ST 445 -- SPRING 2010 HOMEWORK 6 DUE THURSDAY, 29 APRIL 2010 The files are in 'apr10' directory. ** updates 21 April ** *** *** file 'ncecon2010.variables' has been corrected file 'NCpres2008.txt' has been modified to make it (hopefully) easier to read -- still formats with tabs is tricky -- suggestion: read votes as character variable, say 'mccainvotec' and then convert to numeric with mccainvote = input(mccainvotec,comma7.) ; *** *** *** *** In the file 'NCpres2008.txt' are county-level summaries of the voting in the 2008 presidential race. This file is tab-delimited includes both the percentages of votes for the two candidates as well as the raw vote totals. (These data were taken from the New York Times site: http://elections.nytimes.com/2008/ results/states/president/north-carolina.html ) a) Read in the data in this file and create a dataset with the relevant variables. In the file 'nc2010econ.dat' are selected economic, educational, and demographic data for North Carolina at the county level. The list of variables is given in the file NC2010econ.variables. (Data courtesy of Tammy J Lester, NC Dept of Commerce.) b) Read in these related variables and create a dataset c) Merge these two datasets by county name. d) The main task of the assignment is to describe the voting data and investigate some relationships with one or two of the economic variables. Particularly appropriate would be correlations (proc corr) or crosstabulations (proc freq), as well as scatterplots. Summarize your conclusions in a paragraph or two, using charts, graphs, etc. as a supporting appendix. Make effective use of labels and titles; use of user-written formats is encouraged. Note: 1) Be aware that many counties in North Carolina are quite small, both in area and population, and so that analysis of count data from these counties requires care. My suggestion is to limit your analysis to the largest counties, and avoid the smallest. 2) You should be aware that the largest counties will have the largest counts, the smallest counties the smallest -- of anything -- regardless of any cause/effect. For example, Mecklenberg and Wake counties have the highest numbers of doctors and the highest incidence of cancer while Camden and Clay counties have small numbers of both. But does the apparent positive correlation here mean that we should try to reduce the number of doctors to reduce cancer incidence? Base any relationship on RATES: number/population.