ST 590G -- Computation for Data Analysis Fourth Assignment -- due Thursday, 24 November 2011 The Federalist Papers are pamphlets published in support of the US Constitution and are often cited as capturing the original intent of the Founding Fathers. Files with these papers are available at the sites http://www.foundingfathers.info/federalistpapers/fedxx.htm where xx takes values from 01 to 85. (Note there are two versions of 70: 'fed70a' and 'fed70b' -- choose one.) * or * (note change below) http://thomas.loc.gov/home/histdox/fed_xx.html where again xx takes values 01 to 85. (Note that this site has the second version of #70, as the file is 'fed_70-2.html') * or * (a third site) http://www.constitution.org/fed/federaxx.htm where xx takes values 01 to 85. * or * (a fourth site) http://www.let.rug.nl/usa/D/1776-1800/federalist/fedxx.htm where xx takes values 01 to 85. (Same two versions as with the 'FoundingFathers' site. Note the country of this host.) Among historians, the general agreement is that John Jay wrote 2,3,4,5, and 64; James Madison wrote 10, 14, 37, 38, ..., 48; no one is sure about 18, 19, 20, 49, 50, ... 58, 62, 63 and the others were written by Alexander Hamilton. Statisticians Mosteller and Wallace did an extensive analysis of word frequencies to determine who wrote those other 15 papers. Your task is to gather the data to replicate that task. The steps: 1) Read each file and strip out the html code. 2) Make a dataset of the words and compute the frequencies of all of the words. 3) Find the word frequencies of the words used by Mosteller & Wallace. Their list is given in 'mwwords.dat' -- the hint is 'merge.' (Note: treat adverbial/adjectival or plural forms as the same word: "CONSIDERABLY"="CONSIDERABLE", "INNOVATIONS" = "INNOVATION", "VIGOROUS" = "VIGOR", "MATTERS" = "MATTER", "WORKS" = "WORK".) 4) Create one dataset with word frequencies for the papers with known authorship. Include here all of the (14) Madison papers, and at least as many Hamilton papers. (Use at least 14 of the Hamilton papers; use all of them if you want. You can choose to include the Jay papers or just delete them from the analysis.) Include a variable indicating the author. 5) Create a second dataset with the 15 papers where the authorship is not certain. Include a variable 'author' with missing values.