/******************************************************************* EXAMPLE 1, CHAPTER 4 Using SAS to obtain sample mean vectors, sample covariance matrices, and sample correlation matrices. *******************************************************************/ options ls=80 ps=59 nodate; run; /****************************************************************** The data are not in the correct from for use with the SAS procedures CORR and DISCRIM we use below. These procedures require that the data be in the form of one record (line) per experimental unit. The data in the file dental.dat are in the form of one record per observation (so that each child has 4 data records). In particular, the data set looks like 1 1 8 21 0 2 1 10 20 0 3 1 12 21.5 0 4 1 14 23 0 5 2 8 21 0 . . . column 1 observation number column 2 child id number column 3 age column 4 response (distance) column 5 gender indicator (0=girl, 1=boy) We thus create a new data set such that each record in the data set represents all 4 observations on each child plus gender identifier. To do this, we use some data manipulation features of the SAS data step. The second data step does this. We redefine the values of AGE so that we may use AGE as an "index" in creating the new data set DENT2. The DATA step that creates DENT2 demonstrates one way (using the notion of an ARRAY) to transform a data set in the form of one observation per record (the original form) into a data set in the form of one record per individual. The data must be sorted prior to this operation; we invoke PROC SORT for this purpose. In the new data set, the observations at ages 8, 10, 12, and 14 are placed in variables AGE1, AGE2, AGE3, and AGE4, respectively. We use PROC PRINT to print out the first 5 records (so data for the first 5 children, all girls) using the OBS= feature of the DATA= option. *******************************************************************/ data dent1; infile 'dental.dat'; input obsno child age distance gender; run; data dent1; set dent1; if age=8 then age=1; if age=10 then age=2; if age=12 then age=3; if age=14 then age=4; drop obsno; run; proc sort data=dent1; by gender child; run; data dent2(keep=age1-age4 gender child); array aa{4} age1-age4; do age=1 to 4; set dent1; by gender child; aa{age}=distance; if last.child then return; end; run; /* Alternatively, replace the above code with the following using proc transpose (thanks to Laine Elliott for noting this): */ /* data dent1; infile 'dental.dat'; input obsno child age distance gender; run; proc transpose data=dent1 out=dent2 prefix=age; by gender child notsorted; var distance; run; */ title "TRANSFORMED DATA -- 1 RECORD/INDIVIDUAL"; proc print data=dent2(obs=5); run; /******************************************************************* Here, we use PROC CORR to obtain the sample means at each age (the means of the variables AGE1,...,AGE4 in DENT2 and to calculate the sample covariance matrix and corresponding sample correlation matrix separately for each group (girls and boys). The COV option in the PROC CORR statement asks for the sample covariance to be printed; without it, only the sample correlation matrix would appear in the output. *******************************************************************/ proc sort data=dent2; by gender; run; title "SAMPLE COVARIANCE AND CORRELATION MATRICES BY GENDER"; proc corr data=dent2 cov; by gender; var age1 age2 age3 age4; run; /******************************************************************* We now obtain the "centered" and "scaled" values that may be used for plotting scatterplot matrices such as that in Figure 3. Here, we call PROC MEANS to calculate the sample mean (MAGE1,...,MAGE4) and standard deviation (SDAGE1,...,SDAGE4) for each of the variables AGE1,...,AGE4 for each gender. These are output to the data set DENTSTATS, which has two records, one for each gender (see the output). We then MERGE this data set with DENT2 BY GENDER, which has the effect of matching up the appropriate gender mean and SD to each child. We print out the first three records of the resulting data set to illustrate. We use the NOPRINT option with PROC MEANS to suppress printing of its output. The variables CSAGE1,...,CSAGE4 contain the centered/scaled values. These may be plotted against each other to obtain plots like Figure 3. We have not done this here to save space. We export the centered/scaled values to a file so that we can import them into R to take advantage of the simple function pairs() to plot scatterplots. *******************************************************************/ proc sort data=dent2; by gender child; run; proc means data=dent2 mean std noprint; by gender; var age1 age2 age3 age4; output out=dentstats mean=mage1 mage2 mage3 mage4 std=sdage1 sdage2 sdage3 sdage4; run; title "SAMPLE MEANS AND SDS BY GENDER FROM PROC MEANS"; proc print data=dentstats; run; data dentstats; merge dentstats dent2; by gender; csage1=(age1-mage1)/sdage1; csage2=(age2-mage2)/sdage2; csage3=(age3-mage3)/sdage3; csage4=(age4-mage4)/sdage4; run; title "INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER"; proc print data=dentstats(obs=3); run; * create the data file -- use only the variables we need; filename this "dentcenter.dat"; data _null_; set dentstats; file this; put gender csage1 csage2 csage3 csage4; run; /* Alternatively, we can use the SAS procedure STDIZE to do this, creating a data set altdentstats. Try uncommenting the following code and compare the data set dentstats to the data set altdentstats created here */ proc sort data=dent2; by gender; run; proc stdize data=dent2 method=std pstat out=altdentstats(rename=(age1=csage1 age2=csage2 age3=csage3 age4=csage4)); var age1-age4; by gender; run; title "STANDARDIZED VALUES USING STDIZE"; proc print data=altdentstats(obs=3); run; /******************************************************************* One straightforward way to have SAS calculate the pooled sample covariance matrix and the corresponding estimated correlation matrix is using PROC DISCRIM. This procedure is focused on so-called discriminant analysis, which is discussed in a standard text on general multivariate analysis. The data are considered as in the form of vectors; here, the elements of a data vector are denoted as AGE1,...,AGE4. Here, we only use PROC DISCRIM for its facility to print out the sample covariance matrix and correlation matrix "automatically," and disregard other portions of the output. *******************************************************************/ proc discrim pcov pcorr data=dent2; class gender; var age1 age2 age3 age4; run; /******************************************************************* Another way to obtain the "building blocks" of the pooled sample covariance matrix and to obtain the corresponding estimate of correlation matrix directly is to use the MANOVA statement in PROC GLM. The MANOVA statement instructs PROC GLM to consider the data as being in the form of vectors; as above, the elements of a data vector are AGE1,...,AGE4. The MODEL statement dictates a model where each level of GENDER has a different mean vector. This causes SAS to estimate the mean vector in each group by the sample mean vector for that group. With this MODEL statement, the effect of the PRINTE option of the MANOVA statement is to calculate and print the SS&CP matrix corresponding to the pooled sample covariance matrix Unfortunately, PROC GLM does not divide the SS&CP matrix by its degrees of freedom. It does, however, print the estimate of the associated correlation matrix. By default, PROC GLM also carries out a separate one-way analysis of variance on each variable AGE1,...,AGE4. The Mean Square for Error (MSE) values from each table corresponding to AGE1,...,AGE4 are, in fact, the estimates of the individual variances (the diagonal elements of the pooled sample covariance matrix). As it is often easier to look at the correlation matrix to evaluate the likely pattern of correlation and the individual variance estimates at each time to evaluate an assumption like constant variance over time, this information is generally sufficient for the data analyst's needs. *******************************************************************/ title "OBTAINING THE POOLED SS&CP MATRIX AND CORRESPONDING CORRELATION"; title2 "MATRIX USING THE MANOVA STATEMENT OF PROC GLM"; proc glm data=dent2; class gender; model age1 age2 age3 age4 = gender; manova / printe; run; /******************************************************************* Yet another way to obtain the pooled sample covariance matrix and the associated correlation matrix is to use PROC MIXED, which is discussed in Chapters 8-10. PROC MIXED requires the data to be in the form of one observation per record, which is the original form in the data set DENT1. The R and RCORR options in the REPEATED statement cause PROC MIXED to print out the estimates. The MODEL statement specifies a different mean for each gender-age combination, so that the gender-specific means are correctly used in the calculation. A call to PROC MIXED of the form of that below will provide the pooled sample covariance matrix as long as the data are balanced. We defer further discussion of PROC MIXED and the syntax below to later chapters and programs. *******************************************************************/ title "OBTAINING THE POOLED SAMPLE COVARIANCE MATRIX AND CORRELATION"; title2 "USING PROC MIXED"; proc mixed data=dent1; class gender age child; model distance = gender age gender*age; repeated / type=un subject=child r rcorr; run; /******************************************************************* Although it is a bit cumbersome, we may use some DATA step manipulations and PROC CORR to obtain the values of the autocorrelation function for each gender. We first drop variables no longer needed from the data set DENTSTATS. We create then three data sets, LAG1, LAG2, and LAG3, and describe LAG1 here; the other two are similar. We create two new variables, PAIR1 and PAIR2. For LAG1, PAIR1 and PAIR2 are the two values in (5.43) for u=1. As there are 4 ages, each child has 3 such pairs. The output of PROC PRINT for LAG1 shows this for the first 2 children. We then sort the data by gender and call PROC CORR to find the sample correlation between the two variables for each gender. The same principle is used to obtain the correlation by gender for lags 2 and 3 [u=2,3]. There are other, more sophisticated ways to obtain the values of the autocorrelation function; however, for longitudinal data sets where the number of time points is small, the "manual" approach we have demonstrated here is easy to implement and understand. PAIR1 versus PAIR2 may be plotted for each lag to obtain visual presentation of the results as in Figure 6. *******************************************************************/ data dentstats; set dentstats; drop age1-age4 mage1-mage4 sdage1-sdage4; run; data lag1; set dentstats; by child; pair1=csage1; pair2=csage2; output; pair1=csage2; pair2=csage3; output; pair1=csage3; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 1"; proc print data=lag1(obs=6); run; proc sort data=lag1; by gender; proc corr data=lag1; by gender; var pair1 pair2; run; data lag2; set dentstats; by child; pair1=csage1; pair2=csage3; output; pair1=csage2; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 2"; proc print data=lag2(obs=6); run; proc sort data=lag2; by gender; proc corr data=lag2; by gender; var pair1 pair2; run; data lag3; set dentstats; by child; pair1=csage1; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 3"; proc print data=lag3(obs=6); run; proc sort data=lag3; by gender; proc corr data=lag3; by gender; var pair1 pair2; run;