/******************************************************************* EXAMPLE 1, CHAPTER 4 Modified program including use of PROC INSIGHT to make scatterplot matrices. Using SAS to obtain sample mean vectors, sample covariance matrices, and sample correlation matrices. *******************************************************************/ options ls=80 ps=59 nodate; run; /****************************************************************** The data are not in the correct from for use with the SAS procedures CORR and DISCRIM we use below. These procedures require that the data be in the form of one record (line) per experimental unit. The data in the file dental.dat are in the form of one record per observation (so that each child has 4 data records). In particular, the data set looks like 1 1 8 21 0 2 1 10 20 0 3 1 12 21.5 0 4 1 14 23 0 5 2 8 21 0 . . . column 1 observation number column 2 child id number column 3 age column 4 response (distance) column 5 gender indicator (0=girl, 1=boy) We thus create a new data set such that each record in the data set represents all 4 observations on each child plus gender identifier. To do this, we use some data manipulation features of the SAS data step. The second data step does this. We redefine the values of AGE so that we may use AGE as an "index" in creating the new data set DENT2. The DATA step that creates DENT2 demonstrates one way (using the notion of an ARRAY) to transform a data set in the form of one observation per record (the original form) into a data set in the form of one record per individual. The data must be sorted prior to this operation; we invoke PROC SORT for this purpose. In the new data set, the observations at ages 8, 10, 12, and 14 are placed in variables AGE1, AGE2, AGE3, and AGE4, respectively. We use PROC PRINT to print out the first 5 records (so data for the first 5 children, all girls) using the OBS= feature of the DATA= option. *******************************************************************/ data dent1; infile 'dental.dat'; input obsno child age distance gender; run; data dent1; set dent1; if age=8 then age=1; if age=10 then age=2; if age=12 then age=3; if age=14 then age=4; drop obsno; run; proc sort data=dent1; by gender child; run; data dent2(keep=age1-age4 gender child); array aa{4} age1-age4; do age=1 to 4; set dent1; by gender child; aa{age}=distance; if last.child then return; end; run; /* Alternatively, replace the above code with the following using proc transpose (thanks to Laine Elliott for noting this): */ /* data dent1; infile 'dental.dat'; input obsno child age distance gender; run; proc transpose data=dent1 out=dent2 prefix=age; by gender child notsorted; var distance; run; */ title "TRANSFORMED DATA -- 1 RECORD/INDIVIDUAL"; proc print data=dent2(obs=5); run; /******************************************************************* Here, we use PROC CORR to obtain the sample means at each age (the means of the variables AGE1,...,AGE4 in DENT2 and to calculate the sample covariance matrix and corresponding sample correlation matrix separately for each group (girls and boys). The COV option in the PROC CORR statement asks for the sample covariance to be printed; without it, only the sample correlation matrix would appear in the output. *******************************************************************/ proc sort data=dent2; by gender; run; title "SAMPLE COVARIANCE AND CORRELATION MATRICES BY GENDER"; proc corr data=dent2 cov; by gender; var age1 age2 age3 age4; run; /******************************************************************* We now obtain the "centered" and "scaled" values that may be used for plotting scatterplot matrices such as that in Figure 3. Here, we call PROC MEANS to calculate the sample mean (MAGE1,...,MAGE4) and standard deviation (SDAGE1,...,SDAGE4) for each of the variables AGE1,...,AGE4 for each gender. These are output to the data set DENTSTATS, which has two records, one for each gender (see the output). We then MERGE this data set with DENT2 BY GENDER, which has the effect of matching up the appropriate gender mean and SD to each child. We print out the first three records of the resulting data set to illustrate. We use the NOPRINT option with PROC MEANS to suppress printing of its output. The variables CSAGE1,...,CSAGE4 contain the centered/scaled values. *******************************************************************/ proc sort data=dent2; by gender child; run; proc means data=dent2 mean std noprint; by gender; var age1 age2 age3 age4; output out=dentstats mean=mage1 mage2 mage3 mage4 std=sdage1 sdage2 sdage3 sdage4; run; title "SAMPLE MEANS AND SDS BY GENDER FROM PROC MEANS"; proc print data=dentstats; run; data dentstats; merge dentstats dent2; by gender; csage1=(age1-mage1)/sdage1; csage2=(age2-mage2)/sdage2; csage3=(age3-mage3)/sdage3; csage4=(age4-mage4)/sdage4; run; title "INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER"; proc print data=dentstats(obs=3); run; /******************************************************************* The following code is an addition to that at the end of Chapter 4 of the notes and was contributed by Xiaoyi Gao. Here, we first keep in the data set DENTSTATS only the information needed to make scatterplot matrices. SAS PROC INSIGHT is then called to produce the scatterplot matrices. *******************************************************************/ data dentstats; set dentstats; keep gender csage1 csage2 csage3 csage4; run; proc print; run; proc insight data=dentstats; by gender; scatter csage1 csage2 csage3 csage4 * csage1 csage2 csage3 csage4; window 0 0 50 50; run; /******************************************************************* One straightforward way to have SAS calculate the pooled sample covariance matrix and the corresponding estimated correlation matrix is using PROC DISCRIM. This procedure is focused on so-called discriminant analysis, which is discussed in a standard text on general multivariate analysis. The data are considered as in the form of vectors; here, the elements of a data vector are denoted as AGE1,...,AGE4. Here, we only use PROC DISCRIM for its facility to print out the sample covariance matrix and correlation matrix "automatically," and disregard other portions of the output. *******************************************************************/ proc discrim pcov pcorr data=dent2; class gender; var age1 age2 age3 age4; run; /******************************************************************* Although it is a bit cumbersome, we may use some DATA step manipulations and PROC CORR to obtain the values of the autocorrelation function for each gender. We first drop variables no longer needed from the data set DENTSTATS. We create then three data sets, LAG1, LAG2, and LAG3, and describe LAG1 here; the other two are similar. We create two new variables, PAIR1 and PAIR2. For LAG1, PAIR1 and PAIR2 are the two values in (5.43) for u=1. As there are 4 ages, each child has 3 such pairs. The output of PROC PRINT for LAG1 shows this for the first 2 children. We then sort the data by gender and call PROC CORR to find the sample correlation between the two variables for each gender. The same principle is used to obtain the correlation by gender for lags 2 and 3 [u=2,3]. There are other, more sophisticated ways to obtain the values of the autocorrelation function; however, for longitudinal data sets where the number of time points is small, the "manual" approach we have demonstrated here is easy to implement and understand. PAIR1 versus PAIR2 may be plotted for each lag to obtain visual presentation of the results as in Figure 6. *******************************************************************/ data dentstats; set dentstats; drop age1-age4 mage1-mage4 sdage1-sdage4; run; data lag1; set dentstats; by child; pair1=csage1; pair2=csage2; output; pair1=csage2; pair2=csage3; output; pair1=csage3; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 1"; proc print data=lag1(obs=6); run; proc sort data=lag1; by gender; proc corr data=lag1; by gender; var pair1 pair2; run; data lag2; set dentstats; by child; pair1=csage1; pair2=csage3; output; pair1=csage2; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 2"; proc print data=lag2(obs=6); run; proc sort data=lag2; by gender; proc corr data=lag2; by gender; var pair1 pair2; run; data lag3; set dentstats; by child; pair1=csage1; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 3"; proc print data=lag3(obs=6); run; proc sort data=lag3; by gender; proc corr data=lag3; by gender; var pair1 pair2; run;