ST512 SAS Primer Written by D. Kim Chantala About this Primer Chapter 1 Chapter 2 Chapters 4 and 5 Chapter 6 Chapter 7 Example 1 Example 2 About the SAS Language Acknowledgements ST 512 USING SAS IN THE SICL INTRODUCTION This document is intended to serve as a complement to other, more comprehensive manuals for SAS and the Statistics Instructional Computing Laboratory (SICL). These manuals will be your resource for using SAS on the SICL system; only a skeletal outline of necessary information is given here to get you started. As you gain familiarity with the system, using it will become easier, so don't get discouraged if initially you are confused. All lab sessions will be held in SICL, and SICL will be our computing resource for the class. The lab is a state-of-the-art facility based on SUN workstations. For those of you who are interested, the operating system is UNIX. The lab provides a pleasing environment for computing and is accessible from terminals across campus as well as home modems, as described in the manuals. During open hours, help is available for problems you may encounter with the system, see below. Even if you have access to SAS on a mainframe account or PC, it is requested that you do all of your class-related computing using the SICL system. See the instructor to discuss exceptions to this policy. We will be using SAS to perform many of our computations in this class, as they can be tedious and sometimes virtually impossible to do by hand. For those who have not used SAS before, the acronym "SAS" stands for "Statistical Analysis System." The company is based in Cary, NC, and was, in fact, founded by a graduate of the Statistics Department at NCSU. The SAS language is a standard throughout industry and academia all over the world. The SAS language is a powerful tool for data organization and analysis and has a simple structure that allows one to enter and access data in an intuitive way and perform various complex analyses with relative ease. The following manuals are available: 1. SICL Intro. Manual: UNIX Section (by T. Arnold) 2. SICL Intro. Manual: SAS Section (by T. Arnold) 3. SAS Primer (by D.K. Chantala) Manuals 1 and 2 contain basic descriptions of using the SICL system and SAS. Manual 3 is a detailed, comprehensive reference for SAS, with several excellent examples. For the class, you should purchase manuals 1 and 2. If you intend to use SAS in future work, for example, in your research, you will want to purchase manual 3 as well. General information about the SICL is given in manual 1, the UNIX section -- specifically, information on logging in/out, connecting to SICL from remote locations, printing, etc. Information about using SAS on SICL is given in manual 2, the SAS section. In this document, only a few fundamental issues are covered; specific references are made to pages in manuals 1 and 2 are given if more detailed information is desired: - Logging in, logging out of the system (manual 1, p. 1-9) - Manipulating files (manual 1, p. 10-13) - Using SAS on the SICL system -- entering and leaving SAS, the Display Manager and windows, typing and running programs, viewing output, printing programs and results (manual 2, p. 1-11) - SAS -- programs, DATA step, PROCs (manual 2, p. 12-17) : Some sample SAS programs appear at the end of this document. These programs are examples in the instructor's ST 511 class notes. The programs are available in ready-made files on the SICL system so that you may try running them yourself. During open hours in the lab, a person knowledgable about the computer system will be available to help you if you run into problems. These people are there to help you with problems you are having with the system -- logging in, running programs, printing output. They are not there to help you with conceptual issues in programs you are running for homework assignments. So do not ask them for help on homework, only for help with using the computer to run SAS programs. You may ask the graduate teaching assistant for our class for help with conceptual issues during lab sessions. LOGGING IN/LOGGING OUT You have been provided with a username and password for the system. Do not give out your username or password to anyone, even to your mother. When you enter the lab, you will see that there are several machines with "big" screens -- these are the workstations -- and many smaller WYSE terminals. You may use either. The login procedure is the same for each type, and is described on page 6 of manual 1 and repeated here: Press any key until you get the prompt login: Type your username, then hit return. You will then be asked for your password. Enter it and hit RETURN. Once you have logged in, you will see a prompt that will look like this: sicl% When you see this prompt, the system is ready to respond to your commands. At the end of a session, you should always exit whatever software you are using (SAS in our case; see below for how to exit SAS) and return to the sicl% prompt. At the prompt, type logout and hit RETURN. Be sure you read the discussion on page 9 of manual 1. Structure of SAS Fall 1992 SAS Primer Structure of a SAS Program This section of the manual describes the structure of a SAS program and the syntax of the language. Many of the most common mistakes made when programming SAS are listed in Appendix 1. SAS Program Blocks: The DATA step and the PROC step All SAS programs consist of at least two blocks of statements: the DATA step and the PROC step. A step is just a group of program statements that provide instructions to SAS for entering and modifying data (the DATA step) or analyzing data (the PROC step). Use the DATA step to read in and modify data. A data step begins with the statement: DATA setname ; and ends with one of the following statements: RUN; PROC procedure_name ; or when another DATA step starts. Between the beginning and ending statements will be statements for reading data, labeling data, and performing calculations. The collection of data values you read in or create is called a data set and has the name, setname , that you choose. If you need to modify your data (such as transform the data or create new data elements) you must do so in a DATA step, not in a PROC step. Use the PROC step to analyze and view your data. In a PROC step you can select the type of analysis (regression, analysis of variance, computation of means) for your data set. PROC steps begin with the statement: PROC procedure_name ; and end with either one of the following statement: DATA setname ; or when another PROC step starts. Other statements used in the PROC step specify the results you want computed and displayed by the SAS procedure named procedure_name . DATA and PROC steps can appear in any order (except, of course, that a PROC cannot operate on a data set that has not yet been created) and any number of DATA steps or PROC steps can be used in a program. It is important to understand the concept of steps because although some SAS statements can be used anywhere, many SAS statements are used exclusively in a DATA step and others are used exclusively in a PROC step. It is not uncommon to have to begin a new DATA step in order to perform necessary data manipulations. Syntax for SAS Statements A SAS statement is a string of keywords, names, and symbols ending in a semicolon. You may use upper or lower case when typing SAS statements. Most SAS statements are specified with the following form: keyword parameter [options] Such a SAS statements begin with a keyword identifying the kind of a statement it is. Some of the identifying keywords are DATA, PROC, OUTPUT, INFILE, FILE, and VAR. Parameters are often the names of your variables or data sets. Options are keywords specific to a particular SAS statement. Assignment statements used to create new variables or modify the values of existing variables have the familiar algebraic form you use when calculating these variables by hand. SAS statements end with a semicolon (;). COMMON ERROR: A missing semicolon is a very common syntax error and one that is hard for SAS error checking procedures to identify. For this reason the error messages you receive are often unclear. SAS statements are free-format. This means a statement can begin and end anywhere on a line, one statement can continue over several lines, several statements can be on one line, and as many blanks as you like can be used to separate fields. PROGRAMMING TIP: You can make your SAS code easier to read and debug by beginning PROC statements and DATA statements in the first space of the line and indenting all other statements. Adding a blank line and comment between steps is also helpful. Rules for SAS Names SAS uses names to identify variable names, data sets, formats, arrays, libraries and files. SAS names must conform to the following rules: (1) no longer than eight characters (2) the first character must be a letter (3) Contain only letters, numbers, or underscores Comment Statements Use comment statements to make your program easier to read and edit. SAS ignores comment statements. The comment statements will be printed to your log file. There are two ways to make a comment statement: (1) Put an asterisk, *, at the beginning of a line, write the comment, and terminate the comment with a semicolon. For example, 00001 *This is a one-line comment; A Statement following here is not ignored; (2) Start a block of comment lines with /*. Write as many comment lines as you want. Terminate the comment with */. For example, 00007 /* This is a multi-line comment. You may want to use 00008 this technique to "turn off" sections of your program. 00009 SAS will ignore all three of these lines*/ PROGRAMMING TIP: You can use comment statements; to "turn off" sections of your code; this way you can avoid looking through unwanted output. Statements Affecting Your Output Fall 1992 SAS Primer Statements Affecting Your Output You can control the linesize and pagesize of output destined for the lineprinter and add informative titles on tables created by SAS procedures. This section describes two statements used to control these items. These statements can be used anywhere in your SAS program. SAS executes them when encountered and they remain in effect until changed or you exit your SAS session. Changing the Output Page Setup You can use the OPTIONS statement to temporarily change one or more of the page setup options from the default value chosen by the manager of your computing center. These changes will be in effect for the duration of your SAS session or until they are changed with another OPTIONS statement. The syntax for the OPTIONS statement is: OPTIONS option1 option2 ...; where any number of options are chosen from the following list: pagesize=nn each page of output will contain nn lines. linesize=nn each line of output can be up to nn characters in length. [no]date prints or suppresses printing of the date on the top of each page of output. [no]number prints or suppresses printing of the page number at the top of each page. firstobs=nn selects nn as the first observation to be processed obs=nn selects nn as the last observation to be processed FIRSTOBS and OBS allow you to select a limited number of observations to process and are useful for checking to see if your SAS program works properly before you process large amounts of data. The default pagesize chosen by the manager of your computing center will be about 24 lines which is a good size for viewing graphics on a computer screen. If you plan to print your output to paper increase the linesize to no more than 58 lines per page by using the SAS statement: options pagesize=58; anywhere in your program. Adding Titles to Tables in Your Output You can use the TITLE statement to add titles to the tables and plots in your output file. Up to ten lines can be printed on the output using this statement. The form of the title statement is TITLE[n ] [title ]; where n immediately follows the keyword TITLE to specify on which of the ten available lines this title should be printed. For the first title line, either TITLE or TITLE1 may be used. 'title' is the title you want printed on line n . Each title can be up to 132 characters long. The title should be enclosed in apostrophes. For example: TITLE 'Soybean Yield'; TITLE3 '1986 through 1988'; Titles will appear on lines one and three of the output file. Once you specify a title for a line, it is used for all subsequent output until you cancel the title or define another title for that line or one above it. To cancel all existing titles specify: TITLE; To suppress the nth and later titles, specify: TITLEn; To associate a title with a particular PROC step, include the title in the PROC step. If you want to change titles for each of the pages produced by a statement within a PROC be sure to put a RUN; statement after each of the TITLE statements. Since SAS usually starts a new page with each statement you can use this to find out which tables are produced by each statement or to give unique names to each plot or model analyzed. For example: PROC PLOT; TITLE 'PLOT 1'; PLOT X*Y; RUN; TITLE 'PLOT2'; PLOT Z*Y; RUN; will put a different title on each plot. The example "Polynomial Regression with PROC REG" also shows you how to do this. Viewing DATA: Use PROC PRINT or PROC PLOT The first step in analyzing your data is often to plot the data. PROC PLOT can be used to create a scatter plot of a data set. Graphical descriptive displays, such as histograms and stem and leaf plots can be created with PROC UNIVARIATE which is covered in the next chapter on describing your data. To verify that your data was input correctly, you can use PROC PRINT to list any data set. PROC SORT can be used to sort the observations in a data set by one or more of the variables. You may need to use PROC SORT to order the data so other SAS procedures can process the data in subsets using the BY statement. Remember that you can use the TITLE statement with any of these procedures. PROC PRINT The syntax of the PROC PRINT statement and other statements used with PROC PRINT to alter the table of data printed is shown below: PROC PRINT [DATA=setname ]; /*Initiates PROC PRINT procedure and specifies which data set to print*/ [VAR variables ]; /* Specifies which variables in the data set are to be printed*/ [SUM variables ]; /*Specifies variables whose values you want totaled.*/ [BY variables ]; /*Analyzes the variables in groups having the same values. Use PROC SORT to put the data in ascending order before using PROC PRINT with the BY statement.*/ PROC SORT Use PROC SORT to sort a data set before it is processed by other PROC steps that include a BY statement. PROC SORT produces no printed output. Below is a description of this procedure and associated statements: PROC SORT [DATA=setname OUT=newset ]; /*Inititates PROC SORT and optionally specifies the input data set (setname ) and a name for the output data set (newset ).*/ BY [DESCENDING] variable ; /*Specifies which variables are used to sort the data set. Any number of variables can be used in the BY statement.*/ For example, PROC SORT can be used to rearrange your data as follows: /*ORIGINAL DATA SET*/ X Y 1 3 2 3 3 1 2 5 1 2 1 5 PROC SORT; BY X Y; X Y 1 2 1 3 1 5 2 3 2 5 3 1 PROC SORT; BY Y X X Y 3 1 1 2 1 3 2 3 1 5 2 5 PROC PLOT PROC PLOT creates scatter plots from two variables that you specify. You can choose the plot symbol or use the value of a third variable as the plot symbol. PROC PLOT and associated statements follow: PROC PLOT [DATA=setname ]; /*Initiates PROC PLOT and optionally specifies which data set to use; If setname is omitted, the most recently created data set is used.*/ PLOT Y*X | Y*X = 'symbol' | Y*X =Z . . . [/OVERLAY]; /* Specifies that the plot should have variable Y on the vertical axis and variable X on the horizontal axis; if ='symbol' is present the plot symbol is the character inside the apostrophes; if = Z is present then the plot symbol is the value of variable Z . If you specify several plots on the same statement, the option /overlay will put all plots on the same graph.*/ [BY variables ;]@ /* Used to get separate plots on observations defined by the variables in the BY statement. Use PROC SORT with the same BY statement to make sure the data is in ascending order.*/ Example: Printing and Plotting data You are asked to plot the yield for each fertilizer and find the total yield per fertilizer in the data set from the example on creating indicator variables You can start with the data step used in the example on creating indicator variables. SAS SOLUTION: options pagesize=30; /* Set default pagesize to be 30 lines*/ data fertlzr; input fert $ @; do rep=1 to 5; input yield @; x1=(fert='A'); x2=(fert='B'); /*Note there are two SAS x3=(fert='C'); x4=(fert='D'); statements per line*/ output; end; cards; A 60 61 59 60 60 B 62 61 60 62 60 C 63 61 61 64 66 D 62 61 63 60 64 ; proc sort out=order; by fert; proc print data=order; title1 'Example of PROC PRINT with the SUM option'; title2 'Fertilizer data set'; sum yield; by fert; proc plot data=fertlzr; title1 'Example PROC PLOT using an * for the plot symbol'; plot yield*fert='*'; run; We used PROC SORT to sort the input data set by fertilizer before using the BY statement in PROC PRINT. Because of the way the data was read in by the data step the data was already ordered by fertilizer type, so we could have omitted the PROC SORT procedure. In PROC PRINT we used two title statements to annotate the printout; giving a new title1 statement in the PROC PLOT step canceled the previous title2. Note that fert is not a numeric variable; PROC PLOT places the values of non-numeric variables at equal increments on the axis. The output follows: SAS OUTPUT: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Example of PROC PRINT with the SUM option Fertilizer data set 10:04 Friday, June 26, 1992 ----------------------------------- FERT=A ---------------------- OBS REP YIELD X1 X2 X3 X4 1 1 60 1 0 0 0 2 2 61 1 0 0 0 3 3 59 1 0 0 0 4 4 60 1 0 0 0 5 5 60 1 0 0 0 ----- FERT 300 ----------------------------------- FERT=B ---------------------- OBS REP YIELD X1 X2 X3 X4 6 1 62 0 1 0 0 7 2 61 0 1 0 0 8 3 60 0 1 0 0 9 4 62 0 1 0 0 10 5 60 0 1 0 0 ----- FERT 305 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Example of PROC PRINT with the SUM option Fertilizer data set 10:04 Friday, June 26, 1992 ----------------------------------- FERT=C ---------------------- OBS REP YIELD X1 X2 X3 X4 11 1 63 0 0 1 0 12 2 61 0 0 1 0 13 3 61 0 0 1 0 14 4 64 0 0 1 0 15 5 66 0 0 1 0 ----- FERT 315 ----------------------------------- FERT=D ---------------------- OBS REP YIELD X1 X2 X3 X4 16 1 62 0 0 0 1 17 2 61 0 0 0 1 18 3 63 0 0 0 1 19 4 60 0 0 0 1 20 5 64 0 0 0 1 ----- FERT 310 ===== 1230 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Example PROC PLOT using an * for the plot symbol 10:04 Friday, June 26, 1992 Plot of YIELD*FERT. Symbol used is '*'. YIELD | | 66 + * | 65 + | 64 + * | 63 + * | 62 + * | 61 + * * * | 60 + * * | 59 + * | ---+-----------------+-----------------+------------- A B C FERT NOTE: 5 obs hidden. Describing DATA: use PROC SUMMARY PROC MEANS or PROC UNIVARIATE SAS Primer Describing data: use PROC SUMMARY, PROC MEANS, OR PROC UNIVARIATE Both PROC MEANS and PROC SUMMARY compute descriptive statistics for an entire SAS data set. PROC UNIVARIATE can be used to graphically describe the distribution of your data with a stem and leaf plot, a normal probability plot, horizontal bar chart, or a box plot; as well as to print statistics for quantiles, skewness, and a test statistic for normality. Using PROC SUMMARY or PROC MEANS There are two primary differences between PROC MEANS and PROC SUMMARY: 1. PROC MEANS produces subgroup statistics only when a BY statement is used and the input data has been previously sorted (use PROC SORT) by the BY variables. PROC SUMMARY automatically produces statistics for all subgroups, giving you all the information in one run that you would get by repeatedly sorting a data set by the variables that define each subgroup and running PROC MEANS/. 2. PROC SUMMARY does not produce any information in your output so you will always need to use the OUTPUT statement to create a new data set and use PROC PRINT to see the computed statistics. All of the options and statements used with PROC MEANS are identical to those used with PROC SUMMARY. To invoke PROC SUMMARY use the following statements: PROC SUMMARY [DATA=setname ] [NWAY]; [CLASS variables ;] /* Choose additional statements from the list following PROC MEANS*/ where you can optionally choose which data set to process and add NWAY to specify that statistics are to be computed only for the highest level of interaction among the factors (class variables) in your data set. Use a CLASS statement to list all of the variables used in specifying a group or category. An example of a class variable is fertilizer type. To invoke PROC MEANS use the following statement: PROC MEANS [DATA=setname ]; /*Choose additional statements from the following list*/ You can use the following statements with either procedure: [VAR variables ;] /* Compute statistics for only the variables in the VAR statement; if the VAR statement is not used statistics will be computed for all numeric variables in the data set. */ [BY variables ]; /*Obtain separate analysis for the groups defined by the variables in the BY statement*/ [OUTPUT OUT=newset keyword=name1 keyword=name2. . .;]@ Creates a new data set named newset containing the computed statistics from the following list of keyword names: N (number of observations in calculation), MEAN (sample mean), STD (standard deviation), MIN (minimum value), MAX (maximum value), RANGE (range); SUM (sum), VAR (variance), USS (uncorrected sum of squares), CSS (corrected sum of squares), STDERR (standard error of the mean), SKEWNESS KURTOSIS T (Student's T value for testing the hypothesis that the population mean is zero), PRT (probability of a greater absolute value of Student's t) Example: Plotting group means over the data Using the fertilizer data set in the previous example on creating indicator variables, plot the group means over a plot of the original data. SAS SOLUTION: /* This SAS program calculates group means and plots them over the original data set.*/ options pagesize=30; data fertlzr; input fert $ @; do rep=1 to 5; input yield @; x1=(fert='A'); x2=(fert='B'); x3=(fert='C'); x4=(fert='D'); symbol = '*'; output; end; cards; A 60 61 59 60 60 B 62 61 60 62 60 C 63 61 61 64 66 D 62 61 63 60 64 ; proc means; var yield; by fert; output out=average mean=aveyield; ; data remake; /* Take the data set from PROC MEANS*/ set average; /* create a new variable for the means*/ yield=aveyield; /* the same name as the original data.*/ symbol='@'; /* Assign a different symbol to YIELD*/ ; /* the symbol in set FERTLZR.*/ data both; set remake fertlzr; /* Concatenate the sets.*/ keep fert yield symbol; /* Only keep these variables in*/ proc print; proc plot data=both; plot yield*fert=symbol; ; run; SAS OUTPUT: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 26 Analysis Variable : YIELD ------------------------------------ FERT=A ---------------- N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 60.0000000 0.7071068 59.0000000 61.0000000 -------------------------------------------------- -------- ------------------------------------ FERT=B ---------------- ------------------ N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 61.0000000 1.0000000 60.0000000 62.0000000 -------------------------------------------------- -------- ------------------------------------ FERT=C ---------------- ------------------ N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 63.0000000 2.1213203 61.0000000 66.0000000 -------------------------------------------------- -------- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 27 Analysis Variable : YIELD ------------------------------------ FERT=D ---------------- ------------------ N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 62.0000000 1.5811388 60.0000000 64.0000000 -------------------------------------------------- -------- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 28 OBS FERT YIELD SYMBOL 1 A 60 @ 2 B 61 @ 3 C 63 @ 4 D 62 @ 5 A 60 * 6 A 61 * 7 A 59 * 8 A 60 * 9 A 60 * 10 B 62 * 11 B 61 * 12 B 60 * 13 B 62 * 14 B 60 * 15 C 63 * 16 C 61 * 17 C 61 * 18 C 64 * 19 C 66 * 20 D 62 * 21 D 61 * 22 D 63 * 23 D 60 * 24 D 64 * %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 29 Plot of YIELD*FERT. Symbol is value of SYMBOL. YIELD | | 66 + * | 65 + | 64 + * * | 63 + @ * | 62 + * @ | 61 + * @ * * | 60 + @ * * | 59 + * | ---+-----------------+-----------------+-------- ---------+-- A B C D FERT NOTE: 9 obs hidden. PROC UNIVARIATE Use this procedure to examine the statistical distribution of a variable. The following statements are used to call the UNIVARIATE procedure: PROC UNIVARIATE [DATA=setname ] [PLOT] [NORMAL]; Initiates the UNIVARIATE procedure. Choose the data set to process with the DATA= option; choose PLOT to get a stem-and- leaf plot or a bar chart, a box plot, and a normal probability plot; choose NORMAL to compute a test statistic for the hypothesis that the data come from a normal distribution. [VAR variables ;] Tells UNIVARIATE which variables to process. If you omit the VAR statement, statistics will be calculated for all numeric variables in the data set. [BY variables ;] Analyzes the data in groups defined by the BY variables. Use PROC SORT to make sure variables are arranged in ascending order. [OUTPUT OUT=newset keyword=name1 keyword=name2 . . .;] Produces a new output data set calling the variables name . Choose keywords from the following list: N MEAN SUM STD VAR SKEWNEWW, KURTOSIS, MAX, MIN, RANGE, Q3 (upper 75th percentile), MEDIAN Q1 (lower 25th percentile). See PROC MEANS for keyword definitions. For example, the following statements PROC UNIVARIATE; VAR INCOME GRADE; BY STATE; OUTPUT OUT=NEW MEAN = AVE_INC AVE_GR VAR = VAR_INC VAR_GR; create a new data set containing two observations and the variables STATE, AVE_INC, AVE_GR, VAR_INC, and VAR_GR. You can use any number of output statements with PROC UNIVARIATE. Note that the two uses of VAR (VARiable and VARiance) cause no problem to SAS. Examining Correlation in Data: Fall 1992 Use PROC CORR SAS Primer Examining Correlation in Data: Use PROC CORR PROC CORR calculates correlation coefficients between variables. The procedure by default calculates Pearson product-moment correlation and significance probabilities. This is all that will be covered here. Check with the SAS manuals at your computing site for information on the options used with PROC CORR to calculate nonparametric statistics measuring association and partial correlation. The following statements are used to invoke the CORR procedure PROC CORR [DATA=setname ]; Invokes PROC CORR and optionally specifies the name of the data set to be processed. [VAR variables ]; Lists the variables to be correlated. [WITH variables] ; Specifies specific combinations of variables to be correlated. Variables in the VAR statement define the columns of the correlation table results and variables in the WITH statement define the rows of the correlation table results. If WITH is omitted, VAR defines both rows and columns. [BY variables ]; Processes the data set in groups of observations defined by the variables in the BY statement. For example, the following SAS code: PROC CORR DATA=SET1; VAR A B; WITH X Y Z; produces correlation coefficients for the following pairs of variables: A and X, A and Y, A and Z, B and X, B and Y, B and Z. Example: Calculating Pairwise correlation Calculate the Pearson Product-Moment Correlation for all pairwise combinations of yield, ph and temp from the chemical reaction data set from the example on "Reading Values into a Data Set". SAS SOLUTION: options pagesize=50; data react; input yield ph temp @@; psq=ph**2; tsq=temp**2; pt=ph*temp; cards; 90 5 60 100 5 80 95 5 100 105 5.5 80 100 6 60 130 6 80 125 6 100 140 6.5 80 135 7 60 142 7 80 126 7 100 ; proc corr; var yield ph temp; run; SAS OUTPUT: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Correlation Analysis 3 'VAR' Variables: YIELD PH TEMP Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum YIELD 11 117.0909 19.3052 1288 90.0000 142.0000 PH 11 6.0000 0.8062 66.0000 5.0000 7.0000 TEMP 11 80.0000 15.4919 880.0000 60.0000 100.0000 Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 11 YIELD PH TEMP YIELD 1.00000 0.87058 0.14043 0.0 0.0005 0.6805 PH 0.87058 1.00000 0.00000 0.0005 0.0 1.0000 TEMP 0.14043 0.00000 1.00000 0.6805 1.0000 0.0 The table in the output show the correlation for all pairs of the variables. For example, the correlation between PH and YIELD is .87058 (p=.0005). EXAMPLES To introduce the features of SAS entities such as the DATA step, comment statements, and PROCs, we will consider the program to perform an analysis of variance for a RCBD given in the file rcbd.sas. An experiment was set up in a randomized complete block design to investigate differences in yield for 7 hybrid varieties of wheat. In each of 5 blocks, coded as I, II, III, IV, V, 7 plots were defined, and each plot was randomly allocated to one of 7 varieties, a, b, c, d, e, f, g. Thus, there was one plot for each variety in each block, so there is one yield observation available for each block-variety combination. What follows is a description of the components of a SAS program to analyze the data. (This example is described more fully in Chapter 9 of this instructor's lecture notes for ST 511.) DATA step features: Consider the following SAS statements: 1 data bushels; 2 input block $ variety $ yield @@; 3 cards; 4 I a 10 I b 9 I c 11 I d 15 I e 10 I f 12 I g 11 5 II a 11 II b 10 II c 12 II d 12 II e 10 II f 11 II g 12 6 III a 12 III b 13 III c 10 III d 14 III e 15 III f 13 7 III g 13 8 IV a 14 IV b 15 IV c 13 IV d 17 IV e 14 IV f 16 IV g 15 9 V a 13 V b 14 V c 16 V d 19 V e 17 V f 15 V g 18 This block of statements is a DATA step. The statements have been numbered from 1-9 so that we may refer to them in this description. (The numbers are not part of the program, but are here for convenience in referring to specific statements.) Line 1 defines a data set named "bushels." As far as SAS is concerned, for the duration of the program, "bushels" is the name of the data set containing the data. Line 1 ends, as do all SAS statements, with a semicolon. SAS does not care where statements begin or end on a line, nor does it care how many statements appear on a single line, as long as each statement is separated by a semicolon. This is true not only in the DATA step, but in all SAS statements. Thus, lines 2 and 3 have been indented solely to make the program "look nice" -- all SAS cares about is what the statements say. Line 2 contains an input statement. In this statement, variables containing the data and information necessary for SAS to run an analysis of variance (see below) are defined. A single observation in this data set consists of a yield observation identified by the block and variety from which it arose. To identify this structure to SAS, we must define variables containing the block and variety information for a given observation as well as the actual yield response value. In particular, there are 3 variables. The first, block, defines the number of the block from which an observation arose (I, II, etc.). The trailing "$" following a space tells SAS that the variable being defined has character (as opposed to numerical) values. A variable without a "$" is assumed by SAS to be take on numerical values. The second variable, variety, also has character values (a, b, etc.), so it also has a trailing "$". The final variable is yield, which will contain the actual values of the yields (no "$"). Thus, the data set "bushels" will contain 3 variables, block, describing from which block a yield value came, variety, describing the variety from which a yield value came, and yield, the actual value. Each triplet (block, variety, yield) identifies a single observation. You may name SAS variables anything you like, just as long as the names do not begin in a number and are 8 or less characters long. The "@@" following a space after yield is a special feature of the input statement. Without it, the data would have to be entered one observation per line. In particular, in the above, following the input statement is a cards statement. Cards tells SAS that the next information to follow after the semicolon is the actual data that are to be assigned to the variables in the input statement. If the "@@" were left off, so that the input statement read 2 input block $ variety $ yield; instead, the data would have to be entered as follows: I a 10 I b 9 I c 11 I d 15 etc. That is, each single observation consisting of a block, variety and yield specification would have to appear on its own line. The function of "@@" is thus to allow one to "string out" the entering of data so that more than one (block, variety, yield) grouping can appear on each line. One can imagine that this can make a program much shorter that if only one observation could appear on a line. Thus, the first observation is from block I, on variety a, and the yield was 10 bu/acre. The second observation in the data set is from block I, variety b, and the yield was 9 bu/acre. The final thing to note is that after all of the data have been entered, on a separate line by itself is yet another semicolon. This lone semicolon indicates to SAS that the data set is complete. The statements 1-10 above complete the DATA set to define the data to SAS. PROC features: Now consider our data step with more statements added to tell SAS what we would like to do with the data. 1 data bushels; 2 input block $ variety $ yield @@; 3 cards; 4 I a 10 I b 9 I c 11 I d 15 I e 10 I f 12 I g 11 5 II a 11 II b 10 II c 12 II d 12 II e 10 II f 11 II g 12 6 III a 12 III b 13 III c 10 III d 14 III e 15 III f 13 7 III g 13 8 IV a 14 IV b 15 IV c 13 IV d 17 IV e 14 IV f 16 IV g 15 9 V a 13 V b 14 V c 16 V d 19 V e 17 V f 15 V g 18 10 ; 11 *; 12 * Now print out the data; 13 *; 14 proc print data=bushels; run; 15 *; 16 * Run an analysis of variance; 17 *; 18 proc glm data=bushels; class block variety; 19 model yield = block variety; run; Once the data are entered in the data step (lines 1-10), we can use different PROCs to operate on the data. The statements in lines 11-19 use two PROCs to print out the data and run the appropriate analysis of variance. We now discuss these statements. Comment statements: First, note that there are some statements that begin with an asterisk "*". In SAS, any statement that begins in an asterisk is a comment statement -- it is not a program statement but an explanatory statement inserted by the author to clarify what the program is doing. It is a good idea to comment your programs well, so that when you refer to them later you will remember what you did. Note that comment statements also end in semicolons. The "blank" comments *; in the program above were put in simply to create space between the DATA step and the PROC statements that follow, as well as to set off the comments from the programming statement. An alternate method of inserting a comment is to enclose it as follows: /* comment goes here and can be as long as you want and cover several lines */ We will see a fancy example of this in the final version of the program below. After the first set of comment statements, a call to PROC PRINT is made. PROC PRINT does nothing more than print the contents of a data set. It is always a good idea to print out any data set you have entered yourself to check for typos. The specification "data=bushels" tells SAS to print the contents of this data set. Acutally, if we had left "data=bushels" off, and simply had 14 proc print; run; instead, SAS would still print the contents of bushels, since bushels was the last data set referred to in the program. It is usually a good idea to specify the name of the data set, though, since in more complicated programs you may have defined several data sets. The run statement after PROC PRINT, or any PROC statement, simply tells SAS to execute the analysis performed by that procedure. The final set of statements, after another block of comment statements, is a call to PROC GLM in order to construct the analysis of variance. Again, "data=bushels" could have been left off. The second statement is the class statement. This statement informs the PROC of which variables are to be regarded as classification variables defining the elements of the design. Here, block and variety are the two classifications for a yield observation. Yield is the response, so it is not included in the classification statement. The final statement is a model statement. This tells SAS to fit a two-way classification model with no interaction, i.e., Y(ij) = m + a(i) + b(j) + e(ij), where Y(ij) = yield from the ith variety in the jth block, m is the overall mean, a(i) is the effect of the ith variety, b(j) is the effect of the jth block and e(ij) is the error associated with Y(ij). (Note that because there is only one observation per block/treatment combination, the model includes no interaction term. We will see later in the course how to include interactions in the ANOVA using PROC GLM.) The model statement is followed by a run statement to indicate that SAS is to perform this analysis. Here is a full listing of the program as it appears in the file /pub/st512/md/rcbd.sas, fully annotated with comment statements. You should include the program and submit (run) it to see what the output looks like. We will learn how to interpret the output in class. /*---------------------------------------------------------- | | | An example of using PROC GLM to construct the | | analysis of variance for a randomized complete | | block design with on observation per treatment block | | combination. There were 5 blocks (I,II,III,IV,V). | | Each of 7 varieties (a-g) appeared exactly once in | | each block. | | | ----------------------------------------------------------*/ /*---------------------------------------------------------- | | | Note that here we have constructed boxes in which | | comments are contained. All of the text between the | | slash-asterisk combinations is ignored in the | | execution of the program. | | | | Thus, it is possible to make the documentation of | | your program "stand out." | | | ----------------------------------------------------------*/ /*---------------------------------------------------------| | | | The following statement is always recommended. Its | | purpose is to use 55 lines per page in printing the | | program output. If this statement is not included, | | the lines per page will be much shorter, with annoy- | | ing frequent page breaks. | | | |---------------------------------------------------------*/ options ps=55; /*---------------------------------------------------------- | | | The DATA step: Enter the data manually into a data | | set called "bushels." See the file | | | | /pub/st512/md/sicl.txt | | | | for a detailed description. The input statement | | tells SAS that there are 3 variables, block, variety, | | which are character variables ($) and provide the | | block-treatment classification information, and | | yield, which contains the actual responses. The @@ | | allows the (block,variety,yield) triplets to be | | entered several to a line rather than requiring them | | to be entered line by line. The cards statement | | indicates that the data follow. The trailing lone ; | | indicates the end of the data step. | | | ----------------------------------------------------------*/ data bushels; input block $ variety $ yield @@; cards; I a 10 I b 9 I c 11 I d 15 I e 10 I f 12 I g 11 II a 11 II b 10 II c 12 II d 12 II e 10 II f 11 II g 12 III a 12 III b 13 III c 10 III d 14 III e 15 III f 13 III g 13 IV a 14 IV b 15 IV c 13 IV d 17 IV e 14 IV f 16 IV g 15 V a 13 V b 14 V c 16 V d 19 V e 17 V f 15 V g 18 ; /*---------------------------------------------------------- | | | The data are now printed out using PROC PRINT. The | | "title" statement places the title in single quotes | | at the top of each page of output. The title will | | appear on all subsequent pages until a new title | | statement is invoked. See the discussion in the file | | | | /pub/st512/md/sicl.txt | | | | for more on the "data=bushels" statement. | | | ----------------------------------------------------------*/ proc print data=bushels; run; /*---------------------------------------------------------- | | | The analysis of variance is now constructed using | | PROC GLM. The class statement tells the PROC which | | variables in the data set bushels provide the class- | | ification information. The model statement tells | | SAS to construct the analysis of variance for the | | additive linear model corresponding to the RCBD. See | | | | /pub/st512/md/sicl.txt | | | | for more discussion. | | | ----------------------------------------------------------*/ proc glm data=bushels; class block variety; model yield = block variety; run; We have seen several features of SAS by considering the analysis of variance example. Now we will consider a second example to illustrate how one might perform least squares regression in SAS using PROC REG. The full program appears in the file reg.sas. Rate of oxygen consumption (Y) was measured for a bird at a predetermined temperature (X). At each temperature, a different bird was used. This example is treated in greater detail in chapter 10 of this instructor's ST 511 notes. Here are the data: X Y -18 5.2 -15 4.7 -10 4.5 -5 3.6 0 3.4 5 3.1 10 2.7 19 1.8 Here is a SAS program to create a data set with 2 variables, X and Y, print out the data, plot Y versus X, and then fit a simple linear regression model to the data: Y(i) = B0 + B1 X(i) + e(i), where Y(i) and X(i) are the rate of oxygen consumption and temperature for the ith bird, and e(i) is a random error. 1 data oxygen; input x y; cards; 2 -18 5.2 3 -15 4.7 4 -10 4.5 5 -5 3.6 6 0 3.4 7 5 3.1 8 10 2.7 9 19 1.8 10 ; 11 *; 12 * print out the data; 13 *; 14 proc print; title 'Oxygen Consumption Data'; run; 15 *; 16 * plot the data; 17 *; 18 proc plot; plot y*x; run; 19 *; 20 * run the simple linear regression; 21 *; 22 proc reg; model y = x; run; In the program, a data set called "oxygen" is created. It contains 2 variables, x and y. Since both are numerical, no "$" was used. Also, note that here we did not use the "@@" in the input statement, so the data from an individual bird had to go on each line. If we had put a "@@" in the input statement as above, we could have strung the observations all out: data oxygen; input x y @@; cards; -18 5.2 -15 4.7 -10 4.5 -5 3.6 0 3.4 5 3.1 10 2.7 19 1.8 ; The data step is again followed by some comments, and then is printed out using PROC PRINT in line 14. Note that here we did not bother with adding a "data=oxygen" specification in the PROC statement, since the data set "oxygen" was the last data set created. Note that before the run statement in line 14, there is a title statement with a string in single quotes. The effect of a title statement is for each page of output to have the specified title at the top. See the SAS manuals for more on this -- you can have multiple titles and change the titles on different pages of the output. We will see this later in the course. After some further comments, line 18 invokes another PROC. PROC PLOT, as you might suspect, creates a plot with the first variable in the PLOT statement (before the asterisk) on the vertical axis, and the second (after the asterisk) on the horizontal axis. It automatically scales the axes. It is possible to make much fancier plots, both with PROC PLOT and other PROCs. Finally, after some further comments, line 20 contains a call to PROC REG. The syntax is very simple -- PROC REG tells SAS you want to do regression, and the model statement specifies the model. In this case, the model is the simple straight line given above. The specification in line 18 is how you would communicate this to SAS. We will see how to fit more complicated multiple regression models in ST 512. An annotated version of this program appears in the file /pub/st512/md/reg.sas. You should run it to see what the output looks like. We will learn how to interpret such output later in the course. Now that you have some familiarity with the SAS language, you will want to look at the files means.sas, ttest.sas, and paired.sas for examples of other analyses. These programs are fully annotated and self-explanatory. SAS -- THE SAS LANGUAGE AND SAMPLE PROGRAMS The SAS (Statistical Analysis System) is a software package for data manipulation and statistical analysis. The user writes a SAS program to perform the desired tasks. A SAS program is composed of 2 fundamental components: DATA step(s) -- the part of the program in which a structure for the data to be analyzed is created. This structure exists while the program is running. Variables corresponding to the various elements of the data set are defined, and the data are assigned to the variables. Data may be input manually in the body of the program, or they may be read in from a file. We will not get very fancy in this course with data input and manipulation, but be aware that it is possible to perform very complex data manipulation and to analyze very large data sets using SAS. It is also possible to create permanent data sets and files for use with later SAS runs. PROCs (PROCedures) -- the SAS language is organized into a series of procedures, or PROCs, each of which is dedicated to a particular form of data manipulation or statistical analysis to be performed on data sets created in the DATA step. In the example programs below, we will consider several different PROCS: PROC MEANS: computes means, standard deviations and other summary statistics for some or all of the variables in a data set. PROC TTEST: computes the 2-sample t-test for comparing the means of 2 treatments. PROC REG: performs regression analysis using the method of least squares. PROC GLM: constructs the analysis of variance desired by the user, with associated F statistics, constrasts, etc. PROC PLOT: constructs plots of the data as specified by the user. PROC PRINT: prints the contents of a data set. There are many other PROCs to perform data and other kinds of analyses; The SAS documentation describes these. We will discuss features of the PROCs above and some additional PROCs in lecture, demonstration labs, and in homework assignments. A SAS program consists of one or more DATA steps to get the data into a format that SAS can understand and one or more calls to PROCs to perform various analyses on the data. Several example SAS programs using each of these PROCs are available. The programs are meant to illustrate the features of data input using the DATA step that will be useful to us in the course as well as the syntax used in SAS PROCs. The programs are discussed below and can be viewed and run on SICL. They all reside in the directory /pub/st512/md The example programs are mneumonically named: rcbd.sas: analysis of variance of a randomized complete block design using PROC GLM reg.sas: simple linear regression analysis using PROC REG and PROC PLOT means.sas: summary statistics using PROC MEANS ttest.sas: 2 sample t-test using PROC TTEST paired.sas: paired comparison t-test using PROC MEANS The programs are annotated using comment statements (see below) with descriptions of the functions being executed by each line. The first two programs are discussed in detail below. To view each program, get into SAS and use the SAS command include '/pub/st512/md/nameofile' on the command line of the PROGRAM window. Here, nameoffile refers to one of the 5 names above. The program will appear in the window. You may scroll through the program on the screen or print a hard copy to look at according to the instructions in the manual. You may then submit (execute) the program and view the results in the OUTPUT window. A hard copy of the results may also be printed out or written to a file as described in the manual. You may also get a hard copy of the program. If you wish to practice editing, you may copy the program file to a file in your own directory by using the command file 'nameyouchoose' on the command line of the PROGRAM window. For example, you may wish to save 'means.sas' to your own directory under the name 'mymeans.sas.' To do so, simply type file 'mymeans.sas' No directory information is needed if the file is to be in your own directory; it will automatically be placed there. When the include command is used later to retrieve it, no directory information is required. This will save a copy of the file to the name you choose. Then, clear the window and include the renamed file in the PROGRAM window so that you will be working with it rather than the original. Now, you may practice editing and may augment the program if you wish. Acknowledgments The work on this manual was funded by the William Mendenhall Teaching Scholar Program. Part of the SAS Primer was taken from the Statistical Instructional Computing Laboratory Manual written by Tim Arnold and handouts from the SAS Short Course taught by Joy Smith and Sandy Donaghy, Department of Statistics, NCSU. Examples were taken from lecture notes prepared by David A. Dickey, PhD. SAS is a registered trade mark of the SAS Institute Inc., Cary, NC USA. cCopyright 1992 by D. Kim Chantala.