EXAMPLES To introduce the features of SAS entities such as the DATA step, comment statements, and PROCs, we will consider the program to perform an analysis of variance for a RCBD given in the file rcbd.sas. An experiment was set up in a randomized complete block design to investigate differences in yield for 7 hybrid varieties of wheat. In each of 5 blocks, coded as I, II, III, IV, V, 7 plots were defined, and each plot was randomly allocated to one of 7 varieties, a, b, c, d, e, f, g. Thus, there was one plot for each variety in each block, so there is one yield observation available for each block-variety combination. What follows is a description of the components of a SAS program to analyze the data. (This example is described more fully in Chapter 9 of this instructor's lecture notes for ST 511.) DATA step features: Consider the following SAS statements: 1 data bushels; 2 input block $ variety $ yield @@; 3 cards; 4 I a 10 I b 9 I c 11 I d 15 I e 10 I f 12 I g 11 5 II a 11 II b 10 II c 12 II d 12 II e 10 II f 11 II g 12 6 III a 12 III b 13 III c 10 III d 14 III e 15 III f 13 7 III g 13 8 IV a 14 IV b 15 IV c 13 IV d 17 IV e 14 IV f 16 IV g 15 9 V a 13 V b 14 V c 16 V d 19 V e 17 V f 15 V g 18 This block of statements is a DATA step. The statements have been numbered from 1-9 so that we may refer to them in this description. (The numbers are not part of the program, but are here for convenience in referring to specific statements.) Line 1 defines a data set named "bushels." As far as SAS is concerned, for the duration of the program, "bushels" is the name of the data set containing the data. Line 1 ends, as do all SAS statements, with a semicolon. SAS does not care where statements begin or end on a line, nor does it care how many statements appear on a single line, as long as each statement is separated by a semicolon. This is true not only in the DATA step, but in all SAS statements. Thus, lines 2 and 3 have been indented solely to make the program "look nice" -- all SAS cares about is what the statements say. Line 2 contains an input statement. In this statement, variables containing the data and information necessary for SAS to run an analysis of variance (see below) are defined. A single observation in this data set consists of a yield observation identified by the block and variety from which it arose. To identify this structure to SAS, we must define variables containing the block and variety information for a given observation as well as the actual yield response value. In particular, there are 3 variables. The first, block, defines the number of the block from which an observation arose (I, II, etc.). The trailing "$" following a space tells SAS that the variable being defined has character (as opposed to numerical) values. A variable without a "$" is assumed by SAS to be take on numerical values. The second variable, variety, also has character values (a, b, etc.), so it also has a trailing "$". The final variable is yield, which will contain the actual values of the yields (no "$"). Thus, the data set "bushels" will contain 3 variables, block, describing from which block a yield value came, variety, describing the variety from which a yield value came, and yield, the actual value. Each triplet (block, variety, yield) identifies a single observation. You may name SAS variables anything you like, just as long as the names do not begin in a number and are 8 or less characters long. The "@@" following a space after yield is a special feature of the input statement. Without it, the data would have to be entered one observation per line. In particular, in the above, following the input statement is a cards statement. Cards tells SAS that the next information to follow after the semicolon is the actual data that are to be assigned to the variables in the input statement. If the "@@" were left off, so that the input statement read 2 input block $ variety $ yield; instead, the data would have to be entered as follows: I a 10 I b 9 I c 11 I d 15 etc. That is, each single observation consisting of a block, variety and yield specification would have to appear on its own line. The function of "@@" is thus to allow one to "string out" the entering of data so that more than one (block, variety, yield) grouping can appear on each line. One can imagine that this can make a program much shorter that if only one observation could appear on a line. Thus, the first observation is from block I, on variety a, and the yield was 10 bu/acre. The second observation in the data set is from block I, variety b, and the yield was 9 bu/acre. The final thing to note is that after all of the data have been entered, on a separate line by itself is yet another semicolon. This lone semicolon indicates to SAS that the data set is complete. The statements 1-10 above complete the DATA set to define the data to SAS. PROC features: Now consider our data step with more statements added to tell SAS what we would like to do with the data. 1 data bushels; 2 input block $ variety $ yield @@; 3 cards; 4 I a 10 I b 9 I c 11 I d 15 I e 10 I f 12 I g 11 5 II a 11 II b 10 II c 12 II d 12 II e 10 II f 11 II g 12 6 III a 12 III b 13 III c 10 III d 14 III e 15 III f 13 7 III g 13 8 IV a 14 IV b 15 IV c 13 IV d 17 IV e 14 IV f 16 IV g 15 9 V a 13 V b 14 V c 16 V d 19 V e 17 V f 15 V g 18 10 ; 11 *; 12 * Now print out the data; 13 *; 14 proc print data=bushels; run; 15 *; 16 * Run an analysis of variance; 17 *; 18 proc glm data=bushels; class block variety; 19 model yield = block variety; run; Once the data are entered in the data step (lines 1-10), we can use different PROCs to operate on the data. The statements in lines 11-19 use two PROCs to print out the data and run the appropriate analysis of variance. We now discuss these statements. Comment statements: First, note that there are some statements that begin with an asterisk "*". In SAS, any statement that begins in an asterisk is a comment statement -- it is not a program statement but an explanatory statement inserted by the author to clarify what the program is doing. It is a good idea to comment your programs well, so that when you refer to them later you will remember what you did. Note that comment statements also end in semicolons. The "blank" comments *; in the program above were put in simply to create space between the DATA step and the PROC statements that follow, as well as to set off the comments from the programming statement. An alternate method of inserting a comment is to enclose it as follows: /* comment goes here and can be as long as you want and cover several lines */ We will see a fancy example of this in the final version of the program below. After the first set of comment statements, a call to PROC PRINT is made. PROC PRINT does nothing more than print the contents of a data set. It is always a good idea to print out any data set you have entered yourself to check for typos. The specification "data=bushels" tells SAS to print the contents of this data set. Acutally, if we had left "data=bushels" off, and simply had 14 proc print; run; instead, SAS would still print the contents of bushels, since bushels was the last data set referred to in the program. It is usually a good idea to specify the name of the data set, though, since in more complicated programs you may have defined several data sets. The run statement after PROC PRINT, or any PROC statement, simply tells SAS to execute the analysis performed by that procedure. The final set of statements, after another block of comment statements, is a call to PROC GLM in order to construct the analysis of variance. Again, "data=bushels" could have been left off. The second statement is the class statement. This statement informs the PROC of which variables are to be regarded as classification variables defining the elements of the design. Here, block and variety are the two classifications for a yield observation. Yield is the response, so it is not included in the classification statement. The final statement is a model statement. This tells SAS to fit a two-way classification model with no interaction, i.e., Y(ij) = m + a(i) + b(j) + e(ij), where Y(ij) = yield from the ith variety in the jth block, m is the overall mean, a(i) is the effect of the ith variety, b(j) is the effect of the jth block and e(ij) is the error associated with Y(ij). (Note that because there is only one observation per block/treatment combination, the model includes no interaction term. We will see later in the course how to include interactions in the ANOVA using PROC GLM.) The model statement is followed by a run statement to indicate that SAS is to perform this analysis.