Describing DATA: use PROC SUMMARY PROC MEANS or PROC UNIVARIATE SAS Primer Describing data: use PROC SUMMARY, PROC MEANS, OR PROC UNIVARIATE Both PROC MEANS and PROC SUMMARY compute descriptive statistics for an entire SAS data set. PROC UNIVARIATE can be used to graphically describe the distribution of your data with a stem and leaf plot, a normal probability plot, horizontal bar chart, or a box plot; as well as to print statistics for quantiles, skewness, and a test statistic for normality. Using PROC SUMMARY or PROC MEANS There are two primary differences between PROC MEANS and PROC SUMMARY: 1. PROC MEANS produces subgroup statistics only when a BY statement is used and the input data has been previously sorted (use PROC SORT) by the BY variables. PROC SUMMARY automatically produces statistics for all subgroups, giving you all the information in one run that you would get by repeatedly sorting a data set by the variables that define each subgroup and running PROC MEANS/. 2. PROC SUMMARY does not produce any information in your output so you will always need to use the OUTPUT statement to create a new data set and use PROC PRINT to see the computed statistics. All of the options and statements used with PROC MEANS are identical to those used with PROC SUMMARY. To invoke PROC SUMMARY use the following statements: PROC SUMMARY [DATA=setname ] [NWAY]; [CLASS variables ;] /* Choose additional statements from the list following PROC MEANS*/ where you can optionally choose which data set to process and add NWAY to specify that statistics are to be computed only for the highest level of interaction among the factors (class variables) in your data set. Use a CLASS statement to list all of the variables used in specifying a group or category. An example of a class variable is fertilizer type. To invoke PROC MEANS use the following statement: PROC MEANS [DATA=setname ]; /*Choose additional statements from the following list*/ You can use the following statements with either procedure: [VAR variables ;] /* Compute statistics for only the variables in the VAR statement; if the VAR statement is not used statistics will be computed for all numeric variables in the data set. */ [BY variables ]; /*Obtain separate analysis for the groups defined by the variables in the BY statement*/ [OUTPUT OUT=newset keyword=name1 keyword=name2. . .;]@ Creates a new data set named newset containing the computed statistics from the following list of keyword names: N (number of observations in calculation), MEAN (sample mean), STD (standard deviation), MIN (minimum value), MAX (maximum value), RANGE (range); SUM (sum), VAR (variance), USS (uncorrected sum of squares), CSS (corrected sum of squares), STDERR (standard error of the mean), SKEWNESS KURTOSIS T (Student's T value for testing the hypothesis that the population mean is zero), PRT (probability of a greater absolute value of Student's t) Example: Plotting group means over the data Using the fertilizer data set in the previous example on creating indicator variables, plot the group means over a plot of the original data. SAS SOLUTION: /* This SAS program calculates group means and plots them over the original data set.*/ options pagesize=30; data fertlzr; input fert $ @; do rep=1 to 5; input yield @; x1=(fert='A'); x2=(fert='B'); x3=(fert='C'); x4=(fert='D'); symbol = '*'; output; end; cards; A 60 61 59 60 60 B 62 61 60 62 60 C 63 61 61 64 66 D 62 61 63 60 64 ; proc means; var yield; by fert; output out=average mean=aveyield; ; data remake; /* Take the data set from PROC MEANS*/ set average; /* create a new variable for the means*/ yield=aveyield; /* the same name as the original data.*/ symbol='@'; /* Assign a different symbol to YIELD*/ ; /* the symbol in set FERTLZR.*/ data both; set remake fertlzr; /* Concatenate the sets.*/ keep fert yield symbol; /* Only keep these variables in*/ proc print; proc plot data=both; plot yield*fert=symbol; ; run; SAS OUTPUT: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 26 Analysis Variable : YIELD ------------------------------------ FERT=A ---------------- N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 60.0000000 0.7071068 59.0000000 61.0000000 -------------------------------------------------- -------- ------------------------------------ FERT=B ---------------- ------------------ N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 61.0000000 1.0000000 60.0000000 62.0000000 -------------------------------------------------- -------- ------------------------------------ FERT=C ---------------- ------------------ N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 63.0000000 2.1213203 61.0000000 66.0000000 -------------------------------------------------- -------- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 27 Analysis Variable : YIELD ------------------------------------ FERT=D ---------------- ------------------ N Mean Std Dev Minimum Maximum -------------------------------------------------- -------- 5 62.0000000 1.5811388 60.0000000 64.0000000 -------------------------------------------------- -------- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 28 OBS FERT YIELD SYMBOL 1 A 60 @ 2 B 61 @ 3 C 63 @ 4 D 62 @ 5 A 60 * 6 A 61 * 7 A 59 * 8 A 60 * 9 A 60 * 10 B 62 * 11 B 61 * 12 B 60 * 13 B 62 * 14 B 60 * 15 C 63 * 16 C 61 * 17 C 61 * 18 C 64 * 19 C 66 * 20 D 62 * 21 D 61 * 22 D 63 * 23 D 60 * 24 D 64 * %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 08:47 Friday, July 17, 1992 29 Plot of YIELD*FERT. Symbol is value of SYMBOL. YIELD | | 66 + * | 65 + | 64 + * * | 63 + @ * | 62 + * @ | 61 + * @ * * | 60 + @ * * | 59 + * | ---+-----------------+-----------------+-------- ---------+-- A B C D FERT NOTE: 9 obs hidden. PROC UNIVARIATE Use this procedure to examine the statistical distribution of a variable. The following statements are used to call the UNIVARIATE procedure: PROC UNIVARIATE [DATA=setname ] [PLOT] [NORMAL]; Initiates the UNIVARIATE procedure. Choose the data set to process with the DATA= option; choose PLOT to get a stem-and- leaf plot or a bar chart, a box plot, and a normal probability plot; choose NORMAL to compute a test statistic for the hypothesis that the data come from a normal distribution. [VAR variables ;] Tells UNIVARIATE which variables to process. If you omit the VAR statement, statistics will be calculated for all numeric variables in the data set. [BY variables ;] Analyzes the data in groups defined by the BY variables. Use PROC SORT to make sure variables are arranged in ascending order. [OUTPUT OUT=newset keyword=name1 keyword=name2 . . .;] Produces a new output data set calling the variables name . Choose keywords from the following list: N MEAN SUM STD VAR SKEWNEWW, KURTOSIS, MAX, MIN, RANGE, Q3 (upper 75th percentile), MEDIAN Q1 (lower 25th percentile). See PROC MEANS for keyword definitions. For example, the following statements PROC UNIVARIATE; VAR INCOME GRADE; BY STATE; OUTPUT OUT=NEW MEAN = AVE_INC AVE_GR VAR = VAR_INC VAR_GR; create a new data set containing two observations and the variables STATE, AVE_INC, AVE_GR, VAR_INC, and VAR_GR. You can use any number of output statements with PROC UNIVARIATE. Note that the two uses of VAR (VARiable and VARiance) cause no problem to SAS.