Describing DATA: use PROC SUMMARY
PROC MEANS or PROC UNIVARIATE
SAS Primer
Describing data: use PROC SUMMARY, PROC MEANS, OR PROC
UNIVARIATE
Both PROC MEANS and PROC SUMMARY compute descriptive statistics for an
entire SAS data set. PROC UNIVARIATE can be used to graphically
describe the distribution of your data with a stem and leaf plot, a
normal probability plot, horizontal bar chart, or a box plot; as well
as to print statistics for quantiles, skewness, and a test statistic
for normality.
Using PROC SUMMARY or PROC MEANS
There are two primary differences between PROC MEANS and
PROC SUMMARY:
1. PROC MEANS produces subgroup statistics only when a
BY statement is used and the input data has been
previously sorted (use PROC SORT) by the BY variables.
PROC SUMMARY automatically produces statistics for all
subgroups, giving you all the information in one run
that you would get by repeatedly sorting a data set by
the variables that define each subgroup and running
PROC MEANS/.
2. PROC SUMMARY does not produce any information in
your output so you will always need to use the OUTPUT
statement to create a new data set and use PROC PRINT
to see the computed statistics.
All of the options and statements used with PROC MEANS are identical
to those used with PROC SUMMARY. To invoke PROC SUMMARY use the
following statements:
PROC SUMMARY [DATA=setname ] [NWAY];
[CLASS variables ;]
/* Choose additional statements from the list following PROC
MEANS*/
where you can optionally choose which data set to process and add NWAY
to specify that statistics are to be computed only for the highest
level of interaction among the factors (class variables) in your data
set. Use a CLASS statement to list all of the variables used in
specifying a group or category. An example of a class variable is
fertilizer type.
To invoke PROC MEANS use the following statement:
PROC MEANS [DATA=setname ];
/*Choose additional statements from the following list*/
You can use the following statements with either procedure:
[VAR variables ;]
/* Compute statistics for only the
variables in the VAR statement; if
the VAR statement is not used
statistics will be computed for all
numeric variables in the data set.
*/
[BY variables ];
/*Obtain separate analysis for the
groups defined by the variables in
the BY statement*/
[OUTPUT OUT=newset keyword=name1 keyword=name2. . .;]@
Creates a new data set named newset containing the computed
statistics from the following list of keyword names:
N (number of observations in calculation),
MEAN (sample mean),
STD (standard deviation),
MIN (minimum value),
MAX (maximum value),
RANGE (range);
SUM (sum),
VAR (variance),
USS (uncorrected sum of squares),
CSS (corrected sum of squares),
STDERR (standard error of the mean),
SKEWNESS
KURTOSIS
T (Student's T value for testing the
hypothesis that the population mean
is zero),
PRT (probability of a greater absolute
value of Student's t)
Example: Plotting group means over the data
Using the fertilizer data set in the previous example on
creating indicator variables, plot the group means over a
plot of the original data.
SAS SOLUTION:
/* This SAS program calculates group means and plots them
over the original data set.*/
options pagesize=30;
data fertlzr;
input fert $ @;
do rep=1 to 5; input yield @;
x1=(fert='A'); x2=(fert='B');
x3=(fert='C'); x4=(fert='D');
symbol = '*';
output;
end;
cards;
A 60 61 59 60 60
B 62 61 60 62 60
C 63 61 61 64 66
D 62 61 63 60 64
;
proc means;
var yield;
by fert;
output out=average mean=aveyield;
;
data remake; /* Take the data set from PROC MEANS*/
set average; /* create a new variable for the means*/
yield=aveyield; /* the same name as the original data.*/
symbol='@'; /* Assign a different symbol to YIELD*/
; /* the symbol in set FERTLZR.*/
data both;
set remake fertlzr; /* Concatenate the sets.*/
keep fert yield symbol; /* Only keep these variables in*/
proc print;
proc plot data=both;
plot yield*fert=symbol;
;
run;
SAS OUTPUT:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
08:47 Friday,
July 17, 1992 26
Analysis Variable : YIELD
 FERT=A 
N Mean Std Dev Minimum
Maximum


5 60.0000000 0.7071068 59.0000000
61.0000000


 FERT=B 

N Mean Std Dev Minimum
Maximum


5 61.0000000 1.0000000 60.0000000
62.0000000


 FERT=C 

N Mean Std Dev Minimum
Maximum


5 63.0000000 2.1213203 61.0000000
66.0000000


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%
08:47 Friday,
July 17, 1992 27
Analysis Variable : YIELD
 FERT=D 

N Mean Std Dev Minimum
Maximum


5 62.0000000 1.5811388 60.0000000
64.0000000


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%
08:47 Friday,
July 17, 1992 28
OBS FERT YIELD SYMBOL
1 A 60 @
2 B 61 @
3 C 63 @
4 D 62 @
5 A 60 *
6 A 61 *
7 A 59 *
8 A 60 *
9 A 60 *
10 B 62 *
11 B 61 *
12 B 60 *
13 B 62 *
14 B 60 *
15 C 63 *
16 C 61 *
17 C 61 *
18 C 64 *
19 C 66 *
20 D 62 *
21 D 61 *
22 D 63 *
23 D 60 *
24 D 64 *
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
08:47 Friday,
July 17, 1992 29
Plot of YIELD*FERT. Symbol is value of
SYMBOL.
YIELD 

66 + *

65 +

64 + *
*

63 + @
*

62 + *
@

61 + * @ *
*

60 + @ *
*

59 + *

+++
+
A B C
D
FERT
NOTE: 9 obs hidden.
PROC UNIVARIATE
Use this procedure to examine the statistical distribution
of a variable. The following statements are used to call
the UNIVARIATE procedure:
PROC UNIVARIATE [DATA=setname ] [PLOT] [NORMAL];
Initiates the UNIVARIATE procedure. Choose the data set to
process with the DATA= option; choose PLOT to get a
stemand leaf plot or a bar chart, a box plot, and a
normal probability plot; choose NORMAL to compute a test
statistic for the hypothesis that the data come from a
normal distribution.
[VAR variables ;]
Tells UNIVARIATE which variables to process. If you omit
the VAR statement, statistics will be calculated for all
numeric variables in the data set.
[BY variables ;]
Analyzes the data in groups defined by the BY variables.
Use PROC SORT to make sure variables are arranged in
ascending order.
[OUTPUT OUT=newset keyword=name1 keyword=name2 . . .;]
Produces a new output data set calling the variables name .
Choose keywords from the following list:
N MEAN SUM
STD VAR SKEWNEWW, KURTOSIS, MAX, MIN, RANGE,
Q3 (upper 75th percentile), MEDIAN
Q1 (lower 25th percentile).
See PROC MEANS for keyword definitions.
For example, the following statements
PROC UNIVARIATE;
VAR INCOME GRADE;
BY STATE;
OUTPUT OUT=NEW MEAN = AVE_INC AVE_GR VAR = VAR_INC
VAR_GR;
create a new data set containing two observations and the
variables STATE, AVE_INC, AVE_GR, VAR_INC, and VAR_GR. You
can use any number of output statements with PROC
UNIVARIATE. Note that the two uses of VAR (VARiable and
VARiance) cause no problem to SAS.