If you analyze your data, you probably want to know descriptive statistics such as the sum and the mean. If you work with tables, you can calculate the mean of a row or a column. In this article, we discuss how to calculate the mean of a column in SAS.
In SAS, you can calculate the mean of a column with PROC SQL, PROC MEANS, PROC SUMMARY, PROC UNIVARIATE, and an ordinary Data Step. PROC SQL provides the easiest way to get the column average, while the other procedures give you more flexibility. The Data Step is more complex but shows you the cumulative mean too.
In the remainder of this article, we take a detailed look at all 5 methods. We provide examples with reusable SAS and discuss their advantages and disadvantages. At the end of this article, you find a table with a comparison of the 5 methods.
Throughout this article, we will use the terms mean and averages interchangeably.
In all examples, we use the dataset work.my_data which has 1 column (MyColumn) and 5 rows. The goal is to find the average of the column MyColumn (which is (1 + 2 + 3 + 4 + 5) / 5 = 3).
Here is the SAS code to create this dataset.
data work.my_data; input MyColumn; datalines; 1 2 3 4 5 ; run;
1. Calculate the Column Mean in SAS with PROC SQL
The first, and probably easiest method to calculate the mean of a column in SAS is with PROC SQL.
PROC SQL is a powerful SAS Base procedure that you can use to process SQL statements. So, if you want to find the average of a column and you have experience with coding in SQL, this will be your preferred method.
This method consists of 4 steps:
- Open the SQL procedure with PROC SQL.
- Create a new column containing the column mean by using the SELECT statement and the AVG(column-name) function. The AVG function calculates the average of the provided column. After the AS keyword, you define the name of the column that will contain the result of the AVG function.
- The FROM statement provides the dataset where SAS can find the column of which you want to calculate the mean.
- Close the SQL procedure with QUIT.
proc sql; select avg(MyColumn) as Avg_MyColumn from work.my_data; quit;
By default, PROC SQL shows only the result of the executed code on your screen. In other words, it doesn’t create an output dataset.
If you need a table with the calculated mean, you need to add the CREATE TABLE statement to your code. After this statement, you provide the name of the table that SAS will create, and the AS keyword. If you use this code, SAS doesn’t print the result to your screen.
proc sql; create table work.avg_my_data as select avg(MyColumn) as Avg_MyColumn from work.my_data; quit;
To summarize, PROC SQL provides a quick and easy way to calculate the average of a column. Especially, if you have experience with coding in SQL. However, this method either shows the column average on your screen or creates a table with the result.
2. Calculate the Column Mean in SAS with PROC MEANS
The second method to calculate the mean of a column in SAS is with PROC MEANS.
PROC MEANS is a SAS Base procedure that you can use for analyzing your data. It provides descriptive statistics such as the number of observations, the sum, the mean, and the median.
If you use PROC MEANS to calculate the average of a column in SAS, your code has (at least) 3 parts:
- You start the procedure with the PROC MEANS statement followed by the DATA=-option. With the DATA=-option you provide the input dataset. You use the MEAN keyword to only calculate the column mean.
- With the VAR <column-name> statement, you let SAS know of which column you want to calculate the mean.
- You close the procedure with the RUN statement.
proc means data=work.my_data mean; var MyColumn; run;
By default, PROC MEANS only prints the result of the procedure to your screen. However, you can add an OUTPUT statement to your code to generate an output table too.
Within the OUTPUT statement, you use the OUT=-option to define the name of the output table. With the MEAN=-option, you define the column name that contains the column mean.
With the code below we create an output table where the column Avg_MyColumn contains the average of the MyColumn column.
proc means data=work.my_data mean; var MyColumn; output out=work.avg_my_data mean=Avg_MyColumn; run;
As you can see in the image above, the OUTPUT statement of the MEANS procedure creates two additional columns (_TYPE_ and _FREQ_). You can remove these two columns with the DROP=-option. Check this article to learn more about selecting and removing columns in SAS.
3. Calculate the Column Mean in SAS with PROC SUMMARY
The third method to calculate the average of a variable in SAS is with PROC SUMMARY.
PROC SUMMARY is also a SAS Base procedure to analyze data and calculate descriptive statistics. This procedure is very similar to PROC MEANS, but there is one big difference. By default, PROC SUMMARY doesn’t print the result of the executed code to your screen.
You need 3 statements to calculate the average of a SAS variable with PROC SUMMARY:
- Start the procedure with the PROC SUMMARY statement. Use the DATA=-option to define the input table. Provide the MEAN keyword to calculate the mean of the SAS variable. Use the PRINT keyword to print the result to your screen. If you don’t provide the PRINT keyword, SAS will return an error.
- Use the VAR <variable-name> statement to let SAS know of which column you want to calculate the mean.
- Finish the SUMMARY procedure with the RUN statement.
proc summary data=work.my_data mean print; var MyColumn; run;
With the code above, SAS only prints the average of the variable to your screen. So, if you need a table with the column average, you need extra code.
Like the MEANS procedure, you need to add an OUTPUT statement. Again, with the OUT=-option you provide the name of the desired output table. You use the MEAN=-option to define the name of the new column.
proc summary data=work.my_data mean print; var MyColumn; output out=avg_my_data mean=Avg_MyColumn; run;
The output of PROC SUMMARY is identical to PROC MEANS. Hence, the output dataset has two additional columns which you could remove with the
4. Calculate the Column Mean in SAS with PROC UNIVARIATE
The fourth way to calculate the mean of a column in SAS is with PROC UNIVARIATE.
PROC UNIVARIATE is a powerful SAS Base procedure that you can use to assess the distribution of your data, including a test for normality. Although this procedure is more for statistical purposes, you can still use it to calculate the average of a variable.
You can calculate the column average using PROC UNIVARIATE with 3 statements:
- Start with the PROC UNIVARIATE statement and the DATA=-option. You use the DATA=-option to define your input dataset.
- Use the VAR <column-name> to specify which column you want to use to calculate the average.
- Finish your code with the RUN statement.
proc univariate data=work.my_data; var MyColumn; run;
Running the code above will produce the following output.
As you can see, the UNIVARIATE procedure generates much information about the selected variable. You can find the mean of the variable in the Moments section.
Like the previous procedures, the UNIVARIATE procedure only generates a report. However, you can add extra code to your program to create a SAS dataset.
You use the OUTPUT statement to define the name of the output dataset (OUT=-option) and the statistic that it will contain (MEAN=-option).
proc univariate data=work.my_data; var MyColumn; output out=work.avg_my_data mean=Avg_MyColumn; run;
Compared to the MEANS and SUMMARY procedures, the output dataset of the UNIVARIATE procedure has the advantages that it doesn’t contain the 2 additional columns (_TYPE_ and _FREQ_).
5. Calculate the Column Mean in SAS with a Data Step
The fifth, and final method to calculate the column mean in SAS is with a Data Step.
As you know, a SAS Data Step processes the input data row by row. This is great to calculate the average of a row and might seem less useful to calculate the average of a column. However, it is still possible and provides an extra advantage compared to the previous methods.
As a SAS Data Step processes the dataset in the SET statement row by row, it is necessary to know the sum of the previous rows and the number of rows to calculate the column average. Using the RETAIN keyword you can add the value of the current row to the sum of all previous rows. This way, SAS will create a dataset with a new column that contains the cumulative sum.
With the cumulative sum and the special variable _N_, we can calculate the cumulative average of each row. The variable _N_ contains the row number of the row that SAS is currently processing.
With the code below we create a new, temporary dataset that contains two new columns:
- Sum_MyColumn with the cumulative sum.
- Avg_MyColumn with the cumulative average.
data work.avg_my_data_tmp; set work.my_data; retain Sum_MyColumn; Sum_MyColumn = sum(Sum_MyColumn, MyColumn); Avg_MyColumn = Sum_MyColumn / _N_; run;
If you are only interested in the total average of the column, you need to select the last row. You can do this with the END=-option and the IF statement.
data work.avg_my_data; set work.avg_my_data_tmp end=last_obs; if last_obs then output; run;
If you want to know more about the END=-option and how to select specific rows from a SAS dataset, we highly recommend this article.
In this article, we’ve discussed 5 methods to calculate the mean of a column in SAS. Each method has its own use case. The table below provides an overview of the 5 methods, their advantages, and their disadvantages.
|PROC SQL||Low||Quick & Easy||Can’t create Results & Output table with same code|
|PROC MEANS||Medium||Creates Result & Output table by default||Output table contains 2 extra columns (_TYPE_ & _FREQ_)|
|PROC SUMMARY||Medium||Creates only Results table by default||Output table contains 2 extra columns (_TYPE_ & _FREQ_)|
|PROC UNIVARIATE||Medium||Many Statistics by default||Can be slow because of many extra stastics|
|DATA STEP||High||Also creates cumulative mean||Hard(er) to code|