SAS How To's

How to Replace Missing Values in SAS

In this article, we demonstrate how to replace missing values in SAS. We show how to replace missing values with zeros, the mean or the max (and many more). All examples are supported by images and SAS code.

If you want to count the number of missing values before replacing them, you can check this article.

Example Data

Before we start, we create a SAS data set with 4 columns. The first column group contains a group identifier. The other three numeric columns contain some missing values.

/* CREATE A DATA SET */
data work.ds_missing_values;
	infile datalines dlm=',';
	input group $ var1 var2 var3;
	datalines;
A, 1, 2, 3
A, ., 2, 5
A, 3, 4, .
A, 8, ., 8
B, ., ., 1
B, ., 2, 6
B, 3, 1, 8
C, 9, 4, .
C, ., ., 7
;
run;
A data set with missing values

Replace Missing Values with Zeros

The most common value to replace a missing value is zero. All three methods below replace missing values with a zero. The methods differ in readability and complexity.

SAS Missing Values replaced with Zeros
Missing Values Replaced With Zeros

Using COALESCE

The easiest way to substitute a missing value with a zero is using the COALESCE function. The COALESCE function assigns the first non-missing value of its arguments to a (new) variable. Although this method works fine, it isn’t very efficient and you need to write a separate line of code for each variable. In this article, we explain this function in more detail.

/* REPLACE WITH 0 - USING COALESCE */
data work.ds_no_missing_values;
	set work.ds_missing_values;
 
	var1 = coalesce(var1,0);
	var2 = coalesce(var2,0);
	var3 = coalesce(var3,0);
run;

Using ARRAY

Unlike the previous method, you can use an array to simply replace the missing values of all numeric columns. The code snippet below shows how to do this. First, we create an array (num_array) that holds all the numeric columns. Then, we loop over all elements of this array using a do loop. Finally, we use an if-then statement to replace the missing values with a zero.

/* REPLACE WITH 0 - USING ARRAY */
data work.ds_no_missing_values;
	set work.ds_missing_values;
 
	array num_array _numeric_;
	do over num_array;
		if missing(num_array) then num_array = 0;
	end;
run;

If you don’t want to replace the missing values of all numeric variables, then you need to replace the _numeric _ keyword with the variable(s) where you want to substitute the missing values.

Do you know? How to Create Efficient SAS Code with Arrays

Using PROC STDIZE

Even though the COALESCE function and the array work fine, there is a more efficient way. The STDIZE procedure is the easiest way to replace missing values with a zero. Also, it requires less code than previous methods. The example below demonstrates how to use the STDIZE procedure substitute missing values with a zero.

/* REPLACE WITH 0 - PROC STDIZE */
proc stdize data=work.ds_missing_values 
	out=work.ds_no_missing_values 
	reponly missing=0;
run;

In general, you use the STDIZE procedure to standardize your data. However, you can suppress this default behavior with the reponly keyword such that you can use this procedure to replace missing values. With the missing keyword you specify that you want to replace the missing values with a zero.

With the code above you can replace the missing values in all columns easily. However, you can control the column(s) where you want to replace the missing values with the var statement. The example below show you to substitute missing values in the columns var1 and var3.

/* REPLACE WITH 0 - PROC STDIZE */
proc stdize data=work.ds_missing_values 
	out=work.ds_no_missing_values 
	reponly missing=0;
        var var1 var3
run;

Obviously, you can use all three methods mentioned above to replace missing values with another value then zero.

Replace Missing Values with the Mean / Median

Two other frequently used options to replace missing values are the mean and median. In this section we show how to easy replace missing values in SAS with these two statistics.

Using PROC STDIZE

In the last section, we used the STDIZE procedure to replace missing values with zero. Even so, you can also use the STDIZE procedure to replace missing values with the group mean. First, we specify the input and (optional) output data set. Then, we use the reponly keyword to only replace missing values. With the method keyword, we let SAS know to replace missing values with the group mean. Finally, with the by statement, we specify how to group the data set.

/* REPLACE WITH MEAN (OF GROUP) - PROC STDIZE*/
proc stdize data=work.ds_missing_values 
	out=work.ds_no_missing_values 
	reponly method=mean;
	by group;
run;
SAS Missing Values Replaced With The Group's Mean
Missing Values Replaced With The Group’s Mean

Instead of the group mean, you can also replace the missing value with the overall (column) mean. Simple remove the by statement and re-run the SAS code.

/* REPLACE WITH MEAN - PROC STDIZE*/
proc stdize data=work.ds_missing_values 
	out=work.ds_no_missing_values 
	reponly method=mean;
run;
SAS Missing Values Replaced With The Overall Mean
Missing Values Replaced With The Overall Mean

Besides the (group) mean, you can also use the STDIZE procedure to replace the missing value with the group median. To substitute the missing values with the median, you need to et the method keyword equal to ‘median’.

/* REPLACE WITH MEDIAN (OF GROUP) - PROC STDIZE*/
proc stdize data=work.ds_missing_values 
	out=work.ds_no_missing_values 
	reponly method=median;
	by group;
run;
SAS Missing Values Replaced With The Group's Median
Missing Values Replaced With The Group’s Median

To replace the missing values with the overall column median, you need to remove the by statement. See the example code below.

/* REPLACE WITH MEDIAN - PROC STDIZE*/
proc stdize data=work.ds_missing_values 
	out=work.ds_no_missing_values 
	reponly method=median;
run;
SAS Missing Values Replaced With The Overall Median
Missing Values Replaced With The Overall Median

With the examples in this section, we replaced the missing values in all numeric columns. However, if you want to define for which column(s) you want to replace the missing values, you need to add the var statement. All information about the STDIZE procedure can be found here.

Other than the mean and median, you can use the STDIZE procedure also you for a range of other replacements. See this page for the complete list of options for the method keyword.

Replace Missing Values with the Min / Max

A less common, but still useful option to replace missing values is the minimum or maximum. In this section, we explain how to do this.

Using PROC MEANS

To replace missing values in SAS with the minimum or maximum, we need to calculate these statistics first. The MEANS procedure calculates, amongst others, the minimum and maximum. First, we define the input data set with the data option. Then we specify how to group the data to calculate the minimum or maximum (if necessary). Finally, we define to output data set in which statistic needs to be calculated. Adding the optional autoname keyword, SAS adds the word ‘Min’ as a suffix to the output variable(s).

Now that we have calculated the minimum, we can use a merge statement (join) in combination with a simple if-then statement to replace the missing values. It might be necessary to order the input data sets before you can merge them.

/* REPLACE WITH MIN (OF GROUP) */
proc means data=work.ds_missing_values;
 	by group; 
	output out=work.ds_min min= /autoname;	
run;
 
data work.ds_no_missing_values;
	merge work.ds_missing_values (in=a) 
		work.ds_min (in=b);
 	by group;
 
	if missing(var1) then var1 = var1_Min;
	if missing(var2) then var2 = var2_Min;
	if missing(var3) then var3 = var3_Min;
 
	keep group var1 var2 var3;
run;
SAS Missing Values Replaced With The Group's Min
Missing Values Replaced With The Group’s Minimum

To replace the missing values with the minimum value of the complete column instead of the minimum of the group, simply remove the by statement in the MEANS procedure. However, you can’t use the merge statement as in the previous example. Because we removed the by statement from the MEANS procedure, need to use a join in the SQL procedure.

/* REPLACE WITH MIN (OVERALL) */
proc means data=work.ds_missing_values;
	output out=work.ds_min min= /autoname;	
run;
 
proc sql;
	create table work.ds_no_missing_values as
		select group,
			coalesce(var1, var1_Min) as var1,
			coalesce(var2, var2_Min) as var2,
			coalesce(var3, var3_Min) as var3
		from work.ds_missing_values, work.ds_min;
quit;
SAS Missing Values Replaced With The Overall Min
Missing Values Replaced With The Overall Minimum

You can easily replace missing values with the maximum of a group using the code below.

/* REPLACE WITH MAX (OF GROUP) */
proc means data=work.ds_missing_values;
 	by group; 
	output out=work.ds_max max= /autoname;	
run;
 
data work.ds_no_missing_values;
	merge work.ds_missing_values (in=a) 
		work.ds_max (in=b);
 	by group; 
 
	if missing(var1) then var1 = var1_Max;
	if missing(var2) then var2 = var2_Max;
	if missing(var3) then var3 = var3_Max;
 
	keep group var1 var2 var3;
run;
SAS Missing Values Replaced With The Group's Max
Missing Values Replaced With The Group’s Maximum

Do you know? 5 Ways to Find the Minimum Value of a Column

Finally, to replace missing values with the column maximum instead of the group maximum, you can use the example code below.

/* REPLACE WITH MAX (OVERALL) */
proc means data=work.ds_missing_values;
	output out=work.ds_max max= /autoname;	
run;
 
proc sql;
	create table work.ds_no_missing_values as
		select group,
			coalesce(var1, var1_Max) as var1,
			coalesce(var2, var2_Max) as var2,
			coalesce(var3, var3_Max) as var3
		from work.ds_missing_values, work.ds_max;
quit;
SAS Missing Values Replaced With The Overall Max
Missing Values Replaced With The Overall Maximum

Replace Missing Values with the Previous Non-Missing Value

Finally, we discuss how to replace missing values in SAS with the previous non-missing value. To do so we use the UPDATE statement.

Using UPDATE

The UPDATE statement isn’t frequently used. However, to replace missing values with the last non-missing value it is very useful. The code snippet below shows the correct syntax. We will replace the missing values with the previous non-missing value from a specific group.

/* REPLACE WITH THE LAST NON-MISSING VALUE (OF GROUP) */
data work.ds_no_missing_values;
   update work.ds_missing_values (obs=0) work.ds_missing_values;
   by group;
   output;
run;

The image below shows the result of the SAS code above. As you can see, the code correctly replaced the missing values with the last non-missing value by the variable group. A cell remains blank if all previous values of that specific group are missing. For example, in group B and variable one.

SAS Missing Values Replaced With Last Non-Missing Value
Missing Values Replaced With Last Non-Missing Value