As a SAS Programmer or Data Analyst, you normally start your project with one or more big datasets. However, a large number of observations might be unnecessary for your analysis and even might slow down your work.
So, how to filter data in SAS? In SAS, you can filter data with the IF Statement (or IF-ELSE THEN Statement) or the WHERE Statement based on a specific condition. In this article, we demonstrate how to use both statements, discuss their differences, and show how to create multiple, complex conditions.
How to Filter Data in SAS with the IF Statement?
What is an IF Statement?
The IF Statement is a very powerful statement to filter data. In this section, we discuss the basics.
The IF statement in SAS evaluates some logic, for example, “value of X = 10“, and based on the outcome of this evaluation acts. In this article, the action is to filter the data. With the special keyword output, you let SAS know that you want to filter the observations that meet the condition. For example:
DATA WORK.MY_FILTERED_DATA; SET WORK.MY_DATA; IF X = 10 THEN OUTPUT; RUN;
In the example above, we had one input dataset (work.my_data) and one output dataset (work.my_filtered_data). All rows that meet the condition X = 10 are filtered and saved in the output dataset. However, it’s also possible to subset the input dataset into two output datasets. For this purpose, we need the ELSE statement. For example:
DATA WORK.X_EQUAL_TO_10 WORK.X_NOT_EQUAL_TO_10; SET WORK.MY_DATA; IF X = 10 THEN OUTPUT WORK.X_EQUAL_TO_10; ELSE OUTPUT WORK.X_NOT_EQUAL_TO_10; RUN;
In contrast to the first example, when you subset your data into two datasets, it is necessary to specify the name of the output dataset after the output keyword. Otherwise, SAS doesn’t know to which output dataset the observation should be copied.
In the two examples above, we have filtered the input data based on one condition, namely X = 10. However, it is also possible to subset your data based on multiple IF statements with the ELSE IF statement. In the example below, we filter the input data based on the value of the column Group.
DATA WORK.GROUP_A WORK.GROUP_B WORK.GROUP_C WORK.GROUP_OTHER; SET WORK.MY_DATA; IF GROUP = "A" THEN OUTPUT WORK.GROUP_A; ELSE IF GROUP = "B" THEN OUTPUT WORK.GROUP_B; ELSE IF GROUP = "C" THEN OUTPUT WORK.GROUP_C; ELSE OUTPUT WORK.GROUP_OTHER; RUN;
Check this article if you want to filter the first observations per group.
What happens if multiple IF Statements are true?
What happens if you have an IF statement and more than 1 of these conditions are true? For example, you have the following conditions and the value of X in your dataset is 100.
DATA WORK.FILTER_1 WORK.FILTER_2 WORK.FILTER_OTHER; SET WORK.MY_DATA; IF X > 10 THEN OUTPUT WORK.FILTER_1; ELSE IF X > 50 THEN OUTPUT WORK.FILTER_2; ELSE OUTPUT WORK.FILTER_OTHER; RUN;
In this example, both X > 10 and X > 50 are true. However, because we have used the IF-ELSE THEN construction to filter the data, each row can be filtered to only one output dataset.
By default, SAS executes only the first condition that is true. Once SAS has encountered a true condition, it stops checking the other conditions. Hence, the hierarchy of your conditions is of crucial importance.
How to Filter Data in SAS with the WHERE Statement?
What is a WHERE Statement?
A WHERE statement is a statement that you can use to filter data in SAS. Like the IF statement, the WHERE statement filters all observations that meet a specific condition. In the example below, we subset our data based on the value of X. We keep only the observations where X = 10.
DATA WORK.MY_FILTERED_DATA; SET WORK.MY_DATA; WHERE X = 10; RUN;
Where to Place the WHERE Statement?
You can place the WHERE statement in three different parts of the Data Step to filter your data, namely:
- Below the SET statement (previous section)
- In the SET statement
- In the DATA statement
When you place the WHERE statement in the SET statement, SAS filters the data before processing. Hence, this will improve efficiency.
DATA WORK.MY_FILTERED_DATA; SET WORK.MY_DATA (WHERE=(X = 10)); RUN;
You can also place the WHERE statement in the DATA statement. However, this won’t improve efficiency since SAS processes all observations.
DATA WORK.MY_FILTERED_DATA (WHERE=(X = 10)); SET WORK.MY_DATA; RUN;
So, in general, we recommend that you place the WHERE statement in the SET statement.
Note that, when you place the WHERE statement in the DATA or SET statement, you should place it between parenthesis.
What are the Differences Between the IF and WHERE Statements?
Although, you can use both the IF statement and the WHERE statement to filter data in SAS, there are some differences.
- The WHERE statement can be placed in the SET statement. Because of this possibility, the WHERE statement provides you with a technique to improve the efficiency of your code. The IF statement processes always all observations of the input dataset and, thus, can be slower than the WHERE statement.
- The WHERE statement can only filter based on existing variables. If you create a new variable in your Data Step, you can’t use the WHERE statement to filter based on this new variable. SAS will show an error like “ERROR: Variable X is not on file Y.”. Instead, the IF statement is capable of filtering data based on new variables.
- The IF statement can be used to subset your data into two or more new datasets. As we have shown, you can use the IF-ELSE THEN structure to subset the input dataset into two or more new datasets. However, with the WHERE statement, you can’t define to which output dataset the filtered records are copied.
So, to summarize. The IF statement provides more flexibility when you filter data in SAS, while the WHERE statement can provide efficiency gains.
How to Create Multiple Conditions to Filter data in SAS?
In the previous sections, we have used simple, single conditions in our IF and WHERE statements, like X = 10 and Group = “A”. In this section, we discuss how to use SAS operators to filter your data based on multiple, complex conditions
Instead of checking if X equals 10, you could also check if X is bigger than 10, or smaller than 10, or not equal to 10, etc. All these checks are called comparisons, and SAS provides a long list of comparison operators to carry out different checks.
|Equal to||=||EQ||X = 10|
|Not equal to||^=||NE||X ^= 10|
|~=||NE||X ~= 10|
|¬=||NE||X ¬= 10|
|Greater than||>||GT||X > 10|
|Greater than or equal to||>=||GE||X >= 10|
|Less than||<||LT||X < 10|
|Less than or equal to||<=||LE||X <= 10|
|Equal to one of a list||IN||X IN (5, 10)|
You can also combine comparison operators with logical operators (or Boolean operators) to create multiple conditions.
|This and that||&||AND||X = 10 AND Y > 5|
|This or that|||||OR||X = 10 OR Y > 5|
|Not this||¬||NOT||NOT (X > Y)|
The comparison and logical operators mentioned above allow you to create immensely powerful filter statements and write very efficient code. For example:
DATA WORK.FILTER_1 WORK.FILTER_2 WORK.FILTER_3 WORK.FILTER_OTHER; SET WORK.MY_DATA; IF GROUP = "A" AND 0 <= X <= 10 THEN OUTPUT WORK.FILTER_1; ELSE IF GROUP = "A" AND X > 10 THEN OUTPUT WORK.FILTER_2; ELSE IF GROUP IN ("B", "C") AND (X = 1 OR B = 1) THEN OUTPUT WORK.FILTER_3; ELSE OUTPUT WORK.FILTER_OTHER; RUN;
Note that the comparison operators we have discussed are mainly focused on numerical comparisons. However, you can also create statements with textual conditions. One of the most used textual conditions is to check if a string contains another string. For example, with the code below, we check if someone’s name contains the substring “ab“.
DATA WORK.MY_FILTERED_DATA; SET WORK.MY_DATA; IF find(name, "ab") > 0 THEN OUTPUT; RUN;
The WHERE statement also supports multiple conditions.