In this article, we discuss how to rank data in SAS with PROC RANK.
A data set is ranked when each observation is compared to and given a “position” relative to all other observations in the data set. For example, if you have 3 observations in a data set with the values 10, 5, and 20, then their ranks are 2, 1, and 3, respectively.
Normally, ranks are calculated based on ascending order (example above). However, you can also compute ranks in descending order. Additionally, it is also possible to determine the ranks of values within a group.
Ranking data is a typical operation in data analysis. For example, you can use ranked data if you want to select the top X observations with the highest value.
The best way to rank data in SAS is with the PROC RANK procedure. This procedure computes the ranks of one or more numeric observations and stores the ranks in a new column.
The PROC RANK procedure can compute the ranks in both ascending and descending order, as well as calculate the ranks of values within a group.
Lastly, you can use the PROC RANK procedure to assign values to a predefined number of groups. For example, if you use 4 or 10 groups, you effectively rank your data into quartiles or deciles, respectively.
In this article, we demonstrate all the features of the PROC RANK procedure to rank a SAS data set. We’ll example that you can use directly in your own projects.
Contents
How to Rank the Values of One Variable in SAS
Ranking the values of a single variable in a SAS data set is a standard operation.
These are the 6 steps to rank data with the PROC RANK procedure:
- Start the RANK procedure with the PROC RANK statement.
- Specify the name of the input data set (i.e., the data that needs to be ranked) with the data=-option.
- Define the output data set with the out=-option.
- State the name of the variable whose values need to be ranked with the VAR statement.
- Specify the name of the new variable that will contain the ranks with the RANK statement.
- Execute the RANK procedure with the RUN statement.
The following SAS code shows how to rank the values of the variable TOTAL_POINTS from the MY_DATA dataset. The PROC RANK procedure creates a new data set (i.e., MY_DATA_RANKED) with a new column TOTAL_POINTS_RANKED that contains the ranks.
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_RANK; RUN;
The image above shows the ranks of the values in the column TOTAL_POINTS. The lowest value (4) gets the lowest rank (1), whereas the highest value (15) gets the highest rank (6).
Instead of assigning the lowest rank to the lowest value (i.e., ascending order), you can also assign the lowest rank to the highest value (i.e., descending order). To rank data in SAS in descending order, you need to add the DESCENDING keyword to the PROC RANK statement.
For example:
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED DESCENDING; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_RANK_DESC; RUN;
As you can see, by adding the DESCENDING keyword to the PROC RANK statement, the highest value (15) has the lowest rank, whereas the lowest value (4) has the highest rank.
How to Rank Multiple Variables in SAS
The PROC RANK procedure can also rank the values of multiple variables in a single operation.
This is what you need to do:
- Specify all the variables you want to rank in the VAR statement, and
- Define the names of the columns containing the ranks in the RANKS statement.
For example, with the SAS code below, we compute the ranks of the columns TOTAL_POINTS and TOTAL_TIME, and store the ranks in two new columns (TOTAL_POINTS_RANKED and TOTAL_TIME_RANKED).
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED; VAR TOTAL_POINTS TOTAL_TIME; RANKS TOTAL_POINTS_RANK TOTAL_TIME_RANK; RUN;
Important: If you rank multiple columns in a single operation, then all values are either ranked in ascending order, or in descending order. To create one SAS dataset with both ascending and descending ranks, you need two consecutive PROC RANK statements.
How to Rank Data by One Group in SAS
Another common task is to compute the ranks of values within a group.
For example, you have two groups of 3 observations. Group A has the values 5, 10, and 20, and Group B has the values 50, 100, and 200. If you don’t consider the groups, the ranks would be 1, 2, 3, 4, 5, and 6. However, taking into account the groups, the ranks are 1, 2, 3, 1, 2, and 3.
You can rank data in SAS by groups using the BY statement. This BY statement is part of the PROC RANK procedure and specifies the column(s) that define(s) the groups.
In the example below, the data is first divided into groups based on the value in the column SPORT. Next, the values of the column TOTAL_POINTS are ranked based on their relative positing within their group.
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED; BY SPORT; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_RANK; RUN;
How to Rank Data by Multiple Groups in SAS
Instead of ranking data within a group that is based on a single variable, you can also use PROC RANK to rank data by multiple groups.
For dividing a data set into groups that are based on multiple variables (and subsequently rank them), you can use the BY statement. The BY statement starts with the BY keyword, followed by the variables that define the groups (separated by a blank, not a comma!).
In the example below, we create groups based on the values in the columns SPORT and CATEGORY with the BY statement. Next, we see the VAR and RANKS statements to rank the values within each group.
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED; BY SPORT CATEGORY; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_RANK; RUN;
How to Rank Data in Quartiles / Deciles in SAS
Besides computing the rank of each value, you can also use PROC RANK to separate data into quartiles and deciles.
By using the groups=-option, the PROC RANK statement assigns each value to a group instead of a rank. For example, if you use groups=2, SAS assesses each value and assigns the 50% lowest values to the first group and the other 50% of the values to the second group.
Therefore, you can rank your SAS data set based on quartiles using the group=4 options from the PROC RANK statement.
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED GROUPS=4; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_QUARTILES; RUN;
As the image above shows, SAS has created four groups; numbers 0 to 3. Hence, all observations assigned to group 0 belong to the first quartile of (lowest) values. Similarly, values that are part of group 3 are part of the fourth quartile.
Likewise, you can create deciles with PROC RANK using the group=10 option. For example:
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED GROUPS=10; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_DECILES; RUN;
How to Rank Data with Ties in SAS
In the example so far, all the values (and hence ranks) were distinct. However, in real life, it is very likely that some values appear multiple times and therefore have the same rank, i.e., ties.
The PROC RANK procedure has 3 options to deal with ties:
- Assign all tied values the mean of their ties.
- Assign all tied values the highest of their ties.
- Assign all tied values the lowest of their ties.
Although, it depends on your situation which alternative suits you best, you always need the ties=-option in the PROC RANK statement.
For example, in the following data set, we have the values 1, 2, 2, 2, and 3. It’s clear that the values 1 and 3 have the lowest (1) and highest (5) ranks, respectively. The 3 observations with value 2 contain the ranks 2, 3, and 4.
It depends on the ties=-option which rank SAS assigns to these values.
If you use ties=mean, PROC RANK will assign rank 3 to the tied values (i.e., the mean of 2, 3, and 4). Alternatively, if you use ties=high, SAS will use the highest rank, i.e., 4. Likewise, if you use ties=low, the rank of the tied values will be 2.
See the example below.
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED TIES=MEAN; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_RANK_TIES_MEAN; RUN;
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED TIES=HIGH; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_RANK_TIES_HIGH; RUN;
PROC RANK DATA=WORK.MY_DATA OUT=WORK.MY_DATA_RANKED TIES=LOW; VAR TOTAL_POINTS; RANKS TOTAL_POINTS_RANK_TIES_LOW; RUN;