In this article, we discuss how to select observations from a dataset based on its position. One of the most Googled questions regarding this topic is how to select the first N rows in SAS. We will answer this question as well as how to select a range of observations, select the nth observation, and select the last observation.
Throughout this article, we will use the dataset below. It contains the first 10 days of 2020 in ascending order. We have created this dataset with a DO-loop. The weekday-column is generated with a put statement and the dowName format.
data work.my_ds; do i="01JAN2020"d to "10JAN2020"d; date = i; weekday = compress(put(i, dowName.)); output; end; drop i; format date date9.; run;
Select the First N Rows with PROC SQL
There are multiple ways to select the first N rows in SAS. You can use the PROC SQL procedure and SAS code. First, we discuss two methods using the PROC SQL procedure.
Within the PROC SQL statement, you can provide some options that will be used during the execution of the code. One of these options is the OUTOBS=-option. With this option, you can specify the number of observations that will be written to the output set.
In the example below, we select all observations from the original dataset work.my_ds with the select * from code. However, with the OUTOBS=5 option, we limit the number of observations that will be written to the output data set. We only select the first 5 observations for the new dataset work.first_5_obs_sql.
proc sql outobs=5; create table work.first_5_obs_sql as select * from work.my_ds; quit;
Like the OUTOBS=-options limits the number of observations in the output dataset, you can use the INOBS=-option to specify the number of observations that will be read from the input dataset(s).
In the example below, we use the INOBS=5 option to select the first 5 rows from the input dataset work.my_ds. All these 5 observations will be written to the output dataset work.first_5_obs_sql.
proc sql inobs=5; create table work.first_5_obs_sql as select * from work.my_ds; quit;
The two methods above select both the first 5 observations. However, the two options don’t always produce the same result. For example, if you join tables, the result can be different whether you use the OUTOBS=-option or INOBS=-option.
Select the First N Rows with SAS Code
Instead of using the PROC SQL procedure, you can also use SAS code to select the first N rows of a dataset. We discuss two methods.
SAS contains several special variables, one of which, _N_. As you might know, SAS processes datasets row by row. The _N_ variable stores the row number of the row that is currently being processed.
So, to select, for example, the first 5 rows of a table you can use the _N_ variable in combination with an IF-statement. In the example below, the IF-statement writes the current row to the output dataset work.first_5_obs_sas as long as this row is one of the first 5 rows of the input dataset.
data work.first_5_obs_sas; set work.my_ds; if _N_ le 5 then output; run;
Another method to select the first N rows from a dataset is using the OBS=-option. With this option, you can specify the last row that SAS processes from the input dataset. So, in the example below, SAS processes all the observations from the work.my_ds dataset until the fifth.
data work.first_5_obs_sas; set work.my_ds (obs=5); run;
This method is more efficient than the previous one. Unlike the previous method, SAS stops processing the data once the N-th observation from the input data set is read. The previous method, with the _N_ variable, processes all rows from the input data set and checks for each row if the IF-statement holds.
Obviously, the four methods above can be modified to select the first 10 rows, to select the first 100 rows, etc.
Select a Range of Observations
In the previous sections, we discussed different methods to select the first N rows from a data set. Here we demonstrate how to select a range of observations. For example, all observations between rows 6 and 9.
As discussed above, you can use the OBS=-option to specify the last observation that SAS processes from a data set. In contrast, you can use the FIRSTOBS=-option to specify the first observation that SAS processes. If you combine the FIRSTOBS= and OBS=-options, you are able to select a range of observations. For example, below we select rows 6 to 9.
data work.select_range_obs; set work.my_ds (firstobs=6 obs=9); run;
Select the N-th Observation
In SAS, you can also select a specific observation from a dataset. Selecting the N-th observation is a special case of selecting a range of observations. More specifically, if you want to select the N-th row the first observation and last observation of the range are the same. To select a range of length 1 in SAS, the FIRSTOBS=-option and the OBS=-option contain the same value.
In the example below, we select the third observation of the input dataset using FIRSTOBS=3 and OBS=3.
data work.select_nth_obs; set work.my_ds (firstobs=3 obs=3); run;
Another method to select a specific observation from a dataset is by using the _N_ variable. For example, if you want to select the 5 rows, you can use the IF-statement: if _N_= 5 then output.
Select the Last Row in SAS
Selecting the last row of a SAS dataset is a special case of selecting the N-th observation of a dataset.
If you want to use the method discussed above to select the last row, you need to know the number of rows in the data set first. Then, once you know the number of rows, you can set the FIRSTOBS=-option and OBS=-option equal to this number and obtain the last row.
Fortunately, there is an easier way to select the last observation of a data set. With the END=-option, you can create a boolean variable that is TRUE if SAS is processing the last row of the input dataset and FALSE otherwise. So, if you combine the END=-option with an IF-statement, it’s easy to select the last observation of a dataset.
In the example below, we used END=last_obs to create the variable last_obs. This variable is TRUE if SAS has reached the last observation of work.my_ds. Finally, with the IF-statement we write the last observation to the new work.last_obs dataset.
data work.last_obs; set work.my_ds end=last_obs; if last_obs then output; run;
Select the First and Last Observation
If you want to select the first and last observation from a table, you can use a combination of the methods discussed above.
We use _N_=1 to identify the first observation and END=last_obs to find the last observation. Combining these two boolean operations in an IF-statement we can select the first and last observation of a data set.
data work.first_last_obs_sas; set work.my_ds end=last_obs; if _N_ = 1 or last_obs then output; run;
Select the Last N Rows
Selecting the last N observations from a dataset is a bit more difficult than selecting the first N observations. Nevertheless, it is possible. Here we show how.
First, you need to know how many observations your dataset has and store this number in a macro variable. Then, you use this macro variable in combination with the _N_ variable and an IF-statement.
In this example, select the last two observations of the work.my_ds-dataset. First, we create a macro variable total_obs that contains the number of observations. So, in this example, the value of total_obs is 10. Then, we use the IF-statement to select all rows with row numbers between total_obs-1 (i.e., 9) and total_obs (i.e., 10). Hence, we have selected the last 2 rows.
proc sql; select count(*) into :total_obs from work.my_ds; quit; data work.last_n_obs; set work.my_ds; if &total_obs.-1 le _N_ le &total_obs. then output; run;
Select N Observations Randomly
To finish this article, we show how to select N observations randomly from an input dataset.
You can use the PROC SURVEYSELECT procedure to select observations randomly. With the DATA=-option and OUT=-option, you can specify the input and output dataset, respectively. You can specify the number of randomly selected observations with the SAMPSIZE=-option (in this case 4). With the METHOD=-option, you define the sampling method. With the SRS method (simple random sampling), you sample observations with equal probability and without replacement. Finally, with the SEED=-option you can specify the seed of the sample to make the sampling repeatable.
proc surveyselect data=work.my_ds out=work.random_obs method=srs sampsize=4 seed=123456789; run;
For more information about sampling in SAS, see this page.