Uncategorized

Fastest Way to Sort a SAS Dataset

There are several ways to order a dataset in SAS. Two of the most popular are PROC SORT and PROC SQL. But, do you know which one to use to boost the performance of your SAS code? We neither. So, we did some experiments.

So, which method to sort your SAS dataset is the fastest? For a small dataset, there isn’t a big difference in performance between PROC SORT and PROC SQL. However, for larger datasets, PROC SQL is faster than PROC SORT. Also, it seems that sorting character data is easier (read: faster), than numeric data.

In this article, we present the results of some simulations we carried out. We compared PROC SORT and PROC SQL in different situations. For example, for a different number of observations, numeric data versus character data, a different number of columns to order, etc.

How to Measure the Fastest Way to Sort a Dataset – Real Time vs. CPU Time

Before we start the comparison of PROC SORT and PROC SQL, we first define how to measure the performance of these methods.

When you run a SAS program, SAS writes some useful statistics to the SAS Log. Among these statistics are the Real Time and the CPU Time. But, what do they measure and which should we use in our comparison?

SAS Log: Real Time vs. CPU Time
SAS Log: Real Time vs. CPU Time

Real Time

The real time is the actual, real-world time that it takes to execute your code. In other words, if you had a stopwatch and measured the time it takes to execute your code, then the time on your stopwatch and the real time in the SAS log would be the same.

User CPU Time

The CPU time is the amount of time your code utilizes the CPU resources. If you execute your code on a machine, e.g. laptop, with one CPU, then the amount of CPU time is always less than the real time. However, if you have a machine with multiple CPUs and the load is spread among them, then the CPU time can be longer than the real time. For example, the total real time to execute your code was 3 seconds, but your machine has 4 CPUs that were utilized all during 1 second, then the CPU time is 4 seconds.

The configuration of your machine and the type of code you want to execute (for example data processing or statistical tests) heavily influence the real time and the cpu time. So, the results of our simulations may differ on your machine.

By default, SAS prints the real time and CPU time to the log. If you want more statistics about your last executed code, you can activate the fullstimer option. Here is an excellent post about this SAS option.

The Fastest Way to Order Numeric Data

In our first experiment, we compare the performance of PROC SORT and PROC SQL on sorting one column of numeric data.

We carried out this experiment in three different settings, namely with 100.000 observations, 1.000.000 observations, and 10.000.000 observations. The real time and the CPU time in the tables below are the averages of 5 repetitions.

MethodReal Time (sec.)CPU Time (sec.)
PROC SORT0.090.09
PROC SQL0.100.12
Rows: 100.000. Data: Numeric
MethodReal Time (sec.)CPU Time (sec.)
PROC SORT0.651.09
PROC SQL0.661.21
Rows: 1.000.000. Data: Numeric
MethodReal Time (sec.)CPU Time (sec.)
PROC SORT9.3413.35
PROC SQL8.7013.56
Rows: 10.000.000. Data: Numeric

As you can observe, there is no big difference in performance for datasets with 1.000.000 or fewer observations. However, it seems that PROC SQL sorts the data a bit faster for a dataset with 10.000.000 observations.

The Fastest Way to Order Character Data

Fastest Way to Sort a SAS Dataset with a Different Number of Observations

Next, we check which method is faster to order character data. We create a 3 dataset with one column of one character, but with a different number of observations. Again, the real time and the CPU time is the average of 5 runs.

MethodReal Time (sec.)CPU Time (sec.)
PROC SORT0.070.08
PROC SQL0.070.09
Rows: 100.000. Data: Character
MethodReal Time (sec.)CPU Time (sec.)
PROC SORT0.590.88
PROC SQL0.530.86
Rows: 1.000.000. Data: Character
MethodReal Time (sec.)CPU Time (sec.)
PROC SORT4.517.96
PROC SQL4.988.58
Rows: 10.000.000. Data: Character
MethodReal Time (sec.)CPU Time (sec.)
PROC SORT9.5014.68
PROC SQL7.6613.17
Rows: 15.000.000. Data: Character

In this experiment, it seems that PROC SORT and PROC SQL have similar performance for datasets up to 10 million rows. However, PROC SQL is the fastest method to order character data of 15 million observations.

What is interesting, though, is the difference in performance in sorting numeric data and character data. Both PROC SORT and PROC SQL order character datasets faster than numeric datasets (with the same number of observations).

Fastest Way to Sort a SAS Dataset with Strings of Different Lengths

In the next experiment, we tested the influence of the length of the character string on the performance. Again, we ordered one character column of 15 million rows, but the number of characters was different in each experiment. We ordered columns with 1 to 5 characters.

Method1 Char2 Chars3 Chars4 Chars5 Chars
PROC SORT9.5011.4214.1215.3716.77
PROC SQL7.669.1011.1413.0115.58
Rows: 15.000.000. Measure: Real Time (sec.)
Method1 Char2 Chars3 Chars4 Chars5 Chars
PROC SORT14.6817.2321.1122.3723.88
PROC SQL13.1716.9020.0722.3924.76
Rows: 15.000.000. Measure: CPU Time (sec.)

First of all, the PROC SQL is the fastest way to sort a character dataset in SAS. Irrespectively of the number of characters in a string, PROC SQL outperformed PROC SORT.

Also, both for PROC SORT and PROC SQL, ordering longer strings doesn’t mean a linear increase in the amount of processing time. In other words, PROC SQL needed 7.66 seconds to order a column with 1 character and 15.58 seconds to order a column with 5 characters.

Conclusions

After carrying out the experiments above, we can draw the following conclusions.

  • For small datasets, PROC SORT and PROC SQL have similar performance both for numeric data and character data.
  • For large datasets, PROC SQL is the fastest way to sort a SAS dataset.
  • Both PROC SORT and PROC SQL order character datasets faster than numeric datasets of similar size.
  • The number of characters in a string doesn’t impact the performance of both methods linearly.

Discussion

Although we have drawn some conclusions from our experiments, we have to make the following comments:

  • The experiments were carried out on a normal laptop with regular specifications. It might be that if you run the experiments in other (optimized) environments, your conclusions differ.
  • We used only PROC SQL and PROC SORT. However, there exist other techniques to order datasets that might be faster, such as Hash Tables.
  • Finally, we used the default settings of the PROC SORT and PROC SQL. However, the PROC SORT method provides some interesting options to increase performance. See this excellent post for more information.