Would you like to know if two datasets are identical without having to merge them? Or, are you looking for a method to compare two variables from the same datasets? Then, PROC COMPARE is the right procedure for you.
You can use the PROC COMPARE procedure in SAS to check whether two datasets are the same. This procedure compares two datasets; the base dataset and the comparison dataset. SAS compares the two datasets by matching observations. You can use this procedure also to compare two variables from the same dataset.
In this article, we discuss how to use PROC COMPARE and show its various applications.
How to Compare Two Datasets?
You use the PROC COMPARE procedure to compare two SAS datasets. The syntax of this procedure is very simple and easy to remember.
Here’s how to check if two datasets in SAS are the same:
- Start the comparison procedure with the PROC COMPARE statement.
- Use the BASE=-option to specify the name of the first dataset.
- Use the COMPARE=-option to specify the name of the second dataset.
- Finish and execute the procedure with the RUN statement.
This is how the steps above look like in SAS code:
proc compare base=base-dataset compare=comparison-dataset; run;
Below, we give an example of how to use PROC COMPARE and explain the information it generates. For this purpose, we create two datasets work.my_first_dataset and work.my_second_dataset.
data work.my_first_dataset (label="My 1st Dataset"); infile datalines dlm=","; length FirstName LastName $20; label FirstName = "First Name" LastName = "Last Name"; input FirstName $ LastName $ Age; datalines; John, Williams, 32 Maria, Johnes, 44 Ben, Dickens, 50 Jess, Green, 23 ; run; proc print data=work.my_first_dataset; run; proc contents data=work.my_first_dataset; run; data work.my_second_dataset (label="My 2nd Dataset"); infile datalines dlm=","; length FirstName LastName $25; format Salary dollar8.0; label FirstName = "Name" LastName = "Family Name"; input FirstName $ LastName $ Salary; datalines; John, Williams, 80000 Maria, Johnes, 76000 Ben, Dickenss, 65500 Robin, Cooper, 50000 Jeff, Ross, 72000 ; run; proc print data=work.my_second_dataset; run; proc contents data=work.my_second_dataset; run;
As you can see, both datasets aren’t equal. They differ in:
- The number of observations. The first dataset has 4 observations, whereas the second has 5. (blue)
- The variables. Both datasets contain three variables, but the last variable is different (Age vs. Salary) (orange).
- The attributes of the variables. The label and length of the variables FirstName and LastName are different. (yellow).
- The value of the variables. The last name of the third observation has been written differently (Dickens vs. Dickenss) (green).
Now, we run the PROC COMPARE procedure to compare both datasets. As base dataset we use work.my_first_dataset. The comparison dataset is work.my_second_dataset.
proc compare base=work.my_first_dataset compare=work.my_second_dataset; run;
PROC COMPARE creates a report with six sections:
- Data Set Summary
- Variables Summary
- Listing of Common Variables with Differing Attributes
- Observation Summary
- Values Comparison Summary
- Value Comparison Results for Variables
We will discuss each section in further detail.
Remember, PROC COMPARE compares the values of matching observations. Because we didn’t specify how to match observations, SAS matches observations by rows. In other words, it compares the values in the first row of the base dataset with the values in the first row of the comparison dataset. Then, it compares the values of the second row in both datasets, etc. (Later we will discuss how to specify matching observations)
Data Set Summary
The Data Set Summary shows you which two datasets were taken into account and compares their meta-data. It summarizes the moment when the datasets were created and modified, the number of variables, the number of observations, and the dataset labels.
As you can see in the image above, PROC COMPARE compared the datasets work.my_first_dataset and work.my_second_dataset. It shows you that the number of variables (NVar) is the same, but that the number of observations (NObs) and the labels are different.
Do you know? How to Count the Number of Observations in a Dataset (per Group)
The Variables Summary provides information about the differences in the variables of the compared datasets. It compares the variables and their attributes (i.e., formats, lengths, and labels). It doesn’t compare the values of variables.
The image above shows the Variables Summary of the comparison between work.my_first_dataset and work.my_second_dataset. It summarizes that both datasets have two variables in common (namely, FirstName and LastName), as well as one unique variable (Age and Salary).
The summary also shows that there are two common variables with different attributes. In this case, the attributes that differ are the Length and Labels of the variables FirstName and LastName.
Listing of Common Variables with Differing Attributes
The Listing of Common Variables with Differing Attributes contains the differences in Length, Label, and Format of variables that are present in both datasets. Common variables that don’t differ in these three criteria aren’t shown in this summary.
The image above shows that the variables FirstName and LastName have different attributes. In the work.my_first_dataset, the variables have a length of 20, while the same variables in the other dataset have a length of 25. Also, the labels are different.
The Observation Summary of PROC COMPARE summarizes the number of observations in each dataset. It also shows the number of observations that have some equal and unequal values in the common variables.
This is the first part of the PROC COMPARE report that shows the result of compared values. Remember that PROC COMPARE compares the two datasets row by row.
The image above shows that the work.my_first_dataset has four observations, whereas work.my_second_dataset has five. Because the number of observations in both datasets differs, SAS only compares the values of the common variables in the first four rows.
There are two rows that are identical taking into account only the common variables. These are the first two rows of our datasets. The other two rows have one or more variables with unequal values.
Values Comparison Summary
The Value Comparison Summary summarizes the comparison of values for all observations. In other words, it compares the values of all observations for each variable in both datasets. It tells you both the number of variables with all values equal, as well as the number of variables with one or more differences.
The summary also specifies the number of differences were found per variable (Ndif). If the variable is numeric, then this summary the biggest difference between a pair of compared values (MaxDif).
This summary also reports the total number of compared values with differences.
The picture above shows that in the FirstName variable one pair of values was unequal, where as in the LastName variable two differences were found.
Value Comparison Results for Variables
The Value Comparison Results for Variables section of the PROC COMPARE report contains detailed information about the differences presented in the Value Comparison Summary. This summary shows per variable the actual values of matching observations with inequalities.
If there is an inequality in a numeric variable, then this summary calculates and shows the difference between the two values. Because we only have common character variables this isn’t shown here.
The image above shows that the FirstName variable has unequal values in row four, namely Jess vs. Robin. Also, the variable LastName has differences (rows 3 and 4).
Note that, by default, SAS prints per variable only the first 50 observations with different values to the report. You can change this number with the MAXPRINT option.
How to Compare Two Datasets by ID?
By default, PROC COMPARE compares the values in the base dataset and the comparison dataset based on the order they appear. That is to say, SAS compares the first row of the base dataset with the first row of the comparison dataset. Then, the second row of the base dataset with the second row of the comparison datasets, and so on.
On many occasions, the datasets you want to compare might have a common variable that acts as an ID, and you want to compare observations with a similar ID. For example, a customer ID, a product number, a date, etc. However, if your datasets aren’t ordered, or if not all IDs are present in both datasets, then a comparison row-by-row will generate incorrect results. So, how can you compare datasets based on a common ID?
With the PROC COMPARE procedure and the ID statement, SAS compares observations from two datasets based on one or more common variables. These common variables act as an ID for the observations. This way, it is not necessary to merge the two datasets before you can compare them.
Below we show an example of how to use PROC COMPARE in combination with the ID statement. Remember that the variables that act as the ID variables must have the same name in both datasets.
Do you know? How to Rename a Variable
We have created two tables and we will use the variable my_ID in the ID statement.
As you can see in the images above, there are three observations with an ID that appear in both tables. Moreover, the values in the columns Name and Age are also identical. However, work.dataset_1 starts with ID = 01 and work.dataset_2 with ID = 02.
To demonstrate the difference between PROC COMPARE with and without the ID statement, we compare the two SAS datasets without ID statement first.
proc compare base=work.dataset_1 compare=work.dataset_2; run;
As you can see above, the default procedure to compare two SAS datasets generates a long report with the differences. Now, we use the ID statement to let SAS know to match observations using the my_ID variable.
proc compare base=work.dataset_1 compare=work.dataset_2; id my_ID; run;
The report that PROC COMPARE created using the ID statement is much shorter. It shows the number of observations in common and the number of observations there were in one dataset, but not the other. It also tells us that for common observations no unequal values were found.
In case that there are observations in one dataset, but not in the other, SAS only shows how many of these observations there are. You can use the LISTOBS option to include them in the report.
proc compare base=work.dataset_1 compare=work.dataset_2 listobs; id my_ID; run;
The image above tells us that observation with my_ID = 01 wasn’t found in work.dataset_2. Vice versa, the observations with my_ID = 05 and 06 weren’t found in work.dataset_1.
Unfortunately, it’s not possible to create an output dataset with the IDs that are in one dataset but not the other. If you want a dataset with this information, you need an extra DATA step with a MERGE statement.
Do you know? How to Merge Two Datasets
How to Select Variables to Compare?
By default, PROC COMPARE compares all variables from the base and comparison dataset. If you have datasets with many variables, running this procedure might take a while. So, how do you select only these variables you want to compare?
You can use the VAR statement to select variables you want to compare with the PROC COMPARE procedure. The VAR statement is followed by one or more variable names, separated by blanks. You can use the VAR statement also in combination with the ID statement.
In the example below, we use the VAR statement to only compare the FirstName column.
proc compare base=work.my_first_dataset compare=work.my_second_dataset; var FirstName; run;
Another (less efficient) way to define the variables that SAS includes in the COMPARE procedure is by selecting a subset of the variables with the KEEP option. In this case, you use the KEEP option as a dataset option for the base and comparison datasets.
Do you know? How to Select Variables with the KEEP Option
How to Compare Variables in the Same Dataset?
Like we’ve seen above, one way of comparing data is comparing the same variable from different datasets. However, sometimes you might want to contrast the values of two variables in the same dataset. So, how do you compare different variables from the same dataset?
To compare the values from different variables in the same dataset, you can use the PROC COMPARE procedure. First, you use the BASE=-option to specify your input dataset. Then, with the VAR statement and the WITH statement, you select the two variables you want to compare.
Here we show an example.
The SASHELP library offers many free datasets that you can use for testing purposes. Here we will use the CLASSFIT dataset. This dataset contains the actual weight of 19 subjects (persons) and a predicted weight based on their gender, age, and height.
We’ll use PROC COMPARE to compare the actual weight and the predicted weight.
proc compare base=sashelp.classfit; var weight; with predict; run;
As you can see in the image above, SAS compared the variables Weight and Predict. For none of the observations the actual and predicted weight were the same. However, in some cases the difference is very small. For example, in observation 2 and 10.
How to Use the METHOD and CRITERION Options in PROC COMPARE
In the example above, a couple of differences were very small. By default, SAS shows all differences, but sometimes you are only interested in differences that surpass a certain threshold. So, how can you set a threshold in PROC COMPARE when you check for differences?
You can use the METHOD=-option and the CRITERION=-option to define the inequalities that are reported by the PROC COMPARE procedure. You can set the METHOD=-option to ABSOLUTE or PERCENT. With the CRITERION=-option you specify the threshold for which differences are reported. You can use these options both for comparing the same variables for different datasets and for comparing different variables from the same dataset.
Below we give two examples of how to use the METHOD=-option and CRITERION=-option. First, we use these options to only report inequalities when the absolute difference between the values is bigger than 5. If you want to compare absolute differences, you use METHOD=ABSOLUTE.
proc compare base=sashelp.classfit method=absolute criterion=5; var weight; with predict; run;
As you can see in the image above, PROC COMPARE now only shows the differences that a bigger than 5 (absolute value).
You can use the METHOD and CRITERION option also to filter differences that a bigger than a given percentage. In the example below, we use these options to set the threshold for the differences to 10% (absolute percentage).
proc compare base=sashelp.classfit method=percent criterion=10; var weight; with predict; run;
How to Output Differences Between Datasets with PROC COMPARE?
By default, PROC COMPARE generates only a report that shows the differences between two SAS datasets. However, it doesn’t create an output dataset that you can use for further analysis. So, how do you save the results of PROC COMPARE in an output dataset?
The OUT=-option of the PROC COMPARE procedure creates an output dataset. You can use one or more of the following options to control the output dataset: OUTALL, OUTBASE, OUTCOMP, OUTDIF, or OUTNOEQUAL. These options can be combined with the ID statement and the WITH statement.
Here’s what each of the OUT=-options does:
- OUTALL: Writes an observation to the output dataset for all observations in either the base or comparison dataset.
- OUTBASE; Writes an observation to the output dataset for observations only in the base dataset.
- OUTCOMP: Writes an observation to the output dataset for observations only in the comparison dataset.
- OUTDIF: Writes an observation to the output dataset for observations in both the base and comparison dataset.
- OUTNOEQUAL: Writes an observation to the output dataset if there is at least one inequality in one of the matching variables.
Next, we will show how to use the OUT=-option to create an output dataset with the results of PROC COMPARE. We will compare two SAS datasets based on a common variable (my_ID).
As you can see in the images below, the two datasets have three IDs in common, namely 02, 03, and 04. You can also observe that only the observations with ID 04 are completely equal. The other two observations differ in Age or Name.
We will use the OUTDIF and OUTNOEQUAL options so that the output dataset only contains observations that are in both dataset (i.e., matching observations) and that have one or more differences.
/* OUTPUT OPTIONS */ data work.dataset_1; infile datalines dlm=","; input my_ID $ Name $ Age; datalines; 01, Robin, 18 02, John, 32 03, Maria, 44 04, Ben, 50 ; run; proc print data=work.dataset_1; run; data work.dataset_2; infile datalines dlm=","; input my_ID $ Name $ Age; datalines; 02, John, 40 03, Mariaa, 44 04, Ben, 50 05, Jeff, 24 06, Jess, 23 ; run; proc print data=work.dataset_2; run; proc compare base=work.dataset_1 compare=work.dataset_2 out=work.differences outdif outnoequal; id my_ID; run; proc print data=work.differences noobs; run;
The output dataset of the PROC COMPARE procedure shows that there are only differences for the observations with IDs 02 and 03. In the observation with ID 02, there is a difference of 8 in the Age variable. The Name variable is different for ID 03. (The “E” in the Age column stands for Equal.)
If both datasets are completely equal, then the output dataset is empty.
Do you know? How to Check if a Dataset is Empty
How to Use a Macro Variable to Compare Datasets?
As mentioned before, when you compare two SAS datasets, PROC COMPARE only generates a report. So, to know the outcome of the comparison, you need to read the report. This isn’t ideal if you want to automate a task. Fortunately, SAS stores the outcome of the PROC COMPARE procedure in a macro variable called SYSINFO.
You can use the value of the SYSINFO macro variable to know if, and how, two datasets differ. The SYSINFO macro variable can have one of the following values:
|1||Different Dataset Label|
|2||Different Dataset Type (Dataset vs. View)|
|4||Different Variable Informats|
|8||Different Variable Formats|
|16||Different Variable Lengths|
|32||Different Variable Labels|
|64||Observation(s) in the Base datatset, but not in the Comparison dataset|
|128||Observation(s) in the Comparison dataset, but not in the Base dataset|
|256||The Base dataset has a BY group, but not in the Comparison dataset|
|512||The Comparison dataset has a BY group, but not in the Base dataset|
|1024||The Base dataset has a variable, but not in the Comparison dataset|
|2048||The Comparison dataset has a variable, but not in the Base dataset|
|8192||Unequal variable types|
|16384||Unequal BY variable types|
If the datasets are exactly equal, then the value of SYSINFO is zero.
We will use the SYSINFO macro variable to compare the two SAS datasets above.
data work.dataset_1; infile datalines dlm=","; input my_ID $ Name $ Age; datalines; 01, Robin, 18 02, John, 32 03, Maria, 44 04, Ben, 50 ; run; proc print data=work.dataset_1; run; data work.dataset_2; infile datalines dlm=","; input my_ID $ Name $ Age; datalines; 02, John, 40 03, Mariaa, 44 04, Ben, 50 05, Jeff, 24 06, Jess, 23 ; run; proc print data=work.dataset_2; run; proc compare base=work.dataset_1 compare=work.dataset_2; id my_ID; run; %put &=SYSINFO.;
If you run the code above, the SYSINFO macro variable will have a value of 4288, because:
- There is one observation in the Base dataset that isn’t in the Comparison dataset, namely ID 01. This difference has a value of 64 (see table above).
- There are two observations in the Comparison dataset that aren’t in the Base dataset, namely ID 05 and 06. This difference has a value of 128 (see table above).
- There are differences in values, namely in the Name and Age columns. This difference has a value of 4096 (see table above).
- The sum of 64 + 128 + 4096 = 4288.