In SAS you can join two or more tables in two ways. In this article, we discuss how to join tables with the MERGE statement. We explain the syntax of the MERGE statement and discuss how to create a full, left, right, and inner join.
The MERGE Syntax
MERGE SAS-data-set-1 <(data-set-options)>
SAS-data-set-2 <(data-set-options) >
The MERGE statement contains at least two obligatory arguments, namely the SAS tables you want to join. You can keep adding more arguments to join more tables.
The optional arguments are the data set options. For example, the WHERE statement to subset the input data set or the KEEP statement to select variables of the input data set. In this article, we focus only on the IN statement. With this statement, you can create left, right, and inner joins.
In contrast to a join in SQL, the MERGE statement in SAS doesn’t create a cartesian, many-to-many join. The joins in SAS are one-to-one joins or one-to-many joins. The advantage of theses types of joins is that SAS carries out the joins faster. However, the only condition to create these joins is that the input tables are sorted on the common variable(s), i.e. the variable(s) you want to use to join on.
The Input Data Sets
Throughout this article, we will use 2 data sets of 2 columns with information about cities (population and elevation). Both tables contain the column city. We will use this column to join the 2 tables. The tables will be referred to as Left and Right.
data work.left; infile datalines dlm=','; length city $ 15; input city $ population; format population nlnum12.; datalines; Copenhagen, 602481 Amsterdam, 872680 Paris, 2150271 Berlin, 3769495 Rome, 4342212 Madrid, 6345000 London, 8908081 ; run; data work.right; infile datalines dlm=','; length city $ 15; input city $ elevation; datalines; Amsterdam, -2 Lisbon, 2 Brussels, 13 Rome, 21 Berlin, 34 Paris, 35 Warsaw, 78 Vienna, 151 Madrid, 667 ; run;
As mentioned before, the tables must be sorted on the common variable(s). In this example, we have only one common variable, namely city. Currently, the tables are sorted on population and elevation. So, first, we sort both tables on the city column. We use PROC SORT to order the tables.
proc sort data=work.left out= work.left_srt; by city; run; proc sort data=work.right out=work.right_srt; by city; run;
If we wanted to join tables on more than one the column, then these columns must be specified in the BY statement.
The Full Join
The first join type we discuss is the FULL JOIN. This join returns all matching observations from both tables whether the other table matches or not. This join type is the most straightforward of all joins and doesn’t require the IN keyword. However, we always need to specify the common variable(s) in the BY statement.
data work.full_join; merge work.left_srt work.right_srt; by city; run;
As you can see, some observations contain missing value. For example, the population of Brussels. This is because Brussels isn’t present in the Left table.
The LEFT and RIGHT Join
The second and third join types are the LEFT JOIN and the RIGHT JOIN. The LEFT JOIN returns all observations of the Left table and the matching observations from the Right table. In contrast, the RIGHT JOIN returns all observations of the Right table and the matching observations from the Left table.
To create these types of join we need to use the IN keyword. The IN keyword is used to refer to the input tables. In the examples below we use IN = a to refer to the Left table and IN = b to refer to the Right table. The characters a and b are arbitrary. However, it’s common to use these characters.
We use the IF statement to let SAS know which observation must be written to the output data set.
data work.left_join; merge work.left_srt (in=a) work.right_srt (in=b); by city; if a then output; run; data work.right_join; merge work.left_srt (in=a) work.right_srt (in=b); by city; if b then output; run;
The INNER Join
The last join type we discuss is the INNER JOIN. The INNER JOIN returns all observations that match in the Left and Right table. For this join we need the IN keyword and IF statement too.
data work.inner_join; merge work.left_srt (in=a) work.right_srt (in=b); by city; if a and b then output; run;
In SAS you can join two or more tables using the MERGE statement. Firstly, you need to order the tables you want to join on the common variable(s). Secondly, you need to define the common variable(s) in SAS with the BY statement. Finally, to create a left, right or inner join, you need the IN keyword and the IF statement.
On the SAS website you can find the official documentation.