SAS Functions

How to Efficiently Use The COMPRESS Function

If you work with data, and especially during data cleansing, you may come across unwanted characters, such as whitespace, numbers, or semicolons. In SAS, you can use the COMPRESS-function to remove unwanted characters. In this article, we discuss this function and show how to apply it with examples.

Remove Whitespace with the COMPRESS Function

By default, the COMPRESS-function only removes whitespace and blanks. That is to say, leading blanks, trailing blanks, and any blanks within the character string. In the example below, we show how to remove all these types of unwanted whitespaces.

/* REMOVE BLANKS - DEFAULT */
data work.my_data;
	input my_string $char19.;
	datalines;
Emma              
             Peter
  John    Snow      
;
run;
 
data work.remove_blanks;
	set my_data;
 
	compress_string = compress(my_string);
run;
Blanks in SAS
A SAS dataset with blanks

Now, if we use the my_string column as the only input for the COMPRESS-functions, all blanks will be removed.

Remove blanks with the compress function in SAS

If you want to know more about removing blanks and whitespaces from a character string, then this article might interest you.

Remove Specific Characters with the COMPRESS Function

As we showed in the previous section, the SAS COMPRESS-function removes by default only whitespace. However, you can use the optional arguments of this function to make your data cleansing process much more efficient.

With the second argument of the COMPRESS-function, you can define a (list of) character(s) that you want to remove. For example, with the SAS code below we only remove the letter “a” from a string.

data work.my_data;
	input my_string $15.;
	datalines;
Banana
;
run;
 
data work.remove_char;
	set work.my_data;
 
	compress_string = compress(my_string, 'a');
run;
Remove a single character with the compress function
Remove a single character

If you want to remove more than one character with the same operation, you can give the second argument a list of characters. For example, with the code below we remove the letter a, the percentage-sign, and the hyphen.

data work.my_data;
	input my_string $15.;
	datalines;
Banana
95%
two-year-old
;
run;
 
data work.remove_char;
	set work.my_data;
 
	compress_string = compress(my_string, 'a%-');
run;
Remove a list of characters with the SAS compress function
Remove a list of characters

Note that all characters that you want to remove should be written between quotation marks.

How to Use the Modifier Argument of the COMPRESS Function

Although the second argument of the COMPRESS-function makes data cleansing in SAS much easier, the third argument makes this function really powerful. The third argument is called the modifier and enables you to remove (or keep) whole types/classes of characters with one simple operation. Below we show the most common uses of the modifier argument.

Keep Specific Characters

Instead of specifying all characters that you want to remove from a character string, you can use the second- and the third argument of the COMPRESS-function to specify the character you want to keep. In a previous example, we showed how to remove the letter a, the percentage-sign, and the hyphen. In the example below, we demonstrate how to keep only these characters.

data work.my_data;
	input my_string $15.;
	datalines;
Banana
95%
two-year-old
;
run;
 
data work.keep_char;
	set work.my_data;
 
	remove_chars = compress(my_string, 'a%-');
	keep_chars = compress(my_string, 'a%-', 'k');
run;
Remove and Keep specific characters with compress in SAS
Remove and Keep selected characters

Remove Lowercase and Uppercase Characters

If you want to remove the lowercase and uppercase of a list of characters, you could write out the complete list. For example, you could write “abcdefgABCDEFG” as the second argument of the COMPRESS-function. However, there is a more efficient way to do this.

The third argument of the COMPRESS-function in SAS provides an option to remove characters irrespectively of its case. If you use ‘i’ as the third argument (i = case insensitive), then SAS removes both the lowercase as well as the uppercase of the characters defined in the second argument. Below, we provide an example.

data work.my_data;
	input my_string $15.;
	datalines;
BananA
bear
ABC
;
run;
 
data work.case_sensitive_char;
	set work.my_data;
 
	case_sensitive = compress(my_string, 'ab');
	case_insensitive = compress(my_string, 'ab','i');
run;
Remove lowercase and uppercase characters in SAS with the compress function
Remove both Lowercase and Uppercase characters

If you want to learn more about lowercase and uppercase in SAS, this article might interest you.

Remove all Digits

Another powerful third argument of the COMPRESS-function is the letter ‘d’. With this option, you can remove all digits from a character string. In the example below, we remove all digits (0123456789) from a string.

data work.my_data;
	input my_string $15.;
	datalines;
abc123
1-2-3
123
;
run;
 
data work.remove_digits;
	set work.my_data;
 
	remove_digits = compress(my_string,,'d');
run;
Remove all digits with the compress function in SAS
Remove all Digits with the COMPRESS-function

Note that, in contrast to the previous examples, we leave the second argument blank. That is to say, we don’t specify any characters that should be removed.

Remove all Alphabetic Characters

Like the previous example where we removed all digits, you can also use the third argument to remove all alphabetic characters. That is to say, all letters from A to Z irrespective of their case (lowercase or uppercase). Below we provide an example.

data work.my_data;
	input my_string $15.;
	datalines;
aBcDe
abc123
AB--//??yz
;
run;
 
data work.remove_letters;
	set work.my_data;
 
	remove_alphabetic_chars = compress(my_string,,'a');
run;
Remove all Alphabetic Characters
Remove all Alphabetic Characters

Again, note that we leave the second argument blank.

Remove all Punctuation Marks

Besides removing all digits and all alphabetic characters, you can also use the SAS COMPRESS-Functions to remove all punctuation. That is to say, commas, dots, exclamation marks, etc. Below we show an example.

data work.my_data;
	input my_string $15.;
	datalines;
I'm John.
My age: 30
It 50% more
Nooo!!
What?
;
run;
 
data work.remove_punctuation;
	set work.my_data;
 
	remove_punctuation = compress(my_string,,'p');
run;
Remove all Punctuation Marks

Combine the Arguments of the COMPRESS-Function

In the previous section, we showed the power of the third argument of the SAS COMPRESS-Function. To make this function even more powerful, you use a combination of the second- and the third argument. For example, below we show how to:

  • Keep only digits.
  • Keep the letters “a” and “b” (lowercase and uppercase), and the number 2, 4, 6, and 8.
  • Remove all alphabetic characters, the numbers 2 and 4, and the hyphen.
data work.my_data;
	input my_string $15.;
	datalines;
123456890
A1-b2-C3-d4
;
run;
 
data work.combine_modifiers1;
	set work.my_data;
 
	keep_digits = compress(my_string, , 'dk');
run;
 
data work.combine_modifiers2;
	set work.my_data;
 
	combine_modifiers = compress(my_string, 'ab2468', 'ik');
run;
 
data work.combine_modifiers3;
	set work.my_data;
 
	combine_modifiers = compress(my_string, '24-', 'a');
run;
Keep only digits
Keep only Digits
Keep the letters “a” and “b” (lowercase and uppercase), and the number 2, 4, 6, and 8
Remove all alphabetic characters, the numbers 2 and 4, and the hyphen