If you work with data, and especially during data cleansing, you may come across unwanted characters, such as whitespace, numbers, or semicolons. In SAS, you can use the COMPRESS-function to remove unwanted characters. In this article, we discuss this function and show how to apply it with examples.
Contents
Remove Whitespace with the COMPRESS Function
By default, the COMPRESS-function only removes whitespace and blanks. That is to say, leading blanks, trailing blanks, and any blanks within the character string. In the example below, we show how to remove all these types of unwanted whitespaces.
/* REMOVE BLANKS - DEFAULT */ data work.my_data; input my_string $char19.; datalines; Emma Peter John Snow ; run; data work.remove_blanks; set my_data; compress_string = compress(my_string); run;
Now, if we use the my_string column as the only input for the COMPRESS-functions, all blanks will be removed.
If you want to know more about removing blanks and whitespaces from a character string, then this article might interest you.
Remove Specific Characters with the COMPRESS Function
As we showed in the previous section, the SAS COMPRESS-function removes by default only whitespace. However, you can use the optional arguments of this function to make your data cleansing process much more efficient.
With the second argument of the COMPRESS-function, you can define a (list of) character(s) that you want to remove. For example, with the SAS code below we only remove the letter “a” from a string.
data work.my_data; input my_string $15.; datalines; Banana ; run; data work.remove_char; set work.my_data; compress_string = compress(my_string, 'a'); run;
If you want to remove more than one character with the same operation, you can give the second argument a list of characters. For example, with the code below we remove the letter a, the percentage-sign, and the hyphen.
data work.my_data; input my_string $15.; datalines; Banana 95% two-year-old ; run; data work.remove_char; set work.my_data; compress_string = compress(my_string, 'a%-'); run;
Note that all characters that you want to remove should be written between quotation marks.
Do you know? How to Remove Leading Zeros with the COMPRESS Function
How to Use the Modifier Argument of the COMPRESS Function
Although the second argument of the COMPRESS-function makes data cleansing in SAS much easier, the third argument makes this function really powerful. The third argument is called the modifier and enables you to remove (or keep) whole types/classes of characters with one simple operation. Below we show the most common uses of the modifier argument.
Keep Specific Characters
Instead of specifying all characters that you want to remove from a character string, you can use the second- and the third argument of the COMPRESS-function to specify the character you want to keep. In a previous example, we showed how to remove the letter a, the percentage-sign, and the hyphen. In the example below, we demonstrate how to keep only these characters.
data work.my_data; input my_string $15.; datalines; Banana 95% two-year-old ; run; data work.keep_char; set work.my_data; remove_chars = compress(my_string, 'a%-'); keep_chars = compress(my_string, 'a%-', 'k'); run;
Remove Lowercase and Uppercase Characters
If you want to remove the lowercase and uppercase of a list of characters, you could write out the complete list. For example, you could write “abcdefgABCDEFG” as the second argument of the COMPRESS-function. However, there is a more efficient way to do this.
The third argument of the COMPRESS-function in SAS provides an option to remove characters irrespectively of its case. If you use ‘i’ as the third argument (i = case insensitive), then SAS removes both the lowercase as well as the uppercase of the characters defined in the second argument. Below, we provide an example.
data work.my_data; input my_string $15.; datalines; BananA bear ABC ; run; data work.case_sensitive_char; set work.my_data; case_sensitive = compress(my_string, 'ab'); case_insensitive = compress(my_string, 'ab','i'); run;
If you want to learn more about lowercase and uppercase in SAS, this article might interest you.
Remove all Digits
Another powerful third argument of the COMPRESS-function is the letter ‘d’. With this option, you can remove all digits from a character string. In the example below, we remove all digits (0123456789) from a string.
data work.my_data; input my_string $15.; datalines; abc123 1-2-3 123 ; run; data work.remove_digits; set work.my_data; remove_digits = compress(my_string,,'d'); run;
Note that, in contrast to the previous examples, we leave the second argument blank. That is to say, we don’t specify any characters that should be removed.
Remove all Alphabetic Characters
Like the previous example where we removed all digits, you can also use the third argument to remove all alphabetic characters. That is to say, all letters from A to Z irrespective of their case (lowercase or uppercase). Below we provide an example.
data work.my_data; input my_string $15.; datalines; aBcDe abc123 AB--//??yz ; run; data work.remove_letters; set work.my_data; remove_alphabetic_chars = compress(my_string,,'a'); run;
Again, note that we leave the second argument blank.
Remove all Punctuation Marks
Besides removing all digits and all alphabetic characters, you can also use the SAS COMPRESS-Functions to remove all punctuation. That is to say, commas, dots, exclamation marks, etc. Below we show an example.
data work.my_data; input my_string $15.; datalines; I'm John. My age: 30 It 50% more Nooo!! What? ; run; data work.remove_punctuation; set work.my_data; remove_punctuation = compress(my_string,,'p'); run;
Combine the Arguments of the COMPRESS-Function
In the previous section, we showed the power of the third argument of the SAS COMPRESS-Function. To make this function even more powerful, you use a combination of the second- and the third argument. For example, below we show how to:
- Keep only digits.
- Keep the letters “a” and “b” (lowercase and uppercase), and the number 2, 4, 6, and 8.
- Remove all alphabetic characters, the numbers 2 and 4, and the hyphen.
data work.my_data; input my_string $15.; datalines; 123456890 A1-b2-C3-d4 ; run; data work.combine_modifiers1; set work.my_data; keep_digits = compress(my_string, , 'dk'); run; data work.combine_modifiers2; set work.my_data; combine_modifiers = compress(my_string, 'ab2468', 'ik'); run; data work.combine_modifiers3; set work.my_data; combine_modifiers = compress(my_string, '24-', 'a'); run;
2 thoughts on “How to Efficiently Use The COMPRESS Function”
Comments are closed.