SAS How To's

How to Create a SAS Dataset with Raw Data Manually

For some purposes, you might need a SAS dataset with a small amount of data. In these situations, it’s more convenient to create a SAS dataset manually than to read it from an input file.

In this article, we discuss the most frequently asked questions people have when they try to enter data manually. These questions are:

How to:

  • Create a SAS Dataset Manually
  • Change the Length of the Input Variables
  • Change the Format of the Input Variables
  • Enter Date Variables
  • Create Variables Based on other Input Variables
  • Deal with Whitespace and Blanks

Instead, if you are more interested in how to create an empty dataset (based on an existing table), I recommend you to read this article.

How to Create a SAS Dataset Manually

In SAS, you can create a new dataset by reading data from an input file (Excel, CSV, TXT, etc.), by using an existing dataset, or by entering the data manually. The last one is known as instream data and is a convenient way to quickly create a dataset with a small amount of data.

A SAS program to enter data manually consists of at least 3 statements, namely:

  1. The DATA Statement to begin a DATA Step and to specify the name of your dataset.
  2. The INPUT Statement to define the names and the types of the variables in your dataset.
  3. The DATALINES (or CARDS) Statement to specify the data in your dataset.

For example, we can use the following SAS code to create a dataset with employees’ names and their salaries.

data work.employees;
    input employee $ salary;
    datalines;
Smith 40000
Williams 35000
Jones 38000
;
run;

We use the DATA statement to define the name of the dataset (employees) and the library where it’s stored (work). If you don’t specify the library, then the dataset will be saved by default in the WORK library.

The dataset above consists of one character variable (employee) and one numeric variable (salary). With the INPUT statement, we define the names and the types of the input variables. A dollar-sign after a variable name indicates that this variable is a character variable. If you don’t provide anything, SAS assumes this input variable is numeric.

The data you want to enter follows directly after the DATALINES statement. Each row of input data is written on a separate line (without a semicolon at the end). After the last row of input, there is one final line of code with one semicolon.

As you can see, when you enter character data it isn’t necessary to write the values between quotes. (for example the employee’s name).

Note: Instead of using the Datalines statement, you can use the Cards statement. The Cards statement is an alias of the Datalines statement.

How to Change the Length of the Input Variables

If you create a SAS dataset manually, then all input variables have a default length.

The length of a SAS variable is by default 8. For numeric variables, this is enough to store numbers of 16 digits. However, for character variables, this corresponds to just 8 characters. As the example below shows, this is too short for an employee with the name Hernandez (9 characters).

data work.employees;
    input employee $ salary;
    datalines;
Smith 40000
Williams 35000
Jones 38000
Hernandez 38500
;
run;

As you can see above, the name of the fourth employee has been cutoff after 8 characters and has been saved as Hernande (without the z), although we clearly defined Hernandez in the SAS code. So, how to change the length of an input variable?

You need the LENGTH statement to change the default length of a variable. For a character variable this statement consists of 3 parts, namely:

  1. The LENGTH keyword.
  2. The name of the variables whose length you want to change.
  3. A $-sign followed by the desired new length and a semicolon.

So, for example, to change the length of the employee variable from 8 to 25 we use the following code.

LENGTH employee $25;

If you want to change the length of your variable(s), you need to place the LENGTH statement before the INPUT statement. Otherwise, SAS will write a warning to the log without changing the length of your variable.

data work.employees;	
    length employee $25;
    input employee $ salary;	
    datalines;
Smith 40000
Williams 35000
Jones 38000
Hernandez 38500
;
run;

As the table above shows, the length of the employee variable has been changed successfully and the name of Hernandez is shown completely.

You can also use PROC CONTENTS to check (amongst others) the length of your variables.

proc contents data=work.employees;
run;

Above we demonstrated how to change the length of one variable. However, you can use the LENGTH statement also to change the length of multiple variables at once.

Suppose you have a dataset with two columns, namely FirstName and LastName. To change the length of both columns to 25 characters, you use the following code.

LENGTH FirstName LastName $25;

How to Change the Format of Input Variables

Another frequently asked question about creating a dataset manually is how to format the input variables.

By default, SAS uses the BESTw. format for numeric variables and the $w. format for character variables. You can modify the default formats with the FORMAT statement.

The FORMAT statement consists of 3 parts, namely:

  1. The FORMAT keyword.
  2. The name of the variable to which you want to apply the format.
  3. The format you want to apply and a semicolon.

So, for example, if you want to apply the DOLLAR12.2 format to the salary variable you need the following statement:

FORMAT salary DOLLAR12.2;

If you want to apply a format to your input variables, you need to place the FORMAT statement after the INPUT statement. So, for example:

data work.employees;	
    length employee $25;
    input employee $ salary;
    format salary dollar12.2;
    datalines;
Smith 40000
Williams 35000
Jones 38000
Hernandez 38500
;
run;

With the FORMAT statement, you can apply predefined SAS formats and user-defined formats. If you want to know more about user-defined formats and how to create them, I highly recommend you to read this article.

Like the LENGTH statement, you can use the FORMAT statement to apply the same format to multiple variables at once. To apply a format to several variables, your FORMAT statement needs the following structure:

  1. The FORMAT keyword.
  2. The names of the variables to which you want to apply the format.
  3. The format you want to apply and a semicolon.

So, if you have two columns (for example, salary and bonus) and you want to apply the DOLLAR12.2 format to both of them, then you need the following FORMAT statement:

FORMAT salary bonus DOLLAR12.2;

How to Enter Date Variables

Besides character and numeric variables, SAS provides date variables.

A SAS date variable is stored as the number of days between January 1st, 1960, and the given date. So, how do you enter date variables when you create a SAS dataset manually?

To enter date values after the DATALINES statement you need to do two things:

  1. In the INPUT statement, you need to provide the name of the variable (for example my_date) followed by the format of the values after the DATALINES statement. For example, if you enter the value 31AUG2020, then you need INPUT my_date DATE9. as INPUT statement.
  2. You need a FORMAT statement to make the date value interpretable for humans. Otherwise, SAS will show the number of days between the entered date and January 1st, 1960 in the output dataset.

For example, we will add the column birthdate to our example dataset and enter the values in the DATE9. format.

data work.employees;	
	length employee $25;
	input employee $ salary birthdate date9.;
	format salary dollar12.2 birthdate date9.;
	datalines;
Smith 40000 30MAR1980
Williams 35000 05SEP1994
Jones 38000 15FEB1988
Hernandez 38500 08DEC1991
;
run;

As you can see, it isn’t necessary to enter March 30th, 1980 as “30MAR1980″d. You directly enter 30MAR1980 in the DATALINES statement.

How to Create Input Variables Based on Other Variables

Suppose you have entered in the DATALINES statement the variables salary and birthdate. Now, you want to add the columns monthly_salary and the current age of the employees. You could enter this information by hand, or, more efficiently, let SAS calculate these values.

If you want to add a calculated input variable that is based on a current variable, then you define this variable after the INPUT statement. Defining a new variable follows the same rules as always. (See this article to know more about creating new variables) .

In the example below, we create two new variables based on existing variables:

data work.employees;	
	length employee $25;
	input employee $ salary birthdate date9.;
	monthly_salary = round(salary / 12, 1);
	age = intck("year", birthdate, today(), "c");
	format salary monthly_salary dollar12.2 birthdate date9.;
	datalines;
Smith 40000 30MAR1980
Williams 35000 05SEP1994
Jones 38000 15FEB1988
Hernandez 38500 08DEC1991
;
run;

As you might expect, you can define the length and format of these calculated input variables with the LENGTH and FORMAT statements.

How to Deal with Whitespace and Blanks

Until now, all our input variables were values without whitespaces (blanks). However, it might happen that the values in the DATALINES statement contain blanks.

In the examples above, the employee column contains the surname of the employee. In the code below, we try to change the value of the employee column such that it contains the first name and the surname, for example John Smith or Mike Williams.

data work.employees;	
    length employee $25;
    input employee $ salary birthdate date9.;
    monthly_salary = round(salary / 12, 1);
    age = intck("year", birthdate, today(), "c");
    format salary monthly_salary dollar12.2 birthdate date9.;
    datalines;
John Smith 40000 30MAR1980
Mike Williams 35000 05SEP1994
Marta Jones 38000 15FEB1988
Javi Hernandez 38500 08DEC1991
;
run;

The result of the code above is a table with missing values and notes in the SAS log (Note: Invalid data for X in line Y).

The cause of this problem is the way how SAS reads and processes the values after the DATALINES statement. As an example we use the following line of values.

John Smith 40000 30MAR1980

When SAS processes a line of values, it assumes that each blank indicates the beginning of the value of the next input variable. So, in our example, SAS copies the value John to the column employee. This isn’t a problem since John is a character value, just like the employee column (defined in the INPUT statement). Then, SAS tries to copy the value Smith to the salary column. This is a problem because SAS expects a numeric value, but Smith is a character value. Therefore, the salary column remains empty and SAS writes a note to the log. SAS continues processing the remaining values, but similar problems occur.

To avoid this problem and to be able to enter values that contain blanks, we need to change the delimiter. By default, the value that separates the value of each input variable (i.e., the delimiter) is a blank. We can change this default behavior with the INFILE statement.

The INFILE statement consists of three parts and is the first statement after the DATA statement:

  1. The INFILE keyword.
  2. The DATALINES keyword to indicate that you enter raw data manually.
  3. The DLM= option to define the delimiter and a semicolon. (DLM is an abbreviation of delimiter.)

You can choose the delimiter that suits you best, but the most common delimiters are a comma, a semicolon, or a tab. In the example below, we use the comma as a delimiter.

data work.employees;
    infile datalines dlm=',' dsd;	
    length employee $25;
    input employee $ salary birthdate :date9.;
    monthly_salary = round(salary / 12, 1);
    age = intck("year", birthdate, today(), "c");
    format salary monthly_salary dollar12.2 birthdate date9.;
    datalines;
John Smith, 40000, 30MAR1980
Mike Williams, 35000, 05SEP1994
Marta Jones, 38000, 15FEB1988
Javi Hernandez, 38500, 08DEC1991
;
run;

As you can see, now SAS is able to process data with blanks.