Home Understanding PROC SQL in SAS
Post
Cancel

Understanding PROC SQL in SAS

As my experience with SAS grows, I find myself predominantly using PROC SQL for any data manipulation tasks.

This is the general structure of SQL queries that most interns and entry-level analysts will find themselves using; in the below example, the SAS/SQL syntax will be uppercase with library/table/column names being lowercase.

minimal syntax

1
2
3
4
5
PROC SQL;
  SELECT column
  FROM libname.table
  ;
QUIT;

This will generate a SAS report displaying all rows of the selected column from table in the specified library.

Common syntax

1
2
3
4
5
6
PROC SQL;
  CREATE TABLE output AS
  SELECT *
  FROM libname.table
  ;
QUIT;

This will create a SAS dataset in the working directory titled “output” containing all columns from libname.table.

A “*” is a wildcard which represents all columns

Real-world Examples

For this section, we will be using the pending and paid claims tables given in this. In this example we will be working with a pending claims table and a paid claims table and summarizing differences between the two.

Joins

SQL connects different tables together using joins.

1
2
3
4
5
6
7
8
9
10
proc sql;
  create table pending_and_paid as select t1.*
    ,t2.claim_amount    as paid_amount
    ,t2.claim_amount - t1.claim_amount as claim_difference format=dollar16.2
    ,abs(calculated claim_difference)  as abs_claim_difference
  from claims_pending as t1
  inner join claims_paid as t2
  on t1.claim_number eq t2.claim_number
  ;
quit;

t1 and t2 are ALIASES for the pending and paid tables respectively. Aliases can have any SAS appropriate name; however, when there are multiple columns with the same name an alias must be used to differentiate between them; for such reasons, using a short alias is preferred.

  • The “claim_amount” value from the claims_paid table was renamed to “paid_amount”
    • Renaming is necessary since we wanted our table to include a column with the same name from both source tables
    • Recall that t1.* will include all columns from the t1 table
  • The claim_difference column is calculated by taking the difference between the claim amount from the two tables
    • this value was also formatted as dollars as specified by the format.
  • claim_difference is calculated by referencing the previously (i.e. earlier in the query) calculated value by using the “calculated” prefix

An inner join contains records that are only found on both datasets and is often visualized as an intersection on a venn diagram.

This query has the following results

claim_numberdate_pendingdate_deathdate_submittedname_insured_firstname_insured_lastname_beneficiary_firstname_beneficiary_lastclaim_amountpaid_amountclaim_differenceabs_claim_difference
000801/31/202111/17/202001/30/2021JamesKimSarahKim$30,000.00$15,000.00-$15,00015000
000801/31/202111/17/202001/30/2021JamesKimSarahKim$30,000.00$15,000.00-$15,00015000
000901/31/202112/25/202001/31/2021EmilyLeeJohnLee$80,000.00$50,000.00-$30,00030000
001001/31/202101/10/202101/31/2021RobertKimChrisKim$120,000.00$120,000.0000

I want to point out the following:

  1. Claim 0008 is represented twice, this is because the claims_paid dataset has two payments for a single claim_number (which was also present in the claims_pending) table
  2. abs_claim_difference is numeric, but has no format assigned

Controls

Validation of datasets is essential for properly using SQL; verifying that data was not duplicated is an essential control process.

One way to do this is to count the number of unique policies that exists in both datasets and validate this to the results.

1
2
3
proc sql;
  select distinct coalesce(t1.claim_number,t2.claim_number) as claim_number from claims_pending as T1 inner join claims_paid as t2 on t1.claim_number eq t2.claim_number;
quit;

The distinct qualifier in the subquery removes all duplicates (just like the Excel feature); therefore, the above query returns just the claim_number column but removes one of the “0008” references because it would be identical to an earlier row.

The distinct qualifier is for an entire row from the resulting tables and not just the joining criteria

If the total number of claims in these two tables were different, it would be indicative of data duplication. While this is a basic form of control, it is rather manual and prone to error. Instead of comparing individual claims, comparing the overall count would suffice.

1
2
3
4
5
6
7
8
9
proc sql;
  create table distinct_claims as
  select distinct coalesce(t1.claim_number,t2.claim_number) as claim_number
  from claims_pending as T1
  inner join claims_paid as t2
  on t1.claim_number eq t2.claim_number
  ;
  select count(*) from distinct_claims;
  delete table distinct_claims;

This query

  1. Creates a table of the distinct claims
  2. Counts the number of rows from that table
  3. Deletes the table

This process is an improvement; however, it is not concise. Creating tables should be reduced to a minimum for both efficiency and transparency. Your colleagues and your future-self will tha nk you for writing code that is easy to maintain or debug.

Subqueries

1
2
3
4
proc sql;
  /*Control - # of Unique Claims*/
  select count(s1.*) from (select distinct coalesce(t1.claim_number,t2.claim_number) from claims_pending as T1 inner join claims_paid as t2 on t1.claim_number eq t2.claim_number) as s1;
quit;

This subquery results in the same output without creating the temporary table and then later deleting it.

This query has two parts

  1. The “FROM” statement is no longer a table, but is another query referred to as a subquery (because it is a component of a larger query). This subquery is the same as the previous check, which returns the three unique claim_numbers.
  2. The count(*) function returns to total count, rather than the list of claim_numbers

The subquery results are automatically deleted when the query resolves (i.e. no table was created).

This post is licensed under CC BY 4.0 by the author.

SAS Creating Tables

-