Course 3: SAS Procedures for Data Analysis

Welcome to the SAS Procedures for Data Analysis Certificate Program! This course focuses on using SAS procedures to perform descriptive statistics, data exploration, visualization, and statistical analysis. Over one week, you will learn to summarize data with PROC MEANS and PROC FREQ, explore distributions with PROC UNIVARIATE and PROC SUMMARY, visualize data with PROC SGPLOT and PROC GPLOT, conduct hypothesis tests with PROC TTEST and PROC ANOVA, and model relationships with PROC CORR, PROC REG, and PROC GLM. Designed for learners with basic SAS knowledge, this course equips you with the skills to analyze data effectively.

Objective: By the end of the course, learners will be proficient in using SAS procedures to summarize, visualize, and analyze data, enabling data-driven decision-making and robust statistical reporting.

Scope: The course covers descriptive statistics, frequency analysis, data exploration, visualization, hypothesis testing, and correlation/regression analysis, using practical examples and hands-on exercises.

Day 1: Descriptive Statistics with PROC MEANS and PROC FREQ

Introduction: Descriptive statistics are foundational tools in data analysis, providing essential summaries about the central tendency, dispersion, and frequency distribution of variables. In SAS, the procedures PROC MEANS and PROC FREQ are widely used to generate these statistics efficiently. This session introduces you to these procedures, guiding you through their syntax, options, and practical applications for summarizing both numeric and categorical data.

Learning Objectives: By the end of this session, you will be able to understand the purpose and capabilities of PROC MEANS and PROC FREQ, generate summary statistics for numeric variables using PROC MEANS, produce frequency tables and cross-tabulations for categorical variables with PROC FREQ, interpret the output from both procedures to inform data-driven decisions, and apply options and statements to customize your statistical summaries.

Scope: This session covers the use of PROC MEANS for summarizing numeric data and PROC FREQ for analyzing categorical data. You will learn about key options, output interpretation, and practical scenarios where these procedures are most useful. The focus is on hands-on application, ensuring you can confidently use these tools in your own data analysis projects.

Background Information: PROC MEANS is a SAS procedure designed to compute descriptive statistics such as mean, median, minimum, maximum, standard deviation, and more for numeric variables. It is highly customizable, allowing you to specify which statistics to display and how to group your data. PROC FREQ, on the other hand, is used to analyze categorical data by producing frequency counts, percentages, and cross-tabulations. It is invaluable for understanding the distribution of categorical variables and for exploring relationships between them. Both procedures are essential for initial data exploration and for preparing data summaries for reports and presentations.

Hands-On Example:

Suppose you have a dataset called sales_data with the following variables: Region (categorical), Product (categorical), and Sales (numeric).

PROC MEANS Example:

proc means data=sales_data n mean median min max std;
  var Sales;
  class Region;
run;

PROC FREQ Example:

proc freq data=sales_data;
  tables Region Product Region*Product / nocol nopercent;
run;

Interpretation: The PROC MEANS output provides, for each region, the number of observations, mean, median, minimum, maximum, and standard deviation of sales. This helps you quickly assess sales performance and variability across regions. The PROC FREQ output shows the frequency distribution of sales by region and product, as well as a cross-tabulation of region by product. This allows you to identify which products are most popular in each region and spot any patterns or anomalies in the data.

Supplemental Information:

For more on customizing PROC MEANS (including maxdec= and the output statement): SAS PROC MEANS Documentation
For PROC FREQ options such as chisq and plots=freqplot: SAS PROC FREQ Documentation
For using where statements with procedures: SAS WHERE Statement Documentation
For in-depth documentation, refer to the SAS Documentation: SAS Documentation
For questions, visit the SAS Communities forum: SAS Communities

Discussion Points:

When would you use PROC MEANS versus PROC FREQ?
How can summary statistics help you identify data quality issues?
What are the limitations of descriptive statistics, and when should you move to more advanced analyses?
How can you use cross-tabulations to explore relationships between categorical variables?
Discuss scenarios where grouping data (using the class statement) provides more insight than overall summaries.

Day 2: Data Exploration with PROC UNIVARIATE and PROC SUMMARY

Introduction: Exploring your data in detail is a crucial step before any advanced analysis. SAS provides powerful procedures like PROC UNIVARIATE and PROC SUMMARY to help you understand the distribution, central tendency, spread, and potential anomalies in your numeric data. This session will guide you through using these procedures to perform comprehensive data exploration, identify outliers, and generate detailed statistical summaries.

Learning Objectives: By the end of this session, you will be able to use PROC UNIVARIATE to examine the distribution and identify outliers in numeric variables, generate summary statistics for groups of data using PROC SUMMARY, interpret key outputs such as histograms, box plots, and extreme values, customize your analyses with options and statements to focus on specific variables or groups, and apply these procedures to real-world datasets for effective data exploration.

Scope: This session focuses on numeric data exploration using PROC UNIVARIATE for detailed distributional analysis and PROC SUMMARY for flexible, grouped statistical summaries. You will learn how to generate and interpret a variety of statistics and visualizations, and how to use these insights to inform further analysis or data cleaning.

Background Information: PROC UNIVARIATE is designed for in-depth analysis of numeric variables. It provides a wide range of statistics, including measures of central tendency, spread, skewness, and kurtosis, as well as graphical outputs like histograms and box plots. It is especially useful for detecting outliers and understanding the shape of your data’s distribution. PROC SUMMARY is similar to PROC MEANS but offers more flexibility in grouping and outputting results. It can produce summary statistics for multiple variables and groups, and can create output datasets for further analysis. Both procedures are essential for thorough data exploration and for preparing your data for modeling or reporting.

Hands-On Example:

Suppose you have a dataset called employee_data with variables: Department (categorical), Salary (numeric), and YearsExperience (numeric).

PROC UNIVARIATE Example:

proc univariate data=employee_data;
  var Salary;
  histogram Salary / normal;
  inset mean std min max / position=ne;
  id Department;
run;

PROC SUMMARY Example:

proc summary data=employee_data n mean median min max std;
  class Department;
  var Salary YearsExperience;
  output out=dept_summary mean= avg_salary avg_experience;
run;

Interpretation: The PROC UNIVARIATE output provides a detailed statistical summary of the Salary variable, including mean, standard deviation, minimum, maximum, skewness, and kurtosis. The histogram and box plot help visualize the distribution and spot any outliers or unusual patterns. The id statement allows you to identify which department an outlier belongs to. The PROC SUMMARY output generates summary statistics for Salary and YearsExperience by department, and creates a new dataset (dept_summary) with the average salary and experience for each department. This is useful for comparing groups and preparing data for further analysis or reporting.

Supplemental Information:

SAS Documentation for PROC UNIVARIATE: SAS PROC UNIVARIATE Documentation
SAS Documentation for PROC SUMMARY: SAS PROC SUMMARY Documentation
SAS Tutorial: Exploring Data with PROC UNIVARIATE: UCLA PROC UNIVARIATE Tutorial
SAS Tutorial: Summarizing Data with PROC SUMMARY: UCLA PROC SUMMARY Tutorial
Video: SAS PROC UNIVARIATE Tutorial: YouTube PROC UNIVARIATE Tutorial
Video: SAS PROC SUMMARY and PROC MEANS: YouTube PROC SUMMARY Tutorial

Discussion Points:

What are the advantages of using PROC UNIVARIATE over PROC MEANS for data exploration?
How can visualizations like histograms and box plots help in identifying data quality issues?
When would you use PROC SUMMARY instead of PROC MEANS?
Discuss the importance of grouping data when summarizing statistics.
How can you use the output datasets from PROC SUMMARY in further analysis or reporting?

Day 3: Data Visualization with PROC SGPLOT and PROC GPLOT

Introduction: Data visualization is a critical component of data analysis, allowing you to communicate insights effectively through graphical representations. SAS offers several procedures for creating visualizations, with PROC SGPLOT and PROC GPLOT being two of the most commonly used. This session will introduce you to these procedures, guiding you through creating various types of plots and customizing them to enhance clarity and impact.

Learning Objectives: By the end of this session, you will be able to create scatter plots, line plots, bar charts, and histograms using PROC SGPLOT, generate more complex plots, including contour plots and 3D plots, using PROC GPLOT, customize plot aesthetics such as colors, labels, and titles, interpret visualizations to identify patterns, trends, and outliers in your data, and choose the appropriate plot type for different types of data and analytical questions.

Scope: This session covers the creation of a variety of plots using PROC SGPLOT and PROC GPLOT. You will learn how to use these procedures to visualize both simple and complex datasets, and how to customize your plots for effective communication. The focus is on hands-on application, ensuring you can confidently create visualizations for your own data analysis projects.

Background Information: PROC SGPLOT is a modern SAS procedure designed for creating high-quality, publication-ready graphics. It offers a simple and intuitive syntax for generating a wide range of common plot types, including scatter plots, line plots, bar charts, histograms, and box plots. PROC GPLOT, while older, is still valuable for creating more complex and specialized plots, such as contour plots, 3D plots, and surface plots. It provides more control over the plot’s appearance but requires a more detailed understanding of SAS graphics syntax. Both procedures are essential for visualizing data and communicating insights effectively.

Hands-On Example:

Suppose you have a dataset called sales_data with variables: Month (numeric), Sales (numeric), and Region (categorical).

PROC SGPLOT Example:

proc sgplot data=sales_data;
  title "Monthly Sales Trend";
  series x=Month y=Sales / group=Region;
  xaxis label="Month";
  yaxis label="Sales";
run;

PROC GPLOT Example:

proc gplot data=sales_data;
  plot Sales*Month=Region;
  title "Sales by Month and Region";
  axis1 label=("Month");
  axis2 label=("Sales");
run;

Interpretation: The PROC SGPLOT output displays a line plot of monthly sales trends, with each region represented by a different line. This allows you to quickly compare sales performance across regions and identify any seasonal patterns or trends. The PROC GPLOT output shows a scatter plot of sales by month, with each region represented by a different color or symbol. This can help you identify clusters or outliers and explore relationships between sales, month, and region.

Supplemental Information:

SAS Documentation for PROC SGPLOT: SAS PROC SGPLOT Documentation
SAS Documentation for PROC GPLOT: SAS PROC GPLOT Documentation
SAS Tutorial: Creating Graphics with PROC SGPLOT: UCLA PROC SGPLOT Tutorial
SAS Tutorial: Advanced Graphics with PROC GPLOT: UCLA PROC GPLOT Tutorial
Video: SAS PROC SGPLOT Tutorial: YouTube PROC SGPLOT Tutorial
Video: SAS PROC GPLOT Examples: YouTube PROC GPLOT Tutorial

Discussion Points:

What are the advantages of using PROC SGPLOT over PROC GPLOT?
How can you customize the appearance of your plots to enhance clarity?
When would you use a scatter plot versus a line plot?
Discuss the importance of choosing the right plot type for your data.
How can you use visualizations to identify patterns and trends in your data?

Day 4: Statistical Analysis with PROC TTEST and PROC ANOVA

Introduction: Statistical analysis is a core component of data analysis, allowing you to draw inferences and test hypotheses about your data. SAS provides a variety of procedures for performing statistical tests, with PROC TTEST and PROC ANOVA being two of the most commonly used. This session will introduce you to these procedures, guiding you through conducting t-tests and analysis of variance to compare means and assess statistical significance.

Learning Objectives: By the end of this session, you will be able to conduct independent samples t-tests and paired t-tests using PROC TTEST, perform one-way and two-way analysis of variance (ANOVA) using PROC ANOVA, interpret the output from t-tests and ANOVA to assess statistical significance, customize your analyses with options and statements to focus on specific hypotheses, and apply these procedures to real-world datasets for effective statistical analysis.

Scope: This session covers the use of PROC TTEST for comparing means between two groups and PROC ANOVA for comparing means across multiple groups. You will learn how to set up your data, specify your hypotheses, and interpret the results of these statistical tests. The focus is on hands-on application, ensuring you can confidently use these tools in your own data analysis projects.

Background Information: PROC TTEST is designed for comparing the means of two groups. It can perform independent samples t-tests, where the two groups are independent, and paired t-tests, where the two groups are related (e.g., pre- and post-treatment measurements). PROC ANOVA is used to compare the means of two or more groups. It partitions the total variance in the data into different sources of variation, allowing you to assess whether the means of the groups are significantly different from each other. Both procedures are essential for hypothesis testing and for drawing inferences about your data.

Hands-On Example:

Suppose you have a dataset called treatment_data with variables: Treatment (categorical), Score (numeric), and PatientID (numeric).

PROC TTEST Example:

proc ttest data=treatment_data;
  class Treatment;
  var Score;
run;

PROC ANOVA Example:

proc anova data=treatment_data;
  class Treatment;
  model Score = Treatment;
  means Treatment / duncan;
run;

Interpretation: The PROC TTEST output provides the t-statistic, degrees of freedom, and p-value for the test of whether the means of the two treatment groups are significantly different. It also provides confidence intervals for the difference in means. The PROC ANOVA output provides the F-statistic, degrees of freedom, and p-value for the test of whether the means of the treatment groups are significantly different. The means statement with the duncan option performs a Duncan's multiple range test to determine which treatment groups are significantly different from each other.

Supplemental Information:

SAS Documentation for PROC TTEST: SAS PROC TTEST Documentation
SAS Documentation for PROC ANOVA: SAS PROC ANOVA Documentation
SAS Tutorial: Performing T-Tests with PROC TTEST: UCLA PROC TTEST Tutorial
SAS Tutorial: Performing ANOVA with PROC ANOVA: UCLA PROC ANOVA Tutorial
Video: SAS PROC TTEST Tutorial: YouTube PROC TTEST Tutorial
Video: SAS PROC ANOVA Examples: YouTube PROC ANOVA Tutorial

Discussion Points:

What are the assumptions of t-tests and ANOVA?
How do you interpret the p-value in a statistical test?
When would you use a t-test versus ANOVA?
Discuss the importance of checking assumptions before conducting statistical tests.
How can you use post-hoc tests to determine which groups are significantly different from each other?

Day 5: Correlation and Regression Analysis

Introduction: Understanding relationships between variables is a cornerstone of data analysis. Correlation and regression analysis are powerful statistical tools that allow you to quantify associations and model predictive relationships. In SAS, procedures such as PROC CORR and PROC REG make it straightforward to perform these analyses, interpret the results, and apply them to real-world problems. This session will guide you through the process of exploring variable relationships, testing for linear associations, and building regression models.

Learning Objectives: By the end of this session, you will be able to calculate and interpret correlation coefficients using PROC CORR, perform simple and multiple linear regression analyses with PROC REG, assess the strength and direction of relationships between variables, interpret regression output, including coefficients, R-squared, and significance tests, and apply correlation and regression techniques to real datasets for both exploration and prediction.

Scope: This session covers the use of PROC CORR for measuring linear associations between numeric variables and PROC REG for building and interpreting linear regression models. You will learn how to check assumptions, interpret key statistics, and use these analyses to inform data-driven decisions.

Background Information: Correlation analysis quantifies the strength and direction of a linear relationship between two numeric variables, typically using Pearson’s correlation coefficient. Regression analysis, on the other hand, models the relationship between a dependent variable and one or more independent variables, allowing for prediction and hypothesis testing. Both are foundational techniques in statistics and data science.

Hands-On Example:

Suppose you have a dataset called marketing_data with variables: Sales (numeric), Advertising (numeric), Price (numeric), Region (categorical), and Month (numeric).

Example 1: Calculate Pearson Correlation between Sales and Advertising

proc corr data=marketing_data;
  var Sales Advertising;
run;

Example 2: Calculate Correlations among Multiple Variables

proc corr data=marketing_data;
  var Sales Advertising Price;
run;

Example 3: Simple Linear Regression (Sales predicted by Advertising)

proc reg data=marketing_data;
  model Sales = Advertising;
run;

Example 4: Multiple Linear Regression (Sales predicted by Advertising and Price)

proc reg data=marketing_data;
  model Sales = Advertising Price;
run;

Example 5: Regression with Categorical Predictor (Sales predicted by Advertising and Region)

proc glm data=marketing_data;
  class Region;
  model Sales = Advertising Region;
run;

Interpretation: The first two PROC CORR examples provide correlation coefficients, indicating the strength and direction of linear relationships between variables. A value close to 1 or -1 suggests a strong relationship, while a value near 0 suggests little or no linear association. The third example uses PROC REG to fit a simple linear regression model, estimating how changes in advertising spending affect sales. The output includes the regression coefficient, R-squared value (explained variance), and significance tests. The fourth example extends this to multiple regression, allowing you to assess the combined effect of advertising and price on sales. The output helps you determine which predictors are significant and how well the model fits the data. The fifth example demonstrates how to include a categorical variable (Region) in the regression model using PROC GLM. This allows you to assess differences in sales across regions while controlling for advertising.

Supplemental Information:

SAS Documentation for PROC CORR: SAS PROC CORR Documentation
SAS Documentation for PROC REG: SAS PROC REG Documentation
SAS Documentation for PROC GLM: SAS PROC GLM Documentation
SAS Tutorial: Correlation and Regression in SAS: UCLA Correlation and Regression Tutorial
SAS Tutorial: Linear Regression with PROC REG: UCLA PROC REG Tutorial
Video: SAS PROC CORR and PROC REG Tutorial: YouTube PROC CORR and PROC REG Tutorial

Discussion Points:

What is the difference between correlation and regression analysis?
How do you interpret the sign and magnitude of a correlation coefficient?
When should you use multiple regression instead of simple regression?
What are the assumptions underlying linear regression, and how can you check them?
How can categorical variables be included in regression models, and what is the role of dummy coding?

Daily Quiz

Practice Lab

Select an environment to practice coding exercises. Use SAS OnDemand for Academics for a free SAS programming environment.

Exercise

Download the following files to support your learning:

Grade

Day 1 Score: Not completed

Day 2 Score: Not completed

Day 3 Score: Not completed

Day 4 Score: Not completed

Day 5 Score: Not completed

Overall Average Score: Not calculated

Overall Grade: Not calculated

Generate Certificate

Click the button below to generate your certificate for completing the course.

Readme