SkyLimit Tech Hub: Data Science Training Center

Course 3: Data Cleaning and Transformation with dplyr

The dplyr package is a cornerstone of the tidyverse, offering powerful tools for data manipulation in R. This course focuses on mastering dplyr to clean, transform, and prepare datasets for analysis. Designed for learners with basic R knowledge, it covers essential dplyr functions, handling missing data, grouping and summarizing, data transformation with mutate(), and practical data cleaning projects over one week.

Objective: By the end of the course, students will be proficient in using dplyr to manipulate data frames, handle missing values, perform grouped analyses, transform data, and execute end-to-end data cleaning workflows.

Scope: The course covers dplyr’s core functions (select(), filter(), arrange(), mutate(), summarise()), the pipe operator (%>%), missing data strategies, grouping, and real-world data cleaning applications.

Day 1: Introduction to dplyr

Introduction: The dplyr package is a key component of the tidyverse, a collection of R packages designed to simplify and enhance data science workflows. dplyr focuses specifically on data manipulation, providing an intuitive and consistent set of functions that make cleaning, transforming, and preparing data both efficient and readable. In modern data science, where datasets can be large and messy, mastering dplyr is crucial for any aspiring analyst or researcher who seeks to work effectively with structured data.

Objective: The objective of this lesson is to equip learners with the foundational skills to manipulate data frames using dplyr. By the end of the session, learners should be able to install and load the dplyr package, utilize key functions like select(), filter(), and arrange(), and employ the pipe operator (%>%) to chain multiple operations together into clear, logical workflows. These skills will prepare learners to efficiently prepare datasets for analysis or visualization.

Scope of the Lesson: Today’s lesson introduces several core components of dplyr. Learners will first install and load the package, ensuring their R environment is ready for use. They will then learn how to use select() to focus on particular columns, filter() to subset data based on conditions, and arrange() to reorder rows. Additionally, learners will be introduced to the pipe operator %>%, a powerful tool that allows for seamless chaining of multiple commands, resulting in cleaner, more readable code. Mastery of these functions will serve as a solid foundation for more advanced data transformations.

Background Information: dplyr was created to address the complexity and verbosity often encountered when performing data manipulations in base R. Its design philosophy centers around the idea of "verbs" — clear, concise functions that represent common data manipulation tasks. The core functions include select(), used for choosing columns; filter(), for selecting rows that meet certain conditions; arrange(), for ordering rows based on column values; and mutate(), for creating or transforming columns. These functions, combined with the %>% pipe operator, allow users to express complex data manipulations as a series of readable steps, mirroring natural thought processes. Additionally, dplyr is highly optimized for speed and can handle very large datasets efficiently, outperforming many traditional base R methods. The %>% operator, pronounced as "then", passes the output of one function into the next. For instance, rather than nesting functions inside each other (which can be hard to read), %>% enables sequential operations that are easy to follow. Understanding and using %>% effectively is critical for writing professional-grade R code.

Simple Code Examples:

# Install and load dplyr
install.packages("dplyr")
library(dplyr)

# Create a sample data frame
df <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  score = c(88, 95, 79),
  age = c(23, 22, 24)
)

# Select specific columns
df_selected <- df %>% select(name, score)

# Filter rows where score is greater than 85
df_filtered <- df %>% filter(score > 85)

# Arrange the data by score in descending order
df_arranged <- df %>% arrange(desc(score))
                

Interpretation of the Example Codes: In the first code block, install.packages("dplyr") installs the dplyr package from CRAN if it is not already installed. This ensures that all the necessary functions for data manipulation are available. The library(dplyr) command then loads the package into the R session so that its functions can be used without needing to prefix them with dplyr::. Next, a sample data frame df is created using data.frame(). This data frame contains three columns: name, score, and age, with three rows of sample data representing individuals and their corresponding scores and ages. The df_selected object demonstrates the use of the select() function. Here, only the name and score columns are retained, effectively removing the age column from the view. This operation is critical when a user needs to focus analysis on specific attributes of the dataset. The df_filtered object illustrates the filter() function in action. It selects only those rows where the score is greater than 85. In this example, only "Alice" and "Bob" meet the condition, meaning that "Charlie" would be excluded from the filtered result. Finally, the df_arranged object shows how arrange() is used to reorder the dataset based on a column’s values. By using desc(score), the data is sorted in descending order according to the score column, placing the highest-scoring individuals at the top of the dataset. Without desc(), the default behavior would sort in ascending order. Together, these examples demonstrate the practical power of dplyr in cleaning and transforming data in a logical and readable way. They also show how %>% enables clean, sequential operations without deeply nested function calls, making code maintenance and understanding much easier.

Supplemental Information:

Discussion Points:

  • Why does the tidyverse (and dplyr in particular) emphasize readable code over purely compact code?
  • When chaining multiple data operations, how does the pipe %>% help with error checking and debugging?
  • Suppose you have a large dataset (millions of rows). How would you optimize your dplyr workflow for speed?
  • What are some common mistakes beginners make with select() and filter() functions?
  • How could you extend today's examples to add a new column that classifies scores as "High" or "Low"?

Day 2: Handling Missing Data

Introduction: Missing data is a common issue that can significantly distort the results of data analysis if not properly addressed. In R, missing values are denoted by NA, and ignoring them can lead to incorrect calculations, biased models, and misleading conclusions. Fortunately, dplyr offers a set of powerful tools that make it easy to detect, remove, or impute missing values systematically. Understanding how to handle missing data is essential for maintaining the integrity of any data-driven project, from simple summaries to complex machine learning workflows.

Objective: The primary objective of today’s lesson is to learn how to identify and manage missing values using both dplyr functions and base R techniques. Learners will gain skills in detecting missing data with is.na(), filtering out incomplete observations using filter(), and replacing missing values with appropriate substitutes via mutate() and ifelse(). These techniques will enable learners to clean and prepare datasets for reliable analysis, ensuring that results are both valid and meaningful.

Scope of the Lesson: This session focuses on several key strategies for handling missing data. Learners will use the is.na() function to detect missing values and complete.cases() to identify fully observed rows. Filtering incomplete records using dplyr::filter() and excluding NA values will be demonstrated. The session will also cover how to replace missing values with defaults or imputed values, such as the mean of a column, using a combination of mutate() and ifelse(). Additionally, learners will be introduced to simple imputation techniques and the importance of applying them thoughtfully to avoid introducing bias into datasets.

Background Information: In R, NA represents a missing or undefined value. When performing calculations or analyses, these NA values can interfere with functions like mean() or sum(), which by default return NA when missing values are present unless instructed otherwise (e.g., using na.rm = TRUE). Detecting missing values is often the first step in cleaning data, using functions like is.na() to flag them or complete.cases() to find rows with no missing entries. dplyr offers streamlined ways to handle missing data. Filtering out rows with missing values is easily accomplished using filter(!is.na(column)), removing only those observations where the specified column is incomplete. Alternatively, missing values can be replaced with substitutes using mutate() and ifelse(). For example, missing numerical values might be replaced with the mean or median of the existing data, although this must be done cautiously to avoid skewing results. Other strategies include dropping columns with excessive missingness or using predictive models for more advanced imputations. Learning to systematically address missing data is crucial for ensuring that your analyses are valid, reproducible, and trustworthy.

Simple Code Examples:

# Create a sample data frame with missing values
df <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David"),
  score = c(88, NA, 79, NA),
  age = c(23, 22, 24, NA)
)

# Detect missing values
missing_scores <- is.na(df$score)

# Filter out rows with missing scores
df_no_missing <- df %>% filter(!is.na(score))

# Replace missing scores with 0
df_replace_zero <- df %>% mutate(score = ifelse(is.na(score), 0, score))

# Replace missing scores with the mean of the existing scores
mean_score <- mean(df$score, na.rm = TRUE)
df_replace_mean <- df %>% mutate(score = ifelse(is.na(score), mean_score, score))

# Find rows with complete data
complete_rows <- df[complete.cases(df), ]
                

Interpretation of the Example Codes: In the sample data frame df, there are missing values (NA) in both the score and age columns. These missing entries mimic real-world data collection issues, such as participants skipping questions or sensors failing to record data. The is.na(df$score) function checks each value in the score column and returns TRUE if the value is missing and FALSE otherwise. This creates a logical vector useful for further filtering or summarization. Using filter(!is.na(score)), we create df_no_missing, a new data frame that excludes all rows where the score is missing. This is a common step when only complete cases are desired for analysis. The next transformation uses mutate() and ifelse() to create df_replace_zero, where missing scores are replaced with a default value of 0. This is sometimes appropriate when 0 represents a meaningful baseline (e.g., "no attempt" or "failed measurement"). For a more statistically sound method, df_replace_mean replaces missing score values with the mean of the observed scores. First, mean(df$score, na.rm = TRUE) calculates the mean while ignoring NA values. Then mutate() with ifelse() fills in the missing values accordingly. Although this preserves dataset size and overall trends, it can reduce natural variability and should be used with caution. Finally, complete.cases(df) identifies rows where no NA values are present, and df[complete.cases(df), ] extracts only those rows. This method is particularly useful when complete data across multiple columns is required. Through these examples, learners can see multiple approaches to dealing with missing data, from elimination to imputation, depending on the analytical goals.

Supplemental Information:

Discussion Points:

  • What are the risks of simply deleting rows with missing data versus imputing values?
  • How does replacing missing data with a constant (like 0 or the mean) affect variance and statistical assumptions?
  • In what situations might advanced imputation techniques (like k-nearest neighbors or multiple imputation) be preferable to simple replacements?
  • How would you handle missing categorical data differently from missing numerical data?
  • What strategies would you apply if a dataset had more than 50% missing values in a critical column?

Day 3: Grouping and Summarizing Data

Introduction: Grouping and summarizing data are powerful techniques in exploratory data analysis, allowing us to discover patterns, trends, and anomalies in datasets. With dplyr, these tasks become highly intuitive through the use of readable, well-structured syntax. Grouping organizes data into categories based on the values of one or more columns, and summarizing computes aggregate statistics within each group, providing deeper insights than would be possible from raw observations alone. Mastering these techniques is fundamental for anyone working in data science, business analytics, or research fields that require meaningful data interpretation.

Objective: Today’s objective is to empower learners with the ability to group datasets by categorical or numerical variables and compute relevant summary statistics using dplyr. Specifically, learners will explore how to use group_by() to create groupings, apply summarise() to calculate statistics such as means and sums, and use helper functions like n() to count records. Learners will also understand the role of ungroup() in resetting the data structure after grouped operations. These skills will facilitate data condensation, making large and complex datasets easier to analyze and interpret.

Scope of the Lesson: This lesson covers several crucial operations for working with grouped data. Learners will use group_by() to define subsets of the data based on one or more columns, such as grouping customers by region or students by grade level. The summarise() function will then be applied to calculate statistical measures like mean(), sum(), min(), and max() within each group. Additional summarizing tools, such as n() for counting observations, will also be demonstrated. Finally, learners will understand how to use ungroup() to return the data to its original ungrouped form after performing grouped transformations, ensuring clarity and control over subsequent operations.

Background Information: The group_by() function in dplyr creates a grouped data structure where operations such as summarization are performed separately within each group. This allows for easy calculation of statistics like average sales per store, total revenue by year, or maximum test scores by classroom. The summarise() function takes grouped data and computes one or more summary statistics for each group, significantly reducing the size of the dataset while retaining key information. Commonly used summary functions include mean() for averages, sum() for totals, min() and max() for range analysis, and n() for counting the number of observations in each group. After performing grouping and summarizing, it is important to use ungroup() if further operations are needed on the entire dataset without regard to prior groupings. Grouping and summarization are foundational for both exploratory data analysis and for preparing datasets for visualization, modeling, or reporting.

Simple Code Examples:

# Load dplyr
library(dplyr)

# Sample data frame
df <- data.frame(
  department = c("Sales", "Sales", "HR", "HR", "IT", "IT"),
  employee = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank"),
  salary = c(50000, 55000, 45000, 47000, 60000, 62000)
)

# Group by department and compute mean salary
dept_salary_mean <- df %>%
  group_by(department) %>%
  summarise(mean_salary = mean(salary))

# Group by department and compute total salary and number of employees
dept_summary <- df %>%
  group_by(department) %>%
  summarise(
    total_salary = sum(salary),
    employee_count = n()
  )

# Ungroup after summarizing
dept_summary_clean <- dept_summary %>%
  ungroup()
                

Interpretation of the Example Codes: In the provided example, a sample data frame df contains information about employees, their departments, and salaries. To explore department-level salary trends, we first use group_by(department) to group employees based on their department. This sets up the data so that subsequent operations will be performed separately within each group. Using summarise(mean_salary = mean(salary)), we calculate the mean salary for each department, producing a summarized data frame where each row represents a department and its corresponding average salary. This is useful for understanding salary distribution across different sectors of an organization. Next, we extend the summarization to compute two metrics: the total salary paid (sum(salary)) and the number of employees (n()) per department. This provides a fuller picture of departmental scale and payroll. Finally, after the grouping-specific summaries are complete, we apply ungroup() to the resulting data frame. This step is crucial because many subsequent operations (such as additional filtering or joining) assume a non-grouped data structure. ungroup() removes the grouping metadata, returning the data to a normal ungrouped state. By mastering this sequence—grouping, summarizing, and ungrouping—learners can efficiently extract high-level insights from complex datasets.

Supplemental Information:

Discussion Points:

  • Why is it important to ungroup data after summarizing it in dplyr workflows?
  • How would you compute grouped summaries across multiple columns simultaneously?
  • What are potential issues when grouping by continuous variables (like age) instead of categorical variables?
  • How would you handle datasets where the grouping variable itself contains missing values?
  • In real-world analytics, how might grouped summaries guide business decisions (e.g., sales performance by region)?

Day 4: Data Transformation with mutate()

Introduction: The mutate() function is a cornerstone of data transformation in dplyr, offering a flexible and intuitive way to create new columns or modify existing ones within a dataset. Data transformation is often essential before analysis, as raw data typically requires scaling, categorization, feature creation, or other adjustments to make it suitable for modeling, visualization, and interpretation. By using mutate(), analysts can enrich their datasets with new insights and derived variables, ensuring data is in its most useful form.

Objective: The objective of this session is to learn how to create, modify, and transform columns in a dataset using mutate() in combination with other related functions such as ifelse() and case_when(). Learners will practice how to derive new metrics, categorize data based on conditions, and incorporate logical decision-making into column creation. By the end of the session, learners should be comfortable chaining mutate() within a sequence of data manipulation tasks, enhancing the overall efficiency and readability of their data workflows.

Scope of the Lesson: This lesson focuses on understanding and applying the mutate() function for various types of data transformations. Key topics include basic column operations (like scaling numeric values), conditional column creation using ifelse() for binary conditions, and using case_when() to handle more complex, multiple-condition logic scenarios. Learners will also see how mutate() integrates naturally with other dplyr verbs using the pipe operator %>%, allowing transformations to fit smoothly within broader data manipulation pipelines. This combination dramatically increases both the expressiveness and clarity of R code.

Background Information: The mutate() function in dplyr allows users to add new columns or overwrite existing ones based on transformations of the current dataset. For example, multiplying a numerical column by 2 or applying log transformations can be done easily inside mutate(). When conditional logic is needed, ifelse() provides a vectorized way to apply a true/false condition across an entire column, enabling transformations such as creating a "pass/fail" variable based on a score threshold. For more complex scenarios involving multiple conditions, case_when() provides an elegant and readable syntax that mirrors the logic of SQL CASE statements or if-else chains. Using mutate() effectively is critical for real-world tasks like feature engineering in machine learning, summarizing groups of variables, and preparing datasets for visualization.

Simple Code Examples:

# Load dplyr
library(dplyr)

# Sample data frame
df <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David"),
  score = c(85, 42, 73, 90)
)

# Create a new column by scaling the score
df_transformed <- df %>%
  mutate(scaled_score = score / 100)

# Create a pass/fail column using ifelse()
df_pass_fail <- df %>%
  mutate(pass = ifelse(score >= 60, "Pass", "Fail"))

# Create a grade category using case_when()
df_grades <- df %>%
  mutate(grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    score >= 70 ~ "C",
    score >= 60 ~ "D",
    TRUE ~ "F"
  ))
                

Interpretation of the Example Codes: In the examples provided, the dataset df contains the names and scores of four individuals. Using mutate(), we first create a new column scaled_score that normalizes the scores by dividing each by 100, making it easier to compare on a 0–1 scale. Next, the ifelse() function is combined with mutate() to create a pass column that indicates whether a score is a "Pass" or "Fail" based on a threshold value of 60. This is particularly useful in cases where a binary classification is needed, such as in educational evaluations or simple business logic. Lastly, the case_when() function is demonstrated within mutate() to assign a letter grade (A, B, C, D, or F) based on a range of scores. case_when() provides a clean and highly readable method to map multiple conditions to different outputs, which is invaluable for more complex categorization tasks. By chaining these transformations using %>%, the operations remain readable and logically ordered, following the natural thought process of a data analyst.

Supplemental Information:

Discussion Points:

  • What are the advantages of using mutate() over manually editing columns outside a pipeline?
  • When would you prefer case_when() instead of multiple ifelse() statements?
  • How can transformations like scaling and normalization with mutate() impact machine learning models?
  • What are best practices for handling missing data during column creation in mutate()?
  • How might incorrect use of mutate() lead to logical errors in an analysis pipeline?

Day 5: Practical Data Cleaning Projects

Introduction: Applying dplyr functions to real-world datasets is crucial for developing practical, hands-on experience in data cleaning and transformation. In this session, learners will move beyond theory and into practice, working with authentic datasets that mirror the challenges faced in professional data analysis. Real-world data often comes messy, incomplete, or inconsistent, and the ability to clean and prepare it efficiently determines the success of downstream analyses, reporting, or modeling efforts. Using dplyr, learners will gain confidence in transforming raw data into clean, usable forms ready for deeper analysis.

Objective: The objective of this session is to enable learners to apply their knowledge of dplyr to complete mini-projects focused on real-world data cleaning and transformation. By the end of the session, participants should be able to load messy datasets, identify and handle missing values, eliminate duplicates, perform relevant transformations, and summarize important aspects of the data. The projects simulate professional data preparation workflows that would typically be required in business, government, health, or academic settings.

Scope of the Lesson: This lesson will cover practical data cleaning steps from start to finish. Learners will load datasets into R, explore and identify issues such as missing values and duplicate entries, and apply dplyr techniques to fix these problems. Key tasks include using filter(!is.na(col)) to remove incomplete observations, distinct() to deduplicate records, and mutate() to transform or engineer new columns for analysis. The session will also cover summarization tasks using group_by() and summarise(), helping learners create compact reports and statistics that describe their cleaned data. Emphasis will be placed on combining multiple dplyr operations using the pipe operator %>% for clarity and reproducibility.

Background Information: Real-world datasets typically contain inconsistencies such as missing data points, duplicated records, or poorly structured variables. In professional practice, analysts first inspect datasets for these issues before proceeding to analysis. Missing values (NA) are commonly handled using filter() to exclude incomplete rows or mutate() and ifelse() to impute or replace them based on context. Duplicated rows are often accidental artifacts of data entry or merging processes and can be removed efficiently using distinct(). Transformations applied through mutate()—such as creating ratios, adjusting units, categorizing numerical values, or normalizing scales—prepare the data for meaningful visualization, reporting, or statistical modeling. Summarizing the data using group_by() and summarise() provides a quick way to understand key patterns, trends, or anomalies across groups (such as by category, region, or time period). The combination of these cleaning and summarizing skills forms the foundation of a reproducible, transparent data workflow that is critical in any serious data science project.

Simple Code Examples:

# Load dplyr
library(dplyr)

# Sample messy dataset
df <- data.frame(
  id = c(1, 2, 2, 3, 4, 5, 5, 5),
  age = c(25, 30, 30, NA, 40, 22, 22, 22),
  score = c(88, 75, 75, 90, NA, 70, 70, 70)
)

# Remove rows with missing values
df_clean <- df %>%
  filter(!is.na(age), !is.na(score))

# Remove duplicate rows
df_unique <- df_clean %>%
  distinct()

# Create a new column with transformed data
df_transformed <- df_unique %>%
  mutate(score_scaled = score / 100)

# Summarize average age and score by ID
df_summary <- df_transformed %>%
  group_by(id) %>%
  summarise(
    avg_age = mean(age),
    avg_score = mean(score_scaled)
  )
                

Interpretation of the Example Codes: In the example provided, a sample dataset df is created with intentional duplication and missing values in the age and score columns. The cleaning process begins with filter(!is.na()) to exclude any rows where age or score is missing, ensuring that further analysis uses complete data. Next, distinct() removes duplicated rows based on all columns. This is a crucial step to ensure that repeated entries do not skew summary statistics or lead to incorrect conclusions. The mutate() function then adds a new column, score_scaled, by dividing the original score values by 100. This transformation is often useful for normalization when comparing scores across different scales or models. Finally, group_by(id) groups the data by the unique id, and summarise() calculates the average age and scaled score for each individual. This is representative of a basic aggregation task common in real-world projects, preparing the cleaned data for reporting, visualization, or predictive modeling.

Supplemental Information:

Discussion Points:

  • How can you decide whether to drop missing values versus imputing them?
  • What are common pitfalls when removing duplicates using distinct()?
  • How does chaining multiple dplyr verbs with %>% improve reproducibility and readability?
  • What strategies would you use to document and track each cleaning step in a project?
  • How could improper data cleaning influence the outcome of a machine learning model or statistical analysis?
  • In what situations should you prioritize minimal transformation (keeping original data intact) versus heavy transformation?

Daily Quiz

Practice Lab

Select an environment to practice coding exercises.

Exercise

Download the following files to support your learning:

Grade

Day 1 Score: Not completed

Day 2 Score: Not completed

Day 3 Score: Not completed

Day 4 Score: Not completed

Day 5 Score: Not completed

Overall Average Score: Not calculated

Overall Grade: Not calculated

Generate Certificate

Click the button below to generate your certificate for completing the course.