Course 2: Data Frames & Tidyverse Essentials

Data frames and the tidyverse are cornerstones of data manipulation and analysis in R. This course dives into handling tabular data with data frames, leveraging the tidyverse ecosystem for intuitive and efficient data wrangling, and applying these skills to real-world datasets. Designed for learners familiar with R basics, it covers data frame creation, tidyverse principles, tibbles, joins, reshaping, and practical workflows over one week.

Objective: By the end of the course, students will be able to create and manipulate data frames, use tidyverse tools like dplyr and tidyr for data transformation, and execute end-to-end data cleaning and analysis workflows.

Scope: The course covers data frame operations, tidyverse packages (dplyr, tidyr, tibble), piping, filtering, joining, reshaping, and summarizing data, preparing learners for advanced data science tasks.

Day 1: Introduction to Data Frames

Introduction: Data frames are the fundamental structure for handling tabular data in R, allowing efficient storage, manipulation, and analysis of datasets. Mastery of data frames is essential for data science, statistical analysis, and visualization tasks.

Objective: By the end of today’s session, learners will be able to create data frames from scratch, access and manipulate data within data frames, inspect data frames using basic summary functions, and read external CSV files into R as data frames.

Scope of the Lesson: Creating data frames using data.frame(); Accessing columns and rows using $, [[ ]], and [ , ]; Summarizing data frames with summary() and str(); Importing data from CSV files with read.csv().

Background Information: A data frame in R is a type of list where each element (column) is a vector of equal length. Different columns can contain different types (numeric, character, logical, etc.). Accessing Data: df$x (using the $ operator), df[["y"]] (using double brackets), df[1, ] (first row), df[1, 2] (element in first row, second column). Summarizing Data: summary(df) provides descriptive statistics for each column; str(df) displays the structure (types and preview of data). Reading External Data: Load CSV files into R with survey_data <- read.csv("survey.csv").

Simple Code Examples:

# Create a simple data frame
employees <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(28, 34, 25),
  department = c("HR", "Finance", "IT")
)

# Inspect the data frame
print(employees)
summary(employees)
str(employees)

# Access specific parts
employees$name          # Access the 'name' column
employees[2, ]          # Access the second row
employees[1, "age"]     # Access the 'age' of the first employee

# Load a CSV file
sales_data <- read.csv("sales.csv")
head(sales_data)

Explanation of the Example Code: Creates a data frame with employee data; inspects it using print(), summary(), and str(); demonstrates accessing columns, rows, and specific elements; shows how to load a CSV file and view its first few rows.

Discussion Points (Optional): What advantages do data frames offer compared to matrices in R? How does accessing a column with $ differ from accessing it with [[]]? Why is it useful to check the structure of a data frame before analysis?

Supplemental Information:

Cheatsheet: df <- data.frame(x = 1:3); df$x
Video: R Data Frames by R Programming 101
Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)

Day 2: Introduction to Tidyverse

Introduction: The tidyverse is a powerful ecosystem of R packages designed for data science tasks, offering a consistent and intuitive approach to data manipulation, transformation, and visualization. Mastering tidyverse workflows is essential for efficient and readable R programming.

Objective: By the end of today’s session, learners will be able to install and load the tidyverse collection, understand and apply tidyverse principles to organize and process data, use piping (%>%) to streamline code, and create tibbles and perform simple data manipulation.

Scope of the Lesson: Installing and loading the tidyverse (install.packages("tidyverse") and library(tidyverse)); Understanding piping with %>%; Creating and using tibbles with tibble(); Selecting columns using select().

Background Information: Created by Hadley Wickham, the tidyverse is a coherent system of packages designed to work together under the "tidy data" philosophy: each variable forms a column, each observation forms a row, each type of observational unit forms a table. Key Packages: dplyr (data manipulation), tidyr (data reshaping), ggplot2 (visualization), readr, purrr, tibble, stringr, forcats. The Pipe Operator (%>%): Passes output from one function directly into the next, making code more readable. Tibbles: Modern data frames with better printing and stricter type rules, created with tibble().

Simple Code Examples:

# Install and load tidyverse
install.packages("tidyverse")
library(tidyverse)

# Create a tibble
students <- tibble(
  name = c("Alice", "Bob", "Charlie"),
  score = c(88, 92, 79)
)

# View the tibble
print(students)

# Use piping to select a column
students %>%
  select(name)

# Chain multiple functions
students %>%
  select(name, score) %>%
  arrange(desc(score))

Explanation of the Example Code: Installs and loads tidyverse; creates a tibble with student data; demonstrates selecting columns and chaining operations with the pipe operator to sort by score.

Discussion Points (Optional): How does piping %>% improve code readability? What advantages do tibbles offer compared to traditional data frames? Why is "tidy" data important for analysis?

Supplemental Information:

Cheatsheet: library(tidyverse); df %>% select(col1)
Video: Tidyverse Tutorial by DataCamp
Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)

Day 3: Tibbles and Basic Operations

Introduction: Tibbles, a modern reimagining of data frames from the tidyverse, enhance data handling by offering cleaner printing, safer subsetting, and a better user experience, especially when working with larger datasets.

Objective: By the end of today’s session, learners will be able to create and manipulate tibbles, perform basic operations like subsetting, filtering, adding columns, and arranging rows, and use piping (%>%) to build readable workflows.

Scope of the Lesson: Creating tibbles with tibble() and as_tibble(); Accessing columns and rows (subsetting); Adding new columns using mutate(); Filtering rows with filter(); Sorting rows with arrange().

Background Information: Tibbles are part of the tidyverse, printing limited rows/columns and not converting strings to factors. Created via tibble() or as_tibble(). Subsetting: Use $ for columns, select() for column selection. Key Functions: mutate() adds/transforms columns; filter() keeps rows matching conditions; arrange() sorts rows. The Pipe (%>%): Chains operations for readability, avoiding nested calls.

Simple Code Examples:

# Load tidyverse
library(tidyverse)

# Create a tibble
students <- tibble(
  name = c("Alice", "Bob", "Charlie"),
  score = c(88, 92, 79)
)

# Add a new column (e.g., grade category)
students <- students %>%
  mutate(grade = ifelse(score >= 90, "A", "B"))

# Filter students with scores above 80
high_scores <- students %>%
  filter(score > 80)

# Arrange students by descending score
sorted_students <- students %>%
  arrange(desc(score))

Explanation of the Example Code: Creates a tibble; adds a grade column with mutate(); filters high scores; sorts by score in descending order using the pipe operator.

Discussion Points (Optional): How do tibbles make working with data easier than traditional data frames? Why is mutate() often preferred over creating columns manually? How does %>% help structure logical data transformation steps?

Supplemental Information:

Cheatsheet: tibble(x = 1:3); df %>% filter(col > 5)
Video: Tibbles in Tidyverse by R Programming 101
Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)

Day 4: Joining and Reshaping Data

Introduction: Joining and reshaping datasets are essential skills in data analysis, enabling you to integrate multiple sources and reorganize data for effective exploration and modeling.

Objective: By the end of today’s session, learners will be able to combine multiple datasets using different types of joins, reshape data between long and wide formats using tidyverse functions, and handle real-world data alignment issues during integration and transformation.

Scope of the Lesson: Performing joins with left_join(), inner_join(), and other dplyr functions; Reshaping data with pivot_longer() and pivot_wider(); Managing key matching and missing data when merging datasets.

Background Information: Joins (dplyr): Combine data frames based on a shared key. left_join(df1, df2, by = "key"): Keeps all rows from df1; inner_join(): Keeps only matching rows. Reshaping (tidyr): pivot_longer() converts wide to long format; pivot_wider() spreads long to wide. Importance: Joins link sources (e.g., customer + purchase data); reshaping aids plotting/modeling.

Simple Code Examples:

# Load tidyverse
library(tidyverse)

# Example datasets
students <- tibble(id = 1:3, name = c("Alice", "Bob", "Charlie"))
scores <- tibble(id = c(1, 2, 4), score = c(90, 85, 88))

# Left join (keep all students, match scores if available)
student_scores <- students %>% 
  left_join(scores, by = "id")

# Reshape a dataset
grades <- tibble(
  id = 1:2,
  math = c(90, 85),
  science = c(88, 92)
)

# Pivot from wide to long format
grades_long <- grades %>%
  pivot_longer(cols = c(math, science), names_to = "subject", values_to = "score")

# Pivot from long to wide format
grades_wide <- grades_long %>%
  pivot_wider(names_from = subject, values_from = score)

Explanation of the Example Code: Joins student and score data with left_join(); reshapes grades data from wide to long format and back to wide using pivot_longer() and pivot_wider().

<出入Discussion Points (Optional): When would you prefer a left_join() over an inner_join()? Why might you need to reshape data before visualizing it? What issues might arise when joining datasets (e.g., missing keys)?

Supplemental Information:

Cheatsheet: left_join(df1, df2, by = "key"); pivot_longer(df, cols = c(col1, col2))
Video: Joining and Reshaping in Tidyverse by DataCamp
Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)

Day 5: Practical Tidyverse Applications

Introduction: Applying tidyverse tools to real-world datasets develops critical data wrangling skills, essential for data exploration, analysis, and modeling preparation.

Objective: By the end of today’s session, learners will be able to load real-world datasets into R, apply tidyverse functions to filter, join, reshape, and summarize data, and practice end-to-end workflows for data cleaning and transformation.

Scope of the Lesson: Loading datasets (e.g., CSV files from Kaggle or open datasets); Filtering and selecting data using filter() and select(); Joining multiple datasets with left_join() or inner_join(); Reshaping data with pivot_longer() and pivot_wider(); Summarizing grouped data with group_by() and summarise().

Background Information: Typical Workflow: Filter (e.g., price > 100), select columns, join datasets, reshape, summarize (e.g., mean sales by category). Tidyverse Functions: Use %>% to chain transformations. Real-World Relevance: Foundational for business analytics, research, and machine learning.

Simple Code Examples:

# Load tidyverse
library(tidyverse)

# Load a dataset
products <- read_csv("products.csv")
sales <- read_csv("sales.csv")

# Join datasets
full_data <- sales %>%
  left_join(products, by = "product_id")

# Filter data
filtered_data <- full_data %>%
  filter(price > 100)

# Reshape data
sales_long <- filtered_data %>%
  pivot_longer(cols = starts_with("month_"), names_to = "month", values_to = "sales")

# Summarize data
summary_stats <- sales_long %>%
  group_by(product_name) %>%
  summarise(avg_sales = mean(sales, na.rm = TRUE))

Explanation of the Example Code: Loads and joins product and sales data; filters high-priced items; reshapes sales data to long format; summarizes average sales by product.

Discussion Points (Optional): How do chaining operations with %>% improve code readability? When should you reshape data before vs. after joining datasets? Why is summarizing grouped data critical before visualization?

Supplemental Information:

Cheatsheet: df %>% group_by(col) %>% summarise(mean = mean(val))
Video: Tidyverse Projects by R Programming 101
Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)

Daily Quiz

Practice Lab

Select an environment to practice coding exercises.

Exercise

Download the following files to support your learning:

Grade

Day 1 Score: Not completed

Day 2 Score: Not completed

Day 3 Score: Not completed

Day 4 Score: Not completed

Day 5 Score: Not completed

Overall Average Score: Not calculated

Overall Grade: Not calculated

Generate Certificate

Click the button below to generate your certificate for completing the course.