Course 2: Data Frames & Tidyverse Essentials
Data frames and the tidyverse are cornerstones of data manipulation and analysis in R. This course dives into handling tabular data with data frames, leveraging the tidyverse ecosystem for intuitive and efficient data wrangling, and applying these skills to real-world datasets. Designed for learners familiar with R basics, it covers data frame creation, tidyverse principles, tibbles, joins, reshaping, and practical workflows over one week.
Objective: By the end of the course, students will be able to create and manipulate data frames, use tidyverse tools like dplyr and tidyr for data transformation, and execute end-to-end data cleaning and analysis workflows.
Scope: The course covers data frame operations, tidyverse packages (dplyr, tidyr, tibble), piping, filtering, joining, reshaping, and summarizing data, preparing learners for advanced data science tasks.
Day 1: Introduction to Data Frames
Introduction: Data frames are the fundamental structure for handling tabular data in R, allowing efficient storage, manipulation, and analysis of datasets. Mastery of data frames is essential for data science, statistical analysis, and visualization tasks.
Objective: By the end of today’s session, learners will be able to create data frames from scratch, access and manipulate data within data frames, inspect data frames using basic summary functions, and read external CSV files into R as data frames.
Scope of the Lesson: Creating data frames using data.frame(); Accessing columns and rows using $, [[ ]], and [ , ]; Summarizing data frames with summary() and str(); Importing data from CSV files with read.csv().
Background Information: A data frame in R is a type of list where each element (column) is a vector of equal length. Different columns can contain different types (numeric, character, logical, etc.). Accessing Data: df$x (using the $ operator), df[["y"]] (using double brackets), df[1, ] (first row), df[1, 2] (element in first row, second column). Summarizing Data: summary(df) provides descriptive statistics for each column; str(df) displays the structure (types and preview of data). Reading External Data: Load CSV files into R with survey_data <- read.csv("survey.csv").
Simple Code Examples:
# Create a simple data frame employees <- data.frame( name = c("Alice", "Bob", "Charlie"), age = c(28, 34, 25), department = c("HR", "Finance", "IT") ) # Inspect the data frame print(employees) summary(employees) str(employees) # Access specific parts employees$name # Access the 'name' column employees[2, ] # Access the second row employees[1, "age"] # Access the 'age' of the first employee # Load a CSV file sales_data <- read.csv("sales.csv") head(sales_data)
Explanation of the Example Code: Creates a data frame with employee data; inspects it using print(), summary(), and str(); demonstrates accessing columns, rows, and specific elements; shows how to load a CSV file and view its first few rows.
Discussion Points (Optional): What advantages do data frames offer compared to matrices in R? How does accessing a column with $ differ from accessing it with [[]]? Why is it useful to check the structure of a data frame before analysis?
Supplemental Information:
- Cheatsheet: df <- data.frame(x = 1:3); df$x
- Video: R Data Frames by R Programming 101
- Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)
Day 2: Introduction to Tidyverse
Introduction: The tidyverse is a powerful ecosystem of R packages designed for data science tasks, offering a consistent and intuitive approach to data manipulation, transformation, and visualization. Mastering tidyverse workflows is essential for efficient and readable R programming.
Objective: By the end of today’s session, learners will be able to install and load the tidyverse collection, understand and apply tidyverse principles to organize and process data, use piping (%>%) to streamline code, and create tibbles and perform simple data manipulation.
Scope of the Lesson: Installing and loading the tidyverse (install.packages("tidyverse") and library(tidyverse)); Understanding piping with %>%; Creating and using tibbles with tibble(); Selecting columns using select().
Background Information: Created by Hadley Wickham, the tidyverse is a coherent system of packages designed to work together under the "tidy data" philosophy: each variable forms a column, each observation forms a row, each type of observational unit forms a table. Key Packages: dplyr (data manipulation), tidyr (data reshaping), ggplot2 (visualization), readr, purrr, tibble, stringr, forcats. The Pipe Operator (%>%): Passes output from one function directly into the next, making code more readable. Tibbles: Modern data frames with better printing and stricter type rules, created with tibble().
Simple Code Examples:
# Install and load tidyverse install.packages("tidyverse") library(tidyverse) # Create a tibble students <- tibble( name = c("Alice", "Bob", "Charlie"), score = c(88, 92, 79) ) # View the tibble print(students) # Use piping to select a column students %>% select(name) # Chain multiple functions students %>% select(name, score) %>% arrange(desc(score))
Explanation of the Example Code: Installs and loads tidyverse; creates a tibble with student data; demonstrates selecting columns and chaining operations with the pipe operator to sort by score.
Discussion Points (Optional): How does piping %>% improve code readability? What advantages do tibbles offer compared to traditional data frames? Why is "tidy" data important for analysis?
Supplemental Information:
- Cheatsheet: library(tidyverse); df %>% select(col1)
- Video: Tidyverse Tutorial by DataCamp
- Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)
Day 3: Tibbles and Basic Operations
Introduction: Tibbles, a modern reimagining of data frames from the tidyverse, enhance data handling by offering cleaner printing, safer subsetting, and a better user experience, especially when working with larger datasets.
Objective: By the end of today’s session, learners will be able to create and manipulate tibbles, perform basic operations like subsetting, filtering, adding columns, and arranging rows, and use piping (%>%) to build readable workflows.
Scope of the Lesson: Creating tibbles with tibble() and as_tibble(); Accessing columns and rows (subsetting); Adding new columns using mutate(); Filtering rows with filter(); Sorting rows with arrange().
Background Information: Tibbles are part of the tidyverse, printing limited rows/columns and not converting strings to factors. Created via tibble() or as_tibble(). Subsetting: Use $ for columns, select() for column selection. Key Functions: mutate() adds/transforms columns; filter() keeps rows matching conditions; arrange() sorts rows. The Pipe (%>%): Chains operations for readability, avoiding nested calls.
Simple Code Examples:
# Load tidyverse library(tidyverse) # Create a tibble students <- tibble( name = c("Alice", "Bob", "Charlie"), score = c(88, 92, 79) ) # Add a new column (e.g., grade category) students <- students %>% mutate(grade = ifelse(score >= 90, "A", "B")) # Filter students with scores above 80 high_scores <- students %>% filter(score > 80) # Arrange students by descending score sorted_students <- students %>% arrange(desc(score))
Explanation of the Example Code: Creates a tibble; adds a grade column with mutate(); filters high scores; sorts by score in descending order using the pipe operator.
Discussion Points (Optional): How do tibbles make working with data easier than traditional data frames? Why is mutate() often preferred over creating columns manually? How does %>% help structure logical data transformation steps?
Supplemental Information:
- Cheatsheet: tibble(x = 1:3); df %>% filter(col > 5)
- Video: Tibbles in Tidyverse by R Programming 101
- Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)
Day 4: Joining and Reshaping Data
Introduction: Joining and reshaping datasets are essential skills in data analysis, enabling you to integrate multiple sources and reorganize data for effective exploration and modeling.
Objective: By the end of today’s session, learners will be able to combine multiple datasets using different types of joins, reshape data between long and wide formats using tidyverse functions, and handle real-world data alignment issues during integration and transformation.
Scope of the Lesson: Performing joins with left_join(), inner_join(), and other dplyr functions; Reshaping data with pivot_longer() and pivot_wider(); Managing key matching and missing data when merging datasets.
Background Information: Joins (dplyr): Combine data frames based on a shared key. left_join(df1, df2, by = "key"): Keeps all rows from df1; inner_join(): Keeps only matching rows. Reshaping (tidyr): pivot_longer() converts wide to long format; pivot_wider() spreads long to wide. Importance: Joins link sources (e.g., customer + purchase data); reshaping aids plotting/modeling.
Simple Code Examples:
# Load tidyverse library(tidyverse) # Example datasets students <- tibble(id = 1:3, name = c("Alice", "Bob", "Charlie")) scores <- tibble(id = c(1, 2, 4), score = c(90, 85, 88)) # Left join (keep all students, match scores if available) student_scores <- students %>% left_join(scores, by = "id") # Reshape a dataset grades <- tibble( id = 1:2, math = c(90, 85), science = c(88, 92) ) # Pivot from wide to long format grades_long <- grades %>% pivot_longer(cols = c(math, science), names_to = "subject", values_to = "score") # Pivot from long to wide format grades_wide <- grades_long %>% pivot_wider(names_from = subject, values_from = score)
Explanation of the Example Code: Joins student and score data with left_join(); reshapes grades data from wide to long format and back to wide using pivot_longer() and pivot_wider().
<出入Discussion Points (Optional): When would you prefer a left_join() over an inner_join()? Why might you need to reshape data before visualizing it? What issues might arise when joining datasets (e.g., missing keys)?
Supplemental Information:
- Cheatsheet: left_join(df1, df2, by = "key"); pivot_longer(df, cols = c(col1, col2))
- Video: Joining and Reshaping in Tidyverse by DataCamp
- Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)
Day 5: Practical Tidyverse Applications
Introduction: Applying tidyverse tools to real-world datasets develops critical data wrangling skills, essential for data exploration, analysis, and modeling preparation.
Objective: By the end of today’s session, learners will be able to load real-world datasets into R, apply tidyverse functions to filter, join, reshape, and summarize data, and practice end-to-end workflows for data cleaning and transformation.
Scope of the Lesson: Loading datasets (e.g., CSV files from Kaggle or open datasets); Filtering and selecting data using filter() and select(); Joining multiple datasets with left_join() or inner_join(); Reshaping data with pivot_longer() and pivot_wider(); Summarizing grouped data with group_by() and summarise().
Background Information: Typical Workflow: Filter (e.g., price > 100), select columns, join datasets, reshape, summarize (e.g., mean sales by category). Tidyverse Functions: Use %>% to chain transformations. Real-World Relevance: Foundational for business analytics, research, and machine learning.
Simple Code Examples:
# Load tidyverse library(tidyverse) # Load a dataset products <- read_csv("products.csv") sales <- read_csv("sales.csv") # Join datasets full_data <- sales %>% left_join(products, by = "product_id") # Filter data filtered_data <- full_data %>% filter(price > 100) # Reshape data sales_long <- filtered_data %>% pivot_longer(cols = starts_with("month_"), names_to = "month", values_to = "sales") # Summarize data summary_stats <- sales_long %>% group_by(product_name) %>% summarise(avg_sales = mean(sales, na.rm = TRUE))
Explanation of the Example Code: Loads and joins product and sales data; filters high-priced items; reshapes sales data to long format; summarizes average sales by product.
Discussion Points (Optional): How do chaining operations with %>% improve code readability? When should you reshape data before vs. after joining datasets? Why is summarizing grouped data critical before visualization?
Supplemental Information:
- Cheatsheet: df %>% group_by(col) %>% summarise(mean = mean(val))
- Video: Tidyverse Projects by R Programming 101
- Book: R for Data Science by Hadley Wickham & Garrett Grolemund (2016)
Daily Quiz
Practice Lab
Select an environment to practice coding exercises.
Exercise
Download the following files to support your learning:
Grade
Day 1 Score: Not completed
Day 2 Score: Not completed
Day 3 Score: Not completed
Day 4 Score: Not completed
Day 5 Score: Not completed
Overall Average Score: Not calculated
Overall Grade: Not calculated
Generate Certificate
Click the button below to generate your certificate for completing the course.