Course 3: Data Manipulation with Pandas

This course dives into data manipulation using Pandas, a powerful Python library for handling and analyzing tabular data. Designed for learners familiar with Python basics, this course covers DataFrame operations, handling missing data, data transformation, and practical projects. Over one week, daily lessons build skills to clean, preprocess, and analyze datasets, preparing you for advanced data science tasks.

Objective: By the end of the course, students will be able to create and manipulate Pandas DataFrames, handle missing data, perform transformations like scaling and encoding, and apply these skills in real-world data analysis projects.

Scope: The course covers Pandas fundamentals, DataFrame operations, missing data handling, data transformation techniques, and practical applications, equipping learners with tools for efficient data preprocessing and analysis.

Day 1: Introduction to Pandas

Introduction: Pandas is a powerful and widely used Python library designed for data manipulation and analysis. It provides flexible data structures, such as DataFrames, that make it easier to handle and process tabular data. With Pandas, you can perform a variety of tasks, from cleaning and transforming data to conducting in-depth analysis, all with intuitive syntax and methods. The library is built on top of NumPy and integrates seamlessly with other Python libraries, making it an essential tool for data science and analytics.

Learning Objective: The objective of this lesson is to introduce the core features of Pandas, particularly its DataFrame structure, which is central to data manipulation. By the end of the lesson, you will be able to create and manipulate DataFrames, load data from various sources (such as CSV or Excel), and perform basic operations such as viewing and selecting data. Understanding these concepts is foundational for further exploration of data analysis with Pandas.

Scope of the Lesson: This lesson will cover the basics of Pandas, including installation and setup. You will learn how to create DataFrames from scratch, load external data using functions like pd.read_csv and pd.read_excel, and perform basic operations such as viewing the first few rows of a DataFrame, selecting specific columns, and filtering data. These skills are essential for anyone starting with data analysis, as they provide the foundation for more complex manipulations and analyses.

Background Information: Pandas is built on top of the NumPy library and provides two primary data structures: the DataFrame and the Series. The DataFrame is a 2D table, similar to a spreadsheet or SQL table, with labeled axes (rows and columns), making it easier to work with structured data. Each column in a DataFrame is a Series, which is a 1D array. Pandas is designed to make data handling intuitive, allowing you to perform operations such as reading and writing data from files (e.g., CSV, Excel), selecting specific data, and even handling missing values. Pandas' most common methods include read_csv for loading CSV files into a DataFrame, head for viewing the first few rows of the data, and column selection (e.g., df['column_name']) for accessing individual columns. These operations are crucial when working with real-world datasets, as they help in the initial exploration and cleaning stages of data analysis.

Examples to Practice:

# Example 1: Importing Pandas and creating a DataFrame
import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [24, 27, 22],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)

# Example 2: Reading data from a CSV file
# Assuming 'data.csv' is a CSV file in the current working directory
df_csv = pd.read_csv('data.csv')
print(df_csv.head())  # Display the first 5 rows of the DataFrame

# Example 3: Selecting a column from the DataFrame
age_column = df['Age']
print(age_column)

# Example 4: Selecting rows based on a condition
adults = df[df['Age'] >= 25]
print(adults)

Explanation of the Example Codes & Outputs: In Example 1, we import Pandas and create a DataFrame from a dictionary. The pd.DataFrame() function takes a dictionary where the keys represent column names, and the values are lists representing the column values. The DataFrame df is then printed, displaying the tabular data. In Example 2, we use pd.read_csv() to load data from a CSV file named data.csv into a DataFrame called df_csv. The .head() method is used to display the first five rows of the DataFrame, allowing for a quick preview of the data. Example 3 demonstrates how to select a specific column from the DataFrame using df['Age'], which retrieves the column named "Age." The result is a Series that contains the age values. In Example 4, we filter the DataFrame to select rows where the "Age" column is greater than or equal to 25, creating a new DataFrame called adults. The filtered data is then printed.

Supplemental Information: Pandas is a versatile tool for handling structured data, and mastering the basics of DataFrame creation and manipulation is essential for any aspiring data analyst. Once you become comfortable with these operations, you can move on to more advanced techniques like merging datasets, handling missing values, and applying statistical functions. With its easy-to-use syntax and robust functionality, Pandas is indispensable for anyone working with data in Python.

Resources:

Cheatsheet: Pandas Cheatsheet PDF
Video: Pandas Tutorial by Corey Schafer
Book: Python for Data Analysis by Wes McKinney

Day 2: DataFrame Operations

Introduction: DataFrame operations like filtering, sorting, grouping, and joining are essential for efficient data manipulation and preprocessing in Pandas. These operations allow you to explore and transform your data in ways that prepare it for further analysis or visualization. Mastering these operations helps to extract meaningful insights, clean the data, and organize it in formats that are ready for modeling or reporting.

Learning Objective: The goal of this lesson is to teach you how to perform common DataFrame operations in Pandas. By the end of this lesson, you will be able to filter data based on specific conditions, sort values in DataFrames, group data for aggregation, and combine multiple DataFrames through joining operations. These techniques are fundamental for handling real-world data and form the foundation of data analysis workflows.

Scope of the Lesson: This lesson will cover the key operations needed to manipulate DataFrames effectively. You will learn how to filter rows based on conditions (e.g., df[df['col'] > value]), sort data by columns (e.g., df.sort_values), group data for aggregation (e.g., df.groupby), and merge DataFrames (e.g., pd.merge). These operations are commonly used in data preprocessing and analysis tasks, making them indispensable for anyone working with data in Pandas.

Background Information: Pandas offers a range of powerful methods for working with DataFrames. Filtering allows you to extract rows that meet certain criteria, which is essential for cleaning or subsetting data. Sorting rearranges data based on column values, either in ascending or descending order. Grouping enables aggregation of data based on certain categories (e.g., calculating the mean of each group), which is useful for summarizing datasets. Finally, joining DataFrames allows you to combine multiple datasets based on shared columns (keys), which is common when working with data from different sources. Filtering is achieved using conditions inside square brackets (e.g., df[df['age'] > 30]), which returns a subset of the DataFrame that meets the specified condition. Sorting can be done using df.sort_values, which orders the data by one or more columns. You can specify whether the sorting should be in ascending or descending order. Grouping allows you to group data by one or more columns and then apply aggregate functions (e.g., sum, mean, count) to summarize the data (e.g., df.groupby('category').sum()). Joining DataFrames is done with pd.merge, which combines two DataFrames based on a shared column (or index). This operation is useful when you need to merge different pieces of information stored in separate datasets.

Examples to Practice:

# Example 1: Filtering rows based on a condition
import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [24, 27, 22, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']}
df = pd.DataFrame(data)

# Filtering rows where Age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)

# Example 2: Sorting DataFrame by Age
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

# Example 3: Grouping DataFrame by City and calculating mean Age
grouped_df = df.groupby('City')['Age'].mean()
print(grouped_df)

# Example 4: Joining DataFrames
data2 = {'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco'],
         'Population': [8175133, 3792621, 2714856, 870887]}
df2 = pd.DataFrame(data2)

# Merging df and df2 on the 'City' column
merged_df = pd.merge(df, df2, on='City')
print(merged_df)

Explanation of the Example Codes & Outputs: In Example 1, we filter the DataFrame to select rows where the "Age" column is greater than 25. The resulting filtered_df contains only those rows that meet this condition. In Example 2, we use df.sort_values(by='Age', ascending=False) to sort the DataFrame by the "Age" column in descending order. The result is a DataFrame with rows ordered by age, with the oldest individuals first. Example 3 demonstrates how to group the data by the "City" column and calculate the mean age for each city using df.groupby('City')['Age'].mean(). The result is a series with the mean age for each city. In Example 4, we merge two DataFrames (df and df2) based on the shared "City" column using pd.merge. This operation combines the data from both DataFrames into a single DataFrame, allowing us to associate city names with their populations.

Supplemental Information: Mastering DataFrame operations is a key skill for data manipulation. These operations allow you to filter, sort, aggregate, and merge datasets efficiently, which is essential for data analysis. Once you are comfortable with these basic operations, you can explore more advanced techniques, such as pivot tables, handling missing data, and working with time series data. Pandas also integrates seamlessly with other Python libraries, making it an indispensable tool for data analysis.

Resources:

Cheatsheet: Pandas Cheatsheet PDF
Video: Pandas DataFrame Operations by DataCamp
Book: Python for Data Analysis by Wes McKinney

Day 3: Handling Missing Data

Introduction: Missing data is an inevitable challenge when working with real-world datasets. Incomplete records can arise due to various reasons, such as errors in data collection, data corruption, or issues during data extraction. If not handled properly, missing values can introduce bias or affect the accuracy of your analysis. Pandas offers efficient tools for detecting, handling, and imputing missing data, ensuring that your analyses are robust and accurate.

Learning Objective: The objective of this lesson is to teach you how to detect and handle missing data effectively using Pandas. By the end of this lesson, you will be able to identify missing values, remove or replace them, and apply imputation strategies to ensure data integrity.

Scope of the Lesson: This lesson covers various techniques for dealing with missing values in Pandas. You'll learn how to identify missing data using df.isnull(), remove rows or columns with missing values using df.dropna(), and replace missing values with specific values or statistics using df.fillna(). Additionally, you’ll explore imputation strategies like replacing missing values with the mean or median of a column to preserve the overall structure of the data.

Background Information: Missing values, represented as NaN (Not a Number) or None in Pandas, can occur in various forms, such as missing entries in one or more columns or entire rows being empty. Handling these missing values is critical for maintaining the quality and completeness of your analysis. Detecting Missing Values: You can check for missing values using df.isnull(), which returns a DataFrame of the same shape with True indicating the missing values and False for valid ones. The function df.isnull().sum() gives the count of missing values in each column. Dropping Missing Values: If missing data is insignificant or its removal won’t distort the analysis, you can drop rows or columns with missing values using df.dropna(). By default, this removes any row that contains at least one NaN value, but you can customize it to drop columns or rows with only missing data. Filling Missing Values: When removing data isn’t an option (such as with small datasets or important missing values), you can fill in missing data with a specific value (e.g., df.fillna(value)) or by using statistical techniques like filling with the mean, median, or mode of the column. This preserves the dataset's size and structure, ensuring that no valuable information is lost. Imputation Strategies: For more advanced handling of missing data, imputation techniques replace missing values with predicted or computed values, such as the mean or median of the column. Imputation can help avoid bias and maintain data consistency. It's particularly useful when you can't afford to drop data, especially with larger datasets.

Examples to Practice:

import pandas as pd
import numpy as np

# Sample DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [24, np.nan, 22, 30],
        'City': ['New York', np.nan, 'Chicago', 'San Francisco']}
df = pd.DataFrame(data)

# Example 1: Detecting missing values
missing_values = df.isnull().sum()
print(missing_values)

# Example 2: Dropping rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)

# Example 3: Filling missing values with a specific value
filled_df = df.fillna('Unknown')
print(filled_df)

# Example 4: Imputing missing values with the mean of the 'Age' column
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

Explanation of the Example Codes & Outputs: In Example 1, we use df.isnull().sum() to check for missing values in the DataFrame. It returns the count of missing values per column. In Example 2, we use df.dropna() to remove rows containing any missing values. The resulting cleaned_df will contain only rows without missing data. Example 3 shows how to replace missing values with a specific value, such as the string 'Unknown', using df.fillna('Unknown'). This ensures that no NaN values remain in the DataFrame. In Example 4, we demonstrate imputing missing values in the 'Age' column by replacing them with the mean of the existing ages in that column. This technique ensures that the data remains consistent while filling the missing values.

Supplemental Information: Handling missing data is one of the first steps in cleaning a dataset. Depending on the nature of your dataset and the analysis objectives, you might opt to remove missing values, fill them with a default value, or apply more sophisticated imputation techniques. It is important to consider the impact of these choices on your analysis. For example, imputing with the mean or median can introduce bias if the missing data is not randomly distributed.

Resources:

Cheatsheet: Handling Missing Data in Pandas
Video: Handling Missing Data with Pandas by Data School
Book: Python for Data Analysis by Wes McKinney

Day 4: Data Transformation

Introduction: Data transformation is an essential step in the data preprocessing pipeline, where raw data is converted into a format that can be used effectively for analysis and modeling. In many cases, data collected from different sources is not in the right shape or scale for further analysis, and thus, transformations like scaling, encoding, and feature creation are applied to make the data suitable for statistical models or machine learning. These transformations help improve the accuracy and performance of predictive models.

Learning Objective: This lesson is focused on teaching various data transformation techniques using Pandas. By the end of this lesson, you will be able to apply techniques like scaling, one-hot encoding, creating derived columns, and using the apply method to apply custom functions to DataFrames, ultimately enhancing the data’s suitability for modeling.

Scope of the Lesson: The scope of this lesson covers several key data transformation techniques that are frequently used in data analysis: Normalization: Scaling numerical values to a specific range, often [0, 1], which is important for algorithms sensitive to feature scale. One-Hot Encoding: Converting categorical variables into numerical format by creating binary columns for each category, which is often required for machine learning algorithms. Feature Creation: Adding new columns derived from existing ones (e.g., calculating ratios, extracting parts of a date, or applying functions to columns). Applying Functions: Using the df.apply() method to apply custom functions to columns or rows, enabling more flexible data transformations.

Background Information: Normalization: Scaling data to a range between 0 and 1 is common when working with algorithms like k-NN or neural networks, which rely on the relative scale of features. One common method of normalization is Min-Max scaling, which can be done using the formula (df - df.min()) / (df.max() - df.min()). One-Hot Encoding: This technique converts categorical variables into a form that can be provided to machine learning algorithms. For example, if a dataset has a column for "City" with values like "New York," "Chicago," and "San Francisco," one-hot encoding will create new columns for each of these cities, assigning a 1 or 0 to indicate presence or absence. Feature Creation: Often in data analysis, new features are derived from existing ones. This might involve simple arithmetic (e.g., df['new_column'] = df['col1'] * df['col2']) or more complex transformations, such as extracting parts of a datetime column. The apply method: This allows you to apply custom functions across rows or columns of a DataFrame. It's a powerful tool for data transformation when built-in Pandas methods do not suffice. You can use apply to perform any function across the entire DataFrame or a specific axis (rows or columns).

Examples to Practice:

import pandas as pd
import numpy as np

# Sample DataFrame for Data Transformation
data = {'Age': [23, 45, 12, 36, 25],
        'Income': [50000, 120000, 25000, 80000, 45000],
        'City': ['New York', 'Chicago', 'San Francisco', 'New York', 'Chicago']}

df = pd.DataFrame(data)

# Example 1: Normalization (Min-Max scaling)
df['Age_Normalized'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())
print(df[['Age', 'Age_Normalized']])

# Example 2: One-Hot Encoding
df_one_hot = pd.get_dummies(df['City'])
df = pd.concat([df, df_one_hot], axis=1)
print(df)

# Example 3: Feature Creation (Derived Column)
df['Income_Age_Ratio'] = df['Income'] / df['Age']
print(df[['Income', 'Age', 'Income_Age_Ratio']])

# Example 4: Applying a Custom Function (Using apply)
df['Age_Category'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Adult')
print(df[['Age', 'Age_Category']])

Explanation of the Example Codes & Outputs: In Example 1, we normalize the 'Age' column using Min-Max scaling, ensuring the values lie between 0 and 1. This makes the 'Age' values comparable when combined with other numerical columns in modeling. Example 2 demonstrates how to apply one-hot encoding to the 'City' column, turning it into several binary columns, one for each unique city. Each row receives a 1 in the column corresponding to the city in that row and 0 in others. In Example 3, we create a new feature, 'Income_Age_Ratio', which is simply the ratio of 'Income' to 'Age'. This could provide insight into income relative to age for each individual. Example 4 uses the apply method to create a new categorical column, 'Age_Category', which assigns a label ('Young' or 'Adult') based on the 'Age' column value. The apply function allows for flexible and custom transformations of data.

Supplemental Information: Data transformation is crucial for ensuring that datasets are in a form suitable for machine learning algorithms or statistical models. This process may include scaling, encoding, creating new features, or even handling missing values. Effective transformation ensures that the models can efficiently learn patterns in the data, reducing the risk of overfitting or underfitting due to improperly scaled or processed data.

Resources:

Cheatsheet: Pandas Data Transformation
Video: Data Transformation with Pandas by DataCamp
Book: Applied Predictive Modeling by Kuhn & Johnson

Day 5: Practical Data Manipulation Projects

Introduction: The ability to apply Pandas for real-world data manipulation projects is crucial for building proficiency in data analysis. By working on practical tasks, learners can improve their data cleaning, transformation, and analysis skills. These projects provide valuable experience with common data issues, such as missing values, duplicates, and categorical variables. Engaging with real datasets also helps build a deeper understanding of how to manipulate data for various analytical goals, like summarization and model preparation.

Learning Objective: By the end of this lesson, you will have worked through mini-projects that simulate real-world data manipulation tasks. You will be able to clean and preprocess datasets, handle missing values and duplicates, encode categorical variables, scale numerical features, and compute key statistics to summarize data.

Scope of the Lesson: This lesson focuses on hands-on practice with datasets to apply Pandas techniques in real-world scenarios. The key tasks include: Loading Datasets: Importing data from various formats (CSV, Excel, etc.). Cleaning Data: Identifying and handling missing values, removing duplicates, and correcting inconsistencies. Transforming Data: Encoding categorical variables, scaling numerical data, and creating new features. Summarizing Data: Using summary statistics (e.g., df.describe()), grouping data, and computing aggregates.

Background Information: Practical data manipulation projects typically involve working with datasets from sources like Kaggle, government data portals, or corporate datasets. The process usually starts with loading data using Pandas functions such as pd.read_csv() or pd.read_excel(). Once the data is loaded, common cleaning tasks include: Handling Missing Values: Identifying missing values with df.isnull() and either removing them (df.dropna()) or filling them with appropriate values (df.fillna()). Removing Duplicates: Duplicate rows can introduce biases in analysis, so it's important to clean them with df.drop_duplicates(). Encoding Categorical Variables: Categorical variables often need to be converted into numerical format for machine learning models, which can be done using pd.get_dummies() for one-hot encoding or LabelEncoder for ordinal encoding. Scaling Features: Scaling ensures that numerical features are comparable by normalizing them to a common range, often necessary for algorithms like k-NN or neural networks. Pandas offers tools to normalize columns through simple arithmetic operations. Summarizing Data: Using functions like df.describe() gives quick insights into the statistical properties of numeric columns, such as mean, median, and standard deviation, which can be used to detect data anomalies or outliers.

Examples to Practice:

import pandas as pd

# Sample dataset for Data Manipulation Project
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, None, 30, 35, 40],
        'Salary': [50000, 60000, 70000, 80000, None],
        'City': ['New York', 'Chicago', 'San Francisco', 'New York', 'Chicago']}

df = pd.DataFrame(data)

# Example 1: Handling Missing Values
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill missing Age with mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())  # Fill missing Salary with mean

# Example 2: Removing Duplicates
df = df.drop_duplicates()  # Remove any duplicate rows

# Example 3: Encoding Categorical Data
df_encoded = pd.get_dummies(df['City'], prefix='City')  # One-hot encoding for 'City'
df = pd.concat([df, df_encoded], axis=1)

# Example 4: Scaling Numerical Data (Normalization)
df['Age_Normalized'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())

# Example 5: Summarizing Data
summary = df.describe()  # Get summary statistics for numerical columns

# Displaying results
print(df)
print("\nSummary Statistics:\n", summary)

Explanation of the Example Codes & Outputs: Handling Missing Values: In Example 1, the missing Age and Salary values are replaced with the mean of their respective columns. This is a common strategy when dealing with missing data to prevent losing valuable information. Removing Duplicates: Example 2 demonstrates how to remove any duplicate rows using df.drop_duplicates(). This ensures that each row in the dataset is unique, preventing bias in analysis. Encoding Categorical Data: Example 3 applies one-hot encoding to the 'City' column using pd.get_dummies(), which creates new columns for each unique city. This step is crucial for preparing categorical data for machine learning models, which require numerical inputs. Scaling Data: Example 4 normalizes the 'Age' column using Min-Max scaling. Normalization ensures that the numerical data is on the same scale, which is essential when features are fed into machine learning models. Summarizing Data: Finally, Example 5 uses df.describe() to generate a summary of statistics for the numerical columns, helping identify potential outliers and get a quick overview of the data distribution.

Supplemental Information: Working on practical data manipulation projects enables you to build real-world skills in data analysis. These projects cover the critical areas of data cleaning, transformation, and summarization, and will improve your ability to preprocess data effectively for analysis or machine learning. Pandas is a versatile tool that simplifies many of these tasks, making it an indispensable part of any data analyst’s or data scientist’s toolkit.

Resources:

Cheatsheet: Pandas DataFrame Operations
Video: Pandas Project Tutorial by Data School
Book: Python for Data Analysis by Wes McKinney

Daily Quiz

Practice Lab

Select an environment to practice coding exercises.

Exercise

Download the following files to support your learning:

Grade

Day 1 Score: Not completed

Day 2 Score: Not completed

Day 3 Score: Not completed

Day 4 Score: Not completed

Day 5 Score: Not completed

Overall Average Score: Not calculated

Overall Grade: Not calculated

Generate Certificate

Click the button below to generate your certificate for completing the course.