Data#

In God we trust, all others bring data.

—William Edwards Deming

Data is a broad term that refers to facts, statistics, or information in a raw, unprocessed, or organized form. Data can take many forms, including numbers, text, images, audio recordings, and more.

Data processing#

The process of preparing raw data for machine learning involves several stages of data processing and manipulation to transform it into a structured and suitable format. The most common stages are:

  • data collection;

  • data cleaning:

    • handling missing values;

    • remove duplicates;

    • outlier detection;

    • data type conversions;

  • data exploration and visualization;

  • feature engineering.

The result of these manipulation is what is usually called a dataset: a specific collection of data that is organized and structured in a way that makes it suitable for analysis, processing, or machine learning tasks.

Data types#

data-types

Numerical continuous data#

Continuous data can take on any real[1] value within a range and often involves measurements. For instance:

  • height

  • temperature

  • distance

  • time

Numerical discrete data#

Discrete data consists of distinct, separate values and often involves counts or categorizations, e.g.

  • number of children

  • shoe size

  • test scores

Important

The distiction between continuous and discrete data can be occasionally ambiguous. For example, age in years probably should be considered as a discrete variable. However, if we allow fractional ages, e.g. \(30.2\) years, it becomes a continuous variable.

Categorical nominal variables#

Nominal data consists of categories with no inherent order or ranking. For example:

  • colors

  • fruits

  • gender

  • countries

Categorical ordinal variables#

Ordinal data includes categories with a meaningful order or ranking. Examples:

  • education level

  • customer satisfaction

  • movie rating

  • top-10 items suggested by a search engine

Examples of datasets#

There are several way how you can import some famous datasets in Python.

Tip

To install Python library scikit-learn (aka sklearn), run the command pip install scikit-learn

For instance, we can use helpers from sklearn.datasets module.

Iris dataset#

from sklearn.datasets import load_iris
iris_data = load_iris(as_frame=True)
iris_data['data']
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

This is a tabular dataset. The targets are encoded by digits \(0\), \(1\), \(2\):

iris_data['target'].value_counts()
target
0    50
1    50
2    50
Name: count, dtype: int64

What does these values mean?

iris_data.target_names  
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Here is how they look like in the wild (figure 1.1 from [Murphy, 2022])

setosa

versicolor

virginica

Bikeshare dataset#

Tip

To install Python library scikit-learn (aka sklearn), run the command pip install scikit-learn

import pandas as pd
auto_df = pd.read_csv("../ISLP_datsets/Bikeshare.csv")
auto_df.drop("Unnamed: 0", axis=1, inplace=True)
auto_df.head()
season mnth day hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered bikers
0 1 Jan 1 0 0 6 0 clear 0.24 0.2879 0.81 0.0 3 13 16
1 1 Jan 1 1 0 6 0 clear 0.22 0.2727 0.80 0.0 8 32 40
2 1 Jan 1 2 0 6 0 clear 0.22 0.2727 0.80 0.0 5 27 32
3 1 Jan 1 3 0 6 0 clear 0.24 0.2879 0.75 0.0 3 10 13
4 1 Jan 1 4 0 6 0 clear 0.24 0.2879 0.75 0.0 0 1 1

Q. Which features are categorical and which are numeric?

MNIST dataset#

A classical dataset of handwritten digits.

from sklearn.datasets import fetch_openml

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')
X.shape, Y.shape
((70000, 784), (70000,))
../_images/59a6f9c5ee831488a37eb498426cb4d399353766979e6e58e49174c3575b845a.png

Q. What type of data is MNIST dataset?

One-hot encoding#

Before feeding categorical data into machine learning models, we need to convert them to a numerical scale. The standard way to do it is to use a one-hot encoding, also called a dummy encoding.

If a feature belongs to the final set \( \{1, \ldots, K\}\), it is encoded by a binary vector

\[ (\delta_1, \ldots, \delta_K) \in \{0, 1\}^K, \quad \sum\limits_{k=1}^K \delta _k = 1. \]

Thus each categorical variable, which takes \(K\) different values, is converted to \(K\) numeric variables.

Note

In fact, it is enough to have \(K-1\) dummy variables since the value of \(\delta_K\) can be automatically deduced from the values of \(\delta_1, \ldots, \delta_{K-1}\).

Feature matrix#

A tabular numerical dataset can be represented as a feature matrix (or design matrix) \(\boldsymbol X\) of shape \(N\times D\) where

  • \(N\) — number of samples (rows)

  • \(D\) — number of features (columns)

Each sample \(\boldsymbol x_i\) is therefore represented by \(i\)-th row of the feature matrix \(\boldsymbol X\).

Important

A sample \(\boldsymbol x_i\) is a row vector with \(D\) coordinates. However, in linear algebra a vector is by default a column vector. That’s why in vector-matrix operations a training sample is often denoted as \(\boldsymbol x_i^\top\) to emphasize that it is a row.

TODO

  • Give other examples of datasets

  • Investigate the type of data in them (all columns of iris dataset are numerical continuous, but this isn’t always the case)

  • Describe the ways of fetching datasets in Python

  • Add info about image and text datasets (see also [Murphy, 2022], pp. 19—22)

  • Add more visualizations and quizzes