Data

Data#

In God we trust, all others bring data.

—William Edwards Deming

Data is a broad term that refers to facts, statistics, or information in a raw, unprocessed, or organized form. Data can take many forms, including numbers, text, images, audio recordings, and more.

Data processing#

The process of preparing raw data for machine learning involves several stages of data processing and manipulation to transform it into a structured and suitable format. The most common stages are:

data collection;
data cleaning:
- handling missing values;
- remove duplicates;
- outlier detection;
- data type conversions;
data exploration and visualization;
feature engineering.

The result of these manipulation is what is usually called a dataset: a specific collection of data that is organized and structured in a way that makes it suitable for analysis, processing, or machine learning tasks.

Data types#

Numerical continuous data#

Continuous data can take on any real[1] value within a range and often involves measurements. For instance:

height
temperature
distance
time

Numerical discrete data#

Discrete data consists of distinct, separate values and often involves counts or categorizations, e.g.

number of children
shoe size
test scores

Important

The distiction between continuous and discrete data can be occasionally ambiguous. For example, age in years probably should be considered as a discrete variable. However, if we allow fractional ages, e.g. \(30.2\) years, it becomes a continuous variable.

Categorical nominal variables#

Nominal data consists of categories with no inherent order or ranking. For example:

colors
fruits
gender
countries

Categorical ordinal variables#

Ordinal data includes categories with a meaningful order or ranking. Examples:

education level
customer satisfaction
movie rating
top-10 items suggested by a search engine

Examples of datasets#

There are several way how you can import some famous datasets in Python.

Tip

To install Python library scikit-learn (aka sklearn), run the command pip install scikit-learn

For instance, we can use helpers from sklearn.datasets module.

Iris dataset#

from sklearn.datasets import load_iris
iris_data = load_iris(as_frame=True)
iris_data['data']

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
...	...	...	...	...
145	6.7	3.0	5.2	2.3
146	6.3	2.5	5.0	1.9
147	6.5	3.0	5.2	2.0
148	6.2	3.4	5.4	2.3
149	5.9	3.0	5.1	1.8

150 rows × 4 columns

This is a tabular dataset. The targets are encoded by digits \(0\), \(1\), \(2\):

iris_data['target'].value_counts()

target
0    50
1    50
2    50
Name: count, dtype: int64

What does these values mean?

iris_data.target_names  

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Here is how they look like in the wild (figure 1.1 from [Murphy, 2022])

setosa	versicolor	virginica

Bikeshare dataset#

Tip

To install Python library scikit-learn (aka sklearn), run the command pip install scikit-learn

import pandas as pd
auto_df = pd.read_csv("../ISLP_datsets/Bikeshare.csv")
auto_df.drop("Unnamed: 0", axis=1, inplace=True)
auto_df.head()

	season	mnth	day	hr	weekday	weathersit	temp	atemp	hum	casual	registered	bikers
0	1	Jan	1	0	6	clear	0.24	0.2879	0.81	3	13	16
1	1	Jan	1	1	6	clear	0.22	0.2727	0.80	8	32	40
2	1	Jan	1	2	6	clear	0.22	0.2727	0.80	5	27	32
3	1	Jan	1	3	6	clear	0.24	0.2879	0.75	3	10	13
4	1	Jan	1	4	6	clear	0.24	0.2879	0.75	0	1	1

Q. Which features are categorical and which are numeric?

MNIST dataset#

A classical dataset of handwritten digits.

from sklearn.datasets import fetch_openml

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')
X.shape, Y.shape

((70000, 784), (70000,))

../_images/59a6f9c5ee831488a37eb498426cb4d399353766979e6e58e49174c3575b845a.png

Q. What type of data is MNIST dataset?

One-hot encoding#

Before feeding categorical data into machine learning models, we need to convert them to a numerical scale. The standard way to do it is to use a one-hot encoding, also called a dummy encoding.

If a feature belongs to the final set \( \{1, \ldots, K\}\), it is encoded by a binary vector

\[ (\delta_1, \ldots, \delta_K) \in \{0, 1\}^K, \quad \sum\limits_{k=1}^K \delta _k = 1. \]

Thus each categorical variable, which takes \(K\) different values, is converted to \(K\) numeric variables.

Note

In fact, it is enough to have \(K-1\) dummy variables since the value of \(\delta_K\) can be automatically deduced from the values of \(\delta_1, \ldots, \delta_{K-1}\).

Feature matrix#

A tabular numerical dataset can be represented as a feature matrix (or design matrix) \(\boldsymbol X\) of shape \(N\times D\) where

\(N\) — number of samples (rows)
\(D\) — number of features (columns)

Each sample \(\boldsymbol x_i\) is therefore represented by \(i\)-th row of the feature matrix \(\boldsymbol X\).