Pandas#

http://wdy.h-cdn.co/assets/16/05/768x576/sd-aspect-1454612525-baby-pandas.jpg

Pandas is a tool Python-based data analysis and manipulation

  • designed for working with heterogeneous data

  • well suited for data importing, aggregation and cleaning

  • quick visualizations of data

The best of pandas#

import pandas as pd
import numpy as np
df = pd.read_csv("titanic.csv", sep="\t")
type(df), df.shape
(pandas.core.frame.DataFrame, (156, 12))
df.head(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 156.000000 156.000000 156.000000 126.000000 156.000000 156.000000 156.000000
mean 78.500000 0.346154 2.423077 28.141508 0.615385 0.397436 28.109587
std 45.177428 0.477275 0.795459 14.613880 1.056235 0.870146 39.401047
min 1.000000 0.000000 1.000000 0.830000 0.000000 0.000000 6.750000
25% 39.750000 0.000000 2.000000 19.000000 0.000000 0.000000 8.003150
50% 78.500000 0.000000 3.000000 26.000000 0.000000 0.000000 14.454200
75% 117.250000 1.000000 3.000000 35.000000 1.000000 0.000000 30.371850
max 156.000000 1.000000 3.000000 71.000000 5.000000 5.000000 263.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  156 non-null    int64  
 1   Survived     156 non-null    int64  
 2   Pclass       156 non-null    int64  
 3   Name         156 non-null    object 
 4   Sex          156 non-null    object 
 5   Age          126 non-null    float64
 6   SibSp        156 non-null    int64  
 7   Parch        156 non-null    int64  
 8   Ticket       156 non-null    object 
 9   Fare         156 non-null    float64
 10  Cabin        31 non-null     object 
 11  Embarked     155 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 14.8+ KB

Select columns#

Use syntax df[[col1, ..., colN]]

df['Age']
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
151    22.0
152    55.5
153    40.5
154     NaN
155    51.0
Name: Age, Length: 156, dtype: float64
df[['Age']]
Age
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
... ...
151 22.0
152 55.5
153 40.5
154 NaN
155 51.0

156 rows × 1 columns

type(df), type(df['Age']), type(df[['Age']])
(pandas.core.frame.DataFrame,
 pandas.core.series.Series,
 pandas.core.frame.DataFrame)

Indexing#

df.sort_values("Age", inplace=True)
df.head(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
78 79 1 2 Caldwell, Master. Alden Gates male 0.83 0 2 248738 29.0000 NaN S
119 120 0 3 Andersson, Miss. Ellis Anna Maria female 2.00 4 2 347082 31.2750 NaN S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 NaN S
16 17 0 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 NaN Q
43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 SC/Paris 2123 41.5792 NaN C
63 64 0 3 Skoog, Master. Harald male 4.00 3 2 347088 27.9000 NaN S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S
58 59 1 2 West, Miss. Constance Mirium female 5.00 1 2 C.A. 34651 27.7500 NaN S
50 51 0 3 Panula, Master. Juha Niilo male 7.00 4 1 3101295 39.6875 NaN S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.00 3 1 349909 21.0750 NaN S
df.tail(8)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
101 102 0 3 Petroff, Mr. Pastcho ("Pentcho") male NaN 0 0 349215 7.8958 NaN S
107 108 1 3 Moss, Mr. Albert Johan male NaN 0 0 312991 7.7750 NaN S
109 110 1 3 Moran, Miss. Bertha female NaN 1 0 371110 24.1500 NaN Q
121 122 0 3 Moore, Mr. Leonard Charles male NaN 0 0 A4. 54510 8.0500 NaN S
126 127 0 3 McMahon, Mr. Martin male NaN 0 0 370372 7.7500 NaN Q
128 129 1 3 Peter, Miss. Anna female NaN 1 1 2668 22.3583 F E69 C
140 141 0 3 Boulos, Mrs. Joseph (Sultana) female NaN 0 2 2678 15.2458 NaN C
154 155 0 3 Olsen, Mr. Ole Martin male NaN 0 0 Fa 265302 7.3125 NaN S
# access by index
df.iloc[78]
PassengerId                                                   54
Survived                                                       1
Pclass                                                         2
Name           Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin...
Sex                                                       female
Age                                                         29.0
SibSp                                                          1
Parch                                                          0
Ticket                                                      2926
Fare                                                        26.0
Cabin                                                        NaN
Embarked                                                       S
Name: 53, dtype: object
# access by label
df.loc[78]
PassengerId                               79
Survived                                   1
Pclass                                     2
Name           Caldwell, Master. Alden Gates
Sex                                     male
Age                                     0.83
SibSp                                      0
Parch                                      2
Ticket                                248738
Fare                                    29.0
Cabin                                    NaN
Embarked                                   S
Name: 78, dtype: object
# multiple indexing
df.loc[[78, 79, 100], ["Age", "Cabin"]] 
Age Cabin
78 0.83 NaN
79 30.00 NaN
100 28.00 NaN

pd.Series#

1-d slice of dataframes has type pd.Series

df["Age"].head(5).values
array([0.83, 2.  , 2.  , 2.  , 3.  ])

Get access to index

df["Age"].head(5).index
Index([78, 119, 7, 16, 43], dtype='int64')

Creating pd.Series#

pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
Red      1
Green    2
Blue     3
dtype: int64
pd.Series(1, index=["Red", "Green", "Blue"])
Red      1
Green    1
Blue     1
dtype: int64

Convert Series to DataFrame

s = pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
type(s.to_frame("Values"))
pandas.core.frame.DataFrame

NaN’s#

df["Cabin"].head(10)
78     NaN
119    NaN
7      NaN
16     NaN
43     NaN
63     NaN
10      G6
58     NaN
50     NaN
24     NaN
Name: Cabin, dtype: object
df["Cabin"].dropna().head(10)
10              G6
27     C23 C25 C27
136            D47
102            D26
151             C2
88     C23 C25 C27
97         D10 D12
118        B58 B60
139            B86
75           F G73
Name: Cabin, dtype: object
df["Cabin"].fillna(3).head(10)
78      3
119     3
7       3
16      3
43      3
63      3
10     G6
58      3
50      3
24      3
Name: Cabin, dtype: object
df["Cabin"].fillna(method="bfill").head(10)
/tmp/ipykernel_763/3909346204.py:1: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  df["Cabin"].fillna(method="bfill").head(10)
78              G6
119             G6
7               G6
16              G6
43              G6
63              G6
10              G6
58     C23 C25 C27
50     C23 C25 C27
24     C23 C25 C27
Name: Cabin, dtype: object
pd.isna(df["Cabin"]).head(10)
78      True
119     True
7       True
16      True
43      True
63      True
10     False
58      True
50      True
24      True
Name: Cabin, dtype: bool

Визуализация#

df.sort_index()["Fare"].plot();
../_images/2bc61599e0e12415cc4b52a582d5445a84a1104f8147a5a621455fc2feeb1bf9.png
df["Sex"].hist();
../_images/6c050848be2d34462b7817ccdffa42f30d37885000a76b94f882a5a867c7963d.png
np.sqrt(0.95) * 20
19.493588689617926
eps = 0.01
q = 0.95
np.log(eps) / np.log(q)
89.78113496070968