Pandas#
![http://wdy.h-cdn.co/assets/16/05/768x576/sd-aspect-1454612525-baby-pandas.jpg](http://wdy.h-cdn.co/assets/16/05/768x576/sd-aspect-1454612525-baby-pandas.jpg)
Pandas is a tool Python-based data analysis and manipulation
designed for working with heterogeneous data
well suited for data importing, aggregation and cleaning
quick visualizations of data
The best of pandas#
import pandas as pd
import numpy as np
df = pd.read_csv("titanic.csv", sep="\t")
type(df), df.shape
(pandas.core.frame.DataFrame, (156, 12))
df.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 156.000000 | 156.000000 | 156.000000 | 126.000000 | 156.000000 | 156.000000 | 156.000000 |
mean | 78.500000 | 0.346154 | 2.423077 | 28.141508 | 0.615385 | 0.397436 | 28.109587 |
std | 45.177428 | 0.477275 | 0.795459 | 14.613880 | 1.056235 | 0.870146 | 39.401047 |
min | 1.000000 | 0.000000 | 1.000000 | 0.830000 | 0.000000 | 0.000000 | 6.750000 |
25% | 39.750000 | 0.000000 | 2.000000 | 19.000000 | 0.000000 | 0.000000 | 8.003150 |
50% | 78.500000 | 0.000000 | 3.000000 | 26.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 117.250000 | 1.000000 | 3.000000 | 35.000000 | 1.000000 | 0.000000 | 30.371850 |
max | 156.000000 | 1.000000 | 3.000000 | 71.000000 | 5.000000 | 5.000000 | 263.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 156 non-null int64
1 Survived 156 non-null int64
2 Pclass 156 non-null int64
3 Name 156 non-null object
4 Sex 156 non-null object
5 Age 126 non-null float64
6 SibSp 156 non-null int64
7 Parch 156 non-null int64
8 Ticket 156 non-null object
9 Fare 156 non-null float64
10 Cabin 31 non-null object
11 Embarked 155 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 14.8+ KB
Select columns#
Use syntax df[[col1, ..., colN]]
df['Age']
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
151 22.0
152 55.5
153 40.5
154 NaN
155 51.0
Name: Age, Length: 156, dtype: float64
df[['Age']]
Age | |
---|---|
0 | 22.0 |
1 | 38.0 |
2 | 26.0 |
3 | 35.0 |
4 | 35.0 |
... | ... |
151 | 22.0 |
152 | 55.5 |
153 | 40.5 |
154 | NaN |
155 | 51.0 |
156 rows × 1 columns
type(df), type(df['Age']), type(df[['Age']])
(pandas.core.frame.DataFrame,
pandas.core.series.Series,
pandas.core.frame.DataFrame)
Indexing#
df.sort_values("Age", inplace=True)
df.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | male | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | S |
119 | 120 | 0 | 3 | Andersson, Miss. Ellis Anna Maria | female | 2.00 | 4 | 2 | 347082 | 31.2750 | NaN | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.00 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
43 | 44 | 1 | 2 | Laroche, Miss. Simonne Marie Anne Andree | female | 3.00 | 1 | 2 | SC/Paris 2123 | 41.5792 | NaN | C |
63 | 64 | 0 | 3 | Skoog, Master. Harald | male | 4.00 | 3 | 2 | 347088 | 27.9000 | NaN | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.00 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
58 | 59 | 1 | 2 | West, Miss. Constance Mirium | female | 5.00 | 1 | 2 | C.A. 34651 | 27.7500 | NaN | S |
50 | 51 | 0 | 3 | Panula, Master. Juha Niilo | male | 7.00 | 4 | 1 | 3101295 | 39.6875 | NaN | S |
24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
df.tail(8)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
101 | 102 | 0 | 3 | Petroff, Mr. Pastcho ("Pentcho") | male | NaN | 0 | 0 | 349215 | 7.8958 | NaN | S |
107 | 108 | 1 | 3 | Moss, Mr. Albert Johan | male | NaN | 0 | 0 | 312991 | 7.7750 | NaN | S |
109 | 110 | 1 | 3 | Moran, Miss. Bertha | female | NaN | 1 | 0 | 371110 | 24.1500 | NaN | Q |
121 | 122 | 0 | 3 | Moore, Mr. Leonard Charles | male | NaN | 0 | 0 | A4. 54510 | 8.0500 | NaN | S |
126 | 127 | 0 | 3 | McMahon, Mr. Martin | male | NaN | 0 | 0 | 370372 | 7.7500 | NaN | Q |
128 | 129 | 1 | 3 | Peter, Miss. Anna | female | NaN | 1 | 1 | 2668 | 22.3583 | F E69 | C |
140 | 141 | 0 | 3 | Boulos, Mrs. Joseph (Sultana) | female | NaN | 0 | 2 | 2678 | 15.2458 | NaN | C |
154 | 155 | 0 | 3 | Olsen, Mr. Ole Martin | male | NaN | 0 | 0 | Fa 265302 | 7.3125 | NaN | S |
# access by index
df.iloc[78]
PassengerId 54
Survived 1
Pclass 2
Name Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin...
Sex female
Age 29.0
SibSp 1
Parch 0
Ticket 2926
Fare 26.0
Cabin NaN
Embarked S
Name: 53, dtype: object
# access by label
df.loc[78]
PassengerId 79
Survived 1
Pclass 2
Name Caldwell, Master. Alden Gates
Sex male
Age 0.83
SibSp 0
Parch 2
Ticket 248738
Fare 29.0
Cabin NaN
Embarked S
Name: 78, dtype: object
# multiple indexing
df.loc[[78, 79, 100], ["Age", "Cabin"]]
Age | Cabin | |
---|---|---|
78 | 0.83 | NaN |
79 | 30.00 | NaN |
100 | 28.00 | NaN |
pd.Series
#
1-d slice of dataframes has type pd.Series
df["Age"].head(5).values
array([0.83, 2. , 2. , 2. , 3. ])
Get access to index
df["Age"].head(5).index
Index([78, 119, 7, 16, 43], dtype='int64')
Creating pd.Series
#
pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
Red 1
Green 2
Blue 3
dtype: int64
pd.Series(1, index=["Red", "Green", "Blue"])
Red 1
Green 1
Blue 1
dtype: int64
Convert Series to DataFrame
s = pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
type(s.to_frame("Values"))
pandas.core.frame.DataFrame
NaN’s#
df["Cabin"].head(10)
78 NaN
119 NaN
7 NaN
16 NaN
43 NaN
63 NaN
10 G6
58 NaN
50 NaN
24 NaN
Name: Cabin, dtype: object
df["Cabin"].dropna().head(10)
10 G6
27 C23 C25 C27
136 D47
102 D26
151 C2
88 C23 C25 C27
97 D10 D12
118 B58 B60
139 B86
75 F G73
Name: Cabin, dtype: object
df["Cabin"].fillna(3).head(10)
78 3
119 3
7 3
16 3
43 3
63 3
10 G6
58 3
50 3
24 3
Name: Cabin, dtype: object
df["Cabin"].fillna(method="bfill").head(10)
/tmp/ipykernel_763/3909346204.py:1: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
df["Cabin"].fillna(method="bfill").head(10)
78 G6
119 G6
7 G6
16 G6
43 G6
63 G6
10 G6
58 C23 C25 C27
50 C23 C25 C27
24 C23 C25 C27
Name: Cabin, dtype: object
pd.isna(df["Cabin"]).head(10)
78 True
119 True
7 True
16 True
43 True
63 True
10 False
58 True
50 True
24 True
Name: Cabin, dtype: bool
Визуализация#
df.sort_index()["Fare"].plot();
![../_images/2bc61599e0e12415cc4b52a582d5445a84a1104f8147a5a621455fc2feeb1bf9.png](../_images/2bc61599e0e12415cc4b52a582d5445a84a1104f8147a5a621455fc2feeb1bf9.png)
df["Sex"].hist();
![../_images/6c050848be2d34462b7817ccdffa42f30d37885000a76b94f882a5a867c7963d.png](../_images/6c050848be2d34462b7817ccdffa42f30d37885000a76b94f882a5a867c7963d.png)
np.sqrt(0.95) * 20
19.493588689617926
eps = 0.01
q = 0.95
np.log(eps) / np.log(q)
89.78113496070968