Hello everyone! This jupyter notebook is a very simple python guide for loading and analysing the data for the hackathon.

If you are already experienced in Python or if you are planning to use a different language, you can go ahead and skip this one. No hard feelings!

If not, don't worry. This looks more complicated than it is. By the end of this notebook you'll see that loading and manipulating data can be as easy as snapping your fingers.

Well... almost as easy.

First things first!

The first thing you should do is to import the libraries that you'll need for your project.

The ones we present below are just a suggestion. Feel free to use more, use less or don't use any at all!

FYI, Numpy is great for numerical analysis and pandas was designed to help you deal with data structures. Matplotlib is the standard plotting library.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Loading the Data

If you want to load the files into a manageable tabular format (pandas dataframe) you should use something along these lines.

In [4]:
#Simple load - the sep parameter defines the character that separates (makes sense right?) the different columns in the table.

df = pd.read_csv('C:\\Users\\E348493\\Desktop\\Hack_the_electron\\load_pwr.csv',sep=';')

#imagine that your data has some weird characters like Ç,´,^,. To load something like this you you need to define a 
#different encoding

df_2 = pd.read_csv('C:\\Users\\E348493\\Desktop\\Hack_the_electron\\dataset_index.csv',sep=';', encoding="latin1")

In this case, I chose to place the files in a folder on my desktop called "Hack_the_electron". You can place the files wherever you want, just remember to adapt the path in the commands.

Looking at the data

To show a small sample of the data in the dataframe, the head() command is always useful (here we drop one of the columns with the command drop() to ease the analysis)

In [7]:
df=df.drop(columns="Unnamed: 0")

df.head()
Out[7]:
Time meter_0 meter_1 meter_2 meter_3 meter_4 meter_5 meter_6 meter_7 meter_8 ... meter_206 meter_207 meter_208 meter_209 meter_210 meter_211 meter_212 meter_213 meter_214 meter_215
0 2016-10-01 00:15:00 1076.0 404.0 76.0 148.0 284.0 220.0 332.0 144.0 1888.0 ... 0.0 264.0 604.0 316.0 112.0 208.0 1044.0 76.0 1320.0 124.0
1 2016-10-01 00:30:00 1064.0 440.0 56.0 116.0 236.0 164.0 328.0 216.0 1828.0 ... 0.0 272.0 536.0 248.0 64.0 188.0 1068.0 40.0 1316.0 184.0
2 2016-10-01 00:45:00 1036.0 484.0 136.0 20.0 236.0 168.0 356.0 100.0 1696.0 ... 0.0 268.0 660.0 192.0 136.0 144.0 1164.0 44.0 1284.0 212.0
3 2016-10-01 01:00:00 1504.0 580.0 56.0 84.0 196.0 164.0 364.0 36.0 1692.0 ... 0.0 272.0 664.0 176.0 68.0 144.0 1016.0 40.0 1380.0 156.0
4 2016-10-01 01:15:00 3236.0 488.0 68.0 252.0 164.0 84.0 440.0 52.0 1832.0 ... 0.0 276.0 648.0 152.0 104.0 212.0 1016.0 40.0 1500.0 124.0

5 rows × 217 columns

Let's take this opportunity and look into the other dataset as well.

In [8]:
df_2.head()
Out[8]:
meter_id tariff n_phases ctrct_pw
0 meter_0 Time-Of-Use Two Periods T 20.70
1 meter_1 Time-Of-Use Two Periods T 20.70
2 meter_2 Simple Tariff M 4.60
3 meter_3 Simple Tariff M 3.45
4 meter_4 Simple Tariff T 6.90

Data Snapshot

To get the overall look of the data, the describe() and info() commands are great tools

In [9]:
df.info()

df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70072 entries, 0 to 70071
Columns: 217 entries, Time to meter_215
dtypes: float64(216), object(1)
memory usage: 116.0+ MB
Out[9]:
meter_0 meter_1 meter_2 meter_3 meter_4 meter_5 meter_6 meter_7 meter_8 meter_9 ... meter_206 meter_207 meter_208 meter_209 meter_210 meter_211 meter_212 meter_213 meter_214 meter_215
count 70064.000000 70062.000000 70061.000000 70046.000000 70052.000000 70066.000000 70070.000000 70065.000000 70066.000000 70027.000000 ... 70009.000000 70060.000000 70057.000000 70053.000000 70061.000000 70044.000000 70066.000000 70069.000000 70050.000000 70052.000000
mean 1688.141342 848.662984 283.509913 338.758359 294.952321 238.771730 317.813387 252.258189 1092.979705 966.943036 ... 158.428488 254.022723 978.715146 276.105178 205.012832 339.919194 1071.249964 239.781815 1124.807252 271.566779
std 1537.332938 638.520281 506.551761 397.158476 384.147312 457.623944 310.270384 616.990333 532.496515 563.358573 ... 306.613934 113.423875 1479.184981 198.540920 392.445709 454.174532 781.267779 505.232360 302.610938 292.046241
min 0.000000 0.000000 16.000000 12.000000 0.000000 0.000000 0.000000 0.000000 500.000000 144.000000 ... 0.000000 32.000000 0.000000 36.000000 36.000000 0.000000 112.000000 0.000000 40.000000 0.000000
25% 340.000000 492.000000 64.000000 112.000000 116.000000 28.000000 156.000000 48.000000 716.000000 688.000000 ... 0.000000 204.000000 408.000000 148.000000 88.000000 112.000000 524.000000 60.000000 988.000000 108.000000
50% 1168.000000 640.000000 120.000000 184.000000 192.000000 88.000000 224.000000 104.000000 972.000000 800.000000 ... 0.000000 212.000000 480.000000 252.000000 116.000000 168.000000 744.000000 100.000000 1176.000000 176.000000
75% 2628.000000 876.000000 212.000000 336.000000 304.000000 148.000000 372.000000 180.000000 1324.000000 968.000000 ... 192.000000 308.000000 836.000000 348.000000 148.000000 336.000000 1576.000000 148.000000 1320.000000 304.000000
max 12284.000000 7576.000000 5436.000000 4168.000000 4744.000000 2348.000000 3964.000000 8780.000000 6076.000000 8460.000000 ... 3724.000000 1560.000000 17084.000000 3232.000000 5368.000000 3980.000000 7768.000000 4784.000000 2512.000000 3248.000000

8 rows × 216 columns

Time to get our hands dirty!

Let's select some of the columns and perform a couple of simple operations

First, we'll calculate a simple sum

In [10]:
#Select first five elements in the meter_0 column and place them in a list
sample=df[["meter_0"]][0:5].values

print(sample)

#Loop through the elements of the list and sum their values if they are larger than 70
total_sum=sum(i for i in sample if i>700)
    

print("Sum=",total_sum)
[[1076.]
 [1064.]
 [1036.]
 [1504.]
 [3236.]]
Sum= [7916.]

In the document that specifies the data for the challenge (Hack the Electron - Stream A - All Files and Variables.pdf), we mention that we only have the meters' power readings (in W). If we want to obtain the corresponding consumed energy in kWh we need to divide the power by 4 (because the measurements are quarter hour averages). Let's see how we can create a new column with the energy consumptinon in kWh

In [11]:
df_meter_0=df[['Time','meter_0']].copy()

df_meter_0["energy_consumption"]=df_meter_0["meter_0"]/4

df_meter_0.head()
Out[11]:
Time meter_0 energy_consumption
0 2016-10-01 00:15:00 1076.0 269.0
1 2016-10-01 00:30:00 1064.0 266.0
2 2016-10-01 00:45:00 1036.0 259.0
3 2016-10-01 01:00:00 1504.0 376.0
4 2016-10-01 01:15:00 3236.0 809.0

Convert Datatypes

If you want to create a datetime column from a column with a different type you can use the following:

In [12]:
df["time_datetime"]=pd.to_datetime(df.Time)

print("Type of Time column: ",df[["Time"]].dtypes)

print("Type of time_date column: ",df[["time_datetime"]].dtypes)
Type of Time column:  Time    object
dtype: object
Type of time_date column:  time_datetime    datetime64[ns]
dtype: object

Sorting the data

If you would like to sort the data by a particular column, you can use this simple command below

In [13]:
df=df.sort_values("time_datetime")

df.head()
Out[13]:
Time meter_0 meter_1 meter_2 meter_3 meter_4 meter_5 meter_6 meter_7 meter_8 ... meter_207 meter_208 meter_209 meter_210 meter_211 meter_212 meter_213 meter_214 meter_215 time_datetime
0 2016-10-01 00:15:00 1076.0 404.0 76.0 148.0 284.0 220.0 332.0 144.0 1888.0 ... 264.0 604.0 316.0 112.0 208.0 1044.0 76.0 1320.0 124.0 2016-10-01 00:15:00
1 2016-10-01 00:30:00 1064.0 440.0 56.0 116.0 236.0 164.0 328.0 216.0 1828.0 ... 272.0 536.0 248.0 64.0 188.0 1068.0 40.0 1316.0 184.0 2016-10-01 00:30:00
2 2016-10-01 00:45:00 1036.0 484.0 136.0 20.0 236.0 168.0 356.0 100.0 1696.0 ... 268.0 660.0 192.0 136.0 144.0 1164.0 44.0 1284.0 212.0 2016-10-01 00:45:00
3 2016-10-01 01:00:00 1504.0 580.0 56.0 84.0 196.0 164.0 364.0 36.0 1692.0 ... 272.0 664.0 176.0 68.0 144.0 1016.0 40.0 1380.0 156.0 2016-10-01 01:00:00
4 2016-10-01 01:15:00 3236.0 488.0 68.0 252.0 164.0 84.0 440.0 52.0 1832.0 ... 276.0 648.0 152.0 104.0 212.0 1016.0 40.0 1500.0 124.0 2016-10-01 01:15:00

5 rows × 218 columns

Time to Plot!

What can we see if we plot the data from one of the meters? For readability, we will only plot the first 48 values. (Note that in the plot below, the time axis has the following format Day-Month-Hour)

In [14]:
x=df[["time_datetime"]][:48].values

y=df[["meter_0"]][:48].values

plt.plot(x,y)
Out[14]:
[<matplotlib.lines.Line2D at 0x124b6a58>]

Congratulations! You have reached the end of this very small guide.

Now is the time to move fast and break things (as they used to say in Facebook). Just remember to put things back together in the end!

If you would like to know more about Python check out these references:

https://www.w3schools.com/python/

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/