If you are already experienced in Python or if you are planning to use a different language, you can go ahead and skip this one. No hard feelings!
If not, don't worry. This looks more complicated than it is. By the end of this notebook you'll see that loading and manipulating data can be as easy as snapping your fingers.
Well... almost as easy.
The first thing you should do is to import the libraries that you'll need for your project.
The ones we present below are just a suggestion. Feel free to use more, use less or don't use any at all!
FYI, Numpy is great for numerical analysis and pandas was designed to help you deal with data structures. Matplotlib is the standard plotting library.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
If you want to load the files into a manageable tabular format (pandas dataframe) you should use something along these lines.
#Simple load - the sep parameter defines the character that separates (makes sense right?) the different columns in the table.
df = pd.read_csv('C:\\Users\\E348493\\Desktop\\Hack_the_electron\\load_pwr.csv',sep=';')
#imagine that your data has some weird characters like Ç,´,^,. To load something like this you you need to define a
#different encoding
df_2 = pd.read_csv('C:\\Users\\E348493\\Desktop\\Hack_the_electron\\dataset_index.csv',sep=';', encoding="latin1")
In this case, I chose to place the files in a folder on my desktop called "Hack_the_electron". You can place the files wherever you want, just remember to adapt the path in the commands.
To show a small sample of the data in the dataframe, the head() command is always useful (here we drop one of the columns with the command drop() to ease the analysis)
df=df.drop(columns="Unnamed: 0")
df.head()
Let's take this opportunity and look into the other dataset as well.
df_2.head()
To get the overall look of the data, the describe() and info() commands are great tools
df.info()
df.describe()
Let's select some of the columns and perform a couple of simple operations
First, we'll calculate a simple sum
#Select first five elements in the meter_0 column and place them in a list
sample=df[["meter_0"]][0:5].values
print(sample)
#Loop through the elements of the list and sum their values if they are larger than 70
total_sum=sum(i for i in sample if i>700)
print("Sum=",total_sum)
In the document that specifies the data for the challenge (Hack the Electron - Stream A - All Files and Variables.pdf), we mention that we only have the meters' power readings (in W). If we want to obtain the corresponding consumed energy in kWh we need to divide the power by 4 (because the measurements are quarter hour averages). Let's see how we can create a new column with the energy consumptinon in kWh
df_meter_0=df[['Time','meter_0']].copy()
df_meter_0["energy_consumption"]=df_meter_0["meter_0"]/4
df_meter_0.head()
If you want to create a datetime column from a column with a different type you can use the following:
df["time_datetime"]=pd.to_datetime(df.Time)
print("Type of Time column: ",df[["Time"]].dtypes)
print("Type of time_date column: ",df[["time_datetime"]].dtypes)
If you would like to sort the data by a particular column, you can use this simple command below
df=df.sort_values("time_datetime")
df.head()
What can we see if we plot the data from one of the meters? For readability, we will only plot the first 48 values. (Note that in the plot below, the time axis has the following format Day-Month-Hour)
x=df[["time_datetime"]][:48].values
y=df[["meter_0"]][:48].values
plt.plot(x,y)
Now is the time to move fast and break things (as they used to say in Facebook). Just remember to put things back together in the end!
If you would like to know more about Python check out these references:
https://www.w3schools.com/python/
https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/