Introduction to Data Science with Energy Data : Data cleansing

4 min readOct 10, 2018

Let’s explore a process of data science through the new data set we just got. It is an energy consumption data we have with our data center. In our data center, we have setup 10 more meters with data loggers. People who are interested can read our case study here.

Consider a server room containing two racks connected to channels: ULC5, ULC5. This room has two CRACs blowing in underground, called CRAC3, CRAC4. The two CRACs are turned on-off daily alternatively. The temperature setting is fixed.

We setup the meters logging the four channels. Let’s skip the process of setting the meters and logging for now. In short, the meters log data to the database and we exported the database to CSV.

To make it simple, the data is downloaded from http://datascience.ku.ac.th/16-2/ at googledrive. The data is in CSV file with the header on the first line. Totally, there are about 50 more fields. The first column is Timestamp (in the form 16/05/2018 14:41:02), and the second column is dev_name, which can be UDB1_ULC_5,UDB2_ULC_6,ADB1_CRAC3, ADB1_CRAC4,… which correspond to the meter channels.

Suppose that we are predicting Consumed energy kWh.

We explore the data set and show the graph of consumed energy.

Here is the graph. The consumed energy should be monotonically increasing values .We can see the error in the plot indicating the meter reading errors in the data.

Now let’s eliminate the error data first. In this code, we check only the increasing values of the Consumed_active_energy_kW which is field number 49, and check the limit of the values [2000,20000].

max_val = [20000]

min_val = [2000]
field_num = [49]

Figure 1 code to get rid of noises.

Running the program shows some results of lines with error values. We save the new file to powertest_CRAC3.out.csv.

….check field Consumed_active_energy_kW …not incrasing 12827 3884.0 at row 216187 [‘12/06/2018 07:16:29’, ‘ADB1_CRAC3’, ‘0.97’, ‘67.0’, ‘40.04’, ‘9.8’, ‘15.0’, ‘0.0’, ‘18.0’, ‘88.0’, ‘10080.0’, ‘224.8’, ‘0.0’, ‘9000.0’, ‘50.01’, ‘27.88’, ‘224.4’, ‘0.0’, ‘0.97’, ‘39.2’, ‘0.97’, ‘5.76’, ‘1600.0’, ‘0.0’, ‘0.96’, ‘7520.0’, ‘3895.0’, ‘8800.0’, ‘0.97’, ‘45.04’, ‘8.84’, ‘97.0’, ‘3909.0’, ‘27.88’, ‘0.097’, ‘0.097’, ‘225.7’, ‘2320.0’, ‘0.97’, ‘0.0’, ‘272.4’, ‘5.001’, ‘224.7’, ‘1800.0’, ‘388.4’, ‘3916.0’, ‘8.56’, ‘98.0’, ‘14.0’, ‘3884.0’, ‘97.0’, ‘84.0’, ‘0.0’, ‘5.76’, ‘0.97’, ‘8.56’, ‘27.2’, ‘13.0’] …not increasing 12828 3881.0 at row 216195 [‘12/06/2018 07:18:00’, ‘ADB1_CRAC3’, ‘0.94’, ‘92.0’, ‘28.28’, ‘7.08’, ‘2160.0’, ‘0.0’, ‘20.8’, ‘63.2’, ‘7560.0’, ‘224.8’, ‘0.0’, ‘6360.0’, ‘50.05’, ‘20.24’, ‘224.0’, ‘0.0’, ‘0.95’, ‘28.08’, ‘0.94’, ‘6.44’, ‘1760.0’, ‘0.0’, ‘0.93’, ‘5880.0’, ‘28080.0’, ‘6280.0’, ‘0.93’, ‘33.88’, ‘6.08’, ‘94.0’, ‘2255.0’, ‘20.2’, ‘0.094’, ‘0.094’, ‘225.6’, ‘2520.0’, ‘0.93’, ‘0.0’, ‘190.8’, ‘5.005’, ‘224.8’, ‘2120.0’, ‘388.1’, ‘2820.0’, ‘5.92’, ‘96.0’, ‘0.0’, ‘3881.0’, ‘94.0’, ‘93.0’, ‘0.0’, ‘6.44’, ‘0.94’, ‘5.88’, ‘19.08’, ‘6320.0’]….…greater than outliner data at row 386501 20193.0…greater than outliner data at row 386505…21276.0 Processed 393108 lines found 98352 line cleaned line 93924

After cleansing error row, we have eliminated some time values. In order to make a prediction, our data must be sampling in the fixed time interval. Thus, we have to find a way to make our data in the correct sampling interval.

There are several cases here. 1) If there is a missing value between time intervals, we must find a way to insert the guessed value in between. 2) if the interval is not fixed, i.e., some is short and some is long, we must re-correct the sampling time. To do this, the new time may be inserted and the new value must be inserted. There are two kinds of samplings: upsampling and downsampling. Upsampling means the scaled the data from the larger interval to the smaller interval; for example, for the yearly data to monthly data. Downsampling means the scaled the data from the smaller interval to the larger interval; for example, for hourly data to minute data. The approach is called resampling in pandas. The math used for resampling may be : interpolate, mean, last value, etc.

In the data set, powertest_CRAC3.out.csv, the interval generally is 30s.

2018–05–16 14:41:16 1.00 756.0 …
2018–05–16 14:41:46 1.00 757.0 …
2018–05–16 14:42:16 1.00 759.0 …
2018–05–16 14:42:46 1.00 760.0 …

In this following code, we try to resampling different intervals. Suppose we use “means, and interpolate”. In this example, we use the same method for each column just to make it simple. It may be pained to do different methods for different columns which is needed in reality. The plots for different sampling is in Figure 3.

Let’s save the resampling file to powertest_CRAC3_clean.csv for future use.

Now we can plot the new resampling data.

Next, we are ready to explore the prediction. Since what we predict is the consumed energy data, there are two ways of predictions that we show next.

Prediction using variables which can be power meter variables.
Prediction based on time series.

Next, try exploring relationship between variables.

Introduction to Data Science with Energy Data : Data cleansing

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by chantana chantrapornchai

No responses yet

More from chantana chantrapornchai

3D Pose Estimation with Media Pipe and OpenPose

Pose estimation has been around for a while and there are many applications on it. We are wondering around the ML Toolkit as we have done…

Face Analysis using ML-Kit and TensorFlow Lite

I am wandering around and try to find a solution to develop face recognition project on Android. My goal is to run facial expression…

Creating own name entity recognition using BERT and SpaCy: Tourism data set

Since we are interested ontology data extraction for tourism data set, we try to find the way to insert data to the ontology automatically…

Tips in Setting ROS Networking

I’d like to make a note for setting ROS laptop to communicate with any ROS node master. My case is on Raspberry pi. A good starting…

Recommended from Medium

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Sentiment Analysis of Online Reviews with Different Lexicons using R

This is the third article in a series that explores the topic of sentiment analysis using R. Sentiment analysis is a powerful technique…

Lists

Predictive Modeling w/ Python

ChatGPT prompts

Coding & Development

Practical Guides to Machine Learning

Data Science All Algorithm Cheatsheet 2025

Stories, strategies, and secrets to choosing the perfect algorithm.

Creating The Dashboard That Got Me A Data Analyst Job Offer

A walkthrough of the Udemy dashboard that got me a job offer from one of the biggest names in academic publishing.

15 AI Agent Business Ideas to Get Rich in 2025

How Has the Covid-19 Pandemic Affected Singapore: the Cost of Living and More

How the Covid-19 pandemic has affected the cost of living, unemployment rate, and population structure of Singapore using open datasets.