Steam Data Cleaning: Code Optimisation in Python
This post forms part of a larger series on downloading, processing and analysing data from the Steam store. See all posts here.
View original notebook on github (Download). Datasets available on Kaggle.
In my previous post, we took an in-depth look at cleaning data downloaded from the Steam Store. We followed the process from start to finish, omitting just one column, which we will look at today.
The final column to clean, release_date
, provides some interesting optimisation and learning challenges. We encountered columns with a similar structure previously, so we can use what we learned there, but now we will also have dates to handle. We're going to approach this problem with the goal of optimisation in mind - we'll start by figuring out how to solve the task, getting to the point of a functional solution, then we'll test parts of the code to see where the major slowdowns are, using this to develop a framework for improving the efficiency of the code. By iteratively testing, rewriting and rerunning sections of code, we can gradually move towards a more efficienct solution.
Importing Local Functions¶
When cleaning the data, we wrote a print_steam_links
function to easily create links from a dataframe. To use it again, we could copy the code and define it here, but instead we will use a handy trick in jupyter notebook. If we place the function in a separate python (.py) file inside a folder at the root of the project directory (in this case, the 'src' folder), we can tell python to look there for local modules using sys.path.append
. Then we can import the function directly, where the file name (datacleaning) is the module name, as seen below.
import sys
sys.path.append('../src/')
from datacleaning import print_steam_links
Import and Inspect Data¶
We begin by importing the necessary libraries and inspecting the data, with every column cleaned except release date.
# standard library imports
from ast import literal_eval
import time
import re
import sys
sys.path.append('../src/')
# third-party imports
import numpy as np
import pandas as pd
# local imports
from datacleaning import print_steam_links
# customisations
pd.set_option("max_columns", 100)
imported_steam_data = pd.read_csv('../data/exports/steam_partially_clean.csv')
print('Rows:', imported_steam_data.shape[0])
print('Columns:', imported_steam_data.shape[1])
imported_steam_data.head()
Checking the null counts, we see there are no columns with missing values. This means we did our job properly previously, and we should just be able to focus on cleaning and formatting the column.
imported_steam_data.isnull().sum()
Checking the Format¶
First we shall inspect the raw format of the column. As we can see below, it is stored as a dictionary-like string object containing values for coming_soon
and date
. From the first few rows it would appear that the dates are stored in a uniform format - day as an integer, month as a 3-character string abbreviation, a comma, then the year as a four-digit number. We can parse this either using the python built-in datetime module, or as we already have pandas imported, we can use the pd.to_datetime function.
Also, as our analysis will involve looking at ownership and sales data, looking at games that are not released yet will not be useful to us. Intuitively, we can drop any titles which are marked as coming soon, presumably having this value set to true. As a side note, once parsed it may be worth checking that no release dates in our data are beyond the current date, just to make doubly sure none slip through.
display(imported_steam_data['release_date'][0])
imported_steam_data[['name', 'release_date']].head()
We can explore the data a little further using the value_counts
method. Whilst it looks like most dates are in the format dd mmm, yyyy
, there at least a couple of rows in the format mmm yyyy
, such as 'Jun 2009'. This means that all the dates aren't stored uniformly, so we will have to take care when parsing them.
print(imported_steam_data['release_date'].value_counts().head())
imported_steam_data.loc[imported_steam_data['release_date'].str.contains('Jun 2009'), 'release_date']
There are also a number of rows that have a blank string for the date value. We'll have to treat these as missing values, and decide if we want to remove them from the dataset. We can use the imported print_steam_links
function to inspect some of the rows, using sample
to randomly select a few.
no_release_date = imported_steam_data[imported_steam_data['release_date'].str.contains("'date': ''")]
print('Rows with no release date:', no_release_date.shape[0], '\n')
print_steam_links(no_release_date.sample(5, random_state=0))
It looks like some are special re-releases, like anniversary or game of the year editions, some are early access and not officially released yet, and others simply have a missing date. Apart from that there don't appear to be any clear patterns emerging, so as there are only 26 rows it may be best to remove them.
Parsing the dates¶
Taking a look at the format of the column, we'll need to be using literal_eval
as we did before. Apart from that it should be straightforward enough to extract the date.
print(type(imported_steam_data['release_date'].iloc[0]))
imported_steam_data['release_date'].iloc[0]
print(type(literal_eval(imported_steam_data['release_date'].iloc[0])))
literal_eval(imported_steam_data['release_date'].iloc[0])['date']
Once extracted, we can use the pd.to_datetime
functon to interpret and store dates as datetime
objects. This is useful as it will allow us to search and sort the dataset by year when it comes to performing analysis. Say for example we only wish to examine games released in 2010, by converting our dates to a python-recognisable format this will be very easy to achieve.
As seen below, we can supply the to_datetime
function with a date and pandas will automatically interpret the format. We can then inspect it or print an attribute like the year. We can also provide pandas with the format explicitly, so it knows what to look for and how to parse it, which may be quicker for large sets of data.
timestamp = pd.to_datetime(literal_eval(imported_steam_data['release_date'].iloc[0])['date'])
print(timestamp)
print(timestamp.year)
pd.to_datetime(literal_eval(imported_steam_data['release_date'].iloc[0])['date'], format='%d %b, %Y')
Initial Function Definition¶
Now we are ready to begin defining a function. As we only want to keep unreleased games, we first evaluate values from the coming_soon
key, and keep only those where the value is False
. Next we extract the release date, and set missing dates to np.nan, the default way of storing null values in pandas.
Then, using the formats we learned previously, we interpret those dates using the to_datetime
function. Once complete we pass over the dataframe once more with a general call to to_datetime
, catching any dates we missed.
Finally we drop the columns we no longer need and return the dataframe.
def process_release_date(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
# Only want to keep released games
df = df[df['coming_soon'] == False].copy()
# extract release date and set missing dates to null
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
df.loc[df['date'] == '', 'date'] = np.nan
# Parse the date formats we have discovered
df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
# Parse the rest of the date formats
df['release_date'] = pd.to_datetime(df['datetime'])
df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
return df
result = process_release_date(imported_steam_data)
result[['steam_appid', 'release_date']].head()
Whilst functional, the process is quite slow. The easiest way to measure the efficiency of code is by timing how long it takes to run, and that is the method we'll use here. By running this code inside of jupyter notebook, we can take advanted of IPython magics, and use the %timeit magic to easily test how long it takes to run the function.
%timeit process_release_date(imported_steam_data)
We can see that, on average, it takes about 3.5 seconds. Whilst manageable, we could certainly benefit from optimising our code, as this could quickly add up in larger data sets, where increasing efficiency can prove invaluable.
There are a few areas we can investigate to make improvements. When initially parsing the date, we end up calling literal_eval
twice, which may be a source of slowdown. We also loop over the entire dataset multiple times when calling the to_datetime
function.
We can be pretty confident that reducing the number of traversals over the dataset will provide some gains, but first, let's find out which part is causing the greatest slowdown. Targetting the slowest part of the code and improving it will lead to the most noticeable gains, and beyond that we can tweak other parts until we're happy.
We just used the %timeit magic to time our function - the function is run multiple times and the average execution time is given - but we can also use the built-in time
module of python to easily inspect specific sections of code.
# begin timing
start_time = time.time()
# do something
x = 1
for i in range(1000):
x += 1
for j in range(1000):
x += 1
# stop timing
end_time = time.time()
# calculate time difference
execution_time = end_time - start_time
print(execution_time)
We'll break down the function into different sections which we think may be causing slowdown, and see which takes the longest to execute.
def process_release_date(df):
df = df.copy()
# first section
eval_start = time.time()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
print('Evaluation run-time:', time.time() - eval_start)
df.loc[df['date'] == '', 'date'] = None
# second section
first_parse_start = time.time()
df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
print('First parse run-time:', time.time() - first_parse_start)
# third section
second_parse_start = time.time()
df['release_date'] = pd.to_datetime(df['datetime'])
print('Second parse run-time:', time.time() - second_parse_start)
df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
return df
function_start = time.time()
process_release_date(imported_steam_data)
print('\nTotal run-time:', time.time() - function_start)
Immediately we can see that the majority of run-time is taken up by the second call to pd.to_datetime
. This suggests that the first two calls are not functioning as expected - they are possibly terminating after the first error instead of skipping over it as desired - and most of the work is being done by the final call. Now it makes sense why it is slow - pandas has to figure out how each date is formatted, and since we know we have some variations this may be slowing it down considerably.
Whilst the evaluation run-time is much shorter, multiple calls to literal_eval
may be slowing the function as well, so we may wish to investigate that. As we know the biggest slowdown, we will begin there.
We now know that handling our dates in their current form is slow, and we know that we have some different formats mixed in there. Whilst there are likely many possible solutions to this problem, using regular expressions (or regex) comes to mind as they tend to excel at pattern matching in strings.
We know for sure two of the patterns, so let's build a regex for each of those. Then we can iteratively add more as we discover any other patterns. A powerful and useful tool for building and testing regex can be found at regexr.com.
pattern = r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}'
string = '13 Jul, 2018'
print(re.search(pattern, string))
pattern = r'[A-Za-z]{3} [\d]{4}'
string = 'Apr 2016'
print(re.search(pattern, string))
Using these two patterns we can start building out a function. We're going to apply a function to the date column which searches for each pattern, returning a standardised date string which we will then feed into the to_datetime
function.
The first search matches the 'mmm yyyy' pattern, like 'Apr 2019'. As we don't know the particular day for these matches we will assume it is the first of the month, returning '1 Apr 2019' in this example.
If we don't match this, we'll check for the second case. The second match will be the 'dd mmm, yyyy' pattern, like '13 Jul, 2018'. In this case we will simply return the match with the comma removed, to become '13 Jul 2018'.
If neither of these match, we'll check for the empty string, and return it as it is for now.
For anything else we'll simply print the string so we know what else we should be searching for.
def process_release_date(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
def parse_date(x):
if re.search(r'[A-Za-z]{3} [\d]{4}', x):
return '1 ' + x
elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
return x.replace(',', '')
elif x == '':
return x
else:
print(x)
df['date'] = df['date'].apply(parse_date)
df['release_date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df = df.drop(['coming_soon', 'date'], axis=1)
return df
result = process_release_date(imported_steam_data)
As no other rows we're printed out, we can be confident that we caught all of the patterns, and don't have any extra to take care of.
We just used the infer_datetime_format
parameter of to_datetime
, which, according to the documentation, can speed up the process. However, as we now know the exact format the dates will be in, we can explicitly set it ourselves, and this should be the fastest way of doing things.
We also need to decide how to handle the missing dates - those with the empty strings. For now let's set the way the function handles errors as coerce
, which returns NaT
(not a time).
We can now rewrite the function and time it as we did before.
def process_release_date_old(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
# Simple parsing
df['release_date'] = pd.to_datetime(df['date'])
df = df.drop(['coming_soon', 'date'], axis=1)
return df
def process_release_date_new(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
# Complex parsing
def parse_date(x):
if re.search(r'[A-Za-z]{3} [\d]{4}', x):
return '1 ' + x
elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
return x.replace(',', '')
elif x == '':
return x
df['date'] = df['date'].apply(parse_date)
df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
df = df.drop(['coming_soon', 'date'], axis=1)
return df
print('Testing date parsing:\n')
%timeit process_release_date_old(imported_steam_data)
%timeit process_release_date_new(imported_steam_data)
Our results show that the new method is almost four times faster, so we're on the right track.
Another optimisation we can make here is checking which part of the if/else statements has the most matches. It makes sense to order our statements from most matches to least, so for the majority of rows we only have to search through once.
To do this, instead of returning the date we'll return a number for each match. We can then print the value counts for the column and see which is the most frequent.
def optimise_regex_order(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
def parse_date(x):
if re.search(r'[A-Za-z]{3} [\d]{4}', x):
return '0: mmm yyyy' # '1 ' + x
elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
return '1: dd mmm, yyyy' # x.replace(',', '')
elif x == '':
return '2: empty' # pass
df['release_date'] = df['date'].apply(parse_date)
return df
result = optimise_regex_order(imported_steam_data)
result['release_date'].value_counts()
By far the majority of dates are in the 'dd mmm, yyyy' format, which is second in our if/else statements. This means that for all these rows we are unnecessarily searching the string twice. Simply by reordering our searches we should see a performance improvement.
def process_release_date_unordered(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
def parse_date(x):
if re.search(r'[A-Za-z]{3} [\d]{4}', x):
return '1 ' + x
elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
return x.replace(',', '')
elif x == '':
return x
df['release_date'] = df['date'].apply(parse_date)
df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
df = df.drop(['coming_soon', 'date'], axis=1)
return df
def process_release_date_ordered(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
def parse_date(x):
if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
return x.replace(',', '')
elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
return '1 ' + x
elif x == '':
return x
df['release_date'] = df['date'].apply(parse_date)
df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
df = df.drop(['coming_soon', 'date'], axis=1)
return df
%timeit process_release_date_unordered(imported_steam_data)
%timeit process_release_date_ordered(imported_steam_data)
It's an improvement, if only slightly, so we'll keep it. If anything this goes to show how fast regex pattern matching is, as there was hardly any slowdown in searching most of the strings twice.
Now parsing is well-optimised we can move on to the evaluation section.
# Testing evaluation methods
def evaluation_method_original(df):
df = df.copy()
df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
return df
def evaluation_method_1(df):
df = df.copy()
df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))
df['coming_soon'] = df['release_date'].apply(lambda x: x['coming_soon'])
df = df[df['coming_soon'] == False].copy()
df['release_date'] = df['release_date'].apply(lambda x: x['date'])
return df
def evaluation_method_2(df):
df = df.copy()
df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))
df_2 = df['release_date'].transform([lambda x: x['coming_soon'], lambda x: x['date']])
df = pd.concat([df, df_2], axis=1)
return df
def evaluation_method_3(df):
df = df.copy()
def eval_date(x):
x = literal_eval(x)
if x['coming_soon']:
return np.nan
else:
return x['date']
df['release_date'] = df['release_date'].apply(eval_date)
df = df[df['release_date'].notnull()]
return df
print('Original method:\n')
%timeit evaluation_method_original(imported_steam_data)
print('\nNew methods:\n')
%timeit evaluation_method_1(imported_steam_data)
%timeit evaluation_method_2(imported_steam_data)
%timeit evaluation_method_3(imported_steam_data)
It looks like we may have been right in our assumption that multiple calls to literal_eval
were slowing down the function - by calling it once instead of twice we almost halved the run-time.
Of the new methods the final one was just about the fastest, which is useful because it contains flexible custom logic we can modify if needed. Let's put everything together into the final function, and time it once more to see the improvements we've made.
We'll make a couple of changes so we can easily remove missing values at the end, which should mean we end up with clean release dates.
def process_release_date(df):
df = df.copy()
def eval_date(x):
x = literal_eval(x)
if x['coming_soon']:
return '' # return blank string so can drop missing at end
else:
return x['date']
df['release_date'] = df['release_date'].apply(eval_date)
def parse_date(x):
if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
return x.replace(',', '')
elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
return '1 ' + x
elif x == '':
return np.nan
df['release_date'] = df['release_date'].apply(parse_date)
df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')
df = df[df['release_date'].notnull()]
return df
%timeit process_release_date(imported_steam_data)
Referring back to the original time of 3.5s, we've achieved a 7x speed increase. That's pretty close to an order of magnitude improvement. With a dataset like this, we're only talking a matter of seconds, but in a much larger dataset with millions of rows, spending the time to improve code efficiency could shave hours off of run time.
As I'm sure you're aware if you have some familiarity with writing code, for most tasks there are a million and one ways of approaching and solving the problem. Hopefully this helps lay out a simple methodology for testing, improving and thinking about code. Plus, it can be fun and interesting to figure out different ways of achieving the same output. Speaking of which, let's look at a final little challenge.
Bonus: Vanilla Python Solution¶
In our final solution to cleaning the release_date
column, we relied heavily on the pandas library. Often using libraries such as this is a good idea as it contains vectorised and optimised methods for dealing with data, plus it's generally quicker to develop a working solution.
As a small challenge, let's have a look at performing the above cleaning process entirely with vanilla python functions, as in those available by default using python and its built-in packages.
First we need to convert the data from a pandas dataframe into a native python format. We have a few options but let's store the data as a list of lists. We'll also only include the AppID and release date columns, for the sake of demonstration.
date_list = []
for i, row in imported_steam_data.iterrows():
date_list.append([row['steam_appid'], row['release_date']])
date_list[:5]
The process is actually very similar. We have to loop through the data, rather than using pandas apply
, and we parse the dates using strptime
from the datetime
module. We can generate the output using regex pattern matching, as we did before, and we can store the results in a new list of lists.
We also display the first few rows of the output, and time how long it takes to run so we can compare.
from datetime import datetime as dt
def python_only(ls):
processed_rows = []
for i, date in ls:
eval_row = literal_eval(date)
if eval_row['coming_soon'] == False:
if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', eval_row['date']):
output = dt.strptime(eval_row['date'], '%d %b, %Y')
elif re.search(r'[A-Za-z]{3} [\d]{4}', eval_row['date']):
output = dt.strptime(eval_row['date'], '%b %Y')
elif eval_row['date'] == '':
output = 'none'
else:
print('Not matched:', eval_row['date'])
processed_rows.append([i, output])
else:
processed_rows.append([i, 'none'])
return processed_rows
start = time.time()
display(python_only(date_list)[:5])
end = time.time() - start
print(f'\nTime taken: {end:.2f}s')
Impressively, this method only took twice as long as our optimised method using pandas. It would probably take a bit longer if we had to deal with all the columns in the dataset, but this is still a viable solution. Also, we didn't properly handle the missing values, and the data is populated with some amount of 'none' values.
Thanks for joining me, and as always I welcome any feedback or suggestions you may have in the comments below.
Comments