{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Steam Data Cleaning - Optimising Cleaning of the Release Date Column\n", "\n", "*This forms part of a larger series of posts for my [blog](http://nik-davis.github.io) on downloading, processing and analysing data from the steam store. [See all posts here](http://nik-davis.github.io/tag/steam).*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/json": { "Software versions": [ { "module": "Python", "version": "3.7.3 64bit [MSC v.1900 64 bit (AMD64)]" }, { "module": "IPython", "version": "7.5.0" }, { "module": "OS", "version": "Windows 10 10.0.18362 SP0" }, { "module": "numpy", "version": "1.16.3" }, { "module": "pandas", "version": "0.24.2" } ] }, "text/html": [ "
SoftwareVersion
Python3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython7.5.0
OSWindows 10 10.0.18362 SP0
numpy1.16.3
pandas0.24.2
Mon Jun 17 14:42:41 2019 GMT Summer Time
" ], "text/latex": [ "\\begin{tabular}{|l|l|}\\hline\n", "{\\bf Software} & {\\bf Version} \\\\ \\hline\\hline\n", "Python & 3.7.3 64bit [MSC v.1900 64 bit (AMD64)] \\\\ \\hline\n", "IPython & 7.5.0 \\\\ \\hline\n", "OS & Windows 10 10.0.18362 SP0 \\\\ \\hline\n", "numpy & 1.16.3 \\\\ \\hline\n", "pandas & 0.24.2 \\\\ \\hline\n", "\\hline \\multicolumn{2}{|l|}{Mon Jun 17 14:42:41 2019 GMT Summer Time} \\\\ \\hline\n", "\\end{tabular}\n" ], "text/plain": [ "Software versions\n", "Python 3.7.3 64bit [MSC v.1900 64 bit (AMD64)]\n", "IPython 7.5.0\n", "OS Windows 10 10.0.18362 SP0\n", "numpy 1.16.3\n", "pandas 0.24.2\n", "Mon Jun 17 14:42:41 2019 GMT Summer Time" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# view software version information\n", "\n", "# http://raw.github.com/jrjohansson/version_information/master/version_information.py\n", "%load_ext version_information\n", "# %reload_ext version_information\n", "\n", "%version_information numpy, pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "In my [previous post](https://nik-davis.github.io/posts/2019/steam-data-cleaning/), we took an in-depth look at cleaning data downloaded from the Steam Store. We followed the process from start to finish, omitting just one column, which we will look at today.\n", "\n", "The final column to clean, `release_date`, provides some interesting optimisation and learning challenges. We encountered columns with a similar structure previously, so we can use what we learned there, but now we will also have dates to handle. We're going to approach this problem with the goal of optimisation in mind - we'll start by figuring out how to solve the task, getting to the point of a functional solution, then we'll test parts of the code to see where the major slowdowns are, using this to develop a framework for improving the efficiency of the code. By iteratively testing, rewriting and rerunning sections of code, we can gradually move towards a more efficienct solution.\n", "\n", "\n", "\n", "## Importing Local Functions\n", "\n", "When cleaning the data, we wrote a `print_steam_links` function to easily create links from a dataframe. To use it again, we could copy the code and define it here, but instead we will use a handy trick in jupyter notebook. If we place the function in a separate python (.py) file inside a folder at the root of the project directory (in this case, the 'src' folder), we can tell python to look there for local modules using `sys.path.append`. Then we can import the function directly, where the file name (datacleaning) is the module name, as seen below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append('../src/')\n", "\n", "from datacleaning import print_steam_links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import and Inspect Data\n", "\n", "We begin by importing the necessary libraries and inspecting the data, with every column cleaned except release date." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# standard library imports\n", "from ast import literal_eval\n", "import time\n", "import re\n", "import sys\n", "sys.path.append('../src/')\n", "\n", "# third-party imports\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# local imports\n", "from datacleaning import print_steam_links\n", "\n", "# customisations\n", "pd.set_option(\"max_columns\", 100)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rows: 27391\n", "Columns: 12\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesteam_appidrequired_ageplatformscategoriesgenresachievementsrelease_datepriceenglishdeveloperpublisher
0Counter-Strike100windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action0{'coming_soon': False, 'date': '1 Nov, 2000'}7.191ValveValve
1Team Fortress Classic200windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action0{'coming_soon': False, 'date': '1 Apr, 1999'}3.991ValveValve
2Day of Defeat300windows;mac;linuxMulti-player;Valve Anti-Cheat enabledAction0{'coming_soon': False, 'date': '1 May, 2003'}3.991ValveValve
3Deathmatch Classic400windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action0{'coming_soon': False, 'date': '1 Jun, 2001'}3.991ValveValve
4Half-Life: Opposing Force500windows;mac;linuxSingle-player;Multi-player;Valve Anti-Cheat en...Action0{'coming_soon': False, 'date': '1 Nov, 1999'}3.991Gearbox SoftwareValve
\n", "
" ], "text/plain": [ " name steam_appid required_age platforms \\\n", "0 Counter-Strike 10 0 windows;mac;linux \n", "1 Team Fortress Classic 20 0 windows;mac;linux \n", "2 Day of Defeat 30 0 windows;mac;linux \n", "3 Deathmatch Classic 40 0 windows;mac;linux \n", "4 Half-Life: Opposing Force 50 0 windows;mac;linux \n", "\n", " categories genres achievements \\\n", "0 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "1 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "2 Multi-player;Valve Anti-Cheat enabled Action 0 \n", "3 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "4 Single-player;Multi-player;Valve Anti-Cheat en... Action 0 \n", "\n", " release_date price english \\\n", "0 {'coming_soon': False, 'date': '1 Nov, 2000'} 7.19 1 \n", "1 {'coming_soon': False, 'date': '1 Apr, 1999'} 3.99 1 \n", "2 {'coming_soon': False, 'date': '1 May, 2003'} 3.99 1 \n", "3 {'coming_soon': False, 'date': '1 Jun, 2001'} 3.99 1 \n", "4 {'coming_soon': False, 'date': '1 Nov, 1999'} 3.99 1 \n", "\n", " developer publisher \n", "0 Valve Valve \n", "1 Valve Valve \n", "2 Valve Valve \n", "3 Valve Valve \n", "4 Gearbox Software Valve " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imported_steam_data = pd.read_csv('../data/exports/steam_partially_clean.csv')\n", "\n", "print('Rows:', imported_steam_data.shape[0])\n", "print('Columns:', imported_steam_data.shape[1])\n", "imported_steam_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking the null counts, we see there are no columns with missing values. This means we did our job properly previously, and we should just be able to focus on cleaning and formatting the column." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name 0\n", "steam_appid 0\n", "required_age 0\n", "platforms 0\n", "categories 0\n", "genres 0\n", "achievements 0\n", "release_date 0\n", "price 0\n", "english 0\n", "developer 0\n", "publisher 0\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imported_steam_data.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking the Format\n", "\n", "First we shall inspect the raw format of the column. As we can see below, it is stored as a dictionary-like string object containing values for `coming_soon` and `date`. From the first few rows it would appear that the dates are stored in a uniform format - day as an integer, month as a 3-character string abbreviation, a comma, then the year as a four-digit number. We can parse this either using the python built-in datetime module, or as we already have pandas imported, we can use the [pd.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.\n", "\n", "Also, as our analysis will involve looking at ownership and sales data, looking at games that are not released yet will not be useful to us. Intuitively, we can drop any titles which are marked as coming soon, presumably having this value set to true. As a side note, once parsed it may be worth checking that no release dates in our data are beyond the current date, just to make doubly sure none slip through." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"{'coming_soon': False, 'date': '1 Nov, 2000'}\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(imported_steam_data['release_date'][0])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namerelease_date
0Counter-Strike{'coming_soon': False, 'date': '1 Nov, 2000'}
1Team Fortress Classic{'coming_soon': False, 'date': '1 Apr, 1999'}
2Day of Defeat{'coming_soon': False, 'date': '1 May, 2003'}
3Deathmatch Classic{'coming_soon': False, 'date': '1 Jun, 2001'}
4Half-Life: Opposing Force{'coming_soon': False, 'date': '1 Nov, 1999'}
\n", "
" ], "text/plain": [ " name release_date\n", "0 Counter-Strike {'coming_soon': False, 'date': '1 Nov, 2000'}\n", "1 Team Fortress Classic {'coming_soon': False, 'date': '1 Apr, 1999'}\n", "2 Day of Defeat {'coming_soon': False, 'date': '1 May, 2003'}\n", "3 Deathmatch Classic {'coming_soon': False, 'date': '1 Jun, 2001'}\n", "4 Half-Life: Opposing Force {'coming_soon': False, 'date': '1 Nov, 1999'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imported_steam_data[['name', 'release_date']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can explore the data a little further using the `value_counts` method. Whilst it looks like most dates are in the format `dd mmm, yyyy`, there at least a couple of rows in the format `mmm yyyy`, such as 'Jun 2009'. This means that all the dates aren't stored uniformly, so we will have to take care when parsing them." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'coming_soon': False, 'date': '13 Jul, 2018'} 64\n", "{'coming_soon': False, 'date': '31 Jan, 2019'} 58\n", "{'coming_soon': False, 'date': '5 Apr, 2016'} 56\n", "{'coming_soon': False, 'date': '16 Nov, 2018'} 56\n", "{'coming_soon': False, 'date': '31 May, 2018'} 55\n", "Name: release_date, dtype: int64\n" ] }, { "data": { "text/plain": [ "561 {'coming_soon': False, 'date': 'Jun 2009'}\n", "618 {'coming_soon': False, 'date': 'Jun 2009'}\n", "Name: release_date, dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(imported_steam_data['release_date'].value_counts().head())\n", "\n", "imported_steam_data.loc[imported_steam_data['release_date'].str.contains('Jun 2009'), 'release_date']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also a number of rows that have a blank string for the date value. We'll have to treat these as missing values, and decide if we want to remove them from the dataset. We can use the imported `print_steam_links` function to inspect some of the rows, using `sample` to randomly select a few." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rows with no release date: 26 \n", "\n", "1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby): https://store.steampowered.com/app/15540\n", "Sub Rosa: https://store.steampowered.com/app/272230\n", "AirBuccaneers: https://store.steampowered.com/app/223630\n", "Sword of the Stars: The Pit: https://store.steampowered.com/app/233700\n", "Stronghold 2: Steam Edition: https://store.steampowered.com/app/40960\n" ] } ], "source": [ "no_release_date = imported_steam_data[imported_steam_data['release_date'].str.contains(\"'date': ''\")]\n", "\n", "print('Rows with no release date:', no_release_date.shape[0], '\\n')\n", "print_steam_links(no_release_date.sample(5, random_state=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like some are special re-releases, like anniversary or game of the year editions, some are early access and not officially released yet, and others simply have a missing date. Apart from that there don't appear to be any clear patterns emerging, so as there are only 26 rows it may be best to remove them.\n", "\n", "\n", "## Parsing the dates\n", "\n", "Taking a look at the format of the column, we'll need to be using `literal_eval` as we did before. Apart from that it should be straightforward enough to extract the date." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "\"{'coming_soon': False, 'date': '1 Nov, 2000'}\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(type(imported_steam_data['release_date'].iloc[0]))\n", "\n", "imported_steam_data['release_date'].iloc[0]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "'1 Nov, 2000'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(type(literal_eval(imported_steam_data['release_date'].iloc[0])))\n", "\n", "literal_eval(imported_steam_data['release_date'].iloc[0])['date']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once extracted, we can use the `pd.to_datetime` functon to interpret and store dates as `datetime` objects. This is useful as it will allow us to search and sort the dataset by year when it comes to performing analysis. Say for example we only wish to examine games released in 2010, by converting our dates to a python-recognisable format this will be very easy to achieve.\n", "\n", "As seen below, we can supply the `to_datetime` function with a date and pandas will automatically interpret the format. We can then inspect it or print an attribute like the year. We can also provide pandas with the format explicitly, so it knows what to look for and how to parse it, which may be [quicker for large sets of data](https://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2000-11-01 00:00:00\n", "2000\n" ] }, { "data": { "text/plain": [ "Timestamp('2000-11-01 00:00:00')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timestamp = pd.to_datetime(literal_eval(imported_steam_data['release_date'].iloc[0])['date'])\n", "\n", "print(timestamp)\n", "print(timestamp.year)\n", "\n", "pd.to_datetime(literal_eval(imported_steam_data['release_date'].iloc[0])['date'], format='%d %b, %Y')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Function Definition\n", "\n", "Now we are ready to begin defining a function. As we only want to keep unreleased games, we first evaluate values from the `coming_soon` key, and keep only those where the value is `False`. Next we extract the release date, and set missing dates to np.nan, the default way of storing null values in pandas.\n", "\n", "Then, using the formats we learned previously, we interpret those dates using the `to_datetime` function. Once complete we pass over the dataframe once more with a general call to `to_datetime`, catching any dates we missed.\n", "\n", "Finally we drop the columns we no longer need and return the dataframe." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
steam_appidrelease_date
0102000-11-01
1201999-04-01
2302003-05-01
3402001-06-01
4501999-11-01
\n", "
" ], "text/plain": [ " steam_appid release_date\n", "0 10 2000-11-01\n", "1 20 1999-04-01\n", "2 30 2003-05-01\n", "3 40 2001-06-01\n", "4 50 1999-11-01" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_release_date(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " # Only want to keep released games\n", " df = df[df['coming_soon'] == False].copy()\n", " \n", " # extract release date and set missing dates to null\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " df.loc[df['date'] == '', 'date'] = np.nan\n", " \n", " # Parse the date formats we have discovered\n", " df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')\n", " df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')\n", " \n", " # Parse the rest of the date formats\n", " df['release_date'] = pd.to_datetime(df['datetime'])\n", " \n", " df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)\n", " return df\n", "\n", "result = process_release_date(imported_steam_data)\n", "result[['steam_appid', 'release_date']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whilst functional, the process is quite slow. The easiest way to measure the efficiency of code is by timing how long it takes to run, and that is the method we'll use here. By running this code inside of jupyter notebook, we can take advanted of IPython magics, and use the [%timeit](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit) magic to easily test how long it takes to run the function." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.88 s ± 310 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%timeit process_release_date(imported_steam_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that, on average, it takes about 3.5 seconds. Whilst manageable, we could certainly benefit from optimising our code, as this could quickly add up in larger data sets, where increasing efficiency can prove invaluable.\n", "\n", "There are a few areas we can investigate to make improvements. When initially parsing the date, we end up calling `literal_eval` twice, which may be a source of slowdown. We also loop over the entire dataset multiple times when calling the `to_datetime` function. \n", "\n", "We can be pretty confident that reducing the number of traversals over the dataset will provide some gains, but first, let's find out which part is causing the greatest slowdown. Targetting the slowest part of the code and improving it will lead to the most noticeable gains, and beyond that we can tweak other parts until we're happy. \n", "\n", "We just used the %timeit magic to time our function - the function is run multiple times and the average execution time is given - but we can also use the built-in `time` module of python to easily inspect specific sections of code." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.08697056770324707\n" ] } ], "source": [ "# begin timing\n", "start_time = time.time()\n", "\n", "# do something\n", "x = 1\n", "for i in range(1000):\n", " x += 1\n", " for j in range(1000):\n", " x += 1\n", "\n", "# stop timing\n", "end_time = time.time()\n", "\n", "# calculate time difference\n", "execution_time = end_time - start_time\n", "print(execution_time)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll break down the function into different sections which we think may be causing slowdown, and see which takes the longest to execute." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Evaluation run-time: 0.7157635688781738\n", "First parse run-time: 0.0059967041015625\n", "Second parse run-time: 2.839102029800415\n", "\n", "Total run-time: 3.5898571014404297\n" ] } ], "source": [ "def process_release_date(df):\n", " df = df.copy()\n", " \n", " # first section\n", " eval_start = time.time()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " print('Evaluation run-time:', time.time() - eval_start)\n", " \n", " df.loc[df['date'] == '', 'date'] = None\n", " \n", " # second section\n", " first_parse_start = time.time()\n", " \n", " df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')\n", " df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')\n", " \n", " print('First parse run-time:', time.time() - first_parse_start)\n", " \n", " # third section\n", " second_parse_start = time.time()\n", " \n", " df['release_date'] = pd.to_datetime(df['datetime'])\n", " \n", " print('Second parse run-time:', time.time() - second_parse_start)\n", " \n", " df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)\n", " return df\n", "\n", "function_start = time.time()\n", "process_release_date(imported_steam_data)\n", "print('\\nTotal run-time:', time.time() - function_start)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Immediately we can see that the majority of run-time is taken up by the second call to `pd.to_datetime`. This suggests that the first two calls are not functioning as expected - they are possibly terminating after the first error instead of skipping over it as desired - and most of the work is being done by the final call. Now it makes sense why it is slow - pandas has to figure out how each date is formatted, and since we know we have some variations this may be slowing it down considerably.\n", "\n", "Whilst the evaluation run-time is much shorter, multiple calls to `literal_eval` may be slowing the function as well, so we may wish to investigate that. As we know the biggest slowdown, we will begin there.\n", "\n", "We now know that handling our dates in their current form is slow, and we know that we have some different formats mixed in there. Whilst there are likely many possible solutions to this problem, using regular expressions (or regex) comes to mind as they tend to excel at pattern matching in strings.\n", "\n", "We know for sure two of the patterns, so let's build a regex for each of those. Then we can iteratively add more as we discover any other patterns. A powerful and useful tool for building and testing regex can be found at [regexr.com](https://regexr.com/)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "pattern = r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}'\n", "string = '13 Jul, 2018'\n", "\n", "print(re.search(pattern, string))\n", "\n", "pattern = r'[A-Za-z]{3} [\\d]{4}'\n", "string = 'Apr 2016'\n", "\n", "print(re.search(pattern, string))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using these two patterns we can start building out a function. We're going to apply a function to the date column which searches for each pattern, returning a standardised date string which we will then feed into the `to_datetime` function.\n", "\n", "The first search matches the 'mmm yyyy' pattern, like 'Apr 2019'. As we don't know the particular day for these matches we will assume it is the first of the month, returning '1 Apr 2019' in this example.\n", "\n", "If we don't match this, we'll check for the second case. The second match will be the 'dd mmm, yyyy' pattern, like '13 Jul, 2018'. In this case we will simply return the match with the comma removed, to become '13 Jul 2018'.\n", "\n", "If neither of these match, we'll check for the empty string, and return it as it is for now.\n", "\n", "For anything else we'll simply print the string so we know what else we should be searching for." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def process_release_date(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " def parse_date(x):\n", " if re.search(r'[A-Za-z]{3} [\\d]{4}', x):\n", " return '1 ' + x \n", " elif re.search(r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}', x):\n", " return x.replace(',', '')\n", " elif x == '':\n", " return x\n", " else:\n", " print(x)\n", " \n", " df['date'] = df['date'].apply(parse_date)\n", " df['release_date'] = pd.to_datetime(df['date'], infer_datetime_format=True)\n", " \n", " df = df.drop(['coming_soon', 'date'], axis=1)\n", " \n", " return df\n", "\n", "result = process_release_date(imported_steam_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As no other rows we're printed out, we can be confident that we caught all of the patterns, and don't have any extra to take care of.\n", "\n", "We just used the `infer_datetime_format` parameter of `to_datetime`, which, according to the documentation, can speed up the process. However, as we now know the exact format the dates will be in, we can explicitly set it ourselves, and this should be the fastest way of doing things.\n", "\n", "We also need to decide how to handle the missing dates - those with the empty strings. For now let's set the way the function handles errors as `coerce`, which returns `NaT` (not a time).\n", "\n", "We can now rewrite the function and time it as we did before." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing date parsing:\n", "\n", "3.61 s ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", "993 ms ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "def process_release_date_old(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " # Simple parsing\n", " df['release_date'] = pd.to_datetime(df['date'])\n", " \n", " df = df.drop(['coming_soon', 'date'], axis=1)\n", " \n", " return df\n", "\n", "\n", "def process_release_date_new(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " # Complex parsing\n", " def parse_date(x):\n", " if re.search(r'[A-Za-z]{3} [\\d]{4}', x):\n", " return '1 ' + x\n", " elif re.search(r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}', x):\n", " return x.replace(',', '')\n", " elif x == '':\n", " return x\n", " \n", " df['date'] = df['date'].apply(parse_date)\n", " df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')\n", " \n", " df = df.drop(['coming_soon', 'date'], axis=1)\n", " \n", " return df\n", "\n", "print('Testing date parsing:\\n')\n", "%timeit process_release_date_old(imported_steam_data)\n", "%timeit process_release_date_new(imported_steam_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our results show that the new method is almost four times faster, so we're on the right track.\n", "\n", "Another optimisation we can make here is checking which part of the if/else statements has the most matches. It makes sense to order our statements from most matches to least, so for the majority of rows we only have to search through once. \n", "\n", "To do this, instead of returning the date we'll return a number for each match. We can then print the value counts for the column and see which is the most frequent." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1: dd mmm, yyyy 27275\n", "0: mmm yyyy 57\n", "2: empty 22\n", "Name: release_date, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def optimise_regex_order(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " def parse_date(x):\n", " if re.search(r'[A-Za-z]{3} [\\d]{4}', x):\n", " return '0: mmm yyyy' # '1 ' + x\n", " elif re.search(r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}', x):\n", " return '1: dd mmm, yyyy' # x.replace(',', '')\n", " elif x == '':\n", " return '2: empty' # pass\n", " \n", " df['release_date'] = df['date'].apply(parse_date)\n", " \n", " return df\n", "\n", "\n", "result = optimise_regex_order(imported_steam_data)\n", "\n", "result['release_date'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By far the majority of dates are in the 'dd mmm, yyyy' format, which is second in our if/else statements. This means that for all these rows we are unnecessarily searching the string twice. Simply by reordering our searches we should see a performance improvement." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "820 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", "792 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "def process_release_date_unordered(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " def parse_date(x):\n", " if re.search(r'[A-Za-z]{3} [\\d]{4}', x):\n", " return '1 ' + x\n", " elif re.search(r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}', x):\n", " return x.replace(',', '')\n", " elif x == '':\n", " return x\n", " \n", " df['release_date'] = df['date'].apply(parse_date)\n", " df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')\n", " df = df.drop(['coming_soon', 'date'], axis=1)\n", " \n", " return df\n", "\n", "\n", "def process_release_date_ordered(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " def parse_date(x):\n", " if re.search(r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}', x):\n", " return x.replace(',', '')\n", " elif re.search(r'[A-Za-z]{3} [\\d]{4}', x):\n", " return '1 ' + x\n", " elif x == '':\n", " return x\n", " \n", " df['release_date'] = df['date'].apply(parse_date)\n", " df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')\n", " df = df.drop(['coming_soon', 'date'], axis=1)\n", " \n", " return df\n", "\n", "\n", "%timeit process_release_date_unordered(imported_steam_data)\n", "%timeit process_release_date_ordered(imported_steam_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's an improvement, if only slightly, so we'll keep it. If anything this goes to show how fast regex pattern matching is, as there was hardly any slowdown in searching most of the strings twice.\n", "\n", "Now parsing is well-optimised we can move on to the evaluation section." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original method:\n", "\n", "709 ms ± 9.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", "\n", "New methods:\n", "\n", "370 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", "366 ms ± 942 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", "353 ms ± 931 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "# Testing evaluation methods\n", "def evaluation_method_original(df):\n", " df = df.copy()\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon']) \n", " df = df[df['coming_soon'] == False].copy()\n", " df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])\n", " \n", " return df\n", "\n", "\n", "def evaluation_method_1(df):\n", " df = df.copy()\n", " \n", " df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))\n", " \n", " df['coming_soon'] = df['release_date'].apply(lambda x: x['coming_soon'])\n", " df = df[df['coming_soon'] == False].copy()\n", " \n", " df['release_date'] = df['release_date'].apply(lambda x: x['date'])\n", " \n", " return df\n", "\n", "\n", "def evaluation_method_2(df):\n", " df = df.copy()\n", " \n", " df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))\n", " df_2 = df['release_date'].transform([lambda x: x['coming_soon'], lambda x: x['date']])\n", " df = pd.concat([df, df_2], axis=1)\n", " \n", " return df\n", "\n", "\n", "def evaluation_method_3(df):\n", " df = df.copy()\n", " \n", " def eval_date(x):\n", " x = literal_eval(x)\n", " if x['coming_soon']:\n", " return np.nan\n", " else:\n", " return x['date']\n", " \n", " df['release_date'] = df['release_date'].apply(eval_date)\n", " df = df[df['release_date'].notnull()]\n", " \n", " return df\n", "\n", "print('Original method:\\n')\n", "%timeit evaluation_method_original(imported_steam_data)\n", "\n", "print('\\nNew methods:\\n')\n", "%timeit evaluation_method_1(imported_steam_data)\n", "%timeit evaluation_method_2(imported_steam_data)\n", "%timeit evaluation_method_3(imported_steam_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we may have been right in our assumption that multiple calls to `literal_eval` were slowing down the function - by calling it once instead of twice we almost halved the run-time.\n", "\n", "Of the new methods the final one was just about the fastest, which is useful because it contains flexible custom logic we can modify if needed. Let's put everything together into the final function, and time it once more to see the improvements we've made.\n", "\n", "We'll make a couple of changes so we can easily remove missing values at the end, which should mean we end up with clean release dates." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "481 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "def process_release_date(df):\n", " df = df.copy()\n", " \n", " def eval_date(x):\n", " x = literal_eval(x)\n", " if x['coming_soon']:\n", " return '' # return blank string so can drop missing at end\n", " else:\n", " return x['date']\n", " \n", " df['release_date'] = df['release_date'].apply(eval_date)\n", " \n", " def parse_date(x):\n", " if re.search(r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}', x):\n", " return x.replace(',', '')\n", " elif re.search(r'[A-Za-z]{3} [\\d]{4}', x):\n", " return '1 ' + x\n", " elif x == '':\n", " return np.nan\n", " \n", " df['release_date'] = df['release_date'].apply(parse_date)\n", " df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')\n", " \n", " df = df[df['release_date'].notnull()]\n", " \n", " return df\n", "\n", "%timeit process_release_date(imported_steam_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Referring back to the original time of 3.5s, we've achieved a 7x speed increase. That's pretty close to an order of magnitude improvement. With a dataset like this, we're only talking a matter of seconds, but in a much larger dataset with millions of rows, spending the time to improve code efficiency could shave hours off of run time.\n", "\n", "As I'm sure you're aware if you have some familiarity with writing code, for most tasks there are a million and one ways of approaching and solving the problem. Hopefully this helps lay out a simple methodology for testing, improving and thinking about code. Plus, it can be fun and interesting to figure out different ways of achieving the same output. Speaking of which, let's look at a final little challenge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus: Vanilla Python Solution\n", "\n", "In our final solution to cleaning the `release_date` column, we relied heavily on the pandas library. Often using libraries such as this is a good idea as it contains vectorised and optimised methods for dealing with data, plus it's generally quicker to develop a working solution.\n", "\n", "As a small challenge, let's have a look at performing the above cleaning process entirely with vanilla python functions, as in those available by default using python and its built-in packages.\n", "\n", "First we need to convert the data from a pandas dataframe into a native python format. We have a few options but let's store the data as a list of lists. We'll also only include the AppID and release date columns, for the sake of demonstration." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[10, \"{'coming_soon': False, 'date': '1 Nov, 2000'}\"],\n", " [20, \"{'coming_soon': False, 'date': '1 Apr, 1999'}\"],\n", " [30, \"{'coming_soon': False, 'date': '1 May, 2003'}\"],\n", " [40, \"{'coming_soon': False, 'date': '1 Jun, 2001'}\"],\n", " [50, \"{'coming_soon': False, 'date': '1 Nov, 1999'}\"]]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_list = []\n", "\n", "for i, row in imported_steam_data.iterrows():\n", " date_list.append([row['steam_appid'], row['release_date']])\n", " \n", "date_list[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The process is actually very similar. We have to loop through the data, rather than using pandas `apply`, and we parse the dates using `strptime` from the `datetime` module. We can generate the output using regex pattern matching, as we did before, and we can store the results in a new list of lists.\n", "\n", "We also display the first few rows of the output, and time how long it takes to run so we can compare." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[10, datetime.datetime(2000, 11, 1, 0, 0)],\n", " [20, datetime.datetime(1999, 4, 1, 0, 0)],\n", " [30, datetime.datetime(2003, 5, 1, 0, 0)],\n", " [40, datetime.datetime(2001, 6, 1, 0, 0)],\n", " [50, datetime.datetime(1999, 11, 1, 0, 0)]]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Time taken: 0.88s\n" ] } ], "source": [ "from datetime import datetime as dt\n", "\n", "def python_only(ls):\n", " \n", " processed_rows = []\n", " \n", " for i, date in ls:\n", " eval_row = literal_eval(date)\n", " \n", " if eval_row['coming_soon'] == False:\n", " if re.search(r'[\\d]{1,2} [A-Za-z]{3}, [\\d]{4}', eval_row['date']):\n", " output = dt.strptime(eval_row['date'], '%d %b, %Y')\n", " elif re.search(r'[A-Za-z]{3} [\\d]{4}', eval_row['date']):\n", " output = dt.strptime(eval_row['date'], '%b %Y')\n", " elif eval_row['date'] == '':\n", " output = 'none'\n", " else:\n", " print('Not matched:', eval_row['date'])\n", " processed_rows.append([i, output])\n", " else:\n", " processed_rows.append([i, 'none'])\n", " \n", " return processed_rows\n", "\n", "start = time.time()\n", "\n", "display(python_only(date_list)[:5])\n", "\n", "end = time.time() - start\n", "print(f'\\nTime taken: {end:.2f}s')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Impressively, this method only took twice as long as our optimised method using pandas. It would probably take a bit longer if we had to deal with all the columns in the dataset, but this is still a viable solution. Also, we didn't properly handle the missing values, and the data is populated with some amount of 'none' values." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }