Steam Data Cleaning in Python

This post forms part of a larger series on downloading, processing and analysing data from the Steam store. See all posts here.

View original notebook on github (Download). Datasets available on Kaggle.

steam_logo

In the first part of this project, we downloaded and generated data sets from the Steam Store API and SteamSpy API. We now need to take this raw data and prepare it in a process commonly referred to as data cleaning.

Currently the downloaded data is not in a very useful state. Many of the columns contain lengthy strings or missing values, which hinder analysis and are especially crippling to any machine learning techniques we may wish to implement. Data cleaning involves handling missing values, tidying up values, and ensuring data is neatly and consistently formatted.

Data cleaning is often cited as being the lengthiest part of any project. As such, it will be broken up into sections. We will begin by taking care of the columns in the steam data that are easiest to deal with and outlining a framework for the process. Of course it could all be completed in one go and a lot more concisely, however we'll be stepping through all the reasons for each decision and building the process iteratively.

The main aims of this project are to investigate various sales and play-time statistics for games from the Steam store, and see how different features of games affect the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust dataset for future projects.

Towards the end of this section, we'll take care of columns that involve exporting data of some kind. Some columns will require much more in-depth processing and cleaning, or provide information that may be a useful topic for another time. We'll clean some of these slightly, then export them separately from the clean dataset.

As a separate post, we'll take a look at an optimisation problem, walking through the process of handling one particular column.

Once that is complete we will repeat the whole cleaning process for the SteamSpy data and combine the results, finishing with a complete data set ready for analysis.

To follow along or perform your own cleaning, the raw data can be found and downloaded on Kaggle.

API references:

Import Libraries and Inspect Data

To begin with, we'll import the required libraries and set customisation options, then take a look at the previously downloaded data by reading it into a pandas dataframe.

In [1]:
# standard library imports
from ast import literal_eval
import itertools
import time
import re

# third-party imports
import numpy as np
import pandas as pd

# customisations
pd.set_option("max_columns", 100)
In [2]:
# read in downloaded data
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

# print out number of rows and columns
print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])

# view first five rows
raw_steam_data.head()
Rows: 29235
Columns: 39
Out[2]:
type name steam_appid required_age is_free controller_support dlc detailed_description about_the_game short_description fullgame supported_languages header_image website pc_requirements mac_requirements linux_requirements legal_notice drm_notice ext_user_account_notice developers publishers demos price_overview packages package_groups platforms metacritic reviews categories genres screenshots movies recommendations achievements release_date support_info background content_descriptors
0 game Counter-Strike 10 0.0 False NaN NaN Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... NaN English<strong>*</strong>, French<strong>*</st... https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... NaN NaN NaN ['Valve'] ['Valve'] NaN {'currency': 'GBP', 'initial': 719, 'final': 7... [7] [{'name': 'default', 'title': 'Buy Counter-Str... {'windows': True, 'mac': True, 'linux': True} {'score': 88, 'url': 'https://www.metacritic.c... NaN [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 65735} {'total': 0} {'coming_soon': False, 'date': '1 Nov, 2000'} {'url': 'http://steamcommunity.com/app/10', 'e... https://steamcdn-a.akamaihd.net/steam/apps/10/... {'ids': [2, 5], 'notes': 'Includes intense vio...
1 game Team Fortress Classic 20 0.0 False NaN NaN One of the most popular online action games of... One of the most popular online action games of... One of the most popular online action games of... NaN English, French, German, Italian, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... NaN NaN NaN ['Valve'] ['Valve'] NaN {'currency': 'GBP', 'initial': 399, 'final': 3... [29] [{'name': 'default', 'title': 'Buy Team Fortre... {'windows': True, 'mac': True, 'linux': True} NaN NaN [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 2802} {'total': 0} {'coming_soon': False, 'date': '1 Apr, 1999'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/20/... {'ids': [2, 5], 'notes': 'Includes intense vio...
2 game Day of Defeat 30 0.0 False NaN NaN Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... NaN English, French, German, Italian, Spanish - Spain https://steamcdn-a.akamaihd.net/steam/apps/30/... http://www.dayofdefeat.com/ {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... NaN NaN NaN ['Valve'] ['Valve'] NaN {'currency': 'GBP', 'initial': 399, 'final': 3... [30] [{'name': 'default', 'title': 'Buy Day of Defe... {'windows': True, 'mac': True, 'linux': True} {'score': 79, 'url': 'https://www.metacritic.c... NaN [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 1992} {'total': 0} {'coming_soon': False, 'date': '1 May, 2003'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/30/... {'ids': [], 'notes': None}
3 game Deathmatch Classic 40 0.0 False NaN NaN Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... NaN English, French, German, Italian, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... NaN NaN NaN ['Valve'] ['Valve'] NaN {'currency': 'GBP', 'initial': 399, 'final': 3... [31] [{'name': 'default', 'title': 'Buy Deathmatch ... {'windows': True, 'mac': True, 'linux': True} NaN NaN [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 931} {'total': 0} {'coming_soon': False, 'date': '1 Jun, 2001'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/40/... {'ids': [], 'notes': None}
4 game Half-Life: Opposing Force 50 0.0 False NaN NaN Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... NaN English, French, German, Korean https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... NaN NaN NaN ['Gearbox Software'] ['Valve'] NaN {'currency': 'GBP', 'initial': 399, 'final': 3... [32] [{'name': 'default', 'title': 'Buy Half-Life: ... {'windows': True, 'mac': True, 'linux': True} NaN NaN [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 4355} {'total': 0} {'coming_soon': False, 'date': '1 Nov, 1999'} {'url': 'https://help.steampowered.com', 'emai... https://steamcdn-a.akamaihd.net/steam/apps/50/... {'ids': [], 'notes': None}

From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns that look to be stored as dictionaries or lists.

We can chain the isnull() and sum() methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of roughly 30,000 rows these are unlikely to provide any meaningful information.

In [3]:
null_counts = raw_steam_data.isnull().sum()
null_counts
Out[3]:
type                         149
name                           1
steam_appid                    0
required_age                 149
is_free                      149
controller_support         23237
dlc                        24260
detailed_description         175
about_the_game               175
short_description            175
fullgame                   29235
supported_languages          163
header_image                 149
website                     9983
pc_requirements              149
mac_requirements             149
linux_requirements           149
legal_notice               19168
drm_notice                 29077
ext_user_account_notice    28723
developers                   264
publishers                   149
demos                      27096
price_overview              3712
packages                    3370
package_groups               149
platforms                    149
metacritic                 26254
reviews                    23330
categories                   714
genres                       196
screenshots                  177
movies                      2078
recommendations            22507
achievements                2381
release_date                 149
support_info                 149
background                   177
content_descriptors          149
dtype: int64

Initial Processing

We will most likely have to handle each column individually, so we will write some functions to keep our methodology organised, and help iteratively develop the process.

Our first function will remove the columns with more than 50% missing values, taking care of the columns with high null counts. We can do this by running a filter on the dataframe, as seen below.

In [4]:
threshold = raw_steam_data.shape[0] // 2

print('Drop columns with more than {} missing rows'.format(threshold))
print()

drop_rows = raw_steam_data.columns[null_counts > threshold]

print('Columns to drop: {}'.format(list(drop_rows)))
Drop columns with more than 14617 missing rows

Columns to drop: ['controller_support', 'dlc', 'fullgame', 'legal_notice', 'drm_notice', 'ext_user_account_notice', 'demos', 'metacritic', 'reviews', 'recommendations']

We can then look at the type and name columns, thinning out our data set a little by removing apps without either.

In the data collection stage, if no information was returned from an app's API request, only the name and appid was stored. We can easily identify these apps by looking at rows with missing data in the type column, as all other apps have a value here. As seen below, these rows contain no other information so we can safely remove them.

In [5]:
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])

# preview rows with missing type data
raw_steam_data[raw_steam_data['type'].isnull()].head(3)
Rows to remove: 149
Out[5]:
type name steam_appid required_age is_free controller_support dlc detailed_description about_the_game short_description fullgame supported_languages header_image website pc_requirements mac_requirements linux_requirements legal_notice drm_notice ext_user_account_notice developers publishers demos price_overview packages package_groups platforms metacritic reviews categories genres screenshots movies recommendations achievements release_date support_info background content_descriptors
26 NaN Half-Life: Opposing Force 852 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
147 NaN Half-Life: Opposing Force 4330 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
256 NaN Half-Life: Opposing Force 8740 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

We can look at the counts of unique values in a column by using the pandas Series.value_counts method. By checking the value counts we see that all rows either have a missing value, as noted above, or 'game' in the type column.

Once the null rows are removed, we'll be able to remove this column as it doesn't provide us with any more useful information.

In [6]:
raw_steam_data['type'].value_counts(dropna=False)
Out[6]:
game    29086
NaN       149
Name: type, dtype: int64

Taking a look now at the name column, we can check for rows which either have a null value or a string containing 'none'. This isn't recognised as a null value but should be treated as such.

We achieve this by combining boolean filters using brackets and a vertical bar, |, symbolising a logical 'or'.

There are only four rows which match these criteria, and they appear to be missing a lot of data in other columns so we should definitely remove them.

In [7]:
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]
Out[7]:
type name steam_appid required_age is_free controller_support dlc detailed_description about_the_game short_description fullgame supported_languages header_image website pc_requirements mac_requirements linux_requirements legal_notice drm_notice ext_user_account_notice developers publishers demos price_overview packages package_groups platforms metacritic reviews categories genres screenshots movies recommendations achievements release_date support_info background content_descriptors
4918 game none 339860 0.0 False NaN NaN NaN NaN NaN NaN English<strong>*</strong><br><strong>*</strong... https://steamcdn-a.akamaihd.net/steam/apps/339... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] NaN NaN NaN NaN [''] NaN NaN NaN [] {'windows': True, 'mac': False, 'linux': False} NaN NaN [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '25', 'description': 'Adventure'}, {'i... NaN NaN NaN {'total': 3, 'highlighted': [{'name': 'Master ... {'coming_soon': False, 'date': '27 Feb, 2015'} {'url': '', 'email': ''} NaN {'ids': [], 'notes': None}
6779 game none 385020 0.0 False NaN NaN - discontinued - (please remove) - discontinued - (please remove) - discontinued - (please remove) NaN English, French, Italian, German, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/385... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... NaN NaN NaN ['none'] [''] NaN NaN NaN [] {'windows': True, 'mac': True, 'linux': True} NaN NaN [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '4', 'description': 'Casual'}, {'id': ... NaN NaN NaN {'total': 0} {'coming_soon': False, 'date': '4 Nov, 2015'} {'url': '', 'email': ''} NaN {'ids': [], 'notes': None}
7235 game NaN 396420 0.0 True NaN NaN Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b... Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b... Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。 村... NaN NaN https://steamcdn-a.akamaihd.net/steam/apps/396... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] NaN NaN NaN NaN [''] NaN NaN NaN [] {'windows': True, 'mac': False, 'linux': False} NaN NaN NaN NaN NaN NaN NaN {'total': 0} {'coming_soon': False, 'date': '1 Nov, 2016'} {'url': '', 'email': ''} NaN {'ids': [], 'notes': None}
7350 game none 398970 0.0 False NaN NaN NaN NaN NaN NaN English<strong>*</strong><br><strong>*</strong... https://steamcdn-a.akamaihd.net/steam/apps/398... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... NaN NaN NaN ['none'] ['none'] [{'appid': 516340, 'description': ''}] NaN NaN [] {'windows': True, 'mac': True, 'linux': True} NaN NaN [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '25', 'description': 'Adventure'}, {'i... NaN NaN NaN {'total': 35, 'highlighted': [{'name': "They'v... {'coming_soon': False, 'date': '5 Nov, 2015'} {'url': '', 'email': ''} NaN {'ids': [], 'notes': None}

As we know for certain that all AppIDs should be unique, any rows with the same ID need to be handled.

We can easily view duplicated rows using the DataFrame.duplicated() method of pandas. We can pass keep=False to view all duplicated rows, or leave the defaults (keep='first') to skip over the first row and just show the rest of the duplicates. We can also pass a column label into subset if we want to filter by a single column.

As we only want to remove the extra rows, we can keep the default behaviour.

In [8]:
duplicate_rows = raw_steam_data[raw_steam_data.duplicated()]

print('Duplicate rows to remove:', duplicate_rows.shape[0])

duplicate_rows.head(3)
Duplicate rows to remove: 7
Out[8]:
type name steam_appid required_age is_free controller_support dlc detailed_description about_the_game short_description fullgame supported_languages header_image website pc_requirements mac_requirements linux_requirements legal_notice drm_notice ext_user_account_notice developers publishers demos price_overview packages package_groups platforms metacritic reviews categories genres screenshots movies recommendations achievements release_date support_info background content_descriptors
31 game SiN Episodes: Emergence 1300 0.0 False NaN NaN You are John Blade, commander of HardCorps, an... You are John Blade, commander of HardCorps, an... You are John Blade, commander of HardCorps, an... NaN English, Russian, French https://steamcdn-a.akamaihd.net/steam/apps/130... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... [] [] NaN NaN NaN ['Ritual Entertainment'] ['Ritual Entertainment'] NaN {'currency': 'GBP', 'initial': 719, 'final': 7... [70] [{'name': 'default', 'title': 'Buy SiN Episode... {'windows': True, 'mac': False, 'linux': False} {'score': 75, 'url': 'https://www.metacritic.c... NaN [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 265} {'total': 0} {'coming_soon': False, 'date': '10 May, 2006'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/130... {'ids': [], 'notes': None}
32 game SiN Episodes: Emergence 1300 0.0 False NaN NaN You are John Blade, commander of HardCorps, an... You are John Blade, commander of HardCorps, an... You are John Blade, commander of HardCorps, an... NaN English, Russian, French https://steamcdn-a.akamaihd.net/steam/apps/130... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... [] [] NaN NaN NaN ['Ritual Entertainment'] ['Ritual Entertainment'] NaN {'currency': 'GBP', 'initial': 719, 'final': 7... [70] [{'name': 'default', 'title': 'Buy SiN Episode... {'windows': True, 'mac': False, 'linux': False} {'score': 75, 'url': 'https://www.metacritic.c... NaN [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 265} {'total': 0} {'coming_soon': False, 'date': '10 May, 2006'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/130... {'ids': [], 'notes': None}
356 game Jagged Alliance 2 Gold 1620 0.0 False NaN NaN <p>The small country of Arulco has been taken ... <p>The small country of Arulco has been taken ... The small country of Arulco has been taken ove... NaN English https://steamcdn-a.akamaihd.net/steam/apps/162... http://www.jaggedalliance2.com/ {'minimum': '<p><strong>Minimum Configuration:... [] [] NaN NaN NaN ['Strategy First'] ['Strategy First'] NaN {'currency': 'GBP', 'initial': 1499, 'final': ... [94] [{'name': 'default', 'title': 'Buy Jagged Alli... {'windows': True, 'mac': False, 'linux': False} NaN NaN [{'id': 2, 'description': 'Single-player'}] [{'id': '2', 'description': 'Strategy'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN NaN {'total': 0} {'coming_soon': False, 'date': '6 Jul, 2006'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/162... {'ids': [], 'notes': None}

We're now ready to define functions implementing the filters we just looked at. This allows us to easily make changes in the future if we want to alter how the columns are handled, or want to choose a different cut-off threshold for getting rid of columns, for example.

We also define a general purpose process function which will run all the processing functions we create on the data set. This will allow us to slowly add to it as we develop more functions and ensure we're cleaning the correct dataframe.

Finally we run this function on the raw data, inspecting the first few rows and viewing how many rows and columns have been removed.

In [9]:
def drop_null_cols(df, thresh=0.5):
    """Drop columns with more than a certain proportion of missing values (Default 50%)."""
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def process_name_type(df):
    """Remove null values in name and type columns, and remove type column."""
    df = df[df['type'].notnull()]
    
    df = df[df['name'].notnull()]
    df = df[df['name'] != 'none']
    
    df = df.drop('type', axis=1)
    
    return df
    

def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    
    return df

print(raw_steam_data.shape)
initial_processing = process(raw_steam_data)
print(initial_processing.shape)
initial_processing.head()
(29235, 39)
(29075, 28)
Out[9]:
name steam_appid required_age is_free detailed_description about_the_game short_description supported_languages header_image website pc_requirements mac_requirements linux_requirements developers publishers price_overview packages package_groups platforms categories genres screenshots movies achievements release_date support_info background content_descriptors
0 Counter-Strike 10 0.0 False Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... English<strong>*</strong>, French<strong>*</st... https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] {'currency': 'GBP', 'initial': 719, 'final': 7... [7] [{'name': 'default', 'title': 'Buy Counter-Str... {'windows': True, 'mac': True, 'linux': True} [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Nov, 2000'} {'url': 'http://steamcommunity.com/app/10', 'e... https://steamcdn-a.akamaihd.net/steam/apps/10/... {'ids': [2, 5], 'notes': 'Includes intense vio...
1 Team Fortress Classic 20 0.0 False One of the most popular online action games of... One of the most popular online action games of... One of the most popular online action games of... English, French, German, Italian, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] {'currency': 'GBP', 'initial': 399, 'final': 3... [29] [{'name': 'default', 'title': 'Buy Team Fortre... {'windows': True, 'mac': True, 'linux': True} [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Apr, 1999'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/20/... {'ids': [2, 5], 'notes': 'Includes intense vio...
2 Day of Defeat 30 0.0 False Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... English, French, German, Italian, Spanish - Spain https://steamcdn-a.akamaihd.net/steam/apps/30/... http://www.dayofdefeat.com/ {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] {'currency': 'GBP', 'initial': 399, 'final': 3... [30] [{'name': 'default', 'title': 'Buy Day of Defe... {'windows': True, 'mac': True, 'linux': True} [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 May, 2003'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/30/... {'ids': [], 'notes': None}
3 Deathmatch Classic 40 0.0 False Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... English, French, German, Italian, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] {'currency': 'GBP', 'initial': 399, 'final': 3... [31] [{'name': 'default', 'title': 'Buy Deathmatch ... {'windows': True, 'mac': True, 'linux': True} [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Jun, 2001'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/40/... {'ids': [], 'notes': None}
4 Half-Life: Opposing Force 50 0.0 False Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... English, French, German, Korean https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Gearbox Software'] ['Valve'] {'currency': 'GBP', 'initial': 399, 'final': 3... [32] [{'name': 'default', 'title': 'Buy Half-Life: ... {'windows': True, 'mac': True, 'linux': True} [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Nov, 1999'} {'url': 'https://help.steampowered.com', 'emai... https://steamcdn-a.akamaihd.net/steam/apps/50/... {'ids': [], 'notes': None}

Processing Age

Next we'll look at the required_age column. By looking at the value counts we can see that values are already stored as integers, and the values range from 0 to 20, with one likely error (1818). There are no missing values in this column, but the vast majority have a value of 0. We'll clean the column anyway, but this probably means it won't be of much use in analysis as there is little variance in the data.

In [10]:
initial_processing['required_age'].value_counts(dropna=False).sort_index()
Out[10]:
0.0       28431
1.0           1
3.0          10
4.0           2
5.0           1
6.0           1
7.0           8
10.0          3
11.0          4
12.0         72
13.0         21
14.0          4
15.0         39
16.0        141
17.0         47
18.0        288
20.0          1
1818.0        1
Name: required_age, dtype: int64

Whilst fairly useful in its current state, we may benefit from reducing the number of categories that ages fall into. For example, instead of comparing games rated as 5, 6, 7 or 8, we could compare games rated 5+ or 8+.

To decide which categories (or bins) we should use, we will look at the PEGI age ratings as this is the system used in the United Kingdom, where we're performing our analysis. Ratings fall into one of five categories (3, 7, 12, 16, 18), defining the minimum age recommended to play a game.

Using this to inform our decision, we can use the pandas.cut function to sort our data into each of these categories. Rows with 0 may mean they are unrated, unstated as in missing, or rated as suitable for everyone. Because we can't tell we'll leave these as they are. As the erroneous row (1818) is most likely meant to be rated 18 anyway, we can set the upper bound above this value to catch it inside this category.

Below we define a process_age function to handle this, and inspect the output.

In [11]:
def process_age(df):
    """Format ratings in age column to be in line with the PEGI Age Ratings system."""
    # PEGI Age ratings: 3, 7, 12, 16, 18
    cut_points = [-1, 0, 3, 7, 12, 16, 2000]
    label_values = [0, 3, 7, 12, 16, 18]
    
    df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
    
    return df


age_df = process_age(initial_processing)
age_df['required_age'].value_counts().sort_index()
Out[11]:
0     28431
3        11
7        12
12       79
16      205
18      337
Name: required_age, dtype: int64

Processing the Platforms Column

Whilst we could look at the next column in the dataframe, is_free, it would make sense that this is linked to the price_overview column. Ultimately we may wish to combine these columns into one, where free games would have a price of 0.

Looking at the price_overview column, we can see it is stored in a dictionary-like structure, with multiple keys and values. Handling both of these together might be somewhat trickty, so instead we'll look at a simpler example.

In [12]:
age_df['price_overview'].head()
Out[12]:
0    {'currency': 'GBP', 'initial': 719, 'final': 7...
1    {'currency': 'GBP', 'initial': 399, 'final': 3...
2    {'currency': 'GBP', 'initial': 399, 'final': 3...
3    {'currency': 'GBP', 'initial': 399, 'final': 3...
4    {'currency': 'GBP', 'initial': 399, 'final': 3...
Name: price_overview, dtype: object

The platforms column appears to contain a key for each of the main operating systems - windows, mac and linux - and a corresponding boolean value, set to True or False depending on the availability on that platform. This should be a reasonably straighforward place to start. We can separate this data out into three columns - one for each platform - filled with boolean values.

In [13]:
age_df['platforms'].head()
Out[13]:
0    {'windows': True, 'mac': True, 'linux': True}
1    {'windows': True, 'mac': True, 'linux': True}
2    {'windows': True, 'mac': True, 'linux': True}
3    {'windows': True, 'mac': True, 'linux': True}
4    {'windows': True, 'mac': True, 'linux': True}
Name: platforms, dtype: object

So far the cleaning process has been relatively simple, mainly requiring checking for null values and dropping some rows or columns. Already we can see that handling the platforms will be a little more complex.

Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.

In [14]:
platforms_first_row = age_df['platforms'].iloc[0]

print(type(platforms_first_row))

platforms_first_row
<class 'str'>
Out[14]:
"{'windows': True, 'mac': True, 'linux': True}"

We can get around this using the handy literal_eval function from the built-in ast module. As the name suggests, this will allow us to evaluate the string, and then index into it as a dictionary.

In [15]:
eval_first_row = literal_eval(platforms_first_row)

print(type(eval_first_row))
print(eval_first_row)

eval_first_row['windows']
<class 'dict'>
{'windows': True, 'mac': True, 'linux': True}
Out[15]:
True

We also need to check for missing values, but fortunately it appears there aren't any in this column.

In [16]:
age_df['platforms'].isnull().sum()
Out[16]:
0

As for formatting the output, let's keep things simple and return a list of supported platforms. Another option would be to create a column for each platform with a corresponding True/False value for each row, something we can keep in mind for the future.

We can create the desired list by calling the str.join() method on a string, and passing an iterable into the function. In this case, we can pass the keys of the row, as seen below.

In [17]:
# create string of keys, joined on a semi-colon
';'.join(eval_first_row.keys())
Out[17]:
'windows;mac;linux'

We also need to inspect the value of each row, and only end up with the keys that have a value of True, skipping those which are False. The example below shows how we can do this using a list comprehension.

In [18]:
platforms = {'windows': True, 'mac': True, 'linux': False}

# list comprehension
print([x for x in platforms.keys() if platforms[x]])

# using list comprehension in join
';'.join(x for x in platforms.keys() if platforms[x])
['windows', 'mac']
Out[18]:
'windows;mac'

Putting this all together, we can use the pandas Series.apply method to process the rows.

In [19]:
def process_platforms(df):
    """Split platforms column into separate boolean columns for each platform."""
    # evaluate values in platforms column, so can index into dictionaries
    df = df.copy()
    
    def parse_platforms(x):
        
        d = literal_eval(x)
        
        return ';'.join(platform for platform in d.keys() if d[platform])
    
    df['platforms'] = df['platforms'].apply(parse_platforms)
    
    return df


platforms_df = process_platforms(age_df)
platforms_df['platforms'].value_counts()
Out[19]:
windows              19887
windows;mac;linux     4851
windows;mac           3687
windows;linux          643
mac                      5
mac;linux                1
linux                    1
Name: platforms, dtype: int64

Processing Price

Now we have built up some intuition around how to deal with data stored as dictionaries, let's return to the is_free and price_overview columns as we should now be able to handle them.

First let's check how many null values there are in price_overview.

In [20]:
platforms_df['price_overview'].isnull().sum()
Out[20]:
3559

Whilst that looks like a lot, we have to consider the impact that the is_free column might be having. Before jumping to conclusions let's check if there any rows with is_free marked as True and null values in the price_overview column.

In [21]:
free_and_null_price = platforms_df[(platforms_df['is_free']) & (platforms_df['price_overview'].isnull())]

free_and_null_price.shape[0]
Out[21]:
2713

It turns out this accounts for most of the missing values in the price_overview column, meaning we can handle these by setting the final price as 0. This makes intuitive sense - free games wouldn't have a price.

This means that there are almost 850 rows which aren't free but have null values in the price_overview column. Let's investigate those next.

In [22]:
not_free_and_null_price = platforms_df[(platforms_df['is_free'] == False) & (platforms_df['price_overview'].isnull())]

not_free_and_null_price.head()
Out[22]:
name steam_appid required_age is_free detailed_description about_the_game short_description supported_languages header_image website pc_requirements mac_requirements linux_requirements developers publishers price_overview packages package_groups platforms categories genres screenshots movies achievements release_date support_info background content_descriptors
63 The Ship: Single Player 2420 0 False For PC gamers who enjoy multiplayer games with... For PC gamers who enjoy multiplayer games with... The Ship is a murder mystery alternative to tr... English, French, German, Italian, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/242... http://www.blazinggriffin.com/games/the-ship-m... {'minimum': '<strong>Minimum:</strong> 1.8 GHz... [] [] ['Outerlight Ltd.'] ['Blazing Griffin Ltd.'] NaN [56669] [{'name': 'default', 'title': 'Buy The Ship: S... windows [{'id': 2, 'description': 'Single-player'}] [{'id': '1', 'description': 'Action'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 2035597, 'name': 'the Ship: Intro', '... {'total': 0} {'coming_soon': False, 'date': '20 Nov, 2006'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/242... {'ids': [], 'notes': None}
75 RollerCoaster Tycoon® 3: Platinum 2700 0 False Rollercoaster Tycoon 3 Platinum combines the e... Rollercoaster Tycoon 3 Platinum combines the e... Rollercoaster Tycoon 3 Platinum combines the e... English, French, Italian, German, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/270... http://www.atari.com/rollercoastertycoon/us/in... {'minimum': '<strong>Minimum: </strong><br>\t\... {'minimum': '<ul class="bb_ul"><li><strong>OS:... [] ['Frontier', 'Aspyr (Mac)'] ['Atari', 'Aspyr (Mac)'] NaN NaN [] windows;mac [{'id': 2, 'description': 'Single-player'}] [{'id': '28', 'description': 'Simulation'}, {'... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '12 Mar, 2008'} {'url': 'http://www.atari.com/support/atari', ... https://steamcdn-a.akamaihd.net/steam/apps/270... {'ids': [], 'notes': None}
220 BioShock™ 7670 0 False <h1>Special Offer</h1><p>Buying BioShock™ also... BioShock is a shooter unlike any you've ever p... BioShock is a shooter unlike any you've ever p... English<strong>*</strong>, French<strong>*</st... https://steamcdn-a.akamaihd.net/steam/apps/767... http://www.BioShockGame.com {'minimum': '<h2 class="bb_tag"><strong>Minimu... {'minimum': 'Please See BioShock Remastered'} [] ['2K Boston', '2K Australia'] ['2K'] NaN [451, 127633] [{'name': 'default', 'title': 'Buy BioShock™',... windows [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '21 Aug, 2007'} {'url': 'support.2k.com', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/767... {'ids': [], 'notes': None}
234 Sam & Max 101: Culture Shock 8200 0 False <strong>Sam &amp; Max: Episode 1 - Culture Sho... <strong>Sam &amp; Max: Episode 1 - Culture Sho... Sam &amp; Max: Episode 1 - Culture Shock - The... English, French, German, Italian https://steamcdn-a.akamaihd.net/steam/apps/820... http://store.steampowered.com/app/901660/ {'minimum': 'Windows XP or Vista, 1.5GHz proce... [] [] ['Telltale Games'] ['Telltale Games'] NaN [357, 539] [{'name': 'default', 'title': 'Buy Sam & Max 1... windows [{'id': 2, 'description': 'Single-player'}] [{'id': '1', 'description': 'Action'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '15 Jun, 2007'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/820... {'ids': [], 'notes': None}
235 Sam & Max 102: Situation: Comedy 8210 0 False <strong>Sam &amp; Max: Episode 2 - Situation: ... <strong>Sam &amp; Max: Episode 2 - Situation: ... Sam &amp; Max: Episode 2 - Situation: Comedy -... English, German, Italian https://steamcdn-a.akamaihd.net/steam/apps/821... http://store.steampowered.com/app/901660/ {'minimum': 'Windows XP or Vista, 1.5GHz proce... [] [] ['Telltale Games'] ['Telltale Games'] NaN [358, 539] [{'name': 'default', 'title': 'Buy Sam & Max 1... windows [{'id': 2, 'description': 'Single-player'}] [{'id': '1', 'description': 'Action'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '15 Jun, 2007'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/821... {'ids': [], 'notes': None}

The first few rows contain some big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.

We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:

https://store.steampowered.com/app/[steam_appid]

This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later.

In [23]:
def print_steam_links(df):
    """Print links to store page for apps in a dataframe."""
    url_base = "https://store.steampowered.com/app/"
    
    for i, row in df.iterrows():
        appid = row['steam_appid']
        name = row['name']
        
        print(name + ':', url_base + str(appid))
        

print_steam_links(not_free_and_null_price[:5])
The Ship: Single Player: https://store.steampowered.com/app/2420
RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210

For these games we can conclude that:

  • The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party
  • RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: GOG)
    • "A spokesperson for GOG told Eurogamer it pulled the game "due to expiring licensing rights", and stressed it'll talk with "new distribution rights holders" to bring the game back as soon as possible." Source: Eurogamer
  • BioShock has been replaced by BioShock Remastered
  • Sam & Max 101 is sold as part of a season, and this can be found in the package_groups column

So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the package_groups column, or we could leave them for now and return to them later. We'll leave them for now, handling the two price columns, then take a look at the packages next. It may also be that some of these rows are removed later in the cleaning process for other reasons.

If we want to find rows similar to these and deal with each case individually, we could use the .str.contains() method, as seen below.

In [24]:
platforms_df[platforms_df['name'].str.contains("BioShock™")]
Out[24]:
name steam_appid required_age is_free detailed_description about_the_game short_description supported_languages header_image website pc_requirements mac_requirements linux_requirements developers publishers price_overview packages package_groups platforms categories genres screenshots movies achievements release_date support_info background content_descriptors
220 BioShock™ 7670 0 False <h1>Special Offer</h1><p>Buying BioShock™ also... BioShock is a shooter unlike any you've ever p... BioShock is a shooter unlike any you've ever p... English<strong>*</strong>, French<strong>*</st... https://steamcdn-a.akamaihd.net/steam/apps/767... http://www.BioShockGame.com {'minimum': '<h2 class="bb_tag"><strong>Minimu... {'minimum': 'Please See BioShock Remastered'} [] ['2K Boston', '2K Australia'] ['2K'] NaN [451, 127633] [{'name': 'default', 'title': 'Buy BioShock™',... windows [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '21 Aug, 2007'} {'url': 'support.2k.com', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/767... {'ids': [], 'notes': None}
7734 BioShock™ Remastered 409710 18 False <h1>Special Offer</h1><p>Buying BioShock™ Rema... BioShock is a shooter unlike any you've ever p... BioShock is a shooter unlike any other, loaded... English<strong>*</strong>, French<strong>*</st... https://steamcdn-a.akamaihd.net/steam/apps/409... http://www.BioShockGame.com {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... ['2K Boston', '2K Australia', 'Blind Squirrel'... ['2K', 'Feral Interactive (Mac)'] {'currency': 'GBP', 'initial': 999, 'final': 9... [451, 127633] [{'name': 'default', 'title': 'Buy BioShock™ R... windows;mac [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 65, 'highlighted': [{'name': 'Comple... {'coming_soon': False, 'date': '15 Sep, 2016'} {'url': 'support.2k.com', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/409... {'ids': [], 'notes': None}
7735 BioShock™ 2 Remastered 409720 18 False <h1>Special Offer</h1><p>Buying BioShock 2™ Re... BioShock 2 provides players with the perfect b... In BioShock 2, you step into the boots of the ... English<strong>*</strong>, French<strong>*</st... https://steamcdn-a.akamaihd.net/steam/apps/409... http://www.bioshockgame.com/ {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... ['2K Marin', '2K China', 'Digital Extremes', '... ['2K'] {'currency': 'GBP', 'initial': 1399, 'final': ... [81419, 127633] [{'name': 'default', 'title': 'Buy BioShock™ 2... windows [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 53, 'highlighted': [{'name': "Daddy'... {'coming_soon': False, 'date': '15 Sep, 2016'} {'url': 'support.2k.com', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/409... {'ids': [5], 'notes': None}

Now we need to figure out how to process the column.

If we take a look at the data for the first row, we can see that there are a variety of formats in which the price is stored. There is a currency, GBP, which is perfect as we are performing our analysis in the UK. Next we have a number of different values for the price, so which one do we use?

In [25]:
platforms_df['price_overview'][0]
Out[25]:
"{'currency': 'GBP', 'initial': 719, 'final': 719, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '£7.19'}"

If we inspect another row, we see that there is an active discount, applying an 80% price reduction to the title. It looks like initial contains the normal price before discount, and final contains the discounted price. initial_formatted and final_formatted contain the price formatted and displayed in the currency. We don't have to worry about these last two, as we'll be storing the price as a float (or integer) and if we wanted, we could format it like this when printing.

With all this in mind, it looks like we'll be checking the value under the currency key, and using the value in the initial key.

In [26]:
platforms_df['price_overview'][37]
Out[26]:
"{'currency': 'GBP', 'initial': 2299, 'final': 459, 'discount_percent': 80, 'initial_formatted': '£22.99', 'final_formatted': '£4.59'}"

Now the preliminary investigation is complete we can begin definining our function.

We start by evaluating the strings using literal_eval as before, however if there is a null value we return a properly formatted dictionary with -1 for the initial value. This will allow us to fill in a value of 0 for free games, then be left with an easily targetable value for the actual null rows.

Next we create currency and price columns from the dictionary in the price_overview column. We define an anonymous function on the fly using a lambda expression, returning the value in each key.

In [27]:
def process_price(df):
    df = df.copy()
        
    def parse_price(x):
        if x is not np.nan:
            return literal_eval(x)
        else:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # Create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # Set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    return df

price_data = process_price(platforms_df)[['name', 'currency', 'price']]
price_data.head()
Out[27]:
name currency price
0 Counter-Strike GBP 719
1 Team Fortress Classic GBP 399
2 Day of Defeat GBP 399
3 Deathmatch Classic GBP 399
4 Half-Life: Opposing Force GBP 399

We're almost finished, but let's check if any games don't have GBP listed as the currency.

In [28]:
price_data[price_data['currency'] != 'GBP']
Out[28]:
name currency price
991 Robin Hood: The Legend of Sherwood USD 799
5767 Assassin’s Creed® Chronicles: India EUR 999
27593 Mortal Kombat 11 USD 5999
27995 Pagan Online EUR 2699

For some reason there are four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows it's easier and safer to simply drop them.

We can also divide the prices by 100 so they are displayed as floats in pounds.

In [29]:
def process_price(df):
    """Process price_overview column into formatted price column."""
    df = df.copy()
    
    def parse_price(x):
        if x is not np.nan:
            return literal_eval(x)
        else:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # change price to display in pounds (only applying to rows with a value greater than 0)
    df.loc[df['price'] > 0, 'price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
    
    return df


price_df = process_price(platforms_df)
price_df[['name', 'price']].head()
Out[29]:
name price
0 Counter-Strike 7.19
1 Team Fortress Classic 3.99
2 Day of Defeat 3.99
3 Deathmatch Classic 3.99
4 Half-Life: Opposing Force 3.99

Processing Packages

We can now take a look at the packages and package_groups columns to help decide what to do with rows that are missing price data. We're not incredibly interested in the columns themselves, as they don't appear to provide much new useful information, except which games come with others as part of a bundle.

In [30]:
# temporarily set a pandas option using with and option_context
with pd.option_context("display.max_colwidth", 500):
    display(price_df[['steam_appid', 'packages', 'package_groups', 'price']].head(3))
steam_appid packages package_groups price
0 10 [7] [{'name': 'default', 'title': 'Buy Counter-Strike', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 7, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Counter-Strike: Condition Zero - £7.19', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 719}]}] 7.19
1 20 [29] [{'name': 'default', 'title': 'Buy Team Fortress Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 29, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Team Fortress Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] 3.99
2 30 [30] [{'name': 'default', 'title': 'Buy Day of Defeat', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 30, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Day of Defeat - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] 3.99

Overall we have 846 rows with missing price data, which we previously set to -1.

In [31]:
print(price_df[price_df['price'] == -1].shape[0])
846

We can split these rows into two categories: those with package_groups data and those without.

If we take a quick look at the package_groups column we see that there appear to be no null values. On closer inspection, we can find that rows without data are actually stored as empty lists.

In [32]:
print('Null counts:', price_df['package_groups'].isnull().sum())
print('Empty list counts:', price_df[price_df['package_groups'] == "[]"].shape[0])
Null counts: 0
Empty list counts: 3353

Using a combination of filters, we can find out how many rows have both missing price and package_group data and investigate. We'll count the rows and print links to some of the store pages and look for patterns.

In [33]:
missing_price_and_package = price_df[(price_df['price'] == -1) & (price_df['package_groups'] == "[]")]

print('Number of rows:', missing_price_and_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_and_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_and_package[-10:-5])
Number of rows: 799 

First few rows:

RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
Beijing 2008™ - The Official Video Game of the Olympic Games: https://store.steampowered.com/app/10520
LUMINES™ Advance Pack: https://store.steampowered.com/app/11920
Midnight Club 2: https://store.steampowered.com/app/12160
Age of Booty™: https://store.steampowered.com/app/21600

Last few rows:

RoboVirus: https://store.steampowered.com/app/1001870
soko loco deluxe: https://store.steampowered.com/app/1003730
POCKET CAR : VRGROUND: https://store.steampowered.com/app/1004710
The Princess, the Stray Cat, and Matters of the Heart: https://store.steampowered.com/app/1010600
Mr Boom's Firework Factory: https://store.steampowered.com/app/1013670

Most of the games - 799 of 846 - with missing price data fall into the above category. This probably means they can be safely removed.

From following the links for the first few rows to the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of them haven't yet been released and haven't had a price set. We'll take care of all the unreleased games when we clean the release_date column, but we can remove all of these apps here.

Let's now take a look at the rows that have missing price data but do have package_groups data. We may be interested in keeping these rows and extracting price data from the package data.

In [34]:
missing_price_have_package = price_df.loc[(price_df['price'] == -1) & (price_df['package_groups'] != "[]"), ['name', 'steam_appid', 'package_groups', 'price']]

print('Number of rows:', missing_price_have_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_have_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_have_package[-10:-5])
Number of rows: 47 

First few rows:

The Ship: Single Player: https://store.steampowered.com/app/2420
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210
Sam & Max 103: The Mole, the Mob and the Meatball: https://store.steampowered.com/app/8220

Last few rows:

Viscera Cleanup Detail: Shadow Warrior: https://store.steampowered.com/app/255520
Space Hulk: Deathwing: https://store.steampowered.com/app/298900
7,62 Hard Life: https://store.steampowered.com/app/306290
Letter Quest: Grimm's Journey: https://store.steampowered.com/app/328730
Rad Rodgers: World One: https://store.steampowered.com/app/353580

Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game.

Whilst we could extract prices from the package_groups data, the most sensible option seems to be removing these rows. There are only 47 rows this applies to, and any with a newer release will still have the re-release in the data.

Since our logic interacts heavily with the price data we will update the process_price function rather than creating a new one.

In [35]:
def process_price(df):
    """Process price_overview column into formatted price column, and take care of package columns."""
    df = df.copy()
    
    def parse_price(x):
        if x is not np.nan:
            return literal_eval(x)
        else:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # remove rows where price is -1
    df = df[df['price'] != -1]
    
    # change price to display in pounds (can apply to all now -1 rows removed)
    df['price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
    
    return df


price_df = process_price(platforms_df)
price_df.head()
Out[35]:
name steam_appid required_age detailed_description about_the_game short_description supported_languages header_image website pc_requirements mac_requirements linux_requirements developers publishers platforms categories genres screenshots movies achievements release_date support_info background content_descriptors price
0 Counter-Strike 10 0 Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... English<strong>*</strong>, French<strong>*</st... https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] windows;mac;linux [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Nov, 2000'} {'url': 'http://steamcommunity.com/app/10', 'e... https://steamcdn-a.akamaihd.net/steam/apps/10/... {'ids': [2, 5], 'notes': 'Includes intense vio... 7.19
1 Team Fortress Classic 20 0 One of the most popular online action games of... One of the most popular online action games of... One of the most popular online action games of... English, French, German, Italian, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] windows;mac;linux [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Apr, 1999'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/20/... {'ids': [2, 5], 'notes': 'Includes intense vio... 3.99
2 Day of Defeat 30 0 Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... English, French, German, Italian, Spanish - Spain https://steamcdn-a.akamaihd.net/steam/apps/30/... http://www.dayofdefeat.com/ {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] windows;mac;linux [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 May, 2003'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/30/... {'ids': [], 'notes': None} 3.99
3 Deathmatch Classic 40 0 Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... English, French, German, Italian, Spanish - Sp... https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Valve'] ['Valve'] windows;mac;linux [{'id': 1, 'description': 'Multi-player'}, {'i... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Jun, 2001'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/40/... {'ids': [], 'notes': None} 3.99
4 Half-Life: Opposing Force 50 0 Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... English, French, German, Korean https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... ['Gearbox Software'] ['Valve'] windows;mac;linux [{'id': 2, 'description': 'Single-player'}, {'... [{'id': '1', 'description': 'Action'}] [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 0} {'coming_soon': False, 'date': '1 Nov, 1999'} {'url': 'https://help.steampowered.com', 'emai... https://steamcdn-a.akamaihd.net/steam/apps/50/... {'ids': [], 'notes': None} 3.99

The next columns in the data are descriptive columns - detailed_description, about_the_game and short_description. We won't be handling them now, instead returning to them later on when we deal with export columns. These are columns where we will export all or some of the data to a separate csv file as part of the cleaning.

Processing Langauges

Beyond that, the next column is supported_languages. As we will be performing the analysis for an English company, we will only be interested in games available in English. Whilst we could remove non-english game at this stage, instead we will create a column marking english games with a boolean value - True or False.

We begin as usual by looking for rows with null values.

In [36]:
price_df['supported_languages'].isnull().sum()
Out[36]:
4

Taking a closer look at these apps, it doesn't look like there's anything wrong with them. It may be that the data simply wasn't supplied. As there are only 4 rows affected we will go ahead and remove these from the data set.

In [37]:
price_df[price_df['supported_languages'].isnull()]
Out[37]:
name steam_appid required_age detailed_description about_the_game short_description supported_languages header_image website pc_requirements mac_requirements linux_requirements developers publishers platforms categories genres screenshots movies achievements release_date support_info background content_descriptors price
4866 Subsiege 338640 0 <img src="https://steamcdn-a.akamaihd.net/stea... <img src="https://steamcdn-a.akamaihd.net/stea... Subsiege is an intense real-time tactic game w... NaN https://steamcdn-a.akamaihd.net/steam/apps/338... http://subsiege-game.com/ {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] ['Icebird Studios'] ['Icebird Studios'] windows NaN NaN [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 256729398, 'name': 'Release Trailer', ... {'total': 0} {'coming_soon': False, 'date': '7 Sep, 2018'} {'url': 'http://subsiege-game.com/', 'email': ... https://steamcdn-a.akamaihd.net/steam/apps/338... {'ids': [], 'notes': None} 17.89
14560 MARS VR(全球使命VR) 596560 0 1.\t4K level audio-visual experience <br />\r\... 1.\t4K level audio-visual experience <br />\r\... Welcome to 《Mars VR》. This is an immersive fir... NaN https://steamcdn-a.akamaihd.net/steam/apps/596... http://qqsm.zygames.com/ {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... ['Ying Pei Digital Technology Shanghai Co., Li... ['SHANGHAI ZHENYOU TECHNOLOGY CO.,LTD'] windows [{'id': 2, 'description': 'Single-player'}] [{'id': '73', 'description': 'Violent'}, {'id'... [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 256681371, 'name': 'marsvr', 'thumbnai... {'total': 0} {'coming_soon': False, 'date': '5 Apr, 2017'} {'url': 'http://www.zygames.com/contact', 'ema... https://steamcdn-a.akamaihd.net/steam/apps/596... {'ids': [], 'notes': None} 1.99
16386 Numberline 2 654970 0 NumberLine 2 is the continuation of the popula... NumberLine 2 is the continuation of the popula... NumberLine 2 is the continuation of the popula... NaN https://steamcdn-a.akamaihd.net/steam/apps/654... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] ['V34D4R', 'Egor Magurin'] ['Indovers Studio'] windows [{'id': 2, 'description': 'Single-player'}] [{'id': '4', 'description': 'Casual'}, {'id': ... [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 256687192, 'name': 'Numberline 2 Trail... {'total': 60, 'highlighted': [{'name': '1st le... {'coming_soon': False, 'date': '14 Jul, 2017'} {'url': '', 'email': 'radaew.zhenya@yandex.ru'} https://steamcdn-a.akamaihd.net/steam/apps/654... {'ids': [], 'notes': None} 1.59
26855 SNUSE 221 948070 0 <strong> Hey. My name is *&amp;#!$.<br>Today I... <strong> Hey. My name is *&amp;#!$.<br>Today I... Hey. My name is *&amp;#!$. Today I will tell y... NaN https://steamcdn-a.akamaihd.net/steam/apps/948... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] ['SNUSE GM'] ['SNUSE GM'] windows [{'id': 2, 'description': 'Single-player'}] [{'id': '25', 'description': 'Adventure'}, {'i... [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 256745662, 'name': 'snuse', 'thumbnail... {'total': 0} {'coming_soon': False, 'date': '2 Apr, 2019'} {'url': 'vk.com/nilow_i', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/948... {'ids': [], 'notes': None} 0.79

Next we'll take a look at the structure of the column. By looking at the value for the first row and the values for the most common rows, it looks like languages are stored as a string which can be anything from a comma-separated list of languages to a mix of html and headings. It seems reasonably safe to assume that if the app is in English, the word English will appear somewhere in this string. With this in mind we can simply search the string and return a value based on the result.

In [38]:
print(price_df['supported_languages'][0])
price_df['supported_languages'].value_counts().head(10)
English<strong>*</strong>, French<strong>*</strong>, German<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Simplified Chinese<strong>*</strong>, Traditional Chinese<strong>*</strong>, Korean<strong>*</strong><br><strong>*</strong>languages with full audio support
Out[38]:
English                                                                                                        8512
English<strong>*</strong><br><strong>*</strong>languages with full audio support                               7409
English, Russian                                                                                                707
English, Simplified Chinese                                                                                     280
English, Japanese                                                                                               235
English<strong>*</strong>, Russian<strong>*</strong><br><strong>*</strong>languages with full audio support     222
English, French, Italian, German, Spanish - Spain                                                               180
English, German                                                                                                 161
Simplified Chinese                                                                                              157
English, French                                                                                                 143
Name: supported_languages, dtype: int64

It looks like English-only games make up a little over half the rows in our dataset (~16,000), and English plus other languages make up many more. We could create columns for any of the other languages by string searching, but for simplicity we'll focus on just the English ones.

Using the Series.apply method once again, we can check if the string 'english' appears in each row. We'll return a 1 if 'english' is found and 0 otherwise. Whilst we could use True/False, a binary 1/0 can be interpreted as a boolean value and saves a little space in the csv file.

Inside the lambda function, the variable x will take on the value of each row as the expression is evaluated. We apply the lower() string method so capitalisation doesn't matter.

In [39]:
def process_language(df):
    """Process supported_languages column into a boolean 'is english' column."""
    df = df.copy()
    
    # drop rows with missing language data
    df = df.dropna(subset=['supported_languages'])
    
    df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
    df = df.drop('supported_languages', axis=1)
    
    return df


language_df = process_language(price_df)
language_df[['name', 'english']].head()
Out[39]:
name english
0 Counter-Strike 1
1 Team Fortress Classic 1
2 Day of Defeat 1
3 Deathmatch Classic 1
4 Half-Life: Opposing Force 1

Before moving on, we can take a quick look at the results and see that most of the apps support English.

In [40]:
language_df['english'].value_counts()
Out[40]:
1    27699
0      522
Name: english, dtype: int64

Processing Developers and Publishers

We'll skip over the next few columns and take a look at the developers and publishers columns. They will most likely contain similar information so we can look at them together.

We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values, if we search for empty lists we see that we have over 200 'hidden' null values.

In [41]:
print('Developers null counts:', language_df['developers'].isnull().sum())
print('Developers empty list counts:', language_df[language_df['developers'] == "['']"].shape[0])

print('\nPublishers null counts:', language_df['publishers'].isnull().sum())
print('Publishers empty list counts:', language_df[language_df['publishers'] == "['']"].shape[0])
Developers null counts: 104
Developers empty list counts: 0

Publishers null counts: 0
Publishers empty list counts: 213

Ultimately we want a data set with no missing values. That means we have a few options for dealing with these two columns:

  • Remove all rows missing either developer or publisher information
  • Impute missing information by replacing the missing column with the column we have (i.e. if developers is missing, fill it with the value in publishers)
  • Fill missing information with 'Unknown' or 'None'

We can investigate some of the rows with missing data to help inform our decision.

In [42]:
no_dev = language_df[language_df['developers'].isnull()]

print('Total games missing developer:', no_dev.shape[0], '\n')

print_steam_links(no_dev[:5])

no_pub = language_df[language_df['publishers'] == "['']"]

print('\nTotal games missing publisher:', no_pub.shape[0], '\n')
print_steam_links(no_pub[:5])

no_dev_or_pub = language_df[(language_df['developers'].isnull()) & (language_df['publishers'] == "['']")]

print('\nTotal games missing developer and publisher:', no_dev_or_pub.shape[0], '\n')
print_steam_links(no_dev_or_pub[:5])
Total games missing developer: 104 

Tycoon City: New York: https://store.steampowered.com/app/9730
Nikopol: Secrets of the Immortals: https://store.steampowered.com/app/11370
Crash Time 2: https://store.steampowered.com/app/11390
Hunting Unlimited 2010: https://store.steampowered.com/app/12690
18 Wheels of Steel: Extreme Trucker: https://store.steampowered.com/app/33730

Total games missing publisher: 213 

RIP - Trilogy™: https://store.steampowered.com/app/2540
Vigil: Blood Bitterness™: https://store.steampowered.com/app/2570
Bullet Candy: https://store.steampowered.com/app/6600
AudioSurf: https://store.steampowered.com/app/12900
Everyday Shooter: https://store.steampowered.com/app/16300

Total games missing developer and publisher: 67 

PlayClaw 5 - Game Recording and Streaming: https://store.steampowered.com/app/237370
Artemis Spaceship Bridge Simulator: https://store.steampowered.com/app/247350
A Walk in the Dark: https://store.steampowered.com/app/248730
Forge Quest: https://store.steampowered.com/app/249950
Vox: https://store.steampowered.com/app/252770

It appears we are looking at a mix of titles, smaller ones especially, and some of the smaller indie titles may have been self-published. Others simply have wrong or missing data, found by searching for the titles elsewhere. As our priority is creating a clean data set, and there are only a few hundred rows, it will be fine to remove them from the data.

Let's take a look at the structure of the data. Below we inspect some rows near the beginning of the dataframe. It looks like both columns are stored as lists which can have one or multiple values. We'll have to evaluate the rows as before, so they are recognised as lists, then index into them accordingly.

In [43]:
language_df[['developers', 'publishers']].iloc[24:28]
Out[43]:
developers publishers
24 ['Valve'] ['Valve']
25 ['Valve', 'Hidden Path Entertainment'] ['Valve']
27 ['Mark Healey'] ['Mark Healey']
28 ['Tripwire Interactive'] ['Tripwire Interactive']

As we have some single values and some multiple, we have to decide how to handle them. Here are some potential solutions:

  • Create a column for each value in the list (i.e. developer_1, developer_2)
  • Create a column with the first value in the list and a column with the rest of the values (i.e. developer_1, other_developers)
  • Create a column with the first value in the list and disregard the rest
  • Combine all values into one column, simply unpacking the list

Let's begin defining our function, and take a look at how many rows have multiple developers or publishers. After evaluating each row, we can find the length of the lists in each row by using the Series.str.len() method. By filtering only rows where the list has more than one element, we can find the number of rows with more than one value in each column.

In [44]:
def process_developers_and_publishers(df):
    # remove rows with missing data
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    
    for col in ['developers', 'publishers']:
        df[col] = df[col].apply(lambda x: literal_eval(x))
        
        # filter dataframe to rows with lists longer than 1, and store the number of rows
        num_rows = df[df[col].str.len() > 1].shape[0]
        
        print('Rows in {} column with multiple values:'.format(col), num_rows)

process_developers_and_publishers(language_df)
Rows in developers column with multiple values: 1720
Rows in publishers column with multiple values: 884

It turns out that the vast majority have only one value for these columns. If we went with the first or second solutions above, we'd be left with columns with mostly missing data. We could go with the third option, but the first value in the list isn't necessarily the most important, and this seems unfair if multiple teams were involved.

The best way forward seems to be the fourth option - if there are multiple values we combine them into the same column. We'll create a list in this case, calling str.join() as we did before. If we pass a list with only one value, we get a string with just that value. If we pass a list with multiple values, we get a string-separated list as desired. We can see this below.

In [45]:
', '.join(['one item'])
Out[45]:
'one item'
In [46]:
', '.join(['multiple', 'different', 'items'])
Out[46]:
'multiple, different, items'

We can't join on a comma as a number of developers and publishers have a comma in their name, a couple of which can be seen below.

In [47]:
language_df.loc[language_df['developers'].str.contains(",", na=False), ['steam_appid', 'developers', 'publishers']].head(4)
Out[47]:
steam_appid developers publishers
25 730 ['Valve', 'Hidden Path Entertainment'] ['Valve']
66 2520 ['CINEMAX, s.r.o.'] ['CINEMAX, s.r.o.']
73 2630 ['Infinity Ward', 'Aspyr (Mac)'] ['Activision', 'Aspyr (Mac)']
97 3300 ['PopCap Games, Inc.'] ['PopCap Games, Inc.']

Instead we can join on a semi-colon (;). We have 3 rows which contain a semi-colon in their name, so we'll remove these. We'll be able to identify and split individual developer/publisher names in the future by handling it this way.

In [48]:
language_df.loc[language_df['developers'].str.contains(";", na=False), ['steam_appid', 'developers', 'publishers']]
Out[48]:
steam_appid developers publishers
9550 460210 ['bool games;'] ['bool games;']
13489 568480 [';)', 'Quickdraw Studios'] ['Quickdraw Studios']
16871 665890 ['Semicolon;'] ['Semicolon;']

There are also a small number of columns that have ['NA'] or ['N/A'] for publisher. These are some really well hidden null values, and they didn't actually surface until much later in the original development process. This helps highlight the iterative nature of data cleaning - you may discover errors in rows or data that went previously undiscovered, and have to go back and update or correct them.

In [49]:
language_df[(language_df['publishers'] == "['NA']") | (language_df['publishers'] == "['N/A']")].shape[0]
Out[49]:
12

Now we're ready to finish the function we started. We'll abandon the for loop, as there is not too much repetition, and run it on the data as always.

In [50]:
def process_developers_and_publishers(df):
    """Parse columns as semicolon-separated string."""
    # remove rows with missing data (~ means not)
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    df = df[~(df['developers'].str.contains(';')) & ~(df['publishers'].str.contains(';'))]
    df = df[(df['publishers'] != "['NA']") & (df['publishers'] != "['N/A']")]
    
    # create list for each
    df['developer'] = df['developers'].apply(lambda x: ';'.join(literal_eval(x)))
    df['publisher'] = df['publishers'].apply(lambda x: ';'.join(literal_eval(x)))

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df

dev_pub_df = process_developers_and_publishers(language_df)
dev_pub_df[['name', 'steam_appid', 'developer', 'publisher']].head()
Out[50]:
name steam_appid developer publisher
0 Counter-Strike 10 Valve Valve
1 Team Fortress Classic 20 Valve Valve
2 Day of Defeat 30 Valve Valve
3 Deathmatch Classic 40 Valve Valve
4 Half-Life: Opposing Force 50 Gearbox Software Valve

Processing Categories and Genres

We'll take a look at the categories and genres columns next. Below we take a look at the null counts and structure of the data. Both appear to be a list of dictionaries containing an id and description key-value pair. Similar to our handling of developers and publishers, it may be best to extract just the descriptions into a list for now. We could make a list of the IDs, keeping track of the corresponding description externally, but that seems overly complex for our purposes.

In [51]:
print('Categories:\n')
print('Null values:', dev_pub_df['categories'].isnull().sum())
print()
print(dev_pub_df['categories'][0])

print('\nGenres:\n')
print('Null values:', dev_pub_df['genres'].isnull().sum())
print()
print(dev_pub_df['genres'].iloc[0])
print(dev_pub_df['genres'].iloc[1000])
Categories:

Null values: 509

[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]

Genres:

Null values: 37

[{'id': '1', 'description': 'Action'}]
[{'id': '3', 'description': 'RPG'}, {'id': '2', 'description': 'Strategy'}]

Before we begin defining a function we'll inspect some of the null rows, then we can decide how to handle them.

Using the pandas DataFrame.sample method, we can randomly sample a number of rows from the dataset. We set a random_state so the output is the same each time.

In [52]:
print_steam_links(dev_pub_df[dev_pub_df['categories'].isnull()].sample(5, random_state=0))
Driver Booster 3 for STEAM: https://store.steampowered.com/app/403040
Swanky Paint: https://store.steampowered.com/app/432030
ResumeMaker® Professional Deluxe 20: https://store.steampowered.com/app/707280
CyberLink PhotoDirector 9 Ultra: https://store.steampowered.com/app/679550
OVRdrop: https://store.steampowered.com/app/586210

All of the above rows with null categories data are applications or software of some kind, and not actually games. It would definitely be best to remove these, as we are interested in analysing games from the steam store.

Below we take a look at rows with missing genres data. There doesn't seem to be anything wrong with these games, and it suggests that genre data simply wasn't supplied. As there are only 37 rows affected, we'll remove these rows to keep our dataset complete.

In [53]:
print_steam_links(dev_pub_df[dev_pub_df['genres'].isnull()].sample(5, random_state=0))
Antagonist: https://store.steampowered.com/app/585260
Lonelyland VR: https://store.steampowered.com/app/542620
ChessBase 13 Academy: https://store.steampowered.com/app/377340
Sonic the Hedgehog 4 - Episode I: https://store.steampowered.com/app/202530
VEHICLES FURY: https://store.steampowered.com/app/749290

As we did for the developer and publisher columns, we can create an anonymous function and join the results on a semicolon. Inside the function we use a list comprehension to traverse each dictionary and extract the value under the description key, as seen below.

In [54]:
example_category = "[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}]"

[x['description'] for x in literal_eval(example_category)]
Out[54]:
['Multi-player', 'Online Multi-Player', 'Local Multi-Player']

We then use the apply function again to turn each column into a simple delimited list.

In [55]:
def process_categories_and_genres(df):
    df = df.copy()
    df = df[(df['categories'].notnull()) & (df['genres'].notnull())]
    
    for col in ['categories', 'genres']:
        df[col] = df[col].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))
    
    return df

cat_gen_df = process_categories_and_genres(dev_pub_df)
cat_gen_df[['steam_appid', 'categories', 'genres']].head()
Out[55]:
steam_appid categories genres
0 10 Multi-player;Online Multi-Player;Local Multi-P... Action
1 20 Multi-player;Online Multi-Player;Local Multi-P... Action
2 30 Multi-player;Valve Anti-Cheat enabled Action
3 40 Multi-player;Online Multi-Player;Local Multi-P... Action
4 50 Single-player;Multi-player;Valve Anti-Cheat en... Action

Processing Achievements and Content Descriptors

The final columns we will take a look at before moving on to export columns are achievements and content_descriptors. Let's take a look at the null counts for each column and a small sample of rows.

In [56]:
print('Achievements null counts:', cat_gen_df['achievements'].isnull().sum())
print('Content Decsriptors null counts:', cat_gen_df['content_descriptors'].isnull().sum())

cat_gen_df[['name', 'achievements', 'content_descriptors']].iloc[8:13]
Achievements null counts: 1856
Content Decsriptors null counts: 0
Out[56]:
name achievements content_descriptors
8 Half-Life: Blue Shift {'total': 0} {'ids': [], 'notes': None}
9 Half-Life 2 {'total': 33, 'highlighted': [{'name': 'Defian... {'ids': [], 'notes': None}
10 Counter-Strike: Source {'total': 147, 'highlighted': [{'name': 'Someo... {'ids': [2, 5], 'notes': 'Includes intense vio...
11 Half-Life: Source {'total': 0} {'ids': [], 'notes': None}
12 Day of Defeat: Source {'total': 54, 'highlighted': [{'name': 'Double... {'ids': [], 'notes': None}

It looks like both columns are stored as dictionaries, with standard formats if no details are provided or exist.

Below we take a closer look at a single row from the achievements column.

In [57]:
literal_eval(cat_gen_df['achievements'][9])
Out[57]:
{'total': 33,
 'highlighted': [{'name': 'Defiant',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},
  {'name': 'Submissive',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},
  {'name': 'Malcontent',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},
  {'name': 'What cat?',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},
  {'name': 'Trusty Hardware',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},
  {'name': 'Barnacle Bowling',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},
  {'name': "Anchor's Aweigh!",
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},
  {'name': 'Heavy Weapons',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboatgun.jpg'},
  {'name': 'Vorticough',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_find_vortigauntcave.jpg'},
  {'name': 'Revenge!',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_chopper.jpg'}]}

There are two keys in the top level of the dictionary: total and highlighted. The highlighted column looks too specific, being a selection of achievements specific to that game, so we will remove it. It may be worthwhile extracting the total value though.

Now let's take a look at the content_descriptors column.

In [58]:
cat_gen_df['content_descriptors'].value_counts().head(6)
Out[58]:
{'ids': [], 'notes': None}                                                                                                                                                                  25398
{'ids': [2, 5], 'notes': None}                                                                                                                                                                427
{'ids': [1, 5], 'notes': None}                                                                                                                                                                250
{'ids': [5], 'notes': None}                                                                                                                                                                   127
{'ids': [1, 2, 5], 'notes': None}                                                                                                                                                             122
{'ids': [2, 5], 'notes': 'This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, General Mature Content'}        8
Name: content_descriptors, dtype: int64

Content descriptors contain age-related warnings about the content of a game. They are identified by a numeric ID number, with optional notes supplied. Almost 26,000 rows have an empty list, indicating either no content descriptors or none provided. We already have the required_age column, which gives us similar information. Because of this, and because the rows are highly specific to each game, we will drop this column entirely.

Let's now define a function, taking a look at the value counts to verify everything went as expected.

In [59]:
def process_achievements_and_descriptors(df):
    """Parse as total number of achievements."""
    df = df.copy()
    
    df = df.drop('content_descriptors', axis=1)
    
    def parse_achievements(x):
        if x is np.nan:
            # missing data, assume has no achievements
            return 0
        else:
            # else has data, so can extract and return number under total
            return literal_eval(x)['total']
        
    df['achievements'] = df['achievements'].apply(parse_achievements)
    
    return df

achiev_df = process_achievements_and_descriptors(cat_gen_df)
achiev_df['achievements'].value_counts().head()
Out[59]:
0     11966
10      697
12      624
20      579
15      500
Name: achievements, dtype: int64

It looks like we were successful. We'll leave this column as it is for now, however we may wish to consider grouping the values together in bins, like we did for the age column. This is a decision we can make during the feature engineering stage of our analysis, and we can decide at that point if it will be more useful.

Let's now add these functions into the process function and run it on the raw data. This isn't strictly necessary but it will keep things organised and ensure we don't accidentally skip running a function.

We'll then inspect everything we've completed so far. As you will see, there is still plenty more left to do.

In [60]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    df = process_categories_and_genres(df)
    df = process_achievements_and_descriptors(df)
    
    return df

partially_clean = process(raw_steam_data)
partially_clean.head()
Out[60]:
name steam_appid required_age detailed_description about_the_game short_description header_image website pc_requirements mac_requirements linux_requirements platforms categories genres screenshots movies achievements release_date support_info background price english developer publisher
0 Counter-Strike 10 0 Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '1 Nov, 2000'} {'url': 'http://steamcommunity.com/app/10', 'e... https://steamcdn-a.akamaihd.net/steam/apps/10/... 7.19 1 Valve Valve
1 Team Fortress Classic 20 0 One of the most popular online action games of... One of the most popular online action games of... One of the most popular online action games of... https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '1 Apr, 1999'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/20/... 3.99 1 Valve Valve
2 Day of Defeat 30 0 Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... https://steamcdn-a.akamaihd.net/steam/apps/30/... http://www.dayofdefeat.com/ {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... windows;mac;linux Multi-player;Valve Anti-Cheat enabled Action [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '1 May, 2003'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/30/... 3.99 1 Valve Valve
3 Deathmatch Classic 40 0 Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '1 Jun, 2001'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/40/... 3.99 1 Valve Valve
4 Half-Life: Opposing Force 50 0 Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... windows;mac;linux Single-player;Multi-player;Valve Anti-Cheat en... Action [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '1 Nov, 1999'} {'url': 'https://help.steampowered.com', 'emai... https://steamcdn-a.akamaihd.net/steam/apps/50/... 3.99 1 Gearbox Software Valve

Export Columns

There are a number of columns containing information that is either not useful to the current project or is too complex to be useful for now. We may wish to incorporate them into a future project, but for now we will export the columns to separate files and remove them from the dataset.

Processing Description Columns

We have a series of columns with descriptive text about each game: detailed_description, about_the_game and short_description. As the column names imply, these provide information about each game in string format. This is great for humans' understanding, but when it comes to machines is a lot trickier.

These columns could be used as the basis for an interesting recommender system or sentiment analysis project, however they are not required in our current project. We'll be removing them as they take up large amounts of space, and will only serve to slow down any computation on the data.

We'll briefly inspect the columns, in case of anomalies, and export just the description data to a separate file.

In [61]:
partially_clean[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()
Out[61]:
detailed_description    14
about_the_game          14
short_description       14
dtype: int64

We have 14 rows with missing data for these columns, and chances are the 14 rows with missing detailed_description are the rows with missing about_the_game and short_description data too.

By inspecting the individual rows below, we can see that this is true - all rows with missing data in one description column have missing data in the others as well.

In [62]:
partially_clean[partially_clean['detailed_description'].isnull()].head(3)
Out[62]:
name steam_appid required_age detailed_description about_the_game short_description header_image website pc_requirements mac_requirements linux_requirements platforms categories genres screenshots movies achievements release_date support_info background price english developer publisher
97 Bejeweled 2 Deluxe 3300 0 NaN NaN NaN https://steamcdn-a.akamaihd.net/steam/apps/330... NaN {'minimum': '<p><strong>Minimum Requirements:<... {'minimum': '<ul>\n\t<li><strong>OS:</strong> ... [] windows;mac Single-player Casual [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/330... 4.25 1 PopCap Games, Inc. PopCap Games, Inc.
98 Chuzzle Deluxe 3310 0 NaN NaN NaN https://steamcdn-a.akamaihd.net/steam/apps/331... NaN {'minimum': '<p><strong>Minimum Requirements:<... {'minimum': '<ul>\n\t<li><strong>OS:</strong> ... [] windows;mac Single-player Casual [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/331... 4.25 1 PopCap Games, Inc. PopCap Games, Inc.
99 Insaniquarium Deluxe 3320 0 NaN NaN NaN https://steamcdn-a.akamaihd.net/steam/apps/332... NaN {'minimum': '<strong>Minimum Requirements:</st... [] [] windows Single-player Casual [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} https://steamcdn-a.akamaihd.net/steam/apps/332... 4.25 1 PopCap Games, Inc. PopCap Games, Inc.

Interestingly, all of these titles are games from 2006 developed and published by PopCap Games. My best guess is that they were developed previously and all added to the Steam store in one go after Valve allowed third-party titles.

We'll remove these rows, as well as any with a description of less than 20 characters, like those below.

In [63]:
partially_clean[partially_clean['detailed_description'].str.len() <= 20]
Out[63]:
name steam_appid required_age detailed_description about_the_game short_description header_image website pc_requirements mac_requirements linux_requirements platforms categories genres screenshots movies achievements release_date support_info background price english developer publisher
10420 Penguins Cretins 490990 0 ... ... ... https://steamcdn-a.akamaihd.net/steam/apps/490... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... windows Single-player;Multi-player;Online Multi-Player... Action;Adventure;Casual;Indie;Massively Multip... [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 {'coming_soon': False, 'date': '22 Jun, 2016'} {'url': '', 'email': 'support@hfmgames.net'} https://steamcdn-a.akamaihd.net/steam/apps/490... 1.69 1 HFM Games HFM Games
19938 拼词游戏 2017 745840 0 带一点恐怖元素的休闲游戏 带一点恐怖元素的休闲游戏 一款有一点恐怖元素的休闲益智游戏。 https://steamcdn-a.akamaihd.net/steam/apps/745... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... [] windows;mac Single-player Adventure;Indie [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 256699963, 'name': 'alpha', 'thumbnail... 11 {'coming_soon': False, 'date': '29 Nov, 2017'} {'url': '', 'email': '12668934@qq.com'} https://steamcdn-a.akamaihd.net/steam/apps/745... 0.79 0 Mianwotu Mianwotu
21937 God Test 797660 0 God Test God Test God Test https://steamcdn-a.akamaihd.net/steam/apps/797... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] windows Multi-player;Online Multi-Player;MMO;Online Co... Action;Massively Multiplayer;RPG;Strategy NaN NaN 0 {'coming_soon': False, 'date': '18 Apr, 2018'} {'url': '', 'email': 'insanegamedev@outlook.com'} NaN 0.00 1 God Test God Test
26219 В поисках Атлантиды 925640 0 Интересная игра Интересная игра Atlantis https://steamcdn-a.akamaihd.net/steam/apps/925... https://vk.com/atlantisforever {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] windows Single-player;Steam Achievements Adventure;Casual;Indie;Early Access [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 256725871, 'name': 'Game', 'thumbnail'... 1 {'coming_soon': False, 'date': '1 Nov, 2018'} {'url': 'https://vk.com/atlantisforever', 'ema... https://steamcdn-a.akamaihd.net/steam/apps/925... 1.69 0 Dmitr Che Dmitr Che
26356 东方百问~TouHouAsked 930840 0 Null Null Null https://steamcdn-a.akamaihd.net/steam/apps/930... https://asked.touhou.ren/ {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] windows Single-player;Steam Achievements Casual;Indie;Strategy;Early Access [{'id': 0, 'path_thumbnail': 'https://steamcdn... [{'id': 256726640, 'name': 'TouHouAsked', 'thu... 2 {'coming_soon': False, 'date': '7 Oct, 2018'} {'url': 'https://asked.touhou.ren', 'email': '... https://steamcdn-a.akamaihd.net/steam/apps/930... 0.79 0 Root Nine Studio Root Nine Studio

To handle exporting the data to file, we'll write a reusable function which we can call upon for future columns. We will include the steam_appid column as it will allow us to match up these rows with rows in our primary dataset later on, using a merge (like a join in SQL).

In [64]:
def export_data(df, filename):
    """Export dataframe to csv file, filename prepended with 'steam_'.
    
    filename : str without file extension
    """
    filepath = '../data/exports/steam_' + filename + '.csv'
    
    df.to_csv(filepath, index=False)
    
    print_name = filename.replace('_', ' ')
    print("Exported {} to '{}'".format(print_name, filepath))

We can now define a function to process and export the description columns.

In [65]:
def process_descriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    df = df[df['detailed_description'].notnull()].copy()
    
    # remove rows with unusually small description
    df = df[df['detailed_description'].str.len() > 20]
    
    # by default we don't export, useful if calling function later
    if export:
        # create dataframe of description columns
        description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='description_data')
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df

desc_df = process_descriptions(partially_clean, export=True)
Exported description data to '../data/exports/steam_description_data.csv'
In [66]:
# inspect exported data
pd.read_csv('../data/exports/steam_description_data.csv').head()
Out[66]:
steam_appid detailed_description about_the_game short_description
0 10 Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... Play the world's number 1 online action game. ...
1 20 One of the most popular online action games of... One of the most popular online action games of... One of the most popular online action games of...
2 30 Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ...
3 40 Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death...
4 50 Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ...

Processing Media Columns

Similar to the description columns, we have three columns that contain links to various images: header_image, screenshots and background. Whilst we won't be needing this data in this project, it could open the door to some interesting image analysis in the future. We will treat these columns in almost the same way, exporting the contents to a csv file then removing them from the dataset.

Again, let's check for missing values.

In [67]:
image_cols = ['header_image', 'screenshots', 'background']

for col in image_cols:
    print(col+':', desc_df[col].isnull().sum())

desc_df[image_cols].head()
header_image: 0
screenshots: 4
background: 4
Out[67]:
header_image screenshots background
0 https://steamcdn-a.akamaihd.net/steam/apps/10/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/10/...
1 https://steamcdn-a.akamaihd.net/steam/apps/20/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/20/...
2 https://steamcdn-a.akamaihd.net/steam/apps/30/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/30/...
3 https://steamcdn-a.akamaihd.net/steam/apps/40/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/40/...
4 https://steamcdn-a.akamaihd.net/steam/apps/50/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/50/...

As with the description columns, it is likely that the 4 rows with no screenshots data are the same rows with no background data. There are so few that it is probably safe to remove them.

Before we make up our made let's inspect the rows.

In [68]:
no_screenshots = desc_df[desc_df['screenshots'].isnull()]

print_steam_links(no_screenshots)

no_screenshots
The Light Empire: https://store.steampowered.com/app/416220
Girl and Goblin: https://store.steampowered.com/app/880510
Arida: Backland's Awakening: https://store.steampowered.com/app/907760
Nukalypse: The Final War: https://store.steampowered.com/app/947940
Out[68]:
name steam_appid required_age header_image website pc_requirements mac_requirements linux_requirements platforms categories genres screenshots movies achievements release_date support_info background price english developer publisher
7970 The Light Empire 416220 0 https://steamcdn-a.akamaihd.net/steam/apps/416... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... [] [] windows Single-player;Steam Achievements Indie;RPG;Simulation NaN NaN 4 {'coming_soon': False, 'date': '2 Dec, 2015'} {'url': '', 'email': 'Jemy.TLE@outlook.com'} NaN 4.79 1 Jemy Jemy
24874 Girl and Goblin 880510 0 https://steamcdn-a.akamaihd.net/steam/apps/880... NaN {'minimum': '<strong>最低配置:</strong><br><ul cla... [] [] windows Single-player;Steam Achievements;Steam Trading... Adventure;Casual;RPG;Strategy NaN [{'id': 256739772, 'name': '3', 'thumbnail': '... 1552 {'coming_soon': False, 'date': '5 Jan, 2019'} {'url': '', 'email': 'smagician13@yahoo.com'} NaN 0.79 1 Inverse Game Inverse Game
25696 Arida: Backland's Awakening 907760 0 https://steamcdn-a.akamaihd.net/steam/apps/907... http://www.aridagame.com {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... windows;mac Single-player;Steam Achievements;Full controll... Adventure;Indie NaN [{'id': 256729551, 'name': 'Teaser Beta 2018',... 0 {'coming_soon': True, 'date': ''} {'url': 'http://www.aridagame.com', 'email': '... NaN 0.00 1 Aoca Game Lab Aoca Game Lab
26851 Nukalypse: The Final War 947940 0 https://steamcdn-a.akamaihd.net/steam/apps/947... NaN {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... windows Single-player;Steam Achievements Action;Indie;Early Access NaN [{'id': 256745274, 'name': 'Nukalypse: The Fin... 0 {'coming_soon': True, 'date': 'Oct 2019'} {'url': '', 'email': 'nukalypse@gmail.com'} NaN 0.00 1 Zion Games Studio Zion Games Studio

As we predicted, the rows without screenshots are also the rows without a background. It looks like two are unreleased, and if we'd dealt with the release_date column already these would already be removed. One was released recently (5 Jan, 2019), and perhaps didn't have screenshots at the time of downloading, and one simply doesn't have any. As we suspected, it's safe to remove all these rows.

There is also a movies column with similar data. Whilst having more missing values, presumably for games without videos, it appears to contain names, thumbnails and links to various videos and trailers. It's unlikely we'll need them but we can include them in the export and remove them from our data set.

In [69]:
print('Movies null values:', desc_df['movies'].isnull().sum())
print()

desc_df[desc_df['movies'].notnull()]['movies'].iloc[0]
Movies null values: 1708

Out[69]:
"[{'id': 904, 'name': 'Half-Life 2 Trailer', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/904/movie.jpg?t=1507237301', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie480.webm?t=1507237301', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie_max.webm?t=1507237301'}, 'highlight': True}, {'id': 5724, 'name': 'Free Yourself', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/5724/movie.293x165.jpg?t=1507237311', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie480.webm?t=1507237311', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie_max.webm?t=1507237311'}, 'highlight': False}]"

We can now put this all together and define a process_media function.

In [70]:
def process_media(df, export=False):
    """Remove media columns from dataframe, optionally exporting them to csv first."""
    df = df[df['screenshots'].notnull()].copy()
    
    if export:
        media_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
        
        export_data(media_data, 'media_data')
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df

media_df = process_media(desc_df, export=True)
Exported media data to '../data/exports/steam_media_data.csv'
In [71]:
# inspect exported data
pd.read_csv('../data/exports/steam_media_data.csv').head()
Out[71]:
steam_appid header_image screenshots background movies
0 10 https://steamcdn-a.akamaihd.net/steam/apps/10/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN
1 20 https://steamcdn-a.akamaihd.net/steam/apps/20/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN
2 30 https://steamcdn-a.akamaihd.net/steam/apps/30/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/30/... NaN
3 40 https://steamcdn-a.akamaihd.net/steam/apps/40/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN
4 50 https://steamcdn-a.akamaihd.net/steam/apps/50/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN

Before we move on, we can inspect the memory savings of removing these columns by comparing the output of the DataFrame.info method. If we pass memory_usage="deep" we get the true memory usage of each DataFrame. Without this, pandas estimates the amount used. This is because of the way python stores object (string) columns under the hood. Essentially python keeps track of a list of pointers which point to the actual strings in memory. It's a bit like if you hid a bunch of items around the house, and kept a list of where everything was. You couldn't tell the total size of everything just by looking at the list, but you could take a rough guess. Only by following the list and inspecting each individual item could you get an exact figure.

The blog post 'Why Python Is Slow' goes into more detail, but all we need to be aware of is that by passing the parameter we ensure we get the true value of memory usage. We also pass verbose=False to truncate unnecessary output.

We can see that already we have shrunk the memory usage from 285 MB to just under 55 MB. This is great because in general, the smaller the memory footprint the faster our code will run in future. And of course, we're not finished yet.

In [72]:
print('Before removing data:\n')
achiev_df.info(verbose=False, memory_usage="deep")

print('\nData with descriptions and media removed:\n')
media_df.info(verbose=False, memory_usage="deep")
Before removing data:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27414 entries, 0 to 29234
Columns: 24 entries, name to publisher
dtypes: category(1), float64(1), int64(3), object(19)
memory usage: 285.0 MB

Data with descriptions and media removed:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27391 entries, 0 to 29234
Columns: 17 entries, name to publisher
dtypes: category(1), float64(1), int64(3), object(12)
memory usage: 54.2 MB

Website and Support Info

Next we will look at the website and support_info columns. Seen below, they both contain links to external websites. The website column is simply stored as a string whereas the support info column is stored as a dictionary of url and email.

There are a large number of rows with no website listed, and while there are no null values in the support_info column, it looks like many will have empty url and email values inside the data.

For our dataset we'll be dropping both these columns, as they are far too specific to be useful in our analysis. As you may have guessed, we will extract and export this data as we have done before. If not useful, it could be interesting at a later date.

In [73]:
print('website null counts:', media_df['website'].isnull().sum())
print('support_info null counts:', media_df['support_info'].isnull().sum())

with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
    display(media_df[['name', 'website', 'support_info']][75:80])
website null counts: 9328
support_info null counts: 0
name website support_info
82 X: Beyond the Frontier http://www.egosoft.com/games/x/info_en.php {'url': '', 'email': ''}
83 X: Tension http://www.egosoft.com/games/x_tension/info_en.php {'url': '', 'email': ''}
84 X Rebirth http://www.egosoft.com/games/x_rebirth/info_en.php {'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'}
85 688(I) Hunter/Killer NaN {'url': 'http://strategyfirst.com/products/support.html', 'email': ''}
86 Fleet Command NaN {'url': 'http://strategyfirst.com/products/support.html', 'email': ''}

We're going to split the support info into two separate columns. We'll keep all the code that parses the columns inside the export if statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain NaN whereas the other two columns contain a blank string ('') for missing data, as once we have exported to csv they will be represented the same way.

In [74]:
def process_info(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['steam_appid', 'website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
        
        support_info = support_info.drop('support_info', axis=1)
        
        # only keep rows with at least one piece of information
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]

        export_data(support_info, 'support_info')
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df


info_df = process_info(media_df, export=True)
Exported support info to '../data/exports/steam_support_info.csv'
In [75]:
# inspect exported file
pd.read_csv('../data/exports/steam_support_info.csv').head()
Out[75]:
steam_appid website support_url support_email
0 10 NaN http://steamcommunity.com/app/10 NaN
1 30 http://www.dayofdefeat.com/ NaN NaN
2 50 NaN https://help.steampowered.com NaN
3 70 http://www.half-life.com/ http://steamcommunity.com/app/70 NaN
4 80 NaN http://steamcommunity.com/app/80 NaN

System Requirements

As you may have noticed, there are three columns containing the system specifications required to run each game on each platform. These columns are pc_requirements, mac_requirements, and linux_requirements. As computing power has increased over the years, this information won't be of particular use in this analysis. It could be fascinating to use as a proxy of how computers have developed over the years, so we'll want to export the data, but we won't be keeping it in this dataset.

Taking a look at the null counts, it looks like there is data for every row.

In [76]:
requirements_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']

print('null counts:\n')

for col in requirements_cols:
    print(col+':', info_df[col].isnull().sum())
null counts:

pc_requirements: 0
mac_requirements: 0
linux_requirements: 0

However if we look at the data a little more closely, we see that some rows actually store an empty list. These won't appear as null rows, but once evaluated won't provide any information, so can be treated as null values.

In [77]:
info_df[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].tail()
Out[77]:
steam_appid pc_requirements mac_requirements linux_requirements
29230 1065230 {'minimum': '<strong>Minimum:</strong><br><ul ... [] []
29231 1065570 {'minimum': '<strong>Minimum:</strong><br><ul ... [] []
29232 1065650 {'minimum': '<strong>Minimum:</strong><br><ul ... [] []
29233 1066700 {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... []
29234 1069460 {'minimum': '<strong>Minimum:</strong><br><ul ... {'minimum': '<strong>Minimum:</strong><br><ul ... []

By using a simple boolean filter and checking the shape pararater, we can get a count for how many empty lists there are.

In [78]:
print('Empty list counts:\n')

for col in requirements_cols:
    print(col+':', info_df[info_df[col] == '[]'].shape[0])
Empty list counts:

pc_requirements: 13
mac_requirements: 16141
linux_requirements: 19025

That's over half of the rows for both mac and linux requirements. That probably means that there is not enough data in these two columns to be too useful.

It turns out that most games are developed solely for windows, with the growth in mac and linux ports only growing in recent years. Naturally it would make sense that any games that aren't supported on mac or linux would not have corresponding requirements.

As we have already cleaned the platforms column, we can check how many rows actually have missing data by comparing rows with empty lists in the requirements column with data in the platform column. If a row has an empty list in the requirements column but the platforms column shows it is supported, it means the data is missing.

In [79]:
for col in ['mac_requirements', 'linux_requirements']:
    platform = col.split('_')[0]
    print(platform+':', info_df[(info_df[col] == '[]') & (info_df['platforms'].str.contains(platform))].shape[0])
mac: 133
linux: 154

Whilst not an insignificant number, this means that the vast majority of rows are as they should be, and we're not looking at too many data errors.

Let's also have a look for missing values in the pc/windows column.

In [80]:
print('windows:', info_df[(info_df['pc_requirements'] == '[]') & (info_df['platforms'].str.contains('windows'))].shape[0])
windows: 9

Only 9 rows have missing system requirements. We can take a look at some of them below, and follow the links to the steam pages to try and discover if anything is amiss.

In [81]:
missing_windows_requirements = info_df[(info_df['pc_requirements'] == '[]') & (info_df['platforms'].str.contains('windows'))]

print_steam_links(missing_windows_requirements[:5])
Uplink: https://store.steampowered.com/app/1510
Battlestations: Midway: https://store.steampowered.com/app/6870
Grand Theft Auto 2: https://store.steampowered.com/app/12180
Shift 2 Unleashed: https://store.steampowered.com/app/47920
iBomber Defense: https://store.steampowered.com/app/104000

There doesn't appear to be any common issue in these rows - some of the games are quite old but that's about it. It may simply be that no requirements were supplied when the games were added to the steam store.

We can assume that a cross-platform game will have similar requirements in terms of hardware for each platform it supports. With this in mind we can focus on the pc_requirements column, and use the platforms column to tell us which other platforms are supported.

Now we will take a look at a couple of rows from the dataset, to observe the structure of the stored data.

In [82]:
display(info_df['pc_requirements'].iloc[0])
display(info_df['pc_requirements'].iloc[2000])
display(info_df['pc_requirements'].iloc[15000])
"{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</strong> 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t<p><strong>Recommended:</strong> 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t'}"
'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic<br></li><li><strong>Processor:</strong> 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices<br></li><li><strong>Memory:</strong> 2 GB RAM<br></li><li><strong>Storage:</strong> 1400 MB available space</li></ul>\', \'recommended\': \'<strong>Recommended:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic<br></li><li><strong>Processor:</strong> 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices<br></li><li><strong>Memory:</strong> 4 GB RAM<br></li><li><strong>Storage:</strong> 1400 MB available space</li></ul>\'}'
'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li>Requires a 64-bit processor and operating system<br></li><li><strong>OS:</strong> Windows 8 / 10 - 64-bit<br></li><li><strong>Processor:</strong> 2Ghz Intel CPU<br></li><li><strong>Memory:</strong> 2 GB RAM<br></li><li><strong>Graphics:</strong> 1024MB VRAM<br></li><li><strong>DirectX:</strong> Version 9.0<br></li><li><strong>Network:</strong> Broadband Internet connection<br></li><li><strong>Storage:</strong> 10 GB available space</li></ul>\', \'recommended\': \'<strong>Recommended:</strong><br><ul class="bb_ul"><li>Requires a 64-bit processor and operating system<br></li><li><strong>OS:</strong> Windows 10 - 64-bit<br></li><li><strong>Processor:</strong> 2.60Ghz Intel Core i7 CPU<br></li><li><strong>Memory:</strong> 16 GB RAM<br></li><li><strong>Graphics:</strong> 4GB VRAM<br></li><li><strong>Network:</strong> Broadband Internet connection<br></li><li><strong>Storage:</strong> 10 GB available space</li></ul>\'}'

The data is stored as a dictionary, as we've seen before. There is definitely a key for minimum, but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like Processor and Memory for some of the rows.

Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.

By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions.

In [83]:
view_requirements = info_df['pc_requirements'].iloc[[0, 2000, 15000]].copy()

view_requirements = (view_requirements
                         .str.replace(r'\\[rtn]', '')
                         .str.replace(r'<[pbr]{1,2}>', ' ')
                         .str.replace(r'<[\/"=\w\s]+>', '')
                    )

for i, row in view_requirements.iteritems():
    display(row)
"{'minimum': ' Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection'}"
"{'minimum': 'Minimum: OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices Memory: 2 GB RAM Storage: 1400 MB available space', 'recommended': 'Recommended: OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices Memory: 4 GB RAM Storage: 1400 MB available space'}"
"{'minimum': 'Minimum: Requires a 64-bit processor and operating system OS: Windows 8 / 10 - 64-bit Processor: 2Ghz Intel CPU Memory: 2 GB RAM Graphics: 1024MB VRAM DirectX: Version 9.0 Network: Broadband Internet connection Storage: 10 GB available space', 'recommended': 'Recommended: Requires a 64-bit processor and operating system OS: Windows 10 - 64-bit Processor: 2.60Ghz Intel Core i7 CPU Memory: 16 GB RAM Graphics: 4GB VRAM Network: Broadband Internet connection Storage: 10 GB available space'}"

We can now see more clearly the contents and structure of these rows. Some rows have both Minimum and Recommended requirements inside a minimum key, and some have separate minimum and recommended keys. Some have headings like Processor: and Storage: before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.

It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process worthy of another project. With that in mind, it seems best to proceed by slightly cleaning the data before exporting it, but not trying to deal with individual components right now.

We'll export the raw data, and split out the data in the minimum and recommended keys. This will mean lots of rows still have recommended data hidden inside the minimum key, but this will do for now.

In [84]:
def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].copy()
        
        # remove rows with missing pc requirements
        requirements = requirements[requirements['pc_requirements'] != '[]']
        
        requirements['requirements_clean'] = (requirements['pc_requirements']
                                                  .str.replace(r'\\[rtn]', '')
                                                  .str.replace(r'<[pbr]{1,2}>', ' ')
                                                  .str.replace(r'<[\/"=\w\s]+>', '')
                                             )
        
        requirements['requirements_clean'] = requirements['requirements_clean'].apply(lambda x: literal_eval(x))
        
        # split out minimum and recommended into separate columns
        requirements['minimum'] = requirements['requirements_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
        requirements['recommended'] = requirements['requirements_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
        
        requirements = requirements.drop('requirements_clean', axis=1)
        
        export_data(requirements, 'requirements_data')
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df

reqs_df = process_requirements(info_df, export=True)
Exported requirements data to '../data/exports/steam_requirements_data.csv'
In [85]:
# verify export
pd.read_csv('../data/exports/steam_requirements_data.csv').head()
Out[85]:
steam_appid pc_requirements mac_requirements linux_requirements minimum recommended
0 10 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
1 20 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
2 30 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
3 40 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
4 50 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN

Processing Release Date

We're almost finished, and the final column to clean is release_date. It provides the opportunity for an interesting exploration of code optimisation, so we'll be walking through the development process in a separate post. For now, we'll just include the complete function to run on the dataset.

In [86]:
# export data for optimisation post
reqs_df.to_csv('../data/exports/steam_partially_clean.csv', index=False)
In [87]:
def process_release_date(df):
    df = df.copy()
    
    def eval_date(x):
        x = literal_eval(x)
        if x['coming_soon']:
            return '' # return blank string so can drop missing at end
        else:
            return x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif x == '':
            return np.nan
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['release_date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')
    
    df = df[df['release_date'].notnull()]
    
    return df

We can now update the process function once more, and run the full cleaning process on the raw dataset.

In [88]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    df = process_categories_and_genres(df)
    df = process_achievements_and_descriptors(df)  
    df = process_release_date(df)
    
    # Process columns which export data
    df = process_descriptions(df, export=True)
    df = process_media(df, export=True)
    df = process_info(df, export=True)
    df = process_requirements(df, export=True)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()
Exported description data to '../data/exports/steam_description_data.csv'
Exported media data to '../data/exports/steam_media_data.csv'
Exported support info to '../data/exports/steam_support_info.csv'
Exported requirements data to '../data/exports/steam_requirements_data.csv'
Out[88]:
name steam_appid required_age platforms categories genres achievements release_date price english developer publisher
0 Counter-Strike 10 0 windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action 0 2000-11-01 7.19 1 Valve Valve
1 Team Fortress Classic 20 0 windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action 0 1999-04-01 3.99 1 Valve Valve
2 Day of Defeat 30 0 windows;mac;linux Multi-player;Valve Anti-Cheat enabled Action 0 2003-05-01 3.99 1 Valve Valve
3 Deathmatch Classic 40 0 windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action 0 2001-06-01 3.99 1 Valve Valve
4 Half-Life: Opposing Force 50 0 windows;mac;linux Single-player;Multi-player;Valve Anti-Cheat en... Action 0 1999-11-01 3.99 1 Gearbox Software Valve

Final Steps

That just about does is for cleaning the Steam data. Hopefully we have a cleaned dataset ready to be combined with the data downloaded from SteamSpy, once that is cleaned.

Before we export the cleaned steam data, we'll check that we have eradicated missing values, and have a look at the memory footprint like we did earlier.

In [89]:
steam_data.isnull().sum()
Out[89]:
name            0
steam_appid     0
required_age    0
platforms       0
categories      0
genres          0
achievements    0
release_date    0
price           0
english         0
developer       0
publisher       0
dtype: int64
In [90]:
raw_steam_data.info(verbose=False, memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29235 entries, 0 to 29234
Columns: 39 entries, type to content_descriptors
dtypes: float64(2), int64(1), object(36)
memory usage: 390.0 MB
In [91]:
steam_data.info(verbose=False, memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27332 entries, 0 to 29234
Columns: 12 entries, name to publisher
dtypes: category(1), datetime64[ns](1), float64(1), int64(3), object(6)
memory usage: 13.8 MB

It looks like we've taken care of all the null values, and the size of the dataset has been reduced considerably.

We'll also check that no unreleased games have slipped through (data was scraped on or before 1st May, 2019).

In [92]:
steam_data[steam_data['release_date'] > '2019-05-01']
Out[92]:
name steam_appid required_age platforms categories genres achievements release_date price english developer publisher

Now that we're happy with our dataframe we are ready to export to file and finish this part of the project.

In [93]:
steam_data.to_csv('../data/exports/steam_data_clean.csv', index=False)

As I said at the beginning, data cleaning can be a very lengthy process. We definitely could have taken some shortcuts, and didn't have to go into so much depth, but I think it was interesting and valuable to go through the data as thoroughly as we did. A useful side effect of this is that we're now much more familiar with the data, and this should be incredibly useful when it comes to analysis. One of the most important prerequisites for data analysis is having good data, and the foundation of this is built upon solid data cleaning.

Next time we'll take a look at optimising the processing of the release_date column, as mentioned, then move onto cleaning the SteamSpy data. Once that is complete we can begin exploring and analysing the data.

Thanks for joining me, and I welcome any feedback or suggestions you may have in the comments below.

Comments