SteamSpy Data Cleaning in Python
This post forms part of a larger series on downloading, processing and analysing data from the Steam store. See all posts here.
View original notebook on github (Download). Datasets available on Kaggle.
Welcome to the final part of the data cleaning process. Once we're finished here we'll be ready to move on to exploring and analysing the data.
As a quick re-cap, so far we have downloaded information on games from the Steam Store via the Steam API and SteamSpy API. We have cleaned and processed the data from the Steam API, and in this section we'll walkthrough cleaning data downloaded from the SteamSpy API. The overall goal of this project is to collect, clean and analyse data from the Steam Store with the idea of advising a fictional game developer or company.
The previous posts went into great depth about the decisions made and methods used. This post will still go over a number of decisions, but will be more in the style of a brief overview than full discussion.
Import Libraries and Inspect Data¶
Let's begin by importing the necessary libraries and inspecting the raw downloaded data.
# standard library imports
from ast import literal_eval
import itertools
# import time
# import re
# third-party imports
import numpy as np
import pandas as pd
# customisations
pd.set_option("max_columns", 100)
raw_steamspy_data = pd.read_csv('../data/raw/steamspy_data.csv')
raw_steamspy_data.head()
There are a lot of columns here that will overlap with columns we already dealt with in the Steam data. This is fine because it means we don't have to clean them here, and can simply use the columns from the other dataset.
A few columns in this data that could be valuable are positive
, negative
, userscore
, owners
and average_forever
. We'll begin by looking at the null counts for each column.
raw_steamspy_data.isnull().sum()
Most of the columns look fine, with only a few tens or hundreds of missing values, or none at all. The only column we'll remove for certain is the score_rank
column, as it has far too many null values to be useful.
Handling missing values¶
There are some columns we'll be removing, as they're present in the other dataset. Before we do we can investigate the missing values, to decide whether we should remove these rows first. We don't need to go through every column, but just look at a few individual ones.
In the name
column, we have five rows with missing data, and four more rows where the name is 'none'. We'll certainly remove all 9 of these rows.
raw_steamspy_data[raw_steamspy_data['name'].isnull()]
raw_steamspy_data[raw_steamspy_data['name'] == 'none']
There are 197 rows with missing developer data, and 280 rows with missing publisher data. It's conceivable that a small studio may self-publish, but we'll say that a title must have a developer for it to be kept. We can remove the rows with missing developer data, but then we'll be removing both of these columns as this information is already provided in the Steam data.
raw_steamspy_data[raw_steamspy_data['developer'].isnull()].sample(5, random_state=0)
raw_steamspy_data[raw_steamspy_data['publisher'].isnull()].sample(5, random_state=0)
There are a small number of rows with missing price data. This is another column we don't need, (as well as initialprice
and discount
), as we already have that data available in the other dataset.
It looks like these rows have a lot of other missing information as well, so if they haven't already been removed when cleaning other columns we'll remove them here.
raw_steamspy_data[raw_steamspy_data['price'].isnull()].sample(5, random_state=0)
Again there are a small number of rows with missing language data. There are a couple of bigger titles here, but most are missing other data also. We'll go ahead and remove these rows and this should leave us with most troublesome rows dealt with.
raw_steamspy_data[raw_steamspy_data['languages'].isnull()].sample(5, random_state=0)
Remove unwanted columns¶
As mentioned, there are a number of columns present in the Steam dataset, so we can remove any of those with crossover. It makes sense that these exist because SteamSpy's data is retrieved by looking at Steam's data. We'll remove the score_rank
column as it contains too many missing values, and we'll remove the userscore
column as most rows have a value of 0 here, so it won't be useful.
raw_steamspy_data['userscore'].value_counts().head()
There are also three columns - average_2weeks
, median_2weeks
, (average and median playtime in the last 2 weeks) and ccu
(Peak conccurrent users yesterday) - that provide information only relevant to the time the data was collected. Going forward this isn't going to be useful, as we're not interested in how games were performing at that specific time, so we'll remove those.
drop_cols = [
'score_rank', # too many missing values
'userscore', # too little variance (most have 0)
'genre', 'developer', 'publisher', 'price', 'initialprice', 'discount', # provided by Steam data
'average_2weeks', 'median_2weeks', 'ccu' # not interested in temporally specific columns
]
One more column we could remove is the languages
column, as we already have the english
column in the Steam Data. Here in the SteamSpy data, languages is made up of a comma-separated list of all supported languages. This provides us with some more information so we can keep it for now.
raw_steamspy_data['languages'].head()
Process & Export Tags Column¶
There are a couple of columns that require more specific handling, and the tags
column is one of them. There will be a lot of overlap with the genres
column, but the tags might provide just enough unique information to be worthwhile keeping.
Below we separate out the tags data for experimentation, then have a look a the first few rows. Each tag for a game has an associated number, corresponding to the amount of votes that game has for that tag. A game's tags with higher votes means that more people think it is a relevant tag for that title.
tags = raw_steamspy_data['tags']
print(tags[0])
tags.head()
Just by looking at the first row you can see that there are probably going to be far too many tags to include without disrupting the dataset. We have a couple of options here. We could include just the names of the top 3 or 5 tags for a game - which should be easy enough as it looks like the tags are already in order from most votes to least - or we could create new columns for each tag, and the number of votes as the row values, like this:
appid | name | action | fps | multiplayer |
---|---|---|---|---|
10 | Shoot 'em up | 2681 | 2048 | 0 |
20 | Big Brawl | 208 | 0 | 172 |
In a way, we'll do both.
Because including the full tag information in the dataset may make it bulky and unwieldy, we'll only include the top three tags in the core dataset. As we did when cleaning the Steam Data, we'll also export the full tag data as a separate file. This means that if we want to use it as part of our analysis we can import and merge the data quite easily.
Before we can do that, we have to figure out how to handle and process the data. As we've seen previously, the data in each row is stored as a string containing a dictionary, so we have to use literal_eval
to get python to "recognise" the dictionary first.
eval_row = literal_eval(tags[0])
print(eval_row['Action'])
print(eval_row['FPS'])
We also have to figure out how to get a list of all the different unique tags from across the dataset. As each row only contains the tags that it has votes for, we don't know from the outset the tag names or even how many different tags there are.
We could loop through the data and build a list of all the unique tag names using a dictionary or something similar, but instead we'll achieve the same thing much more easily using the chain function from itertools. As seen below, it will take a variety of iterables and unpack them into a single sequence (we have to use list() here to force the generator to evaluate). We can then use the set function to obtain just the unique values from the returned sequence.
As you can see below, we can pass lists, tuples and dictionaries and they will all be handled. Also notice how only the key from the dictionary is returned (d) and not the value inside the key (e). This is really useful and means we can just pass the evaluated tags data as a series, and itertools will just extract the individual keys.
values = [
['a', 'b'], # list
('b', 'c'), # tuple
{'d': 'e'} # dictionary
]
list(itertools.chain(*values))
set(itertools.chain(*values))
If we tried to define a function now, we might run into a problem. It turns out not all of the rows have data stored as dictionaries. Seen below, there are a few hundred rows with an empty list.
tags[tags == '[]'].shape[0]
This doesn't cause an issue when we generate our set of unique tag names, as any empty lists are ignored.
parsed_tags = tags.apply(lambda x: literal_eval(x))
cols = set(itertools.chain(*parsed_tags))
print('Number of unique tags:', len(cols))
print('\nFirst few tags:', sorted(list(cols))[:5])
However to create the new dataframe from the tags and votes, we need to look inside the dictionaries in each row. We'll expand the logic to parse the tags using the isintance
function to check each row for its type. If it is a dictionary, we return the dictionary. If it is a list, we'll return an empty dictionary {}
. We'll also include an extra check in case there is anything else in the rows we haven't found yet.
Next, we loop through the tags stored in cols
, standardise the format of the new column names, then extract the vote count of each tag by checking if the tag name is in each row's dictionary of tags.
def parse_tags(x):
x = literal_eval(x)
if isinstance(x, dict):
return x
elif isinstance(x, list):
return {}
else:
raise TypeError('Something other than dict or list found')
parsed_tags = tags.apply(parse_tags)
tag_data = pd.DataFrame()
for col in sorted(cols):
# standardise column names
col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", "")
# check if column in row's dictionary of tags and return that value if it is, or 0 if it isn't
tag_data[col_name] = parsed_tags.apply(lambda x: x[col] if col in x.keys() else 0)
tag_data.head()
We'll have to add in the AppIDs, but apart from that this makes up the tag data that we'll export. Next we have to figure out how to extract just the top three tags for use in the core dataset.
It turns out this is a relatively simple process. As the keys are in order of highest votes to least, we can just extract the first 3 keys for each row. This may not be the safest approach, as dictionaries don't necessarily preserve key order, but it's fine for our purposes. We can then join these keys on a semicolon to create a list of the tags.
def parse_tags(x):
x = literal_eval(x)
if isinstance(x, dict):
return ';'.join(list(x.keys())[:3])
else:
return np.nan
tags.apply(parse_tags).head()
Handle Owners Column¶
The final column we'll look at before defining a function to perform the cleaning is the owners
column. This column is made up of the lower and upper bound of an estimation for the amount of owners for each title. For privacy reasons, SteamSpy can't get exact figures from Steam (though it used to), so we'll have to make do with the figures we have here.
owners = raw_steamspy_data['owners']
owners.head()
We have a couple of options for how to deal with the data. We could remove the commas and split the data, keeping the lower or upper bound.
owners_split = owners.str.replace(',', '').str.split(' .. ')
owners_split.apply(lambda x: int(x[0])).head()
We could calculate the mid-point between the points, and keep that.
owners_split.apply(lambda x: (int(x[0]) + int(x[1])) // 2).head()
Or we could reformat the data slightly, but pretty much keep it as is.
owners.str.replace(',', '').str.replace(' .. ', '-').head()
We'll go with this last option. It allows us to keep the ranges and also easily modify the data in the future.
Define Function¶
We're now ready to define the functions to clean the data. We'll wrap all the logic inside a process
function, and for most of the columns there isn't much code to write so we can confine it here. The only logic we'll separate out is for processing the tags column, as it is a more complex endeavour.
In the next code cell we clean the data, export the full tags data to file, and inspect the clean dataframe. After that we inspect the exported data, verifying everything went as planned.
def process_tags(df, export=False):
if export:
tag_data = df[['appid', 'tags']].copy()
def parse_export_tags(x):
x = literal_eval(x)
if isinstance(x, dict):
return x
elif isinstance(x, list):
return {}
else:
raise TypeError('Something other than dict or list found')
tag_data['tags'] = tag_data['tags'].apply(parse_export_tags)
cols = set(itertools.chain(*tag_data['tags']))
for col in sorted(cols):
col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", "")
tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)
tag_data = tag_data.drop('tags', axis=1)
tag_data.to_csv('../data/exports/steamspy_tag_data.csv', index=False)
print("Exported tag data to '../data/exports/steamspy_tag_data.csv'")
def parse_tags(x):
x = literal_eval(x)
if isinstance(x, dict):
return ';'.join(list(x.keys())[:3])
else:
return np.nan
df['tags'] = df['tags'].apply(parse_tags)
# rows with null tags seem to be superseded by newer release, so remove (e.g. dead island)
df = df[df['tags'].notnull()]
return df
def process(df):
df = df.copy()
# handle missing values
df = df[(df['name'].notnull()) & (df['name'] != 'none')]
df = df[df['developer'].notnull()]
df = df[df['languages'].notnull()]
df = df[df['price'].notnull()]
# remove unwanted columns
df = df.drop([
'genre', 'developer', 'publisher', 'score_rank', 'userscore', 'average_2weeks',
'median_2weeks', 'price', 'initialprice', 'discount', 'ccu'
], axis=1)
# keep top tags, exporting full tag data to file
df = process_tags(df, export=True)
# reformat owners column
df['owners'] = df['owners'].str.replace(',', '').str.replace(' .. ', '-')
return df
steamspy_data = process(raw_steamspy_data)
steamspy_data.head()
# inspect tag data
pd.read_csv('../data/exports/steamspy_tag_data.csv').head()
Merge and Export Clean Data¶
Before we wrap up, we'll check for any missing values and export the SteamSpy data to its own csv file. Then we'll look at combining the previously cleaned Steam data and the newly cleaned SteamSpy data into one cohesive dataset. This will give us options and flexibility for analysis in the future, and means we will have one csv file to begin looking at.
steamspy_data.isnull().sum()
steamspy_data.to_csv('../data/exports/steamspy_clean.csv', index=False)
To combine the dataframes we can use the pandas merge function, specifying which columns to merge upon. We'll keep the default how='inner'
behaviour, meaning we only keep rows with data in both dataframes.
steam_data = pd.read_csv('../data/exports/steam_data_clean.csv')
merged = steam_data.merge(steamspy_data, left_on='steam_appid', right_on='appid', suffixes=('', '_steamspy'))
merged.head()
Finally, we can remove, reorder and rename some columns before exporting the complete, clean dataset.
We'll remove the SteamSpy rows with overlap - name, languages and AppID. Next we reindex the dataframe, reordering the columns so it makes a little more logical sense to look at, and then we rename some of the columns to make it a bit clearer what they describe.
This is a quick and simple step, but as we'll be making this data available for public use it's a very good idea to make it as user-friendly as possible.
# remove overlapping columns
steam_clean = merged.drop(['name_steamspy', 'languages', 'steam_appid'], axis=1)
# reindex to reorder columns
steam_clean = steam_clean[[
'appid',
'name',
'release_date',
'english',
'developer',
'publisher',
'platforms',
'required_age',
'categories',
'genres',
'tags',
'achievements',
'positive',
'negative',
'average_forever',
'median_forever',
'owners',
'price'
]]
steam_clean = steam_clean.rename({
'tags': 'steamspy_tags',
'positive': 'positive_ratings',
'negative': 'negative_ratings',
'average_forever': 'average_playtime',
'median_forever': 'median_playtime'
}, axis=1)
steam_clean.head()
# export clean dataset
steam_clean.to_csv('../data/steam_clean.csv', index=False)
And with that, we have come to the end of the cleaning process. It's been a long journey and hopefully you've picked up something useful along the way, and now we can look ahead to analysing the data, creating some visualisations, and gaining some insights into the multi-billion dollar industry that is video games.
Thanks for joining me, and as always I welcome any feedback or suggestions you may have in the comments below.
Comments