SteamSpy Data Cleaning in Python

This post forms part of a larger series on downloading, processing and analysing data from the Steam store. See all posts here.

View original notebook on github (Download). Datasets available on Kaggle.

Welcome to the final part of the data cleaning process. Once we're finished here we'll be ready to move on to exploring and analysing the data.

As a quick re-cap, so far we have downloaded information on games from the Steam Store via the Steam API and SteamSpy API. We have cleaned and processed the data from the Steam API, and in this section we'll walkthrough cleaning data downloaded from the SteamSpy API. The overall goal of this project is to collect, clean and analyse data from the Steam Store with the idea of advising a fictional game developer or company.

The previous posts went into great depth about the decisions made and methods used. This post will still go over a number of decisions, but will be more in the style of a brief overview than full discussion.

Import Libraries and Inspect Data

Let's begin by importing the necessary libraries and inspecting the raw downloaded data.

In [1]:
# standard library imports
from ast import literal_eval
import itertools
# import time
# import re

# third-party imports
import numpy as np
import pandas as pd

# customisations
pd.set_option("max_columns", 100)
In [2]:
raw_steamspy_data = pd.read_csv('../data/raw/steamspy_data.csv')
raw_steamspy_data.head()
Out[2]:
appid name developer publisher score_rank positive negative userscore owners average_forever average_2weeks median_forever median_2weeks price initialprice discount languages genre ccu tags
0 10 Counter-Strike Valve Valve NaN 124534 3339 0 10,000,000 .. 20,000,000 17612 709 317 26 999.0 999.0 0.0 English, French, German, Italian, Spanish - Sp... Action 14923 {'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...
1 20 Team Fortress Classic Valve Valve NaN 3318 633 0 5,000,000 .. 10,000,000 277 15 62 15 499.0 499.0 0.0 English, French, German, Italian, Spanish - Sp... Action 87 {'Action': 208, 'FPS': 188, 'Multiplayer': 172...
2 30 Day of Defeat Valve Valve NaN 3416 398 0 5,000,000 .. 10,000,000 187 0 34 0 499.0 499.0 0.0 English, French, German, Italian, Spanish - Spain Action 130 {'FPS': 138, 'World War II': 122, 'Multiplayer...
3 40 Deathmatch Classic Valve Valve NaN 1273 267 0 5,000,000 .. 10,000,000 258 0 184 0 499.0 499.0 0.0 English, French, German, Italian, Spanish - Sp... Action 4 {'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...
4 50 Half-Life: Opposing Force Gearbox Software Valve NaN 5250 288 0 5,000,000 .. 10,000,000 624 0 415 0 499.0 499.0 0.0 English, French, German, Korean Action 71 {'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...

There are a lot of columns here that will overlap with columns we already dealt with in the Steam data. This is fine because it means we don't have to clean them here, and can simply use the columns from the other dataset.

A few columns in this data that could be valuable are positive, negative, userscore, owners and average_forever. We'll begin by looking at the null counts for each column.

In [3]:
raw_steamspy_data.isnull().sum()
Out[3]:
appid                  0
name                   5
developer            197
publisher            280
score_rank         29177
positive               0
negative               0
userscore              0
owners                 0
average_forever        0
average_2weeks         0
median_forever         0
median_2weeks          0
price                 29
initialprice          22
discount              22
languages             94
genre                152
ccu                    0
tags                   0
dtype: int64

Most of the columns look fine, with only a few tens or hundreds of missing values, or none at all. The only column we'll remove for certain is the score_rank column, as it has far too many null values to be useful.

Handling missing values

There are some columns we'll be removing, as they're present in the other dataset. Before we do we can investigate the missing values, to decide whether we should remove these rows first. We don't need to go through every column, but just look at a few individual ones.

In the name column, we have five rows with missing data, and four more rows where the name is 'none'. We'll certainly remove all 9 of these rows.

In [4]:
raw_steamspy_data[raw_steamspy_data['name'].isnull()]
Out[4]:
appid name developer publisher score_rank positive negative userscore owners average_forever average_2weeks median_forever median_2weeks price initialprice discount languages genre ccu tags
481 17760 NaN NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 0.0 0.0 0.0 NaN NaN 0 []
7235 396420 NaN NaN NaN NaN 22 9 0 20,000 .. 50,000 0 0 0 0 0.0 0.0 0.0 NaN NaN 0 {'Free to Play': 24, 'Action': 21, 'Indie': 21...
9553 460250 NaN Jeroen Wimmers Jeroen Wimmers NaN 44 4 0 0 .. 20,000 0 0 0 0 899.0 899.0 0.0 English, French, Italian, German, Spanish - Sp... Casual, Indie 0 {'Indie': 34, 'Casual': 24, 'Puzzle': 15, 'Min...
22244 806160 NaN Paleno Games Paleno Games NaN 4 13 0 0 .. 20,000 0 0 0 0 99.0 99.0 0.0 English, French, Italian, German, Spanish - Sp... Action, Adventure, Casual, Indie 0 {'Action': 22, 'Indie': 22, 'Casual': 21, 'Adv...
27324 965340 NaN 2nd Studio 2nd Studio NaN 31 1 0 0 .. 20,000 0 0 0 0 199.0 199.0 0.0 English, German, Danish, Japanese, Russian, Si... Action, Indie, Simulation 0 {'Indie': 32, 'Sexual Content': 31, 'Action': ...
In [5]:
raw_steamspy_data[raw_steamspy_data['name'] == 'none']
Out[5]:
appid name developer publisher score_rank positive negative userscore owners average_forever average_2weeks median_forever median_2weeks price initialprice discount languages genre ccu tags
4918 339860 none NaN NaN NaN 84 32 0 50,000 .. 100,000 309 0 325 0 0.0 0.0 0.0 English Adventure, Indie, Massively Multiplayer, RPG, ... 0 {'Indie': 34, 'RPG': 33, 'Adventure': 32, 'Mas...
6779 385020 none none NaN NaN 14 26 0 20,000 .. 50,000 262 0 272 0 0.0 0.0 0.0 English, French, Italian, German, Spanish - Sp... Casual, Indie, Massively Multiplayer, Strategy 0 {'Indie': 32, 'Strategy': 31, 'Casual': 21, 'M...
7350 398970 none none none NaN 10 21 0 20,000 .. 50,000 259 0 272 0 0.0 0.0 0.0 English Adventure, Indie, Massively Multiplayer, RPG, ... 0 {'Massively Multiplayer': 27, 'Strategy': 23, ...

There are 197 rows with missing developer data, and 280 rows with missing publisher data. It's conceivable that a small studio may self-publish, but we'll say that a title must have a developer for it to be kept. We can remove the rows with missing developer data, but then we'll be removing both of these columns as this information is already provided in the Steam data.

In [6]:
raw_steamspy_data[raw_steamspy_data['developer'].isnull()].sample(5, random_state=0)
Out[6]:
appid name developer publisher score_rank positive negative userscore owners average_forever average_2weeks median_forever median_2weeks price initialprice discount languages genre ccu tags
8774 439400 Legends of Callasia Demo NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 NaN NaN NaN NaN NaN 0 []
568 23130 Mole Control NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 0.0 0.0 0.0 NaN NaN 1 []
18399 703610 Space Girls NaN NaN NaN 70 81 0 20,000 .. 50,000 0 0 0 0 99.0 99.0 0.0 English, Russian Sexual Content, Nudity, Action, Casual, Indie 0 {'Nudity': 26, 'Sexual Content': 25, 'Action':...
2207 247910 Sniper Elite: Nazi Zombie Army 2 NaN Rebellion NaN 3069 554 0 200,000 .. 500,000 289 0 335 0 1499.0 1499.0 0.0 English, French, Italian, German, Spanish - Sp... Action, Adventure 23 {'Zombies': 294, 'Action': 185, 'Co-op': 171, ...
15367 619730 Mad Driver NaN NaN NaN 46 48 0 100,000 .. 200,000 287 0 294 0 99.0 99.0 0.0 English, Russian Casual, Indie, Racing, Strategy 0 {'Strategy': 21, 'Indie': 21, 'Casual': 21, 'R...
In [7]:
raw_steamspy_data[raw_steamspy_data['publisher'].isnull()].sample(5, random_state=0)
Out[7]:
appid name developer publisher score_rank positive negative userscore owners average_forever average_2weeks median_forever median_2weeks price initialprice discount languages genre ccu tags
22993 827660 Feminist Cyborg Vs Last guy(gay) NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 NaN NaN NaN NaN NaN 0 []
489 18310 Spectraball - Demo NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 0.0 0.0 0.0 NaN NaN 0 []
5347 349260 BlastZone 2 Matt Edzenga NaN NaN 50 25 0 20,000 .. 50,000 195 0 195 0 999.0 999.0 0.0 English Action, Indie 0 {'Indie': 35, 'Action': 34, "Shoot 'Em Up": 18...
2493 259530 Savant - Ascent D-Pad Studio NaN NaN 2467 179 0 200,000 .. 500,000 209 0 294 0 199.0 199.0 0.0 English, Japanese, Korean, Norwegian, Simplifi... Action, Indie 3 {'Action': 116, 'Indie': 102, 'Great Soundtrac...
18285 700580 Rust - Staging Branch NaN NaN NaN 0 0 0 0 .. 20,000 78 0 6 0 NaN NaN NaN NaN NaN 0 []

There are a small number of rows with missing price data. This is another column we don't need, (as well as initialprice and discount), as we already have that data available in the other dataset.

It looks like these rows have a lot of other missing information as well, so if they haven't already been removed when cleaning other columns we'll remove them here.

In [8]:
raw_steamspy_data[raw_steamspy_data['price'].isnull()].sample(5, random_state=0)
Out[8]:
appid name developer publisher score_rank positive negative userscore owners average_forever average_2weeks median_forever median_2weeks price initialprice discount languages genre ccu tags
4261 321210 Blade Symphony Beta NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 NaN NaN NaN NaN NaN 0 []
18285 700580 Rust - Staging Branch NaN NaN NaN 0 0 0 0 .. 20,000 78 0 6 0 NaN NaN NaN NaN NaN 0 []
22496 813350 Ben 10 VR NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 NaN NaN NaN NaN NaN 0 []
15719 630790 PSYCHO-PASS NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 NaN NaN NaN NaN NaN 0 []
12076 530940 BIOHAZARD 7 resident evil グロテスクVer. NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 NaN NaN NaN NaN NaN 0 []

Again there are a small number of rows with missing language data. There are a couple of bigger titles here, but most are missing other data also. We'll go ahead and remove these rows and this should leave us with most troublesome rows dealt with.

In [9]:
raw_steamspy_data[raw_steamspy_data['languages'].isnull()].sample(5, random_state=0)
Out[9]:
appid name developer publisher score_rank positive negative userscore owners average_forever average_2weeks median_forever median_2weeks price initialprice discount languages genre ccu tags
249 8350 Strong Bad's Cool Game for Attractive People: ... Telltale Games Telltale Games NaN 12 5 0 0 .. 20,000 0 0 0 0 0.0 0.0 0.0 NaN Adventure 0 {'Adventure': 20}
1233 94500 Back to the Future: Ep 2 - Get Tannen! Telltale Games Telltale Games NaN 1367 162 0 500,000 .. 1,000,000 91 0 91 0 0.0 0.0 0.0 NaN NaN 2 {'Adventure': 131, 'Point & Click': 116, 'Epis...
2120 243180 The Harvest NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 0.0 0.0 0.0 NaN NaN 0 []
757 34390 Football Manager 2011 Demo NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 0.0 0.0 0.0 NaN NaN 0 []
569 23140 KrissX NaN NaN NaN 0 0 0 0 .. 20,000 0 0 0 0 0.0 0.0 0.0 NaN NaN 1 []

Remove unwanted columns

As mentioned, there are a number of columns present in the Steam dataset, so we can remove any of those with crossover. It makes sense that these exist because SteamSpy's data is retrieved by looking at Steam's data. We'll remove the score_rank column as it contains too many missing values, and we'll remove the userscore column as most rows have a value of 0 here, so it won't be useful.

In [10]:
raw_steamspy_data['userscore'].value_counts().head()
Out[10]:
0      29177
100        4
84         4
95         3
55         3
Name: userscore, dtype: int64

There are also three columns - average_2weeks, median_2weeks, (average and median playtime in the last 2 weeks) and ccu (Peak conccurrent users yesterday) - that provide information only relevant to the time the data was collected. Going forward this isn't going to be useful, as we're not interested in how games were performing at that specific time, so we'll remove those.

In [11]:
drop_cols = [
    'score_rank', # too many missing values
    'userscore', # too little variance (most have 0)
    'genre', 'developer', 'publisher', 'price', 'initialprice', 'discount', # provided by Steam data
    'average_2weeks', 'median_2weeks', 'ccu' # not interested in temporally specific columns
]

One more column we could remove is the languages column, as we already have the english column in the Steam Data. Here in the SteamSpy data, languages is made up of a comma-separated list of all supported languages. This provides us with some more information so we can keep it for now.

In [12]:
raw_steamspy_data['languages'].head()
Out[12]:
0    English, French, German, Italian, Spanish - Sp...
1    English, French, German, Italian, Spanish - Sp...
2    English, French, German, Italian, Spanish - Spain
3    English, French, German, Italian, Spanish - Sp...
4                      English, French, German, Korean
Name: languages, dtype: object

Process & Export Tags Column

There are a couple of columns that require more specific handling, and the tags column is one of them. There will be a lot of overlap with the genres column, but the tags might provide just enough unique information to be worthwhile keeping.

Below we separate out the tags data for experimentation, then have a look a the first few rows. Each tag for a game has an associated number, corresponding to the amount of votes that game has for that tag. A game's tags with higher votes means that more people think it is a relevant tag for that title.

In [13]:
tags = raw_steamspy_data['tags']

print(tags[0])
tags.head()
{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1659, 'Shooter': 1420, 'Classic': 1344, 'Team-Based': 943, 'First-Person': 799, 'Competitive': 790, 'Tactical': 734, "1990's": 564, 'e-sports': 550, 'PvP': 480, 'Military': 367, 'Strategy': 329, 'Score Attack': 200, 'Survival': 192, 'Old School': 164, 'Assassin': 151, '1980s': 144, 'Violent': 40}
Out[13]:
0    {'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...
1    {'Action': 208, 'FPS': 188, 'Multiplayer': 172...
2    {'FPS': 138, 'World War II': 122, 'Multiplayer...
3    {'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...
4    {'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...
Name: tags, dtype: object

Just by looking at the first row you can see that there are probably going to be far too many tags to include without disrupting the dataset. We have a couple of options here. We could include just the names of the top 3 or 5 tags for a game - which should be easy enough as it looks like the tags are already in order from most votes to least - or we could create new columns for each tag, and the number of votes as the row values, like this:

appid name action fps multiplayer
10 Shoot 'em up 2681 2048 0
20 Big Brawl 208 0 172

In a way, we'll do both.

Because including the full tag information in the dataset may make it bulky and unwieldy, we'll only include the top three tags in the core dataset. As we did when cleaning the Steam Data, we'll also export the full tag data as a separate file. This means that if we want to use it as part of our analysis we can import and merge the data quite easily.

Before we can do that, we have to figure out how to handle and process the data. As we've seen previously, the data in each row is stored as a string containing a dictionary, so we have to use literal_eval to get python to "recognise" the dictionary first.

In [14]:
eval_row = literal_eval(tags[0])

print(eval_row['Action'])
print(eval_row['FPS'])
2681
2048

We also have to figure out how to get a list of all the different unique tags from across the dataset. As each row only contains the tags that it has votes for, we don't know from the outset the tag names or even how many different tags there are.

We could loop through the data and build a list of all the unique tag names using a dictionary or something similar, but instead we'll achieve the same thing much more easily using the chain function from itertools. As seen below, it will take a variety of iterables and unpack them into a single sequence (we have to use list() here to force the generator to evaluate). We can then use the set function to obtain just the unique values from the returned sequence.

As you can see below, we can pass lists, tuples and dictionaries and they will all be handled. Also notice how only the key from the dictionary is returned (d) and not the value inside the key (e). This is really useful and means we can just pass the evaluated tags data as a series, and itertools will just extract the individual keys.

In [15]:
values = [
    ['a', 'b'], # list
    ('b', 'c'), # tuple
    {'d': 'e'}  # dictionary
]

list(itertools.chain(*values))
Out[15]:
['a', 'b', 'b', 'c', 'd']
In [16]:
set(itertools.chain(*values))
Out[16]:
{'a', 'b', 'c', 'd'}

If we tried to define a function now, we might run into a problem. It turns out not all of the rows have data stored as dictionaries. Seen below, there are a few hundred rows with an empty list.

In [17]:
tags[tags == '[]'].shape[0]
Out[17]:
656

This doesn't cause an issue when we generate our set of unique tag names, as any empty lists are ignored.

In [18]:
parsed_tags = tags.apply(lambda x: literal_eval(x))

cols = set(itertools.chain(*parsed_tags))

print('Number of unique tags:', len(cols))
print('\nFirst few tags:', sorted(list(cols))[:5])
Number of unique tags: 371

First few tags: ['1980s', "1990's", '2.5D', '2D', '2D Fighter']

However to create the new dataframe from the tags and votes, we need to look inside the dictionaries in each row. We'll expand the logic to parse the tags using the isintance function to check each row for its type. If it is a dictionary, we return the dictionary. If it is a list, we'll return an empty dictionary {}. We'll also include an extra check in case there is anything else in the rows we haven't found yet.

Next, we loop through the tags stored in cols, standardise the format of the new column names, then extract the vote count of each tag by checking if the tag name is in each row's dictionary of tags.

In [19]:
def parse_tags(x):
    x = literal_eval(x)
    
    if isinstance(x, dict):
        return x
    elif isinstance(x, list):
        return {}
    else:
        raise TypeError('Something other than dict or list found')

parsed_tags = tags.apply(parse_tags)
        
tag_data = pd.DataFrame()

for col in sorted(cols):
    # standardise column names
    col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", "")

    # check if column in row's dictionary of tags and return that value if it is, or 0 if it isn't
    tag_data[col_name] = parsed_tags.apply(lambda x: x[col] if col in x.keys() else 0)

tag_data.head()
Out[19]:
1980s 1990s 2.5d 2d 2d_fighter 360_video 3d 3d_platformer 3d_vision 4_player_local 4x 6dof atv abstract action action_rpg action_adventure addictive adventure agriculture aliens alternate_history america animation_&_modeling anime arcade arena_shooter artificial_intelligence assassin asynchronous_multiplayer atmospheric audio_production bmx base_building baseball based_on_a_novel basketball batman battle_royale beat_em_up beautiful benchmark bikes blood board_game bowling building bullet_hell bullet_time crpg ... text_based third_person third_person_shooter thriller time_attack time_management time_manipulation time_travel top_down top_down_shooter touch_friendly tower_defense trackir trading trading_card_game trains transhumanism turn_based turn_based_combat turn_based_strategy turn_based_tactics tutorial twin_stick_shooter typing underground underwater unforgiving utilities vr vr_only vampire video_production villain_protagonist violent visual_novel voice_control voxel walking_simulator war wargame warhammer_40k web_publishing werewolves western word_game world_war_i world_war_ii wrestling zombies e_sports
0 144 564 0 0 0 0 0 0 0 0 0 0 0 0 2681 0 0 0 0 0 0 0 0 0 0 0 0 0 151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 550
1 0 71 0 0 0 0 0 0 0 0 0 0 0 0 208 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80 0 0 0 0 0 0 5 122 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 85 0 0 0 0 0 0 0 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 77 0 0 0 0 0 0 0 0 0 0 0 0 211 0 0 0 87 0 122 0 0 0 0 0 0 0 0 0 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 371 columns

We'll have to add in the AppIDs, but apart from that this makes up the tag data that we'll export. Next we have to figure out how to extract just the top three tags for use in the core dataset.

It turns out this is a relatively simple process. As the keys are in order of highest votes to least, we can just extract the first 3 keys for each row. This may not be the safest approach, as dictionaries don't necessarily preserve key order, but it's fine for our purposes. We can then join these keys on a semicolon to create a list of the tags.

In [20]:
def parse_tags(x):
    x = literal_eval(x)

    if isinstance(x, dict):
        return ';'.join(list(x.keys())[:3])
    else:
        return np.nan
    
tags.apply(parse_tags).head()
Out[20]:
0          Action;FPS;Multiplayer
1          Action;FPS;Multiplayer
2    FPS;World War II;Multiplayer
3          Action;FPS;Multiplayer
4               FPS;Action;Sci-fi
Name: tags, dtype: object

Handle Owners Column

The final column we'll look at before defining a function to perform the cleaning is the owners column. This column is made up of the lower and upper bound of an estimation for the amount of owners for each title. For privacy reasons, SteamSpy can't get exact figures from Steam (though it used to), so we'll have to make do with the figures we have here.

In [21]:
owners = raw_steamspy_data['owners']
owners.head()
Out[21]:
0    10,000,000 .. 20,000,000
1     5,000,000 .. 10,000,000
2     5,000,000 .. 10,000,000
3     5,000,000 .. 10,000,000
4     5,000,000 .. 10,000,000
Name: owners, dtype: object

We have a couple of options for how to deal with the data. We could remove the commas and split the data, keeping the lower or upper bound.

In [22]:
owners_split = owners.str.replace(',', '').str.split(' .. ')
owners_split.apply(lambda x: int(x[0])).head()
Out[22]:
0    10000000
1     5000000
2     5000000
3     5000000
4     5000000
Name: owners, dtype: int64

We could calculate the mid-point between the points, and keep that.

In [23]:
owners_split.apply(lambda x: (int(x[0]) + int(x[1])) // 2).head()
Out[23]:
0    15000000
1     7500000
2     7500000
3     7500000
4     7500000
Name: owners, dtype: int64

Or we could reformat the data slightly, but pretty much keep it as is.

In [24]:
owners.str.replace(',', '').str.replace(' .. ', '-').head()
Out[24]:
0    10000000-20000000
1     5000000-10000000
2     5000000-10000000
3     5000000-10000000
4     5000000-10000000
Name: owners, dtype: object

We'll go with this last option. It allows us to keep the ranges and also easily modify the data in the future.

Define Function

We're now ready to define the functions to clean the data. We'll wrap all the logic inside a process function, and for most of the columns there isn't much code to write so we can confine it here. The only logic we'll separate out is for processing the tags column, as it is a more complex endeavour.

In the next code cell we clean the data, export the full tags data to file, and inspect the clean dataframe. After that we inspect the exported data, verifying everything went as planned.

In [25]:
def process_tags(df, export=False):
    if export: 
        
        tag_data = df[['appid', 'tags']].copy()
        
        def parse_export_tags(x):
            x = literal_eval(x)

            if isinstance(x, dict):
                return x
            elif isinstance(x, list):
                return {}
            else:
                raise TypeError('Something other than dict or list found')

        tag_data['tags'] = tag_data['tags'].apply(parse_export_tags)

        cols = set(itertools.chain(*tag_data['tags']))

        for col in sorted(cols):
            col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", "")

            tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)

        tag_data = tag_data.drop('tags', axis=1)

        tag_data.to_csv('../data/exports/steamspy_tag_data.csv', index=False)
        print("Exported tag data to '../data/exports/steamspy_tag_data.csv'")
        
        
    def parse_tags(x):
        x = literal_eval(x)
        
        if isinstance(x, dict):
            return ';'.join(list(x.keys())[:3])
        else:
            return np.nan
    
    df['tags'] = df['tags'].apply(parse_tags)
    
    # rows with null tags seem to be superseded by newer release, so remove (e.g. dead island)
    df = df[df['tags'].notnull()]
    
    return df


def process(df):
    df = df.copy()
    
    # handle missing values
    df = df[(df['name'].notnull()) & (df['name'] != 'none')]
    df = df[df['developer'].notnull()]
    df = df[df['languages'].notnull()]
    df = df[df['price'].notnull()]
    
    # remove unwanted columns
    df = df.drop([
        'genre', 'developer', 'publisher', 'score_rank', 'userscore', 'average_2weeks',
        'median_2weeks', 'price', 'initialprice', 'discount', 'ccu'
    ], axis=1)
    
    # keep top tags, exporting full tag data to file
    df = process_tags(df, export=True)
    
    # reformat owners column
    df['owners'] = df['owners'].str.replace(',', '').str.replace(' .. ', '-')
    
    return df


steamspy_data = process(raw_steamspy_data)
steamspy_data.head()
Exported tag data to '../data/exports/steamspy_tag_data.csv'
Out[25]:
appid name positive negative owners average_forever median_forever languages tags
0 10 Counter-Strike 124534 3339 10000000-20000000 17612 317 English, French, German, Italian, Spanish - Sp... Action;FPS;Multiplayer
1 20 Team Fortress Classic 3318 633 5000000-10000000 277 62 English, French, German, Italian, Spanish - Sp... Action;FPS;Multiplayer
2 30 Day of Defeat 3416 398 5000000-10000000 187 34 English, French, German, Italian, Spanish - Spain FPS;World War II;Multiplayer
3 40 Deathmatch Classic 1273 267 5000000-10000000 258 184 English, French, German, Italian, Spanish - Sp... Action;FPS;Multiplayer
4 50 Half-Life: Opposing Force 5250 288 5000000-10000000 624 415 English, French, German, Korean FPS;Action;Sci-fi
In [26]:
# inspect tag data
pd.read_csv('../data/exports/steamspy_tag_data.csv').head()
Out[26]:
appid 1980s 1990s 2.5d 2d 2d_fighter 360_video 3d 3d_platformer 3d_vision 4_player_local 4x 6dof atv abstract action action_rpg action_adventure addictive adventure agriculture aliens alternate_history america animation_&_modeling anime arcade arena_shooter artificial_intelligence assassin asynchronous_multiplayer atmospheric audio_production bmx base_building baseball based_on_a_novel basketball batman battle_royale beat_em_up beautiful benchmark bikes blood board_game bowling building bullet_hell bullet_time ... text_based third_person third_person_shooter thriller time_attack time_management time_manipulation time_travel top_down top_down_shooter touch_friendly tower_defense trackir trading trading_card_game trains transhumanism turn_based turn_based_combat turn_based_strategy turn_based_tactics tutorial twin_stick_shooter typing underground underwater unforgiving utilities vr vr_only vampire video_production villain_protagonist violent visual_novel voice_control voxel walking_simulator war wargame warhammer_40k web_publishing werewolves western word_game world_war_i world_war_ii wrestling zombies e_sports
0 10 144 564 0 0 0 0 0 0 0 0 0 0 0 0 2681 0 0 0 0 0 0 0 0 0 0 0 0 0 151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 550
1 20 0 71 0 0 0 0 0 0 0 0 0 0 0 0 208 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80 0 0 0 0 0 0 5 122 0 0 0
3 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 85 0 0 0 0 0 0 0 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 50 0 77 0 0 0 0 0 0 0 0 0 0 0 0 211 0 0 0 87 0 122 0 0 0 0 0 0 0 0 0 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 372 columns

Merge and Export Clean Data

Before we wrap up, we'll check for any missing values and export the SteamSpy data to its own csv file. Then we'll look at combining the previously cleaned Steam data and the newly cleaned SteamSpy data into one cohesive dataset. This will give us options and flexibility for analysis in the future, and means we will have one csv file to begin looking at.

In [27]:
steamspy_data.isnull().sum()
Out[27]:
appid              0
name               0
positive           0
negative           0
owners             0
average_forever    0
median_forever     0
languages          0
tags               0
dtype: int64
In [28]:
steamspy_data.to_csv('../data/exports/steamspy_clean.csv', index=False)

To combine the dataframes we can use the pandas merge function, specifying which columns to merge upon. We'll keep the default how='inner' behaviour, meaning we only keep rows with data in both dataframes.

In [29]:
steam_data = pd.read_csv('../data/exports/steam_data_clean.csv')
    
merged = steam_data.merge(steamspy_data, left_on='steam_appid', right_on='appid', suffixes=('', '_steamspy'))
merged.head()
Out[29]:
name steam_appid required_age platforms categories genres achievements release_date price english developer publisher appid name_steamspy positive negative owners average_forever median_forever languages tags
0 Counter-Strike 10 0 windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action 0 2000-11-01 7.19 1 Valve Valve 10 Counter-Strike 124534 3339 10000000-20000000 17612 317 English, French, German, Italian, Spanish - Sp... Action;FPS;Multiplayer
1 Team Fortress Classic 20 0 windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action 0 1999-04-01 3.99 1 Valve Valve 20 Team Fortress Classic 3318 633 5000000-10000000 277 62 English, French, German, Italian, Spanish - Sp... Action;FPS;Multiplayer
2 Day of Defeat 30 0 windows;mac;linux Multi-player;Valve Anti-Cheat enabled Action 0 2003-05-01 3.99 1 Valve Valve 30 Day of Defeat 3416 398 5000000-10000000 187 34 English, French, German, Italian, Spanish - Spain FPS;World War II;Multiplayer
3 Deathmatch Classic 40 0 windows;mac;linux Multi-player;Online Multi-Player;Local Multi-P... Action 0 2001-06-01 3.99 1 Valve Valve 40 Deathmatch Classic 1273 267 5000000-10000000 258 184 English, French, German, Italian, Spanish - Sp... Action;FPS;Multiplayer
4 Half-Life: Opposing Force 50 0 windows;mac;linux Single-player;Multi-player;Valve Anti-Cheat en... Action 0 1999-11-01 3.99 1 Gearbox Software Valve 50 Half-Life: Opposing Force 5250 288 5000000-10000000 624 415 English, French, German, Korean FPS;Action;Sci-fi

Finally, we can remove, reorder and rename some columns before exporting the complete, clean dataset.

We'll remove the SteamSpy rows with overlap - name, languages and AppID. Next we reindex the dataframe, reordering the columns so it makes a little more logical sense to look at, and then we rename some of the columns to make it a bit clearer what they describe.

This is a quick and simple step, but as we'll be making this data available for public use it's a very good idea to make it as user-friendly as possible.

In [30]:
# remove overlapping columns
steam_clean = merged.drop(['name_steamspy', 'languages', 'steam_appid'], axis=1)

# reindex to reorder columns
steam_clean = steam_clean[[
    'appid',
    'name',
    'release_date',
    'english',
    'developer',
    'publisher',
    'platforms',
    'required_age',
    'categories',
    'genres',
    'tags',
    'achievements',
    'positive',
    'negative',
    'average_forever',
    'median_forever',
    'owners',
    'price'
]]

steam_clean = steam_clean.rename({
    'tags': 'steamspy_tags',
    'positive': 'positive_ratings',
    'negative': 'negative_ratings',
    'average_forever': 'average_playtime',
    'median_forever': 'median_playtime'
}, axis=1)

steam_clean.head()
Out[30]:
appid name release_date english developer publisher platforms required_age categories genres steamspy_tags achievements positive_ratings negative_ratings average_playtime median_playtime owners price
0 10 Counter-Strike 2000-11-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 124534 3339 17612 317 10000000-20000000 7.19
1 20 Team Fortress Classic 1999-04-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 3318 633 277 62 5000000-10000000 3.99
2 30 Day of Defeat 2003-05-01 1 Valve Valve windows;mac;linux 0 Multi-player;Valve Anti-Cheat enabled Action FPS;World War II;Multiplayer 0 3416 398 187 34 5000000-10000000 3.99
3 40 Deathmatch Classic 2001-06-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 1273 267 258 184 5000000-10000000 3.99
4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software Valve windows;mac;linux 0 Single-player;Multi-player;Valve Anti-Cheat en... Action FPS;Action;Sci-fi 0 5250 288 624 415 5000000-10000000 3.99
In [31]:
# export clean dataset
steam_clean.to_csv('../data/steam_clean.csv', index=False)

And with that, we have come to the end of the cleaning process. It's been a long journey and hopefully you've picked up something useful along the way, and now we can look ahead to analysing the data, creating some visualisations, and gaining some insights into the multi-billion dollar industry that is video games.

Thanks for joining me, and as always I welcome any feedback or suggestions you may have in the comments below.

Comments