Welcome to the final part of the data cleaning process. Once we're finished here we'll be ready to move on to exploring and analysing the data.

As a quick re-cap, so far we have downloaded information on games from the Steam Store via the Steam API and SteamSpy API. We have cleaned and processed the data from the Steam API, and in this section we'll walkthrough cleaning data downloaded from the SteamSpy API. The overall goal of this project is to collect, clean and analyse data from the Steam Store with the idea of advising a fictional game developer or company.

The previous posts went into great depth about the decisions made and methods used. This post will still go over a number of decisions, but will be more in the style of a brief overview than full discussion.

Import Libraries and Inspect Data¶

Let's begin by importing the necessary libraries and inspecting the raw downloaded data.

In [1]:

# standard library imports
from ast import literal_eval
import itertools
# import time
# import re

# third-party imports
import numpy as np
import pandas as pd

# customisations
pd.set_option("max_columns", 100)

In [2]:

raw_steamspy_data = pd.read_csv('../data/raw/steamspy_data.csv')
raw_steamspy_data.head()

Out[2]:

	appid	name	developer	publisher	score_rank	positive	negative	owners	average_forever	average_2weeks	median_forever	median_2weeks	price	initialprice	languages	genre	ccu	tags
0	10	Counter-Strike	Valve	Valve	NaN	124534	3339	10,000,000 .. 20,000,000	17612	709	317	26	999.0	999.0	English, French, German, Italian, Spanish - Sp...	Action	14923	{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...
1	20	Team Fortress Classic	Valve	Valve	NaN	3318	633	5,000,000 .. 10,000,000	277	15	62	15	499.0	499.0	English, French, German, Italian, Spanish - Sp...	Action	87	{'Action': 208, 'FPS': 188, 'Multiplayer': 172...
2	30	Day of Defeat	Valve	Valve	NaN	3416	398	5,000,000 .. 10,000,000	187	0	34	0	499.0	499.0	English, French, German, Italian, Spanish - Spain	Action	130	{'FPS': 138, 'World War II': 122, 'Multiplayer...
3	40	Deathmatch Classic	Valve	Valve	NaN	1273	267	5,000,000 .. 10,000,000	258	0	184	0	499.0	499.0	English, French, German, Italian, Spanish - Sp...	Action	4	{'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...
4	50	Half-Life: Opposing Force	Gearbox Software	Valve	NaN	5250	288	5,000,000 .. 10,000,000	624	0	415	0	499.0	499.0	English, French, German, Korean	Action	71	{'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...

There are a lot of columns here that will overlap with columns we already dealt with in the Steam data. This is fine because it means we don't have to clean them here, and can simply use the columns from the other dataset.

A few columns in this data that could be valuable are positive, negative, userscore, owners and average_forever. We'll begin by looking at the null counts for each column.

In [3]:

raw_steamspy_data.isnull().sum()

Out[3]:

appid                  0
name                   5
developer            197
publisher            280
score_rank         29177
positive               0
negative               0
userscore              0
owners                 0
average_forever        0
average_2weeks         0
median_forever         0
median_2weeks          0
price                 29
initialprice          22
discount              22
languages             94
genre                152
ccu                    0
tags                   0
dtype: int64

Most of the columns look fine, with only a few tens or hundreds of missing values, or none at all. The only column we'll remove for certain is the score_rank column, as it has far too many null values to be useful.

Handling missing values¶

There are some columns we'll be removing, as they're present in the other dataset. Before we do we can investigate the missing values, to decide whether we should remove these rows first. We don't need to go through every column, but just look at a few individual ones.

In the name column, we have five rows with missing data, and four more rows where the name is 'none'. We'll certainly remove all 9 of these rows.

In [4]:

raw_steamspy_data[raw_steamspy_data['name'].isnull()]

Out[4]:

	appid	name	developer	publisher	score_rank	positive	negative	owners	price	initialprice	languages	genre	tags
481	17760	NaN	NaN	NaN	NaN	0	0	0 .. 20,000	0.0	0.0	NaN	NaN	[]
7235	396420	NaN	NaN	NaN	NaN	22	9	20,000 .. 50,000	0.0	0.0	NaN	NaN	{'Free to Play': 24, 'Action': 21, 'Indie': 21...
9553	460250	NaN	Jeroen Wimmers	Jeroen Wimmers	NaN	44	4	0 .. 20,000	899.0	899.0	English, French, Italian, German, Spanish - Sp...	Casual, Indie	{'Indie': 34, 'Casual': 24, 'Puzzle': 15, 'Min...
22244	806160	NaN	Paleno Games	Paleno Games	NaN	4	13	0 .. 20,000	99.0	99.0	English, French, Italian, German, Spanish - Sp...	Action, Adventure, Casual, Indie	{'Action': 22, 'Indie': 22, 'Casual': 21, 'Adv...
27324	965340	NaN	2nd Studio	2nd Studio	NaN	31	1	0 .. 20,000	199.0	199.0	English, German, Danish, Japanese, Russian, Si...	Action, Indie, Simulation	{'Indie': 32, 'Sexual Content': 31, 'Action': ...

In [5]:

raw_steamspy_data[raw_steamspy_data['name'] == 'none']

Out[5]:

	appid	name	developer	publisher	score_rank	positive	negative	owners	average_forever	median_forever	languages	genre	tags
4918	339860	none	NaN	NaN	NaN	84	32	50,000 .. 100,000	309	325	English	Adventure, Indie, Massively Multiplayer, RPG, ...	{'Indie': 34, 'RPG': 33, 'Adventure': 32, 'Mas...
6779	385020	none	none	NaN	NaN	14	26	20,000 .. 50,000	262	272	English, French, Italian, German, Spanish - Sp...	Casual, Indie, Massively Multiplayer, Strategy	{'Indie': 32, 'Strategy': 31, 'Casual': 21, 'M...
7350	398970	none	none	none	NaN	10	21	20,000 .. 50,000	259	272	English	Adventure, Indie, Massively Multiplayer, RPG, ...	{'Massively Multiplayer': 27, 'Strategy': 23, ...

There are 197 rows with missing developer data, and 280 rows with missing publisher data. It's conceivable that a small studio may self-publish, but we'll say that a title must have a developer for it to be kept. We can remove the rows with missing developer data, but then we'll be removing both of these columns as this information is already provided in the Steam data.

In [6]:

raw_steamspy_data[raw_steamspy_data['developer'].isnull()].sample(5, random_state=0)

Out[6]:

	appid	name	developer	publisher	score_rank	positive	negative	owners	average_forever	median_forever	price	initialprice	discount	languages	genre	ccu	tags
8774	439400	Legends of Callasia Demo	NaN	NaN	NaN	0	0	0 .. 20,000	0	0	NaN	NaN	NaN	NaN	NaN	0	[]
568	23130	Mole Control	NaN	NaN	NaN	0	0	0 .. 20,000	0	0	0.0	0.0	0.0	NaN	NaN	1	[]
18399	703610	Space Girls	NaN	NaN	NaN	70	81	20,000 .. 50,000	0	0	99.0	99.0	0.0	English, Russian	Sexual Content, Nudity, Action, Casual, Indie	0	{'Nudity': 26, 'Sexual Content': 25, 'Action':...
2207	247910	Sniper Elite: Nazi Zombie Army 2	NaN	Rebellion	NaN	3069	554	200,000 .. 500,000	289	335	1499.0	1499.0	0.0	English, French, Italian, German, Spanish - Sp...	Action, Adventure	23	{'Zombies': 294, 'Action': 185, 'Co-op': 171, ...
15367	619730	Mad Driver	NaN	NaN	NaN	46	48	100,000 .. 200,000	287	294	99.0	99.0	0.0	English, Russian	Casual, Indie, Racing, Strategy	0	{'Strategy': 21, 'Indie': 21, 'Casual': 21, 'R...

In [7]:

raw_steamspy_data[raw_steamspy_data['publisher'].isnull()].sample(5, random_state=0)

Out[7]:

	appid	name	developer	publisher	score_rank	positive	negative	owners	average_forever	median_forever	price	initialprice	discount	languages	genre	ccu	tags
22993	827660	Feminist Cyborg Vs Last guy(gay)	NaN	NaN	NaN	0	0	0 .. 20,000	0	0	NaN	NaN	NaN	NaN	NaN	0	[]
489	18310	Spectraball - Demo	NaN	NaN	NaN	0	0	0 .. 20,000	0	0	0.0	0.0	0.0	NaN	NaN	0	[]
5347	349260	BlastZone 2	Matt Edzenga	NaN	NaN	50	25	20,000 .. 50,000	195	195	999.0	999.0	0.0	English	Action, Indie	0	{'Indie': 35, 'Action': 34, "Shoot 'Em Up": 18...
2493	259530	Savant - Ascent	D-Pad Studio	NaN	NaN	2467	179	200,000 .. 500,000	209	294	199.0	199.0	0.0	English, Japanese, Korean, Norwegian, Simplifi...	Action, Indie	3	{'Action': 116, 'Indie': 102, 'Great Soundtrac...
18285	700580	Rust - Staging Branch	NaN	NaN	NaN	0	0	0 .. 20,000	78	6	NaN	NaN	NaN	NaN	NaN	0	[]

There are a small number of rows with missing price data. This is another column we don't need, (as well as initialprice and discount), as we already have that data available in the other dataset.

It looks like these rows have a lot of other missing information as well, so if they haven't already been removed when cleaning other columns we'll remove them here.

In [8]:

raw_steamspy_data[raw_steamspy_data['price'].isnull()].sample(5, random_state=0)

Out[8]:

	appid	name	developer	publisher	score_rank	owners	average_forever	median_forever	price	initialprice	discount	languages	genre	tags
4261	321210	Blade Symphony Beta	NaN	NaN	NaN	0 .. 20,000	0	0	NaN	NaN	NaN	NaN	NaN	[]
18285	700580	Rust - Staging Branch	NaN	NaN	NaN	0 .. 20,000	78	6	NaN	NaN	NaN	NaN	NaN	[]
22496	813350	Ben 10 VR	NaN	NaN	NaN	0 .. 20,000	0	0	NaN	NaN	NaN	NaN	NaN	[]
15719	630790	PSYCHO-PASS	NaN	NaN	NaN	0 .. 20,000	0	0	NaN	NaN	NaN	NaN	NaN	[]
12076	530940	BIOHAZARD 7 resident evil グロテスクVer.	NaN	NaN	NaN	0 .. 20,000	0	0	NaN	NaN	NaN	NaN	NaN	[]

Again there are a small number of rows with missing language data. There are a couple of bigger titles here, but most are missing other data also. We'll go ahead and remove these rows and this should leave us with most troublesome rows dealt with.

In [9]:

raw_steamspy_data[raw_steamspy_data['languages'].isnull()].sample(5, random_state=0)

Out[9]:

	appid	name	developer	publisher	score_rank	positive	negative	owners	average_forever	median_forever	languages	genre	ccu	tags
249	8350	Strong Bad's Cool Game for Attractive People: ...	Telltale Games	Telltale Games	NaN	12	5	0 .. 20,000	0	0	NaN	Adventure	0	{'Adventure': 20}
1233	94500	Back to the Future: Ep 2 - Get Tannen!	Telltale Games	Telltale Games	NaN	1367	162	500,000 .. 1,000,000	91	91	NaN	NaN	2	{'Adventure': 131, 'Point & Click': 116, 'Epis...
2120	243180	The Harvest	NaN	NaN	NaN	0	0	0 .. 20,000	0	0	NaN	NaN	0	[]
757	34390	Football Manager 2011 Demo	NaN	NaN	NaN	0	0	0 .. 20,000	0	0	NaN	NaN	0	[]
569	23140	KrissX	NaN	NaN	NaN	0	0	0 .. 20,000	0	0	NaN	NaN	1	[]

Remove unwanted columns¶

As mentioned, there are a number of columns present in the Steam dataset, so we can remove any of those with crossover. It makes sense that these exist because SteamSpy's data is retrieved by looking at Steam's data. We'll remove the score_rank column as it contains too many missing values, and we'll remove the userscore column as most rows have a value of 0 here, so it won't be useful.

In [10]:

raw_steamspy_data['userscore'].value_counts().head()

Out[10]:

0      29177
100        4
84         4
95         3
55         3
Name: userscore, dtype: int64

There are also three columns - average_2weeks, median_2weeks, (average and median playtime in the last 2 weeks) and ccu (Peak conccurrent users yesterday) - that provide information only relevant to the time the data was collected. Going forward this isn't going to be useful, as we're not interested in how games were performing at that specific time, so we'll remove those.

In [11]:

drop_cols = [
    'score_rank', # too many missing values
    'userscore', # too little variance (most have 0)
    'genre', 'developer', 'publisher', 'price', 'initialprice', 'discount', # provided by Steam data
    'average_2weeks', 'median_2weeks', 'ccu' # not interested in temporally specific columns
]

One more column we could remove is the languages column, as we already have the english column in the Steam Data. Here in the SteamSpy data, languages is made up of a comma-separated list of all supported languages. This provides us with some more information so we can keep it for now.

In [12]:

raw_steamspy_data['languages'].head()

Out[12]:

0    English, French, German, Italian, Spanish - Sp...
1    English, French, German, Italian, Spanish - Sp...
2    English, French, German, Italian, Spanish - Spain
3    English, French, German, Italian, Spanish - Sp...
4                      English, French, German, Korean
Name: languages, dtype: object

Process & Export Tags Column¶

There are a couple of columns that require more specific handling, and the tags column is one of them. There will be a lot of overlap with the genres column, but the tags might provide just enough unique information to be worthwhile keeping.

Below we separate out the tags data for experimentation, then have a look a the first few rows. Each tag for a game has an associated number, corresponding to the amount of votes that game has for that tag. A game's tags with higher votes means that more people think it is a relevant tag for that title.

In [13]:

tags = raw_steamspy_data['tags']

print(tags[0])
tags.head()

{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1659, 'Shooter': 1420, 'Classic': 1344, 'Team-Based': 943, 'First-Person': 799, 'Competitive': 790, 'Tactical': 734, "1990's": 564, 'e-sports': 550, 'PvP': 480, 'Military': 367, 'Strategy': 329, 'Score Attack': 200, 'Survival': 192, 'Old School': 164, 'Assassin': 151, '1980s': 144, 'Violent': 40}

Out[13]:

0    {'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...
1    {'Action': 208, 'FPS': 188, 'Multiplayer': 172...
2    {'FPS': 138, 'World War II': 122, 'Multiplayer...
3    {'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...
4    {'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...
Name: tags, dtype: object

Just by looking at the first row you can see that there are probably going to be far too many tags to include without disrupting the dataset. We have a couple of options here. We could include just the names of the top 3 or 5 tags for a game - which should be easy enough as it looks like the tags are already in order from most votes to least - or we could create new columns for each tag, and the number of votes as the row values, like this:

appid	name	action	fps	multiplayer
10	Shoot 'em up	2681	2048	0
20	Big Brawl	208	0	172

In a way, we'll do both.

Because including the full tag information in the dataset may make it bulky and unwieldy, we'll only include the top three tags in the core dataset. As we did when cleaning the Steam Data, we'll also export the full tag data as a separate file. This means that if we want to use it as part of our analysis we can import and merge the data quite easily.

Before we can do that, we have to figure out how to handle and process the data. As we've seen previously, the data in each row is stored as a string containing a dictionary, so we have to use literal_eval to get python to "recognise" the dictionary first.

In [14]:

eval_row = literal_eval(tags[0])

print(eval_row['Action'])
print(eval_row['FPS'])

2681
2048

We also have to figure out how to get a list of all the different unique tags from across the dataset. As each row only contains the tags that it has votes for, we don't know from the outset the tag names or even how many different tags there are.

We could loop through the data and build a list of all the unique tag names using a dictionary or something similar, but instead we'll achieve the same thing much more easily using the chain function from itertools. As seen below, it will take a variety of iterables and unpack them into a single sequence (we have to use list() here to force the generator to evaluate). We can then use the set function to obtain just the unique values from the returned sequence.

As you can see below, we can pass lists, tuples and dictionaries and they will all be handled. Also notice how only the key from the dictionary is returned (d) and not the value inside the key (e). This is really useful and means we can just pass the evaluated tags data as a series, and itertools will just extract the individual keys.

In [15]:

values = [
    ['a', 'b'], # list
    ('b', 'c'), # tuple
    {'d': 'e'}  # dictionary
]

list(itertools.chain(*values))

Out[15]:

['a', 'b', 'b', 'c', 'd']

In [16]:

set(itertools.chain(*values))

Out[16]:

{'a', 'b', 'c', 'd'}

If we tried to define a function now, we might run into a problem. It turns out not all of the rows have data stored as dictionaries. Seen below, there are a few hundred rows with an empty list.

In [17]:

tags[tags == '[]'].shape[0]

Out[17]:

This doesn't cause an issue when we generate our set of unique tag names, as any empty lists are ignored.

In [18]:

parsed_tags = tags.apply(lambda x: literal_eval(x))

cols = set(itertools.chain(*parsed_tags))

print('Number of unique tags:', len(cols))
print('\nFirst few tags:', sorted(list(cols))[:5])

Number of unique tags: 371

First few tags: ['1980s', "1990's", '2.5D', '2D', '2D Fighter']

However to create the new dataframe from the tags and votes, we need to look inside the dictionaries in each row. We'll expand the logic to parse the tags using the isintance function to check each row for its type. If it is a dictionary, we return the dictionary. If it is a list, we'll return an empty dictionary {}. We'll also include an extra check in case there is anything else in the rows we haven't found yet.

Next, we loop through the tags stored in cols, standardise the format of the new column names, then extract the vote count of each tag by checking if the tag name is in each row's dictionary of tags.

In [19]:

def parse_tags(x):
    x = literal_eval(x)
    
    if isinstance(x, dict):
        return x
    elif isinstance(x, list):
        return {}
    else:
        raise TypeError('Something other than dict or list found')

parsed_tags = tags.apply(parse_tags)
        
tag_data = pd.DataFrame()

for col in sorted(cols):
    # standardise column names
    col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", "")

    # check if column in row's dictionary of tags and return that value if it is, or 0 if it isn't
    tag_data[col_name] = parsed_tags.apply(lambda x: x[col] if col in x.keys() else 0)

tag_data.head()

Out[19]:

	1980s	1990s	action	adventure	aliens	arena_shooter	assassin	atmospheric	...	violent	war	world_war_i	world_war_ii	e_sports
0	144	564	2681	0	0	0	151	0	...	40	0	0	0	550
1	0	71	208	15	0	0	0	0	...	26	0	0	0	0
2	0	0	99	0	0	0	0	0	...	0	80	5	122	0
3	0	0	85	0	0	22	0	0	...	0	0	0	0	0
4	0	77	211	87	122	0	0	73	...	0	0	0	0	0

5 rows × 371 columns

We'll have to add in the AppIDs, but apart from that this makes up the tag data that we'll export. Next we have to figure out how to extract just the top three tags for use in the core dataset.

It turns out this is a relatively simple process. As the keys are in order of highest votes to least, we can just extract the first 3 keys for each row. This may not be the safest approach, as dictionaries don't necessarily preserve key order, but it's fine for our purposes. We can then join these keys on a semicolon to create a list of the tags.

In [20]:

def parse_tags(x):
    x = literal_eval(x)

    if isinstance(x, dict):
        return ';'.join(list(x.keys())[:3])
    else:
        return np.nan
    
tags.apply(parse_tags).head()

Out[20]:

0          Action;FPS;Multiplayer
1          Action;FPS;Multiplayer
2    FPS;World War II;Multiplayer
3          Action;FPS;Multiplayer
4               FPS;Action;Sci-fi
Name: tags, dtype: object

Handle Owners Column¶

The final column we'll look at before defining a function to perform the cleaning is the owners column. This column is made up of the lower and upper bound of an estimation for the amount of owners for each title. For privacy reasons, SteamSpy can't get exact figures from Steam (though it used to), so we'll have to make do with the figures we have here.

In [21]:

owners = raw_steamspy_data['owners']
owners.head()

Out[21]:

0    10,000,000 .. 20,000,000
1     5,000,000 .. 10,000,000
2     5,000,000 .. 10,000,000
3     5,000,000 .. 10,000,000
4     5,000,000 .. 10,000,000
Name: owners, dtype: object

We have a couple of options for how to deal with the data. We could remove the commas and split the data, keeping the lower or upper bound.

In [22]:

owners_split = owners.str.replace(',', '').str.split(' .. ')
owners_split.apply(lambda x: int(x[0])).head()

Out[22]:

0    10000000
1     5000000
2     5000000
3     5000000
4     5000000
Name: owners, dtype: int64

We could calculate the mid-point between the points, and keep that.

In [23]:

owners_split.apply(lambda x: (int(x[0]) + int(x[1])) // 2).head()

Out[23]:

0    15000000
1     7500000
2     7500000
3     7500000
4     7500000
Name: owners, dtype: int64

Or we could reformat the data slightly, but pretty much keep it as is.

In [24]:

owners.str.replace(',', '').str.replace(' .. ', '-').head()

Out[24]:

0    10000000-20000000
1     5000000-10000000
2     5000000-10000000
3     5000000-10000000
4     5000000-10000000
Name: owners, dtype: object

We'll go with this last option. It allows us to keep the ranges and also easily modify the data in the future.

Define Function¶

We're now ready to define the functions to clean the data. We'll wrap all the logic inside a process function, and for most of the columns there isn't much code to write so we can confine it here. The only logic we'll separate out is for processing the tags column, as it is a more complex endeavour.

In the next code cell we clean the data, export the full tags data to file, and inspect the clean dataframe. After that we inspect the exported data, verifying everything went as planned.

In [25]:

def process_tags(df, export=False):
    if export: 
        
        tag_data = df[['appid', 'tags']].copy()
        
        def parse_export_tags(x):
            x = literal_eval(x)

            if isinstance(x, dict):
                return x
            elif isinstance(x, list):
                return {}
            else:
                raise TypeError('Something other than dict or list found')

        tag_data['tags'] = tag_data['tags'].apply(parse_export_tags)

        cols = set(itertools.chain(*tag_data['tags']))

        for col in sorted(cols):
            col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", "")

            tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)

        tag_data = tag_data.drop('tags', axis=1)

        tag_data.to_csv('../data/exports/steamspy_tag_data.csv', index=False)
        print("Exported tag data to '../data/exports/steamspy_tag_data.csv'")
        
        
    def parse_tags(x):
        x = literal_eval(x)
        
        if isinstance(x, dict):
            return ';'.join(list(x.keys())[:3])
        else:
            return np.nan
    
    df['tags'] = df['tags'].apply(parse_tags)
    
    # rows with null tags seem to be superseded by newer release, so remove (e.g. dead island)
    df = df[df['tags'].notnull()]
    
    return df


def process(df):
    df = df.copy()
    
    # handle missing values
    df = df[(df['name'].notnull()) & (df['name'] != 'none')]
    df = df[df['developer'].notnull()]
    df = df[df['languages'].notnull()]
    df = df[df['price'].notnull()]
    
    # remove unwanted columns
    df = df.drop([
        'genre', 'developer', 'publisher', 'score_rank', 'userscore', 'average_2weeks',
        'median_2weeks', 'price', 'initialprice', 'discount', 'ccu'
    ], axis=1)
    
    # keep top tags, exporting full tag data to file
    df = process_tags(df, export=True)
    
    # reformat owners column
    df['owners'] = df['owners'].str.replace(',', '').str.replace(' .. ', '-')
    
    return df


steamspy_data = process(raw_steamspy_data)
steamspy_data.head()

Exported tag data to '../data/exports/steamspy_tag_data.csv'

Out[25]:

	appid	name	positive	negative	owners	average_forever	median_forever	languages	tags
0	10	Counter-Strike	124534	3339	10000000-20000000	17612	317	English, French, German, Italian, Spanish - Sp...	Action;FPS;Multiplayer
1	20	Team Fortress Classic	3318	633	5000000-10000000	277	62	English, French, German, Italian, Spanish - Sp...	Action;FPS;Multiplayer
2	30	Day of Defeat	3416	398	5000000-10000000	187	34	English, French, German, Italian, Spanish - Spain	FPS;World War II;Multiplayer
3	40	Deathmatch Classic	1273	267	5000000-10000000	258	184	English, French, German, Italian, Spanish - Sp...	Action;FPS;Multiplayer
4	50	Half-Life: Opposing Force	5250	288	5000000-10000000	624	415	English, French, German, Korean	FPS;Action;Sci-fi

In [26]:

# inspect tag data
pd.read_csv('../data/exports/steamspy_tag_data.csv').head()

Out[26]:

	appid	1980s	1990s	action	adventure	aliens	arena_shooter	assassin	atmospheric	...	violent	war	world_war_i	world_war_ii	e_sports
0	10	144	564	2681	0	0	0	151	0	...	40	0	0	0	550
1	20	0	71	208	15	0	0	0	0	...	26	0	0	0	0
2	30	0	0	99	0	0	0	0	0	...	0	80	5	122	0
3	40	0	0	85	0	0	22	0	0	...	0	0	0	0	0
4	50	0	77	211	87	122	0	0	73	...	0	0	0	0	0

5 rows × 372 columns

Merge and Export Clean Data¶

Before we wrap up, we'll check for any missing values and export the SteamSpy data to its own csv file. Then we'll look at combining the previously cleaned Steam data and the newly cleaned SteamSpy data into one cohesive dataset. This will give us options and flexibility for analysis in the future, and means we will have one csv file to begin looking at.

In [27]:

steamspy_data.isnull().sum()

Out[27]:

appid              0
name               0
positive           0
negative           0
owners             0
average_forever    0
median_forever     0
languages          0
tags               0
dtype: int64

In [28]:

steamspy_data.to_csv('../data/exports/steamspy_clean.csv', index=False)

To combine the dataframes we can use the pandas merge function, specifying which columns to merge upon. We'll keep the default how='inner' behaviour, meaning we only keep rows with data in both dataframes.

In [29]:

steam_data = pd.read_csv('../data/exports/steam_data_clean.csv')
    
merged = steam_data.merge(steamspy_data, left_on='steam_appid', right_on='appid', suffixes=('', '_steamspy'))
merged.head()

Out[29]:

	name	steam_appid	platforms	categories	genres	release_date	price	english	developer	publisher	appid	name_steamspy	positive	negative	owners	average_forever	median_forever	languages	tags
0	Counter-Strike	10	windows;mac;linux	Multi-player;Online Multi-Player;Local Multi-P...	Action	2000-11-01	7.19	1	Valve	Valve	10	Counter-Strike	124534	3339	10000000-20000000	17612	317	English, French, German, Italian, Spanish - Sp...	Action;FPS;Multiplayer
1	Team Fortress Classic	20	windows;mac;linux	Multi-player;Online Multi-Player;Local Multi-P...	Action	1999-04-01	3.99	1	Valve	Valve	20	Team Fortress Classic	3318	633	5000000-10000000	277	62	English, French, German, Italian, Spanish - Sp...	Action;FPS;Multiplayer
2	Day of Defeat	30	windows;mac;linux	Multi-player;Valve Anti-Cheat enabled	Action	2003-05-01	3.99	1	Valve	Valve	30	Day of Defeat	3416	398	5000000-10000000	187	34	English, French, German, Italian, Spanish - Spain	FPS;World War II;Multiplayer
3	Deathmatch Classic	40	windows;mac;linux	Multi-player;Online Multi-Player;Local Multi-P...	Action	2001-06-01	3.99	1	Valve	Valve	40	Deathmatch Classic	1273	267	5000000-10000000	258	184	English, French, German, Italian, Spanish - Sp...	Action;FPS;Multiplayer
4	Half-Life: Opposing Force	50	windows;mac;linux	Single-player;Multi-player;Valve Anti-Cheat en...	Action	1999-11-01	3.99	1	Gearbox Software	Valve	50	Half-Life: Opposing Force	5250	288	5000000-10000000	624	415	English, French, German, Korean	FPS;Action;Sci-fi

Finally, we can remove, reorder and rename some columns before exporting the complete, clean dataset.

We'll remove the SteamSpy rows with overlap - name, languages and AppID. Next we reindex the dataframe, reordering the columns so it makes a little more logical sense to look at, and then we rename some of the columns to make it a bit clearer what they describe.

This is a quick and simple step, but as we'll be making this data available for public use it's a very good idea to make it as user-friendly as possible.

In [30]:

# remove overlapping columns
steam_clean = merged.drop(['name_steamspy', 'languages', 'steam_appid'], axis=1)

# reindex to reorder columns
steam_clean = steam_clean[[
    'appid',
    'name',
    'release_date',
    'english',
    'developer',
    'publisher',
    'platforms',
    'required_age',
    'categories',
    'genres',
    'tags',
    'achievements',
    'positive',
    'negative',
    'average_forever',
    'median_forever',
    'owners',
    'price'
]]

steam_clean = steam_clean.rename({
    'tags': 'steamspy_tags',
    'positive': 'positive_ratings',
    'negative': 'negative_ratings',
    'average_forever': 'average_playtime',
    'median_forever': 'median_playtime'
}, axis=1)

steam_clean.head()

Out[30]:

	appid	name	release_date	english	developer	publisher	platforms	categories	genres	steamspy_tags	positive_ratings	negative_ratings	average_playtime	median_playtime	owners	price
0	10	Counter-Strike	2000-11-01	1	Valve	Valve	windows;mac;linux	Multi-player;Online Multi-Player;Local Multi-P...	Action	Action;FPS;Multiplayer	124534	3339	17612	317	10000000-20000000	7.19
1	20	Team Fortress Classic	1999-04-01	1	Valve	Valve	windows;mac;linux	Multi-player;Online Multi-Player;Local Multi-P...	Action	Action;FPS;Multiplayer	3318	633	277	62	5000000-10000000	3.99
2	30	Day of Defeat	2003-05-01	1	Valve	Valve	windows;mac;linux	Multi-player;Valve Anti-Cheat enabled	Action	FPS;World War II;Multiplayer	3416	398	187	34	5000000-10000000	3.99
3	40	Deathmatch Classic	2001-06-01	1	Valve	Valve	windows;mac;linux	Multi-player;Online Multi-Player;Local Multi-P...	Action	Action;FPS;Multiplayer	1273	267	258	184	5000000-10000000	3.99
4	50	Half-Life: Opposing Force	1999-11-01	1	Gearbox Software	Valve	windows;mac;linux	Single-player;Multi-player;Valve Anti-Cheat en...	Action	FPS;Action;Sci-fi	5250	288	624	415	5000000-10000000	3.99

In [31]:

# export clean dataset
steam_clean.to_csv('../data/steam_clean.csv', index=False)

And with that, we have come to the end of the cleaning process. It's been a long journey and hopefully you've picked up something useful along the way, and now we can look ahead to analysing the data, creating some visualisations, and gaining some insights into the multi-billion dollar industry that is video games.

SteamSpy Data Cleaning in Python