Steam Data Cleaning in Python
This post forms part of a larger series on downloading, processing and analysing data from the Steam store. See all posts here.
View original notebook on github (Download). Datasets available on Kaggle.
In the first part of this project, we downloaded and generated data sets from the Steam Store API and SteamSpy API. We now need to take this raw data and prepare it in a process commonly referred to as data cleaning.
Currently the downloaded data is not in a very useful state. Many of the columns contain lengthy strings or missing values, which hinder analysis and are especially crippling to any machine learning techniques we may wish to implement. Data cleaning involves handling missing values, tidying up values, and ensuring data is neatly and consistently formatted.
Data cleaning is often cited as being the lengthiest part of any project. As such, it will be broken up into sections. We will begin by taking care of the columns in the steam data that are easiest to deal with and outlining a framework for the process. Of course it could all be completed in one go and a lot more concisely, however we'll be stepping through all the reasons for each decision and building the process iteratively.
The main aims of this project are to investigate various sales and play-time statistics for games from the Steam store, and see how different features of games affect the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust dataset for future projects.
Towards the end of this section, we'll take care of columns that involve exporting data of some kind. Some columns will require much more in-depth processing and cleaning, or provide information that may be a useful topic for another time. We'll clean some of these slightly, then export them separately from the clean dataset.
As a separate post, we'll take a look at an optimisation problem, walking through the process of handling one particular column.
Once that is complete we will repeat the whole cleaning process for the SteamSpy data and combine the results, finishing with a complete data set ready for analysis.
To follow along or perform your own cleaning, the raw data can be found and downloaded on Kaggle.
API references:¶
- https://partner.steamgames.com/doc/webapi
- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI
- https://steamapi.xpaw.me/#
- https://steamspy.com/api.php
Import Libraries and Inspect Data¶
To begin with, we'll import the required libraries and set customisation options, then take a look at the previously downloaded data by reading it into a pandas dataframe.
# standard library imports
from ast import literal_eval
import itertools
import time
import re
# third-party imports
import numpy as np
import pandas as pd
# customisations
pd.set_option("max_columns", 100)
# read in downloaded data
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')
# print out number of rows and columns
print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])
# view first five rows
raw_steam_data.head()
From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns that look to be stored as dictionaries or lists.
We can chain the isnull()
and sum()
methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of roughly 30,000 rows these are unlikely to provide any meaningful information.
null_counts = raw_steam_data.isnull().sum()
null_counts
Initial Processing¶
We will most likely have to handle each column individually, so we will write some functions to keep our methodology organised, and help iteratively develop the process.
Our first function will remove the columns with more than 50% missing values, taking care of the columns with high null counts. We can do this by running a filter on the dataframe, as seen below.
threshold = raw_steam_data.shape[0] // 2
print('Drop columns with more than {} missing rows'.format(threshold))
print()
drop_rows = raw_steam_data.columns[null_counts > threshold]
print('Columns to drop: {}'.format(list(drop_rows)))
We can then look at the type and name columns, thinning out our data set a little by removing apps without either.
In the data collection stage, if no information was returned from an app's API request, only the name and appid was stored. We can easily identify these apps by looking at rows with missing data in the type
column, as all other apps have a value here. As seen below, these rows contain no other information so we can safely remove them.
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])
# preview rows with missing type data
raw_steam_data[raw_steam_data['type'].isnull()].head(3)
We can look at the counts of unique values in a column by using the pandas Series.value_counts method. By checking the value counts we see that all rows either have a missing value, as noted above, or 'game' in the type
column.
Once the null rows are removed, we'll be able to remove this column as it doesn't provide us with any more useful information.
raw_steam_data['type'].value_counts(dropna=False)
Taking a look now at the name column, we can check for rows which either have a null value or a string containing 'none'. This isn't recognised as a null value but should be treated as such.
We achieve this by combining boolean filters using brackets and a vertical bar, |
, symbolising a logical 'or'.
There are only four rows which match these criteria, and they appear to be missing a lot of data in other columns so we should definitely remove them.
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]
As we know for certain that all AppIDs should be unique, any rows with the same ID need to be handled.
We can easily view duplicated rows using the DataFrame.duplicated() method of pandas. We can pass keep=False
to view all duplicated rows, or leave the defaults (keep='first'
) to skip over the first row and just show the rest of the duplicates. We can also pass a column label into subset
if we want to filter by a single column.
As we only want to remove the extra rows, we can keep the default behaviour.
duplicate_rows = raw_steam_data[raw_steam_data.duplicated()]
print('Duplicate rows to remove:', duplicate_rows.shape[0])
duplicate_rows.head(3)
We're now ready to define functions implementing the filters we just looked at. This allows us to easily make changes in the future if we want to alter how the columns are handled, or want to choose a different cut-off threshold for getting rid of columns, for example.
We also define a general purpose process
function which will run all the processing functions we create on the data set. This will allow us to slowly add to it as we develop more functions and ensure we're cleaning the correct dataframe.
Finally we run this function on the raw data, inspecting the first few rows and viewing how many rows and columns have been removed.
def drop_null_cols(df, thresh=0.5):
"""Drop columns with more than a certain proportion of missing values (Default 50%)."""
cutoff_count = len(df) * thresh
return df.dropna(thresh=cutoff_count, axis=1)
def process_name_type(df):
"""Remove null values in name and type columns, and remove type column."""
df = df[df['type'].notnull()]
df = df[df['name'].notnull()]
df = df[df['name'] != 'none']
df = df.drop('type', axis=1)
return df
def process(df):
"""Process data set. Will eventually contain calls to all functions we write."""
# Copy the input dataframe to avoid accidentally modifying original data
df = df.copy()
# Remove duplicate rows - all appids should be unique
df = df.drop_duplicates()
# Remove collumns with more than 50% null values
df = drop_null_cols(df)
# Process rest of columns
df = process_name_type(df)
return df
print(raw_steam_data.shape)
initial_processing = process(raw_steam_data)
print(initial_processing.shape)
initial_processing.head()
Processing Age¶
Next we'll look at the required_age
column. By looking at the value counts we can see that values are already stored as integers, and the values range from 0 to 20, with one likely error (1818). There are no missing values in this column, but the vast majority have a value of 0. We'll clean the column anyway, but this probably means it won't be of much use in analysis as there is little variance in the data.
initial_processing['required_age'].value_counts(dropna=False).sort_index()
Whilst fairly useful in its current state, we may benefit from reducing the number of categories that ages fall into. For example, instead of comparing games rated as 5, 6, 7 or 8, we could compare games rated 5+ or 8+.
To decide which categories (or bins) we should use, we will look at the PEGI age ratings as this is the system used in the United Kingdom, where we're performing our analysis. Ratings fall into one of five categories (3, 7, 12, 16, 18), defining the minimum age recommended to play a game.
Using this to inform our decision, we can use the pandas.cut function to sort our data into each of these categories. Rows with 0 may mean they are unrated, unstated as in missing, or rated as suitable for everyone. Because we can't tell we'll leave these as they are. As the erroneous row (1818) is most likely meant to be rated 18 anyway, we can set the upper bound above this value to catch it inside this category.
Below we define a process_age
function to handle this, and inspect the output.
def process_age(df):
"""Format ratings in age column to be in line with the PEGI Age Ratings system."""
# PEGI Age ratings: 3, 7, 12, 16, 18
cut_points = [-1, 0, 3, 7, 12, 16, 2000]
label_values = [0, 3, 7, 12, 16, 18]
df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
return df
age_df = process_age(initial_processing)
age_df['required_age'].value_counts().sort_index()
Processing the Platforms Column¶
Whilst we could look at the next column in the dataframe, is_free
, it would make sense that this is linked to the price_overview
column. Ultimately we may wish to combine these columns into one, where free games would have a price of 0.
Looking at the price_overview
column, we can see it is stored in a dictionary-like structure, with multiple keys and values. Handling both of these together might be somewhat trickty, so instead we'll look at a simpler example.
age_df['price_overview'].head()
The platforms
column appears to contain a key for each of the main operating systems - windows, mac and linux - and a corresponding boolean value, set to True or False depending on the availability on that platform. This should be a reasonably straighforward place to start. We can separate this data out into three columns - one for each platform - filled with boolean values.
age_df['platforms'].head()
So far the cleaning process has been relatively simple, mainly requiring checking for null values and dropping some rows or columns. Already we can see that handling the platforms will be a little more complex.
Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.
platforms_first_row = age_df['platforms'].iloc[0]
print(type(platforms_first_row))
platforms_first_row
We can get around this using the handy literal_eval function from the built-in ast
module. As the name suggests, this will allow us to evaluate the string, and then index into it as a
dictionary.
eval_first_row = literal_eval(platforms_first_row)
print(type(eval_first_row))
print(eval_first_row)
eval_first_row['windows']
We also need to check for missing values, but fortunately it appears there aren't any in this column.
age_df['platforms'].isnull().sum()
As for formatting the output, let's keep things simple and return a list of supported platforms. Another option would be to create a column for each platform with a corresponding True/False value for each row, something we can keep in mind for the future.
We can create the desired list by calling the str.join() method on a string, and passing an iterable into the function. In this case, we can pass the keys of the row, as seen below.
# create string of keys, joined on a semi-colon
';'.join(eval_first_row.keys())
We also need to inspect the value of each row, and only end up with the keys that have a value of True
, skipping those which are False
. The example below shows how we can do this using a list comprehension.
platforms = {'windows': True, 'mac': True, 'linux': False}
# list comprehension
print([x for x in platforms.keys() if platforms[x]])
# using list comprehension in join
';'.join(x for x in platforms.keys() if platforms[x])
Putting this all together, we can use the pandas Series.apply method to process the rows.
def process_platforms(df):
"""Split platforms column into separate boolean columns for each platform."""
# evaluate values in platforms column, so can index into dictionaries
df = df.copy()
def parse_platforms(x):
d = literal_eval(x)
return ';'.join(platform for platform in d.keys() if d[platform])
df['platforms'] = df['platforms'].apply(parse_platforms)
return df
platforms_df = process_platforms(age_df)
platforms_df['platforms'].value_counts()
Processing Price¶
Now we have built up some intuition around how to deal with data stored as dictionaries, let's return to the is_free
and price_overview
columns as we should now be able to handle them.
First let's check how many null values there are in price_overview
.
platforms_df['price_overview'].isnull().sum()
Whilst that looks like a lot, we have to consider the impact that the is_free
column might be having. Before jumping to conclusions let's check if there any rows with is_free
marked as True and null values in the price_overview
column.
free_and_null_price = platforms_df[(platforms_df['is_free']) & (platforms_df['price_overview'].isnull())]
free_and_null_price.shape[0]
It turns out this accounts for most of the missing values in the price_overview
column, meaning we can handle these by setting the final price as 0. This makes intuitive sense - free games wouldn't have a price.
This means that there are almost 850 rows which aren't free but have null values in the price_overview
column. Let's investigate those next.
not_free_and_null_price = platforms_df[(platforms_df['is_free'] == False) & (platforms_df['price_overview'].isnull())]
not_free_and_null_price.head()
The first few rows contain some big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.
We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:
https://store.steampowered.com/app/[steam_appid]
This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later.
def print_steam_links(df):
"""Print links to store page for apps in a dataframe."""
url_base = "https://store.steampowered.com/app/"
for i, row in df.iterrows():
appid = row['steam_appid']
name = row['name']
print(name + ':', url_base + str(appid))
print_steam_links(not_free_and_null_price[:5])
For these games we can conclude that:
- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party
- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: GOG)
- "A spokesperson for GOG told Eurogamer it pulled the game "due to expiring licensing rights", and stressed it'll talk with "new distribution rights holders" to bring the game back as soon as possible." Source: Eurogamer
- BioShock has been replaced by BioShock Remastered
- Sam & Max 101 is sold as part of a season, and this can be found in the
package_groups
column
So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the package_groups
column, or we could leave them for now and return to them later. We'll leave them for now, handling the two price columns, then take a look at the packages next. It may also be that some of these rows are removed later in the cleaning process for other reasons.
If we want to find rows similar to these and deal with each case individually, we could use the .str.contains()
method, as seen below.
platforms_df[platforms_df['name'].str.contains("BioShock™")]
Now we need to figure out how to process the column.
If we take a look at the data for the first row, we can see that there are a variety of formats in which the price is stored. There is a currency, GBP, which is perfect as we are performing our analysis in the UK. Next we have a number of different values for the price, so which one do we use?
platforms_df['price_overview'][0]
If we inspect another row, we see that there is an active discount, applying an 80% price reduction to the title. It looks like initial
contains the normal price before discount, and final
contains the discounted price. initial_formatted
and final_formatted
contain the price formatted and displayed in the currency. We don't have to worry about these last two, as we'll be storing the price as a float (or integer) and if we wanted, we could format it like this when printing.
With all this in mind, it looks like we'll be checking the value under the currency
key, and using the value in the initial
key.
platforms_df['price_overview'][37]
Now the preliminary investigation is complete we can begin definining our function.
We start by evaluating the strings using literal_eval
as before, however if there is a null value we return a properly formatted dictionary with -1 for the initial
value. This will allow us to fill in a value of 0 for free games, then be left with an easily targetable value for the actual null rows.
Next we create currency
and price
columns from the dictionary in the price_overview
column. We define an anonymous function on the fly using a lambda expression, returning the value in each key.
def process_price(df):
df = df.copy()
def parse_price(x):
if x is not np.nan:
return literal_eval(x)
else:
return {'currency': 'GBP', 'initial': -1}
# evaluate as dictionary and set to -1 if missing
df['price_overview'] = df['price_overview'].apply(parse_price)
# Create columns from currency and initial values
df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
df['price'] = df['price_overview'].apply(lambda x: x['initial'])
# Set price of free games to 0
df.loc[df['is_free'], 'price'] = 0
return df
price_data = process_price(platforms_df)[['name', 'currency', 'price']]
price_data.head()
We're almost finished, but let's check if any games don't have GBP listed as the currency.
price_data[price_data['currency'] != 'GBP']
For some reason there are four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows it's easier and safer to simply drop them.
We can also divide the prices by 100 so they are displayed as floats in pounds.
def process_price(df):
"""Process price_overview column into formatted price column."""
df = df.copy()
def parse_price(x):
if x is not np.nan:
return literal_eval(x)
else:
return {'currency': 'GBP', 'initial': -1}
# evaluate as dictionary and set to -1 if missing
df['price_overview'] = df['price_overview'].apply(parse_price)
# create columns from currency and initial values
df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
df['price'] = df['price_overview'].apply(lambda x: x['initial'])
# set price of free games to 0
df.loc[df['is_free'], 'price'] = 0
# remove non-GBP rows
df = df[df['currency'] == 'GBP']
# change price to display in pounds (only applying to rows with a value greater than 0)
df.loc[df['price'] > 0, 'price'] /= 100
# remove columns no longer needed
df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
return df
price_df = process_price(platforms_df)
price_df[['name', 'price']].head()
Processing Packages¶
We can now take a look at the packages
and package_groups
columns to help decide what to do with rows that are missing price data. We're not incredibly interested in the columns themselves, as they don't appear to provide much new useful information, except which games come with others as part of a bundle.
# temporarily set a pandas option using with and option_context
with pd.option_context("display.max_colwidth", 500):
display(price_df[['steam_appid', 'packages', 'package_groups', 'price']].head(3))
Overall we have 846 rows with missing price data, which we previously set to -1.
print(price_df[price_df['price'] == -1].shape[0])
We can split these rows into two categories: those with package_groups
data and those without.
If we take a quick look at the package_groups
column we see that there appear to be no null values. On closer inspection, we can find that rows without data are actually stored as empty lists.
print('Null counts:', price_df['package_groups'].isnull().sum())
print('Empty list counts:', price_df[price_df['package_groups'] == "[]"].shape[0])
Using a combination of filters, we can find out how many rows have both missing price
and package_group
data and investigate. We'll count the rows and print links to some of the store pages and look for patterns.
missing_price_and_package = price_df[(price_df['price'] == -1) & (price_df['package_groups'] == "[]")]
print('Number of rows:', missing_price_and_package.shape[0], '\n')
print('First few rows:\n')
print_steam_links(missing_price_and_package[:5])
print('\nLast few rows:\n')
print_steam_links(missing_price_and_package[-10:-5])
Most of the games - 799 of 846 - with missing price data fall into the above category. This probably means they can be safely removed.
From following the links for the first few rows to the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of them haven't yet been released and haven't had a price set. We'll take care of all the unreleased games when we clean the release_date
column, but we can remove all of these apps here.
Let's now take a look at the rows that have missing price data but do have package_groups
data. We may be interested in keeping these rows and extracting price data from the package data.
missing_price_have_package = price_df.loc[(price_df['price'] == -1) & (price_df['package_groups'] != "[]"), ['name', 'steam_appid', 'package_groups', 'price']]
print('Number of rows:', missing_price_have_package.shape[0], '\n')
print('First few rows:\n')
print_steam_links(missing_price_have_package[:5])
print('\nLast few rows:\n')
print_steam_links(missing_price_have_package[-10:-5])
Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game.
Whilst we could extract prices from the package_groups
data, the most sensible option seems to be removing these rows. There are only 47 rows this applies to, and any with a newer release will still have the re-release in the data.
Since our logic interacts heavily with the price data we will update the process_price
function rather than creating a new one.
def process_price(df):
"""Process price_overview column into formatted price column, and take care of package columns."""
df = df.copy()
def parse_price(x):
if x is not np.nan:
return literal_eval(x)
else:
return {'currency': 'GBP', 'initial': -1}
# evaluate as dictionary and set to -1 if missing
df['price_overview'] = df['price_overview'].apply(parse_price)
# create columns from currency and initial values
df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
df['price'] = df['price_overview'].apply(lambda x: x['initial'])
# set price of free games to 0
df.loc[df['is_free'], 'price'] = 0
# remove non-GBP rows
df = df[df['currency'] == 'GBP']
# remove rows where price is -1
df = df[df['price'] != -1]
# change price to display in pounds (can apply to all now -1 rows removed)
df['price'] /= 100
# remove columns no longer needed
df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
return df
price_df = process_price(platforms_df)
price_df.head()
The next columns in the data are descriptive columns - detailed_description
, about_the_game
and short_description
. We won't be handling them now, instead returning to them later on when we deal with export columns. These are columns where we will export all or some of the data to a separate csv file as part of the cleaning.
Processing Langauges¶
Beyond that, the next column is supported_languages
. As we will be performing the analysis for an English company, we will only be interested in games available in English. Whilst we could remove non-english game at this stage, instead we will create a column marking english games with a boolean value - True or False.
We begin as usual by looking for rows with null values.
price_df['supported_languages'].isnull().sum()
Taking a closer look at these apps, it doesn't look like there's anything wrong with them. It may be that the data simply wasn't supplied. As there are only 4 rows affected we will go ahead and remove these from the data set.
price_df[price_df['supported_languages'].isnull()]
Next we'll take a look at the structure of the column. By looking at the value for the first row and the values for the most common rows, it looks like languages are stored as a string which can be anything from a comma-separated list of languages to a mix of html and headings. It seems reasonably safe to assume that if the app is in English, the word English will appear somewhere in this string. With this in mind we can simply search the string and return a value based on the result.
print(price_df['supported_languages'][0])
price_df['supported_languages'].value_counts().head(10)
It looks like English-only games make up a little over half the rows in our dataset (~16,000), and English plus other languages make up many more. We could create columns for any of the other languages by string searching, but for simplicity we'll focus on just the English ones.
Using the Series.apply method once again, we can check if the string 'english' appears in each row. We'll return a 1 if 'english' is found and 0 otherwise. Whilst we could use True/False, a binary 1/0 can be interpreted as a boolean value and saves a little space in the csv file.
Inside the lambda function, the variable x
will take on the value of each row as the expression is evaluated. We apply the lower()
string method so capitalisation doesn't matter.
def process_language(df):
"""Process supported_languages column into a boolean 'is english' column."""
df = df.copy()
# drop rows with missing language data
df = df.dropna(subset=['supported_languages'])
df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
df = df.drop('supported_languages', axis=1)
return df
language_df = process_language(price_df)
language_df[['name', 'english']].head()
Before moving on, we can take a quick look at the results and see that most of the apps support English.
language_df['english'].value_counts()
Processing Developers and Publishers¶
We'll skip over the next few columns and take a look at the developers
and publishers
columns. They will most likely contain similar information so we can look at them together.
We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values, if we search for empty lists we see that we have over 200 'hidden' null values.
print('Developers null counts:', language_df['developers'].isnull().sum())
print('Developers empty list counts:', language_df[language_df['developers'] == "['']"].shape[0])
print('\nPublishers null counts:', language_df['publishers'].isnull().sum())
print('Publishers empty list counts:', language_df[language_df['publishers'] == "['']"].shape[0])
Ultimately we want a data set with no missing values. That means we have a few options for dealing with these two columns:
- Remove all rows missing either developer or publisher information
- Impute missing information by replacing the missing column with the column we have (i.e. if developers is missing, fill it with the value in publishers)
- Fill missing information with 'Unknown' or 'None'
We can investigate some of the rows with missing data to help inform our decision.
no_dev = language_df[language_df['developers'].isnull()]
print('Total games missing developer:', no_dev.shape[0], '\n')
print_steam_links(no_dev[:5])
no_pub = language_df[language_df['publishers'] == "['']"]
print('\nTotal games missing publisher:', no_pub.shape[0], '\n')
print_steam_links(no_pub[:5])
no_dev_or_pub = language_df[(language_df['developers'].isnull()) & (language_df['publishers'] == "['']")]
print('\nTotal games missing developer and publisher:', no_dev_or_pub.shape[0], '\n')
print_steam_links(no_dev_or_pub[:5])
It appears we are looking at a mix of titles, smaller ones especially, and some of the smaller indie titles may have been self-published. Others simply have wrong or missing data, found by searching for the titles elsewhere. As our priority is creating a clean data set, and there are only a few hundred rows, it will be fine to remove them from the data.
Let's take a look at the structure of the data. Below we inspect some rows near the beginning of the dataframe. It looks like both columns are stored as lists which can have one or multiple values. We'll have to evaluate the rows as before, so they are recognised as lists, then index into them accordingly.
language_df[['developers', 'publishers']].iloc[24:28]
As we have some single values and some multiple, we have to decide how to handle them. Here are some potential solutions:
- Create a column for each value in the list (i.e. developer_1, developer_2)
- Create a column with the first value in the list and a column with the rest of the values (i.e. developer_1, other_developers)
- Create a column with the first value in the list and disregard the rest
- Combine all values into one column, simply unpacking the list
Let's begin defining our function, and take a look at how many rows have multiple developers or publishers. After evaluating each row, we can find the length of the lists in each row by using the Series.str.len() method. By filtering only rows where the list has more than one element, we can find the number of rows with more than one value in each column.
def process_developers_and_publishers(df):
# remove rows with missing data
df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
for col in ['developers', 'publishers']:
df[col] = df[col].apply(lambda x: literal_eval(x))
# filter dataframe to rows with lists longer than 1, and store the number of rows
num_rows = df[df[col].str.len() > 1].shape[0]
print('Rows in {} column with multiple values:'.format(col), num_rows)
process_developers_and_publishers(language_df)
It turns out that the vast majority have only one value for these columns. If we went with the first or second solutions above, we'd be left with columns with mostly missing data. We could go with the third option, but the first value in the list isn't necessarily the most important, and this seems unfair if multiple teams were involved.
The best way forward seems to be the fourth option - if there are multiple values we combine them into the same column. We'll create a list in this case, calling str.join() as we did before. If we pass a list with only one value, we get a string with just that value. If we pass a list with multiple values, we get a string-separated list as desired. We can see this below.
', '.join(['one item'])
', '.join(['multiple', 'different', 'items'])
We can't join on a comma as a number of developers and publishers have a comma in their name, a couple of which can be seen below.
language_df.loc[language_df['developers'].str.contains(",", na=False), ['steam_appid', 'developers', 'publishers']].head(4)
Instead we can join on a semi-colon (;
). We have 3 rows which contain a semi-colon in their name, so we'll remove these. We'll be able to identify and split individual developer/publisher names in the future by handling it this way.
language_df.loc[language_df['developers'].str.contains(";", na=False), ['steam_appid', 'developers', 'publishers']]
There are also a small number of columns that have ['NA']
or ['N/A']
for publisher. These are some really well hidden null values, and they didn't actually surface until much later in the original development process. This helps highlight the iterative nature of data cleaning - you may discover errors in rows or data that went previously undiscovered, and have to go back and update or correct them.
language_df[(language_df['publishers'] == "['NA']") | (language_df['publishers'] == "['N/A']")].shape[0]
Now we're ready to finish the function we started. We'll abandon the for loop, as there is not too much repetition, and run it on the data as always.
def process_developers_and_publishers(df):
"""Parse columns as semicolon-separated string."""
# remove rows with missing data (~ means not)
df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
df = df[~(df['developers'].str.contains(';')) & ~(df['publishers'].str.contains(';'))]
df = df[(df['publishers'] != "['NA']") & (df['publishers'] != "['N/A']")]
# create list for each
df['developer'] = df['developers'].apply(lambda x: ';'.join(literal_eval(x)))
df['publisher'] = df['publishers'].apply(lambda x: ';'.join(literal_eval(x)))
df = df.drop(['developers', 'publishers'], axis=1)
return df
dev_pub_df = process_developers_and_publishers(language_df)
dev_pub_df[['name', 'steam_appid', 'developer', 'publisher']].head()
Processing Categories and Genres¶
We'll take a look at the categories
and genres
columns next. Below we take a look at the null counts and structure of the data. Both appear to be a list of dictionaries containing an id and description key-value pair. Similar to our handling of developers and publishers, it may be best to extract just the descriptions into a list for now. We could make a list of the IDs, keeping track of the corresponding description externally, but that seems overly complex for our purposes.
print('Categories:\n')
print('Null values:', dev_pub_df['categories'].isnull().sum())
print()
print(dev_pub_df['categories'][0])
print('\nGenres:\n')
print('Null values:', dev_pub_df['genres'].isnull().sum())
print()
print(dev_pub_df['genres'].iloc[0])
print(dev_pub_df['genres'].iloc[1000])
Before we begin defining a function we'll inspect some of the null rows, then we can decide how to handle them.
Using the pandas DataFrame.sample method, we can randomly sample a number of rows from the dataset. We set a random_state so the output is the same each time.
print_steam_links(dev_pub_df[dev_pub_df['categories'].isnull()].sample(5, random_state=0))
All of the above rows with null categories data are applications or software of some kind, and not actually games. It would definitely be best to remove these, as we are interested in analysing games from the steam store.
Below we take a look at rows with missing genres data. There doesn't seem to be anything wrong with these games, and it suggests that genre data simply wasn't supplied. As there are only 37 rows affected, we'll remove these rows to keep our dataset complete.
print_steam_links(dev_pub_df[dev_pub_df['genres'].isnull()].sample(5, random_state=0))
As we did for the developer and publisher columns, we can create an anonymous function and join the results on a semicolon. Inside the function we use a list comprehension to traverse each dictionary and extract the value under the description
key, as seen below.
example_category = "[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}]"
[x['description'] for x in literal_eval(example_category)]
We then use the apply function again to turn each column into a simple delimited list.
def process_categories_and_genres(df):
df = df.copy()
df = df[(df['categories'].notnull()) & (df['genres'].notnull())]
for col in ['categories', 'genres']:
df[col] = df[col].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))
return df
cat_gen_df = process_categories_and_genres(dev_pub_df)
cat_gen_df[['steam_appid', 'categories', 'genres']].head()
Processing Achievements and Content Descriptors¶
The final columns we will take a look at before moving on to export columns are achievements
and content_descriptors
. Let's take a look at the null counts for each column and a small sample of rows.
print('Achievements null counts:', cat_gen_df['achievements'].isnull().sum())
print('Content Decsriptors null counts:', cat_gen_df['content_descriptors'].isnull().sum())
cat_gen_df[['name', 'achievements', 'content_descriptors']].iloc[8:13]
It looks like both columns are stored as dictionaries, with standard formats if no details are provided or exist.
Below we take a closer look at a single row from the achievements column.
literal_eval(cat_gen_df['achievements'][9])
There are two keys in the top level of the dictionary: total
and highlighted
. The highlighted column looks too specific, being a selection of achievements specific to that game, so we will remove it. It may be worthwhile extracting the total
value though.
Now let's take a look at the content_descriptors
column.
cat_gen_df['content_descriptors'].value_counts().head(6)
Content descriptors contain age-related warnings about the content of a game. They are identified by a numeric ID number, with optional notes supplied. Almost 26,000 rows have an empty list, indicating either no content descriptors or none provided. We already have the required_age
column, which gives us similar information. Because of this, and because the rows are highly specific to each game, we will drop this column entirely.
Let's now define a function, taking a look at the value counts to verify everything went as expected.
def process_achievements_and_descriptors(df):
"""Parse as total number of achievements."""
df = df.copy()
df = df.drop('content_descriptors', axis=1)
def parse_achievements(x):
if x is np.nan:
# missing data, assume has no achievements
return 0
else:
# else has data, so can extract and return number under total
return literal_eval(x)['total']
df['achievements'] = df['achievements'].apply(parse_achievements)
return df
achiev_df = process_achievements_and_descriptors(cat_gen_df)
achiev_df['achievements'].value_counts().head()
It looks like we were successful. We'll leave this column as it is for now, however we may wish to consider grouping the values together in bins, like we did for the age column. This is a decision we can make during the feature engineering stage of our analysis, and we can decide at that point if it will be more useful.
Let's now add these functions into the process
function and run it on the raw data. This isn't strictly necessary but it will keep things organised and ensure we don't accidentally skip running a function.
We'll then inspect everything we've completed so far. As you will see, there is still plenty more left to do.
def process(df):
"""Process data set. Will eventually contain calls to all functions we write."""
# Copy the input dataframe to avoid accidentally modifying original data
df = df.copy()
# Remove duplicate rows - all appids should be unique
df = df.drop_duplicates()
# Remove collumns with more than 50% null values
df = drop_null_cols(df)
# Process columns
df = process_name_type(df)
df = process_age(df)
df = process_platforms(df)
df = process_price(df)
df = process_language(df)
df = process_developers_and_publishers(df)
df = process_categories_and_genres(df)
df = process_achievements_and_descriptors(df)
return df
partially_clean = process(raw_steam_data)
partially_clean.head()
Export Columns¶
There are a number of columns containing information that is either not useful to the current project or is too complex to be useful for now. We may wish to incorporate them into a future project, but for now we will export the columns to separate files and remove them from the dataset.
Processing Description Columns¶
We have a series of columns with descriptive text about each game: detailed_description
, about_the_game
and short_description
. As the column names imply, these provide information about each game in string format. This is great for humans' understanding, but when it comes to machines is a lot trickier.
These columns could be used as the basis for an interesting recommender system or sentiment analysis project, however they are not required in our current project. We'll be removing them as they take up large amounts of space, and will only serve to slow down any computation on the data.
We'll briefly inspect the columns, in case of anomalies, and export just the description data to a separate file.
partially_clean[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()
We have 14 rows with missing data for these columns, and chances are the 14 rows with missing detailed_description
are the rows with missing about_the_game
and short_description
data too.
By inspecting the individual rows below, we can see that this is true - all rows with missing data in one description column have missing data in the others as well.
partially_clean[partially_clean['detailed_description'].isnull()].head(3)
Interestingly, all of these titles are games from 2006 developed and published by PopCap Games. My best guess is that they were developed previously and all added to the Steam store in one go after Valve allowed third-party titles.
We'll remove these rows, as well as any with a description of less than 20 characters, like those below.
partially_clean[partially_clean['detailed_description'].str.len() <= 20]
To handle exporting the data to file, we'll write a reusable function which we can call upon for future columns. We will include the steam_appid
column as it will allow us to match up these rows with rows in our primary dataset later on, using a merge (like a join in SQL).
def export_data(df, filename):
"""Export dataframe to csv file, filename prepended with 'steam_'.
filename : str without file extension
"""
filepath = '../data/exports/steam_' + filename + '.csv'
df.to_csv(filepath, index=False)
print_name = filename.replace('_', ' ')
print("Exported {} to '{}'".format(print_name, filepath))
We can now define a function to process and export the description columns.
def process_descriptions(df, export=False):
"""Export descriptions to external csv file then remove these columns."""
# remove rows with missing description data
df = df[df['detailed_description'].notnull()].copy()
# remove rows with unusually small description
df = df[df['detailed_description'].str.len() > 20]
# by default we don't export, useful if calling function later
if export:
# create dataframe of description columns
description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
export_data(description_data, filename='description_data')
# drop description columns from main dataframe
df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
return df
desc_df = process_descriptions(partially_clean, export=True)
# inspect exported data
pd.read_csv('../data/exports/steam_description_data.csv').head()
Processing Media Columns¶
Similar to the description columns, we have three columns that contain links to various images: header_image
, screenshots
and background
. Whilst we won't be needing this data in this project, it could open the door to some interesting image analysis in the future. We will treat these columns in almost the same way, exporting the contents to a csv file then removing them from the dataset.
Again, let's check for missing values.
image_cols = ['header_image', 'screenshots', 'background']
for col in image_cols:
print(col+':', desc_df[col].isnull().sum())
desc_df[image_cols].head()
As with the description columns, it is likely that the 4 rows with no screenshots
data are the same rows with no background
data. There are so few that it is probably safe to remove them.
Before we make up our made let's inspect the rows.
no_screenshots = desc_df[desc_df['screenshots'].isnull()]
print_steam_links(no_screenshots)
no_screenshots
As we predicted, the rows without screenshots are also the rows without a background. It looks like two are unreleased, and if we'd dealt with the release_date
column already these would already be removed. One was released recently (5 Jan, 2019), and perhaps didn't have screenshots at the time of downloading, and one simply doesn't have any. As we suspected, it's safe to remove all these rows.
There is also a movies
column with similar data. Whilst having more missing values, presumably for games without videos, it appears to contain names, thumbnails and links to various videos and trailers. It's unlikely we'll need them but we can include them in the export and remove them from our data set.
print('Movies null values:', desc_df['movies'].isnull().sum())
print()
desc_df[desc_df['movies'].notnull()]['movies'].iloc[0]
We can now put this all together and define a process_media
function.
def process_media(df, export=False):
"""Remove media columns from dataframe, optionally exporting them to csv first."""
df = df[df['screenshots'].notnull()].copy()
if export:
media_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
export_data(media_data, 'media_data')
df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
return df
media_df = process_media(desc_df, export=True)
# inspect exported data
pd.read_csv('../data/exports/steam_media_data.csv').head()
Before we move on, we can inspect the memory savings of removing these columns by comparing the output of the DataFrame.info method. If we pass memory_usage="deep"
we get the true memory usage of each DataFrame. Without this, pandas estimates the amount used. This is because of the way python stores object (string) columns under the hood. Essentially python keeps track of a list of pointers which point to the actual strings in memory. It's a bit like if you hid a bunch of items around the house, and kept a list of where everything was. You couldn't tell the total size of everything just by looking at the list, but you could take a rough guess. Only by following the list and inspecting each individual item could you get an exact figure.
The blog post 'Why Python Is Slow' goes into more detail, but all we need to be aware of is that by passing the parameter we ensure we get the true value of memory usage. We also pass verbose=False
to truncate unnecessary output.
We can see that already we have shrunk the memory usage from 285 MB to just under 55 MB. This is great because in general, the smaller the memory footprint the faster our code will run in future. And of course, we're not finished yet.
print('Before removing data:\n')
achiev_df.info(verbose=False, memory_usage="deep")
print('\nData with descriptions and media removed:\n')
media_df.info(verbose=False, memory_usage="deep")
Website and Support Info¶
Next we will look at the website
and support_info
columns. Seen below, they both contain links to external websites. The website column is simply stored as a string whereas the support info column is stored as a dictionary of url
and email
.
There are a large number of rows with no website listed, and while there are no null values in the support_info
column, it looks like many will have empty url
and email
values inside the data.
For our dataset we'll be dropping both these columns, as they are far too specific to be useful in our analysis. As you may have guessed, we will extract and export this data as we have done before. If not useful, it could be interesting at a later date.
print('website null counts:', media_df['website'].isnull().sum())
print('support_info null counts:', media_df['support_info'].isnull().sum())
with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
display(media_df[['name', 'website', 'support_info']][75:80])
We're going to split the support info into two separate columns. We'll keep all the code that parses the columns inside the export if
statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain NaN
whereas the other two columns contain a blank string (''
) for missing data, as once we have exported to csv they will be represented the same way.
def process_info(df, export=False):
"""Drop support information from dataframe, optionally exporting beforehand."""
if export:
support_info = df[['steam_appid', 'website', 'support_info']].copy()
support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
support_info = support_info.drop('support_info', axis=1)
# only keep rows with at least one piece of information
support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]
export_data(support_info, 'support_info')
df = df.drop(['website', 'support_info'], axis=1)
return df
info_df = process_info(media_df, export=True)
# inspect exported file
pd.read_csv('../data/exports/steam_support_info.csv').head()
System Requirements¶
As you may have noticed, there are three columns containing the system specifications required to run each game on each platform. These columns are pc_requirements
, mac_requirements
, and linux_requirements
. As computing power has increased over the years, this information won't be of particular use in this analysis. It could be fascinating to use as a proxy of how computers have developed over the years, so we'll want to export the data, but we won't be keeping it in this dataset.
Taking a look at the null counts, it looks like there is data for every row.
requirements_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']
print('null counts:\n')
for col in requirements_cols:
print(col+':', info_df[col].isnull().sum())
However if we look at the data a little more closely, we see that some rows actually store an empty list. These won't appear as null rows, but once evaluated won't provide any information, so can be treated as null values.
info_df[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].tail()
By using a simple boolean filter and checking the shape pararater, we can get a count for how many empty lists there are.
print('Empty list counts:\n')
for col in requirements_cols:
print(col+':', info_df[info_df[col] == '[]'].shape[0])
That's over half of the rows for both mac and linux requirements. That probably means that there is not enough data in these two columns to be too useful.
It turns out that most games are developed solely for windows, with the growth in mac and linux ports only growing in recent years. Naturally it would make sense that any games that aren't supported on mac or linux would not have corresponding requirements.
As we have already cleaned the platforms column, we can check how many rows actually have missing data by comparing rows with empty lists in the requirements column with data in the platform column. If a row has an empty list in the requirements column but the platforms column shows it is supported, it means the data is missing.
for col in ['mac_requirements', 'linux_requirements']:
platform = col.split('_')[0]
print(platform+':', info_df[(info_df[col] == '[]') & (info_df['platforms'].str.contains(platform))].shape[0])
Whilst not an insignificant number, this means that the vast majority of rows are as they should be, and we're not looking at too many data errors.
Let's also have a look for missing values in the pc/windows column.
print('windows:', info_df[(info_df['pc_requirements'] == '[]') & (info_df['platforms'].str.contains('windows'))].shape[0])
Only 9 rows have missing system requirements. We can take a look at some of them below, and follow the links to the steam pages to try and discover if anything is amiss.
missing_windows_requirements = info_df[(info_df['pc_requirements'] == '[]') & (info_df['platforms'].str.contains('windows'))]
print_steam_links(missing_windows_requirements[:5])
There doesn't appear to be any common issue in these rows - some of the games are quite old but that's about it. It may simply be that no requirements were supplied when the games were added to the steam store.
We can assume that a cross-platform game will have similar requirements in terms of hardware for each platform it supports. With this in mind we can focus on the pc_requirements
column, and use the platforms
column to tell us which other platforms are supported.
Now we will take a look at a couple of rows from the dataset, to observe the structure of the stored data.
display(info_df['pc_requirements'].iloc[0])
display(info_df['pc_requirements'].iloc[2000])
display(info_df['pc_requirements'].iloc[15000])
The data is stored as a dictionary, as we've seen before. There is definitely a key for minimum
, but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like Processor
and Memory
for some of the rows.
Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.
By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions.
view_requirements = info_df['pc_requirements'].iloc[[0, 2000, 15000]].copy()
view_requirements = (view_requirements
.str.replace(r'\\[rtn]', '')
.str.replace(r'<[pbr]{1,2}>', ' ')
.str.replace(r'<[\/"=\w\s]+>', '')
)
for i, row in view_requirements.iteritems():
display(row)
We can now see more clearly the contents and structure of these rows. Some rows have both Minimum
and Recommended
requirements inside a minimum
key, and some have separate minimum
and recommended
keys. Some have headings like Processor:
and Storage:
before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.
It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process worthy of another project. With that in mind, it seems best to proceed by slightly cleaning the data before exporting it, but not trying to deal with individual components right now.
We'll export the raw data, and split out the data in the minimum
and recommended
keys. This will mean lots of rows still have recommended data hidden inside the minimum key, but this will do for now.
def process_requirements(df, export=False):
if export:
requirements = df[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].copy()
# remove rows with missing pc requirements
requirements = requirements[requirements['pc_requirements'] != '[]']
requirements['requirements_clean'] = (requirements['pc_requirements']
.str.replace(r'\\[rtn]', '')
.str.replace(r'<[pbr]{1,2}>', ' ')
.str.replace(r'<[\/"=\w\s]+>', '')
)
requirements['requirements_clean'] = requirements['requirements_clean'].apply(lambda x: literal_eval(x))
# split out minimum and recommended into separate columns
requirements['minimum'] = requirements['requirements_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
requirements['recommended'] = requirements['requirements_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
requirements = requirements.drop('requirements_clean', axis=1)
export_data(requirements, 'requirements_data')
df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
return df
reqs_df = process_requirements(info_df, export=True)
# verify export
pd.read_csv('../data/exports/steam_requirements_data.csv').head()
Processing Release Date¶
We're almost finished, and the final column to clean is release_date
. It provides the opportunity for an interesting exploration of code optimisation, so we'll be walking through the development process in a separate post. For now, we'll just include the complete function to run on the dataset.
# export data for optimisation post
reqs_df.to_csv('../data/exports/steam_partially_clean.csv', index=False)
def process_release_date(df):
df = df.copy()
def eval_date(x):
x = literal_eval(x)
if x['coming_soon']:
return '' # return blank string so can drop missing at end
else:
return x['date']
df['release_date'] = df['release_date'].apply(eval_date)
def parse_date(x):
if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
return x.replace(',', '')
elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
return '1 ' + x
elif x == '':
return np.nan
else:
# Should be everything, print out anything left just in case
print(x)
df['release_date'] = df['release_date'].apply(parse_date)
df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')
df = df[df['release_date'].notnull()]
return df
We can now update the process
function once more, and run the full cleaning process on the raw dataset.
def process(df):
"""Process data set. Will eventually contain calls to all functions we write."""
# Copy the input dataframe to avoid accidentally modifying original data
df = df.copy()
# Remove duplicate rows - all appids should be unique
df = df.drop_duplicates()
# Remove collumns with more than 50% null values
df = drop_null_cols(df)
# Process columns
df = process_name_type(df)
df = process_age(df)
df = process_platforms(df)
df = process_price(df)
df = process_language(df)
df = process_developers_and_publishers(df)
df = process_categories_and_genres(df)
df = process_achievements_and_descriptors(df)
df = process_release_date(df)
# Process columns which export data
df = process_descriptions(df, export=True)
df = process_media(df, export=True)
df = process_info(df, export=True)
df = process_requirements(df, export=True)
return df
steam_data = process(raw_steam_data)
steam_data.head()
Final Steps¶
That just about does is for cleaning the Steam data. Hopefully we have a cleaned dataset ready to be combined with the data downloaded from SteamSpy, once that is cleaned.
Before we export the cleaned steam data, we'll check that we have eradicated missing values, and have a look at the memory footprint like we did earlier.
steam_data.isnull().sum()
raw_steam_data.info(verbose=False, memory_usage="deep")
steam_data.info(verbose=False, memory_usage="deep")
It looks like we've taken care of all the null values, and the size of the dataset has been reduced considerably.
We'll also check that no unreleased games have slipped through (data was scraped on or before 1st May, 2019).
steam_data[steam_data['release_date'] > '2019-05-01']
Now that we're happy with our dataframe we are ready to export to file and finish this part of the project.
steam_data.to_csv('../data/exports/steam_data_clean.csv', index=False)
As I said at the beginning, data cleaning can be a very lengthy process. We definitely could have taken some shortcuts, and didn't have to go into so much depth, but I think it was interesting and valuable to go through the data as thoroughly as we did. A useful side effect of this is that we're now much more familiar with the data, and this should be incredibly useful when it comes to analysis. One of the most important prerequisites for data analysis is having good data, and the foundation of this is built upon solid data cleaning.
Next time we'll take a look at optimising the processing of the release_date
column, as mentioned, then move onto cleaning the SteamSpy data. Once that is complete we can begin exploring and analysing the data.
Thanks for joining me, and I welcome any feedback or suggestions you may have in the comments below.
Comments