{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Steam Data Cleaning\n", "\n", "*This forms part of a larger series of posts for my [blog](http://nik-davis.github.io) on downloading, processing and analysing data from the steam store. [See all posts here](http://nik-davis.github.io/tag/steam).*" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "application/json": { "Software versions": [ { "module": "Python", "version": "3.7.3 64bit [MSC v.1900 64 bit (AMD64)]" }, { "module": "IPython", "version": "7.5.0" }, { "module": "OS", "version": "Windows 10 10.0.18362 SP0" }, { "module": "numpy", "version": "1.16.3" }, { "module": "pandas", "version": "0.24.2" } ] }, "text/html": [ "
SoftwareVersion
Python3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython7.5.0
OSWindows 10 10.0.18362 SP0
numpy1.16.3
pandas0.24.2
Tue Jun 11 13:07:43 2019 GMT Summer Time
" ], "text/latex": [ "\\begin{tabular}{|l|l|}\\hline\n", "{\\bf Software} & {\\bf Version} \\\\ \\hline\\hline\n", "Python & 3.7.3 64bit [MSC v.1900 64 bit (AMD64)] \\\\ \\hline\n", "IPython & 7.5.0 \\\\ \\hline\n", "OS & Windows 10 10.0.18362 SP0 \\\\ \\hline\n", "numpy & 1.16.3 \\\\ \\hline\n", "pandas & 0.24.2 \\\\ \\hline\n", "\\hline \\multicolumn{2}{|l|}{Tue Jun 11 13:07:43 2019 GMT Summer Time} \\\\ \\hline\n", "\\end{tabular}\n" ], "text/plain": [ "Software versions\n", "Python 3.7.3 64bit [MSC v.1900 64 bit (AMD64)]\n", "IPython 7.5.0\n", "OS Windows 10 10.0.18362 SP0\n", "numpy 1.16.3\n", "pandas 0.24.2\n", "Tue Jun 11 13:07:43 2019 GMT Summer Time" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# view software version information\n", "\n", "# http://raw.github.com/jrjohansson/version_information/master/version_information.py\n", "%load_ext version_information\n", "# %reload_ext version_information\n", "\n", "%version_information numpy, pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![steam_logo](https://nik-davis.github.io/images/steam_logo_white.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "In the first part of this project, we downloaded and generated data sets from the Steam Store API and SteamSpy API. We now need to take this raw data and prepare it in a process commonly referred to as [data cleaning](https://en.wikipedia.org/wiki/Data_cleansing).\n", "\n", "Currently the downloaded data is not in a very useful state. Many of the columns contain lengthy strings or missing values, which hinder analysis and are especially crippling to any machine learning techniques we may wish to implement. Data cleaning involves handling missing values, tidying up values, and ensuring data is neatly and consistently formatted.\n", "\n", "\n", "\n", "Data cleaning is often cited as being the lengthiest part of any project. As such, it will be broken up into sections. We will begin by taking care of the columns in the steam data that are easiest to deal with and outlining a framework for the process. Of course it could all be completed in one go and a lot more concisely, however we'll be stepping through all the reasons for each decision and building the process iteratively.\n", "\n", "The main aims of this project are to investigate various sales and play-time statistics for games from the Steam store, and see how different features of games affect the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust dataset for future projects.\n", "\n", "Towards the end of this section, we'll take care of columns that involve exporting data of some kind. Some columns will require much more in-depth processing and cleaning, or provide information that may be a useful topic for another time. We'll clean some of these slightly, then export them separately from the clean dataset.\n", "\n", "As a separate post, we'll take a look at an optimisation problem, walking through the process of handling one particular column.\n", "\n", "Once that is complete we will repeat the whole cleaning process for the SteamSpy data and combine the results, finishing with a complete data set ready for analysis.\n", "\n", "To follow along or perform your own cleaning, the raw data can be found and downloaded on [Kaggle](https://www.kaggle.com/nikdavis/steam-store-raw).\n", "\n", "## API references:\n", "\n", "- https://partner.steamgames.com/doc/webapi\n", "- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI\n", "- https://steamapi.xpaw.me/#\n", "- https://steamspy.com/api.php\n", "\n", "## Import Libraries and Inspect Data\n", "\n", "To begin with, we'll import the required libraries and set customisation options, then take a look at the previously downloaded data by reading it into a pandas dataframe." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# standard library imports\n", "from ast import literal_eval\n", "import itertools\n", "import time\n", "import re\n", "\n", "# third-party imports\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# customisations\n", "pd.set_option(\"max_columns\", 100)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rows: 29235\n", "Columns: 39\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typenamesteam_appidrequired_ageis_freecontroller_supportdlcdetailed_descriptionabout_the_gameshort_descriptionfullgamesupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementslegal_noticedrm_noticeext_user_account_noticedeveloperspublishersdemosprice_overviewpackagespackage_groupsplatformsmetacriticreviewscategoriesgenresscreenshotsmoviesrecommendationsachievementsrelease_datesupport_infobackgroundcontent_descriptors
0gameCounter-Strike100.0FalseNaNNaNPlay the world's number 1 online action game. ...Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...NaNEnglish<strong>*</strong>, French<strong>*</st...https://steamcdn-a.akamaihd.net/steam/apps/10/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...NaNNaNNaN['Valve']['Valve']NaN{'currency': 'GBP', 'initial': 719, 'final': 7...[7][{'name': 'default', 'title': 'Buy Counter-Str...{'windows': True, 'mac': True, 'linux': True}{'score': 88, 'url': 'https://www.metacritic.c...NaN[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 65735}{'total': 0}{'coming_soon': False, 'date': '1 Nov, 2000'}{'url': 'http://steamcommunity.com/app/10', 'e...https://steamcdn-a.akamaihd.net/steam/apps/10/...{'ids': [2, 5], 'notes': 'Includes intense vio...
1gameTeam Fortress Classic200.0FalseNaNNaNOne of the most popular online action games of...One of the most popular online action games of...One of the most popular online action games of...NaNEnglish, French, German, Italian, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/20/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...NaNNaNNaN['Valve']['Valve']NaN{'currency': 'GBP', 'initial': 399, 'final': 3...[29][{'name': 'default', 'title': 'Buy Team Fortre...{'windows': True, 'mac': True, 'linux': True}NaNNaN[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 2802}{'total': 0}{'coming_soon': False, 'date': '1 Apr, 1999'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/20/...{'ids': [2, 5], 'notes': 'Includes intense vio...
2gameDay of Defeat300.0FalseNaNNaNEnlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...NaNEnglish, French, German, Italian, Spanish - Spainhttps://steamcdn-a.akamaihd.net/steam/apps/30/...http://www.dayofdefeat.com/{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...NaNNaNNaN['Valve']['Valve']NaN{'currency': 'GBP', 'initial': 399, 'final': 3...[30][{'name': 'default', 'title': 'Buy Day of Defe...{'windows': True, 'mac': True, 'linux': True}{'score': 79, 'url': 'https://www.metacritic.c...NaN[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 1992}{'total': 0}{'coming_soon': False, 'date': '1 May, 2003'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/30/...{'ids': [], 'notes': None}
3gameDeathmatch Classic400.0FalseNaNNaNEnjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...NaNEnglish, French, German, Italian, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/40/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...NaNNaNNaN['Valve']['Valve']NaN{'currency': 'GBP', 'initial': 399, 'final': 3...[31][{'name': 'default', 'title': 'Buy Deathmatch ...{'windows': True, 'mac': True, 'linux': True}NaNNaN[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 931}{'total': 0}{'coming_soon': False, 'date': '1 Jun, 2001'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/40/...{'ids': [], 'notes': None}
4gameHalf-Life: Opposing Force500.0FalseNaNNaNReturn to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...NaNEnglish, French, German, Koreanhttps://steamcdn-a.akamaihd.net/steam/apps/50/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...NaNNaNNaN['Gearbox Software']['Valve']NaN{'currency': 'GBP', 'initial': 399, 'final': 3...[32][{'name': 'default', 'title': 'Buy Half-Life: ...{'windows': True, 'mac': True, 'linux': True}NaNNaN[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 4355}{'total': 0}{'coming_soon': False, 'date': '1 Nov, 1999'}{'url': 'https://help.steampowered.com', 'emai...https://steamcdn-a.akamaihd.net/steam/apps/50/...{'ids': [], 'notes': None}
\n", "
" ], "text/plain": [ " type name steam_appid required_age is_free \\\n", "0 game Counter-Strike 10 0.0 False \n", "1 game Team Fortress Classic 20 0.0 False \n", "2 game Day of Defeat 30 0.0 False \n", "3 game Deathmatch Classic 40 0.0 False \n", "4 game Half-Life: Opposing Force 50 0.0 False \n", "\n", " controller_support dlc detailed_description \\\n", "0 NaN NaN Play the world's number 1 online action game. ... \n", "1 NaN NaN One of the most popular online action games of... \n", "2 NaN NaN Enlist in an intense brand of Axis vs. Allied ... \n", "3 NaN NaN Enjoy fast-paced multiplayer gaming with Death... \n", "4 NaN NaN Return to the Black Mesa Research Facility as ... \n", "\n", " about_the_game \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " short_description fullgame \\\n", "0 Play the world's number 1 online action game. ... NaN \n", "1 One of the most popular online action games of... NaN \n", "2 Enlist in an intense brand of Axis vs. Allied ... NaN \n", "3 Enjoy fast-paced multiplayer gaming with Death... NaN \n", "4 Return to the Black Mesa Research Facility as ... NaN \n", "\n", " supported_languages \\\n", "0 English*, French*Minimum:Minimum:Minimum:Minimum:Minimum: threshold]\n", "\n", "print('Columns to drop: {}'.format(list(drop_rows)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then look at the type and name columns, thinning out our data set a little by removing apps without either.\n", "\n", "In the data collection stage, if no information was returned from an app's API request, only the name and appid was stored. We can easily identify these apps by looking at rows with missing data in the `type` column, as all other apps have a value here. As seen below, these rows contain no other information so we can safely remove them." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rows to remove: 149\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typenamesteam_appidrequired_ageis_freecontroller_supportdlcdetailed_descriptionabout_the_gameshort_descriptionfullgamesupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementslegal_noticedrm_noticeext_user_account_noticedeveloperspublishersdemosprice_overviewpackagespackage_groupsplatformsmetacriticreviewscategoriesgenresscreenshotsmoviesrecommendationsachievementsrelease_datesupport_infobackgroundcontent_descriptors
26NaNHalf-Life: Opposing Force852NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
147NaNHalf-Life: Opposing Force4330NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
256NaNHalf-Life: Opposing Force8740NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " type name steam_appid required_age is_free \\\n", "26 NaN Half-Life: Opposing Force 852 NaN NaN \n", "147 NaN Half-Life: Opposing Force 4330 NaN NaN \n", "256 NaN Half-Life: Opposing Force 8740 NaN NaN \n", "\n", " controller_support dlc detailed_description about_the_game \\\n", "26 NaN NaN NaN NaN \n", "147 NaN NaN NaN NaN \n", "256 NaN NaN NaN NaN \n", "\n", " short_description fullgame supported_languages header_image website \\\n", "26 NaN NaN NaN NaN NaN \n", "147 NaN NaN NaN NaN NaN \n", "256 NaN NaN NaN NaN NaN \n", "\n", " pc_requirements mac_requirements linux_requirements legal_notice \\\n", "26 NaN NaN NaN NaN \n", "147 NaN NaN NaN NaN \n", "256 NaN NaN NaN NaN \n", "\n", " drm_notice ext_user_account_notice developers publishers demos \\\n", "26 NaN NaN NaN NaN NaN \n", "147 NaN NaN NaN NaN NaN \n", "256 NaN NaN NaN NaN NaN \n", "\n", " price_overview packages package_groups platforms metacritic reviews \\\n", "26 NaN NaN NaN NaN NaN NaN \n", "147 NaN NaN NaN NaN NaN NaN \n", "256 NaN NaN NaN NaN NaN NaN \n", "\n", " categories genres screenshots movies recommendations achievements \\\n", "26 NaN NaN NaN NaN NaN NaN \n", "147 NaN NaN NaN NaN NaN NaN \n", "256 NaN NaN NaN NaN NaN NaN \n", "\n", " release_date support_info background content_descriptors \n", "26 NaN NaN NaN NaN \n", "147 NaN NaN NaN NaN \n", "256 NaN NaN NaN NaN " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])\n", "\n", "# preview rows with missing type data\n", "raw_steam_data[raw_steam_data['type'].isnull()].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at the counts of unique values in a column by using the pandas [Series.value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method. By checking the value counts we see that all rows either have a missing value, as noted above, or 'game' in the `type` column.\n", "\n", "Once the null rows are removed, we'll be able to remove this column as it doesn't provide us with any more useful information." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "game 29086\n", "NaN 149\n", "Name: type, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steam_data['type'].value_counts(dropna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Taking a look now at the name column, we can check for rows which either have a null value or a string containing 'none'. This isn't recognised as a null value but should be treated as such.\n", "\n", "We achieve this by combining boolean filters using brackets and a vertical bar, `|`, symbolising a logical 'or'.\n", "\n", "There are only four rows which match these criteria, and they appear to be missing a lot of data in other columns so we should definitely remove them." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typenamesteam_appidrequired_ageis_freecontroller_supportdlcdetailed_descriptionabout_the_gameshort_descriptionfullgamesupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementslegal_noticedrm_noticeext_user_account_noticedeveloperspublishersdemosprice_overviewpackagespackage_groupsplatformsmetacriticreviewscategoriesgenresscreenshotsmoviesrecommendationsachievementsrelease_datesupport_infobackgroundcontent_descriptors
4918gamenone3398600.0FalseNaNNaNNaNNaNNaNNaNEnglish<strong>*</strong><br><strong>*</strong...https://steamcdn-a.akamaihd.net/steam/apps/339...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...[][]NaNNaNNaNNaN['']NaNNaNNaN[]{'windows': True, 'mac': False, 'linux': False}NaNNaN[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '25', 'description': 'Adventure'}, {'i...NaNNaNNaN{'total': 3, 'highlighted': [{'name': 'Master ...{'coming_soon': False, 'date': '27 Feb, 2015'}{'url': '', 'email': ''}NaN{'ids': [], 'notes': None}
6779gamenone3850200.0FalseNaNNaN- discontinued - (please remove)- discontinued - (please remove)- discontinued - (please remove)NaNEnglish, French, Italian, German, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/385...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...NaNNaNNaN['none']['']NaNNaNNaN[]{'windows': True, 'mac': True, 'linux': True}NaNNaN[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '4', 'description': 'Casual'}, {'id': ...NaNNaNNaN{'total': 0}{'coming_soon': False, 'date': '4 Nov, 2015'}{'url': '', 'email': ''}NaN{'ids': [], 'notes': None}
7235gameNaN3964200.0TrueNaNNaNSpookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。 村...NaNNaNhttps://steamcdn-a.akamaihd.net/steam/apps/396...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...[][]NaNNaNNaNNaN['']NaNNaNNaN[]{'windows': True, 'mac': False, 'linux': False}NaNNaNNaNNaNNaNNaNNaN{'total': 0}{'coming_soon': False, 'date': '1 Nov, 2016'}{'url': '', 'email': ''}NaN{'ids': [], 'notes': None}
7350gamenone3989700.0FalseNaNNaNNaNNaNNaNNaNEnglish<strong>*</strong><br><strong>*</strong...https://steamcdn-a.akamaihd.net/steam/apps/398...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...NaNNaNNaN['none']['none'][{'appid': 516340, 'description': ''}]NaNNaN[]{'windows': True, 'mac': True, 'linux': True}NaNNaN[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '25', 'description': 'Adventure'}, {'i...NaNNaNNaN{'total': 35, 'highlighted': [{'name': \"They'v...{'coming_soon': False, 'date': '5 Nov, 2015'}{'url': '', 'email': ''}NaN{'ids': [], 'notes': None}
\n", "
" ], "text/plain": [ " type name steam_appid required_age is_free controller_support dlc \\\n", "4918 game none 339860 0.0 False NaN NaN \n", "6779 game none 385020 0.0 False NaN NaN \n", "7235 game NaN 396420 0.0 True NaN NaN \n", "7350 game none 398970 0.0 False NaN NaN \n", "\n", " detailed_description \\\n", "4918 NaN \n", "6779 - discontinued - (please remove) \n", "7235 Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。*

**
*Minimum:
    Minimum:
      Minimum:
        Minimum:
          Minimum:
            Minimum:
              Minimum:
                Minimum:
                  \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                  typenamesteam_appidrequired_ageis_freecontroller_supportdlcdetailed_descriptionabout_the_gameshort_descriptionfullgamesupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementslegal_noticedrm_noticeext_user_account_noticedeveloperspublishersdemosprice_overviewpackagespackage_groupsplatformsmetacriticreviewscategoriesgenresscreenshotsmoviesrecommendationsachievementsrelease_datesupport_infobackgroundcontent_descriptors
                  31gameSiN Episodes: Emergence13000.0FalseNaNNaNYou are John Blade, commander of HardCorps, an...You are John Blade, commander of HardCorps, an...You are John Blade, commander of HardCorps, an...NaNEnglish, Russian, Frenchhttps://steamcdn-a.akamaihd.net/steam/apps/130...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...[][]NaNNaNNaN['Ritual Entertainment']['Ritual Entertainment']NaN{'currency': 'GBP', 'initial': 719, 'final': 7...[70][{'name': 'default', 'title': 'Buy SiN Episode...{'windows': True, 'mac': False, 'linux': False}{'score': 75, 'url': 'https://www.metacritic.c...NaN[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 265}{'total': 0}{'coming_soon': False, 'date': '10 May, 2006'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/130...{'ids': [], 'notes': None}
                  32gameSiN Episodes: Emergence13000.0FalseNaNNaNYou are John Blade, commander of HardCorps, an...You are John Blade, commander of HardCorps, an...You are John Blade, commander of HardCorps, an...NaNEnglish, Russian, Frenchhttps://steamcdn-a.akamaihd.net/steam/apps/130...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...[][]NaNNaNNaN['Ritual Entertainment']['Ritual Entertainment']NaN{'currency': 'GBP', 'initial': 719, 'final': 7...[70][{'name': 'default', 'title': 'Buy SiN Episode...{'windows': True, 'mac': False, 'linux': False}{'score': 75, 'url': 'https://www.metacritic.c...NaN[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 265}{'total': 0}{'coming_soon': False, 'date': '10 May, 2006'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/130...{'ids': [], 'notes': None}
                  356gameJagged Alliance 2 Gold16200.0FalseNaNNaN<p>The small country of Arulco has been taken ...<p>The small country of Arulco has been taken ...The small country of Arulco has been taken ove...NaNEnglishhttps://steamcdn-a.akamaihd.net/steam/apps/162...http://www.jaggedalliance2.com/{'minimum': '<p><strong>Minimum Configuration:...[][]NaNNaNNaN['Strategy First']['Strategy First']NaN{'currency': 'GBP', 'initial': 1499, 'final': ...[94][{'name': 'default', 'title': 'Buy Jagged Alli...{'windows': True, 'mac': False, 'linux': False}NaNNaN[{'id': 2, 'description': 'Single-player'}][{'id': '2', 'description': 'Strategy'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaNNaN{'total': 0}{'coming_soon': False, 'date': '6 Jul, 2006'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/162...{'ids': [], 'notes': None}
                  \n", "" ], "text/plain": [ " type name steam_appid required_age is_free \\\n", "31 game SiN Episodes: Emergence 1300 0.0 False \n", "32 game SiN Episodes: Emergence 1300 0.0 False \n", "356 game Jagged Alliance 2 Gold 1620 0.0 False \n", "\n", " controller_support dlc \\\n", "31 NaN NaN \n", "32 NaN NaN \n", "356 NaN NaN \n", "\n", " detailed_description \\\n", "31 You are John Blade, commander of HardCorps, an... \n", "32 You are John Blade, commander of HardCorps, an... \n", "356

                  The small country of Arulco has been taken ... \n", "\n", " about_the_game \\\n", "31 You are John Blade, commander of HardCorps, an... \n", "32 You are John Blade, commander of HardCorps, an... \n", "356

                  The small country of Arulco has been taken ... \n", "\n", " short_description fullgame \\\n", "31 You are John Blade, commander of HardCorps, an... NaN \n", "32 You are John Blade, commander of HardCorps, an... NaN \n", "356 The small country of Arulco has been taken ove... NaN \n", "\n", " supported_languages \\\n", "31 English, Russian, French \n", "32 English, Russian, French \n", "356 English \n", "\n", " header_image \\\n", "31 https://steamcdn-a.akamaihd.net/steam/apps/130... \n", "32 https://steamcdn-a.akamaihd.net/steam/apps/130... \n", "356 https://steamcdn-a.akamaihd.net/steam/apps/162... \n", "\n", " website \\\n", "31 NaN \n", "32 NaN \n", "356 http://www.jaggedalliance2.com/ \n", "\n", " pc_requirements mac_requirements \\\n", "31 {'minimum': '\\r\\n\\t\\t\\t

                  Minimum:Minimum:Minimum Configuration:... [] \n", "\n", " linux_requirements legal_notice drm_notice ext_user_account_notice \\\n", "31 [] NaN NaN NaN \n", "32 [] NaN NaN NaN \n", "356 [] NaN NaN NaN \n", "\n", " developers publishers demos \\\n", "31 ['Ritual Entertainment'] ['Ritual Entertainment'] NaN \n", "32 ['Ritual Entertainment'] ['Ritual Entertainment'] NaN \n", "356 ['Strategy First'] ['Strategy First'] NaN \n", "\n", " price_overview packages \\\n", "31 {'currency': 'GBP', 'initial': 719, 'final': 7... [70] \n", "32 {'currency': 'GBP', 'initial': 719, 'final': 7... [70] \n", "356 {'currency': 'GBP', 'initial': 1499, 'final': ... [94] \n", "\n", " package_groups \\\n", "31 [{'name': 'default', 'title': 'Buy SiN Episode... \n", "32 [{'name': 'default', 'title': 'Buy SiN Episode... \n", "356 [{'name': 'default', 'title': 'Buy Jagged Alli... \n", "\n", " platforms \\\n", "31 {'windows': True, 'mac': False, 'linux': False} \n", "32 {'windows': True, 'mac': False, 'linux': False} \n", "356 {'windows': True, 'mac': False, 'linux': False} \n", "\n", " metacritic reviews \\\n", "31 {'score': 75, 'url': 'https://www.metacritic.c... NaN \n", "32 {'score': 75, 'url': 'https://www.metacritic.c... NaN \n", "356 NaN NaN \n", "\n", " categories \\\n", "31 [{'id': 2, 'description': 'Single-player'}, {'... \n", "32 [{'id': 2, 'description': 'Single-player'}, {'... \n", "356 [{'id': 2, 'description': 'Single-player'}] \n", "\n", " genres \\\n", "31 [{'id': '1', 'description': 'Action'}] \n", "32 [{'id': '1', 'description': 'Action'}] \n", "356 [{'id': '2', 'description': 'Strategy'}] \n", "\n", " screenshots movies recommendations \\\n", "31 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 265} \n", "32 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 265} \n", "356 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN NaN \n", "\n", " achievements release_date \\\n", "31 {'total': 0} {'coming_soon': False, 'date': '10 May, 2006'} \n", "32 {'total': 0} {'coming_soon': False, 'date': '10 May, 2006'} \n", "356 {'total': 0} {'coming_soon': False, 'date': '6 Jul, 2006'} \n", "\n", " support_info \\\n", "31 {'url': '', 'email': ''} \n", "32 {'url': '', 'email': ''} \n", "356 {'url': '', 'email': ''} \n", "\n", " background \\\n", "31 https://steamcdn-a.akamaihd.net/steam/apps/130... \n", "32 https://steamcdn-a.akamaihd.net/steam/apps/130... \n", "356 https://steamcdn-a.akamaihd.net/steam/apps/162... \n", "\n", " content_descriptors \n", "31 {'ids': [], 'notes': None} \n", "32 {'ids': [], 'notes': None} \n", "356 {'ids': [], 'notes': None} " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "duplicate_rows = raw_steam_data[raw_steam_data.duplicated()]\n", "\n", "print('Duplicate rows to remove:', duplicate_rows.shape[0])\n", "\n", "duplicate_rows.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're now ready to define functions implementing the filters we just looked at. This allows us to easily make changes in the future if we want to alter how the columns are handled, or want to choose a different cut-off threshold for getting rid of columns, for example. \n", "\n", "We also define a general purpose `process` function which will run all the processing functions we create on the data set. This will allow us to slowly add to it as we develop more functions and ensure we're cleaning the correct dataframe.\n", "\n", "Finally we run this function on the raw data, inspecting the first few rows and viewing how many rows and columns have been removed." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(29235, 39)\n", "(29075, 28)\n" ] }, { "data": { "text/html": [ "

                  \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                  namesteam_appidrequired_ageis_freedetailed_descriptionabout_the_gameshort_descriptionsupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsdeveloperspublishersprice_overviewpackagespackage_groupsplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundcontent_descriptors
                  0Counter-Strike100.0FalsePlay the world's number 1 online action game. ...Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...English<strong>*</strong>, French<strong>*</st...https://steamcdn-a.akamaihd.net/steam/apps/10/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']{'currency': 'GBP', 'initial': 719, 'final': 7...[7][{'name': 'default', 'title': 'Buy Counter-Str...{'windows': True, 'mac': True, 'linux': True}[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Nov, 2000'}{'url': 'http://steamcommunity.com/app/10', 'e...https://steamcdn-a.akamaihd.net/steam/apps/10/...{'ids': [2, 5], 'notes': 'Includes intense vio...
                  1Team Fortress Classic200.0FalseOne of the most popular online action games of...One of the most popular online action games of...One of the most popular online action games of...English, French, German, Italian, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/20/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']{'currency': 'GBP', 'initial': 399, 'final': 3...[29][{'name': 'default', 'title': 'Buy Team Fortre...{'windows': True, 'mac': True, 'linux': True}[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Apr, 1999'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/20/...{'ids': [2, 5], 'notes': 'Includes intense vio...
                  2Day of Defeat300.0FalseEnlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...English, French, German, Italian, Spanish - Spainhttps://steamcdn-a.akamaihd.net/steam/apps/30/...http://www.dayofdefeat.com/{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']{'currency': 'GBP', 'initial': 399, 'final': 3...[30][{'name': 'default', 'title': 'Buy Day of Defe...{'windows': True, 'mac': True, 'linux': True}[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 May, 2003'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/30/...{'ids': [], 'notes': None}
                  3Deathmatch Classic400.0FalseEnjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...English, French, German, Italian, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/40/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']{'currency': 'GBP', 'initial': 399, 'final': 3...[31][{'name': 'default', 'title': 'Buy Deathmatch ...{'windows': True, 'mac': True, 'linux': True}[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Jun, 2001'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/40/...{'ids': [], 'notes': None}
                  4Half-Life: Opposing Force500.0FalseReturn to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...English, French, German, Koreanhttps://steamcdn-a.akamaihd.net/steam/apps/50/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Gearbox Software']['Valve']{'currency': 'GBP', 'initial': 399, 'final': 3...[32][{'name': 'default', 'title': 'Buy Half-Life: ...{'windows': True, 'mac': True, 'linux': True}[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Nov, 1999'}{'url': 'https://help.steampowered.com', 'emai...https://steamcdn-a.akamaihd.net/steam/apps/50/...{'ids': [], 'notes': None}
                  \n", "
                  " ], "text/plain": [ " name steam_appid required_age is_free \\\n", "0 Counter-Strike 10 0.0 False \n", "1 Team Fortress Classic 20 0.0 False \n", "2 Day of Defeat 30 0.0 False \n", "3 Deathmatch Classic 40 0.0 False \n", "4 Half-Life: Opposing Force 50 0.0 False \n", "\n", " detailed_description \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " about_the_game \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " short_description \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " supported_languages \\\n", "0 English*, French*Minimum:Minimum:Minimum:Minimum:Minimum:\n" ] }, { "data": { "text/plain": [ "\"{'windows': True, 'mac': True, 'linux': True}\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "platforms_first_row = age_df['platforms'].iloc[0]\n", "\n", "print(type(platforms_first_row))\n", "\n", "platforms_first_row" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the built-in `ast` module. As the name suggests, this will allow us to evaluate the string, and then index into it as a \n", "dictionary." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "{'windows': True, 'mac': True, 'linux': True}\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eval_first_row = literal_eval(platforms_first_row)\n", "\n", "print(type(eval_first_row))\n", "print(eval_first_row)\n", "\n", "eval_first_row['windows']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need to check for missing values, but fortunately it appears there aren't any in this column." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age_df['platforms'].isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As for formatting the output, let's keep things simple and return a list of supported platforms. Another option would be to create a column for each platform with a corresponding True/False value for each row, something we can keep in mind for the future.\n", "\n", "We can create the desired list by calling the [str.join()](https://docs.python.org/3/library/stdtypes.html#str.join) method on a string, and passing an iterable into the function. In this case, we can pass the keys of the row, as seen below." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'windows;mac;linux'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create string of keys, joined on a semi-colon\n", "';'.join(eval_first_row.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need to inspect the value of each row, and only end up with the keys that have a value of `True`, skipping those which are `False`. The example below shows how we can do this using a list comprehension." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['windows', 'mac']\n" ] }, { "data": { "text/plain": [ "'windows;mac'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "platforms = {'windows': True, 'mac': True, 'linux': False}\n", "\n", "# list comprehension\n", "print([x for x in platforms.keys() if platforms[x]])\n", "\n", "# using list comprehension in join\n", "';'.join(x for x in platforms.keys() if platforms[x])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Putting this all together, we can use the pandas [Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to process the rows." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "windows 19887\n", "windows;mac;linux 4851\n", "windows;mac 3687\n", "windows;linux 643\n", "mac 5\n", "mac;linux 1\n", "linux 1\n", "Name: platforms, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_platforms(df):\n", " \"\"\"Split platforms column into separate boolean columns for each platform.\"\"\"\n", " # evaluate values in platforms column, so can index into dictionaries\n", " df = df.copy()\n", " \n", " def parse_platforms(x):\n", " \n", " d = literal_eval(x)\n", " \n", " return ';'.join(platform for platform in d.keys() if d[platform])\n", " \n", " df['platforms'] = df['platforms'].apply(parse_platforms)\n", " \n", " return df\n", "\n", "\n", "platforms_df = process_platforms(age_df)\n", "platforms_df['platforms'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing Price\n", "\n", "Now we have built up some intuition around how to deal with data stored as dictionaries, let's return to the `is_free` and `price_overview` columns as we should now be able to handle them.\n", "\n", "First let's check how many null values there are in `price_overview`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3559" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "platforms_df['price_overview'].isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whilst that looks like a lot, we have to consider the impact that the `is_free` column might be having. Before jumping to conclusions let's check if there any rows with `is_free` marked as True and null values in the `price_overview` column." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2713" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "free_and_null_price = platforms_df[(platforms_df['is_free']) & (platforms_df['price_overview'].isnull())]\n", "\n", "free_and_null_price.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out this accounts for most of the missing values in the `price_overview` column, meaning we can handle these by setting the final price as 0. This makes intuitive sense - free games wouldn't have a price.\n", "\n", "This means that there are almost 850 rows which aren't free but have null values in the `price_overview` column. Let's investigate those next." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                  \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                  namesteam_appidrequired_ageis_freedetailed_descriptionabout_the_gameshort_descriptionsupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsdeveloperspublishersprice_overviewpackagespackage_groupsplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundcontent_descriptors
                  63The Ship: Single Player24200FalseFor PC gamers who enjoy multiplayer games with...For PC gamers who enjoy multiplayer games with...The Ship is a murder mystery alternative to tr...English, French, German, Italian, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/242...http://www.blazinggriffin.com/games/the-ship-m...{'minimum': '<strong>Minimum:</strong> 1.8 GHz...[][]['Outerlight Ltd.']['Blazing Griffin Ltd.']NaN[56669][{'name': 'default', 'title': 'Buy The Ship: S...windows[{'id': 2, 'description': 'Single-player'}][{'id': '1', 'description': 'Action'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 2035597, 'name': 'the Ship: Intro', '...{'total': 0}{'coming_soon': False, 'date': '20 Nov, 2006'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/242...{'ids': [], 'notes': None}
                  75RollerCoaster Tycoon® 3: Platinum27000FalseRollercoaster Tycoon 3 Platinum combines the e...Rollercoaster Tycoon 3 Platinum combines the e...Rollercoaster Tycoon 3 Platinum combines the e...English, French, Italian, German, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/270...http://www.atari.com/rollercoastertycoon/us/in...{'minimum': '<strong>Minimum: </strong><br>\\t\\...{'minimum': '<ul class=\"bb_ul\"><li><strong>OS:...[]['Frontier', 'Aspyr (Mac)']['Atari', 'Aspyr (Mac)']NaNNaN[]windows;mac[{'id': 2, 'description': 'Single-player'}][{'id': '28', 'description': 'Simulation'}, {'...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '12 Mar, 2008'}{'url': 'http://www.atari.com/support/atari', ...https://steamcdn-a.akamaihd.net/steam/apps/270...{'ids': [], 'notes': None}
                  220BioShock™76700False<h1>Special Offer</h1><p>Buying BioShock™ also...BioShock is a shooter unlike any you've ever p...BioShock is a shooter unlike any you've ever p...English<strong>*</strong>, French<strong>*</st...https://steamcdn-a.akamaihd.net/steam/apps/767...http://www.BioShockGame.com{'minimum': '<h2 class=\"bb_tag\"><strong>Minimu...{'minimum': 'Please See BioShock Remastered'}[]['2K Boston', '2K Australia']['2K']NaN[451, 127633][{'name': 'default', 'title': 'Buy BioShock™',...windows[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '21 Aug, 2007'}{'url': 'support.2k.com', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/767...{'ids': [], 'notes': None}
                  234Sam & Max 101: Culture Shock82000False<strong>Sam &amp; Max: Episode 1 - Culture Sho...<strong>Sam &amp; Max: Episode 1 - Culture Sho...Sam &amp; Max: Episode 1 - Culture Shock - The...English, French, German, Italianhttps://steamcdn-a.akamaihd.net/steam/apps/820...http://store.steampowered.com/app/901660/{'minimum': 'Windows XP or Vista, 1.5GHz proce...[][]['Telltale Games']['Telltale Games']NaN[357, 539][{'name': 'default', 'title': 'Buy Sam & Max 1...windows[{'id': 2, 'description': 'Single-player'}][{'id': '1', 'description': 'Action'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '15 Jun, 2007'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/820...{'ids': [], 'notes': None}
                  235Sam & Max 102: Situation: Comedy82100False<strong>Sam &amp; Max: Episode 2 - Situation: ...<strong>Sam &amp; Max: Episode 2 - Situation: ...Sam &amp; Max: Episode 2 - Situation: Comedy -...English, German, Italianhttps://steamcdn-a.akamaihd.net/steam/apps/821...http://store.steampowered.com/app/901660/{'minimum': 'Windows XP or Vista, 1.5GHz proce...[][]['Telltale Games']['Telltale Games']NaN[358, 539][{'name': 'default', 'title': 'Buy Sam & Max 1...windows[{'id': 2, 'description': 'Single-player'}][{'id': '1', 'description': 'Action'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '15 Jun, 2007'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/821...{'ids': [], 'notes': None}
                  \n", "
                  " ], "text/plain": [ " name steam_appid required_age is_free \\\n", "63 The Ship: Single Player 2420 0 False \n", "75 RollerCoaster Tycoon® 3: Platinum 2700 0 False \n", "220 BioShock™ 7670 0 False \n", "234 Sam & Max 101: Culture Shock 8200 0 False \n", "235 Sam & Max 102: Situation: Comedy 8210 0 False \n", "\n", " detailed_description \\\n", "63 For PC gamers who enjoy multiplayer games with... \n", "75 Rollercoaster Tycoon 3 Platinum combines the e... \n", "220

                  Special Offer

                  Buying BioShock™ also... \n", "234 Sam & Max: Episode 1 - Culture Sho... \n", "235 Sam & Max: Episode 2 - Situation: ... \n", "\n", " about_the_game \\\n", "63 For PC gamers who enjoy multiplayer games with... \n", "75 Rollercoaster Tycoon 3 Platinum combines the e... \n", "220 BioShock is a shooter unlike any you've ever p... \n", "234 Sam & Max: Episode 1 - Culture Sho... \n", "235 Sam & Max: Episode 2 - Situation: ... \n", "\n", " short_description \\\n", "63 The Ship is a murder mystery alternative to tr... \n", "75 Rollercoaster Tycoon 3 Platinum combines the e... \n", "220 BioShock is a shooter unlike any you've ever p... \n", "234 Sam & Max: Episode 1 - Culture Shock - The... \n", "235 Sam & Max: Episode 2 - Situation: Comedy -... \n", "\n", " supported_languages \\\n", "63 English, French, German, Italian, Spanish - Sp... \n", "75 English, French, Italian, German, Spanish - Sp... \n", "220 English*, French*Minimum: 1.8 GHz... \n", "75 {'minimum': 'Minimum:
                  \\t\\... \n", "220 {'minimum': '

                  Minimu... \n", "234 {'minimum': 'Windows XP or Vista, 1.5GHz proce... \n", "235 {'minimum': 'Windows XP or Vista, 1.5GHz proce... \n", "\n", " mac_requirements linux_requirements \\\n", "63 [] [] \n", "75 {'minimum': '
                  • OS:... [] \n", "220 {'minimum': 'Please See BioShock Remastered'} [] \n", "234 [] [] \n", "235 [] [] \n", "\n", " developers publishers price_overview \\\n", "63 ['Outerlight Ltd.'] ['Blazing Griffin Ltd.'] NaN \n", "75 ['Frontier', 'Aspyr (Mac)'] ['Atari', 'Aspyr (Mac)'] NaN \n", "220 ['2K Boston', '2K Australia'] ['2K'] NaN \n", "234 ['Telltale Games'] ['Telltale Games'] NaN \n", "235 ['Telltale Games'] ['Telltale Games'] NaN \n", "\n", " packages package_groups \\\n", "63 [56669] [{'name': 'default', 'title': 'Buy The Ship: S... \n", "75 NaN [] \n", "220 [451, 127633] [{'name': 'default', 'title': 'Buy BioShock™',... \n", "234 [357, 539] [{'name': 'default', 'title': 'Buy Sam & Max 1... \n", "235 [358, 539] [{'name': 'default', 'title': 'Buy Sam & Max 1... \n", "\n", " platforms categories \\\n", "63 windows [{'id': 2, 'description': 'Single-player'}] \n", "75 windows;mac [{'id': 2, 'description': 'Single-player'}] \n", "220 windows [{'id': 2, 'description': 'Single-player'}, {'... \n", "234 windows [{'id': 2, 'description': 'Single-player'}] \n", "235 windows [{'id': 2, 'description': 'Single-player'}] \n", "\n", " genres \\\n", "63 [{'id': '1', 'description': 'Action'}, {'id': ... \n", "75 [{'id': '28', 'description': 'Simulation'}, {'... \n", "220 [{'id': '1', 'description': 'Action'}, {'id': ... \n", "234 [{'id': '1', 'description': 'Action'}, {'id': ... \n", "235 [{'id': '1', 'description': 'Action'}, {'id': ... \n", "\n", " screenshots \\\n", "63 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "75 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "220 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "234 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "235 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "\n", " movies achievements \\\n", "63 [{'id': 2035597, 'name': 'the Ship: Intro', '... {'total': 0} \n", "75 NaN {'total': 0} \n", "220 NaN {'total': 0} \n", "234 NaN {'total': 0} \n", "235 NaN {'total': 0} \n", "\n", " release_date \\\n", "63 {'coming_soon': False, 'date': '20 Nov, 2006'} \n", "75 {'coming_soon': False, 'date': '12 Mar, 2008'} \n", "220 {'coming_soon': False, 'date': '21 Aug, 2007'} \n", "234 {'coming_soon': False, 'date': '15 Jun, 2007'} \n", "235 {'coming_soon': False, 'date': '15 Jun, 2007'} \n", "\n", " support_info \\\n", "63 {'url': '', 'email': ''} \n", "75 {'url': 'http://www.atari.com/support/atari', ... \n", "220 {'url': 'support.2k.com', 'email': ''} \n", "234 {'url': '', 'email': ''} \n", "235 {'url': '', 'email': ''} \n", "\n", " background \\\n", "63 https://steamcdn-a.akamaihd.net/steam/apps/242... \n", "75 https://steamcdn-a.akamaihd.net/steam/apps/270... \n", "220 https://steamcdn-a.akamaihd.net/steam/apps/767... \n", "234 https://steamcdn-a.akamaihd.net/steam/apps/820... \n", "235 https://steamcdn-a.akamaihd.net/steam/apps/821... \n", "\n", " content_descriptors \n", "63 {'ids': [], 'notes': None} \n", "75 {'ids': [], 'notes': None} \n", "220 {'ids': [], 'notes': None} \n", "234 {'ids': [], 'notes': None} \n", "235 {'ids': [], 'notes': None} " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "not_free_and_null_price = platforms_df[(platforms_df['is_free'] == False) & (platforms_df['price_overview'].isnull())]\n", "\n", "not_free_and_null_price.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first few rows contain some big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.\n", "\n", "We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:\n", "\n", " https://store.steampowered.com/app/[steam_appid]\n", "\n", "This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Ship: Single Player: https://store.steampowered.com/app/2420\n", "RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700\n", "BioShock™: https://store.steampowered.com/app/7670\n", "Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200\n", "Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210\n" ] } ], "source": [ "def print_steam_links(df):\n", " \"\"\"Print links to store page for apps in a dataframe.\"\"\"\n", " url_base = \"https://store.steampowered.com/app/\"\n", " \n", " for i, row in df.iterrows():\n", " appid = row['steam_appid']\n", " name = row['name']\n", " \n", " print(name + ':', url_base + str(appid))\n", " \n", "\n", "print_steam_links(not_free_and_null_price[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For these games we can conclude that:\n", "\n", "- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party\n", "- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: [GOG](https://www.gog.com/)) \n", " - \"A spokesperson for GOG told Eurogamer it pulled the game \"due to expiring licensing rights\", and stressed it'll talk with \"new distribution rights holders\" to bring the game back as soon as possible.\" Source: [Eurogamer](https://www.eurogamer.net/articles/2018-05-09-rollercoaster-tycoon-3-pulled-from-steam-gog)\n", "- BioShock has been replaced by BioShock Remastered\n", "- Sam & Max 101 is sold as part of a season, and this can be found in the `package_groups` column\n", "\n", "So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the `package_groups` column, or we could leave them for now and return to them later. We'll leave them for now, handling the two price columns, then take a look at the packages next. It may also be that some of these rows are removed later in the cleaning process for other reasons.\n", "\n", "If we want to find rows similar to these and deal with each case individually, we could use the `.str.contains()` method, as seen below." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                    namesteam_appidrequired_ageis_freedetailed_descriptionabout_the_gameshort_descriptionsupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsdeveloperspublishersprice_overviewpackagespackage_groupsplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundcontent_descriptors
                    220BioShock™76700False<h1>Special Offer</h1><p>Buying BioShock™ also...BioShock is a shooter unlike any you've ever p...BioShock is a shooter unlike any you've ever p...English<strong>*</strong>, French<strong>*</st...https://steamcdn-a.akamaihd.net/steam/apps/767...http://www.BioShockGame.com{'minimum': '<h2 class=\"bb_tag\"><strong>Minimu...{'minimum': 'Please See BioShock Remastered'}[]['2K Boston', '2K Australia']['2K']NaN[451, 127633][{'name': 'default', 'title': 'Buy BioShock™',...windows[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '21 Aug, 2007'}{'url': 'support.2k.com', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/767...{'ids': [], 'notes': None}
                    7734BioShock™ Remastered40971018False<h1>Special Offer</h1><p>Buying BioShock™ Rema...BioShock is a shooter unlike any you've ever p...BioShock is a shooter unlike any other, loaded...English<strong>*</strong>, French<strong>*</st...https://steamcdn-a.akamaihd.net/steam/apps/409...http://www.BioShockGame.com{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...['2K Boston', '2K Australia', 'Blind Squirrel'...['2K', 'Feral Interactive (Mac)']{'currency': 'GBP', 'initial': 999, 'final': 9...[451, 127633][{'name': 'default', 'title': 'Buy BioShock™ R...windows;mac[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 65, 'highlighted': [{'name': 'Comple...{'coming_soon': False, 'date': '15 Sep, 2016'}{'url': 'support.2k.com', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/409...{'ids': [], 'notes': None}
                    7735BioShock™ 2 Remastered40972018False<h1>Special Offer</h1><p>Buying BioShock 2™ Re...BioShock 2 provides players with the perfect b...In BioShock 2, you step into the boots of the ...English<strong>*</strong>, French<strong>*</st...https://steamcdn-a.akamaihd.net/steam/apps/409...http://www.bioshockgame.com/{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...['2K Marin', '2K China', 'Digital Extremes', '...['2K']{'currency': 'GBP', 'initial': 1399, 'final': ...[81419, 127633][{'name': 'default', 'title': 'Buy BioShock™ 2...windows[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 53, 'highlighted': [{'name': \"Daddy'...{'coming_soon': False, 'date': '15 Sep, 2016'}{'url': 'support.2k.com', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/409...{'ids': [5], 'notes': None}
                    \n", "
                    " ], "text/plain": [ " name steam_appid required_age is_free \\\n", "220 BioShock™ 7670 0 False \n", "7734 BioShock™ Remastered 409710 18 False \n", "7735 BioShock™ 2 Remastered 409720 18 False \n", "\n", " detailed_description \\\n", "220

                    Special Offer

                    Buying BioShock™ also... \n", "7734

                    Special Offer

                    Buying BioShock™ Rema... \n", "7735

                    Special Offer

                    Buying BioShock 2™ Re... \n", "\n", " about_the_game \\\n", "220 BioShock is a shooter unlike any you've ever p... \n", "7734 BioShock is a shooter unlike any you've ever p... \n", "7735 BioShock 2 provides players with the perfect b... \n", "\n", " short_description \\\n", "220 BioShock is a shooter unlike any you've ever p... \n", "7734 BioShock is a shooter unlike any other, loaded... \n", "7735 In BioShock 2, you step into the boots of the ... \n", "\n", " supported_languages \\\n", "220 English*, French**, French**, French*Minimu... \n", "7734 {'minimum': 'Minimum:

                      Minimum:
                        Minimum:
                          Minimum:
                            Minimum:
                              Minimum:
                                \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                namecurrencyprice
                                0Counter-StrikeGBP719
                                1Team Fortress ClassicGBP399
                                2Day of DefeatGBP399
                                3Deathmatch ClassicGBP399
                                4Half-Life: Opposing ForceGBP399
                                \n", "" ], "text/plain": [ " name currency price\n", "0 Counter-Strike GBP 719\n", "1 Team Fortress Classic GBP 399\n", "2 Day of Defeat GBP 399\n", "3 Deathmatch Classic GBP 399\n", "4 Half-Life: Opposing Force GBP 399" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_price(df):\n", " df = df.copy()\n", " \n", " def parse_price(x):\n", " if x is not np.nan:\n", " return literal_eval(x)\n", " else:\n", " return {'currency': 'GBP', 'initial': -1}\n", " \n", " # evaluate as dictionary and set to -1 if missing\n", " df['price_overview'] = df['price_overview'].apply(parse_price)\n", " \n", " # Create columns from currency and initial values\n", " df['currency'] = df['price_overview'].apply(lambda x: x['currency'])\n", " df['price'] = df['price_overview'].apply(lambda x: x['initial'])\n", " \n", " # Set price of free games to 0\n", " df.loc[df['is_free'], 'price'] = 0\n", " \n", " return df\n", "\n", "price_data = process_price(platforms_df)[['name', 'currency', 'price']]\n", "price_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're almost finished, but let's check if any games don't have GBP listed as the currency." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                namecurrencyprice
                                991Robin Hood: The Legend of SherwoodUSD799
                                5767Assassin’s Creed® Chronicles: IndiaEUR999
                                27593Mortal Kombat 11USD5999
                                27995Pagan OnlineEUR2699
                                \n", "
                                " ], "text/plain": [ " name currency price\n", "991 Robin Hood: The Legend of Sherwood USD 799\n", "5767 Assassin’s Creed® Chronicles: India EUR 999\n", "27593 Mortal Kombat 11 USD 5999\n", "27995 Pagan Online EUR 2699" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "price_data[price_data['currency'] != 'GBP']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For some reason there are four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows it's easier and safer to simply drop them.\n", "\n", "We can also divide the prices by 100 so they are displayed as floats in pounds." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
                                \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                nameprice
                                0Counter-Strike7.19
                                1Team Fortress Classic3.99
                                2Day of Defeat3.99
                                3Deathmatch Classic3.99
                                4Half-Life: Opposing Force3.99
                                \n", "
                                " ], "text/plain": [ " name price\n", "0 Counter-Strike 7.19\n", "1 Team Fortress Classic 3.99\n", "2 Day of Defeat 3.99\n", "3 Deathmatch Classic 3.99\n", "4 Half-Life: Opposing Force 3.99" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_price(df):\n", " \"\"\"Process price_overview column into formatted price column.\"\"\"\n", " df = df.copy()\n", " \n", " def parse_price(x):\n", " if x is not np.nan:\n", " return literal_eval(x)\n", " else:\n", " return {'currency': 'GBP', 'initial': -1}\n", " \n", " # evaluate as dictionary and set to -1 if missing\n", " df['price_overview'] = df['price_overview'].apply(parse_price)\n", " \n", " # create columns from currency and initial values\n", " df['currency'] = df['price_overview'].apply(lambda x: x['currency'])\n", " df['price'] = df['price_overview'].apply(lambda x: x['initial'])\n", " \n", " # set price of free games to 0\n", " df.loc[df['is_free'], 'price'] = 0\n", " \n", " # remove non-GBP rows\n", " df = df[df['currency'] == 'GBP']\n", " \n", " # change price to display in pounds (only applying to rows with a value greater than 0)\n", " df.loc[df['price'] > 0, 'price'] /= 100\n", " \n", " # remove columns no longer needed\n", " df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)\n", " \n", " return df\n", "\n", "\n", "price_df = process_price(platforms_df)\n", "price_df[['name', 'price']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing Packages\n", "\n", "We can now take a look at the `packages` and `package_groups` columns to help decide what to do with rows that are missing price data. We're not incredibly interested in the columns themselves, as they don't appear to provide much new useful information, except which games come with others as part of a bundle." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
                                \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                steam_appidpackagespackage_groupsprice
                                010[7][{'name': 'default', 'title': 'Buy Counter-Strike', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 7, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Counter-Strike: Condition Zero - £7.19', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 719}]}]7.19
                                120[29][{'name': 'default', 'title': 'Buy Team Fortress Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 29, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Team Fortress Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]3.99
                                230[30][{'name': 'default', 'title': 'Buy Day of Defeat', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 30, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Day of Defeat - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]3.99
                                \n", "
                                " ], "text/plain": [ " steam_appid packages \\\n", "0 10 [7] \n", "1 20 [29] \n", "2 30 [30] \n", "\n", " package_groups \\\n", "0 [{'name': 'default', 'title': 'Buy Counter-Strike', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 7, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Counter-Strike: Condition Zero - £7.19', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 719}]}] \n", "1 [{'name': 'default', 'title': 'Buy Team Fortress Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 29, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Team Fortress Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] \n", "2 [{'name': 'default', 'title': 'Buy Day of Defeat', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 30, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Day of Defeat - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] \n", "\n", " price \n", "0 7.19 \n", "1 3.99 \n", "2 3.99 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# temporarily set a pandas option using with and option_context\n", "with pd.option_context(\"display.max_colwidth\", 500):\n", " display(price_df[['steam_appid', 'packages', 'package_groups', 'price']].head(3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall we have 846 rows with missing price data, which we previously set to -1." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "846\n" ] } ], "source": [ "print(price_df[price_df['price'] == -1].shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can split these rows into two categories: those with `package_groups` data and those without.\n", "\n", "If we take a quick look at the `package_groups` column we see that there appear to be no null values. On closer inspection, we can find that rows without data are actually stored as empty lists." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Null counts: 0\n", "Empty list counts: 3353\n" ] } ], "source": [ "print('Null counts:', price_df['package_groups'].isnull().sum())\n", "print('Empty list counts:', price_df[price_df['package_groups'] == \"[]\"].shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a combination of filters, we can find out how many rows have both missing `price` and `package_group` data and investigate. We'll count the rows and print links to some of the store pages and look for patterns." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rows: 799 \n", "\n", "First few rows:\n", "\n", "RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700\n", "Beijing 2008™ - The Official Video Game of the Olympic Games: https://store.steampowered.com/app/10520\n", "LUMINES™ Advance Pack: https://store.steampowered.com/app/11920\n", "Midnight Club 2: https://store.steampowered.com/app/12160\n", "Age of Booty™: https://store.steampowered.com/app/21600\n", "\n", "Last few rows:\n", "\n", "RoboVirus: https://store.steampowered.com/app/1001870\n", "soko loco deluxe: https://store.steampowered.com/app/1003730\n", "POCKET CAR : VRGROUND: https://store.steampowered.com/app/1004710\n", "The Princess, the Stray Cat, and Matters of the Heart: https://store.steampowered.com/app/1010600\n", "Mr Boom's Firework Factory: https://store.steampowered.com/app/1013670\n" ] } ], "source": [ "missing_price_and_package = price_df[(price_df['price'] == -1) & (price_df['package_groups'] == \"[]\")]\n", "\n", "print('Number of rows:', missing_price_and_package.shape[0], '\\n')\n", "\n", "print('First few rows:\\n')\n", "print_steam_links(missing_price_and_package[:5])\n", "\n", "print('\\nLast few rows:\\n')\n", "print_steam_links(missing_price_and_package[-10:-5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the games - 799 of 846 - with missing price data fall into the above category. This probably means they can be safely removed.\n", "\n", "From following the links for the first few rows to the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of them haven't yet been released and haven't had a price set. We'll take care of all the unreleased games when we clean the `release_date` column, but we can remove all of these apps here.\n", "\n", "Let's now take a look at the rows that have missing price data but do have `package_groups` data. We may be interested in keeping these rows and extracting price data from the package data." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rows: 47 \n", "\n", "First few rows:\n", "\n", "The Ship: Single Player: https://store.steampowered.com/app/2420\n", "BioShock™: https://store.steampowered.com/app/7670\n", "Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200\n", "Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210\n", "Sam & Max 103: The Mole, the Mob and the Meatball: https://store.steampowered.com/app/8220\n", "\n", "Last few rows:\n", "\n", "Viscera Cleanup Detail: Shadow Warrior: https://store.steampowered.com/app/255520\n", "Space Hulk: Deathwing: https://store.steampowered.com/app/298900\n", "7,62 Hard Life: https://store.steampowered.com/app/306290\n", "Letter Quest: Grimm's Journey: https://store.steampowered.com/app/328730\n", "Rad Rodgers: World One: https://store.steampowered.com/app/353580\n" ] } ], "source": [ "missing_price_have_package = price_df.loc[(price_df['price'] == -1) & (price_df['package_groups'] != \"[]\"), ['name', 'steam_appid', 'package_groups', 'price']]\n", "\n", "print('Number of rows:', missing_price_have_package.shape[0], '\\n')\n", "\n", "print('First few rows:\\n')\n", "print_steam_links(missing_price_have_package[:5])\n", "\n", "print('\\nLast few rows:\\n')\n", "print_steam_links(missing_price_have_package[-10:-5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game. \n", "\n", "Whilst we could extract prices from the `package_groups` data, the most sensible option seems to be removing these rows. There are only 47 rows this applies to, and any with a newer release will still have the re-release in the data.\n", "\n", "Since our logic interacts heavily with the price data we will update the `process_price` function rather than creating a new one." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
                                \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                namesteam_appidrequired_agedetailed_descriptionabout_the_gameshort_descriptionsupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsdeveloperspublishersplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundcontent_descriptorsprice
                                0Counter-Strike100Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...English<strong>*</strong>, French<strong>*</st...https://steamcdn-a.akamaihd.net/steam/apps/10/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']windows;mac;linux[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Nov, 2000'}{'url': 'http://steamcommunity.com/app/10', 'e...https://steamcdn-a.akamaihd.net/steam/apps/10/...{'ids': [2, 5], 'notes': 'Includes intense vio...7.19
                                1Team Fortress Classic200One of the most popular online action games of...One of the most popular online action games of...One of the most popular online action games of...English, French, German, Italian, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/20/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']windows;mac;linux[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Apr, 1999'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/20/...{'ids': [2, 5], 'notes': 'Includes intense vio...3.99
                                2Day of Defeat300Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...English, French, German, Italian, Spanish - Spainhttps://steamcdn-a.akamaihd.net/steam/apps/30/...http://www.dayofdefeat.com/{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']windows;mac;linux[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 May, 2003'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/30/...{'ids': [], 'notes': None}3.99
                                3Deathmatch Classic400Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...English, French, German, Italian, Spanish - Sp...https://steamcdn-a.akamaihd.net/steam/apps/40/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Valve']['Valve']windows;mac;linux[{'id': 1, 'description': 'Multi-player'}, {'i...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Jun, 2001'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/40/...{'ids': [], 'notes': None}3.99
                                4Half-Life: Opposing Force500Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...English, French, German, Koreanhttps://steamcdn-a.akamaihd.net/steam/apps/50/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...['Gearbox Software']['Valve']windows;mac;linux[{'id': 2, 'description': 'Single-player'}, {'...[{'id': '1', 'description': 'Action'}][{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN{'total': 0}{'coming_soon': False, 'date': '1 Nov, 1999'}{'url': 'https://help.steampowered.com', 'emai...https://steamcdn-a.akamaihd.net/steam/apps/50/...{'ids': [], 'notes': None}3.99
                                \n", "
                                " ], "text/plain": [ " name steam_appid required_age \\\n", "0 Counter-Strike 10 0 \n", "1 Team Fortress Classic 20 0 \n", "2 Day of Defeat 30 0 \n", "3 Deathmatch Classic 40 0 \n", "4 Half-Life: Opposing Force 50 0 \n", "\n", " detailed_description \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " about_the_game \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " short_description \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " supported_languages \\\n", "0 English*, French*Minimum:Minimum:Minimum:Minimum:Minimum:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                namesteam_appidrequired_agedetailed_descriptionabout_the_gameshort_descriptionsupported_languagesheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsdeveloperspublishersplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundcontent_descriptorsprice
                                4866Subsiege3386400<img src=\"https://steamcdn-a.akamaihd.net/stea...<img src=\"https://steamcdn-a.akamaihd.net/stea...Subsiege is an intense real-time tactic game w...NaNhttps://steamcdn-a.akamaihd.net/steam/apps/338...http://subsiege-game.com/{'minimum': '<strong>Minimum:</strong><br><ul ...[][]['Icebird Studios']['Icebird Studios']windowsNaNNaN[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 256729398, 'name': 'Release Trailer', ...{'total': 0}{'coming_soon': False, 'date': '7 Sep, 2018'}{'url': 'http://subsiege-game.com/', 'email': ...https://steamcdn-a.akamaihd.net/steam/apps/338...{'ids': [], 'notes': None}17.89
                                14560MARS VR(全球使命VR)59656001.\\t4K level audio-visual experience <br />\\r\\...1.\\t4K level audio-visual experience <br />\\r\\...Welcome to 《Mars VR》. This is an immersive fir...NaNhttps://steamcdn-a.akamaihd.net/steam/apps/596...http://qqsm.zygames.com/{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...['Ying Pei Digital Technology Shanghai Co., Li...['SHANGHAI ZHENYOU TECHNOLOGY CO.,LTD']windows[{'id': 2, 'description': 'Single-player'}][{'id': '73', 'description': 'Violent'}, {'id'...[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 256681371, 'name': 'marsvr', 'thumbnai...{'total': 0}{'coming_soon': False, 'date': '5 Apr, 2017'}{'url': 'http://www.zygames.com/contact', 'ema...https://steamcdn-a.akamaihd.net/steam/apps/596...{'ids': [], 'notes': None}1.99
                                16386Numberline 26549700NumberLine 2 is the continuation of the popula...NumberLine 2 is the continuation of the popula...NumberLine 2 is the continuation of the popula...NaNhttps://steamcdn-a.akamaihd.net/steam/apps/654...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...[][]['V34D4R', 'Egor Magurin']['Indovers Studio']windows[{'id': 2, 'description': 'Single-player'}][{'id': '4', 'description': 'Casual'}, {'id': ...[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 256687192, 'name': 'Numberline 2 Trail...{'total': 60, 'highlighted': [{'name': '1st le...{'coming_soon': False, 'date': '14 Jul, 2017'}{'url': '', 'email': 'radaew.zhenya@yandex.ru'}https://steamcdn-a.akamaihd.net/steam/apps/654...{'ids': [], 'notes': None}1.59
                                26855SNUSE 2219480700<strong> Hey. My name is *&amp;#!$.<br>Today I...<strong> Hey. My name is *&amp;#!$.<br>Today I...Hey. My name is *&amp;#!$. Today I will tell y...NaNhttps://steamcdn-a.akamaihd.net/steam/apps/948...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...[][]['SNUSE GM']['SNUSE GM']windows[{'id': 2, 'description': 'Single-player'}][{'id': '25', 'description': 'Adventure'}, {'i...[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 256745662, 'name': 'snuse', 'thumbnail...{'total': 0}{'coming_soon': False, 'date': '2 Apr, 2019'}{'url': 'vk.com/nilow_i', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/948...{'ids': [], 'notes': None}0.79
                                \n", "" ], "text/plain": [ " name steam_appid required_age \\\n", "4866 Subsiege 338640 0 \n", "14560 MARS VR(全球使命VR) 596560 0 \n", "16386 Numberline 2 654970 0 \n", "26855 SNUSE 221 948070 0 \n", "\n", " detailed_description \\\n", "4866 \\r\\... \n", "16386 NumberLine 2 is the continuation of the popula... \n", "26855 Hey. My name is *&#!$.
                                Today I... \n", "\n", " about_the_game \\\n", "4866 \\r\\... \n", "16386 NumberLine 2 is the continuation of the popula... \n", "26855 Hey. My name is *&#!$.
                                Today I... \n", "\n", " short_description supported_languages \\\n", "4866 Subsiege is an intense real-time tactic game w... NaN \n", "14560 Welcome to 《Mars VR》. This is an immersive fir... NaN \n", "16386 NumberLine 2 is the continuation of the popula... NaN \n", "26855 Hey. My name is *&#!$. Today I will tell y... NaN \n", "\n", " header_image \\\n", "4866 https://steamcdn-a.akamaihd.net/steam/apps/338... \n", "14560 https://steamcdn-a.akamaihd.net/steam/apps/596... \n", "16386 https://steamcdn-a.akamaihd.net/steam/apps/654... \n", "26855 https://steamcdn-a.akamaihd.net/steam/apps/948... \n", "\n", " website \\\n", "4866 http://subsiege-game.com/ \n", "14560 http://qqsm.zygames.com/ \n", "16386 NaN \n", "26855 NaN \n", "\n", " pc_requirements \\\n", "4866 {'minimum': 'Minimum:
                                  Minimum:
                                    Minimum:
                                      Minimum:
                                        Minimum:
                                          Minimum:
                                            *, French*, German*, Italian*, Spanish - Spain*, Simplified Chinese*, Traditional Chinese*, Korean*
                                            *languages with full audio support\n" ] }, { "data": { "text/plain": [ "English 8512\n", "English*
                                            *languages with full audio support 7409\n", "English, Russian 707\n", "English, Simplified Chinese 280\n", "English, Japanese 235\n", "English*, Russian*
                                            *languages with full audio support 222\n", "English, French, Italian, German, Spanish - Spain 180\n", "English, German 161\n", "Simplified Chinese 157\n", "English, French 143\n", "Name: supported_languages, dtype: int64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(price_df['supported_languages'][0])\n", "price_df['supported_languages'].value_counts().head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like English-only games make up a little over half the rows in our dataset (~16,000), and English plus other languages make up many more. We could create columns for any of the other languages by string searching, but for simplicity we'll focus on just the English ones.\n", "\n", "Using the Series.apply method once again, we can check if the string 'english' appears in each row. We'll return a 1 if 'english' is found and 0 otherwise. Whilst we could use True/False, a binary 1/0 can be interpreted as a boolean value and saves a little space in the csv file. \n", "\n", "Inside the lambda function, the variable `x` will take on the value of each row as the expression is evaluated. We apply the `lower()` string method so capitalisation doesn't matter." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            nameenglish
                                            0Counter-Strike1
                                            1Team Fortress Classic1
                                            2Day of Defeat1
                                            3Deathmatch Classic1
                                            4Half-Life: Opposing Force1
                                            \n", "
                                            " ], "text/plain": [ " name english\n", "0 Counter-Strike 1\n", "1 Team Fortress Classic 1\n", "2 Day of Defeat 1\n", "3 Deathmatch Classic 1\n", "4 Half-Life: Opposing Force 1" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_language(df):\n", " \"\"\"Process supported_languages column into a boolean 'is english' column.\"\"\"\n", " df = df.copy()\n", " \n", " # drop rows with missing language data\n", " df = df.dropna(subset=['supported_languages'])\n", " \n", " df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)\n", " df = df.drop('supported_languages', axis=1)\n", " \n", " return df\n", "\n", "\n", "language_df = process_language(price_df)\n", "language_df[['name', 'english']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before moving on, we can take a quick look at the results and see that most of the apps support English." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 27699\n", "0 522\n", "Name: english, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "language_df['english'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing Developers and Publishers\n", "\n", "We'll skip over the next few columns and take a look at the `developers` and `publishers` columns. They will most likely contain similar information so we can look at them together. \n", "\n", "We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values, if we search for empty lists we see that we have over 200 'hidden' null values." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Developers null counts: 104\n", "Developers empty list counts: 0\n", "\n", "Publishers null counts: 0\n", "Publishers empty list counts: 213\n" ] } ], "source": [ "print('Developers null counts:', language_df['developers'].isnull().sum())\n", "print('Developers empty list counts:', language_df[language_df['developers'] == \"['']\"].shape[0])\n", "\n", "print('\\nPublishers null counts:', language_df['publishers'].isnull().sum())\n", "print('Publishers empty list counts:', language_df[language_df['publishers'] == \"['']\"].shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ultimately we want a data set with no missing values. That means we have a few options for dealing with these two columns:\n", "\n", "- Remove all rows missing either developer or publisher information\n", "- Impute missing information by replacing the missing column with the column we have (i.e. if developers is missing, fill it with the value in publishers)\n", "- Fill missing information with 'Unknown' or 'None'\n", "\n", "We can investigate some of the rows with missing data to help inform our decision." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total games missing developer: 104 \n", "\n", "Tycoon City: New York: https://store.steampowered.com/app/9730\n", "Nikopol: Secrets of the Immortals: https://store.steampowered.com/app/11370\n", "Crash Time 2: https://store.steampowered.com/app/11390\n", "Hunting Unlimited 2010: https://store.steampowered.com/app/12690\n", "18 Wheels of Steel: Extreme Trucker: https://store.steampowered.com/app/33730\n", "\n", "Total games missing publisher: 213 \n", "\n", "RIP - Trilogy™: https://store.steampowered.com/app/2540\n", "Vigil: Blood Bitterness™: https://store.steampowered.com/app/2570\n", "Bullet Candy: https://store.steampowered.com/app/6600\n", "AudioSurf: https://store.steampowered.com/app/12900\n", "Everyday Shooter: https://store.steampowered.com/app/16300\n", "\n", "Total games missing developer and publisher: 67 \n", "\n", "PlayClaw 5 - Game Recording and Streaming: https://store.steampowered.com/app/237370\n", "Artemis Spaceship Bridge Simulator: https://store.steampowered.com/app/247350\n", "A Walk in the Dark: https://store.steampowered.com/app/248730\n", "Forge Quest: https://store.steampowered.com/app/249950\n", "Vox: https://store.steampowered.com/app/252770\n" ] } ], "source": [ "no_dev = language_df[language_df['developers'].isnull()]\n", "\n", "print('Total games missing developer:', no_dev.shape[0], '\\n')\n", "\n", "print_steam_links(no_dev[:5])\n", "\n", "no_pub = language_df[language_df['publishers'] == \"['']\"]\n", "\n", "print('\\nTotal games missing publisher:', no_pub.shape[0], '\\n')\n", "print_steam_links(no_pub[:5])\n", "\n", "no_dev_or_pub = language_df[(language_df['developers'].isnull()) & (language_df['publishers'] == \"['']\")]\n", "\n", "print('\\nTotal games missing developer and publisher:', no_dev_or_pub.shape[0], '\\n')\n", "print_steam_links(no_dev_or_pub[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It appears we are looking at a mix of titles, smaller ones especially, and some of the smaller indie titles may have been self-published. Others simply have wrong or missing data, found by searching for the titles elsewhere. As our priority is creating a clean data set, and there are only a few hundred rows, it will be fine to remove them from the data.\n", "\n", "Let's take a look at the structure of the data. Below we inspect some rows near the beginning of the dataframe. It looks like both columns are stored as lists which can have one or multiple values. We'll have to evaluate the rows as before, so they are recognised as lists, then index into them accordingly." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            developerspublishers
                                            24['Valve']['Valve']
                                            25['Valve', 'Hidden Path Entertainment']['Valve']
                                            27['Mark Healey']['Mark Healey']
                                            28['Tripwire Interactive']['Tripwire Interactive']
                                            \n", "
                                            " ], "text/plain": [ " developers publishers\n", "24 ['Valve'] ['Valve']\n", "25 ['Valve', 'Hidden Path Entertainment'] ['Valve']\n", "27 ['Mark Healey'] ['Mark Healey']\n", "28 ['Tripwire Interactive'] ['Tripwire Interactive']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "language_df[['developers', 'publishers']].iloc[24:28]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have some single values and some multiple, we have to decide how to handle them. Here are some potential solutions:\n", "\n", " - Create a column for each value in the list (i.e. developer_1, developer_2)\n", " - Create a column with the first value in the list and a column with the rest of the values (i.e. developer_1, other_developers)\n", " - Create a column with the first value in the list and disregard the rest\n", " - Combine all values into one column, simply unpacking the list\n", " \n", "Let's begin defining our function, and take a look at how many rows have multiple developers or publishers. After evaluating each row, we can find the length of the lists in each row by using the [Series.str.len()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.len.html) method. By filtering only rows where the list has more than one element, we can find the number of rows with more than one value in each column." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rows in developers column with multiple values: 1720\n", "Rows in publishers column with multiple values: 884\n" ] } ], "source": [ "def process_developers_and_publishers(df):\n", " # remove rows with missing data\n", " df = df[(df['developers'].notnull()) & (df['publishers'] != \"['']\")].copy()\n", " \n", " for col in ['developers', 'publishers']:\n", " df[col] = df[col].apply(lambda x: literal_eval(x))\n", " \n", " # filter dataframe to rows with lists longer than 1, and store the number of rows\n", " num_rows = df[df[col].str.len() > 1].shape[0]\n", " \n", " print('Rows in {} column with multiple values:'.format(col), num_rows)\n", "\n", "process_developers_and_publishers(language_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that the vast majority have only one value for these columns. If we went with the first or second solutions above, we'd be left with columns with mostly missing data. We could go with the third option, but the first value in the list isn't necessarily the most important, and this seems unfair if multiple teams were involved.\n", "\n", "The best way forward seems to be the fourth option - if there are multiple values we combine them into the same column. We'll create a list in this case, calling str.join() as we did before. If we pass a list with only one value, we get a string with just that value. If we pass a list with multiple values, we get a string-separated list as desired. We can see this below." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'one item'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "', '.join(['one item'])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "'multiple, different, items'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "', '.join(['multiple', 'different', 'items'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't join on a comma as a number of developers and publishers have a comma in their name, a couple of which can be seen below." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            steam_appiddeveloperspublishers
                                            25730['Valve', 'Hidden Path Entertainment']['Valve']
                                            662520['CINEMAX, s.r.o.']['CINEMAX, s.r.o.']
                                            732630['Infinity Ward', 'Aspyr (Mac)']['Activision', 'Aspyr (Mac)']
                                            973300['PopCap Games, Inc.']['PopCap Games, Inc.']
                                            \n", "
                                            " ], "text/plain": [ " steam_appid developers \\\n", "25 730 ['Valve', 'Hidden Path Entertainment'] \n", "66 2520 ['CINEMAX, s.r.o.'] \n", "73 2630 ['Infinity Ward', 'Aspyr (Mac)'] \n", "97 3300 ['PopCap Games, Inc.'] \n", "\n", " publishers \n", "25 ['Valve'] \n", "66 ['CINEMAX, s.r.o.'] \n", "73 ['Activision', 'Aspyr (Mac)'] \n", "97 ['PopCap Games, Inc.'] " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "language_df.loc[language_df['developers'].str.contains(\",\", na=False), ['steam_appid', 'developers', 'publishers']].head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead we can join on a semi-colon (`;`). We have 3 rows which contain a semi-colon in their name, so we'll remove these. We'll be able to identify and split individual developer/publisher names in the future by handling it this way." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            steam_appiddeveloperspublishers
                                            9550460210['bool games;']['bool games;']
                                            13489568480[';)', 'Quickdraw Studios']['Quickdraw Studios']
                                            16871665890['Semicolon;']['Semicolon;']
                                            \n", "
                                            " ], "text/plain": [ " steam_appid developers publishers\n", "9550 460210 ['bool games;'] ['bool games;']\n", "13489 568480 [';)', 'Quickdraw Studios'] ['Quickdraw Studios']\n", "16871 665890 ['Semicolon;'] ['Semicolon;']" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "language_df.loc[language_df['developers'].str.contains(\";\", na=False), ['steam_appid', 'developers', 'publishers']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also a small number of columns that have `['NA']` or `['N/A']` for publisher. These are some really well hidden null values, and they didn't actually surface until much later in the original development process. This helps highlight the iterative nature of data cleaning - you may discover errors in rows or data that went previously undiscovered, and have to go back and update or correct them." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "language_df[(language_df['publishers'] == \"['NA']\") | (language_df['publishers'] == \"['N/A']\")].shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're ready to finish the function we started. We'll abandon the for loop, as there is not too much repetition, and run it on the data as always." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            namesteam_appiddeveloperpublisher
                                            0Counter-Strike10ValveValve
                                            1Team Fortress Classic20ValveValve
                                            2Day of Defeat30ValveValve
                                            3Deathmatch Classic40ValveValve
                                            4Half-Life: Opposing Force50Gearbox SoftwareValve
                                            \n", "
                                            " ], "text/plain": [ " name steam_appid developer publisher\n", "0 Counter-Strike 10 Valve Valve\n", "1 Team Fortress Classic 20 Valve Valve\n", "2 Day of Defeat 30 Valve Valve\n", "3 Deathmatch Classic 40 Valve Valve\n", "4 Half-Life: Opposing Force 50 Gearbox Software Valve" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_developers_and_publishers(df):\n", " \"\"\"Parse columns as semicolon-separated string.\"\"\"\n", " # remove rows with missing data (~ means not)\n", " df = df[(df['developers'].notnull()) & (df['publishers'] != \"['']\")].copy()\n", " df = df[~(df['developers'].str.contains(';')) & ~(df['publishers'].str.contains(';'))]\n", " df = df[(df['publishers'] != \"['NA']\") & (df['publishers'] != \"['N/A']\")]\n", " \n", " # create list for each\n", " df['developer'] = df['developers'].apply(lambda x: ';'.join(literal_eval(x)))\n", " df['publisher'] = df['publishers'].apply(lambda x: ';'.join(literal_eval(x)))\n", "\n", " df = df.drop(['developers', 'publishers'], axis=1)\n", " \n", " return df\n", "\n", "dev_pub_df = process_developers_and_publishers(language_df)\n", "dev_pub_df[['name', 'steam_appid', 'developer', 'publisher']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing Categories and Genres\n", "\n", "We'll take a look at the `categories` and `genres` columns next. Below we take a look at the null counts and structure of the data. Both appear to be a list of dictionaries containing an id and description key-value pair. Similar to our handling of developers and publishers, it may be best to extract just the descriptions into a list for now. We could make a list of the IDs, keeping track of the corresponding description externally, but that seems overly complex for our purposes." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Categories:\n", "\n", "Null values: 509\n", "\n", "[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]\n", "\n", "Genres:\n", "\n", "Null values: 37\n", "\n", "[{'id': '1', 'description': 'Action'}]\n", "[{'id': '3', 'description': 'RPG'}, {'id': '2', 'description': 'Strategy'}]\n" ] } ], "source": [ "print('Categories:\\n')\n", "print('Null values:', dev_pub_df['categories'].isnull().sum())\n", "print()\n", "print(dev_pub_df['categories'][0])\n", "\n", "print('\\nGenres:\\n')\n", "print('Null values:', dev_pub_df['genres'].isnull().sum())\n", "print()\n", "print(dev_pub_df['genres'].iloc[0])\n", "print(dev_pub_df['genres'].iloc[1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we begin defining a function we'll inspect some of the null rows, then we can decide how to handle them.\n", "\n", "Using the pandas [DataFrame.sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) method, we can randomly sample a number of rows from the dataset. We set a random_state so the output is the same each time." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Driver Booster 3 for STEAM: https://store.steampowered.com/app/403040\n", "Swanky Paint: https://store.steampowered.com/app/432030\n", "ResumeMaker® Professional Deluxe 20: https://store.steampowered.com/app/707280\n", "CyberLink PhotoDirector 9 Ultra: https://store.steampowered.com/app/679550\n", "OVRdrop: https://store.steampowered.com/app/586210\n" ] } ], "source": [ "print_steam_links(dev_pub_df[dev_pub_df['categories'].isnull()].sample(5, random_state=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the above rows with null categories data are applications or software of some kind, and not actually games. It would definitely be best to remove these, as we are interested in analysing games from the steam store.\n", "\n", "Below we take a look at rows with missing genres data. There doesn't seem to be anything wrong with these games, and it suggests that genre data simply wasn't supplied. As there are only 37 rows affected, we'll remove these rows to keep our dataset complete." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Antagonist: https://store.steampowered.com/app/585260\n", "Lonelyland VR: https://store.steampowered.com/app/542620\n", "ChessBase 13 Academy: https://store.steampowered.com/app/377340\n", "Sonic the Hedgehog 4 - Episode I: https://store.steampowered.com/app/202530\n", "VEHICLES FURY: https://store.steampowered.com/app/749290\n" ] } ], "source": [ "print_steam_links(dev_pub_df[dev_pub_df['genres'].isnull()].sample(5, random_state=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we did for the developer and publisher columns, we can create an anonymous function and join the results on a semicolon. Inside the function we use a list comprehension to traverse each dictionary and extract the value under the `description` key, as seen below. " ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Multi-player', 'Online Multi-Player', 'Local Multi-Player']" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_category = \"[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}]\"\n", "\n", "[x['description'] for x in literal_eval(example_category)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then use the apply function again to turn each column into a simple delimited list." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            steam_appidcategoriesgenres
                                            010Multi-player;Online Multi-Player;Local Multi-P...Action
                                            120Multi-player;Online Multi-Player;Local Multi-P...Action
                                            230Multi-player;Valve Anti-Cheat enabledAction
                                            340Multi-player;Online Multi-Player;Local Multi-P...Action
                                            450Single-player;Multi-player;Valve Anti-Cheat en...Action
                                            \n", "
                                            " ], "text/plain": [ " steam_appid categories genres\n", "0 10 Multi-player;Online Multi-Player;Local Multi-P... Action\n", "1 20 Multi-player;Online Multi-Player;Local Multi-P... Action\n", "2 30 Multi-player;Valve Anti-Cheat enabled Action\n", "3 40 Multi-player;Online Multi-Player;Local Multi-P... Action\n", "4 50 Single-player;Multi-player;Valve Anti-Cheat en... Action" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_categories_and_genres(df):\n", " df = df.copy()\n", " df = df[(df['categories'].notnull()) & (df['genres'].notnull())]\n", " \n", " for col in ['categories', 'genres']:\n", " df[col] = df[col].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))\n", " \n", " return df\n", "\n", "cat_gen_df = process_categories_and_genres(dev_pub_df)\n", "cat_gen_df[['steam_appid', 'categories', 'genres']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing Achievements and Content Descriptors\n", "\n", "The final columns we will take a look at before moving on to export columns are `achievements` and `content_descriptors`. Let's take a look at the null counts for each column and a small sample of rows." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Achievements null counts: 1856\n", "Content Decsriptors null counts: 0\n" ] }, { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            nameachievementscontent_descriptors
                                            8Half-Life: Blue Shift{'total': 0}{'ids': [], 'notes': None}
                                            9Half-Life 2{'total': 33, 'highlighted': [{'name': 'Defian...{'ids': [], 'notes': None}
                                            10Counter-Strike: Source{'total': 147, 'highlighted': [{'name': 'Someo...{'ids': [2, 5], 'notes': 'Includes intense vio...
                                            11Half-Life: Source{'total': 0}{'ids': [], 'notes': None}
                                            12Day of Defeat: Source{'total': 54, 'highlighted': [{'name': 'Double...{'ids': [], 'notes': None}
                                            \n", "
                                            " ], "text/plain": [ " name achievements \\\n", "8 Half-Life: Blue Shift {'total': 0} \n", "9 Half-Life 2 {'total': 33, 'highlighted': [{'name': 'Defian... \n", "10 Counter-Strike: Source {'total': 147, 'highlighted': [{'name': 'Someo... \n", "11 Half-Life: Source {'total': 0} \n", "12 Day of Defeat: Source {'total': 54, 'highlighted': [{'name': 'Double... \n", "\n", " content_descriptors \n", "8 {'ids': [], 'notes': None} \n", "9 {'ids': [], 'notes': None} \n", "10 {'ids': [2, 5], 'notes': 'Includes intense vio... \n", "11 {'ids': [], 'notes': None} \n", "12 {'ids': [], 'notes': None} " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Achievements null counts:', cat_gen_df['achievements'].isnull().sum())\n", "print('Content Decsriptors null counts:', cat_gen_df['content_descriptors'].isnull().sum())\n", "\n", "cat_gen_df[['name', 'achievements', 'content_descriptors']].iloc[8:13]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like both columns are stored as dictionaries, with standard formats if no details are provided or exist.\n", "\n", "Below we take a closer look at a single row from the achievements column." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "{'total': 33,\n", " 'highlighted': [{'name': 'Defiant',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},\n", " {'name': 'Submissive',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},\n", " {'name': 'Malcontent',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},\n", " {'name': 'What cat?',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},\n", " {'name': 'Trusty Hardware',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},\n", " {'name': 'Barnacle Bowling',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},\n", " {'name': \"Anchor's Aweigh!\",\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},\n", " {'name': 'Heavy Weapons',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboatgun.jpg'},\n", " {'name': 'Vorticough',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_find_vortigauntcave.jpg'},\n", " {'name': 'Revenge!',\n", " 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_chopper.jpg'}]}" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "literal_eval(cat_gen_df['achievements'][9])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two keys in the top level of the dictionary: `total` and `highlighted`. The highlighted column looks too specific, being a selection of achievements specific to that game, so we will remove it. It may be worthwhile extracting the `total` value though.\n", "\n", "Now let's take a look at the `content_descriptors` column." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ids': [], 'notes': None} 25398\n", "{'ids': [2, 5], 'notes': None} 427\n", "{'ids': [1, 5], 'notes': None} 250\n", "{'ids': [5], 'notes': None} 127\n", "{'ids': [1, 2, 5], 'notes': None} 122\n", "{'ids': [2, 5], 'notes': 'This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, General Mature Content'} 8\n", "Name: content_descriptors, dtype: int64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_gen_df['content_descriptors'].value_counts().head(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Content descriptors contain age-related warnings about the content of a game. They are identified by a numeric ID number, with optional notes supplied. Almost 26,000 rows have an empty list, indicating either no content descriptors or none provided. We already have the `required_age` column, which gives us similar information. Because of this, and because the rows are highly specific to each game, we will drop this column entirely. \n", "\n", "Let's now define a function, taking a look at the value counts to verify everything went as expected." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 11966\n", "10 697\n", "12 624\n", "20 579\n", "15 500\n", "Name: achievements, dtype: int64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_achievements_and_descriptors(df):\n", " \"\"\"Parse as total number of achievements.\"\"\"\n", " df = df.copy()\n", " \n", " df = df.drop('content_descriptors', axis=1)\n", " \n", " def parse_achievements(x):\n", " if x is np.nan:\n", " # missing data, assume has no achievements\n", " return 0\n", " else:\n", " # else has data, so can extract and return number under total\n", " return literal_eval(x)['total']\n", " \n", " df['achievements'] = df['achievements'].apply(parse_achievements)\n", " \n", " return df\n", "\n", "achiev_df = process_achievements_and_descriptors(cat_gen_df)\n", "achiev_df['achievements'].value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we were successful. We'll leave this column as it is for now, however we may wish to consider grouping the values together in bins, like we did for the age column. This is a decision we can make during the feature engineering stage of our analysis, and we can decide at that point if it will be more useful.\n", "\n", "Let's now add these functions into the `process` function and run it on the raw data. This isn't strictly necessary but it will keep things organised and ensure we don't accidentally skip running a function.\n", "\n", "We'll then inspect everything we've completed so far. As you will see, there is still plenty more left to do." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            namesteam_appidrequired_agedetailed_descriptionabout_the_gameshort_descriptionheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundpriceenglishdeveloperpublisher
                                            0Counter-Strike100Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...https://steamcdn-a.akamaihd.net/steam/apps/10/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '1 Nov, 2000'}{'url': 'http://steamcommunity.com/app/10', 'e...https://steamcdn-a.akamaihd.net/steam/apps/10/...7.191ValveValve
                                            1Team Fortress Classic200One of the most popular online action games of...One of the most popular online action games of...One of the most popular online action games of...https://steamcdn-a.akamaihd.net/steam/apps/20/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '1 Apr, 1999'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/20/...3.991ValveValve
                                            2Day of Defeat300Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...https://steamcdn-a.akamaihd.net/steam/apps/30/...http://www.dayofdefeat.com/{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...windows;mac;linuxMulti-player;Valve Anti-Cheat enabledAction[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '1 May, 2003'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/30/...3.991ValveValve
                                            3Deathmatch Classic400Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...https://steamcdn-a.akamaihd.net/steam/apps/40/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '1 Jun, 2001'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/40/...3.991ValveValve
                                            4Half-Life: Opposing Force500Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...https://steamcdn-a.akamaihd.net/steam/apps/50/...NaN{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...windows;mac;linuxSingle-player;Multi-player;Valve Anti-Cheat en...Action[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '1 Nov, 1999'}{'url': 'https://help.steampowered.com', 'emai...https://steamcdn-a.akamaihd.net/steam/apps/50/...3.991Gearbox SoftwareValve
                                            \n", "
                                            " ], "text/plain": [ " name steam_appid required_age \\\n", "0 Counter-Strike 10 0 \n", "1 Team Fortress Classic 20 0 \n", "2 Day of Defeat 30 0 \n", "3 Deathmatch Classic 40 0 \n", "4 Half-Life: Opposing Force 50 0 \n", "\n", " detailed_description \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " about_the_game \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " short_description \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " header_image \\\n", "0 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n", "1 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n", "2 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n", "3 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n", "4 https://steamcdn-a.akamaihd.net/steam/apps/50/... \n", "\n", " website \\\n", "0 NaN \n", "1 NaN \n", "2 http://www.dayofdefeat.com/ \n", "3 NaN \n", "4 NaN \n", "\n", " pc_requirements \\\n", "0 {'minimum': '\\r\\n\\t\\t\\t

                                            Minimum:Minimum:Minimum:Minimum:Minimum:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                            namesteam_appidrequired_agedetailed_descriptionabout_the_gameshort_descriptionheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundpriceenglishdeveloperpublisher
                                            97Bejeweled 2 Deluxe33000NaNNaNNaNhttps://steamcdn-a.akamaihd.net/steam/apps/330...NaN{'minimum': '<p><strong>Minimum Requirements:<...{'minimum': '<ul>\\n\\t<li><strong>OS:</strong> ...[]windows;macSingle-playerCasual[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '30 Aug, 2006'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/330...4.251PopCap Games, Inc.PopCap Games, Inc.
                                            98Chuzzle Deluxe33100NaNNaNNaNhttps://steamcdn-a.akamaihd.net/steam/apps/331...NaN{'minimum': '<p><strong>Minimum Requirements:<...{'minimum': '<ul>\\n\\t<li><strong>OS:</strong> ...[]windows;macSingle-playerCasual[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '30 Aug, 2006'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/331...4.251PopCap Games, Inc.PopCap Games, Inc.
                                            99Insaniquarium Deluxe33200NaNNaNNaNhttps://steamcdn-a.akamaihd.net/steam/apps/332...NaN{'minimum': '<strong>Minimum Requirements:</st...[][]windowsSingle-playerCasual[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '30 Aug, 2006'}{'url': '', 'email': ''}https://steamcdn-a.akamaihd.net/steam/apps/332...4.251PopCap Games, Inc.PopCap Games, Inc.
                                            \n", "" ], "text/plain": [ " name steam_appid required_age detailed_description \\\n", "97 Bejeweled 2 Deluxe 3300 0 NaN \n", "98 Chuzzle Deluxe 3310 0 NaN \n", "99 Insaniquarium Deluxe 3320 0 NaN \n", "\n", " about_the_game short_description \\\n", "97 NaN NaN \n", "98 NaN NaN \n", "99 NaN NaN \n", "\n", " header_image website \\\n", "97 https://steamcdn-a.akamaihd.net/steam/apps/330... NaN \n", "98 https://steamcdn-a.akamaihd.net/steam/apps/331... NaN \n", "99 https://steamcdn-a.akamaihd.net/steam/apps/332... NaN \n", "\n", " pc_requirements \\\n", "97 {'minimum': '

                                            Minimum Requirements:<... \n", "98 {'minimum': '

                                            Minimum Requirements:<... \n", "99 {'minimum': 'Minimum Requirements:\\n\\t

                                          • OS: ... [] \n", "98 {'minimum': '
                                              \\n\\t
                                            • OS: ... [] \n", "99 [] [] \n", "\n", " platforms categories genres \\\n", "97 windows;mac Single-player Casual \n", "98 windows;mac Single-player Casual \n", "99 windows Single-player Casual \n", "\n", " screenshots movies achievements \\\n", "97 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 \n", "98 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 \n", "99 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 \n", "\n", " release_date support_info \\\n", "97 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} \n", "98 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} \n", "99 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} \n", "\n", " background price english \\\n", "97 https://steamcdn-a.akamaihd.net/steam/apps/330... 4.25 1 \n", "98 https://steamcdn-a.akamaihd.net/steam/apps/331... 4.25 1 \n", "99 https://steamcdn-a.akamaihd.net/steam/apps/332... 4.25 1 \n", "\n", " developer publisher \n", "97 PopCap Games, Inc. PopCap Games, Inc. \n", "98 PopCap Games, Inc. PopCap Games, Inc. \n", "99 PopCap Games, Inc. PopCap Games, Inc. " ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "partially_clean[partially_clean['detailed_description'].isnull()].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, all of these titles are games from 2006 developed and published by PopCap Games. My best guess is that they were developed previously and all added to the Steam store in one go after Valve allowed third-party titles.\n", "\n", "We'll remove these rows, as well as any with a description of less than 20 characters, like those below." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                              namesteam_appidrequired_agedetailed_descriptionabout_the_gameshort_descriptionheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundpriceenglishdeveloperpublisher
                                              10420Penguins Cretins4909900.........https://steamcdn-a.akamaihd.net/steam/apps/490...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...windowsSingle-player;Multi-player;Online Multi-Player...Action;Adventure;Casual;Indie;Massively Multip...[{'id': 0, 'path_thumbnail': 'https://steamcdn...NaN0{'coming_soon': False, 'date': '22 Jun, 2016'}{'url': '', 'email': 'support@hfmgames.net'}https://steamcdn-a.akamaihd.net/steam/apps/490...1.691HFM GamesHFM Games
                                              19938拼词游戏 20177458400带一点恐怖元素的休闲游戏带一点恐怖元素的休闲游戏一款有一点恐怖元素的休闲益智游戏。https://steamcdn-a.akamaihd.net/steam/apps/745...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...[]windows;macSingle-playerAdventure;Indie[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 256699963, 'name': 'alpha', 'thumbnail...11{'coming_soon': False, 'date': '29 Nov, 2017'}{'url': '', 'email': '12668934@qq.com'}https://steamcdn-a.akamaihd.net/steam/apps/745...0.790MianwotuMianwotu
                                              21937God Test7976600God TestGod TestGod Testhttps://steamcdn-a.akamaihd.net/steam/apps/797...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...[][]windowsMulti-player;Online Multi-Player;MMO;Online Co...Action;Massively Multiplayer;RPG;StrategyNaNNaN0{'coming_soon': False, 'date': '18 Apr, 2018'}{'url': '', 'email': 'insanegamedev@outlook.com'}NaN0.001God TestGod Test
                                              26219В поисках Атлантиды9256400Интересная играИнтересная играAtlantishttps://steamcdn-a.akamaihd.net/steam/apps/925...https://vk.com/atlantisforever{'minimum': '<strong>Minimum:</strong><br><ul ...[][]windowsSingle-player;Steam AchievementsAdventure;Casual;Indie;Early Access[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 256725871, 'name': 'Game', 'thumbnail'...1{'coming_soon': False, 'date': '1 Nov, 2018'}{'url': 'https://vk.com/atlantisforever', 'ema...https://steamcdn-a.akamaihd.net/steam/apps/925...1.690Dmitr CheDmitr Che
                                              26356东方百问~TouHouAsked9308400NullNullNullhttps://steamcdn-a.akamaihd.net/steam/apps/930...https://asked.touhou.ren/{'minimum': '<strong>Minimum:</strong><br><ul ...[][]windowsSingle-player;Steam AchievementsCasual;Indie;Strategy;Early Access[{'id': 0, 'path_thumbnail': 'https://steamcdn...[{'id': 256726640, 'name': 'TouHouAsked', 'thu...2{'coming_soon': False, 'date': '7 Oct, 2018'}{'url': 'https://asked.touhou.ren', 'email': '...https://steamcdn-a.akamaihd.net/steam/apps/930...0.790Root Nine StudioRoot Nine Studio
                                              \n", "
                                              " ], "text/plain": [ " name steam_appid required_age detailed_description \\\n", "10420 Penguins Cretins 490990 0 ... \n", "19938 拼词游戏 2017 745840 0 带一点恐怖元素的休闲游戏 \n", "21937 God Test 797660 0 God Test \n", "26219 В поисках Атлантиды 925640 0 Интересная игра \n", "26356 东方百问~TouHouAsked 930840 0 Null \n", "\n", " about_the_game short_description \\\n", "10420 ... ... \n", "19938 带一点恐怖元素的休闲游戏 一款有一点恐怖元素的休闲益智游戏。 \n", "21937 God Test God Test \n", "26219 Интересная игра Atlantis \n", "26356 Null Null \n", "\n", " header_image \\\n", "10420 https://steamcdn-a.akamaihd.net/steam/apps/490... \n", "19938 https://steamcdn-a.akamaihd.net/steam/apps/745... \n", "21937 https://steamcdn-a.akamaihd.net/steam/apps/797... \n", "26219 https://steamcdn-a.akamaihd.net/steam/apps/925... \n", "26356 https://steamcdn-a.akamaihd.net/steam/apps/930... \n", "\n", " website \\\n", "10420 NaN \n", "19938 NaN \n", "21937 NaN \n", "26219 https://vk.com/atlantisforever \n", "26356 https://asked.touhou.ren/ \n", "\n", " pc_requirements \\\n", "10420 {'minimum': 'Minimum:
                                                Minimum:
                                                  Minimum:
                                                    Minimum:
                                                      Minimum:
                                                        Minimum:
                                                          Minimum:
                                                            Minimum:
                                                              20]\n", " \n", " # by default we don't export, useful if calling function later\n", " if export:\n", " # create dataframe of description columns\n", " description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]\n", " \n", " export_data(description_data, filename='description_data')\n", " \n", " # drop description columns from main dataframe\n", " df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)\n", " \n", " return df\n", "\n", "desc_df = process_descriptions(partially_clean, export=True)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                              steam_appiddetailed_descriptionabout_the_gameshort_description
                                                              010Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...Play the world's number 1 online action game. ...
                                                              120One of the most popular online action games of...One of the most popular online action games of...One of the most popular online action games of...
                                                              230Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...Enlist in an intense brand of Axis vs. Allied ...
                                                              340Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...Enjoy fast-paced multiplayer gaming with Death...
                                                              450Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...Return to the Black Mesa Research Facility as ...
                                                              \n", "
                                                              " ], "text/plain": [ " steam_appid detailed_description \\\n", "0 10 Play the world's number 1 online action game. ... \n", "1 20 One of the most popular online action games of... \n", "2 30 Enlist in an intense brand of Axis vs. Allied ... \n", "3 40 Enjoy fast-paced multiplayer gaming with Death... \n", "4 50 Return to the Black Mesa Research Facility as ... \n", "\n", " about_the_game \\\n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... \n", "\n", " short_description \n", "0 Play the world's number 1 online action game. ... \n", "1 One of the most popular online action games of... \n", "2 Enlist in an intense brand of Axis vs. Allied ... \n", "3 Enjoy fast-paced multiplayer gaming with Death... \n", "4 Return to the Black Mesa Research Facility as ... " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect exported data\n", "pd.read_csv('../data/exports/steam_description_data.csv').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Processing Media Columns\n", "\n", "Similar to the description columns, we have three columns that contain links to various images: `header_image`, `screenshots` and `background`. Whilst we won't be needing this data in this project, it could open the door to some interesting image analysis in the future. We will treat these columns in almost the same way, exporting the contents to a csv file then removing them from the dataset.\n", "\n", "Again, let's check for missing values." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "header_image: 0\n", "screenshots: 4\n", "background: 4\n" ] }, { "data": { "text/html": [ "
                                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                              header_imagescreenshotsbackground
                                                              0https://steamcdn-a.akamaihd.net/steam/apps/10/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/10/...
                                                              1https://steamcdn-a.akamaihd.net/steam/apps/20/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/20/...
                                                              2https://steamcdn-a.akamaihd.net/steam/apps/30/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/30/...
                                                              3https://steamcdn-a.akamaihd.net/steam/apps/40/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/40/...
                                                              4https://steamcdn-a.akamaihd.net/steam/apps/50/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/50/...
                                                              \n", "
                                                              " ], "text/plain": [ " header_image \\\n", "0 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n", "1 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n", "2 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n", "3 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n", "4 https://steamcdn-a.akamaihd.net/steam/apps/50/... \n", "\n", " screenshots \\\n", "0 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "1 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "2 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "3 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "4 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "\n", " background \n", "0 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n", "1 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n", "2 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n", "3 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n", "4 https://steamcdn-a.akamaihd.net/steam/apps/50/... " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "image_cols = ['header_image', 'screenshots', 'background']\n", "\n", "for col in image_cols:\n", " print(col+':', desc_df[col].isnull().sum())\n", "\n", "desc_df[image_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with the description columns, it is likely that the 4 rows with no `screenshots` data are the same rows with no `background` data. There are so few that it is probably safe to remove them.\n", "\n", "Before we make up our made let's inspect the rows." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Light Empire: https://store.steampowered.com/app/416220\n", "Girl and Goblin: https://store.steampowered.com/app/880510\n", "Arida: Backland's Awakening: https://store.steampowered.com/app/907760\n", "Nukalypse: The Final War: https://store.steampowered.com/app/947940\n" ] }, { "data": { "text/html": [ "
                                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                              namesteam_appidrequired_ageheader_imagewebsitepc_requirementsmac_requirementslinux_requirementsplatformscategoriesgenresscreenshotsmoviesachievementsrelease_datesupport_infobackgroundpriceenglishdeveloperpublisher
                                                              7970The Light Empire4162200https://steamcdn-a.akamaihd.net/steam/apps/416...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...[][]windowsSingle-player;Steam AchievementsIndie;RPG;SimulationNaNNaN4{'coming_soon': False, 'date': '2 Dec, 2015'}{'url': '', 'email': 'Jemy.TLE@outlook.com'}NaN4.791JemyJemy
                                                              24874Girl and Goblin8805100https://steamcdn-a.akamaihd.net/steam/apps/880...NaN{'minimum': '<strong>最低配置:</strong><br><ul cla...[][]windowsSingle-player;Steam Achievements;Steam Trading...Adventure;Casual;RPG;StrategyNaN[{'id': 256739772, 'name': '3', 'thumbnail': '...1552{'coming_soon': False, 'date': '5 Jan, 2019'}{'url': '', 'email': 'smagician13@yahoo.com'}NaN0.791Inverse GameInverse Game
                                                              25696Arida: Backland's Awakening9077600https://steamcdn-a.akamaihd.net/steam/apps/907...http://www.aridagame.com{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...windows;macSingle-player;Steam Achievements;Full controll...Adventure;IndieNaN[{'id': 256729551, 'name': 'Teaser Beta 2018',...0{'coming_soon': True, 'date': ''}{'url': 'http://www.aridagame.com', 'email': '...NaN0.001Aoca Game LabAoca Game Lab
                                                              26851Nukalypse: The Final War9479400https://steamcdn-a.akamaihd.net/steam/apps/947...NaN{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...windowsSingle-player;Steam AchievementsAction;Indie;Early AccessNaN[{'id': 256745274, 'name': 'Nukalypse: The Fin...0{'coming_soon': True, 'date': 'Oct 2019'}{'url': '', 'email': 'nukalypse@gmail.com'}NaN0.001Zion Games StudioZion Games Studio
                                                              \n", "
                                                              " ], "text/plain": [ " name steam_appid required_age \\\n", "7970 The Light Empire 416220 0 \n", "24874 Girl and Goblin 880510 0 \n", "25696 Arida: Backland's Awakening 907760 0 \n", "26851 Nukalypse: The Final War 947940 0 \n", "\n", " header_image \\\n", "7970 https://steamcdn-a.akamaihd.net/steam/apps/416... \n", "24874 https://steamcdn-a.akamaihd.net/steam/apps/880... \n", "25696 https://steamcdn-a.akamaihd.net/steam/apps/907... \n", "26851 https://steamcdn-a.akamaihd.net/steam/apps/947... \n", "\n", " website \\\n", "7970 NaN \n", "24874 NaN \n", "25696 http://www.aridagame.com \n", "26851 NaN \n", "\n", " pc_requirements \\\n", "7970 {'minimum': 'Minimum:
                                                                最低配置:
                                                                  Minimum:
                                                                    Minimum:
                                                                      Minimum:
                                                                        Minimum:
                                                                          Minimum:
                                                                            Minimum:
                                                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                                              steam_appidheader_imagescreenshotsbackgroundmovies
                                                                              010https://steamcdn-a.akamaihd.net/steam/apps/10/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/10/...NaN
                                                                              120https://steamcdn-a.akamaihd.net/steam/apps/20/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/20/...NaN
                                                                              230https://steamcdn-a.akamaihd.net/steam/apps/30/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/30/...NaN
                                                                              340https://steamcdn-a.akamaihd.net/steam/apps/40/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/40/...NaN
                                                                              450https://steamcdn-a.akamaihd.net/steam/apps/50/...[{'id': 0, 'path_thumbnail': 'https://steamcdn...https://steamcdn-a.akamaihd.net/steam/apps/50/...NaN
                                                                              \n", "" ], "text/plain": [ " steam_appid header_image \\\n", "0 10 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n", "1 20 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n", "2 30 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n", "3 40 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n", "4 50 https://steamcdn-a.akamaihd.net/steam/apps/50/... \n", "\n", " screenshots \\\n", "0 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "1 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "2 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "3 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "4 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n", "\n", " background movies \n", "0 https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN \n", "1 https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN \n", "2 https://steamcdn-a.akamaihd.net/steam/apps/30/... NaN \n", "3 https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN \n", "4 https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect exported data\n", "pd.read_csv('../data/exports/steam_media_data.csv').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we move on, we can inspect the memory savings of removing these columns by comparing the output of the [DataFrame.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method. If we pass `memory_usage=\"deep\"` we get the true memory usage of each DataFrame. Without this, pandas estimates the amount used. This is because of the way python stores object (string) columns under the hood. Essentially python keeps track of a list of pointers which point to the actual strings in memory. It's a bit like if you hid a bunch of items around the house, and kept a list of where everything was. You couldn't tell the total size of everything just by looking at the list, but you could take a rough guess. Only by following the list and inspecting each individual item could you get an exact figure.\n", "\n", "The blog post '[Why Python Is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)' goes into more detail, but all we need to be aware of is that by passing the parameter we ensure we get the true value of memory usage. We also pass `verbose=False` to truncate unnecessary output.\n", "\n", "We can see that already we have shrunk the memory usage from 285 MB to just under 55 MB. This is great because in general, the smaller the memory footprint the faster our code will run in future. And of course, we're not finished yet." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before removing data:\n", "\n", "\n", "Int64Index: 27414 entries, 0 to 29234\n", "Columns: 24 entries, name to publisher\n", "dtypes: category(1), float64(1), int64(3), object(19)\n", "memory usage: 285.0 MB\n", "\n", "Data with descriptions and media removed:\n", "\n", "\n", "Int64Index: 27391 entries, 0 to 29234\n", "Columns: 17 entries, name to publisher\n", "dtypes: category(1), float64(1), int64(3), object(12)\n", "memory usage: 54.2 MB\n" ] } ], "source": [ "print('Before removing data:\\n')\n", "achiev_df.info(verbose=False, memory_usage=\"deep\")\n", "\n", "print('\\nData with descriptions and media removed:\\n')\n", "media_df.info(verbose=False, memory_usage=\"deep\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Website and Support Info\n", "\n", "Next we will look at the `website` and `support_info` columns. Seen below, they both contain links to external websites. The website column is simply stored as a string whereas the support info column is stored as a dictionary of `url` and `email`.\n", "\n", "There are a large number of rows with no website listed, and while there are no null values in the `support_info` column, it looks like many will have empty `url` and `email` values inside the data.\n", "\n", "For our dataset we'll be dropping both these columns, as they are far too specific to be useful in our analysis. As you may have guessed, we will extract and export this data as we have done before. If not useful, it could be interesting at a later date." ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "website null counts: 9328\n", "support_info null counts: 0\n" ] }, { "data": { "text/html": [ "
                                                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                                              namewebsitesupport_info
                                                                              82X: Beyond the Frontierhttp://www.egosoft.com/games/x/info_en.php{'url': '', 'email': ''}
                                                                              83X: Tensionhttp://www.egosoft.com/games/x_tension/info_en.php{'url': '', 'email': ''}
                                                                              84X Rebirthhttp://www.egosoft.com/games/x_rebirth/info_en.php{'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'}
                                                                              85688(I) Hunter/KillerNaN{'url': 'http://strategyfirst.com/products/support.html', 'email': ''}
                                                                              86Fleet CommandNaN{'url': 'http://strategyfirst.com/products/support.html', 'email': ''}
                                                                              \n", "
                                                                              " ], "text/plain": [ " name \\\n", "82 X: Beyond the Frontier \n", "83 X: Tension \n", "84 X Rebirth \n", "85 688(I) Hunter/Killer \n", "86 Fleet Command \n", "\n", " website \\\n", "82 http://www.egosoft.com/games/x/info_en.php \n", "83 http://www.egosoft.com/games/x_tension/info_en.php \n", "84 http://www.egosoft.com/games/x_rebirth/info_en.php \n", "85 NaN \n", "86 NaN \n", "\n", " support_info \n", "82 {'url': '', 'email': ''} \n", "83 {'url': '', 'email': ''} \n", "84 {'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'} \n", "85 {'url': 'http://strategyfirst.com/products/support.html', 'email': ''} \n", "86 {'url': 'http://strategyfirst.com/products/support.html', 'email': ''} " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print('website null counts:', media_df['website'].isnull().sum())\n", "print('support_info null counts:', media_df['support_info'].isnull().sum())\n", "\n", "with pd.option_context(\"display.max_colwidth\", 100): # ensures strings not cut short\n", " display(media_df[['name', 'website', 'support_info']][75:80])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to split the support info into two separate columns. We'll keep all the code that parses the columns inside the export `if` statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain `NaN` whereas the other two columns contain a blank string (`''`) for missing data, as once we have exported to csv they will be represented the same way." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Exported support info to '../data/exports/steam_support_info.csv'\n" ] } ], "source": [ "def process_info(df, export=False):\n", " \"\"\"Drop support information from dataframe, optionally exporting beforehand.\"\"\"\n", " if export:\n", " support_info = df[['steam_appid', 'website', 'support_info']].copy()\n", " \n", " support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))\n", " support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])\n", " support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])\n", " \n", " support_info = support_info.drop('support_info', axis=1)\n", " \n", " # only keep rows with at least one piece of information\n", " support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]\n", "\n", " export_data(support_info, 'support_info')\n", " \n", " df = df.drop(['website', 'support_info'], axis=1)\n", " \n", " return df\n", "\n", "\n", "info_df = process_info(media_df, export=True)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                                              steam_appidwebsitesupport_urlsupport_email
                                                                              010NaNhttp://steamcommunity.com/app/10NaN
                                                                              130http://www.dayofdefeat.com/NaNNaN
                                                                              250NaNhttps://help.steampowered.comNaN
                                                                              370http://www.half-life.com/http://steamcommunity.com/app/70NaN
                                                                              480NaNhttp://steamcommunity.com/app/80NaN
                                                                              \n", "
                                                                              " ], "text/plain": [ " steam_appid website support_url \\\n", "0 10 NaN http://steamcommunity.com/app/10 \n", "1 30 http://www.dayofdefeat.com/ NaN \n", "2 50 NaN https://help.steampowered.com \n", "3 70 http://www.half-life.com/ http://steamcommunity.com/app/70 \n", "4 80 NaN http://steamcommunity.com/app/80 \n", "\n", " support_email \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN " ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect exported file\n", "pd.read_csv('../data/exports/steam_support_info.csv').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### System Requirements\n", "\n", "As you may have noticed, there are three columns containing the system specifications required to run each game on each platform. These columns are `pc_requirements`, `mac_requirements`, and `linux_requirements`. As computing power has increased over the years, this information won't be of particular use in this analysis. It could be fascinating to use as a proxy of how computers have developed over the years, so we'll want to export the data, but we won't be keeping it in this dataset.\n", "\n", "Taking a look at the null counts, it looks like there is data for every row." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "null counts:\n", "\n", "pc_requirements: 0\n", "mac_requirements: 0\n", "linux_requirements: 0\n" ] } ], "source": [ "requirements_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']\n", "\n", "print('null counts:\\n')\n", "\n", "for col in requirements_cols:\n", " print(col+':', info_df[col].isnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However if we look at the data a little more closely, we see that some rows actually store an empty list. These won't appear as null rows, but once evaluated won't provide any information, so can be treated as null values." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
                                                                              \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                                              steam_appidpc_requirementsmac_requirementslinux_requirements
                                                                              292301065230{'minimum': '<strong>Minimum:</strong><br><ul ...[][]
                                                                              292311065570{'minimum': '<strong>Minimum:</strong><br><ul ...[][]
                                                                              292321065650{'minimum': '<strong>Minimum:</strong><br><ul ...[][]
                                                                              292331066700{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...[]
                                                                              292341069460{'minimum': '<strong>Minimum:</strong><br><ul ...{'minimum': '<strong>Minimum:</strong><br><ul ...[]
                                                                              \n", "
                                                                              " ], "text/plain": [ " steam_appid pc_requirements \\\n", "29230 1065230 {'minimum': 'Minimum:
                                                                                Minimum:
                                                                                  Minimum:
                                                                                    Minimum:
                                                                                      Minimum:
                                                                                        Minimum:
                                                                                          Minimum:
                                                                                            Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection

                                                                                            \\\\r\\\\n\\\\t\\\\t\\\\t

                                                                                            Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection

                                                                                            \\\\r\\\\n\\\\t\\\\t\\\\t'}\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'{\\'minimum\\': \\'Minimum:
                                                                                            • OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic
                                                                                            • Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices
                                                                                            • Memory: 2 GB RAM
                                                                                            • Storage: 1400 MB available space
                                                                                            \\', \\'recommended\\': \\'Recommended:
                                                                                            • OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic
                                                                                            • Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices
                                                                                            • Memory: 4 GB RAM
                                                                                            • Storage: 1400 MB available space
                                                                                            \\'}'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'{\\'minimum\\': \\'Minimum:
                                                                                            • Requires a 64-bit processor and operating system
                                                                                            • OS: Windows 8 / 10 - 64-bit
                                                                                            • Processor: 2Ghz Intel CPU
                                                                                            • Memory: 2 GB RAM
                                                                                            • Graphics: 1024MB VRAM
                                                                                            • DirectX: Version 9.0
                                                                                            • Network: Broadband Internet connection
                                                                                            • Storage: 10 GB available space
                                                                                            \\', \\'recommended\\': \\'Recommended:
                                                                                            • Requires a 64-bit processor and operating system
                                                                                            • OS: Windows 10 - 64-bit
                                                                                            • Processor: 2.60Ghz Intel Core i7 CPU
                                                                                            • Memory: 16 GB RAM
                                                                                            • Graphics: 4GB VRAM
                                                                                            • Network: Broadband Internet connection
                                                                                            • Storage: 10 GB available space
                                                                                            \\'}'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(info_df['pc_requirements'].iloc[0])\n", "display(info_df['pc_requirements'].iloc[2000])\n", "display(info_df['pc_requirements'].iloc[15000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is stored as a dictionary, as we've seen before. There is definitely a key for `minimum`, but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like `Processor` and `Memory` for some of the rows.\n", "\n", "Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.\n", "\n", "By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"{'minimum': ' Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection'}\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\"{'minimum': 'Minimum: OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices Memory: 2 GB RAM Storage: 1400 MB available space', 'recommended': 'Recommended: OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices Memory: 4 GB RAM Storage: 1400 MB available space'}\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\"{'minimum': 'Minimum: Requires a 64-bit processor and operating system OS: Windows 8 / 10 - 64-bit Processor: 2Ghz Intel CPU Memory: 2 GB RAM Graphics: 1024MB VRAM DirectX: Version 9.0 Network: Broadband Internet connection Storage: 10 GB available space', 'recommended': 'Recommended: Requires a 64-bit processor and operating system OS: Windows 10 - 64-bit Processor: 2.60Ghz Intel Core i7 CPU Memory: 16 GB RAM Graphics: 4GB VRAM Network: Broadband Internet connection Storage: 10 GB available space'}\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "view_requirements = info_df['pc_requirements'].iloc[[0, 2000, 15000]].copy()\n", "\n", "view_requirements = (view_requirements\n", " .str.replace(r'\\\\[rtn]', '')\n", " .str.replace(r'<[pbr]{1,2}>', ' ')\n", " .str.replace(r'<[\\/\"=\\w\\s]+>', '')\n", " )\n", "\n", "for i, row in view_requirements.iteritems():\n", " display(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now see more clearly the contents and structure of these rows. Some rows have both `Minimum` and `Recommended` requirements inside a `minimum` key, and some have separate `minimum` and `recommended` keys. Some have headings like `Processor:` and `Storage:` before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.\n", "\n", "It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process worthy of another project. With that in mind, it seems best to proceed by slightly cleaning the data before exporting it, but not trying to deal with individual components right now.\n", "\n", "We'll export the raw data, and split out the data in the `minimum` and `recommended` keys. This will mean lots of rows still have recommended data hidden inside the minimum key, but this will do for now." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Exported requirements data to '../data/exports/steam_requirements_data.csv'\n" ] } ], "source": [ "def process_requirements(df, export=False):\n", " if export:\n", " requirements = df[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].copy()\n", " \n", " # remove rows with missing pc requirements\n", " requirements = requirements[requirements['pc_requirements'] != '[]']\n", " \n", " requirements['requirements_clean'] = (requirements['pc_requirements']\n", " .str.replace(r'\\\\[rtn]', '')\n", " .str.replace(r'<[pbr]{1,2}>', ' ')\n", " .str.replace(r'<[\\/\"=\\w\\s]+>', '')\n", " )\n", " \n", " requirements['requirements_clean'] = requirements['requirements_clean'].apply(lambda x: literal_eval(x))\n", " \n", " # split out minimum and recommended into separate columns\n", " requirements['minimum'] = requirements['requirements_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)\n", " requirements['recommended'] = requirements['requirements_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)\n", " \n", " requirements = requirements.drop('requirements_clean', axis=1)\n", " \n", " export_data(requirements, 'requirements_data')\n", " \n", " df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)\n", " \n", " return df\n", "\n", "reqs_df = process_requirements(info_df, export=True)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
                                                                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                                                            steam_appidpc_requirementsmac_requirementslinux_requirementsminimumrecommended
                                                                                            010{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...500 mhz processor, 96mb ram, 16mb video card, ...NaN
                                                                                            120{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...500 mhz processor, 96mb ram, 16mb video card, ...NaN
                                                                                            230{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...500 mhz processor, 96mb ram, 16mb video card, ...NaN
                                                                                            340{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...500 mhz processor, 96mb ram, 16mb video card, ...NaN
                                                                                            450{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st...{'minimum': 'Minimum: OS X Snow Leopard 10.6....{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...500 mhz processor, 96mb ram, 16mb video card, ...NaN
                                                                                            \n", "
                                                                                            " ], "text/plain": [ " steam_appid pc_requirements \\\n", "0 10 {'minimum': '\\r\\n\\t\\t\\t

                                                                                            Minimum:Minimum:Minimum:Minimum:Minimum:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                                                            namesteam_appidrequired_ageplatformscategoriesgenresachievementsrelease_datepriceenglishdeveloperpublisher
                                                                                            0Counter-Strike100windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action02000-11-017.191ValveValve
                                                                                            1Team Fortress Classic200windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action01999-04-013.991ValveValve
                                                                                            2Day of Defeat300windows;mac;linuxMulti-player;Valve Anti-Cheat enabledAction02003-05-013.991ValveValve
                                                                                            3Deathmatch Classic400windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action02001-06-013.991ValveValve
                                                                                            4Half-Life: Opposing Force500windows;mac;linuxSingle-player;Multi-player;Valve Anti-Cheat en...Action01999-11-013.991Gearbox SoftwareValve
                                                                                            \n", "" ], "text/plain": [ " name steam_appid required_age platforms \\\n", "0 Counter-Strike 10 0 windows;mac;linux \n", "1 Team Fortress Classic 20 0 windows;mac;linux \n", "2 Day of Defeat 30 0 windows;mac;linux \n", "3 Deathmatch Classic 40 0 windows;mac;linux \n", "4 Half-Life: Opposing Force 50 0 windows;mac;linux \n", "\n", " categories genres achievements \\\n", "0 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "1 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "2 Multi-player;Valve Anti-Cheat enabled Action 0 \n", "3 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "4 Single-player;Multi-player;Valve Anti-Cheat en... Action 0 \n", "\n", " release_date price english developer publisher \n", "0 2000-11-01 7.19 1 Valve Valve \n", "1 1999-04-01 3.99 1 Valve Valve \n", "2 2003-05-01 3.99 1 Valve Valve \n", "3 2001-06-01 3.99 1 Valve Valve \n", "4 1999-11-01 3.99 1 Gearbox Software Valve " ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process(df):\n", " \"\"\"Process data set. Will eventually contain calls to all functions we write.\"\"\"\n", " \n", " # Copy the input dataframe to avoid accidentally modifying original data\n", " df = df.copy()\n", " \n", " # Remove duplicate rows - all appids should be unique\n", " df = df.drop_duplicates()\n", " \n", " # Remove collumns with more than 50% null values\n", " df = drop_null_cols(df)\n", " \n", " # Process columns\n", " df = process_name_type(df)\n", " df = process_age(df)\n", " df = process_platforms(df)\n", " df = process_price(df)\n", " df = process_language(df)\n", " df = process_developers_and_publishers(df)\n", " df = process_categories_and_genres(df)\n", " df = process_achievements_and_descriptors(df) \n", " df = process_release_date(df)\n", " \n", " # Process columns which export data\n", " df = process_descriptions(df, export=True)\n", " df = process_media(df, export=True)\n", " df = process_info(df, export=True)\n", " df = process_requirements(df, export=True)\n", " \n", " return df\n", "\n", "steam_data = process(raw_steam_data)\n", "steam_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final Steps\n", "\n", "That just about does is for cleaning the Steam data. Hopefully we have a cleaned dataset ready to be combined with the data downloaded from SteamSpy, once that is cleaned. \n", "\n", "Before we export the cleaned steam data, we'll check that we have eradicated missing values, and have a look at the memory footprint like we did earlier." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name 0\n", "steam_appid 0\n", "required_age 0\n", "platforms 0\n", "categories 0\n", "genres 0\n", "achievements 0\n", "release_date 0\n", "price 0\n", "english 0\n", "developer 0\n", "publisher 0\n", "dtype: int64" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "steam_data.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 29235 entries, 0 to 29234\n", "Columns: 39 entries, type to content_descriptors\n", "dtypes: float64(2), int64(1), object(36)\n", "memory usage: 390.0 MB\n" ] } ], "source": [ "raw_steam_data.info(verbose=False, memory_usage=\"deep\")" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 27332 entries, 0 to 29234\n", "Columns: 12 entries, name to publisher\n", "dtypes: category(1), datetime64[ns](1), float64(1), int64(3), object(6)\n", "memory usage: 13.8 MB\n" ] } ], "source": [ "steam_data.info(verbose=False, memory_usage=\"deep\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we've taken care of all the null values, and the size of the dataset has been reduced considerably.\n", "\n", "We'll also check that no unreleased games have slipped through (data was scraped on or before 1st May, 2019)." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

                                                                                            \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
                                                                                            namesteam_appidrequired_ageplatformscategoriesgenresachievementsrelease_datepriceenglishdeveloperpublisher
                                                                                            \n", "
                                                                                            " ], "text/plain": [ "Empty DataFrame\n", "Columns: [name, steam_appid, required_age, platforms, categories, genres, achievements, release_date, price, english, developer, publisher]\n", "Index: []" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "steam_data[steam_data['release_date'] > '2019-05-01']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we're happy with our dataframe we are ready to export to file and finish this part of the project." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "scrolled": true }, "outputs": [], "source": [ "steam_data.to_csv('../data/exports/steam_data_clean.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I said at the beginning, data cleaning can be a very lengthy process. We definitely could have taken some shortcuts, and didn't have to go into so much depth, but I think it was interesting and valuable to go through the data as thoroughly as we did. A useful side effect of this is that we're now much more familiar with the data, and this should be incredibly useful when it comes to analysis. One of the most important prerequisites for data analysis is having good data, and the foundation of this is built upon solid data cleaning.\n", "\n", "Next time we'll take a look at optimising the processing of the `release_date` column, as mentioned, then move onto cleaning the SteamSpy data. Once that is complete we can begin exploring and analysing the data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }