\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" type | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" is_free | \n",
" controller_support | \n",
" dlc | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" fullgame | \n",
" supported_languages | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" legal_notice | \n",
" drm_notice | \n",
" ext_user_account_notice | \n",
" developers | \n",
" publishers | \n",
" demos | \n",
" price_overview | \n",
" packages | \n",
" package_groups | \n",
" platforms | \n",
" metacritic | \n",
" reviews | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" recommendations | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" content_descriptors | \n",
"
\n",
" \n",
" \n",
" \n",
" 31 | \n",
" game | \n",
" SiN Episodes: Emergence | \n",
" 1300 | \n",
" 0.0 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" You are John Blade, commander of HardCorps, an... | \n",
" You are John Blade, commander of HardCorps, an... | \n",
" You are John Blade, commander of HardCorps, an... | \n",
" NaN | \n",
" English, Russian, French | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/130... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" [] | \n",
" [] | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ['Ritual Entertainment'] | \n",
" ['Ritual Entertainment'] | \n",
" NaN | \n",
" {'currency': 'GBP', 'initial': 719, 'final': 7... | \n",
" [70] | \n",
" [{'name': 'default', 'title': 'Buy SiN Episode... | \n",
" {'windows': True, 'mac': False, 'linux': False} | \n",
" {'score': 75, 'url': 'https://www.metacritic.c... | \n",
" NaN | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 265} | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '10 May, 2006'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/130... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 32 | \n",
" game | \n",
" SiN Episodes: Emergence | \n",
" 1300 | \n",
" 0.0 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" You are John Blade, commander of HardCorps, an... | \n",
" You are John Blade, commander of HardCorps, an... | \n",
" You are John Blade, commander of HardCorps, an... | \n",
" NaN | \n",
" English, Russian, French | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/130... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" [] | \n",
" [] | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ['Ritual Entertainment'] | \n",
" ['Ritual Entertainment'] | \n",
" NaN | \n",
" {'currency': 'GBP', 'initial': 719, 'final': 7... | \n",
" [70] | \n",
" [{'name': 'default', 'title': 'Buy SiN Episode... | \n",
" {'windows': True, 'mac': False, 'linux': False} | \n",
" {'score': 75, 'url': 'https://www.metacritic.c... | \n",
" NaN | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 265} | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '10 May, 2006'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/130... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 356 | \n",
" game | \n",
" Jagged Alliance 2 Gold | \n",
" 1620 | \n",
" 0.0 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" <p>The small country of Arulco has been taken ... | \n",
" <p>The small country of Arulco has been taken ... | \n",
" The small country of Arulco has been taken ove... | \n",
" NaN | \n",
" English | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/162... | \n",
" http://www.jaggedalliance2.com/ | \n",
" {'minimum': '<p><strong>Minimum Configuration:... | \n",
" [] | \n",
" [] | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ['Strategy First'] | \n",
" ['Strategy First'] | \n",
" NaN | \n",
" {'currency': 'GBP', 'initial': 1499, 'final': ... | \n",
" [94] | \n",
" [{'name': 'default', 'title': 'Buy Jagged Alli... | \n",
" {'windows': True, 'mac': False, 'linux': False} | \n",
" NaN | \n",
" NaN | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '2', 'description': 'Strategy'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '6 Jul, 2006'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/162... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" type name steam_appid required_age is_free \\\n",
"31 game SiN Episodes: Emergence 1300 0.0 False \n",
"32 game SiN Episodes: Emergence 1300 0.0 False \n",
"356 game Jagged Alliance 2 Gold 1620 0.0 False \n",
"\n",
" controller_support dlc \\\n",
"31 NaN NaN \n",
"32 NaN NaN \n",
"356 NaN NaN \n",
"\n",
" detailed_description \\\n",
"31 You are John Blade, commander of HardCorps, an... \n",
"32 You are John Blade, commander of HardCorps, an... \n",
"356 The small country of Arulco has been taken ... \n",
"\n",
" about_the_game \\\n",
"31 You are John Blade, commander of HardCorps, an... \n",
"32 You are John Blade, commander of HardCorps, an... \n",
"356
The small country of Arulco has been taken ... \n",
"\n",
" short_description fullgame \\\n",
"31 You are John Blade, commander of HardCorps, an... NaN \n",
"32 You are John Blade, commander of HardCorps, an... NaN \n",
"356 The small country of Arulco has been taken ove... NaN \n",
"\n",
" supported_languages \\\n",
"31 English, Russian, French \n",
"32 English, Russian, French \n",
"356 English \n",
"\n",
" header_image \\\n",
"31 https://steamcdn-a.akamaihd.net/steam/apps/130... \n",
"32 https://steamcdn-a.akamaihd.net/steam/apps/130... \n",
"356 https://steamcdn-a.akamaihd.net/steam/apps/162... \n",
"\n",
" website \\\n",
"31 NaN \n",
"32 NaN \n",
"356 http://www.jaggedalliance2.com/ \n",
"\n",
" pc_requirements mac_requirements \\\n",
"31 {'minimum': '\\r\\n\\t\\t\\t
Minimum:Minimum:Minimum Configuration:... [] \n",
"\n",
" linux_requirements legal_notice drm_notice ext_user_account_notice \\\n",
"31 [] NaN NaN NaN \n",
"32 [] NaN NaN NaN \n",
"356 [] NaN NaN NaN \n",
"\n",
" developers publishers demos \\\n",
"31 ['Ritual Entertainment'] ['Ritual Entertainment'] NaN \n",
"32 ['Ritual Entertainment'] ['Ritual Entertainment'] NaN \n",
"356 ['Strategy First'] ['Strategy First'] NaN \n",
"\n",
" price_overview packages \\\n",
"31 {'currency': 'GBP', 'initial': 719, 'final': 7... [70] \n",
"32 {'currency': 'GBP', 'initial': 719, 'final': 7... [70] \n",
"356 {'currency': 'GBP', 'initial': 1499, 'final': ... [94] \n",
"\n",
" package_groups \\\n",
"31 [{'name': 'default', 'title': 'Buy SiN Episode... \n",
"32 [{'name': 'default', 'title': 'Buy SiN Episode... \n",
"356 [{'name': 'default', 'title': 'Buy Jagged Alli... \n",
"\n",
" platforms \\\n",
"31 {'windows': True, 'mac': False, 'linux': False} \n",
"32 {'windows': True, 'mac': False, 'linux': False} \n",
"356 {'windows': True, 'mac': False, 'linux': False} \n",
"\n",
" metacritic reviews \\\n",
"31 {'score': 75, 'url': 'https://www.metacritic.c... NaN \n",
"32 {'score': 75, 'url': 'https://www.metacritic.c... NaN \n",
"356 NaN NaN \n",
"\n",
" categories \\\n",
"31 [{'id': 2, 'description': 'Single-player'}, {'... \n",
"32 [{'id': 2, 'description': 'Single-player'}, {'... \n",
"356 [{'id': 2, 'description': 'Single-player'}] \n",
"\n",
" genres \\\n",
"31 [{'id': '1', 'description': 'Action'}] \n",
"32 [{'id': '1', 'description': 'Action'}] \n",
"356 [{'id': '2', 'description': 'Strategy'}] \n",
"\n",
" screenshots movies recommendations \\\n",
"31 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 265} \n",
"32 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN {'total': 265} \n",
"356 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN NaN \n",
"\n",
" achievements release_date \\\n",
"31 {'total': 0} {'coming_soon': False, 'date': '10 May, 2006'} \n",
"32 {'total': 0} {'coming_soon': False, 'date': '10 May, 2006'} \n",
"356 {'total': 0} {'coming_soon': False, 'date': '6 Jul, 2006'} \n",
"\n",
" support_info \\\n",
"31 {'url': '', 'email': ''} \n",
"32 {'url': '', 'email': ''} \n",
"356 {'url': '', 'email': ''} \n",
"\n",
" background \\\n",
"31 https://steamcdn-a.akamaihd.net/steam/apps/130... \n",
"32 https://steamcdn-a.akamaihd.net/steam/apps/130... \n",
"356 https://steamcdn-a.akamaihd.net/steam/apps/162... \n",
"\n",
" content_descriptors \n",
"31 {'ids': [], 'notes': None} \n",
"32 {'ids': [], 'notes': None} \n",
"356 {'ids': [], 'notes': None} "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"duplicate_rows = raw_steam_data[raw_steam_data.duplicated()]\n",
"\n",
"print('Duplicate rows to remove:', duplicate_rows.shape[0])\n",
"\n",
"duplicate_rows.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're now ready to define functions implementing the filters we just looked at. This allows us to easily make changes in the future if we want to alter how the columns are handled, or want to choose a different cut-off threshold for getting rid of columns, for example. \n",
"\n",
"We also define a general purpose `process` function which will run all the processing functions we create on the data set. This will allow us to slowly add to it as we develop more functions and ensure we're cleaning the correct dataframe.\n",
"\n",
"Finally we run this function on the raw data, inspecting the first few rows and viewing how many rows and columns have been removed."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(29235, 39)\n",
"(29075, 28)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" is_free | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" supported_languages | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" developers | \n",
" publishers | \n",
" price_overview | \n",
" packages | \n",
" package_groups | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" content_descriptors | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" 10 | \n",
" 0.0 | \n",
" False | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
" English<strong>*</strong>, French<strong>*</st... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" {'currency': 'GBP', 'initial': 719, 'final': 7... | \n",
" [7] | \n",
" [{'name': 'default', 'title': 'Buy Counter-Str... | \n",
" {'windows': True, 'mac': True, 'linux': True} | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Nov, 2000'} | \n",
" {'url': 'http://steamcommunity.com/app/10', 'e... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" {'ids': [2, 5], 'notes': 'Includes intense vio... | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" 20 | \n",
" 0.0 | \n",
" False | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
" English, French, German, Italian, Spanish - Sp... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" {'currency': 'GBP', 'initial': 399, 'final': 3... | \n",
" [29] | \n",
" [{'name': 'default', 'title': 'Buy Team Fortre... | \n",
" {'windows': True, 'mac': True, 'linux': True} | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Apr, 1999'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" {'ids': [2, 5], 'notes': 'Includes intense vio... | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" 30 | \n",
" 0.0 | \n",
" False | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" English, French, German, Italian, Spanish - Spain | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" http://www.dayofdefeat.com/ | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" {'currency': 'GBP', 'initial': 399, 'final': 3... | \n",
" [30] | \n",
" [{'name': 'default', 'title': 'Buy Day of Defe... | \n",
" {'windows': True, 'mac': True, 'linux': True} | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 May, 2003'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" 40 | \n",
" 0.0 | \n",
" False | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" English, French, German, Italian, Spanish - Sp... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" {'currency': 'GBP', 'initial': 399, 'final': 3... | \n",
" [31] | \n",
" [{'name': 'default', 'title': 'Buy Deathmatch ... | \n",
" {'windows': True, 'mac': True, 'linux': True} | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Jun, 2001'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" 50 | \n",
" 0.0 | \n",
" False | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" English, French, German, Korean | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Gearbox Software'] | \n",
" ['Valve'] | \n",
" {'currency': 'GBP', 'initial': 399, 'final': 3... | \n",
" [32] | \n",
" [{'name': 'default', 'title': 'Buy Half-Life: ... | \n",
" {'windows': True, 'mac': True, 'linux': True} | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Nov, 1999'} | \n",
" {'url': 'https://help.steampowered.com', 'emai... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid required_age is_free \\\n",
"0 Counter-Strike 10 0.0 False \n",
"1 Team Fortress Classic 20 0.0 False \n",
"2 Day of Defeat 30 0.0 False \n",
"3 Deathmatch Classic 40 0.0 False \n",
"4 Half-Life: Opposing Force 50 0.0 False \n",
"\n",
" detailed_description \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" about_the_game \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" short_description \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" supported_languages \\\n",
"0 English*, French*Minimum:Minimum:Minimum:Minimum:Minimum:\n"
]
},
{
"data": {
"text/plain": [
"\"{'windows': True, 'mac': True, 'linux': True}\""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"platforms_first_row = age_df['platforms'].iloc[0]\n",
"\n",
"print(type(platforms_first_row))\n",
"\n",
"platforms_first_row"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the built-in `ast` module. As the name suggests, this will allow us to evaluate the string, and then index into it as a \n",
"dictionary."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"{'windows': True, 'mac': True, 'linux': True}\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eval_first_row = literal_eval(platforms_first_row)\n",
"\n",
"print(type(eval_first_row))\n",
"print(eval_first_row)\n",
"\n",
"eval_first_row['windows']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also need to check for missing values, but fortunately it appears there aren't any in this column."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age_df['platforms'].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As for formatting the output, let's keep things simple and return a list of supported platforms. Another option would be to create a column for each platform with a corresponding True/False value for each row, something we can keep in mind for the future.\n",
"\n",
"We can create the desired list by calling the [str.join()](https://docs.python.org/3/library/stdtypes.html#str.join) method on a string, and passing an iterable into the function. In this case, we can pass the keys of the row, as seen below."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'windows;mac;linux'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create string of keys, joined on a semi-colon\n",
"';'.join(eval_first_row.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also need to inspect the value of each row, and only end up with the keys that have a value of `True`, skipping those which are `False`. The example below shows how we can do this using a list comprehension."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['windows', 'mac']\n"
]
},
{
"data": {
"text/plain": [
"'windows;mac'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"platforms = {'windows': True, 'mac': True, 'linux': False}\n",
"\n",
"# list comprehension\n",
"print([x for x in platforms.keys() if platforms[x]])\n",
"\n",
"# using list comprehension in join\n",
"';'.join(x for x in platforms.keys() if platforms[x])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Putting this all together, we can use the pandas [Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to process the rows."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"windows 19887\n",
"windows;mac;linux 4851\n",
"windows;mac 3687\n",
"windows;linux 643\n",
"mac 5\n",
"mac;linux 1\n",
"linux 1\n",
"Name: platforms, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process_platforms(df):\n",
" \"\"\"Split platforms column into separate boolean columns for each platform.\"\"\"\n",
" # evaluate values in platforms column, so can index into dictionaries\n",
" df = df.copy()\n",
" \n",
" def parse_platforms(x):\n",
" \n",
" d = literal_eval(x)\n",
" \n",
" return ';'.join(platform for platform in d.keys() if d[platform])\n",
" \n",
" df['platforms'] = df['platforms'].apply(parse_platforms)\n",
" \n",
" return df\n",
"\n",
"\n",
"platforms_df = process_platforms(age_df)\n",
"platforms_df['platforms'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing Price\n",
"\n",
"Now we have built up some intuition around how to deal with data stored as dictionaries, let's return to the `is_free` and `price_overview` columns as we should now be able to handle them.\n",
"\n",
"First let's check how many null values there are in `price_overview`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3559"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"platforms_df['price_overview'].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whilst that looks like a lot, we have to consider the impact that the `is_free` column might be having. Before jumping to conclusions let's check if there any rows with `is_free` marked as True and null values in the `price_overview` column."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2713"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"free_and_null_price = platforms_df[(platforms_df['is_free']) & (platforms_df['price_overview'].isnull())]\n",
"\n",
"free_and_null_price.shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It turns out this accounts for most of the missing values in the `price_overview` column, meaning we can handle these by setting the final price as 0. This makes intuitive sense - free games wouldn't have a price.\n",
"\n",
"This means that there are almost 850 rows which aren't free but have null values in the `price_overview` column. Let's investigate those next."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" is_free | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" supported_languages | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" developers | \n",
" publishers | \n",
" price_overview | \n",
" packages | \n",
" package_groups | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" content_descriptors | \n",
"
\n",
" \n",
" \n",
" \n",
" 63 | \n",
" The Ship: Single Player | \n",
" 2420 | \n",
" 0 | \n",
" False | \n",
" For PC gamers who enjoy multiplayer games with... | \n",
" For PC gamers who enjoy multiplayer games with... | \n",
" The Ship is a murder mystery alternative to tr... | \n",
" English, French, German, Italian, Spanish - Sp... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/242... | \n",
" http://www.blazinggriffin.com/games/the-ship-m... | \n",
" {'minimum': '<strong>Minimum:</strong> 1.8 GHz... | \n",
" [] | \n",
" [] | \n",
" ['Outerlight Ltd.'] | \n",
" ['Blazing Griffin Ltd.'] | \n",
" NaN | \n",
" [56669] | \n",
" [{'name': 'default', 'title': 'Buy The Ship: S... | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '1', 'description': 'Action'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 2035597, 'name': 'the Ship: Intro', '... | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '20 Nov, 2006'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/242... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 75 | \n",
" RollerCoaster Tycoon® 3: Platinum | \n",
" 2700 | \n",
" 0 | \n",
" False | \n",
" Rollercoaster Tycoon 3 Platinum combines the e... | \n",
" Rollercoaster Tycoon 3 Platinum combines the e... | \n",
" Rollercoaster Tycoon 3 Platinum combines the e... | \n",
" English, French, Italian, German, Spanish - Sp... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/270... | \n",
" http://www.atari.com/rollercoastertycoon/us/in... | \n",
" {'minimum': '<strong>Minimum: </strong><br>\\t\\... | \n",
" {'minimum': '<ul class=\"bb_ul\"><li><strong>OS:... | \n",
" [] | \n",
" ['Frontier', 'Aspyr (Mac)'] | \n",
" ['Atari', 'Aspyr (Mac)'] | \n",
" NaN | \n",
" NaN | \n",
" [] | \n",
" windows;mac | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '28', 'description': 'Simulation'}, {'... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '12 Mar, 2008'} | \n",
" {'url': 'http://www.atari.com/support/atari', ... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/270... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 220 | \n",
" BioShock™ | \n",
" 7670 | \n",
" 0 | \n",
" False | \n",
" <h1>Special Offer</h1><p>Buying BioShock™ also... | \n",
" BioShock is a shooter unlike any you've ever p... | \n",
" BioShock is a shooter unlike any you've ever p... | \n",
" English<strong>*</strong>, French<strong>*</st... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/767... | \n",
" http://www.BioShockGame.com | \n",
" {'minimum': '<h2 class=\"bb_tag\"><strong>Minimu... | \n",
" {'minimum': 'Please See BioShock Remastered'} | \n",
" [] | \n",
" ['2K Boston', '2K Australia'] | \n",
" ['2K'] | \n",
" NaN | \n",
" [451, 127633] | \n",
" [{'name': 'default', 'title': 'Buy BioShock™',... | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '21 Aug, 2007'} | \n",
" {'url': 'support.2k.com', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/767... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 234 | \n",
" Sam & Max 101: Culture Shock | \n",
" 8200 | \n",
" 0 | \n",
" False | \n",
" <strong>Sam & Max: Episode 1 - Culture Sho... | \n",
" <strong>Sam & Max: Episode 1 - Culture Sho... | \n",
" Sam & Max: Episode 1 - Culture Shock - The... | \n",
" English, French, German, Italian | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/820... | \n",
" http://store.steampowered.com/app/901660/ | \n",
" {'minimum': 'Windows XP or Vista, 1.5GHz proce... | \n",
" [] | \n",
" [] | \n",
" ['Telltale Games'] | \n",
" ['Telltale Games'] | \n",
" NaN | \n",
" [357, 539] | \n",
" [{'name': 'default', 'title': 'Buy Sam & Max 1... | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '1', 'description': 'Action'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '15 Jun, 2007'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/820... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 235 | \n",
" Sam & Max 102: Situation: Comedy | \n",
" 8210 | \n",
" 0 | \n",
" False | \n",
" <strong>Sam & Max: Episode 2 - Situation: ... | \n",
" <strong>Sam & Max: Episode 2 - Situation: ... | \n",
" Sam & Max: Episode 2 - Situation: Comedy -... | \n",
" English, German, Italian | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/821... | \n",
" http://store.steampowered.com/app/901660/ | \n",
" {'minimum': 'Windows XP or Vista, 1.5GHz proce... | \n",
" [] | \n",
" [] | \n",
" ['Telltale Games'] | \n",
" ['Telltale Games'] | \n",
" NaN | \n",
" [358, 539] | \n",
" [{'name': 'default', 'title': 'Buy Sam & Max 1... | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '1', 'description': 'Action'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '15 Jun, 2007'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/821... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid required_age is_free \\\n",
"63 The Ship: Single Player 2420 0 False \n",
"75 RollerCoaster Tycoon® 3: Platinum 2700 0 False \n",
"220 BioShock™ 7670 0 False \n",
"234 Sam & Max 101: Culture Shock 8200 0 False \n",
"235 Sam & Max 102: Situation: Comedy 8210 0 False \n",
"\n",
" detailed_description \\\n",
"63 For PC gamers who enjoy multiplayer games with... \n",
"75 Rollercoaster Tycoon 3 Platinum combines the e... \n",
"220 Special Offer
Buying BioShock™ also... \n",
"234 Sam & Max: Episode 1 - Culture Sho... \n",
"235 Sam & Max: Episode 2 - Situation: ... \n",
"\n",
" about_the_game \\\n",
"63 For PC gamers who enjoy multiplayer games with... \n",
"75 Rollercoaster Tycoon 3 Platinum combines the e... \n",
"220 BioShock is a shooter unlike any you've ever p... \n",
"234 Sam & Max: Episode 1 - Culture Sho... \n",
"235 Sam & Max: Episode 2 - Situation: ... \n",
"\n",
" short_description \\\n",
"63 The Ship is a murder mystery alternative to tr... \n",
"75 Rollercoaster Tycoon 3 Platinum combines the e... \n",
"220 BioShock is a shooter unlike any you've ever p... \n",
"234 Sam & Max: Episode 1 - Culture Shock - The... \n",
"235 Sam & Max: Episode 2 - Situation: Comedy -... \n",
"\n",
" supported_languages \\\n",
"63 English, French, German, Italian, Spanish - Sp... \n",
"75 English, French, Italian, German, Spanish - Sp... \n",
"220 English*, French*Minimum: 1.8 GHz... \n",
"75 {'minimum': 'Minimum:
\\t\\... \n",
"220 {'minimum': 'Minimu... \n",
"234 {'minimum': 'Windows XP or Vista, 1.5GHz proce... \n",
"235 {'minimum': 'Windows XP or Vista, 1.5GHz proce... \n",
"\n",
" mac_requirements linux_requirements \\\n",
"63 [] [] \n",
"75 {'minimum': '- OS:... [] \n",
"220 {'minimum': 'Please See BioShock Remastered'} [] \n",
"234 [] [] \n",
"235 [] [] \n",
"\n",
" developers publishers price_overview \\\n",
"63 ['Outerlight Ltd.'] ['Blazing Griffin Ltd.'] NaN \n",
"75 ['Frontier', 'Aspyr (Mac)'] ['Atari', 'Aspyr (Mac)'] NaN \n",
"220 ['2K Boston', '2K Australia'] ['2K'] NaN \n",
"234 ['Telltale Games'] ['Telltale Games'] NaN \n",
"235 ['Telltale Games'] ['Telltale Games'] NaN \n",
"\n",
" packages package_groups \\\n",
"63 [56669] [{'name': 'default', 'title': 'Buy The Ship: S... \n",
"75 NaN [] \n",
"220 [451, 127633] [{'name': 'default', 'title': 'Buy BioShock™',... \n",
"234 [357, 539] [{'name': 'default', 'title': 'Buy Sam & Max 1... \n",
"235 [358, 539] [{'name': 'default', 'title': 'Buy Sam & Max 1... \n",
"\n",
" platforms categories \\\n",
"63 windows [{'id': 2, 'description': 'Single-player'}] \n",
"75 windows;mac [{'id': 2, 'description': 'Single-player'}] \n",
"220 windows [{'id': 2, 'description': 'Single-player'}, {'... \n",
"234 windows [{'id': 2, 'description': 'Single-player'}] \n",
"235 windows [{'id': 2, 'description': 'Single-player'}] \n",
"\n",
" genres \\\n",
"63 [{'id': '1', 'description': 'Action'}, {'id': ... \n",
"75 [{'id': '28', 'description': 'Simulation'}, {'... \n",
"220 [{'id': '1', 'description': 'Action'}, {'id': ... \n",
"234 [{'id': '1', 'description': 'Action'}, {'id': ... \n",
"235 [{'id': '1', 'description': 'Action'}, {'id': ... \n",
"\n",
" screenshots \\\n",
"63 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"75 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"220 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"234 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"235 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"\n",
" movies achievements \\\n",
"63 [{'id': 2035597, 'name': 'the Ship: Intro', '... {'total': 0} \n",
"75 NaN {'total': 0} \n",
"220 NaN {'total': 0} \n",
"234 NaN {'total': 0} \n",
"235 NaN {'total': 0} \n",
"\n",
" release_date \\\n",
"63 {'coming_soon': False, 'date': '20 Nov, 2006'} \n",
"75 {'coming_soon': False, 'date': '12 Mar, 2008'} \n",
"220 {'coming_soon': False, 'date': '21 Aug, 2007'} \n",
"234 {'coming_soon': False, 'date': '15 Jun, 2007'} \n",
"235 {'coming_soon': False, 'date': '15 Jun, 2007'} \n",
"\n",
" support_info \\\n",
"63 {'url': '', 'email': ''} \n",
"75 {'url': 'http://www.atari.com/support/atari', ... \n",
"220 {'url': 'support.2k.com', 'email': ''} \n",
"234 {'url': '', 'email': ''} \n",
"235 {'url': '', 'email': ''} \n",
"\n",
" background \\\n",
"63 https://steamcdn-a.akamaihd.net/steam/apps/242... \n",
"75 https://steamcdn-a.akamaihd.net/steam/apps/270... \n",
"220 https://steamcdn-a.akamaihd.net/steam/apps/767... \n",
"234 https://steamcdn-a.akamaihd.net/steam/apps/820... \n",
"235 https://steamcdn-a.akamaihd.net/steam/apps/821... \n",
"\n",
" content_descriptors \n",
"63 {'ids': [], 'notes': None} \n",
"75 {'ids': [], 'notes': None} \n",
"220 {'ids': [], 'notes': None} \n",
"234 {'ids': [], 'notes': None} \n",
"235 {'ids': [], 'notes': None} "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"not_free_and_null_price = platforms_df[(platforms_df['is_free'] == False) & (platforms_df['price_overview'].isnull())]\n",
"\n",
"not_free_and_null_price.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first few rows contain some big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.\n",
"\n",
"We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:\n",
"\n",
" https://store.steampowered.com/app/[steam_appid]\n",
"\n",
"This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Ship: Single Player: https://store.steampowered.com/app/2420\n",
"RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700\n",
"BioShock™: https://store.steampowered.com/app/7670\n",
"Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200\n",
"Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210\n"
]
}
],
"source": [
"def print_steam_links(df):\n",
" \"\"\"Print links to store page for apps in a dataframe.\"\"\"\n",
" url_base = \"https://store.steampowered.com/app/\"\n",
" \n",
" for i, row in df.iterrows():\n",
" appid = row['steam_appid']\n",
" name = row['name']\n",
" \n",
" print(name + ':', url_base + str(appid))\n",
" \n",
"\n",
"print_steam_links(not_free_and_null_price[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For these games we can conclude that:\n",
"\n",
"- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party\n",
"- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: [GOG](https://www.gog.com/)) \n",
" - \"A spokesperson for GOG told Eurogamer it pulled the game \"due to expiring licensing rights\", and stressed it'll talk with \"new distribution rights holders\" to bring the game back as soon as possible.\" Source: [Eurogamer](https://www.eurogamer.net/articles/2018-05-09-rollercoaster-tycoon-3-pulled-from-steam-gog)\n",
"- BioShock has been replaced by BioShock Remastered\n",
"- Sam & Max 101 is sold as part of a season, and this can be found in the `package_groups` column\n",
"\n",
"So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the `package_groups` column, or we could leave them for now and return to them later. We'll leave them for now, handling the two price columns, then take a look at the packages next. It may also be that some of these rows are removed later in the cleaning process for other reasons.\n",
"\n",
"If we want to find rows similar to these and deal with each case individually, we could use the `.str.contains()` method, as seen below."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" is_free | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" supported_languages | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" developers | \n",
" publishers | \n",
" price_overview | \n",
" packages | \n",
" package_groups | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" content_descriptors | \n",
"
\n",
" \n",
" \n",
" \n",
" 220 | \n",
" BioShock™ | \n",
" 7670 | \n",
" 0 | \n",
" False | \n",
" <h1>Special Offer</h1><p>Buying BioShock™ also... | \n",
" BioShock is a shooter unlike any you've ever p... | \n",
" BioShock is a shooter unlike any you've ever p... | \n",
" English<strong>*</strong>, French<strong>*</st... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/767... | \n",
" http://www.BioShockGame.com | \n",
" {'minimum': '<h2 class=\"bb_tag\"><strong>Minimu... | \n",
" {'minimum': 'Please See BioShock Remastered'} | \n",
" [] | \n",
" ['2K Boston', '2K Australia'] | \n",
" ['2K'] | \n",
" NaN | \n",
" [451, 127633] | \n",
" [{'name': 'default', 'title': 'Buy BioShock™',... | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '21 Aug, 2007'} | \n",
" {'url': 'support.2k.com', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/767... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 7734 | \n",
" BioShock™ Remastered | \n",
" 409710 | \n",
" 18 | \n",
" False | \n",
" <h1>Special Offer</h1><p>Buying BioShock™ Rema... | \n",
" BioShock is a shooter unlike any you've ever p... | \n",
" BioShock is a shooter unlike any other, loaded... | \n",
" English<strong>*</strong>, French<strong>*</st... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/409... | \n",
" http://www.BioShockGame.com | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" ['2K Boston', '2K Australia', 'Blind Squirrel'... | \n",
" ['2K', 'Feral Interactive (Mac)'] | \n",
" {'currency': 'GBP', 'initial': 999, 'final': 9... | \n",
" [451, 127633] | \n",
" [{'name': 'default', 'title': 'Buy BioShock™ R... | \n",
" windows;mac | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 65, 'highlighted': [{'name': 'Comple... | \n",
" {'coming_soon': False, 'date': '15 Sep, 2016'} | \n",
" {'url': 'support.2k.com', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/409... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 7735 | \n",
" BioShock™ 2 Remastered | \n",
" 409720 | \n",
" 18 | \n",
" False | \n",
" <h1>Special Offer</h1><p>Buying BioShock 2™ Re... | \n",
" BioShock 2 provides players with the perfect b... | \n",
" In BioShock 2, you step into the boots of the ... | \n",
" English<strong>*</strong>, French<strong>*</st... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/409... | \n",
" http://www.bioshockgame.com/ | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" ['2K Marin', '2K China', 'Digital Extremes', '... | \n",
" ['2K'] | \n",
" {'currency': 'GBP', 'initial': 1399, 'final': ... | \n",
" [81419, 127633] | \n",
" [{'name': 'default', 'title': 'Buy BioShock™ 2... | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 53, 'highlighted': [{'name': \"Daddy'... | \n",
" {'coming_soon': False, 'date': '15 Sep, 2016'} | \n",
" {'url': 'support.2k.com', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/409... | \n",
" {'ids': [5], 'notes': None} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid required_age is_free \\\n",
"220 BioShock™ 7670 0 False \n",
"7734 BioShock™ Remastered 409710 18 False \n",
"7735 BioShock™ 2 Remastered 409720 18 False \n",
"\n",
" detailed_description \\\n",
"220 Special Offer
Buying BioShock™ also... \n",
"7734
Special Offer
Buying BioShock™ Rema... \n",
"7735
Special Offer
Buying BioShock 2™ Re... \n",
"\n",
" about_the_game \\\n",
"220 BioShock is a shooter unlike any you've ever p... \n",
"7734 BioShock is a shooter unlike any you've ever p... \n",
"7735 BioShock 2 provides players with the perfect b... \n",
"\n",
" short_description \\\n",
"220 BioShock is a shooter unlike any you've ever p... \n",
"7734 BioShock is a shooter unlike any other, loaded... \n",
"7735 In BioShock 2, you step into the boots of the ... \n",
"\n",
" supported_languages \\\n",
"220 English*, French**, French**, French*Minimu... \n",
"7734 {'minimum': 'Minimum:
\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" name | \n",
" currency | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" GBP | \n",
" 719 | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" GBP | \n",
" 399 | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" GBP | \n",
" 399 | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" GBP | \n",
" 399 | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" GBP | \n",
" 399 | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" name currency price\n",
"0 Counter-Strike GBP 719\n",
"1 Team Fortress Classic GBP 399\n",
"2 Day of Defeat GBP 399\n",
"3 Deathmatch Classic GBP 399\n",
"4 Half-Life: Opposing Force GBP 399"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process_price(df):\n",
" df = df.copy()\n",
" \n",
" def parse_price(x):\n",
" if x is not np.nan:\n",
" return literal_eval(x)\n",
" else:\n",
" return {'currency': 'GBP', 'initial': -1}\n",
" \n",
" # evaluate as dictionary and set to -1 if missing\n",
" df['price_overview'] = df['price_overview'].apply(parse_price)\n",
" \n",
" # Create columns from currency and initial values\n",
" df['currency'] = df['price_overview'].apply(lambda x: x['currency'])\n",
" df['price'] = df['price_overview'].apply(lambda x: x['initial'])\n",
" \n",
" # Set price of free games to 0\n",
" df.loc[df['is_free'], 'price'] = 0\n",
" \n",
" return df\n",
"\n",
"price_data = process_price(platforms_df)[['name', 'currency', 'price']]\n",
"price_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're almost finished, but let's check if any games don't have GBP listed as the currency."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" currency | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 991 | \n",
" Robin Hood: The Legend of Sherwood | \n",
" USD | \n",
" 799 | \n",
"
\n",
" \n",
" 5767 | \n",
" Assassin’s Creed® Chronicles: India | \n",
" EUR | \n",
" 999 | \n",
"
\n",
" \n",
" 27593 | \n",
" Mortal Kombat 11 | \n",
" USD | \n",
" 5999 | \n",
"
\n",
" \n",
" 27995 | \n",
" Pagan Online | \n",
" EUR | \n",
" 2699 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name currency price\n",
"991 Robin Hood: The Legend of Sherwood USD 799\n",
"5767 Assassin’s Creed® Chronicles: India EUR 999\n",
"27593 Mortal Kombat 11 USD 5999\n",
"27995 Pagan Online EUR 2699"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"price_data[price_data['currency'] != 'GBP']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For some reason there are four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows it's easier and safer to simply drop them.\n",
"\n",
"We can also divide the prices by 100 so they are displayed as floats in pounds."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" 7.19 | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" 3.99 | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" 3.99 | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" 3.99 | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" 3.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name price\n",
"0 Counter-Strike 7.19\n",
"1 Team Fortress Classic 3.99\n",
"2 Day of Defeat 3.99\n",
"3 Deathmatch Classic 3.99\n",
"4 Half-Life: Opposing Force 3.99"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process_price(df):\n",
" \"\"\"Process price_overview column into formatted price column.\"\"\"\n",
" df = df.copy()\n",
" \n",
" def parse_price(x):\n",
" if x is not np.nan:\n",
" return literal_eval(x)\n",
" else:\n",
" return {'currency': 'GBP', 'initial': -1}\n",
" \n",
" # evaluate as dictionary and set to -1 if missing\n",
" df['price_overview'] = df['price_overview'].apply(parse_price)\n",
" \n",
" # create columns from currency and initial values\n",
" df['currency'] = df['price_overview'].apply(lambda x: x['currency'])\n",
" df['price'] = df['price_overview'].apply(lambda x: x['initial'])\n",
" \n",
" # set price of free games to 0\n",
" df.loc[df['is_free'], 'price'] = 0\n",
" \n",
" # remove non-GBP rows\n",
" df = df[df['currency'] == 'GBP']\n",
" \n",
" # change price to display in pounds (only applying to rows with a value greater than 0)\n",
" df.loc[df['price'] > 0, 'price'] /= 100\n",
" \n",
" # remove columns no longer needed\n",
" df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)\n",
" \n",
" return df\n",
"\n",
"\n",
"price_df = process_price(platforms_df)\n",
"price_df[['name', 'price']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing Packages\n",
"\n",
"We can now take a look at the `packages` and `package_groups` columns to help decide what to do with rows that are missing price data. We're not incredibly interested in the columns themselves, as they don't appear to provide much new useful information, except which games come with others as part of a bundle."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" packages | \n",
" package_groups | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" [7] | \n",
" [{'name': 'default', 'title': 'Buy Counter-Strike', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 7, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Counter-Strike: Condition Zero - £7.19', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 719}]}] | \n",
" 7.19 | \n",
"
\n",
" \n",
" 1 | \n",
" 20 | \n",
" [29] | \n",
" [{'name': 'default', 'title': 'Buy Team Fortress Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 29, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Team Fortress Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] | \n",
" 3.99 | \n",
"
\n",
" \n",
" 2 | \n",
" 30 | \n",
" [30] | \n",
" [{'name': 'default', 'title': 'Buy Day of Defeat', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 30, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Day of Defeat - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] | \n",
" 3.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid packages \\\n",
"0 10 [7] \n",
"1 20 [29] \n",
"2 30 [30] \n",
"\n",
" package_groups \\\n",
"0 [{'name': 'default', 'title': 'Buy Counter-Strike', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 7, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Counter-Strike: Condition Zero - £7.19', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 719}]}] \n",
"1 [{'name': 'default', 'title': 'Buy Team Fortress Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 29, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Team Fortress Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] \n",
"2 [{'name': 'default', 'title': 'Buy Day of Defeat', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 30, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Day of Defeat - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}] \n",
"\n",
" price \n",
"0 7.19 \n",
"1 3.99 \n",
"2 3.99 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# temporarily set a pandas option using with and option_context\n",
"with pd.option_context(\"display.max_colwidth\", 500):\n",
" display(price_df[['steam_appid', 'packages', 'package_groups', 'price']].head(3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overall we have 846 rows with missing price data, which we previously set to -1."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"846\n"
]
}
],
"source": [
"print(price_df[price_df['price'] == -1].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can split these rows into two categories: those with `package_groups` data and those without.\n",
"\n",
"If we take a quick look at the `package_groups` column we see that there appear to be no null values. On closer inspection, we can find that rows without data are actually stored as empty lists."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Null counts: 0\n",
"Empty list counts: 3353\n"
]
}
],
"source": [
"print('Null counts:', price_df['package_groups'].isnull().sum())\n",
"print('Empty list counts:', price_df[price_df['package_groups'] == \"[]\"].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a combination of filters, we can find out how many rows have both missing `price` and `package_group` data and investigate. We'll count the rows and print links to some of the store pages and look for patterns."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows: 799 \n",
"\n",
"First few rows:\n",
"\n",
"RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700\n",
"Beijing 2008™ - The Official Video Game of the Olympic Games: https://store.steampowered.com/app/10520\n",
"LUMINES™ Advance Pack: https://store.steampowered.com/app/11920\n",
"Midnight Club 2: https://store.steampowered.com/app/12160\n",
"Age of Booty™: https://store.steampowered.com/app/21600\n",
"\n",
"Last few rows:\n",
"\n",
"RoboVirus: https://store.steampowered.com/app/1001870\n",
"soko loco deluxe: https://store.steampowered.com/app/1003730\n",
"POCKET CAR : VRGROUND: https://store.steampowered.com/app/1004710\n",
"The Princess, the Stray Cat, and Matters of the Heart: https://store.steampowered.com/app/1010600\n",
"Mr Boom's Firework Factory: https://store.steampowered.com/app/1013670\n"
]
}
],
"source": [
"missing_price_and_package = price_df[(price_df['price'] == -1) & (price_df['package_groups'] == \"[]\")]\n",
"\n",
"print('Number of rows:', missing_price_and_package.shape[0], '\\n')\n",
"\n",
"print('First few rows:\\n')\n",
"print_steam_links(missing_price_and_package[:5])\n",
"\n",
"print('\\nLast few rows:\\n')\n",
"print_steam_links(missing_price_and_package[-10:-5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of the games - 799 of 846 - with missing price data fall into the above category. This probably means they can be safely removed.\n",
"\n",
"From following the links for the first few rows to the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of them haven't yet been released and haven't had a price set. We'll take care of all the unreleased games when we clean the `release_date` column, but we can remove all of these apps here.\n",
"\n",
"Let's now take a look at the rows that have missing price data but do have `package_groups` data. We may be interested in keeping these rows and extracting price data from the package data."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows: 47 \n",
"\n",
"First few rows:\n",
"\n",
"The Ship: Single Player: https://store.steampowered.com/app/2420\n",
"BioShock™: https://store.steampowered.com/app/7670\n",
"Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200\n",
"Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210\n",
"Sam & Max 103: The Mole, the Mob and the Meatball: https://store.steampowered.com/app/8220\n",
"\n",
"Last few rows:\n",
"\n",
"Viscera Cleanup Detail: Shadow Warrior: https://store.steampowered.com/app/255520\n",
"Space Hulk: Deathwing: https://store.steampowered.com/app/298900\n",
"7,62 Hard Life: https://store.steampowered.com/app/306290\n",
"Letter Quest: Grimm's Journey: https://store.steampowered.com/app/328730\n",
"Rad Rodgers: World One: https://store.steampowered.com/app/353580\n"
]
}
],
"source": [
"missing_price_have_package = price_df.loc[(price_df['price'] == -1) & (price_df['package_groups'] != \"[]\"), ['name', 'steam_appid', 'package_groups', 'price']]\n",
"\n",
"print('Number of rows:', missing_price_have_package.shape[0], '\\n')\n",
"\n",
"print('First few rows:\\n')\n",
"print_steam_links(missing_price_have_package[:5])\n",
"\n",
"print('\\nLast few rows:\\n')\n",
"print_steam_links(missing_price_have_package[-10:-5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game. \n",
"\n",
"Whilst we could extract prices from the `package_groups` data, the most sensible option seems to be removing these rows. There are only 47 rows this applies to, and any with a newer release will still have the re-release in the data.\n",
"\n",
"Since our logic interacts heavily with the price data we will update the `process_price` function rather than creating a new one."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" supported_languages | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" developers | \n",
" publishers | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" content_descriptors | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" 10 | \n",
" 0 | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
" English<strong>*</strong>, French<strong>*</st... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" windows;mac;linux | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Nov, 2000'} | \n",
" {'url': 'http://steamcommunity.com/app/10', 'e... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" {'ids': [2, 5], 'notes': 'Includes intense vio... | \n",
" 7.19 | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" 20 | \n",
" 0 | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
" English, French, German, Italian, Spanish - Sp... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" windows;mac;linux | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Apr, 1999'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" {'ids': [2, 5], 'notes': 'Includes intense vio... | \n",
" 3.99 | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" 30 | \n",
" 0 | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" English, French, German, Italian, Spanish - Spain | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" http://www.dayofdefeat.com/ | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" windows;mac;linux | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 May, 2003'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" {'ids': [], 'notes': None} | \n",
" 3.99 | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" 40 | \n",
" 0 | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" English, French, German, Italian, Spanish - Sp... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
" windows;mac;linux | \n",
" [{'id': 1, 'description': 'Multi-player'}, {'i... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Jun, 2001'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" {'ids': [], 'notes': None} | \n",
" 3.99 | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" 50 | \n",
" 0 | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" English, French, German, Korean | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" ['Gearbox Software'] | \n",
" ['Valve'] | \n",
" windows;mac;linux | \n",
" [{'id': 2, 'description': 'Single-player'}, {'... | \n",
" [{'id': '1', 'description': 'Action'}] | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '1 Nov, 1999'} | \n",
" {'url': 'https://help.steampowered.com', 'emai... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" {'ids': [], 'notes': None} | \n",
" 3.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid required_age \\\n",
"0 Counter-Strike 10 0 \n",
"1 Team Fortress Classic 20 0 \n",
"2 Day of Defeat 30 0 \n",
"3 Deathmatch Classic 40 0 \n",
"4 Half-Life: Opposing Force 50 0 \n",
"\n",
" detailed_description \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" about_the_game \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" short_description \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" supported_languages \\\n",
"0 English*, French*Minimum:Minimum:Minimum:Minimum:Minimum:\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" supported_languages | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" developers | \n",
" publishers | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" content_descriptors | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 4866 | \n",
" Subsiege | \n",
" 338640 | \n",
" 0 | \n",
" <img src=\"https://steamcdn-a.akamaihd.net/stea... | \n",
" <img src=\"https://steamcdn-a.akamaihd.net/stea... | \n",
" Subsiege is an intense real-time tactic game w... | \n",
" NaN | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/338... | \n",
" http://subsiege-game.com/ | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
" ['Icebird Studios'] | \n",
" ['Icebird Studios'] | \n",
" windows | \n",
" NaN | \n",
" NaN | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 256729398, 'name': 'Release Trailer', ... | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '7 Sep, 2018'} | \n",
" {'url': 'http://subsiege-game.com/', 'email': ... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/338... | \n",
" {'ids': [], 'notes': None} | \n",
" 17.89 | \n",
"
\n",
" \n",
" 14560 | \n",
" MARS VR(全球使命VR) | \n",
" 596560 | \n",
" 0 | \n",
" 1.\\t4K level audio-visual experience <br />\\r\\... | \n",
" 1.\\t4K level audio-visual experience <br />\\r\\... | \n",
" Welcome to 《Mars VR》. This is an immersive fir... | \n",
" NaN | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/596... | \n",
" http://qqsm.zygames.com/ | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" ['Ying Pei Digital Technology Shanghai Co., Li... | \n",
" ['SHANGHAI ZHENYOU TECHNOLOGY CO.,LTD'] | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '73', 'description': 'Violent'}, {'id'... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 256681371, 'name': 'marsvr', 'thumbnai... | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '5 Apr, 2017'} | \n",
" {'url': 'http://www.zygames.com/contact', 'ema... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/596... | \n",
" {'ids': [], 'notes': None} | \n",
" 1.99 | \n",
"
\n",
" \n",
" 16386 | \n",
" Numberline 2 | \n",
" 654970 | \n",
" 0 | \n",
" NumberLine 2 is the continuation of the popula... | \n",
" NumberLine 2 is the continuation of the popula... | \n",
" NumberLine 2 is the continuation of the popula... | \n",
" NaN | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/654... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
" ['V34D4R', 'Egor Magurin'] | \n",
" ['Indovers Studio'] | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '4', 'description': 'Casual'}, {'id': ... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 256687192, 'name': 'Numberline 2 Trail... | \n",
" {'total': 60, 'highlighted': [{'name': '1st le... | \n",
" {'coming_soon': False, 'date': '14 Jul, 2017'} | \n",
" {'url': '', 'email': 'radaew.zhenya@yandex.ru'} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/654... | \n",
" {'ids': [], 'notes': None} | \n",
" 1.59 | \n",
"
\n",
" \n",
" 26855 | \n",
" SNUSE 221 | \n",
" 948070 | \n",
" 0 | \n",
" <strong> Hey. My name is *&#!$.<br>Today I... | \n",
" <strong> Hey. My name is *&#!$.<br>Today I... | \n",
" Hey. My name is *&#!$. Today I will tell y... | \n",
" NaN | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/948... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
" ['SNUSE GM'] | \n",
" ['SNUSE GM'] | \n",
" windows | \n",
" [{'id': 2, 'description': 'Single-player'}] | \n",
" [{'id': '25', 'description': 'Adventure'}, {'i... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 256745662, 'name': 'snuse', 'thumbnail... | \n",
" {'total': 0} | \n",
" {'coming_soon': False, 'date': '2 Apr, 2019'} | \n",
" {'url': 'vk.com/nilow_i', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/948... | \n",
" {'ids': [], 'notes': None} | \n",
" 0.79 | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" name steam_appid required_age \\\n",
"4866 Subsiege 338640 0 \n",
"14560 MARS VR(全球使命VR) 596560 0 \n",
"16386 Numberline 2 654970 0 \n",
"26855 SNUSE 221 948070 0 \n",
"\n",
" detailed_description \\\n",
"4866
\\r\\... \n",
"16386 NumberLine 2 is the continuation of the popula... \n",
"26855 Hey. My name is *&#!$.
Today I... \n",
"\n",
" about_the_game \\\n",
"4866
\\r\\... \n",
"16386 NumberLine 2 is the continuation of the popula... \n",
"26855 Hey. My name is *&#!$.
Today I... \n",
"\n",
" short_description supported_languages \\\n",
"4866 Subsiege is an intense real-time tactic game w... NaN \n",
"14560 Welcome to 《Mars VR》. This is an immersive fir... NaN \n",
"16386 NumberLine 2 is the continuation of the popula... NaN \n",
"26855 Hey. My name is *&#!$. Today I will tell y... NaN \n",
"\n",
" header_image \\\n",
"4866 https://steamcdn-a.akamaihd.net/steam/apps/338... \n",
"14560 https://steamcdn-a.akamaihd.net/steam/apps/596... \n",
"16386 https://steamcdn-a.akamaihd.net/steam/apps/654... \n",
"26855 https://steamcdn-a.akamaihd.net/steam/apps/948... \n",
"\n",
" website \\\n",
"4866 http://subsiege-game.com/ \n",
"14560 http://qqsm.zygames.com/ \n",
"16386 NaN \n",
"26855 NaN \n",
"\n",
" pc_requirements \\\n",
"4866 {'minimum': 'Minimum:
, French*, German*, Italian*, Spanish - Spain*, Simplified Chinese*, Traditional Chinese*, Korean*
*languages with full audio support\n"
]
},
{
"data": {
"text/plain": [
"English 8512\n",
"English*
*languages with full audio support 7409\n",
"English, Russian 707\n",
"English, Simplified Chinese 280\n",
"English, Japanese 235\n",
"English*, Russian*
*languages with full audio support 222\n",
"English, French, Italian, German, Spanish - Spain 180\n",
"English, German 161\n",
"Simplified Chinese 157\n",
"English, French 143\n",
"Name: supported_languages, dtype: int64"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(price_df['supported_languages'][0])\n",
"price_df['supported_languages'].value_counts().head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like English-only games make up a little over half the rows in our dataset (~16,000), and English plus other languages make up many more. We could create columns for any of the other languages by string searching, but for simplicity we'll focus on just the English ones.\n",
"\n",
"Using the Series.apply method once again, we can check if the string 'english' appears in each row. We'll return a 1 if 'english' is found and 0 otherwise. Whilst we could use True/False, a binary 1/0 can be interpreted as a boolean value and saves a little space in the csv file. \n",
"\n",
"Inside the lambda function, the variable `x` will take on the value of each row as the expression is evaluated. We apply the `lower()` string method so capitalisation doesn't matter."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" english | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name english\n",
"0 Counter-Strike 1\n",
"1 Team Fortress Classic 1\n",
"2 Day of Defeat 1\n",
"3 Deathmatch Classic 1\n",
"4 Half-Life: Opposing Force 1"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process_language(df):\n",
" \"\"\"Process supported_languages column into a boolean 'is english' column.\"\"\"\n",
" df = df.copy()\n",
" \n",
" # drop rows with missing language data\n",
" df = df.dropna(subset=['supported_languages'])\n",
" \n",
" df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)\n",
" df = df.drop('supported_languages', axis=1)\n",
" \n",
" return df\n",
"\n",
"\n",
"language_df = process_language(price_df)\n",
"language_df[['name', 'english']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before moving on, we can take a quick look at the results and see that most of the apps support English."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 27699\n",
"0 522\n",
"Name: english, dtype: int64"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"language_df['english'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing Developers and Publishers\n",
"\n",
"We'll skip over the next few columns and take a look at the `developers` and `publishers` columns. They will most likely contain similar information so we can look at them together. \n",
"\n",
"We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values, if we search for empty lists we see that we have over 200 'hidden' null values."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Developers null counts: 104\n",
"Developers empty list counts: 0\n",
"\n",
"Publishers null counts: 0\n",
"Publishers empty list counts: 213\n"
]
}
],
"source": [
"print('Developers null counts:', language_df['developers'].isnull().sum())\n",
"print('Developers empty list counts:', language_df[language_df['developers'] == \"['']\"].shape[0])\n",
"\n",
"print('\\nPublishers null counts:', language_df['publishers'].isnull().sum())\n",
"print('Publishers empty list counts:', language_df[language_df['publishers'] == \"['']\"].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ultimately we want a data set with no missing values. That means we have a few options for dealing with these two columns:\n",
"\n",
"- Remove all rows missing either developer or publisher information\n",
"- Impute missing information by replacing the missing column with the column we have (i.e. if developers is missing, fill it with the value in publishers)\n",
"- Fill missing information with 'Unknown' or 'None'\n",
"\n",
"We can investigate some of the rows with missing data to help inform our decision."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total games missing developer: 104 \n",
"\n",
"Tycoon City: New York: https://store.steampowered.com/app/9730\n",
"Nikopol: Secrets of the Immortals: https://store.steampowered.com/app/11370\n",
"Crash Time 2: https://store.steampowered.com/app/11390\n",
"Hunting Unlimited 2010: https://store.steampowered.com/app/12690\n",
"18 Wheels of Steel: Extreme Trucker: https://store.steampowered.com/app/33730\n",
"\n",
"Total games missing publisher: 213 \n",
"\n",
"RIP - Trilogy™: https://store.steampowered.com/app/2540\n",
"Vigil: Blood Bitterness™: https://store.steampowered.com/app/2570\n",
"Bullet Candy: https://store.steampowered.com/app/6600\n",
"AudioSurf: https://store.steampowered.com/app/12900\n",
"Everyday Shooter: https://store.steampowered.com/app/16300\n",
"\n",
"Total games missing developer and publisher: 67 \n",
"\n",
"PlayClaw 5 - Game Recording and Streaming: https://store.steampowered.com/app/237370\n",
"Artemis Spaceship Bridge Simulator: https://store.steampowered.com/app/247350\n",
"A Walk in the Dark: https://store.steampowered.com/app/248730\n",
"Forge Quest: https://store.steampowered.com/app/249950\n",
"Vox: https://store.steampowered.com/app/252770\n"
]
}
],
"source": [
"no_dev = language_df[language_df['developers'].isnull()]\n",
"\n",
"print('Total games missing developer:', no_dev.shape[0], '\\n')\n",
"\n",
"print_steam_links(no_dev[:5])\n",
"\n",
"no_pub = language_df[language_df['publishers'] == \"['']\"]\n",
"\n",
"print('\\nTotal games missing publisher:', no_pub.shape[0], '\\n')\n",
"print_steam_links(no_pub[:5])\n",
"\n",
"no_dev_or_pub = language_df[(language_df['developers'].isnull()) & (language_df['publishers'] == \"['']\")]\n",
"\n",
"print('\\nTotal games missing developer and publisher:', no_dev_or_pub.shape[0], '\\n')\n",
"print_steam_links(no_dev_or_pub[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It appears we are looking at a mix of titles, smaller ones especially, and some of the smaller indie titles may have been self-published. Others simply have wrong or missing data, found by searching for the titles elsewhere. As our priority is creating a clean data set, and there are only a few hundred rows, it will be fine to remove them from the data.\n",
"\n",
"Let's take a look at the structure of the data. Below we inspect some rows near the beginning of the dataframe. It looks like both columns are stored as lists which can have one or multiple values. We'll have to evaluate the rows as before, so they are recognised as lists, then index into them accordingly."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" developers | \n",
" publishers | \n",
"
\n",
" \n",
" \n",
" \n",
" 24 | \n",
" ['Valve'] | \n",
" ['Valve'] | \n",
"
\n",
" \n",
" 25 | \n",
" ['Valve', 'Hidden Path Entertainment'] | \n",
" ['Valve'] | \n",
"
\n",
" \n",
" 27 | \n",
" ['Mark Healey'] | \n",
" ['Mark Healey'] | \n",
"
\n",
" \n",
" 28 | \n",
" ['Tripwire Interactive'] | \n",
" ['Tripwire Interactive'] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" developers publishers\n",
"24 ['Valve'] ['Valve']\n",
"25 ['Valve', 'Hidden Path Entertainment'] ['Valve']\n",
"27 ['Mark Healey'] ['Mark Healey']\n",
"28 ['Tripwire Interactive'] ['Tripwire Interactive']"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"language_df[['developers', 'publishers']].iloc[24:28]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have some single values and some multiple, we have to decide how to handle them. Here are some potential solutions:\n",
"\n",
" - Create a column for each value in the list (i.e. developer_1, developer_2)\n",
" - Create a column with the first value in the list and a column with the rest of the values (i.e. developer_1, other_developers)\n",
" - Create a column with the first value in the list and disregard the rest\n",
" - Combine all values into one column, simply unpacking the list\n",
" \n",
"Let's begin defining our function, and take a look at how many rows have multiple developers or publishers. After evaluating each row, we can find the length of the lists in each row by using the [Series.str.len()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.len.html) method. By filtering only rows where the list has more than one element, we can find the number of rows with more than one value in each column."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rows in developers column with multiple values: 1720\n",
"Rows in publishers column with multiple values: 884\n"
]
}
],
"source": [
"def process_developers_and_publishers(df):\n",
" # remove rows with missing data\n",
" df = df[(df['developers'].notnull()) & (df['publishers'] != \"['']\")].copy()\n",
" \n",
" for col in ['developers', 'publishers']:\n",
" df[col] = df[col].apply(lambda x: literal_eval(x))\n",
" \n",
" # filter dataframe to rows with lists longer than 1, and store the number of rows\n",
" num_rows = df[df[col].str.len() > 1].shape[0]\n",
" \n",
" print('Rows in {} column with multiple values:'.format(col), num_rows)\n",
"\n",
"process_developers_and_publishers(language_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It turns out that the vast majority have only one value for these columns. If we went with the first or second solutions above, we'd be left with columns with mostly missing data. We could go with the third option, but the first value in the list isn't necessarily the most important, and this seems unfair if multiple teams were involved.\n",
"\n",
"The best way forward seems to be the fourth option - if there are multiple values we combine them into the same column. We'll create a list in this case, calling str.join() as we did before. If we pass a list with only one value, we get a string with just that value. If we pass a list with multiple values, we get a string-separated list as desired. We can see this below."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'one item'"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"', '.join(['one item'])"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"'multiple, different, items'"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"', '.join(['multiple', 'different', 'items'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can't join on a comma as a number of developers and publishers have a comma in their name, a couple of which can be seen below."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" developers | \n",
" publishers | \n",
"
\n",
" \n",
" \n",
" \n",
" 25 | \n",
" 730 | \n",
" ['Valve', 'Hidden Path Entertainment'] | \n",
" ['Valve'] | \n",
"
\n",
" \n",
" 66 | \n",
" 2520 | \n",
" ['CINEMAX, s.r.o.'] | \n",
" ['CINEMAX, s.r.o.'] | \n",
"
\n",
" \n",
" 73 | \n",
" 2630 | \n",
" ['Infinity Ward', 'Aspyr (Mac)'] | \n",
" ['Activision', 'Aspyr (Mac)'] | \n",
"
\n",
" \n",
" 97 | \n",
" 3300 | \n",
" ['PopCap Games, Inc.'] | \n",
" ['PopCap Games, Inc.'] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid developers \\\n",
"25 730 ['Valve', 'Hidden Path Entertainment'] \n",
"66 2520 ['CINEMAX, s.r.o.'] \n",
"73 2630 ['Infinity Ward', 'Aspyr (Mac)'] \n",
"97 3300 ['PopCap Games, Inc.'] \n",
"\n",
" publishers \n",
"25 ['Valve'] \n",
"66 ['CINEMAX, s.r.o.'] \n",
"73 ['Activision', 'Aspyr (Mac)'] \n",
"97 ['PopCap Games, Inc.'] "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"language_df.loc[language_df['developers'].str.contains(\",\", na=False), ['steam_appid', 'developers', 'publishers']].head(4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead we can join on a semi-colon (`;`). We have 3 rows which contain a semi-colon in their name, so we'll remove these. We'll be able to identify and split individual developer/publisher names in the future by handling it this way."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" developers | \n",
" publishers | \n",
"
\n",
" \n",
" \n",
" \n",
" 9550 | \n",
" 460210 | \n",
" ['bool games;'] | \n",
" ['bool games;'] | \n",
"
\n",
" \n",
" 13489 | \n",
" 568480 | \n",
" [';)', 'Quickdraw Studios'] | \n",
" ['Quickdraw Studios'] | \n",
"
\n",
" \n",
" 16871 | \n",
" 665890 | \n",
" ['Semicolon;'] | \n",
" ['Semicolon;'] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid developers publishers\n",
"9550 460210 ['bool games;'] ['bool games;']\n",
"13489 568480 [';)', 'Quickdraw Studios'] ['Quickdraw Studios']\n",
"16871 665890 ['Semicolon;'] ['Semicolon;']"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"language_df.loc[language_df['developers'].str.contains(\";\", na=False), ['steam_appid', 'developers', 'publishers']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are also a small number of columns that have `['NA']` or `['N/A']` for publisher. These are some really well hidden null values, and they didn't actually surface until much later in the original development process. This helps highlight the iterative nature of data cleaning - you may discover errors in rows or data that went previously undiscovered, and have to go back and update or correct them."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"language_df[(language_df['publishers'] == \"['NA']\") | (language_df['publishers'] == \"['N/A']\")].shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we're ready to finish the function we started. We'll abandon the for loop, as there is not too much repetition, and run it on the data as always."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" developer | \n",
" publisher | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" 10 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" 20 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" 30 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" 40 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" 50 | \n",
" Gearbox Software | \n",
" Valve | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid developer publisher\n",
"0 Counter-Strike 10 Valve Valve\n",
"1 Team Fortress Classic 20 Valve Valve\n",
"2 Day of Defeat 30 Valve Valve\n",
"3 Deathmatch Classic 40 Valve Valve\n",
"4 Half-Life: Opposing Force 50 Gearbox Software Valve"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process_developers_and_publishers(df):\n",
" \"\"\"Parse columns as semicolon-separated string.\"\"\"\n",
" # remove rows with missing data (~ means not)\n",
" df = df[(df['developers'].notnull()) & (df['publishers'] != \"['']\")].copy()\n",
" df = df[~(df['developers'].str.contains(';')) & ~(df['publishers'].str.contains(';'))]\n",
" df = df[(df['publishers'] != \"['NA']\") & (df['publishers'] != \"['N/A']\")]\n",
" \n",
" # create list for each\n",
" df['developer'] = df['developers'].apply(lambda x: ';'.join(literal_eval(x)))\n",
" df['publisher'] = df['publishers'].apply(lambda x: ';'.join(literal_eval(x)))\n",
"\n",
" df = df.drop(['developers', 'publishers'], axis=1)\n",
" \n",
" return df\n",
"\n",
"dev_pub_df = process_developers_and_publishers(language_df)\n",
"dev_pub_df[['name', 'steam_appid', 'developer', 'publisher']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing Categories and Genres\n",
"\n",
"We'll take a look at the `categories` and `genres` columns next. Below we take a look at the null counts and structure of the data. Both appear to be a list of dictionaries containing an id and description key-value pair. Similar to our handling of developers and publishers, it may be best to extract just the descriptions into a list for now. We could make a list of the IDs, keeping track of the corresponding description externally, but that seems overly complex for our purposes."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Categories:\n",
"\n",
"Null values: 509\n",
"\n",
"[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]\n",
"\n",
"Genres:\n",
"\n",
"Null values: 37\n",
"\n",
"[{'id': '1', 'description': 'Action'}]\n",
"[{'id': '3', 'description': 'RPG'}, {'id': '2', 'description': 'Strategy'}]\n"
]
}
],
"source": [
"print('Categories:\\n')\n",
"print('Null values:', dev_pub_df['categories'].isnull().sum())\n",
"print()\n",
"print(dev_pub_df['categories'][0])\n",
"\n",
"print('\\nGenres:\\n')\n",
"print('Null values:', dev_pub_df['genres'].isnull().sum())\n",
"print()\n",
"print(dev_pub_df['genres'].iloc[0])\n",
"print(dev_pub_df['genres'].iloc[1000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we begin defining a function we'll inspect some of the null rows, then we can decide how to handle them.\n",
"\n",
"Using the pandas [DataFrame.sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) method, we can randomly sample a number of rows from the dataset. We set a random_state so the output is the same each time."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Driver Booster 3 for STEAM: https://store.steampowered.com/app/403040\n",
"Swanky Paint: https://store.steampowered.com/app/432030\n",
"ResumeMaker® Professional Deluxe 20: https://store.steampowered.com/app/707280\n",
"CyberLink PhotoDirector 9 Ultra: https://store.steampowered.com/app/679550\n",
"OVRdrop: https://store.steampowered.com/app/586210\n"
]
}
],
"source": [
"print_steam_links(dev_pub_df[dev_pub_df['categories'].isnull()].sample(5, random_state=0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of the above rows with null categories data are applications or software of some kind, and not actually games. It would definitely be best to remove these, as we are interested in analysing games from the steam store.\n",
"\n",
"Below we take a look at rows with missing genres data. There doesn't seem to be anything wrong with these games, and it suggests that genre data simply wasn't supplied. As there are only 37 rows affected, we'll remove these rows to keep our dataset complete."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Antagonist: https://store.steampowered.com/app/585260\n",
"Lonelyland VR: https://store.steampowered.com/app/542620\n",
"ChessBase 13 Academy: https://store.steampowered.com/app/377340\n",
"Sonic the Hedgehog 4 - Episode I: https://store.steampowered.com/app/202530\n",
"VEHICLES FURY: https://store.steampowered.com/app/749290\n"
]
}
],
"source": [
"print_steam_links(dev_pub_df[dev_pub_df['genres'].isnull()].sample(5, random_state=0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we did for the developer and publisher columns, we can create an anonymous function and join the results on a semicolon. Inside the function we use a list comprehension to traverse each dictionary and extract the value under the `description` key, as seen below. "
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Multi-player', 'Online Multi-Player', 'Local Multi-Player']"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_category = \"[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}]\"\n",
"\n",
"[x['description'] for x in literal_eval(example_category)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then use the apply function again to turn each column into a simple delimited list."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" categories | \n",
" genres | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
"
\n",
" \n",
" 1 | \n",
" 20 | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
"
\n",
" \n",
" 2 | \n",
" 30 | \n",
" Multi-player;Valve Anti-Cheat enabled | \n",
" Action | \n",
"
\n",
" \n",
" 3 | \n",
" 40 | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
"
\n",
" \n",
" 4 | \n",
" 50 | \n",
" Single-player;Multi-player;Valve Anti-Cheat en... | \n",
" Action | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid categories genres\n",
"0 10 Multi-player;Online Multi-Player;Local Multi-P... Action\n",
"1 20 Multi-player;Online Multi-Player;Local Multi-P... Action\n",
"2 30 Multi-player;Valve Anti-Cheat enabled Action\n",
"3 40 Multi-player;Online Multi-Player;Local Multi-P... Action\n",
"4 50 Single-player;Multi-player;Valve Anti-Cheat en... Action"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process_categories_and_genres(df):\n",
" df = df.copy()\n",
" df = df[(df['categories'].notnull()) & (df['genres'].notnull())]\n",
" \n",
" for col in ['categories', 'genres']:\n",
" df[col] = df[col].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))\n",
" \n",
" return df\n",
"\n",
"cat_gen_df = process_categories_and_genres(dev_pub_df)\n",
"cat_gen_df[['steam_appid', 'categories', 'genres']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing Achievements and Content Descriptors\n",
"\n",
"The final columns we will take a look at before moving on to export columns are `achievements` and `content_descriptors`. Let's take a look at the null counts for each column and a small sample of rows."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Achievements null counts: 1856\n",
"Content Decsriptors null counts: 0\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" achievements | \n",
" content_descriptors | \n",
"
\n",
" \n",
" \n",
" \n",
" 8 | \n",
" Half-Life: Blue Shift | \n",
" {'total': 0} | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 9 | \n",
" Half-Life 2 | \n",
" {'total': 33, 'highlighted': [{'name': 'Defian... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 10 | \n",
" Counter-Strike: Source | \n",
" {'total': 147, 'highlighted': [{'name': 'Someo... | \n",
" {'ids': [2, 5], 'notes': 'Includes intense vio... | \n",
"
\n",
" \n",
" 11 | \n",
" Half-Life: Source | \n",
" {'total': 0} | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
" 12 | \n",
" Day of Defeat: Source | \n",
" {'total': 54, 'highlighted': [{'name': 'Double... | \n",
" {'ids': [], 'notes': None} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name achievements \\\n",
"8 Half-Life: Blue Shift {'total': 0} \n",
"9 Half-Life 2 {'total': 33, 'highlighted': [{'name': 'Defian... \n",
"10 Counter-Strike: Source {'total': 147, 'highlighted': [{'name': 'Someo... \n",
"11 Half-Life: Source {'total': 0} \n",
"12 Day of Defeat: Source {'total': 54, 'highlighted': [{'name': 'Double... \n",
"\n",
" content_descriptors \n",
"8 {'ids': [], 'notes': None} \n",
"9 {'ids': [], 'notes': None} \n",
"10 {'ids': [2, 5], 'notes': 'Includes intense vio... \n",
"11 {'ids': [], 'notes': None} \n",
"12 {'ids': [], 'notes': None} "
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('Achievements null counts:', cat_gen_df['achievements'].isnull().sum())\n",
"print('Content Decsriptors null counts:', cat_gen_df['content_descriptors'].isnull().sum())\n",
"\n",
"cat_gen_df[['name', 'achievements', 'content_descriptors']].iloc[8:13]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like both columns are stored as dictionaries, with standard formats if no details are provided or exist.\n",
"\n",
"Below we take a closer look at a single row from the achievements column."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'total': 33,\n",
" 'highlighted': [{'name': 'Defiant',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},\n",
" {'name': 'Submissive',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},\n",
" {'name': 'Malcontent',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},\n",
" {'name': 'What cat?',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},\n",
" {'name': 'Trusty Hardware',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},\n",
" {'name': 'Barnacle Bowling',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},\n",
" {'name': \"Anchor's Aweigh!\",\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},\n",
" {'name': 'Heavy Weapons',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboatgun.jpg'},\n",
" {'name': 'Vorticough',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_find_vortigauntcave.jpg'},\n",
" {'name': 'Revenge!',\n",
" 'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_chopper.jpg'}]}"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"literal_eval(cat_gen_df['achievements'][9])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two keys in the top level of the dictionary: `total` and `highlighted`. The highlighted column looks too specific, being a selection of achievements specific to that game, so we will remove it. It may be worthwhile extracting the `total` value though.\n",
"\n",
"Now let's take a look at the `content_descriptors` column."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ids': [], 'notes': None} 25398\n",
"{'ids': [2, 5], 'notes': None} 427\n",
"{'ids': [1, 5], 'notes': None} 250\n",
"{'ids': [5], 'notes': None} 127\n",
"{'ids': [1, 2, 5], 'notes': None} 122\n",
"{'ids': [2, 5], 'notes': 'This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, General Mature Content'} 8\n",
"Name: content_descriptors, dtype: int64"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_gen_df['content_descriptors'].value_counts().head(6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Content descriptors contain age-related warnings about the content of a game. They are identified by a numeric ID number, with optional notes supplied. Almost 26,000 rows have an empty list, indicating either no content descriptors or none provided. We already have the `required_age` column, which gives us similar information. Because of this, and because the rows are highly specific to each game, we will drop this column entirely. \n",
"\n",
"Let's now define a function, taking a look at the value counts to verify everything went as expected."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0 11966\n",
"10 697\n",
"12 624\n",
"20 579\n",
"15 500\n",
"Name: achievements, dtype: int64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process_achievements_and_descriptors(df):\n",
" \"\"\"Parse as total number of achievements.\"\"\"\n",
" df = df.copy()\n",
" \n",
" df = df.drop('content_descriptors', axis=1)\n",
" \n",
" def parse_achievements(x):\n",
" if x is np.nan:\n",
" # missing data, assume has no achievements\n",
" return 0\n",
" else:\n",
" # else has data, so can extract and return number under total\n",
" return literal_eval(x)['total']\n",
" \n",
" df['achievements'] = df['achievements'].apply(parse_achievements)\n",
" \n",
" return df\n",
"\n",
"achiev_df = process_achievements_and_descriptors(cat_gen_df)\n",
"achiev_df['achievements'].value_counts().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like we were successful. We'll leave this column as it is for now, however we may wish to consider grouping the values together in bins, like we did for the age column. This is a decision we can make during the feature engineering stage of our analysis, and we can decide at that point if it will be more useful.\n",
"\n",
"Let's now add these functions into the `process` function and run it on the raw data. This isn't strictly necessary but it will keep things organised and ensure we don't accidentally skip running a function.\n",
"\n",
"We'll then inspect everything we've completed so far. As you will see, there is still plenty more left to do."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" price | \n",
" english | \n",
" developer | \n",
" publisher | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" 10 | \n",
" 0 | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" windows;mac;linux | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '1 Nov, 2000'} | \n",
" {'url': 'http://steamcommunity.com/app/10', 'e... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" 7.19 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" 20 | \n",
" 0 | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" windows;mac;linux | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '1 Apr, 1999'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" 3.99 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" 30 | \n",
" 0 | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" http://www.dayofdefeat.com/ | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" windows;mac;linux | \n",
" Multi-player;Valve Anti-Cheat enabled | \n",
" Action | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '1 May, 2003'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" 3.99 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" 40 | \n",
" 0 | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" windows;mac;linux | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '1 Jun, 2001'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" 3.99 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" 50 | \n",
" 0 | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" NaN | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" windows;mac;linux | \n",
" Single-player;Multi-player;Valve Anti-Cheat en... | \n",
" Action | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '1 Nov, 1999'} | \n",
" {'url': 'https://help.steampowered.com', 'emai... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" 3.99 | \n",
" 1 | \n",
" Gearbox Software | \n",
" Valve | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid required_age \\\n",
"0 Counter-Strike 10 0 \n",
"1 Team Fortress Classic 20 0 \n",
"2 Day of Defeat 30 0 \n",
"3 Deathmatch Classic 40 0 \n",
"4 Half-Life: Opposing Force 50 0 \n",
"\n",
" detailed_description \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" about_the_game \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" short_description \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" header_image \\\n",
"0 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n",
"1 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n",
"2 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n",
"3 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n",
"4 https://steamcdn-a.akamaihd.net/steam/apps/50/... \n",
"\n",
" website \\\n",
"0 NaN \n",
"1 NaN \n",
"2 http://www.dayofdefeat.com/ \n",
"3 NaN \n",
"4 NaN \n",
"\n",
" pc_requirements \\\n",
"0 {'minimum': '\\r\\n\\t\\t\\tMinimum:Minimum:Minimum:Minimum:Minimum:\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" price | \n",
" english | \n",
" developer | \n",
" publisher | \n",
"
\n",
" \n",
" \n",
" \n",
" 97 | \n",
" Bejeweled 2 Deluxe | \n",
" 3300 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/330... | \n",
" NaN | \n",
" {'minimum': '<p><strong>Minimum Requirements:<... | \n",
" {'minimum': '<ul>\\n\\t<li><strong>OS:</strong> ... | \n",
" [] | \n",
" windows;mac | \n",
" Single-player | \n",
" Casual | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '30 Aug, 2006'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/330... | \n",
" 4.25 | \n",
" 1 | \n",
" PopCap Games, Inc. | \n",
" PopCap Games, Inc. | \n",
"
\n",
" \n",
" 98 | \n",
" Chuzzle Deluxe | \n",
" 3310 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/331... | \n",
" NaN | \n",
" {'minimum': '<p><strong>Minimum Requirements:<... | \n",
" {'minimum': '<ul>\\n\\t<li><strong>OS:</strong> ... | \n",
" [] | \n",
" windows;mac | \n",
" Single-player | \n",
" Casual | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '30 Aug, 2006'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/331... | \n",
" 4.25 | \n",
" 1 | \n",
" PopCap Games, Inc. | \n",
" PopCap Games, Inc. | \n",
"
\n",
" \n",
" 99 | \n",
" Insaniquarium Deluxe | \n",
" 3320 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/332... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum Requirements:</st... | \n",
" [] | \n",
" [] | \n",
" windows | \n",
" Single-player | \n",
" Casual | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '30 Aug, 2006'} | \n",
" {'url': '', 'email': ''} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/332... | \n",
" 4.25 | \n",
" 1 | \n",
" PopCap Games, Inc. | \n",
" PopCap Games, Inc. | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" name steam_appid required_age detailed_description \\\n",
"97 Bejeweled 2 Deluxe 3300 0 NaN \n",
"98 Chuzzle Deluxe 3310 0 NaN \n",
"99 Insaniquarium Deluxe 3320 0 NaN \n",
"\n",
" about_the_game short_description \\\n",
"97 NaN NaN \n",
"98 NaN NaN \n",
"99 NaN NaN \n",
"\n",
" header_image website \\\n",
"97 https://steamcdn-a.akamaihd.net/steam/apps/330... NaN \n",
"98 https://steamcdn-a.akamaihd.net/steam/apps/331... NaN \n",
"99 https://steamcdn-a.akamaihd.net/steam/apps/332... NaN \n",
"\n",
" pc_requirements \\\n",
"97 {'minimum': 'Minimum Requirements:<... \n",
"98 {'minimum': 'Minimum Requirements:<... \n",
"99 {'minimum': 'Minimum Requirements:\\n\\t- OS: ... [] \n",
"98 {'minimum': '
\\n\\t- OS: ... [] \n",
"99 [] [] \n",
"\n",
" platforms categories genres \\\n",
"97 windows;mac Single-player Casual \n",
"98 windows;mac Single-player Casual \n",
"99 windows Single-player Casual \n",
"\n",
" screenshots movies achievements \\\n",
"97 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 \n",
"98 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 \n",
"99 [{'id': 0, 'path_thumbnail': 'https://steamcdn... NaN 0 \n",
"\n",
" release_date support_info \\\n",
"97 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} \n",
"98 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} \n",
"99 {'coming_soon': False, 'date': '30 Aug, 2006'} {'url': '', 'email': ''} \n",
"\n",
" background price english \\\n",
"97 https://steamcdn-a.akamaihd.net/steam/apps/330... 4.25 1 \n",
"98 https://steamcdn-a.akamaihd.net/steam/apps/331... 4.25 1 \n",
"99 https://steamcdn-a.akamaihd.net/steam/apps/332... 4.25 1 \n",
"\n",
" developer publisher \n",
"97 PopCap Games, Inc. PopCap Games, Inc. \n",
"98 PopCap Games, Inc. PopCap Games, Inc. \n",
"99 PopCap Games, Inc. PopCap Games, Inc. "
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"partially_clean[partially_clean['detailed_description'].isnull()].head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Interestingly, all of these titles are games from 2006 developed and published by PopCap Games. My best guess is that they were developed previously and all added to the Steam store in one go after Valve allowed third-party titles.\n",
"\n",
"We'll remove these rows, as well as any with a description of less than 20 characters, like those below."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" price | \n",
" english | \n",
" developer | \n",
" publisher | \n",
"
\n",
" \n",
" \n",
" \n",
" 10420 | \n",
" Penguins Cretins | \n",
" 490990 | \n",
" 0 | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/490... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" windows | \n",
" Single-player;Multi-player;Online Multi-Player... | \n",
" Action;Adventure;Casual;Indie;Massively Multip... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '22 Jun, 2016'} | \n",
" {'url': '', 'email': 'support@hfmgames.net'} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/490... | \n",
" 1.69 | \n",
" 1 | \n",
" HFM Games | \n",
" HFM Games | \n",
"
\n",
" \n",
" 19938 | \n",
" 拼词游戏 2017 | \n",
" 745840 | \n",
" 0 | \n",
" 带一点恐怖元素的休闲游戏 | \n",
" 带一点恐怖元素的休闲游戏 | \n",
" 一款有一点恐怖元素的休闲益智游戏。 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/745... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" windows;mac | \n",
" Single-player | \n",
" Adventure;Indie | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 256699963, 'name': 'alpha', 'thumbnail... | \n",
" 11 | \n",
" {'coming_soon': False, 'date': '29 Nov, 2017'} | \n",
" {'url': '', 'email': '12668934@qq.com'} | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/745... | \n",
" 0.79 | \n",
" 0 | \n",
" Mianwotu | \n",
" Mianwotu | \n",
"
\n",
" \n",
" 21937 | \n",
" God Test | \n",
" 797660 | \n",
" 0 | \n",
" God Test | \n",
" God Test | \n",
" God Test | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/797... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
" windows | \n",
" Multi-player;Online Multi-Player;MMO;Online Co... | \n",
" Action;Massively Multiplayer;RPG;Strategy | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" {'coming_soon': False, 'date': '18 Apr, 2018'} | \n",
" {'url': '', 'email': 'insanegamedev@outlook.com'} | \n",
" NaN | \n",
" 0.00 | \n",
" 1 | \n",
" God Test | \n",
" God Test | \n",
"
\n",
" \n",
" 26219 | \n",
" В поисках Атлантиды | \n",
" 925640 | \n",
" 0 | \n",
" Интересная игра | \n",
" Интересная игра | \n",
" Atlantis | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/925... | \n",
" https://vk.com/atlantisforever | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
" windows | \n",
" Single-player;Steam Achievements | \n",
" Adventure;Casual;Indie;Early Access | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 256725871, 'name': 'Game', 'thumbnail'... | \n",
" 1 | \n",
" {'coming_soon': False, 'date': '1 Nov, 2018'} | \n",
" {'url': 'https://vk.com/atlantisforever', 'ema... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/925... | \n",
" 1.69 | \n",
" 0 | \n",
" Dmitr Che | \n",
" Dmitr Che | \n",
"
\n",
" \n",
" 26356 | \n",
" 东方百问~TouHouAsked | \n",
" 930840 | \n",
" 0 | \n",
" Null | \n",
" Null | \n",
" Null | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/930... | \n",
" https://asked.touhou.ren/ | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
" windows | \n",
" Single-player;Steam Achievements | \n",
" Casual;Indie;Strategy;Early Access | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" [{'id': 256726640, 'name': 'TouHouAsked', 'thu... | \n",
" 2 | \n",
" {'coming_soon': False, 'date': '7 Oct, 2018'} | \n",
" {'url': 'https://asked.touhou.ren', 'email': '... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/930... | \n",
" 0.79 | \n",
" 0 | \n",
" Root Nine Studio | \n",
" Root Nine Studio | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid required_age detailed_description \\\n",
"10420 Penguins Cretins 490990 0 ... \n",
"19938 拼词游戏 2017 745840 0 带一点恐怖元素的休闲游戏 \n",
"21937 God Test 797660 0 God Test \n",
"26219 В поисках Атлантиды 925640 0 Интересная игра \n",
"26356 东方百问~TouHouAsked 930840 0 Null \n",
"\n",
" about_the_game short_description \\\n",
"10420 ... ... \n",
"19938 带一点恐怖元素的休闲游戏 一款有一点恐怖元素的休闲益智游戏。 \n",
"21937 God Test God Test \n",
"26219 Интересная игра Atlantis \n",
"26356 Null Null \n",
"\n",
" header_image \\\n",
"10420 https://steamcdn-a.akamaihd.net/steam/apps/490... \n",
"19938 https://steamcdn-a.akamaihd.net/steam/apps/745... \n",
"21937 https://steamcdn-a.akamaihd.net/steam/apps/797... \n",
"26219 https://steamcdn-a.akamaihd.net/steam/apps/925... \n",
"26356 https://steamcdn-a.akamaihd.net/steam/apps/930... \n",
"\n",
" website \\\n",
"10420 NaN \n",
"19938 NaN \n",
"21937 NaN \n",
"26219 https://vk.com/atlantisforever \n",
"26356 https://asked.touhou.ren/ \n",
"\n",
" pc_requirements \\\n",
"10420 {'minimum': 'Minimum:
20]\n",
" \n",
" # by default we don't export, useful if calling function later\n",
" if export:\n",
" # create dataframe of description columns\n",
" description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]\n",
" \n",
" export_data(description_data, filename='description_data')\n",
" \n",
" # drop description columns from main dataframe\n",
" df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)\n",
" \n",
" return df\n",
"\n",
"desc_df = process_descriptions(partially_clean, export=True)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" detailed_description | \n",
" about_the_game | \n",
" short_description | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
" Play the world's number 1 online action game. ... | \n",
"
\n",
" \n",
" 1 | \n",
" 20 | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
" One of the most popular online action games of... | \n",
"
\n",
" \n",
" 2 | \n",
" 30 | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
" Enlist in an intense brand of Axis vs. Allied ... | \n",
"
\n",
" \n",
" 3 | \n",
" 40 | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
" Enjoy fast-paced multiplayer gaming with Death... | \n",
"
\n",
" \n",
" 4 | \n",
" 50 | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
" Return to the Black Mesa Research Facility as ... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid detailed_description \\\n",
"0 10 Play the world's number 1 online action game. ... \n",
"1 20 One of the most popular online action games of... \n",
"2 30 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 40 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 50 Return to the Black Mesa Research Facility as ... \n",
"\n",
" about_the_game \\\n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... \n",
"\n",
" short_description \n",
"0 Play the world's number 1 online action game. ... \n",
"1 One of the most popular online action games of... \n",
"2 Enlist in an intense brand of Axis vs. Allied ... \n",
"3 Enjoy fast-paced multiplayer gaming with Death... \n",
"4 Return to the Black Mesa Research Facility as ... "
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# inspect exported data\n",
"pd.read_csv('../data/exports/steam_description_data.csv').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Processing Media Columns\n",
"\n",
"Similar to the description columns, we have three columns that contain links to various images: `header_image`, `screenshots` and `background`. Whilst we won't be needing this data in this project, it could open the door to some interesting image analysis in the future. We will treat these columns in almost the same way, exporting the contents to a csv file then removing them from the dataset.\n",
"\n",
"Again, let's check for missing values."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"header_image: 0\n",
"screenshots: 4\n",
"background: 4\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" header_image | \n",
" screenshots | \n",
" background | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
"
\n",
" \n",
" 1 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
"
\n",
" \n",
" 2 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
"
\n",
" \n",
" 3 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
"
\n",
" \n",
" 4 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" header_image \\\n",
"0 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n",
"1 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n",
"2 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n",
"3 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n",
"4 https://steamcdn-a.akamaihd.net/steam/apps/50/... \n",
"\n",
" screenshots \\\n",
"0 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"1 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"2 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"3 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"4 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"\n",
" background \n",
"0 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n",
"1 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n",
"2 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n",
"3 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n",
"4 https://steamcdn-a.akamaihd.net/steam/apps/50/... "
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"image_cols = ['header_image', 'screenshots', 'background']\n",
"\n",
"for col in image_cols:\n",
" print(col+':', desc_df[col].isnull().sum())\n",
"\n",
"desc_df[image_cols].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As with the description columns, it is likely that the 4 rows with no `screenshots` data are the same rows with no `background` data. There are so few that it is probably safe to remove them.\n",
"\n",
"Before we make up our made let's inspect the rows."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Light Empire: https://store.steampowered.com/app/416220\n",
"Girl and Goblin: https://store.steampowered.com/app/880510\n",
"Arida: Backland's Awakening: https://store.steampowered.com/app/907760\n",
"Nukalypse: The Final War: https://store.steampowered.com/app/947940\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" header_image | \n",
" website | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" screenshots | \n",
" movies | \n",
" achievements | \n",
" release_date | \n",
" support_info | \n",
" background | \n",
" price | \n",
" english | \n",
" developer | \n",
" publisher | \n",
"
\n",
" \n",
" \n",
" \n",
" 7970 | \n",
" The Light Empire | \n",
" 416220 | \n",
" 0 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/416... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
" windows | \n",
" Single-player;Steam Achievements | \n",
" Indie;RPG;Simulation | \n",
" NaN | \n",
" NaN | \n",
" 4 | \n",
" {'coming_soon': False, 'date': '2 Dec, 2015'} | \n",
" {'url': '', 'email': 'Jemy.TLE@outlook.com'} | \n",
" NaN | \n",
" 4.79 | \n",
" 1 | \n",
" Jemy | \n",
" Jemy | \n",
"
\n",
" \n",
" 24874 | \n",
" Girl and Goblin | \n",
" 880510 | \n",
" 0 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/880... | \n",
" NaN | \n",
" {'minimum': '<strong>最低配置:</strong><br><ul cla... | \n",
" [] | \n",
" [] | \n",
" windows | \n",
" Single-player;Steam Achievements;Steam Trading... | \n",
" Adventure;Casual;RPG;Strategy | \n",
" NaN | \n",
" [{'id': 256739772, 'name': '3', 'thumbnail': '... | \n",
" 1552 | \n",
" {'coming_soon': False, 'date': '5 Jan, 2019'} | \n",
" {'url': '', 'email': 'smagician13@yahoo.com'} | \n",
" NaN | \n",
" 0.79 | \n",
" 1 | \n",
" Inverse Game | \n",
" Inverse Game | \n",
"
\n",
" \n",
" 25696 | \n",
" Arida: Backland's Awakening | \n",
" 907760 | \n",
" 0 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/907... | \n",
" http://www.aridagame.com | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" windows;mac | \n",
" Single-player;Steam Achievements;Full controll... | \n",
" Adventure;Indie | \n",
" NaN | \n",
" [{'id': 256729551, 'name': 'Teaser Beta 2018',... | \n",
" 0 | \n",
" {'coming_soon': True, 'date': ''} | \n",
" {'url': 'http://www.aridagame.com', 'email': '... | \n",
" NaN | \n",
" 0.00 | \n",
" 1 | \n",
" Aoca Game Lab | \n",
" Aoca Game Lab | \n",
"
\n",
" \n",
" 26851 | \n",
" Nukalypse: The Final War | \n",
" 947940 | \n",
" 0 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/947... | \n",
" NaN | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" windows | \n",
" Single-player;Steam Achievements | \n",
" Action;Indie;Early Access | \n",
" NaN | \n",
" [{'id': 256745274, 'name': 'Nukalypse: The Fin... | \n",
" 0 | \n",
" {'coming_soon': True, 'date': 'Oct 2019'} | \n",
" {'url': '', 'email': 'nukalypse@gmail.com'} | \n",
" NaN | \n",
" 0.00 | \n",
" 1 | \n",
" Zion Games Studio | \n",
" Zion Games Studio | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name steam_appid required_age \\\n",
"7970 The Light Empire 416220 0 \n",
"24874 Girl and Goblin 880510 0 \n",
"25696 Arida: Backland's Awakening 907760 0 \n",
"26851 Nukalypse: The Final War 947940 0 \n",
"\n",
" header_image \\\n",
"7970 https://steamcdn-a.akamaihd.net/steam/apps/416... \n",
"24874 https://steamcdn-a.akamaihd.net/steam/apps/880... \n",
"25696 https://steamcdn-a.akamaihd.net/steam/apps/907... \n",
"26851 https://steamcdn-a.akamaihd.net/steam/apps/947... \n",
"\n",
" website \\\n",
"7970 NaN \n",
"24874 NaN \n",
"25696 http://www.aridagame.com \n",
"26851 NaN \n",
"\n",
" pc_requirements \\\n",
"7970 {'minimum': 'Minimum:
\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" header_image | \n",
" screenshots | \n",
" background | \n",
" movies | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/10/... | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 20 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/20/... | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 30 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/30/... | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 40 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/40/... | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 50 | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" [{'id': 0, 'path_thumbnail': 'https://steamcdn... | \n",
" https://steamcdn-a.akamaihd.net/steam/apps/50/... | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" steam_appid header_image \\\n",
"0 10 https://steamcdn-a.akamaihd.net/steam/apps/10/... \n",
"1 20 https://steamcdn-a.akamaihd.net/steam/apps/20/... \n",
"2 30 https://steamcdn-a.akamaihd.net/steam/apps/30/... \n",
"3 40 https://steamcdn-a.akamaihd.net/steam/apps/40/... \n",
"4 50 https://steamcdn-a.akamaihd.net/steam/apps/50/... \n",
"\n",
" screenshots \\\n",
"0 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"1 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"2 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"3 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"4 [{'id': 0, 'path_thumbnail': 'https://steamcdn... \n",
"\n",
" background movies \n",
"0 https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN \n",
"1 https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN \n",
"2 https://steamcdn-a.akamaihd.net/steam/apps/30/... NaN \n",
"3 https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN \n",
"4 https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN "
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# inspect exported data\n",
"pd.read_csv('../data/exports/steam_media_data.csv').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we move on, we can inspect the memory savings of removing these columns by comparing the output of the [DataFrame.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method. If we pass `memory_usage=\"deep\"` we get the true memory usage of each DataFrame. Without this, pandas estimates the amount used. This is because of the way python stores object (string) columns under the hood. Essentially python keeps track of a list of pointers which point to the actual strings in memory. It's a bit like if you hid a bunch of items around the house, and kept a list of where everything was. You couldn't tell the total size of everything just by looking at the list, but you could take a rough guess. Only by following the list and inspecting each individual item could you get an exact figure.\n",
"\n",
"The blog post '[Why Python Is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)' goes into more detail, but all we need to be aware of is that by passing the parameter we ensure we get the true value of memory usage. We also pass `verbose=False` to truncate unnecessary output.\n",
"\n",
"We can see that already we have shrunk the memory usage from 285 MB to just under 55 MB. This is great because in general, the smaller the memory footprint the faster our code will run in future. And of course, we're not finished yet."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Before removing data:\n",
"\n",
"\n",
"Int64Index: 27414 entries, 0 to 29234\n",
"Columns: 24 entries, name to publisher\n",
"dtypes: category(1), float64(1), int64(3), object(19)\n",
"memory usage: 285.0 MB\n",
"\n",
"Data with descriptions and media removed:\n",
"\n",
"\n",
"Int64Index: 27391 entries, 0 to 29234\n",
"Columns: 17 entries, name to publisher\n",
"dtypes: category(1), float64(1), int64(3), object(12)\n",
"memory usage: 54.2 MB\n"
]
}
],
"source": [
"print('Before removing data:\\n')\n",
"achiev_df.info(verbose=False, memory_usage=\"deep\")\n",
"\n",
"print('\\nData with descriptions and media removed:\\n')\n",
"media_df.info(verbose=False, memory_usage=\"deep\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Website and Support Info\n",
"\n",
"Next we will look at the `website` and `support_info` columns. Seen below, they both contain links to external websites. The website column is simply stored as a string whereas the support info column is stored as a dictionary of `url` and `email`.\n",
"\n",
"There are a large number of rows with no website listed, and while there are no null values in the `support_info` column, it looks like many will have empty `url` and `email` values inside the data.\n",
"\n",
"For our dataset we'll be dropping both these columns, as they are far too specific to be useful in our analysis. As you may have guessed, we will extract and export this data as we have done before. If not useful, it could be interesting at a later date."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"website null counts: 9328\n",
"support_info null counts: 0\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" website | \n",
" support_info | \n",
"
\n",
" \n",
" \n",
" \n",
" 82 | \n",
" X: Beyond the Frontier | \n",
" http://www.egosoft.com/games/x/info_en.php | \n",
" {'url': '', 'email': ''} | \n",
"
\n",
" \n",
" 83 | \n",
" X: Tension | \n",
" http://www.egosoft.com/games/x_tension/info_en.php | \n",
" {'url': '', 'email': ''} | \n",
"
\n",
" \n",
" 84 | \n",
" X Rebirth | \n",
" http://www.egosoft.com/games/x_rebirth/info_en.php | \n",
" {'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'} | \n",
"
\n",
" \n",
" 85 | \n",
" 688(I) Hunter/Killer | \n",
" NaN | \n",
" {'url': 'http://strategyfirst.com/products/support.html', 'email': ''} | \n",
"
\n",
" \n",
" 86 | \n",
" Fleet Command | \n",
" NaN | \n",
" {'url': 'http://strategyfirst.com/products/support.html', 'email': ''} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name \\\n",
"82 X: Beyond the Frontier \n",
"83 X: Tension \n",
"84 X Rebirth \n",
"85 688(I) Hunter/Killer \n",
"86 Fleet Command \n",
"\n",
" website \\\n",
"82 http://www.egosoft.com/games/x/info_en.php \n",
"83 http://www.egosoft.com/games/x_tension/info_en.php \n",
"84 http://www.egosoft.com/games/x_rebirth/info_en.php \n",
"85 NaN \n",
"86 NaN \n",
"\n",
" support_info \n",
"82 {'url': '', 'email': ''} \n",
"83 {'url': '', 'email': ''} \n",
"84 {'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'} \n",
"85 {'url': 'http://strategyfirst.com/products/support.html', 'email': ''} \n",
"86 {'url': 'http://strategyfirst.com/products/support.html', 'email': ''} "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print('website null counts:', media_df['website'].isnull().sum())\n",
"print('support_info null counts:', media_df['support_info'].isnull().sum())\n",
"\n",
"with pd.option_context(\"display.max_colwidth\", 100): # ensures strings not cut short\n",
" display(media_df[['name', 'website', 'support_info']][75:80])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to split the support info into two separate columns. We'll keep all the code that parses the columns inside the export `if` statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain `NaN` whereas the other two columns contain a blank string (`''`) for missing data, as once we have exported to csv they will be represented the same way."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exported support info to '../data/exports/steam_support_info.csv'\n"
]
}
],
"source": [
"def process_info(df, export=False):\n",
" \"\"\"Drop support information from dataframe, optionally exporting beforehand.\"\"\"\n",
" if export:\n",
" support_info = df[['steam_appid', 'website', 'support_info']].copy()\n",
" \n",
" support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))\n",
" support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])\n",
" support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])\n",
" \n",
" support_info = support_info.drop('support_info', axis=1)\n",
" \n",
" # only keep rows with at least one piece of information\n",
" support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]\n",
"\n",
" export_data(support_info, 'support_info')\n",
" \n",
" df = df.drop(['website', 'support_info'], axis=1)\n",
" \n",
" return df\n",
"\n",
"\n",
"info_df = process_info(media_df, export=True)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" website | \n",
" support_url | \n",
" support_email | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" NaN | \n",
" http://steamcommunity.com/app/10 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 30 | \n",
" http://www.dayofdefeat.com/ | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 50 | \n",
" NaN | \n",
" https://help.steampowered.com | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 70 | \n",
" http://www.half-life.com/ | \n",
" http://steamcommunity.com/app/70 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 80 | \n",
" NaN | \n",
" http://steamcommunity.com/app/80 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid website support_url \\\n",
"0 10 NaN http://steamcommunity.com/app/10 \n",
"1 30 http://www.dayofdefeat.com/ NaN \n",
"2 50 NaN https://help.steampowered.com \n",
"3 70 http://www.half-life.com/ http://steamcommunity.com/app/70 \n",
"4 80 NaN http://steamcommunity.com/app/80 \n",
"\n",
" support_email \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN "
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# inspect exported file\n",
"pd.read_csv('../data/exports/steam_support_info.csv').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### System Requirements\n",
"\n",
"As you may have noticed, there are three columns containing the system specifications required to run each game on each platform. These columns are `pc_requirements`, `mac_requirements`, and `linux_requirements`. As computing power has increased over the years, this information won't be of particular use in this analysis. It could be fascinating to use as a proxy of how computers have developed over the years, so we'll want to export the data, but we won't be keeping it in this dataset.\n",
"\n",
"Taking a look at the null counts, it looks like there is data for every row."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"null counts:\n",
"\n",
"pc_requirements: 0\n",
"mac_requirements: 0\n",
"linux_requirements: 0\n"
]
}
],
"source": [
"requirements_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']\n",
"\n",
"print('null counts:\\n')\n",
"\n",
"for col in requirements_cols:\n",
" print(col+':', info_df[col].isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However if we look at the data a little more closely, we see that some rows actually store an empty list. These won't appear as null rows, but once evaluated won't provide any information, so can be treated as null values."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
"
\n",
" \n",
" \n",
" \n",
" 29230 | \n",
" 1065230 | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
"
\n",
" \n",
" 29231 | \n",
" 1065570 | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
"
\n",
" \n",
" 29232 | \n",
" 1065650 | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
" [] | \n",
"
\n",
" \n",
" 29233 | \n",
" 1066700 | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
"
\n",
" \n",
" 29234 | \n",
" 1069460 | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" {'minimum': '<strong>Minimum:</strong><br><ul ... | \n",
" [] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid pc_requirements \\\n",
"29230 1065230 {'minimum': 'Minimum:
Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection
\\\\r\\\\n\\\\t\\\\t\\\\tRecommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection
\\\\r\\\\n\\\\t\\\\t\\\\t'}\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'{\\'minimum\\': \\'Minimum:
- OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic
- Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices
- Memory: 2 GB RAM
- Storage: 1400 MB available space
\\', \\'recommended\\': \\'Recommended:
- OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic
- Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices
- Memory: 4 GB RAM
- Storage: 1400 MB available space
\\'}'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'{\\'minimum\\': \\'Minimum:
- Requires a 64-bit processor and operating system
- OS: Windows 8 / 10 - 64-bit
- Processor: 2Ghz Intel CPU
- Memory: 2 GB RAM
- Graphics: 1024MB VRAM
- DirectX: Version 9.0
- Network: Broadband Internet connection
- Storage: 10 GB available space
\\', \\'recommended\\': \\'Recommended:
- Requires a 64-bit processor and operating system
- OS: Windows 10 - 64-bit
- Processor: 2.60Ghz Intel Core i7 CPU
- Memory: 16 GB RAM
- Graphics: 4GB VRAM
- Network: Broadband Internet connection
- Storage: 10 GB available space
\\'}'"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(info_df['pc_requirements'].iloc[0])\n",
"display(info_df['pc_requirements'].iloc[2000])\n",
"display(info_df['pc_requirements'].iloc[15000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data is stored as a dictionary, as we've seen before. There is definitely a key for `minimum`, but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like `Processor` and `Memory` for some of the rows.\n",
"\n",
"Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.\n",
"\n",
"By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions."
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"{'minimum': ' Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection'}\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\"{'minimum': 'Minimum: OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices Memory: 2 GB RAM Storage: 1400 MB available space', 'recommended': 'Recommended: OS: Microsoft® Windows® XP, Windows Server 2008, Windows Vista® Home Premium, Business, Ultimate, or Enterprise (including 64 bit editions) with Service Pack 2, Windows 7, or Windows 8 Classic Processor: 2.33GHz or faster x86-compatible processor, or Intel Atom™ 1.6GHz or faster processor for netbook class devices Memory: 4 GB RAM Storage: 1400 MB available space'}\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\"{'minimum': 'Minimum: Requires a 64-bit processor and operating system OS: Windows 8 / 10 - 64-bit Processor: 2Ghz Intel CPU Memory: 2 GB RAM Graphics: 1024MB VRAM DirectX: Version 9.0 Network: Broadband Internet connection Storage: 10 GB available space', 'recommended': 'Recommended: Requires a 64-bit processor and operating system OS: Windows 10 - 64-bit Processor: 2.60Ghz Intel Core i7 CPU Memory: 16 GB RAM Graphics: 4GB VRAM Network: Broadband Internet connection Storage: 10 GB available space'}\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"view_requirements = info_df['pc_requirements'].iloc[[0, 2000, 15000]].copy()\n",
"\n",
"view_requirements = (view_requirements\n",
" .str.replace(r'\\\\[rtn]', '')\n",
" .str.replace(r'<[pbr]{1,2}>', ' ')\n",
" .str.replace(r'<[\\/\"=\\w\\s]+>', '')\n",
" )\n",
"\n",
"for i, row in view_requirements.iteritems():\n",
" display(row)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now see more clearly the contents and structure of these rows. Some rows have both `Minimum` and `Recommended` requirements inside a `minimum` key, and some have separate `minimum` and `recommended` keys. Some have headings like `Processor:` and `Storage:` before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.\n",
"\n",
"It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process worthy of another project. With that in mind, it seems best to proceed by slightly cleaning the data before exporting it, but not trying to deal with individual components right now.\n",
"\n",
"We'll export the raw data, and split out the data in the `minimum` and `recommended` keys. This will mean lots of rows still have recommended data hidden inside the minimum key, but this will do for now."
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exported requirements data to '../data/exports/steam_requirements_data.csv'\n"
]
}
],
"source": [
"def process_requirements(df, export=False):\n",
" if export:\n",
" requirements = df[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].copy()\n",
" \n",
" # remove rows with missing pc requirements\n",
" requirements = requirements[requirements['pc_requirements'] != '[]']\n",
" \n",
" requirements['requirements_clean'] = (requirements['pc_requirements']\n",
" .str.replace(r'\\\\[rtn]', '')\n",
" .str.replace(r'<[pbr]{1,2}>', ' ')\n",
" .str.replace(r'<[\\/\"=\\w\\s]+>', '')\n",
" )\n",
" \n",
" requirements['requirements_clean'] = requirements['requirements_clean'].apply(lambda x: literal_eval(x))\n",
" \n",
" # split out minimum and recommended into separate columns\n",
" requirements['minimum'] = requirements['requirements_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)\n",
" requirements['recommended'] = requirements['requirements_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)\n",
" \n",
" requirements = requirements.drop('requirements_clean', axis=1)\n",
" \n",
" export_data(requirements, 'requirements_data')\n",
" \n",
" df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)\n",
" \n",
" return df\n",
"\n",
"reqs_df = process_requirements(info_df, export=True)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" steam_appid | \n",
" pc_requirements | \n",
" mac_requirements | \n",
" linux_requirements | \n",
" minimum | \n",
" recommended | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" 500 mhz processor, 96mb ram, 16mb video card, ... | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 20 | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" 500 mhz processor, 96mb ram, 16mb video card, ... | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 30 | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" 500 mhz processor, 96mb ram, 16mb video card, ... | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 40 | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" 500 mhz processor, 96mb ram, 16mb video card, ... | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 50 | \n",
" {'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</st... | \n",
" {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | \n",
" {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | \n",
" 500 mhz processor, 96mb ram, 16mb video card, ... | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" steam_appid pc_requirements \\\n",
"0 10 {'minimum': '\\r\\n\\t\\t\\tMinimum:Minimum:Minimum:Minimum:Minimum:\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" achievements | \n",
" release_date | \n",
" price | \n",
" english | \n",
" developer | \n",
" publisher | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Counter-Strike | \n",
" 10 | \n",
" 0 | \n",
" windows;mac;linux | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
" 0 | \n",
" 2000-11-01 | \n",
" 7.19 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 1 | \n",
" Team Fortress Classic | \n",
" 20 | \n",
" 0 | \n",
" windows;mac;linux | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
" 0 | \n",
" 1999-04-01 | \n",
" 3.99 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 2 | \n",
" Day of Defeat | \n",
" 30 | \n",
" 0 | \n",
" windows;mac;linux | \n",
" Multi-player;Valve Anti-Cheat enabled | \n",
" Action | \n",
" 0 | \n",
" 2003-05-01 | \n",
" 3.99 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 3 | \n",
" Deathmatch Classic | \n",
" 40 | \n",
" 0 | \n",
" windows;mac;linux | \n",
" Multi-player;Online Multi-Player;Local Multi-P... | \n",
" Action | \n",
" 0 | \n",
" 2001-06-01 | \n",
" 3.99 | \n",
" 1 | \n",
" Valve | \n",
" Valve | \n",
"
\n",
" \n",
" 4 | \n",
" Half-Life: Opposing Force | \n",
" 50 | \n",
" 0 | \n",
" windows;mac;linux | \n",
" Single-player;Multi-player;Valve Anti-Cheat en... | \n",
" Action | \n",
" 0 | \n",
" 1999-11-01 | \n",
" 3.99 | \n",
" 1 | \n",
" Gearbox Software | \n",
" Valve | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" name steam_appid required_age platforms \\\n",
"0 Counter-Strike 10 0 windows;mac;linux \n",
"1 Team Fortress Classic 20 0 windows;mac;linux \n",
"2 Day of Defeat 30 0 windows;mac;linux \n",
"3 Deathmatch Classic 40 0 windows;mac;linux \n",
"4 Half-Life: Opposing Force 50 0 windows;mac;linux \n",
"\n",
" categories genres achievements \\\n",
"0 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n",
"1 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n",
"2 Multi-player;Valve Anti-Cheat enabled Action 0 \n",
"3 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n",
"4 Single-player;Multi-player;Valve Anti-Cheat en... Action 0 \n",
"\n",
" release_date price english developer publisher \n",
"0 2000-11-01 7.19 1 Valve Valve \n",
"1 1999-04-01 3.99 1 Valve Valve \n",
"2 2003-05-01 3.99 1 Valve Valve \n",
"3 2001-06-01 3.99 1 Valve Valve \n",
"4 1999-11-01 3.99 1 Gearbox Software Valve "
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def process(df):\n",
" \"\"\"Process data set. Will eventually contain calls to all functions we write.\"\"\"\n",
" \n",
" # Copy the input dataframe to avoid accidentally modifying original data\n",
" df = df.copy()\n",
" \n",
" # Remove duplicate rows - all appids should be unique\n",
" df = df.drop_duplicates()\n",
" \n",
" # Remove collumns with more than 50% null values\n",
" df = drop_null_cols(df)\n",
" \n",
" # Process columns\n",
" df = process_name_type(df)\n",
" df = process_age(df)\n",
" df = process_platforms(df)\n",
" df = process_price(df)\n",
" df = process_language(df)\n",
" df = process_developers_and_publishers(df)\n",
" df = process_categories_and_genres(df)\n",
" df = process_achievements_and_descriptors(df) \n",
" df = process_release_date(df)\n",
" \n",
" # Process columns which export data\n",
" df = process_descriptions(df, export=True)\n",
" df = process_media(df, export=True)\n",
" df = process_info(df, export=True)\n",
" df = process_requirements(df, export=True)\n",
" \n",
" return df\n",
"\n",
"steam_data = process(raw_steam_data)\n",
"steam_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Final Steps\n",
"\n",
"That just about does is for cleaning the Steam data. Hopefully we have a cleaned dataset ready to be combined with the data downloaded from SteamSpy, once that is cleaned. \n",
"\n",
"Before we export the cleaned steam data, we'll check that we have eradicated missing values, and have a look at the memory footprint like we did earlier."
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name 0\n",
"steam_appid 0\n",
"required_age 0\n",
"platforms 0\n",
"categories 0\n",
"genres 0\n",
"achievements 0\n",
"release_date 0\n",
"price 0\n",
"english 0\n",
"developer 0\n",
"publisher 0\n",
"dtype: int64"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"steam_data.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 29235 entries, 0 to 29234\n",
"Columns: 39 entries, type to content_descriptors\n",
"dtypes: float64(2), int64(1), object(36)\n",
"memory usage: 390.0 MB\n"
]
}
],
"source": [
"raw_steam_data.info(verbose=False, memory_usage=\"deep\")"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 27332 entries, 0 to 29234\n",
"Columns: 12 entries, name to publisher\n",
"dtypes: category(1), datetime64[ns](1), float64(1), int64(3), object(6)\n",
"memory usage: 13.8 MB\n"
]
}
],
"source": [
"steam_data.info(verbose=False, memory_usage=\"deep\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like we've taken care of all the null values, and the size of the dataset has been reduced considerably.\n",
"\n",
"We'll also check that no unreleased games have slipped through (data was scraped on or before 1st May, 2019)."
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" steam_appid | \n",
" required_age | \n",
" platforms | \n",
" categories | \n",
" genres | \n",
" achievements | \n",
" release_date | \n",
" price | \n",
" english | \n",
" developer | \n",
" publisher | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [name, steam_appid, required_age, platforms, categories, genres, achievements, release_date, price, english, developer, publisher]\n",
"Index: []"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"steam_data[steam_data['release_date'] > '2019-05-01']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we're happy with our dataframe we are ready to export to file and finish this part of the project."
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"steam_data.to_csv('../data/exports/steam_data_clean.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As I said at the beginning, data cleaning can be a very lengthy process. We definitely could have taken some shortcuts, and didn't have to go into so much depth, but I think it was interesting and valuable to go through the data as thoroughly as we did. A useful side effect of this is that we're now much more familiar with the data, and this should be incredibly useful when it comes to analysis. One of the most important prerequisites for data analysis is having good data, and the foundation of this is built upon solid data cleaning.\n",
"\n",
"Next time we'll take a look at optimising the processing of the `release_date` column, as mentioned, then move onto cleaning the SteamSpy data. Once that is complete we can begin exploring and analysing the data."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}