{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SteamSpy Data Cleaning\n", "\n", "*This forms part of a larger series of posts for my [blog](http://nik-davis.github.io) on downloading, processing and analysing data from the steam store. [See all posts here](http://nik-davis.github.io/tag/steam).*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/json": { "Software versions": [ { "module": "Python", "version": "3.7.3 64bit [MSC v.1900 64 bit (AMD64)]" }, { "module": "IPython", "version": "7.5.0" }, { "module": "OS", "version": "Windows 10 10.0.18362 SP0" }, { "module": "numpy", "version": "1.16.3" }, { "module": "pandas", "version": "0.24.2" } ] }, "text/html": [ "
SoftwareVersion
Python3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython7.5.0
OSWindows 10 10.0.18362 SP0
numpy1.16.3
pandas0.24.2
Wed Jun 12 15:12:03 2019 GMT Summer Time
" ], "text/latex": [ "\\begin{tabular}{|l|l|}\\hline\n", "{\\bf Software} & {\\bf Version} \\\\ \\hline\\hline\n", "Python & 3.7.3 64bit [MSC v.1900 64 bit (AMD64)] \\\\ \\hline\n", "IPython & 7.5.0 \\\\ \\hline\n", "OS & Windows 10 10.0.18362 SP0 \\\\ \\hline\n", "numpy & 1.16.3 \\\\ \\hline\n", "pandas & 0.24.2 \\\\ \\hline\n", "\\hline \\multicolumn{2}{|l|}{Wed Jun 12 15:12:03 2019 GMT Summer Time} \\\\ \\hline\n", "\\end{tabular}\n" ], "text/plain": [ "Software versions\n", "Python 3.7.3 64bit [MSC v.1900 64 bit (AMD64)]\n", "IPython 7.5.0\n", "OS Windows 10 10.0.18362 SP0\n", "numpy 1.16.3\n", "pandas 0.24.2\n", "Wed Jun 12 15:12:03 2019 GMT Summer Time" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# view software version information\n", "\n", "# http://raw.github.com/jrjohansson/version_information/master/version_information.py\n", "%load_ext version_information\n", "# %reload_ext version_information\n", "\n", "%version_information numpy, pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Welcome to the final part of the data cleaning process. Once we're finished here we'll be ready to move on to exploring and analysing the data.\n", "\n", "As a quick re-cap, so far we have downloaded information on games from the Steam Store via the Steam API and SteamSpy API. We have cleaned and processed the data from the Steam API, and in this section we'll walkthrough cleaning data downloaded from the SteamSpy API. The overall goal of this project is to collect, clean and analyse data from the Steam Store with the idea of advising a fictional game developer or company.\n", "\n", "The previous posts went into great depth about the decisions made and methods used. This post will still go over a number of decisions, but will be more in the style of a brief overview than full discussion.\n", "\n", "\n", "\n", "## Import Libraries and Inspect Data\n", "\n", "Let's begin by importing the necessary libraries and inspecting the raw downloaded data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# standard library imports\n", "from ast import literal_eval\n", "import itertools\n", "# import time\n", "# import re\n", "\n", "# third-party imports\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# customisations\n", "pd.set_option(\"max_columns\", 100)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamedeveloperpublisherscore_rankpositivenegativeuserscoreownersaverage_foreveraverage_2weeksmedian_forevermedian_2weekspriceinitialpricediscountlanguagesgenreccutags
010Counter-StrikeValveValveNaN1245343339010,000,000 .. 20,000,0001761270931726999.0999.00.0English, French, German, Italian, Spanish - Sp...Action14923{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...
120Team Fortress ClassicValveValveNaN331863305,000,000 .. 10,000,000277156215499.0499.00.0English, French, German, Italian, Spanish - Sp...Action87{'Action': 208, 'FPS': 188, 'Multiplayer': 172...
230Day of DefeatValveValveNaN341639805,000,000 .. 10,000,0001870340499.0499.00.0English, French, German, Italian, Spanish - SpainAction130{'FPS': 138, 'World War II': 122, 'Multiplayer...
340Deathmatch ClassicValveValveNaN127326705,000,000 .. 10,000,00025801840499.0499.00.0English, French, German, Italian, Spanish - Sp...Action4{'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...
450Half-Life: Opposing ForceGearbox SoftwareValveNaN525028805,000,000 .. 10,000,00062404150499.0499.00.0English, French, German, KoreanAction71{'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...
\n", "
" ], "text/plain": [ " appid name developer publisher score_rank \\\n", "0 10 Counter-Strike Valve Valve NaN \n", "1 20 Team Fortress Classic Valve Valve NaN \n", "2 30 Day of Defeat Valve Valve NaN \n", "3 40 Deathmatch Classic Valve Valve NaN \n", "4 50 Half-Life: Opposing Force Gearbox Software Valve NaN \n", "\n", " positive negative userscore owners average_forever \\\n", "0 124534 3339 0 10,000,000 .. 20,000,000 17612 \n", "1 3318 633 0 5,000,000 .. 10,000,000 277 \n", "2 3416 398 0 5,000,000 .. 10,000,000 187 \n", "3 1273 267 0 5,000,000 .. 10,000,000 258 \n", "4 5250 288 0 5,000,000 .. 10,000,000 624 \n", "\n", " average_2weeks median_forever median_2weeks price initialprice \\\n", "0 709 317 26 999.0 999.0 \n", "1 15 62 15 499.0 499.0 \n", "2 0 34 0 499.0 499.0 \n", "3 0 184 0 499.0 499.0 \n", "4 0 415 0 499.0 499.0 \n", "\n", " discount languages genre ccu \\\n", "0 0.0 English, French, German, Italian, Spanish - Sp... Action 14923 \n", "1 0.0 English, French, German, Italian, Spanish - Sp... Action 87 \n", "2 0.0 English, French, German, Italian, Spanish - Spain Action 130 \n", "3 0.0 English, French, German, Italian, Spanish - Sp... Action 4 \n", "4 0.0 English, French, German, Korean Action 71 \n", "\n", " tags \n", "0 {'Action': 2681, 'FPS': 2048, 'Multiplayer': 1... \n", "1 {'Action': 208, 'FPS': 188, 'Multiplayer': 172... \n", "2 {'FPS': 138, 'World War II': 122, 'Multiplayer... \n", "3 {'Action': 85, 'FPS': 71, 'Multiplayer': 58, '... \n", "4 {'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data = pd.read_csv('../data/raw/steamspy_data.csv')\n", "raw_steamspy_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of columns here that will overlap with columns we already dealt with in the Steam data. This is fine because it means we don't have to clean them here, and can simply use the columns from the other dataset.\n", "\n", "A few columns in this data that could be valuable are `positive`, `negative`, `userscore`, `owners` and `average_forever`. We'll begin by looking at the null counts for each column." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "appid 0\n", "name 5\n", "developer 197\n", "publisher 280\n", "score_rank 29177\n", "positive 0\n", "negative 0\n", "userscore 0\n", "owners 0\n", "average_forever 0\n", "average_2weeks 0\n", "median_forever 0\n", "median_2weeks 0\n", "price 29\n", "initialprice 22\n", "discount 22\n", "languages 94\n", "genre 152\n", "ccu 0\n", "tags 0\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the columns look fine, with only a few tens or hundreds of missing values, or none at all. The only column we'll remove for certain is the `score_rank` column, as it has far too many null values to be useful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling missing values\n", "\n", "There are some columns we'll be removing, as they're present in the other dataset. Before we do we can investigate the missing values, to decide whether we should remove these rows first. We don't need to go through every column, but just look at a few individual ones.\n", "\n", "In the `name` column, we have five rows with missing data, and four more rows where the name is 'none'. We'll certainly remove all 9 of these rows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamedeveloperpublisherscore_rankpositivenegativeuserscoreownersaverage_foreveraverage_2weeksmedian_forevermedian_2weekspriceinitialpricediscountlanguagesgenreccutags
48117760NaNNaNNaNNaN0000 .. 20,00000000.00.00.0NaNNaN0[]
7235396420NaNNaNNaNNaN229020,000 .. 50,00000000.00.00.0NaNNaN0{'Free to Play': 24, 'Action': 21, 'Indie': 21...
9553460250NaNJeroen WimmersJeroen WimmersNaN44400 .. 20,0000000899.0899.00.0English, French, Italian, German, Spanish - Sp...Casual, Indie0{'Indie': 34, 'Casual': 24, 'Puzzle': 15, 'Min...
22244806160NaNPaleno GamesPaleno GamesNaN41300 .. 20,000000099.099.00.0English, French, Italian, German, Spanish - Sp...Action, Adventure, Casual, Indie0{'Action': 22, 'Indie': 22, 'Casual': 21, 'Adv...
27324965340NaN2nd Studio2nd StudioNaN31100 .. 20,0000000199.0199.00.0English, German, Danish, Japanese, Russian, Si...Action, Indie, Simulation0{'Indie': 32, 'Sexual Content': 31, 'Action': ...
\n", "
" ], "text/plain": [ " appid name developer publisher score_rank positive \\\n", "481 17760 NaN NaN NaN NaN 0 \n", "7235 396420 NaN NaN NaN NaN 22 \n", "9553 460250 NaN Jeroen Wimmers Jeroen Wimmers NaN 44 \n", "22244 806160 NaN Paleno Games Paleno Games NaN 4 \n", "27324 965340 NaN 2nd Studio 2nd Studio NaN 31 \n", "\n", " negative userscore owners average_forever average_2weeks \\\n", "481 0 0 0 .. 20,000 0 0 \n", "7235 9 0 20,000 .. 50,000 0 0 \n", "9553 4 0 0 .. 20,000 0 0 \n", "22244 13 0 0 .. 20,000 0 0 \n", "27324 1 0 0 .. 20,000 0 0 \n", "\n", " median_forever median_2weeks price initialprice discount \\\n", "481 0 0 0.0 0.0 0.0 \n", "7235 0 0 0.0 0.0 0.0 \n", "9553 0 0 899.0 899.0 0.0 \n", "22244 0 0 99.0 99.0 0.0 \n", "27324 0 0 199.0 199.0 0.0 \n", "\n", " languages \\\n", "481 NaN \n", "7235 NaN \n", "9553 English, French, Italian, German, Spanish - Sp... \n", "22244 English, French, Italian, German, Spanish - Sp... \n", "27324 English, German, Danish, Japanese, Russian, Si... \n", "\n", " genre ccu \\\n", "481 NaN 0 \n", "7235 NaN 0 \n", "9553 Casual, Indie 0 \n", "22244 Action, Adventure, Casual, Indie 0 \n", "27324 Action, Indie, Simulation 0 \n", "\n", " tags \n", "481 [] \n", "7235 {'Free to Play': 24, 'Action': 21, 'Indie': 21... \n", "9553 {'Indie': 34, 'Casual': 24, 'Puzzle': 15, 'Min... \n", "22244 {'Action': 22, 'Indie': 22, 'Casual': 21, 'Adv... \n", "27324 {'Indie': 32, 'Sexual Content': 31, 'Action': ... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data[raw_steamspy_data['name'].isnull()]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamedeveloperpublisherscore_rankpositivenegativeuserscoreownersaverage_foreveraverage_2weeksmedian_forevermedian_2weekspriceinitialpricediscountlanguagesgenreccutags
4918339860noneNaNNaNNaN8432050,000 .. 100,000309032500.00.00.0EnglishAdventure, Indie, Massively Multiplayer, RPG, ...0{'Indie': 34, 'RPG': 33, 'Adventure': 32, 'Mas...
6779385020nonenoneNaNNaN1426020,000 .. 50,000262027200.00.00.0English, French, Italian, German, Spanish - Sp...Casual, Indie, Massively Multiplayer, Strategy0{'Indie': 32, 'Strategy': 31, 'Casual': 21, 'M...
7350398970nonenonenoneNaN1021020,000 .. 50,000259027200.00.00.0EnglishAdventure, Indie, Massively Multiplayer, RPG, ...0{'Massively Multiplayer': 27, 'Strategy': 23, ...
\n", "
" ], "text/plain": [ " appid name developer publisher score_rank positive negative \\\n", "4918 339860 none NaN NaN NaN 84 32 \n", "6779 385020 none none NaN NaN 14 26 \n", "7350 398970 none none none NaN 10 21 \n", "\n", " userscore owners average_forever average_2weeks \\\n", "4918 0 50,000 .. 100,000 309 0 \n", "6779 0 20,000 .. 50,000 262 0 \n", "7350 0 20,000 .. 50,000 259 0 \n", "\n", " median_forever median_2weeks price initialprice discount \\\n", "4918 325 0 0.0 0.0 0.0 \n", "6779 272 0 0.0 0.0 0.0 \n", "7350 272 0 0.0 0.0 0.0 \n", "\n", " languages \\\n", "4918 English \n", "6779 English, French, Italian, German, Spanish - Sp... \n", "7350 English \n", "\n", " genre ccu \\\n", "4918 Adventure, Indie, Massively Multiplayer, RPG, ... 0 \n", "6779 Casual, Indie, Massively Multiplayer, Strategy 0 \n", "7350 Adventure, Indie, Massively Multiplayer, RPG, ... 0 \n", "\n", " tags \n", "4918 {'Indie': 34, 'RPG': 33, 'Adventure': 32, 'Mas... \n", "6779 {'Indie': 32, 'Strategy': 31, 'Casual': 21, 'M... \n", "7350 {'Massively Multiplayer': 27, 'Strategy': 23, ... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data[raw_steamspy_data['name'] == 'none']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 197 rows with missing developer data, and 280 rows with missing publisher data. It's conceivable that a small studio may self-publish, but we'll say that a title must have a developer for it to be kept. We can remove the rows with missing developer data, but then we'll be removing both of these columns as this information is already provided in the Steam data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamedeveloperpublisherscore_rankpositivenegativeuserscoreownersaverage_foreveraverage_2weeksmedian_forevermedian_2weekspriceinitialpricediscountlanguagesgenreccutags
8774439400Legends of Callasia DemoNaNNaNNaN0000 .. 20,0000000NaNNaNNaNNaNNaN0[]
56823130Mole ControlNaNNaNNaN0000 .. 20,00000000.00.00.0NaNNaN1[]
18399703610Space GirlsNaNNaNNaN7081020,000 .. 50,000000099.099.00.0English, RussianSexual Content, Nudity, Action, Casual, Indie0{'Nudity': 26, 'Sexual Content': 25, 'Action':...
2207247910Sniper Elite: Nazi Zombie Army 2NaNRebellionNaN30695540200,000 .. 500,000289033501499.01499.00.0English, French, Italian, German, Spanish - Sp...Action, Adventure23{'Zombies': 294, 'Action': 185, 'Co-op': 171, ...
15367619730Mad DriverNaNNaNNaN46480100,000 .. 200,0002870294099.099.00.0English, RussianCasual, Indie, Racing, Strategy0{'Strategy': 21, 'Indie': 21, 'Casual': 21, 'R...
\n", "
" ], "text/plain": [ " appid name developer publisher \\\n", "8774 439400 Legends of Callasia Demo NaN NaN \n", "568 23130 Mole Control NaN NaN \n", "18399 703610 Space Girls NaN NaN \n", "2207 247910 Sniper Elite: Nazi Zombie Army 2 NaN Rebellion \n", "15367 619730 Mad Driver NaN NaN \n", "\n", " score_rank positive negative userscore owners \\\n", "8774 NaN 0 0 0 0 .. 20,000 \n", "568 NaN 0 0 0 0 .. 20,000 \n", "18399 NaN 70 81 0 20,000 .. 50,000 \n", "2207 NaN 3069 554 0 200,000 .. 500,000 \n", "15367 NaN 46 48 0 100,000 .. 200,000 \n", "\n", " average_forever average_2weeks median_forever median_2weeks price \\\n", "8774 0 0 0 0 NaN \n", "568 0 0 0 0 0.0 \n", "18399 0 0 0 0 99.0 \n", "2207 289 0 335 0 1499.0 \n", "15367 287 0 294 0 99.0 \n", "\n", " initialprice discount \\\n", "8774 NaN NaN \n", "568 0.0 0.0 \n", "18399 99.0 0.0 \n", "2207 1499.0 0.0 \n", "15367 99.0 0.0 \n", "\n", " languages \\\n", "8774 NaN \n", "568 NaN \n", "18399 English, Russian \n", "2207 English, French, Italian, German, Spanish - Sp... \n", "15367 English, Russian \n", "\n", " genre ccu \\\n", "8774 NaN 0 \n", "568 NaN 1 \n", "18399 Sexual Content, Nudity, Action, Casual, Indie 0 \n", "2207 Action, Adventure 23 \n", "15367 Casual, Indie, Racing, Strategy 0 \n", "\n", " tags \n", "8774 [] \n", "568 [] \n", "18399 {'Nudity': 26, 'Sexual Content': 25, 'Action':... \n", "2207 {'Zombies': 294, 'Action': 185, 'Co-op': 171, ... \n", "15367 {'Strategy': 21, 'Indie': 21, 'Casual': 21, 'R... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data[raw_steamspy_data['developer'].isnull()].sample(5, random_state=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamedeveloperpublisherscore_rankpositivenegativeuserscoreownersaverage_foreveraverage_2weeksmedian_forevermedian_2weekspriceinitialpricediscountlanguagesgenreccutags
22993827660Feminist Cyborg Vs Last guy(gay)NaNNaNNaN0000 .. 20,0000000NaNNaNNaNNaNNaN0[]
48918310Spectraball - DemoNaNNaNNaN0000 .. 20,00000000.00.00.0NaNNaN0[]
5347349260BlastZone 2Matt EdzengaNaNNaN5025020,000 .. 50,00019501950999.0999.00.0EnglishAction, Indie0{'Indie': 35, 'Action': 34, \"Shoot 'Em Up\": 18...
2493259530Savant - AscentD-Pad StudioNaNNaN24671790200,000 .. 500,00020902940199.0199.00.0English, Japanese, Korean, Norwegian, Simplifi...Action, Indie3{'Action': 116, 'Indie': 102, 'Great Soundtrac...
18285700580Rust - Staging BranchNaNNaNNaN0000 .. 20,00078060NaNNaNNaNNaNNaN0[]
\n", "
" ], "text/plain": [ " appid name developer publisher \\\n", "22993 827660 Feminist Cyborg Vs Last guy(gay) NaN NaN \n", "489 18310 Spectraball - Demo NaN NaN \n", "5347 349260 BlastZone 2 Matt Edzenga NaN \n", "2493 259530 Savant - Ascent D-Pad Studio NaN \n", "18285 700580 Rust - Staging Branch NaN NaN \n", "\n", " score_rank positive negative userscore owners \\\n", "22993 NaN 0 0 0 0 .. 20,000 \n", "489 NaN 0 0 0 0 .. 20,000 \n", "5347 NaN 50 25 0 20,000 .. 50,000 \n", "2493 NaN 2467 179 0 200,000 .. 500,000 \n", "18285 NaN 0 0 0 0 .. 20,000 \n", "\n", " average_forever average_2weeks median_forever median_2weeks price \\\n", "22993 0 0 0 0 NaN \n", "489 0 0 0 0 0.0 \n", "5347 195 0 195 0 999.0 \n", "2493 209 0 294 0 199.0 \n", "18285 78 0 6 0 NaN \n", "\n", " initialprice discount \\\n", "22993 NaN NaN \n", "489 0.0 0.0 \n", "5347 999.0 0.0 \n", "2493 199.0 0.0 \n", "18285 NaN NaN \n", "\n", " languages genre ccu \\\n", "22993 NaN NaN 0 \n", "489 NaN NaN 0 \n", "5347 English Action, Indie 0 \n", "2493 English, Japanese, Korean, Norwegian, Simplifi... Action, Indie 3 \n", "18285 NaN NaN 0 \n", "\n", " tags \n", "22993 [] \n", "489 [] \n", "5347 {'Indie': 35, 'Action': 34, \"Shoot 'Em Up\": 18... \n", "2493 {'Action': 116, 'Indie': 102, 'Great Soundtrac... \n", "18285 [] " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data[raw_steamspy_data['publisher'].isnull()].sample(5, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a small number of rows with missing price data. This is another column we don't need, (as well as `initialprice` and `discount`), as we already have that data available in the other dataset. \n", "\n", "It looks like these rows have a lot of other missing information as well, so if they haven't already been removed when cleaning other columns we'll remove them here." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamedeveloperpublisherscore_rankpositivenegativeuserscoreownersaverage_foreveraverage_2weeksmedian_forevermedian_2weekspriceinitialpricediscountlanguagesgenreccutags
4261321210Blade Symphony BetaNaNNaNNaN0000 .. 20,0000000NaNNaNNaNNaNNaN0[]
18285700580Rust - Staging BranchNaNNaNNaN0000 .. 20,00078060NaNNaNNaNNaNNaN0[]
22496813350Ben 10 VRNaNNaNNaN0000 .. 20,0000000NaNNaNNaNNaNNaN0[]
15719630790PSYCHO-PASSNaNNaNNaN0000 .. 20,0000000NaNNaNNaNNaNNaN0[]
12076530940BIOHAZARD 7 resident evil グロテスクVer.NaNNaNNaN0000 .. 20,0000000NaNNaNNaNNaNNaN0[]
\n", "
" ], "text/plain": [ " appid name developer publisher \\\n", "4261 321210 Blade Symphony Beta NaN NaN \n", "18285 700580 Rust - Staging Branch NaN NaN \n", "22496 813350 Ben 10 VR NaN NaN \n", "15719 630790 PSYCHO-PASS NaN NaN \n", "12076 530940 BIOHAZARD 7 resident evil グロテスクVer. NaN NaN \n", "\n", " score_rank positive negative userscore owners \\\n", "4261 NaN 0 0 0 0 .. 20,000 \n", "18285 NaN 0 0 0 0 .. 20,000 \n", "22496 NaN 0 0 0 0 .. 20,000 \n", "15719 NaN 0 0 0 0 .. 20,000 \n", "12076 NaN 0 0 0 0 .. 20,000 \n", "\n", " average_forever average_2weeks median_forever median_2weeks price \\\n", "4261 0 0 0 0 NaN \n", "18285 78 0 6 0 NaN \n", "22496 0 0 0 0 NaN \n", "15719 0 0 0 0 NaN \n", "12076 0 0 0 0 NaN \n", "\n", " initialprice discount languages genre ccu tags \n", "4261 NaN NaN NaN NaN 0 [] \n", "18285 NaN NaN NaN NaN 0 [] \n", "22496 NaN NaN NaN NaN 0 [] \n", "15719 NaN NaN NaN NaN 0 [] \n", "12076 NaN NaN NaN NaN 0 [] " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data[raw_steamspy_data['price'].isnull()].sample(5, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again there are a small number of rows with missing language data. There are a couple of bigger titles here, but most are missing other data also. We'll go ahead and remove these rows and this should leave us with most troublesome rows dealt with." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamedeveloperpublisherscore_rankpositivenegativeuserscoreownersaverage_foreveraverage_2weeksmedian_forevermedian_2weekspriceinitialpricediscountlanguagesgenreccutags
2498350Strong Bad's Cool Game for Attractive People: ...Telltale GamesTelltale GamesNaN12500 .. 20,00000000.00.00.0NaNAdventure0{'Adventure': 20}
123394500Back to the Future: Ep 2 - Get Tannen!Telltale GamesTelltale GamesNaN13671620500,000 .. 1,000,0009109100.00.00.0NaNNaN2{'Adventure': 131, 'Point & Click': 116, 'Epis...
2120243180The HarvestNaNNaNNaN0000 .. 20,00000000.00.00.0NaNNaN0[]
75734390Football Manager 2011 DemoNaNNaNNaN0000 .. 20,00000000.00.00.0NaNNaN0[]
56923140KrissXNaNNaNNaN0000 .. 20,00000000.00.00.0NaNNaN1[]
\n", "
" ], "text/plain": [ " appid name \\\n", "249 8350 Strong Bad's Cool Game for Attractive People: ... \n", "1233 94500 Back to the Future: Ep 2 - Get Tannen! \n", "2120 243180 The Harvest \n", "757 34390 Football Manager 2011 Demo \n", "569 23140 KrissX \n", "\n", " developer publisher score_rank positive negative \\\n", "249 Telltale Games Telltale Games NaN 12 5 \n", "1233 Telltale Games Telltale Games NaN 1367 162 \n", "2120 NaN NaN NaN 0 0 \n", "757 NaN NaN NaN 0 0 \n", "569 NaN NaN NaN 0 0 \n", "\n", " userscore owners average_forever average_2weeks \\\n", "249 0 0 .. 20,000 0 0 \n", "1233 0 500,000 .. 1,000,000 91 0 \n", "2120 0 0 .. 20,000 0 0 \n", "757 0 0 .. 20,000 0 0 \n", "569 0 0 .. 20,000 0 0 \n", "\n", " median_forever median_2weeks price initialprice discount languages \\\n", "249 0 0 0.0 0.0 0.0 NaN \n", "1233 91 0 0.0 0.0 0.0 NaN \n", "2120 0 0 0.0 0.0 0.0 NaN \n", "757 0 0 0.0 0.0 0.0 NaN \n", "569 0 0 0.0 0.0 0.0 NaN \n", "\n", " genre ccu tags \n", "249 Adventure 0 {'Adventure': 20} \n", "1233 NaN 2 {'Adventure': 131, 'Point & Click': 116, 'Epis... \n", "2120 NaN 0 [] \n", "757 NaN 0 [] \n", "569 NaN 1 [] " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data[raw_steamspy_data['languages'].isnull()].sample(5, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remove unwanted columns\n", "\n", "As mentioned, there are a number of columns present in the Steam dataset, so we can remove any of those with crossover. It makes sense that these exist because SteamSpy's data is retrieved by looking at Steam's data. We'll remove the `score_rank` column as it contains too many missing values, and we'll remove the `userscore` column as most rows have a value of 0 here, so it won't be useful." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 29177\n", "100 4\n", "84 4\n", "95 3\n", "55 3\n", "Name: userscore, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data['userscore'].value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also three columns - `average_2weeks`, `median_2weeks`, (average and median playtime in the last 2 weeks) and `ccu` (Peak conccurrent users yesterday) - that provide information only relevant to the time the data was collected. Going forward this isn't going to be useful, as we're not interested in how games were performing at that specific time, so we'll remove those." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "drop_cols = [\n", " 'score_rank', # too many missing values\n", " 'userscore', # too little variance (most have 0)\n", " 'genre', 'developer', 'publisher', 'price', 'initialprice', 'discount', # provided by Steam data\n", " 'average_2weeks', 'median_2weeks', 'ccu' # not interested in temporally specific columns\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One more column we could remove is the `languages` column, as we already have the `english` column in the Steam Data. Here in the SteamSpy data, languages is made up of a comma-separated list of all supported languages. This provides us with some more information so we can keep it for now." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 English, French, German, Italian, Spanish - Sp...\n", "1 English, French, German, Italian, Spanish - Sp...\n", "2 English, French, German, Italian, Spanish - Spain\n", "3 English, French, German, Italian, Spanish - Sp...\n", "4 English, French, German, Korean\n", "Name: languages, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_steamspy_data['languages'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Process & Export Tags Column\n", "\n", "There are a couple of columns that require more specific handling, and the `tags` column is one of them. There will be a lot of overlap with the `genres` column, but the tags might provide just enough unique information to be worthwhile keeping.\n", "\n", "Below we separate out the tags data for experimentation, then have a look a the first few rows. Each tag for a game has an associated number, corresponding to the amount of votes that game has for that tag. A game's tags with higher votes means that more people think it is a relevant tag for that title." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1659, 'Shooter': 1420, 'Classic': 1344, 'Team-Based': 943, 'First-Person': 799, 'Competitive': 790, 'Tactical': 734, \"1990's\": 564, 'e-sports': 550, 'PvP': 480, 'Military': 367, 'Strategy': 329, 'Score Attack': 200, 'Survival': 192, 'Old School': 164, 'Assassin': 151, '1980s': 144, 'Violent': 40}\n" ] }, { "data": { "text/plain": [ "0 {'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...\n", "1 {'Action': 208, 'FPS': 188, 'Multiplayer': 172...\n", "2 {'FPS': 138, 'World War II': 122, 'Multiplayer...\n", "3 {'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...\n", "4 {'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...\n", "Name: tags, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tags = raw_steamspy_data['tags']\n", "\n", "print(tags[0])\n", "tags.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just by looking at the first row you can see that there are probably going to be far too many tags to include without disrupting the dataset. We have a couple of options here. We could include just the names of the top 3 or 5 tags for a game - which should be easy enough as it looks like the tags are already in order from most votes to least - or we could create new columns for each tag, and the number of votes as the row values, like this:\n", "\n", "appid | name | action | fps | multiplayer\n", "--- | --- | --- | --- | ---\n", "10 | Shoot 'em up | 2681 | 2048 | 0\n", "20 | Big Brawl | 208 | 0 | 172\n", "\n", "\n", "In a way, we'll do both.\n", "\n", "Because including the full tag information in the dataset may make it bulky and unwieldy, we'll only include the top three tags in the core dataset. As we did when cleaning the Steam Data, we'll also export the full tag data as a separate file. This means that if we want to use it as part of our analysis we can import and merge the data quite easily.\n", "\n", "Before we can do that, we have to figure out how to handle and process the data. As we've seen previously, the data in each row is stored as a string containing a dictionary, so we have to use `literal_eval` to get python to \"recognise\" the dictionary first." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2681\n", "2048\n" ] } ], "source": [ "eval_row = literal_eval(tags[0])\n", "\n", "print(eval_row['Action'])\n", "print(eval_row['FPS'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also have to figure out how to get a list of all the different unique tags from across the dataset. As each row only contains the tags that it has votes for, we don't know from the outset the tag names or even how many different tags there are.\n", "\n", "We could loop through the data and build a list of all the unique tag names using a dictionary or something similar, but instead we'll achieve the same thing much more easily using the [chain](https://docs.python.org/3/library/itertools.html?highlight=itertools#itertools.chain) function from itertools. As seen below, it will take a variety of iterables and unpack them into a single sequence (we have to use list() here to force the generator to evaluate). We can then use the set function to obtain just the unique values from the returned sequence.\n", "\n", "As you can see below, we can pass lists, tuples and dictionaries and they will all be handled. Also notice how only the key from the dictionary is returned (d) and not the value inside the key (e). This is really useful and means we can just pass the evaluated tags data as a series, and itertools will just extract the individual keys." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a', 'b', 'b', 'c', 'd']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "values = [\n", " ['a', 'b'], # list\n", " ('b', 'c'), # tuple\n", " {'d': 'e'} # dictionary\n", "]\n", "\n", "list(itertools.chain(*values))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'a', 'b', 'c', 'd'}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(itertools.chain(*values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we tried to define a function now, we might run into a problem. It turns out not all of the rows have data stored as dictionaries. Seen below, there are a few hundred rows with an empty list." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "656" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tags[tags == '[]'].shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This doesn't cause an issue when we generate our set of unique tag names, as any empty lists are ignored." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of unique tags: 371\n", "\n", "First few tags: ['1980s', \"1990's\", '2.5D', '2D', '2D Fighter']\n" ] } ], "source": [ "parsed_tags = tags.apply(lambda x: literal_eval(x))\n", "\n", "cols = set(itertools.chain(*parsed_tags))\n", "\n", "print('Number of unique tags:', len(cols))\n", "print('\\nFirst few tags:', sorted(list(cols))[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However to create the new dataframe from the tags and votes, we need to look inside the dictionaries in each row. We'll expand the logic to parse the tags using the `isintance` function to check each row for its type. If it is a dictionary, we return the dictionary. If it is a list, we'll return an empty dictionary `{}`. We'll also include an extra check in case there is anything else in the rows we haven't found yet. \n", "\n", "Next, we loop through the tags stored in `cols`, standardise the format of the new column names, then extract the vote count of each tag by checking if the tag name is in each row's dictionary of tags." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1980s1990s2.5d2d2d_fighter360_video3d3d_platformer3d_vision4_player_local4x6dofatvabstractactionaction_rpgaction_adventureaddictiveadventureagriculturealiensalternate_historyamericaanimation_&_modelinganimearcadearena_shooterartificial_intelligenceassassinasynchronous_multiplayeratmosphericaudio_productionbmxbase_buildingbaseballbased_on_a_novelbasketballbatmanbattle_royalebeat_em_upbeautifulbenchmarkbikesbloodboard_gamebowlingbuildingbullet_hellbullet_timecrpg...text_basedthird_personthird_person_shooterthrillertime_attacktime_managementtime_manipulationtime_traveltop_downtop_down_shootertouch_friendlytower_defensetrackirtradingtrading_card_gametrainstranshumanismturn_basedturn_based_combatturn_based_strategyturn_based_tacticstutorialtwin_stick_shootertypingundergroundunderwaterunforgivingutilitiesvrvr_onlyvampirevideo_productionvillain_protagonistviolentvisual_novelvoice_controlvoxelwalking_simulatorwarwargamewarhammer_40kweb_publishingwerewolveswesternword_gameworld_war_iworld_war_iiwrestlingzombiese_sports
014456400000000000026810000000000000151000000000000000000000...00000000000000000000000000000000040000000000000000550
1071000000000000208000150000000000000000000000000000000...000000000000000000000000000000000260000000000000000
2000000000000009900000000000000000000000000000000000...00000000000000000000000000000000000000800000005122000
30000000000000085000000000002200000000000000000000000...00000000000000000000000000000000000000000000000000
4077000000000000211000870122000000000730000000000000000000...00000000000000000000000000000000000000000000000000
\n", "

5 rows × 371 columns

\n", "
" ], "text/plain": [ " 1980s 1990s 2.5d 2d 2d_fighter 360_video 3d 3d_platformer \\\n", "0 144 564 0 0 0 0 0 0 \n", "1 0 71 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 0 \n", "4 0 77 0 0 0 0 0 0 \n", "\n", " 3d_vision 4_player_local 4x 6dof atv abstract action action_rpg \\\n", "0 0 0 0 0 0 0 2681 0 \n", "1 0 0 0 0 0 0 208 0 \n", "2 0 0 0 0 0 0 99 0 \n", "3 0 0 0 0 0 0 85 0 \n", "4 0 0 0 0 0 0 211 0 \n", "\n", " action_adventure addictive adventure agriculture aliens \\\n", "0 0 0 0 0 0 \n", "1 0 0 15 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 87 0 122 \n", "\n", " alternate_history america animation_&_modeling anime arcade \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " arena_shooter artificial_intelligence assassin asynchronous_multiplayer \\\n", "0 0 0 151 0 \n", "1 0 0 0 0 \n", "2 0 0 0 0 \n", "3 22 0 0 0 \n", "4 0 0 0 0 \n", "\n", " atmospheric audio_production bmx base_building baseball \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 73 0 0 0 0 \n", "\n", " based_on_a_novel basketball batman battle_royale beat_em_up beautiful \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " benchmark bikes blood board_game bowling building bullet_hell \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " bullet_time crpg ... text_based third_person third_person_shooter \\\n", "0 0 0 ... 0 0 0 \n", "1 0 0 ... 0 0 0 \n", "2 0 0 ... 0 0 0 \n", "3 0 0 ... 0 0 0 \n", "4 0 0 ... 0 0 0 \n", "\n", " thriller time_attack time_management time_manipulation time_travel \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " top_down top_down_shooter touch_friendly tower_defense trackir \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " trading trading_card_game trains transhumanism turn_based \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " turn_based_combat turn_based_strategy turn_based_tactics tutorial \\\n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "2 0 0 0 0 \n", "3 0 0 0 0 \n", "4 0 0 0 0 \n", "\n", " twin_stick_shooter typing underground underwater unforgiving \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " utilities vr vr_only vampire video_production villain_protagonist \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " violent visual_novel voice_control voxel walking_simulator war \\\n", "0 40 0 0 0 0 0 \n", "1 26 0 0 0 0 0 \n", "2 0 0 0 0 0 80 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " wargame warhammer_40k web_publishing werewolves western word_game \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " world_war_i world_war_ii wrestling zombies e_sports \n", "0 0 0 0 0 550 \n", "1 0 0 0 0 0 \n", "2 5 122 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", "[5 rows x 371 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def parse_tags(x):\n", " x = literal_eval(x)\n", " \n", " if isinstance(x, dict):\n", " return x\n", " elif isinstance(x, list):\n", " return {}\n", " else:\n", " raise TypeError('Something other than dict or list found')\n", "\n", "parsed_tags = tags.apply(parse_tags)\n", " \n", "tag_data = pd.DataFrame()\n", "\n", "for col in sorted(cols):\n", " # standardise column names\n", " col_name = col.lower().replace(' ', '_').replace('-', '_').replace(\"'\", \"\")\n", "\n", " # check if column in row's dictionary of tags and return that value if it is, or 0 if it isn't\n", " tag_data[col_name] = parsed_tags.apply(lambda x: x[col] if col in x.keys() else 0)\n", "\n", "tag_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll have to add in the AppIDs, but apart from that this makes up the tag data that we'll export. Next we have to figure out how to extract just the top three tags for use in the core dataset.\n", "\n", "It turns out this is a relatively simple process. As the keys are in order of highest votes to least, we can just extract the first 3 keys for each row. This may not be the safest approach, as dictionaries don't necessarily preserve key order, but it's fine for our purposes. We can then join these keys on a semicolon to create a list of the tags." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Action;FPS;Multiplayer\n", "1 Action;FPS;Multiplayer\n", "2 FPS;World War II;Multiplayer\n", "3 Action;FPS;Multiplayer\n", "4 FPS;Action;Sci-fi\n", "Name: tags, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def parse_tags(x):\n", " x = literal_eval(x)\n", "\n", " if isinstance(x, dict):\n", " return ';'.join(list(x.keys())[:3])\n", " else:\n", " return np.nan\n", " \n", "tags.apply(parse_tags).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handle Owners Column\n", "\n", "The final column we'll look at before defining a function to perform the cleaning is the `owners` column. This column is made up of the lower and upper bound of an estimation for the amount of owners for each title. For privacy reasons, SteamSpy can't get exact figures from Steam (though it [used to](https://www.polygon.com/2018/4/12/17229752/steam-spy-charts-new-privacy-rules-valve)), so we'll have to make do with the figures we have here." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10,000,000 .. 20,000,000\n", "1 5,000,000 .. 10,000,000\n", "2 5,000,000 .. 10,000,000\n", "3 5,000,000 .. 10,000,000\n", "4 5,000,000 .. 10,000,000\n", "Name: owners, dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "owners = raw_steamspy_data['owners']\n", "owners.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a couple of options for how to deal with the data. We could remove the commas and split the data, keeping the lower or upper bound." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10000000\n", "1 5000000\n", "2 5000000\n", "3 5000000\n", "4 5000000\n", "Name: owners, dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "owners_split = owners.str.replace(',', '').str.split(' .. ')\n", "owners_split.apply(lambda x: int(x[0])).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could calculate the mid-point between the points, and keep that." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 15000000\n", "1 7500000\n", "2 7500000\n", "3 7500000\n", "4 7500000\n", "Name: owners, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "owners_split.apply(lambda x: (int(x[0]) + int(x[1])) // 2).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we could reformat the data slightly, but pretty much keep it as is." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 10000000-20000000\n", "1 5000000-10000000\n", "2 5000000-10000000\n", "3 5000000-10000000\n", "4 5000000-10000000\n", "Name: owners, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "owners.str.replace(',', '').str.replace(' .. ', '-').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll go with this last option. It allows us to keep the ranges and also easily modify the data in the future." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Function\n", "\n", "We're now ready to define the functions to clean the data. We'll wrap all the logic inside a `process` function, and for most of the columns there isn't much code to write so we can confine it here. The only logic we'll separate out is for processing the tags column, as it is a more complex endeavour.\n", "\n", "In the next code cell we clean the data, export the full tags data to file, and inspect the clean dataframe. After that we inspect the exported data, verifying everything went as planned." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Exported tag data to '../data/exports/steamspy_tag_data.csv'\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamepositivenegativeownersaverage_forevermedian_foreverlanguagestags
010Counter-Strike124534333910000000-2000000017612317English, French, German, Italian, Spanish - Sp...Action;FPS;Multiplayer
120Team Fortress Classic33186335000000-1000000027762English, French, German, Italian, Spanish - Sp...Action;FPS;Multiplayer
230Day of Defeat34163985000000-1000000018734English, French, German, Italian, Spanish - SpainFPS;World War II;Multiplayer
340Deathmatch Classic12732675000000-10000000258184English, French, German, Italian, Spanish - Sp...Action;FPS;Multiplayer
450Half-Life: Opposing Force52502885000000-10000000624415English, French, German, KoreanFPS;Action;Sci-fi
\n", "
" ], "text/plain": [ " appid name positive negative owners \\\n", "0 10 Counter-Strike 124534 3339 10000000-20000000 \n", "1 20 Team Fortress Classic 3318 633 5000000-10000000 \n", "2 30 Day of Defeat 3416 398 5000000-10000000 \n", "3 40 Deathmatch Classic 1273 267 5000000-10000000 \n", "4 50 Half-Life: Opposing Force 5250 288 5000000-10000000 \n", "\n", " average_forever median_forever \\\n", "0 17612 317 \n", "1 277 62 \n", "2 187 34 \n", "3 258 184 \n", "4 624 415 \n", "\n", " languages \\\n", "0 English, French, German, Italian, Spanish - Sp... \n", "1 English, French, German, Italian, Spanish - Sp... \n", "2 English, French, German, Italian, Spanish - Spain \n", "3 English, French, German, Italian, Spanish - Sp... \n", "4 English, French, German, Korean \n", "\n", " tags \n", "0 Action;FPS;Multiplayer \n", "1 Action;FPS;Multiplayer \n", "2 FPS;World War II;Multiplayer \n", "3 Action;FPS;Multiplayer \n", "4 FPS;Action;Sci-fi " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def process_tags(df, export=False):\n", " if export: \n", " \n", " tag_data = df[['appid', 'tags']].copy()\n", " \n", " def parse_export_tags(x):\n", " x = literal_eval(x)\n", "\n", " if isinstance(x, dict):\n", " return x\n", " elif isinstance(x, list):\n", " return {}\n", " else:\n", " raise TypeError('Something other than dict or list found')\n", "\n", " tag_data['tags'] = tag_data['tags'].apply(parse_export_tags)\n", "\n", " cols = set(itertools.chain(*tag_data['tags']))\n", "\n", " for col in sorted(cols):\n", " col_name = col.lower().replace(' ', '_').replace('-', '_').replace(\"'\", \"\")\n", "\n", " tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)\n", "\n", " tag_data = tag_data.drop('tags', axis=1)\n", "\n", " tag_data.to_csv('../data/exports/steamspy_tag_data.csv', index=False)\n", " print(\"Exported tag data to '../data/exports/steamspy_tag_data.csv'\")\n", " \n", " \n", " def parse_tags(x):\n", " x = literal_eval(x)\n", " \n", " if isinstance(x, dict):\n", " return ';'.join(list(x.keys())[:3])\n", " else:\n", " return np.nan\n", " \n", " df['tags'] = df['tags'].apply(parse_tags)\n", " \n", " # rows with null tags seem to be superseded by newer release, so remove (e.g. dead island)\n", " df = df[df['tags'].notnull()]\n", " \n", " return df\n", "\n", "\n", "def process(df):\n", " df = df.copy()\n", " \n", " # handle missing values\n", " df = df[(df['name'].notnull()) & (df['name'] != 'none')]\n", " df = df[df['developer'].notnull()]\n", " df = df[df['languages'].notnull()]\n", " df = df[df['price'].notnull()]\n", " \n", " # remove unwanted columns\n", " df = df.drop([\n", " 'genre', 'developer', 'publisher', 'score_rank', 'userscore', 'average_2weeks',\n", " 'median_2weeks', 'price', 'initialprice', 'discount', 'ccu'\n", " ], axis=1)\n", " \n", " # keep top tags, exporting full tag data to file\n", " df = process_tags(df, export=True)\n", " \n", " # reformat owners column\n", " df['owners'] = df['owners'].str.replace(',', '').str.replace(' .. ', '-')\n", " \n", " return df\n", "\n", "\n", "steamspy_data = process(raw_steamspy_data)\n", "steamspy_data.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appid1980s1990s2.5d2d2d_fighter360_video3d3d_platformer3d_vision4_player_local4x6dofatvabstractactionaction_rpgaction_adventureaddictiveadventureagriculturealiensalternate_historyamericaanimation_&_modelinganimearcadearena_shooterartificial_intelligenceassassinasynchronous_multiplayeratmosphericaudio_productionbmxbase_buildingbaseballbased_on_a_novelbasketballbatmanbattle_royalebeat_em_upbeautifulbenchmarkbikesbloodboard_gamebowlingbuildingbullet_hellbullet_time...text_basedthird_personthird_person_shooterthrillertime_attacktime_managementtime_manipulationtime_traveltop_downtop_down_shootertouch_friendlytower_defensetrackirtradingtrading_card_gametrainstranshumanismturn_basedturn_based_combatturn_based_strategyturn_based_tacticstutorialtwin_stick_shootertypingundergroundunderwaterunforgivingutilitiesvrvr_onlyvampirevideo_productionvillain_protagonistviolentvisual_novelvoice_controlvoxelwalking_simulatorwarwargamewarhammer_40kweb_publishingwerewolveswesternword_gameworld_war_iworld_war_iiwrestlingzombiese_sports
0101445640000000000002681000000000000015100000000000000000000...00000000000000000000000000000000040000000000000000550
12007100000000000020800015000000000000000000000000000000...000000000000000000000000000000000260000000000000000
23000000000000000990000000000000000000000000000000000...00000000000000000000000000000000000000800000005122000
340000000000000008500000000000220000000000000000000000...00000000000000000000000000000000000000000000000000
45007700000000000021100087012200000000073000000000000000000...00000000000000000000000000000000000000000000000000
\n", "

5 rows × 372 columns

\n", "
" ], "text/plain": [ " appid 1980s 1990s 2.5d 2d 2d_fighter 360_video 3d 3d_platformer \\\n", "0 10 144 564 0 0 0 0 0 0 \n", "1 20 0 71 0 0 0 0 0 0 \n", "2 30 0 0 0 0 0 0 0 0 \n", "3 40 0 0 0 0 0 0 0 0 \n", "4 50 0 77 0 0 0 0 0 0 \n", "\n", " 3d_vision 4_player_local 4x 6dof atv abstract action action_rpg \\\n", "0 0 0 0 0 0 0 2681 0 \n", "1 0 0 0 0 0 0 208 0 \n", "2 0 0 0 0 0 0 99 0 \n", "3 0 0 0 0 0 0 85 0 \n", "4 0 0 0 0 0 0 211 0 \n", "\n", " action_adventure addictive adventure agriculture aliens \\\n", "0 0 0 0 0 0 \n", "1 0 0 15 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 87 0 122 \n", "\n", " alternate_history america animation_&_modeling anime arcade \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " arena_shooter artificial_intelligence assassin asynchronous_multiplayer \\\n", "0 0 0 151 0 \n", "1 0 0 0 0 \n", "2 0 0 0 0 \n", "3 22 0 0 0 \n", "4 0 0 0 0 \n", "\n", " atmospheric audio_production bmx base_building baseball \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 73 0 0 0 0 \n", "\n", " based_on_a_novel basketball batman battle_royale beat_em_up beautiful \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " benchmark bikes blood board_game bowling building bullet_hell \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " bullet_time ... text_based third_person third_person_shooter thriller \\\n", "0 0 ... 0 0 0 0 \n", "1 0 ... 0 0 0 0 \n", "2 0 ... 0 0 0 0 \n", "3 0 ... 0 0 0 0 \n", "4 0 ... 0 0 0 0 \n", "\n", " time_attack time_management time_manipulation time_travel top_down \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " top_down_shooter touch_friendly tower_defense trackir trading \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " trading_card_game trains transhumanism turn_based turn_based_combat \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " turn_based_strategy turn_based_tactics tutorial twin_stick_shooter \\\n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "2 0 0 0 0 \n", "3 0 0 0 0 \n", "4 0 0 0 0 \n", "\n", " typing underground underwater unforgiving utilities vr vr_only \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " vampire video_production villain_protagonist violent visual_novel \\\n", "0 0 0 0 40 0 \n", "1 0 0 0 26 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " voice_control voxel walking_simulator war wargame warhammer_40k \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 80 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " web_publishing werewolves western word_game world_war_i world_war_ii \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 5 122 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " wrestling zombies e_sports \n", "0 0 0 550 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 0 0 0 \n", "\n", "[5 rows x 372 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect tag data\n", "pd.read_csv('../data/exports/steamspy_tag_data.csv').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge and Export Clean Data\n", "\n", "Before we wrap up, we'll check for any missing values and export the SteamSpy data to its own csv file. Then we'll look at combining the previously cleaned Steam data and the newly cleaned SteamSpy data into one cohesive dataset. This will give us options and flexibility for analysis in the future, and means we will have one csv file to begin looking at." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "appid 0\n", "name 0\n", "positive 0\n", "negative 0\n", "owners 0\n", "average_forever 0\n", "median_forever 0\n", "languages 0\n", "tags 0\n", "dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "steamspy_data.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "steamspy_data.to_csv('../data/exports/steamspy_clean.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To combine the dataframes we can use the pandas [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) function, specifying which columns to merge upon. We'll keep the default `how='inner'` behaviour, meaning we only keep rows with data in both dataframes." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesteam_appidrequired_ageplatformscategoriesgenresachievementsrelease_datepriceenglishdeveloperpublisherappidname_steamspypositivenegativeownersaverage_forevermedian_foreverlanguagestags
0Counter-Strike100windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action02000-11-017.191ValveValve10Counter-Strike124534333910000000-2000000017612317English, French, German, Italian, Spanish - Sp...Action;FPS;Multiplayer
1Team Fortress Classic200windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action01999-04-013.991ValveValve20Team Fortress Classic33186335000000-1000000027762English, French, German, Italian, Spanish - Sp...Action;FPS;Multiplayer
2Day of Defeat300windows;mac;linuxMulti-player;Valve Anti-Cheat enabledAction02003-05-013.991ValveValve30Day of Defeat34163985000000-1000000018734English, French, German, Italian, Spanish - SpainFPS;World War II;Multiplayer
3Deathmatch Classic400windows;mac;linuxMulti-player;Online Multi-Player;Local Multi-P...Action02001-06-013.991ValveValve40Deathmatch Classic12732675000000-10000000258184English, French, German, Italian, Spanish - Sp...Action;FPS;Multiplayer
4Half-Life: Opposing Force500windows;mac;linuxSingle-player;Multi-player;Valve Anti-Cheat en...Action01999-11-013.991Gearbox SoftwareValve50Half-Life: Opposing Force52502885000000-10000000624415English, French, German, KoreanFPS;Action;Sci-fi
\n", "
" ], "text/plain": [ " name steam_appid required_age platforms \\\n", "0 Counter-Strike 10 0 windows;mac;linux \n", "1 Team Fortress Classic 20 0 windows;mac;linux \n", "2 Day of Defeat 30 0 windows;mac;linux \n", "3 Deathmatch Classic 40 0 windows;mac;linux \n", "4 Half-Life: Opposing Force 50 0 windows;mac;linux \n", "\n", " categories genres achievements \\\n", "0 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "1 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "2 Multi-player;Valve Anti-Cheat enabled Action 0 \n", "3 Multi-player;Online Multi-Player;Local Multi-P... Action 0 \n", "4 Single-player;Multi-player;Valve Anti-Cheat en... Action 0 \n", "\n", " release_date price english developer publisher appid \\\n", "0 2000-11-01 7.19 1 Valve Valve 10 \n", "1 1999-04-01 3.99 1 Valve Valve 20 \n", "2 2003-05-01 3.99 1 Valve Valve 30 \n", "3 2001-06-01 3.99 1 Valve Valve 40 \n", "4 1999-11-01 3.99 1 Gearbox Software Valve 50 \n", "\n", " name_steamspy positive negative owners \\\n", "0 Counter-Strike 124534 3339 10000000-20000000 \n", "1 Team Fortress Classic 3318 633 5000000-10000000 \n", "2 Day of Defeat 3416 398 5000000-10000000 \n", "3 Deathmatch Classic 1273 267 5000000-10000000 \n", "4 Half-Life: Opposing Force 5250 288 5000000-10000000 \n", "\n", " average_forever median_forever \\\n", "0 17612 317 \n", "1 277 62 \n", "2 187 34 \n", "3 258 184 \n", "4 624 415 \n", "\n", " languages \\\n", "0 English, French, German, Italian, Spanish - Sp... \n", "1 English, French, German, Italian, Spanish - Sp... \n", "2 English, French, German, Italian, Spanish - Spain \n", "3 English, French, German, Italian, Spanish - Sp... \n", "4 English, French, German, Korean \n", "\n", " tags \n", "0 Action;FPS;Multiplayer \n", "1 Action;FPS;Multiplayer \n", "2 FPS;World War II;Multiplayer \n", "3 Action;FPS;Multiplayer \n", "4 FPS;Action;Sci-fi " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "steam_data = pd.read_csv('../data/exports/steam_data_clean.csv')\n", " \n", "merged = steam_data.merge(steamspy_data, left_on='steam_appid', right_on='appid', suffixes=('', '_steamspy'))\n", "merged.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can remove, reorder and rename some columns before exporting the complete, clean dataset.\n", "\n", "We'll remove the SteamSpy rows with overlap - name, languages and AppID. Next we reindex the dataframe, reordering the columns so it makes a little more logical sense to look at, and then we rename some of the columns to make it a bit clearer what they describe. \n", "\n", "This is a quick and simple step, but as we'll be making this data available for public use it's a very good idea to make it as user-friendly as possible." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appidnamerelease_dateenglishdeveloperpublisherplatformsrequired_agecategoriesgenressteamspy_tagsachievementspositive_ratingsnegative_ratingsaverage_playtimemedian_playtimeownersprice
010Counter-Strike2000-11-011ValveValvewindows;mac;linux0Multi-player;Online Multi-Player;Local Multi-P...ActionAction;FPS;Multiplayer012453433391761231710000000-200000007.19
120Team Fortress Classic1999-04-011ValveValvewindows;mac;linux0Multi-player;Online Multi-Player;Local Multi-P...ActionAction;FPS;Multiplayer03318633277625000000-100000003.99
230Day of Defeat2003-05-011ValveValvewindows;mac;linux0Multi-player;Valve Anti-Cheat enabledActionFPS;World War II;Multiplayer03416398187345000000-100000003.99
340Deathmatch Classic2001-06-011ValveValvewindows;mac;linux0Multi-player;Online Multi-Player;Local Multi-P...ActionAction;FPS;Multiplayer012732672581845000000-100000003.99
450Half-Life: Opposing Force1999-11-011Gearbox SoftwareValvewindows;mac;linux0Single-player;Multi-player;Valve Anti-Cheat en...ActionFPS;Action;Sci-fi052502886244155000000-100000003.99
\n", "
" ], "text/plain": [ " appid name release_date english developer \\\n", "0 10 Counter-Strike 2000-11-01 1 Valve \n", "1 20 Team Fortress Classic 1999-04-01 1 Valve \n", "2 30 Day of Defeat 2003-05-01 1 Valve \n", "3 40 Deathmatch Classic 2001-06-01 1 Valve \n", "4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software \n", "\n", " publisher platforms required_age \\\n", "0 Valve windows;mac;linux 0 \n", "1 Valve windows;mac;linux 0 \n", "2 Valve windows;mac;linux 0 \n", "3 Valve windows;mac;linux 0 \n", "4 Valve windows;mac;linux 0 \n", "\n", " categories genres \\\n", "0 Multi-player;Online Multi-Player;Local Multi-P... Action \n", "1 Multi-player;Online Multi-Player;Local Multi-P... Action \n", "2 Multi-player;Valve Anti-Cheat enabled Action \n", "3 Multi-player;Online Multi-Player;Local Multi-P... Action \n", "4 Single-player;Multi-player;Valve Anti-Cheat en... Action \n", "\n", " steamspy_tags achievements positive_ratings \\\n", "0 Action;FPS;Multiplayer 0 124534 \n", "1 Action;FPS;Multiplayer 0 3318 \n", "2 FPS;World War II;Multiplayer 0 3416 \n", "3 Action;FPS;Multiplayer 0 1273 \n", "4 FPS;Action;Sci-fi 0 5250 \n", "\n", " negative_ratings average_playtime median_playtime owners \\\n", "0 3339 17612 317 10000000-20000000 \n", "1 633 277 62 5000000-10000000 \n", "2 398 187 34 5000000-10000000 \n", "3 267 258 184 5000000-10000000 \n", "4 288 624 415 5000000-10000000 \n", "\n", " price \n", "0 7.19 \n", "1 3.99 \n", "2 3.99 \n", "3 3.99 \n", "4 3.99 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove overlapping columns\n", "steam_clean = merged.drop(['name_steamspy', 'languages', 'steam_appid'], axis=1)\n", "\n", "# reindex to reorder columns\n", "steam_clean = steam_clean[[\n", " 'appid',\n", " 'name',\n", " 'release_date',\n", " 'english',\n", " 'developer',\n", " 'publisher',\n", " 'platforms',\n", " 'required_age',\n", " 'categories',\n", " 'genres',\n", " 'tags',\n", " 'achievements',\n", " 'positive',\n", " 'negative',\n", " 'average_forever',\n", " 'median_forever',\n", " 'owners',\n", " 'price'\n", "]]\n", "\n", "steam_clean = steam_clean.rename({\n", " 'tags': 'steamspy_tags',\n", " 'positive': 'positive_ratings',\n", " 'negative': 'negative_ratings',\n", " 'average_forever': 'average_playtime',\n", " 'median_forever': 'median_playtime'\n", "}, axis=1)\n", "\n", "steam_clean.head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# export clean dataset\n", "steam_clean.to_csv('../data/steam_clean.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And with that, we have come to the end of the cleaning process. It's been a long journey and hopefully you've picked up something useful along the way, and now we can look ahead to analysing the data, creating some visualisations, and gaining some insights into the multi-billion dollar industry that is video games." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }