{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SteamSpy Data Cleaning\n",
    "\n",
    "*This forms part of a larger series of posts for my [blog](http://nik-davis.github.io) on downloading, processing and analysing data from the steam store. [See all posts here](http://nik-davis.github.io/tag/steam).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/json": {
       "Software versions": [
        {
         "module": "Python",
         "version": "3.7.3 64bit [MSC v.1900 64 bit (AMD64)]"
        },
        {
         "module": "IPython",
         "version": "7.5.0"
        },
        {
         "module": "OS",
         "version": "Windows 10 10.0.18362 SP0"
        },
        {
         "module": "numpy",
         "version": "1.16.3"
        },
        {
         "module": "pandas",
         "version": "0.24.2"
        }
       ]
      },
      "text/html": [
       "<table><tr><th>Software</th><th>Version</th></tr><tr><td>Python</td><td>3.7.3 64bit [MSC v.1900 64 bit (AMD64)]</td></tr><tr><td>IPython</td><td>7.5.0</td></tr><tr><td>OS</td><td>Windows 10 10.0.18362 SP0</td></tr><tr><td>numpy</td><td>1.16.3</td></tr><tr><td>pandas</td><td>0.24.2</td></tr><tr><td colspan='2'>Wed Jun 12 15:12:03 2019 GMT Summer Time</td></tr></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{|l|l|}\\hline\n",
       "{\\bf Software} & {\\bf Version} \\\\ \\hline\\hline\n",
       "Python & 3.7.3 64bit [MSC v.1900 64 bit (AMD64)] \\\\ \\hline\n",
       "IPython & 7.5.0 \\\\ \\hline\n",
       "OS & Windows 10 10.0.18362 SP0 \\\\ \\hline\n",
       "numpy & 1.16.3 \\\\ \\hline\n",
       "pandas & 0.24.2 \\\\ \\hline\n",
       "\\hline \\multicolumn{2}{|l|}{Wed Jun 12 15:12:03 2019 GMT Summer Time} \\\\ \\hline\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "Software versions\n",
       "Python 3.7.3 64bit [MSC v.1900 64 bit (AMD64)]\n",
       "IPython 7.5.0\n",
       "OS Windows 10 10.0.18362 SP0\n",
       "numpy 1.16.3\n",
       "pandas 0.24.2\n",
       "Wed Jun 12 15:12:03 2019 GMT Summer Time"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view software version information\n",
    "\n",
    "# http://raw.github.com/jrjohansson/version_information/master/version_information.py\n",
    "%load_ext version_information\n",
    "# %reload_ext version_information\n",
    "\n",
    "%version_information numpy, pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- PELICAN_BEGIN_SUMMARY -->\n",
    "\n",
    "Welcome to the final part of the data cleaning process. Once we're finished here we'll be ready to move on to exploring and analysing the data.\n",
    "\n",
    "As a quick re-cap, so far we have downloaded information on games from the Steam Store via the Steam API and SteamSpy API. We have cleaned and processed the data from the Steam API, and in this section we'll walkthrough cleaning data downloaded from the SteamSpy API. The overall goal of this project is to collect, clean and analyse data from the Steam Store with the idea of advising a fictional game developer or company.\n",
    "\n",
    "The previous posts went into great depth about the decisions made and methods used. This post will still go over a number of decisions, but will be more in the style of a brief overview than full discussion.\n",
    "\n",
    "<!-- PELICAN_END_SUMMARY -->\n",
    "\n",
    "## Import Libraries and Inspect Data\n",
    "\n",
    "Let's begin by importing the necessary libraries and inspecting the raw downloaded data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# standard library imports\n",
    "from ast import literal_eval\n",
    "import itertools\n",
    "# import time\n",
    "# import re\n",
    "\n",
    "# third-party imports\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# customisations\n",
    "pd.set_option(\"max_columns\", 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>score_rank</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>userscore</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>average_2weeks</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>median_2weeks</th>\n",
       "      <th>price</th>\n",
       "      <th>initialprice</th>\n",
       "      <th>discount</th>\n",
       "      <th>languages</th>\n",
       "      <th>genre</th>\n",
       "      <th>ccu</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>Counter-Strike</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>NaN</td>\n",
       "      <td>124534</td>\n",
       "      <td>3339</td>\n",
       "      <td>0</td>\n",
       "      <td>10,000,000 .. 20,000,000</td>\n",
       "      <td>17612</td>\n",
       "      <td>709</td>\n",
       "      <td>317</td>\n",
       "      <td>26</td>\n",
       "      <td>999.0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action</td>\n",
       "      <td>14923</td>\n",
       "      <td>{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>Team Fortress Classic</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3318</td>\n",
       "      <td>633</td>\n",
       "      <td>0</td>\n",
       "      <td>5,000,000 .. 10,000,000</td>\n",
       "      <td>277</td>\n",
       "      <td>15</td>\n",
       "      <td>62</td>\n",
       "      <td>15</td>\n",
       "      <td>499.0</td>\n",
       "      <td>499.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action</td>\n",
       "      <td>87</td>\n",
       "      <td>{'Action': 208, 'FPS': 188, 'Multiplayer': 172...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>Day of Defeat</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3416</td>\n",
       "      <td>398</td>\n",
       "      <td>0</td>\n",
       "      <td>5,000,000 .. 10,000,000</td>\n",
       "      <td>187</td>\n",
       "      <td>0</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td>499.0</td>\n",
       "      <td>499.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, German, Italian, Spanish - Spain</td>\n",
       "      <td>Action</td>\n",
       "      <td>130</td>\n",
       "      <td>{'FPS': 138, 'World War II': 122, 'Multiplayer...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>Deathmatch Classic</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1273</td>\n",
       "      <td>267</td>\n",
       "      <td>0</td>\n",
       "      <td>5,000,000 .. 10,000,000</td>\n",
       "      <td>258</td>\n",
       "      <td>0</td>\n",
       "      <td>184</td>\n",
       "      <td>0</td>\n",
       "      <td>499.0</td>\n",
       "      <td>499.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action</td>\n",
       "      <td>4</td>\n",
       "      <td>{'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>Half-Life: Opposing Force</td>\n",
       "      <td>Gearbox Software</td>\n",
       "      <td>Valve</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5250</td>\n",
       "      <td>288</td>\n",
       "      <td>0</td>\n",
       "      <td>5,000,000 .. 10,000,000</td>\n",
       "      <td>624</td>\n",
       "      <td>0</td>\n",
       "      <td>415</td>\n",
       "      <td>0</td>\n",
       "      <td>499.0</td>\n",
       "      <td>499.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, German, Korean</td>\n",
       "      <td>Action</td>\n",
       "      <td>71</td>\n",
       "      <td>{'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   appid                       name         developer publisher  score_rank  \\\n",
       "0     10             Counter-Strike             Valve     Valve         NaN   \n",
       "1     20      Team Fortress Classic             Valve     Valve         NaN   \n",
       "2     30              Day of Defeat             Valve     Valve         NaN   \n",
       "3     40         Deathmatch Classic             Valve     Valve         NaN   \n",
       "4     50  Half-Life: Opposing Force  Gearbox Software     Valve         NaN   \n",
       "\n",
       "   positive  negative  userscore                    owners  average_forever  \\\n",
       "0    124534      3339          0  10,000,000 .. 20,000,000            17612   \n",
       "1      3318       633          0   5,000,000 .. 10,000,000              277   \n",
       "2      3416       398          0   5,000,000 .. 10,000,000              187   \n",
       "3      1273       267          0   5,000,000 .. 10,000,000              258   \n",
       "4      5250       288          0   5,000,000 .. 10,000,000              624   \n",
       "\n",
       "   average_2weeks  median_forever  median_2weeks  price  initialprice  \\\n",
       "0             709             317             26  999.0         999.0   \n",
       "1              15              62             15  499.0         499.0   \n",
       "2               0              34              0  499.0         499.0   \n",
       "3               0             184              0  499.0         499.0   \n",
       "4               0             415              0  499.0         499.0   \n",
       "\n",
       "   discount                                          languages   genre    ccu  \\\n",
       "0       0.0  English, French, German, Italian, Spanish - Sp...  Action  14923   \n",
       "1       0.0  English, French, German, Italian, Spanish - Sp...  Action     87   \n",
       "2       0.0  English, French, German, Italian, Spanish - Spain  Action    130   \n",
       "3       0.0  English, French, German, Italian, Spanish - Sp...  Action      4   \n",
       "4       0.0                    English, French, German, Korean  Action     71   \n",
       "\n",
       "                                                tags  \n",
       "0  {'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...  \n",
       "1  {'Action': 208, 'FPS': 188, 'Multiplayer': 172...  \n",
       "2  {'FPS': 138, 'World War II': 122, 'Multiplayer...  \n",
       "3  {'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...  \n",
       "4  {'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data = pd.read_csv('../data/raw/steamspy_data.csv')\n",
    "raw_steamspy_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a lot of columns here that will overlap with columns we already dealt with in the Steam data. This is fine because it means we don't have to clean them here, and can simply use the columns from the other dataset.\n",
    "\n",
    "A few columns in this data that could be valuable are `positive`, `negative`, `userscore`, `owners` and `average_forever`. We'll begin by looking at the null counts for each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "appid                  0\n",
       "name                   5\n",
       "developer            197\n",
       "publisher            280\n",
       "score_rank         29177\n",
       "positive               0\n",
       "negative               0\n",
       "userscore              0\n",
       "owners                 0\n",
       "average_forever        0\n",
       "average_2weeks         0\n",
       "median_forever         0\n",
       "median_2weeks          0\n",
       "price                 29\n",
       "initialprice          22\n",
       "discount              22\n",
       "languages             94\n",
       "genre                152\n",
       "ccu                    0\n",
       "tags                   0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of the columns look fine, with only a few tens or hundreds of missing values, or none at all. The only column we'll remove for certain is the `score_rank` column, as it has far too many null values to be useful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling missing values\n",
    "\n",
    "There are some columns we'll be removing, as they're present in the other dataset. Before we do we can investigate the missing values, to decide whether we should remove these rows first. We don't need to go through every column, but just look at a few individual ones.\n",
    "\n",
    "In the `name` column, we have five rows with missing data, and four more rows where the name is 'none'. We'll certainly remove all 9 of these rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>score_rank</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>userscore</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>average_2weeks</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>median_2weeks</th>\n",
       "      <th>price</th>\n",
       "      <th>initialprice</th>\n",
       "      <th>discount</th>\n",
       "      <th>languages</th>\n",
       "      <th>genre</th>\n",
       "      <th>ccu</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>481</th>\n",
       "      <td>17760</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7235</th>\n",
       "      <td>396420</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>22</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>20,000 .. 50,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Free to Play': 24, 'Action': 21, 'Indie': 21...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9553</th>\n",
       "      <td>460250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Jeroen Wimmers</td>\n",
       "      <td>Jeroen Wimmers</td>\n",
       "      <td>NaN</td>\n",
       "      <td>44</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>899.0</td>\n",
       "      <td>899.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, Italian, German, Spanish - Sp...</td>\n",
       "      <td>Casual, Indie</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Indie': 34, 'Casual': 24, 'Puzzle': 15, 'Min...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22244</th>\n",
       "      <td>806160</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Paleno Games</td>\n",
       "      <td>Paleno Games</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, Italian, German, Spanish - Sp...</td>\n",
       "      <td>Action, Adventure, Casual, Indie</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Action': 22, 'Indie': 22, 'Casual': 21, 'Adv...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27324</th>\n",
       "      <td>965340</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2nd Studio</td>\n",
       "      <td>2nd Studio</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>199.0</td>\n",
       "      <td>199.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, German, Danish, Japanese, Russian, Si...</td>\n",
       "      <td>Action, Indie, Simulation</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Indie': 32, 'Sexual Content': 31, 'Action': ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        appid name       developer       publisher  score_rank  positive  \\\n",
       "481     17760  NaN             NaN             NaN         NaN         0   \n",
       "7235   396420  NaN             NaN             NaN         NaN        22   \n",
       "9553   460250  NaN  Jeroen Wimmers  Jeroen Wimmers         NaN        44   \n",
       "22244  806160  NaN    Paleno Games    Paleno Games         NaN         4   \n",
       "27324  965340  NaN      2nd Studio      2nd Studio         NaN        31   \n",
       "\n",
       "       negative  userscore            owners  average_forever  average_2weeks  \\\n",
       "481           0          0       0 .. 20,000                0               0   \n",
       "7235          9          0  20,000 .. 50,000                0               0   \n",
       "9553          4          0       0 .. 20,000                0               0   \n",
       "22244        13          0       0 .. 20,000                0               0   \n",
       "27324         1          0       0 .. 20,000                0               0   \n",
       "\n",
       "       median_forever  median_2weeks  price  initialprice  discount  \\\n",
       "481                 0              0    0.0           0.0       0.0   \n",
       "7235                0              0    0.0           0.0       0.0   \n",
       "9553                0              0  899.0         899.0       0.0   \n",
       "22244               0              0   99.0          99.0       0.0   \n",
       "27324               0              0  199.0         199.0       0.0   \n",
       "\n",
       "                                               languages  \\\n",
       "481                                                  NaN   \n",
       "7235                                                 NaN   \n",
       "9553   English, French, Italian, German, Spanish - Sp...   \n",
       "22244  English, French, Italian, German, Spanish - Sp...   \n",
       "27324  English, German, Danish, Japanese, Russian, Si...   \n",
       "\n",
       "                                  genre  ccu  \\\n",
       "481                                 NaN    0   \n",
       "7235                                NaN    0   \n",
       "9553                      Casual, Indie    0   \n",
       "22244  Action, Adventure, Casual, Indie    0   \n",
       "27324         Action, Indie, Simulation    0   \n",
       "\n",
       "                                                    tags  \n",
       "481                                                   []  \n",
       "7235   {'Free to Play': 24, 'Action': 21, 'Indie': 21...  \n",
       "9553   {'Indie': 34, 'Casual': 24, 'Puzzle': 15, 'Min...  \n",
       "22244  {'Action': 22, 'Indie': 22, 'Casual': 21, 'Adv...  \n",
       "27324  {'Indie': 32, 'Sexual Content': 31, 'Action': ...  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data[raw_steamspy_data['name'].isnull()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>score_rank</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>userscore</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>average_2weeks</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>median_2weeks</th>\n",
       "      <th>price</th>\n",
       "      <th>initialprice</th>\n",
       "      <th>discount</th>\n",
       "      <th>languages</th>\n",
       "      <th>genre</th>\n",
       "      <th>ccu</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4918</th>\n",
       "      <td>339860</td>\n",
       "      <td>none</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>84</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>50,000 .. 100,000</td>\n",
       "      <td>309</td>\n",
       "      <td>0</td>\n",
       "      <td>325</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English</td>\n",
       "      <td>Adventure, Indie, Massively Multiplayer, RPG, ...</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Indie': 34, 'RPG': 33, 'Adventure': 32, 'Mas...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6779</th>\n",
       "      <td>385020</td>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>20,000 .. 50,000</td>\n",
       "      <td>262</td>\n",
       "      <td>0</td>\n",
       "      <td>272</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, Italian, German, Spanish - Sp...</td>\n",
       "      <td>Casual, Indie, Massively Multiplayer, Strategy</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Indie': 32, 'Strategy': 31, 'Casual': 21, 'M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7350</th>\n",
       "      <td>398970</td>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "      <td>20,000 .. 50,000</td>\n",
       "      <td>259</td>\n",
       "      <td>0</td>\n",
       "      <td>272</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English</td>\n",
       "      <td>Adventure, Indie, Massively Multiplayer, RPG, ...</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Massively Multiplayer': 27, 'Strategy': 23, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       appid  name developer publisher  score_rank  positive  negative  \\\n",
       "4918  339860  none       NaN       NaN         NaN        84        32   \n",
       "6779  385020  none      none       NaN         NaN        14        26   \n",
       "7350  398970  none      none      none         NaN        10        21   \n",
       "\n",
       "      userscore             owners  average_forever  average_2weeks  \\\n",
       "4918          0  50,000 .. 100,000              309               0   \n",
       "6779          0   20,000 .. 50,000              262               0   \n",
       "7350          0   20,000 .. 50,000              259               0   \n",
       "\n",
       "      median_forever  median_2weeks  price  initialprice  discount  \\\n",
       "4918             325              0    0.0           0.0       0.0   \n",
       "6779             272              0    0.0           0.0       0.0   \n",
       "7350             272              0    0.0           0.0       0.0   \n",
       "\n",
       "                                              languages  \\\n",
       "4918                                            English   \n",
       "6779  English, French, Italian, German, Spanish - Sp...   \n",
       "7350                                            English   \n",
       "\n",
       "                                                  genre  ccu  \\\n",
       "4918  Adventure, Indie, Massively Multiplayer, RPG, ...    0   \n",
       "6779     Casual, Indie, Massively Multiplayer, Strategy    0   \n",
       "7350  Adventure, Indie, Massively Multiplayer, RPG, ...    0   \n",
       "\n",
       "                                                   tags  \n",
       "4918  {'Indie': 34, 'RPG': 33, 'Adventure': 32, 'Mas...  \n",
       "6779  {'Indie': 32, 'Strategy': 31, 'Casual': 21, 'M...  \n",
       "7350  {'Massively Multiplayer': 27, 'Strategy': 23, ...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data[raw_steamspy_data['name'] == 'none']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 197 rows with missing developer data, and 280 rows with missing publisher data. It's conceivable that a small studio may self-publish, but we'll say that a title must have a developer for it to be kept. We can remove the rows with missing developer data, but then we'll be removing both of these columns as this information is already provided in the Steam data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>score_rank</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>userscore</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>average_2weeks</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>median_2weeks</th>\n",
       "      <th>price</th>\n",
       "      <th>initialprice</th>\n",
       "      <th>discount</th>\n",
       "      <th>languages</th>\n",
       "      <th>genre</th>\n",
       "      <th>ccu</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>8774</th>\n",
       "      <td>439400</td>\n",
       "      <td>Legends of Callasia Demo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>568</th>\n",
       "      <td>23130</td>\n",
       "      <td>Mole Control</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18399</th>\n",
       "      <td>703610</td>\n",
       "      <td>Space Girls</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>70</td>\n",
       "      <td>81</td>\n",
       "      <td>0</td>\n",
       "      <td>20,000 .. 50,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, Russian</td>\n",
       "      <td>Sexual Content, Nudity, Action, Casual, Indie</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Nudity': 26, 'Sexual Content': 25, 'Action':...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2207</th>\n",
       "      <td>247910</td>\n",
       "      <td>Sniper Elite: Nazi Zombie Army 2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Rebellion</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3069</td>\n",
       "      <td>554</td>\n",
       "      <td>0</td>\n",
       "      <td>200,000 .. 500,000</td>\n",
       "      <td>289</td>\n",
       "      <td>0</td>\n",
       "      <td>335</td>\n",
       "      <td>0</td>\n",
       "      <td>1499.0</td>\n",
       "      <td>1499.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, French, Italian, German, Spanish - Sp...</td>\n",
       "      <td>Action, Adventure</td>\n",
       "      <td>23</td>\n",
       "      <td>{'Zombies': 294, 'Action': 185, 'Co-op': 171, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15367</th>\n",
       "      <td>619730</td>\n",
       "      <td>Mad Driver</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>46</td>\n",
       "      <td>48</td>\n",
       "      <td>0</td>\n",
       "      <td>100,000 .. 200,000</td>\n",
       "      <td>287</td>\n",
       "      <td>0</td>\n",
       "      <td>294</td>\n",
       "      <td>0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, Russian</td>\n",
       "      <td>Casual, Indie, Racing, Strategy</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Strategy': 21, 'Indie': 21, 'Casual': 21, 'R...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        appid                              name developer  publisher  \\\n",
       "8774   439400          Legends of Callasia Demo       NaN        NaN   \n",
       "568     23130                      Mole Control       NaN        NaN   \n",
       "18399  703610                       Space Girls       NaN        NaN   \n",
       "2207   247910  Sniper Elite: Nazi Zombie Army 2       NaN  Rebellion   \n",
       "15367  619730                        Mad Driver       NaN        NaN   \n",
       "\n",
       "       score_rank  positive  negative  userscore              owners  \\\n",
       "8774          NaN         0         0          0         0 .. 20,000   \n",
       "568           NaN         0         0          0         0 .. 20,000   \n",
       "18399         NaN        70        81          0    20,000 .. 50,000   \n",
       "2207          NaN      3069       554          0  200,000 .. 500,000   \n",
       "15367         NaN        46        48          0  100,000 .. 200,000   \n",
       "\n",
       "       average_forever  average_2weeks  median_forever  median_2weeks   price  \\\n",
       "8774                 0               0               0              0     NaN   \n",
       "568                  0               0               0              0     0.0   \n",
       "18399                0               0               0              0    99.0   \n",
       "2207               289               0             335              0  1499.0   \n",
       "15367              287               0             294              0    99.0   \n",
       "\n",
       "       initialprice  discount  \\\n",
       "8774            NaN       NaN   \n",
       "568             0.0       0.0   \n",
       "18399          99.0       0.0   \n",
       "2207         1499.0       0.0   \n",
       "15367          99.0       0.0   \n",
       "\n",
       "                                               languages  \\\n",
       "8774                                                 NaN   \n",
       "568                                                  NaN   \n",
       "18399                                   English, Russian   \n",
       "2207   English, French, Italian, German, Spanish - Sp...   \n",
       "15367                                   English, Russian   \n",
       "\n",
       "                                               genre  ccu  \\\n",
       "8774                                             NaN    0   \n",
       "568                                              NaN    1   \n",
       "18399  Sexual Content, Nudity, Action, Casual, Indie    0   \n",
       "2207                               Action, Adventure   23   \n",
       "15367                Casual, Indie, Racing, Strategy    0   \n",
       "\n",
       "                                                    tags  \n",
       "8774                                                  []  \n",
       "568                                                   []  \n",
       "18399  {'Nudity': 26, 'Sexual Content': 25, 'Action':...  \n",
       "2207   {'Zombies': 294, 'Action': 185, 'Co-op': 171, ...  \n",
       "15367  {'Strategy': 21, 'Indie': 21, 'Casual': 21, 'R...  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data[raw_steamspy_data['developer'].isnull()].sample(5, random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>score_rank</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>userscore</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>average_2weeks</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>median_2weeks</th>\n",
       "      <th>price</th>\n",
       "      <th>initialprice</th>\n",
       "      <th>discount</th>\n",
       "      <th>languages</th>\n",
       "      <th>genre</th>\n",
       "      <th>ccu</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22993</th>\n",
       "      <td>827660</td>\n",
       "      <td>Feminist Cyborg Vs Last guy(gay)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>489</th>\n",
       "      <td>18310</td>\n",
       "      <td>Spectraball - Demo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5347</th>\n",
       "      <td>349260</td>\n",
       "      <td>BlastZone 2</td>\n",
       "      <td>Matt Edzenga</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50</td>\n",
       "      <td>25</td>\n",
       "      <td>0</td>\n",
       "      <td>20,000 .. 50,000</td>\n",
       "      <td>195</td>\n",
       "      <td>0</td>\n",
       "      <td>195</td>\n",
       "      <td>0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English</td>\n",
       "      <td>Action, Indie</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Indie': 35, 'Action': 34, \"Shoot 'Em Up\": 18...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2493</th>\n",
       "      <td>259530</td>\n",
       "      <td>Savant - Ascent</td>\n",
       "      <td>D-Pad Studio</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2467</td>\n",
       "      <td>179</td>\n",
       "      <td>0</td>\n",
       "      <td>200,000 .. 500,000</td>\n",
       "      <td>209</td>\n",
       "      <td>0</td>\n",
       "      <td>294</td>\n",
       "      <td>0</td>\n",
       "      <td>199.0</td>\n",
       "      <td>199.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>English, Japanese, Korean, Norwegian, Simplifi...</td>\n",
       "      <td>Action, Indie</td>\n",
       "      <td>3</td>\n",
       "      <td>{'Action': 116, 'Indie': 102, 'Great Soundtrac...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18285</th>\n",
       "      <td>700580</td>\n",
       "      <td>Rust - Staging Branch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>78</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        appid                              name     developer publisher  \\\n",
       "22993  827660  Feminist Cyborg Vs Last guy(gay)           NaN       NaN   \n",
       "489     18310                Spectraball - Demo           NaN       NaN   \n",
       "5347   349260                       BlastZone 2  Matt Edzenga       NaN   \n",
       "2493   259530                   Savant - Ascent  D-Pad Studio       NaN   \n",
       "18285  700580             Rust - Staging Branch           NaN       NaN   \n",
       "\n",
       "       score_rank  positive  negative  userscore              owners  \\\n",
       "22993         NaN         0         0          0         0 .. 20,000   \n",
       "489           NaN         0         0          0         0 .. 20,000   \n",
       "5347          NaN        50        25          0    20,000 .. 50,000   \n",
       "2493          NaN      2467       179          0  200,000 .. 500,000   \n",
       "18285         NaN         0         0          0         0 .. 20,000   \n",
       "\n",
       "       average_forever  average_2weeks  median_forever  median_2weeks  price  \\\n",
       "22993                0               0               0              0    NaN   \n",
       "489                  0               0               0              0    0.0   \n",
       "5347               195               0             195              0  999.0   \n",
       "2493               209               0             294              0  199.0   \n",
       "18285               78               0               6              0    NaN   \n",
       "\n",
       "       initialprice  discount  \\\n",
       "22993           NaN       NaN   \n",
       "489             0.0       0.0   \n",
       "5347          999.0       0.0   \n",
       "2493          199.0       0.0   \n",
       "18285           NaN       NaN   \n",
       "\n",
       "                                               languages          genre  ccu  \\\n",
       "22993                                                NaN            NaN    0   \n",
       "489                                                  NaN            NaN    0   \n",
       "5347                                             English  Action, Indie    0   \n",
       "2493   English, Japanese, Korean, Norwegian, Simplifi...  Action, Indie    3   \n",
       "18285                                                NaN            NaN    0   \n",
       "\n",
       "                                                    tags  \n",
       "22993                                                 []  \n",
       "489                                                   []  \n",
       "5347   {'Indie': 35, 'Action': 34, \"Shoot 'Em Up\": 18...  \n",
       "2493   {'Action': 116, 'Indie': 102, 'Great Soundtrac...  \n",
       "18285                                                 []  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data[raw_steamspy_data['publisher'].isnull()].sample(5, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a small number of rows with missing price data. This is another column we don't need, (as well as `initialprice` and `discount`), as we already have that data available in the other dataset. \n",
    "\n",
    "It looks like these rows have a lot of other missing information as well, so if they haven't already been removed when cleaning other columns we'll remove them here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>score_rank</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>userscore</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>average_2weeks</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>median_2weeks</th>\n",
       "      <th>price</th>\n",
       "      <th>initialprice</th>\n",
       "      <th>discount</th>\n",
       "      <th>languages</th>\n",
       "      <th>genre</th>\n",
       "      <th>ccu</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4261</th>\n",
       "      <td>321210</td>\n",
       "      <td>Blade Symphony Beta</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18285</th>\n",
       "      <td>700580</td>\n",
       "      <td>Rust - Staging Branch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>78</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22496</th>\n",
       "      <td>813350</td>\n",
       "      <td>Ben 10 VR</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15719</th>\n",
       "      <td>630790</td>\n",
       "      <td>PSYCHO-PASS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12076</th>\n",
       "      <td>530940</td>\n",
       "      <td>BIOHAZARD 7 resident evil グロテスクVer.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        appid                                 name developer publisher  \\\n",
       "4261   321210                  Blade Symphony Beta       NaN       NaN   \n",
       "18285  700580                Rust - Staging Branch       NaN       NaN   \n",
       "22496  813350                            Ben 10 VR       NaN       NaN   \n",
       "15719  630790                          PSYCHO-PASS       NaN       NaN   \n",
       "12076  530940  BIOHAZARD 7 resident evil グロテスクVer.       NaN       NaN   \n",
       "\n",
       "       score_rank  positive  negative  userscore       owners  \\\n",
       "4261          NaN         0         0          0  0 .. 20,000   \n",
       "18285         NaN         0         0          0  0 .. 20,000   \n",
       "22496         NaN         0         0          0  0 .. 20,000   \n",
       "15719         NaN         0         0          0  0 .. 20,000   \n",
       "12076         NaN         0         0          0  0 .. 20,000   \n",
       "\n",
       "       average_forever  average_2weeks  median_forever  median_2weeks  price  \\\n",
       "4261                 0               0               0              0    NaN   \n",
       "18285               78               0               6              0    NaN   \n",
       "22496                0               0               0              0    NaN   \n",
       "15719                0               0               0              0    NaN   \n",
       "12076                0               0               0              0    NaN   \n",
       "\n",
       "       initialprice  discount languages genre  ccu tags  \n",
       "4261            NaN       NaN       NaN   NaN    0   []  \n",
       "18285           NaN       NaN       NaN   NaN    0   []  \n",
       "22496           NaN       NaN       NaN   NaN    0   []  \n",
       "15719           NaN       NaN       NaN   NaN    0   []  \n",
       "12076           NaN       NaN       NaN   NaN    0   []  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data[raw_steamspy_data['price'].isnull()].sample(5, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again there are a small number of rows with missing language data. There are a couple of bigger titles here, but most are missing other data also. We'll go ahead and remove these rows and this should leave us with most troublesome rows dealt with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>score_rank</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>userscore</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>average_2weeks</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>median_2weeks</th>\n",
       "      <th>price</th>\n",
       "      <th>initialprice</th>\n",
       "      <th>discount</th>\n",
       "      <th>languages</th>\n",
       "      <th>genre</th>\n",
       "      <th>ccu</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>8350</td>\n",
       "      <td>Strong Bad's Cool Game for Attractive People: ...</td>\n",
       "      <td>Telltale Games</td>\n",
       "      <td>Telltale Games</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Adventure</td>\n",
       "      <td>0</td>\n",
       "      <td>{'Adventure': 20}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1233</th>\n",
       "      <td>94500</td>\n",
       "      <td>Back to the Future: Ep 2 - Get Tannen!</td>\n",
       "      <td>Telltale Games</td>\n",
       "      <td>Telltale Games</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1367</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>500,000 .. 1,000,000</td>\n",
       "      <td>91</td>\n",
       "      <td>0</td>\n",
       "      <td>91</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>{'Adventure': 131, 'Point &amp; Click': 116, 'Epis...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2120</th>\n",
       "      <td>243180</td>\n",
       "      <td>The Harvest</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>757</th>\n",
       "      <td>34390</td>\n",
       "      <td>Football Manager 2011 Demo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>569</th>\n",
       "      <td>23140</td>\n",
       "      <td>KrissX</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0 .. 20,000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       appid                                               name  \\\n",
       "249     8350  Strong Bad's Cool Game for Attractive People: ...   \n",
       "1233   94500             Back to the Future: Ep 2 - Get Tannen!   \n",
       "2120  243180                                        The Harvest   \n",
       "757    34390                         Football Manager 2011 Demo   \n",
       "569    23140                                             KrissX   \n",
       "\n",
       "           developer       publisher  score_rank  positive  negative  \\\n",
       "249   Telltale Games  Telltale Games         NaN        12         5   \n",
       "1233  Telltale Games  Telltale Games         NaN      1367       162   \n",
       "2120             NaN             NaN         NaN         0         0   \n",
       "757              NaN             NaN         NaN         0         0   \n",
       "569              NaN             NaN         NaN         0         0   \n",
       "\n",
       "      userscore                owners  average_forever  average_2weeks  \\\n",
       "249           0           0 .. 20,000                0               0   \n",
       "1233          0  500,000 .. 1,000,000               91               0   \n",
       "2120          0           0 .. 20,000                0               0   \n",
       "757           0           0 .. 20,000                0               0   \n",
       "569           0           0 .. 20,000                0               0   \n",
       "\n",
       "      median_forever  median_2weeks  price  initialprice  discount languages  \\\n",
       "249                0              0    0.0           0.0       0.0       NaN   \n",
       "1233              91              0    0.0           0.0       0.0       NaN   \n",
       "2120               0              0    0.0           0.0       0.0       NaN   \n",
       "757                0              0    0.0           0.0       0.0       NaN   \n",
       "569                0              0    0.0           0.0       0.0       NaN   \n",
       "\n",
       "          genre  ccu                                               tags  \n",
       "249   Adventure    0                                  {'Adventure': 20}  \n",
       "1233        NaN    2  {'Adventure': 131, 'Point & Click': 116, 'Epis...  \n",
       "2120        NaN    0                                                 []  \n",
       "757         NaN    0                                                 []  \n",
       "569         NaN    1                                                 []  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data[raw_steamspy_data['languages'].isnull()].sample(5, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove unwanted columns\n",
    "\n",
    "As mentioned, there are a number of columns present in the Steam dataset, so we can remove any of those with crossover. It makes sense that these exist because SteamSpy's data is retrieved by looking at Steam's data. We'll remove the `score_rank` column as it contains too many missing values, and we'll remove the `userscore` column as most rows have a value of 0 here, so it won't be useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      29177\n",
       "100        4\n",
       "84         4\n",
       "95         3\n",
       "55         3\n",
       "Name: userscore, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data['userscore'].value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also three columns - `average_2weeks`, `median_2weeks`, (average and median playtime in the last 2 weeks) and `ccu` (Peak conccurrent users yesterday) - that provide information only relevant to the time the data was collected. Going forward this isn't going to be useful, as we're not interested in how games were performing at that specific time, so we'll remove those."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop_cols = [\n",
    "    'score_rank', # too many missing values\n",
    "    'userscore', # too little variance (most have 0)\n",
    "    'genre', 'developer', 'publisher', 'price', 'initialprice', 'discount', # provided by Steam data\n",
    "    'average_2weeks', 'median_2weeks', 'ccu' # not interested in temporally specific columns\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One more column we could remove is the `languages` column, as we already have the `english` column in the Steam Data. Here in the SteamSpy data, languages is made up of a comma-separated list of all supported languages. This provides us with some more information so we can keep it for now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    English, French, German, Italian, Spanish - Sp...\n",
       "1    English, French, German, Italian, Spanish - Sp...\n",
       "2    English, French, German, Italian, Spanish - Spain\n",
       "3    English, French, German, Italian, Spanish - Sp...\n",
       "4                      English, French, German, Korean\n",
       "Name: languages, dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_steamspy_data['languages'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Process & Export Tags Column\n",
    "\n",
    "There are a couple of columns that require more specific handling, and the `tags` column is one of them. There will be a lot of overlap with the `genres` column, but the tags might provide just enough unique information to be worthwhile keeping.\n",
    "\n",
    "Below we separate out the tags data for experimentation, then have a look a the first few rows. Each tag for a game has an associated number, corresponding to the amount of votes that game has for that tag. A game's tags with higher votes means that more people think it is a relevant tag for that title."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1659, 'Shooter': 1420, 'Classic': 1344, 'Team-Based': 943, 'First-Person': 799, 'Competitive': 790, 'Tactical': 734, \"1990's\": 564, 'e-sports': 550, 'PvP': 480, 'Military': 367, 'Strategy': 329, 'Score Attack': 200, 'Survival': 192, 'Old School': 164, 'Assassin': 151, '1980s': 144, 'Violent': 40}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0    {'Action': 2681, 'FPS': 2048, 'Multiplayer': 1...\n",
       "1    {'Action': 208, 'FPS': 188, 'Multiplayer': 172...\n",
       "2    {'FPS': 138, 'World War II': 122, 'Multiplayer...\n",
       "3    {'Action': 85, 'FPS': 71, 'Multiplayer': 58, '...\n",
       "4    {'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si...\n",
       "Name: tags, dtype: object"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tags = raw_steamspy_data['tags']\n",
    "\n",
    "print(tags[0])\n",
    "tags.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just by looking at the first row you can see that there are probably going to be far too many tags to include without disrupting the dataset. We have a couple of options here. We could include just the names of the top 3 or 5 tags for a game - which should be easy enough as it looks like the tags are already in order from most votes to least - or we could create new columns for each tag, and the number of votes as the row values, like this:\n",
    "\n",
    "appid | name | action | fps | multiplayer\n",
    "--- | --- | --- | --- | ---\n",
    "10 | Shoot 'em up | 2681 | 2048 | 0\n",
    "20 | Big Brawl | 208  | 0 | 172\n",
    "\n",
    "\n",
    "In a way, we'll do both.\n",
    "\n",
    "Because including the full tag information in the dataset may make it bulky and unwieldy, we'll only include the top three tags in the core dataset. As we did when cleaning the Steam Data, we'll also export the full tag data as a separate file. This means that if we want to use it as part of our analysis we can import and merge the data quite easily.\n",
    "\n",
    "Before we can do that, we have to figure out how to handle and process the data. As we've seen previously, the data in each row is stored as a string containing a dictionary, so we have to use `literal_eval` to get python to \"recognise\" the dictionary first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2681\n",
      "2048\n"
     ]
    }
   ],
   "source": [
    "eval_row = literal_eval(tags[0])\n",
    "\n",
    "print(eval_row['Action'])\n",
    "print(eval_row['FPS'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also have to figure out how to get a list of all the different unique tags from across the dataset. As each row only contains the tags that it has votes for, we don't know from the outset the tag names or even how many different tags there are.\n",
    "\n",
    "We could loop through the data and build a list of all the unique tag names using a dictionary or something similar, but instead we'll achieve the same thing much more easily using the [chain](https://docs.python.org/3/library/itertools.html?highlight=itertools#itertools.chain) function from itertools. As seen below, it will take a variety of iterables and unpack them into a single sequence (we have to use list() here to force the generator to evaluate). We can then use the set function to obtain just the unique values from the returned sequence.\n",
    "\n",
    "As you can see below, we can pass lists, tuples and dictionaries and they will all be handled. Also notice how only the key from the dictionary is returned (d) and not the value inside the key (e). This is really useful and means we can just pass the evaluated tags data as a series, and itertools will just extract the individual keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'b', 'c', 'd']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "values = [\n",
    "    ['a', 'b'], # list\n",
    "    ('b', 'c'), # tuple\n",
    "    {'d': 'e'}  # dictionary\n",
    "]\n",
    "\n",
    "list(itertools.chain(*values))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'a', 'b', 'c', 'd'}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set(itertools.chain(*values))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we tried to define a function now, we might run into a problem. It turns out not all of the rows have data stored as dictionaries. Seen below, there are a few hundred rows with an empty list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "656"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tags[tags == '[]'].shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This doesn't cause an issue when we generate our set of unique tag names, as any empty lists are ignored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of unique tags: 371\n",
      "\n",
      "First few tags: ['1980s', \"1990's\", '2.5D', '2D', '2D Fighter']\n"
     ]
    }
   ],
   "source": [
    "parsed_tags = tags.apply(lambda x: literal_eval(x))\n",
    "\n",
    "cols = set(itertools.chain(*parsed_tags))\n",
    "\n",
    "print('Number of unique tags:', len(cols))\n",
    "print('\\nFirst few tags:', sorted(list(cols))[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However to create the new dataframe from the tags and votes, we need to look inside the dictionaries in each row. We'll expand the logic to parse the tags using the `isintance` function to check each row for its type. If it is a dictionary, we return the dictionary. If it is a list, we'll return an empty dictionary `{}`. We'll also include an extra check in case there is anything else in the rows we haven't found yet. \n",
    "\n",
    "Next, we loop through the tags stored in `cols`, standardise the format of the new column names, then extract the vote count of each tag by checking if the tag name is in each row's dictionary of tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>1980s</th>\n",
       "      <th>1990s</th>\n",
       "      <th>2.5d</th>\n",
       "      <th>2d</th>\n",
       "      <th>2d_fighter</th>\n",
       "      <th>360_video</th>\n",
       "      <th>3d</th>\n",
       "      <th>3d_platformer</th>\n",
       "      <th>3d_vision</th>\n",
       "      <th>4_player_local</th>\n",
       "      <th>4x</th>\n",
       "      <th>6dof</th>\n",
       "      <th>atv</th>\n",
       "      <th>abstract</th>\n",
       "      <th>action</th>\n",
       "      <th>action_rpg</th>\n",
       "      <th>action_adventure</th>\n",
       "      <th>addictive</th>\n",
       "      <th>adventure</th>\n",
       "      <th>agriculture</th>\n",
       "      <th>aliens</th>\n",
       "      <th>alternate_history</th>\n",
       "      <th>america</th>\n",
       "      <th>animation_&amp;_modeling</th>\n",
       "      <th>anime</th>\n",
       "      <th>arcade</th>\n",
       "      <th>arena_shooter</th>\n",
       "      <th>artificial_intelligence</th>\n",
       "      <th>assassin</th>\n",
       "      <th>asynchronous_multiplayer</th>\n",
       "      <th>atmospheric</th>\n",
       "      <th>audio_production</th>\n",
       "      <th>bmx</th>\n",
       "      <th>base_building</th>\n",
       "      <th>baseball</th>\n",
       "      <th>based_on_a_novel</th>\n",
       "      <th>basketball</th>\n",
       "      <th>batman</th>\n",
       "      <th>battle_royale</th>\n",
       "      <th>beat_em_up</th>\n",
       "      <th>beautiful</th>\n",
       "      <th>benchmark</th>\n",
       "      <th>bikes</th>\n",
       "      <th>blood</th>\n",
       "      <th>board_game</th>\n",
       "      <th>bowling</th>\n",
       "      <th>building</th>\n",
       "      <th>bullet_hell</th>\n",
       "      <th>bullet_time</th>\n",
       "      <th>crpg</th>\n",
       "      <th>...</th>\n",
       "      <th>text_based</th>\n",
       "      <th>third_person</th>\n",
       "      <th>third_person_shooter</th>\n",
       "      <th>thriller</th>\n",
       "      <th>time_attack</th>\n",
       "      <th>time_management</th>\n",
       "      <th>time_manipulation</th>\n",
       "      <th>time_travel</th>\n",
       "      <th>top_down</th>\n",
       "      <th>top_down_shooter</th>\n",
       "      <th>touch_friendly</th>\n",
       "      <th>tower_defense</th>\n",
       "      <th>trackir</th>\n",
       "      <th>trading</th>\n",
       "      <th>trading_card_game</th>\n",
       "      <th>trains</th>\n",
       "      <th>transhumanism</th>\n",
       "      <th>turn_based</th>\n",
       "      <th>turn_based_combat</th>\n",
       "      <th>turn_based_strategy</th>\n",
       "      <th>turn_based_tactics</th>\n",
       "      <th>tutorial</th>\n",
       "      <th>twin_stick_shooter</th>\n",
       "      <th>typing</th>\n",
       "      <th>underground</th>\n",
       "      <th>underwater</th>\n",
       "      <th>unforgiving</th>\n",
       "      <th>utilities</th>\n",
       "      <th>vr</th>\n",
       "      <th>vr_only</th>\n",
       "      <th>vampire</th>\n",
       "      <th>video_production</th>\n",
       "      <th>villain_protagonist</th>\n",
       "      <th>violent</th>\n",
       "      <th>visual_novel</th>\n",
       "      <th>voice_control</th>\n",
       "      <th>voxel</th>\n",
       "      <th>walking_simulator</th>\n",
       "      <th>war</th>\n",
       "      <th>wargame</th>\n",
       "      <th>warhammer_40k</th>\n",
       "      <th>web_publishing</th>\n",
       "      <th>werewolves</th>\n",
       "      <th>western</th>\n",
       "      <th>word_game</th>\n",
       "      <th>world_war_i</th>\n",
       "      <th>world_war_ii</th>\n",
       "      <th>wrestling</th>\n",
       "      <th>zombies</th>\n",
       "      <th>e_sports</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>144</td>\n",
       "      <td>564</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2681</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>151</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>71</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>208</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>99</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>122</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>85</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>77</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>87</td>\n",
       "      <td>0</td>\n",
       "      <td>122</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>73</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 371 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   1980s  1990s  2.5d  2d  2d_fighter  360_video  3d  3d_platformer  \\\n",
       "0    144    564     0   0           0          0   0              0   \n",
       "1      0     71     0   0           0          0   0              0   \n",
       "2      0      0     0   0           0          0   0              0   \n",
       "3      0      0     0   0           0          0   0              0   \n",
       "4      0     77     0   0           0          0   0              0   \n",
       "\n",
       "   3d_vision  4_player_local  4x  6dof  atv  abstract  action  action_rpg  \\\n",
       "0          0               0   0     0    0         0    2681           0   \n",
       "1          0               0   0     0    0         0     208           0   \n",
       "2          0               0   0     0    0         0      99           0   \n",
       "3          0               0   0     0    0         0      85           0   \n",
       "4          0               0   0     0    0         0     211           0   \n",
       "\n",
       "   action_adventure  addictive  adventure  agriculture  aliens  \\\n",
       "0                 0          0          0            0       0   \n",
       "1                 0          0         15            0       0   \n",
       "2                 0          0          0            0       0   \n",
       "3                 0          0          0            0       0   \n",
       "4                 0          0         87            0     122   \n",
       "\n",
       "   alternate_history  america  animation_&_modeling  anime  arcade  \\\n",
       "0                  0        0                     0      0       0   \n",
       "1                  0        0                     0      0       0   \n",
       "2                  0        0                     0      0       0   \n",
       "3                  0        0                     0      0       0   \n",
       "4                  0        0                     0      0       0   \n",
       "\n",
       "   arena_shooter  artificial_intelligence  assassin  asynchronous_multiplayer  \\\n",
       "0              0                        0       151                         0   \n",
       "1              0                        0         0                         0   \n",
       "2              0                        0         0                         0   \n",
       "3             22                        0         0                         0   \n",
       "4              0                        0         0                         0   \n",
       "\n",
       "   atmospheric  audio_production  bmx  base_building  baseball  \\\n",
       "0            0                 0    0              0         0   \n",
       "1            0                 0    0              0         0   \n",
       "2            0                 0    0              0         0   \n",
       "3            0                 0    0              0         0   \n",
       "4           73                 0    0              0         0   \n",
       "\n",
       "   based_on_a_novel  basketball  batman  battle_royale  beat_em_up  beautiful  \\\n",
       "0                 0           0       0              0           0          0   \n",
       "1                 0           0       0              0           0          0   \n",
       "2                 0           0       0              0           0          0   \n",
       "3                 0           0       0              0           0          0   \n",
       "4                 0           0       0              0           0          0   \n",
       "\n",
       "   benchmark  bikes  blood  board_game  bowling  building  bullet_hell  \\\n",
       "0          0      0      0           0        0         0            0   \n",
       "1          0      0      0           0        0         0            0   \n",
       "2          0      0      0           0        0         0            0   \n",
       "3          0      0      0           0        0         0            0   \n",
       "4          0      0      0           0        0         0            0   \n",
       "\n",
       "   bullet_time  crpg  ...  text_based  third_person  third_person_shooter  \\\n",
       "0            0     0  ...           0             0                     0   \n",
       "1            0     0  ...           0             0                     0   \n",
       "2            0     0  ...           0             0                     0   \n",
       "3            0     0  ...           0             0                     0   \n",
       "4            0     0  ...           0             0                     0   \n",
       "\n",
       "   thriller  time_attack  time_management  time_manipulation  time_travel  \\\n",
       "0         0            0                0                  0            0   \n",
       "1         0            0                0                  0            0   \n",
       "2         0            0                0                  0            0   \n",
       "3         0            0                0                  0            0   \n",
       "4         0            0                0                  0            0   \n",
       "\n",
       "   top_down  top_down_shooter  touch_friendly  tower_defense  trackir  \\\n",
       "0         0                 0               0              0        0   \n",
       "1         0                 0               0              0        0   \n",
       "2         0                 0               0              0        0   \n",
       "3         0                 0               0              0        0   \n",
       "4         0                 0               0              0        0   \n",
       "\n",
       "   trading  trading_card_game  trains  transhumanism  turn_based  \\\n",
       "0        0                  0       0              0           0   \n",
       "1        0                  0       0              0           0   \n",
       "2        0                  0       0              0           0   \n",
       "3        0                  0       0              0           0   \n",
       "4        0                  0       0              0           0   \n",
       "\n",
       "   turn_based_combat  turn_based_strategy  turn_based_tactics  tutorial  \\\n",
       "0                  0                    0                   0         0   \n",
       "1                  0                    0                   0         0   \n",
       "2                  0                    0                   0         0   \n",
       "3                  0                    0                   0         0   \n",
       "4                  0                    0                   0         0   \n",
       "\n",
       "   twin_stick_shooter  typing  underground  underwater  unforgiving  \\\n",
       "0                   0       0            0           0            0   \n",
       "1                   0       0            0           0            0   \n",
       "2                   0       0            0           0            0   \n",
       "3                   0       0            0           0            0   \n",
       "4                   0       0            0           0            0   \n",
       "\n",
       "   utilities  vr  vr_only  vampire  video_production  villain_protagonist  \\\n",
       "0          0   0        0        0                 0                    0   \n",
       "1          0   0        0        0                 0                    0   \n",
       "2          0   0        0        0                 0                    0   \n",
       "3          0   0        0        0                 0                    0   \n",
       "4          0   0        0        0                 0                    0   \n",
       "\n",
       "   violent  visual_novel  voice_control  voxel  walking_simulator  war  \\\n",
       "0       40             0              0      0                  0    0   \n",
       "1       26             0              0      0                  0    0   \n",
       "2        0             0              0      0                  0   80   \n",
       "3        0             0              0      0                  0    0   \n",
       "4        0             0              0      0                  0    0   \n",
       "\n",
       "   wargame  warhammer_40k  web_publishing  werewolves  western  word_game  \\\n",
       "0        0              0               0           0        0          0   \n",
       "1        0              0               0           0        0          0   \n",
       "2        0              0               0           0        0          0   \n",
       "3        0              0               0           0        0          0   \n",
       "4        0              0               0           0        0          0   \n",
       "\n",
       "   world_war_i  world_war_ii  wrestling  zombies  e_sports  \n",
       "0            0             0          0        0       550  \n",
       "1            0             0          0        0         0  \n",
       "2            5           122          0        0         0  \n",
       "3            0             0          0        0         0  \n",
       "4            0             0          0        0         0  \n",
       "\n",
       "[5 rows x 371 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def parse_tags(x):\n",
    "    x = literal_eval(x)\n",
    "    \n",
    "    if isinstance(x, dict):\n",
    "        return x\n",
    "    elif isinstance(x, list):\n",
    "        return {}\n",
    "    else:\n",
    "        raise TypeError('Something other than dict or list found')\n",
    "\n",
    "parsed_tags = tags.apply(parse_tags)\n",
    "        \n",
    "tag_data = pd.DataFrame()\n",
    "\n",
    "for col in sorted(cols):\n",
    "    # standardise column names\n",
    "    col_name = col.lower().replace(' ', '_').replace('-', '_').replace(\"'\", \"\")\n",
    "\n",
    "    # check if column in row's dictionary of tags and return that value if it is, or 0 if it isn't\n",
    "    tag_data[col_name] = parsed_tags.apply(lambda x: x[col] if col in x.keys() else 0)\n",
    "\n",
    "tag_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll have to add in the AppIDs, but apart from that this makes up the tag data that we'll export. Next we have to figure out how to extract just the top three tags for use in the core dataset.\n",
    "\n",
    "It turns out this is a relatively simple process. As the keys are in order of highest votes to least, we can just extract the first 3 keys for each row. This may not be the safest approach, as dictionaries don't necessarily preserve key order, but it's fine for our purposes. We can then join these keys on a semicolon to create a list of the tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0          Action;FPS;Multiplayer\n",
       "1          Action;FPS;Multiplayer\n",
       "2    FPS;World War II;Multiplayer\n",
       "3          Action;FPS;Multiplayer\n",
       "4               FPS;Action;Sci-fi\n",
       "Name: tags, dtype: object"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def parse_tags(x):\n",
    "    x = literal_eval(x)\n",
    "\n",
    "    if isinstance(x, dict):\n",
    "        return ';'.join(list(x.keys())[:3])\n",
    "    else:\n",
    "        return np.nan\n",
    "    \n",
    "tags.apply(parse_tags).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handle Owners Column\n",
    "\n",
    "The final column we'll look at before defining a function to perform the cleaning is the `owners` column. This column is made up of the lower and upper bound of an estimation for the amount of owners for each title. For privacy reasons, SteamSpy can't get exact figures from Steam (though it [used to](https://www.polygon.com/2018/4/12/17229752/steam-spy-charts-new-privacy-rules-valve)), so we'll have to make do with the figures we have here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    10,000,000 .. 20,000,000\n",
       "1     5,000,000 .. 10,000,000\n",
       "2     5,000,000 .. 10,000,000\n",
       "3     5,000,000 .. 10,000,000\n",
       "4     5,000,000 .. 10,000,000\n",
       "Name: owners, dtype: object"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners = raw_steamspy_data['owners']\n",
    "owners.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a couple of options for how to deal with the data. We could remove the commas and split the data, keeping the lower or upper bound."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    10000000\n",
       "1     5000000\n",
       "2     5000000\n",
       "3     5000000\n",
       "4     5000000\n",
       "Name: owners, dtype: int64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners_split = owners.str.replace(',', '').str.split(' .. ')\n",
    "owners_split.apply(lambda x: int(x[0])).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could calculate the mid-point between the points, and keep that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    15000000\n",
       "1     7500000\n",
       "2     7500000\n",
       "3     7500000\n",
       "4     7500000\n",
       "Name: owners, dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners_split.apply(lambda x: (int(x[0]) + int(x[1])) // 2).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we could reformat the data slightly, but pretty much keep it as is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    10000000-20000000\n",
       "1     5000000-10000000\n",
       "2     5000000-10000000\n",
       "3     5000000-10000000\n",
       "4     5000000-10000000\n",
       "Name: owners, dtype: object"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners.str.replace(',', '').str.replace(' .. ', '-').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll go with this last option. It allows us to keep the ranges and also easily modify the data in the future."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Function\n",
    "\n",
    "We're now ready to define the functions to clean the data. We'll wrap all the logic inside a `process` function, and for most of the columns there isn't much code to write so we can confine it here. The only logic we'll separate out is for processing the tags column, as it is a more complex endeavour.\n",
    "\n",
    "In the next code cell we clean the data, export the full tags data to file, and inspect the clean dataframe. After that we inspect the exported data, verifying everything went as planned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exported tag data to '../data/exports/steamspy_tag_data.csv'\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>languages</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>Counter-Strike</td>\n",
       "      <td>124534</td>\n",
       "      <td>3339</td>\n",
       "      <td>10000000-20000000</td>\n",
       "      <td>17612</td>\n",
       "      <td>317</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>Team Fortress Classic</td>\n",
       "      <td>3318</td>\n",
       "      <td>633</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>277</td>\n",
       "      <td>62</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>Day of Defeat</td>\n",
       "      <td>3416</td>\n",
       "      <td>398</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>187</td>\n",
       "      <td>34</td>\n",
       "      <td>English, French, German, Italian, Spanish - Spain</td>\n",
       "      <td>FPS;World War II;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>Deathmatch Classic</td>\n",
       "      <td>1273</td>\n",
       "      <td>267</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>258</td>\n",
       "      <td>184</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>Half-Life: Opposing Force</td>\n",
       "      <td>5250</td>\n",
       "      <td>288</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>624</td>\n",
       "      <td>415</td>\n",
       "      <td>English, French, German, Korean</td>\n",
       "      <td>FPS;Action;Sci-fi</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   appid                       name  positive  negative             owners  \\\n",
       "0     10             Counter-Strike    124534      3339  10000000-20000000   \n",
       "1     20      Team Fortress Classic      3318       633   5000000-10000000   \n",
       "2     30              Day of Defeat      3416       398   5000000-10000000   \n",
       "3     40         Deathmatch Classic      1273       267   5000000-10000000   \n",
       "4     50  Half-Life: Opposing Force      5250       288   5000000-10000000   \n",
       "\n",
       "   average_forever  median_forever  \\\n",
       "0            17612             317   \n",
       "1              277              62   \n",
       "2              187              34   \n",
       "3              258             184   \n",
       "4              624             415   \n",
       "\n",
       "                                           languages  \\\n",
       "0  English, French, German, Italian, Spanish - Sp...   \n",
       "1  English, French, German, Italian, Spanish - Sp...   \n",
       "2  English, French, German, Italian, Spanish - Spain   \n",
       "3  English, French, German, Italian, Spanish - Sp...   \n",
       "4                    English, French, German, Korean   \n",
       "\n",
       "                           tags  \n",
       "0        Action;FPS;Multiplayer  \n",
       "1        Action;FPS;Multiplayer  \n",
       "2  FPS;World War II;Multiplayer  \n",
       "3        Action;FPS;Multiplayer  \n",
       "4             FPS;Action;Sci-fi  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def process_tags(df, export=False):\n",
    "    if export: \n",
    "        \n",
    "        tag_data = df[['appid', 'tags']].copy()\n",
    "        \n",
    "        def parse_export_tags(x):\n",
    "            x = literal_eval(x)\n",
    "\n",
    "            if isinstance(x, dict):\n",
    "                return x\n",
    "            elif isinstance(x, list):\n",
    "                return {}\n",
    "            else:\n",
    "                raise TypeError('Something other than dict or list found')\n",
    "\n",
    "        tag_data['tags'] = tag_data['tags'].apply(parse_export_tags)\n",
    "\n",
    "        cols = set(itertools.chain(*tag_data['tags']))\n",
    "\n",
    "        for col in sorted(cols):\n",
    "            col_name = col.lower().replace(' ', '_').replace('-', '_').replace(\"'\", \"\")\n",
    "\n",
    "            tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)\n",
    "\n",
    "        tag_data = tag_data.drop('tags', axis=1)\n",
    "\n",
    "        tag_data.to_csv('../data/exports/steamspy_tag_data.csv', index=False)\n",
    "        print(\"Exported tag data to '../data/exports/steamspy_tag_data.csv'\")\n",
    "        \n",
    "        \n",
    "    def parse_tags(x):\n",
    "        x = literal_eval(x)\n",
    "        \n",
    "        if isinstance(x, dict):\n",
    "            return ';'.join(list(x.keys())[:3])\n",
    "        else:\n",
    "            return np.nan\n",
    "    \n",
    "    df['tags'] = df['tags'].apply(parse_tags)\n",
    "    \n",
    "    # rows with null tags seem to be superseded by newer release, so remove (e.g. dead island)\n",
    "    df = df[df['tags'].notnull()]\n",
    "    \n",
    "    return df\n",
    "\n",
    "\n",
    "def process(df):\n",
    "    df = df.copy()\n",
    "    \n",
    "    # handle missing values\n",
    "    df = df[(df['name'].notnull()) & (df['name'] != 'none')]\n",
    "    df = df[df['developer'].notnull()]\n",
    "    df = df[df['languages'].notnull()]\n",
    "    df = df[df['price'].notnull()]\n",
    "    \n",
    "    # remove unwanted columns\n",
    "    df = df.drop([\n",
    "        'genre', 'developer', 'publisher', 'score_rank', 'userscore', 'average_2weeks',\n",
    "        'median_2weeks', 'price', 'initialprice', 'discount', 'ccu'\n",
    "    ], axis=1)\n",
    "    \n",
    "    # keep top tags, exporting full tag data to file\n",
    "    df = process_tags(df, export=True)\n",
    "    \n",
    "    # reformat owners column\n",
    "    df['owners'] = df['owners'].str.replace(',', '').str.replace(' .. ', '-')\n",
    "    \n",
    "    return df\n",
    "\n",
    "\n",
    "steamspy_data = process(raw_steamspy_data)\n",
    "steamspy_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>1980s</th>\n",
       "      <th>1990s</th>\n",
       "      <th>2.5d</th>\n",
       "      <th>2d</th>\n",
       "      <th>2d_fighter</th>\n",
       "      <th>360_video</th>\n",
       "      <th>3d</th>\n",
       "      <th>3d_platformer</th>\n",
       "      <th>3d_vision</th>\n",
       "      <th>4_player_local</th>\n",
       "      <th>4x</th>\n",
       "      <th>6dof</th>\n",
       "      <th>atv</th>\n",
       "      <th>abstract</th>\n",
       "      <th>action</th>\n",
       "      <th>action_rpg</th>\n",
       "      <th>action_adventure</th>\n",
       "      <th>addictive</th>\n",
       "      <th>adventure</th>\n",
       "      <th>agriculture</th>\n",
       "      <th>aliens</th>\n",
       "      <th>alternate_history</th>\n",
       "      <th>america</th>\n",
       "      <th>animation_&amp;_modeling</th>\n",
       "      <th>anime</th>\n",
       "      <th>arcade</th>\n",
       "      <th>arena_shooter</th>\n",
       "      <th>artificial_intelligence</th>\n",
       "      <th>assassin</th>\n",
       "      <th>asynchronous_multiplayer</th>\n",
       "      <th>atmospheric</th>\n",
       "      <th>audio_production</th>\n",
       "      <th>bmx</th>\n",
       "      <th>base_building</th>\n",
       "      <th>baseball</th>\n",
       "      <th>based_on_a_novel</th>\n",
       "      <th>basketball</th>\n",
       "      <th>batman</th>\n",
       "      <th>battle_royale</th>\n",
       "      <th>beat_em_up</th>\n",
       "      <th>beautiful</th>\n",
       "      <th>benchmark</th>\n",
       "      <th>bikes</th>\n",
       "      <th>blood</th>\n",
       "      <th>board_game</th>\n",
       "      <th>bowling</th>\n",
       "      <th>building</th>\n",
       "      <th>bullet_hell</th>\n",
       "      <th>bullet_time</th>\n",
       "      <th>...</th>\n",
       "      <th>text_based</th>\n",
       "      <th>third_person</th>\n",
       "      <th>third_person_shooter</th>\n",
       "      <th>thriller</th>\n",
       "      <th>time_attack</th>\n",
       "      <th>time_management</th>\n",
       "      <th>time_manipulation</th>\n",
       "      <th>time_travel</th>\n",
       "      <th>top_down</th>\n",
       "      <th>top_down_shooter</th>\n",
       "      <th>touch_friendly</th>\n",
       "      <th>tower_defense</th>\n",
       "      <th>trackir</th>\n",
       "      <th>trading</th>\n",
       "      <th>trading_card_game</th>\n",
       "      <th>trains</th>\n",
       "      <th>transhumanism</th>\n",
       "      <th>turn_based</th>\n",
       "      <th>turn_based_combat</th>\n",
       "      <th>turn_based_strategy</th>\n",
       "      <th>turn_based_tactics</th>\n",
       "      <th>tutorial</th>\n",
       "      <th>twin_stick_shooter</th>\n",
       "      <th>typing</th>\n",
       "      <th>underground</th>\n",
       "      <th>underwater</th>\n",
       "      <th>unforgiving</th>\n",
       "      <th>utilities</th>\n",
       "      <th>vr</th>\n",
       "      <th>vr_only</th>\n",
       "      <th>vampire</th>\n",
       "      <th>video_production</th>\n",
       "      <th>villain_protagonist</th>\n",
       "      <th>violent</th>\n",
       "      <th>visual_novel</th>\n",
       "      <th>voice_control</th>\n",
       "      <th>voxel</th>\n",
       "      <th>walking_simulator</th>\n",
       "      <th>war</th>\n",
       "      <th>wargame</th>\n",
       "      <th>warhammer_40k</th>\n",
       "      <th>web_publishing</th>\n",
       "      <th>werewolves</th>\n",
       "      <th>western</th>\n",
       "      <th>word_game</th>\n",
       "      <th>world_war_i</th>\n",
       "      <th>world_war_ii</th>\n",
       "      <th>wrestling</th>\n",
       "      <th>zombies</th>\n",
       "      <th>e_sports</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>144</td>\n",
       "      <td>564</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2681</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>151</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>71</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>208</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>99</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>122</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>85</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>77</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>87</td>\n",
       "      <td>0</td>\n",
       "      <td>122</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>73</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 372 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   appid  1980s  1990s  2.5d  2d  2d_fighter  360_video  3d  3d_platformer  \\\n",
       "0     10    144    564     0   0           0          0   0              0   \n",
       "1     20      0     71     0   0           0          0   0              0   \n",
       "2     30      0      0     0   0           0          0   0              0   \n",
       "3     40      0      0     0   0           0          0   0              0   \n",
       "4     50      0     77     0   0           0          0   0              0   \n",
       "\n",
       "   3d_vision  4_player_local  4x  6dof  atv  abstract  action  action_rpg  \\\n",
       "0          0               0   0     0    0         0    2681           0   \n",
       "1          0               0   0     0    0         0     208           0   \n",
       "2          0               0   0     0    0         0      99           0   \n",
       "3          0               0   0     0    0         0      85           0   \n",
       "4          0               0   0     0    0         0     211           0   \n",
       "\n",
       "   action_adventure  addictive  adventure  agriculture  aliens  \\\n",
       "0                 0          0          0            0       0   \n",
       "1                 0          0         15            0       0   \n",
       "2                 0          0          0            0       0   \n",
       "3                 0          0          0            0       0   \n",
       "4                 0          0         87            0     122   \n",
       "\n",
       "   alternate_history  america  animation_&_modeling  anime  arcade  \\\n",
       "0                  0        0                     0      0       0   \n",
       "1                  0        0                     0      0       0   \n",
       "2                  0        0                     0      0       0   \n",
       "3                  0        0                     0      0       0   \n",
       "4                  0        0                     0      0       0   \n",
       "\n",
       "   arena_shooter  artificial_intelligence  assassin  asynchronous_multiplayer  \\\n",
       "0              0                        0       151                         0   \n",
       "1              0                        0         0                         0   \n",
       "2              0                        0         0                         0   \n",
       "3             22                        0         0                         0   \n",
       "4              0                        0         0                         0   \n",
       "\n",
       "   atmospheric  audio_production  bmx  base_building  baseball  \\\n",
       "0            0                 0    0              0         0   \n",
       "1            0                 0    0              0         0   \n",
       "2            0                 0    0              0         0   \n",
       "3            0                 0    0              0         0   \n",
       "4           73                 0    0              0         0   \n",
       "\n",
       "   based_on_a_novel  basketball  batman  battle_royale  beat_em_up  beautiful  \\\n",
       "0                 0           0       0              0           0          0   \n",
       "1                 0           0       0              0           0          0   \n",
       "2                 0           0       0              0           0          0   \n",
       "3                 0           0       0              0           0          0   \n",
       "4                 0           0       0              0           0          0   \n",
       "\n",
       "   benchmark  bikes  blood  board_game  bowling  building  bullet_hell  \\\n",
       "0          0      0      0           0        0         0            0   \n",
       "1          0      0      0           0        0         0            0   \n",
       "2          0      0      0           0        0         0            0   \n",
       "3          0      0      0           0        0         0            0   \n",
       "4          0      0      0           0        0         0            0   \n",
       "\n",
       "   bullet_time  ...  text_based  third_person  third_person_shooter  thriller  \\\n",
       "0            0  ...           0             0                     0         0   \n",
       "1            0  ...           0             0                     0         0   \n",
       "2            0  ...           0             0                     0         0   \n",
       "3            0  ...           0             0                     0         0   \n",
       "4            0  ...           0             0                     0         0   \n",
       "\n",
       "   time_attack  time_management  time_manipulation  time_travel  top_down  \\\n",
       "0            0                0                  0            0         0   \n",
       "1            0                0                  0            0         0   \n",
       "2            0                0                  0            0         0   \n",
       "3            0                0                  0            0         0   \n",
       "4            0                0                  0            0         0   \n",
       "\n",
       "   top_down_shooter  touch_friendly  tower_defense  trackir  trading  \\\n",
       "0                 0               0              0        0        0   \n",
       "1                 0               0              0        0        0   \n",
       "2                 0               0              0        0        0   \n",
       "3                 0               0              0        0        0   \n",
       "4                 0               0              0        0        0   \n",
       "\n",
       "   trading_card_game  trains  transhumanism  turn_based  turn_based_combat  \\\n",
       "0                  0       0              0           0                  0   \n",
       "1                  0       0              0           0                  0   \n",
       "2                  0       0              0           0                  0   \n",
       "3                  0       0              0           0                  0   \n",
       "4                  0       0              0           0                  0   \n",
       "\n",
       "   turn_based_strategy  turn_based_tactics  tutorial  twin_stick_shooter  \\\n",
       "0                    0                   0         0                   0   \n",
       "1                    0                   0         0                   0   \n",
       "2                    0                   0         0                   0   \n",
       "3                    0                   0         0                   0   \n",
       "4                    0                   0         0                   0   \n",
       "\n",
       "   typing  underground  underwater  unforgiving  utilities  vr  vr_only  \\\n",
       "0       0            0           0            0          0   0        0   \n",
       "1       0            0           0            0          0   0        0   \n",
       "2       0            0           0            0          0   0        0   \n",
       "3       0            0           0            0          0   0        0   \n",
       "4       0            0           0            0          0   0        0   \n",
       "\n",
       "   vampire  video_production  villain_protagonist  violent  visual_novel  \\\n",
       "0        0                 0                    0       40             0   \n",
       "1        0                 0                    0       26             0   \n",
       "2        0                 0                    0        0             0   \n",
       "3        0                 0                    0        0             0   \n",
       "4        0                 0                    0        0             0   \n",
       "\n",
       "   voice_control  voxel  walking_simulator  war  wargame  warhammer_40k  \\\n",
       "0              0      0                  0    0        0              0   \n",
       "1              0      0                  0    0        0              0   \n",
       "2              0      0                  0   80        0              0   \n",
       "3              0      0                  0    0        0              0   \n",
       "4              0      0                  0    0        0              0   \n",
       "\n",
       "   web_publishing  werewolves  western  word_game  world_war_i  world_war_ii  \\\n",
       "0               0           0        0          0            0             0   \n",
       "1               0           0        0          0            0             0   \n",
       "2               0           0        0          0            5           122   \n",
       "3               0           0        0          0            0             0   \n",
       "4               0           0        0          0            0             0   \n",
       "\n",
       "   wrestling  zombies  e_sports  \n",
       "0          0        0       550  \n",
       "1          0        0         0  \n",
       "2          0        0         0  \n",
       "3          0        0         0  \n",
       "4          0        0         0  \n",
       "\n",
       "[5 rows x 372 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect tag data\n",
    "pd.read_csv('../data/exports/steamspy_tag_data.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merge and Export Clean Data\n",
    "\n",
    "Before we wrap up, we'll check for any missing values and export the SteamSpy data to its own csv file. Then we'll look at combining the previously cleaned Steam data and the newly cleaned SteamSpy data into one cohesive dataset. This will give us options and flexibility for analysis in the future, and means we will have one csv file to begin looking at."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "appid              0\n",
       "name               0\n",
       "positive           0\n",
       "negative           0\n",
       "owners             0\n",
       "average_forever    0\n",
       "median_forever     0\n",
       "languages          0\n",
       "tags               0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "steamspy_data.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "steamspy_data.to_csv('../data/exports/steamspy_clean.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To combine the dataframes we can use the pandas [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) function, specifying which columns to merge upon. We'll keep the default `how='inner'` behaviour, meaning we only keep rows with data in both dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>steam_appid</th>\n",
       "      <th>required_age</th>\n",
       "      <th>platforms</th>\n",
       "      <th>categories</th>\n",
       "      <th>genres</th>\n",
       "      <th>achievements</th>\n",
       "      <th>release_date</th>\n",
       "      <th>price</th>\n",
       "      <th>english</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>appid</th>\n",
       "      <th>name_steamspy</th>\n",
       "      <th>positive</th>\n",
       "      <th>negative</th>\n",
       "      <th>owners</th>\n",
       "      <th>average_forever</th>\n",
       "      <th>median_forever</th>\n",
       "      <th>languages</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Counter-Strike</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>Multi-player;Online Multi-Player;Local Multi-P...</td>\n",
       "      <td>Action</td>\n",
       "      <td>0</td>\n",
       "      <td>2000-11-01</td>\n",
       "      <td>7.19</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>10</td>\n",
       "      <td>Counter-Strike</td>\n",
       "      <td>124534</td>\n",
       "      <td>3339</td>\n",
       "      <td>10000000-20000000</td>\n",
       "      <td>17612</td>\n",
       "      <td>317</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Team Fortress Classic</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>Multi-player;Online Multi-Player;Local Multi-P...</td>\n",
       "      <td>Action</td>\n",
       "      <td>0</td>\n",
       "      <td>1999-04-01</td>\n",
       "      <td>3.99</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>20</td>\n",
       "      <td>Team Fortress Classic</td>\n",
       "      <td>3318</td>\n",
       "      <td>633</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>277</td>\n",
       "      <td>62</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Day of Defeat</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>Multi-player;Valve Anti-Cheat enabled</td>\n",
       "      <td>Action</td>\n",
       "      <td>0</td>\n",
       "      <td>2003-05-01</td>\n",
       "      <td>3.99</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>30</td>\n",
       "      <td>Day of Defeat</td>\n",
       "      <td>3416</td>\n",
       "      <td>398</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>187</td>\n",
       "      <td>34</td>\n",
       "      <td>English, French, German, Italian, Spanish - Spain</td>\n",
       "      <td>FPS;World War II;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Deathmatch Classic</td>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>Multi-player;Online Multi-Player;Local Multi-P...</td>\n",
       "      <td>Action</td>\n",
       "      <td>0</td>\n",
       "      <td>2001-06-01</td>\n",
       "      <td>3.99</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>40</td>\n",
       "      <td>Deathmatch Classic</td>\n",
       "      <td>1273</td>\n",
       "      <td>267</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>258</td>\n",
       "      <td>184</td>\n",
       "      <td>English, French, German, Italian, Spanish - Sp...</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Half-Life: Opposing Force</td>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>Single-player;Multi-player;Valve Anti-Cheat en...</td>\n",
       "      <td>Action</td>\n",
       "      <td>0</td>\n",
       "      <td>1999-11-01</td>\n",
       "      <td>3.99</td>\n",
       "      <td>1</td>\n",
       "      <td>Gearbox Software</td>\n",
       "      <td>Valve</td>\n",
       "      <td>50</td>\n",
       "      <td>Half-Life: Opposing Force</td>\n",
       "      <td>5250</td>\n",
       "      <td>288</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>624</td>\n",
       "      <td>415</td>\n",
       "      <td>English, French, German, Korean</td>\n",
       "      <td>FPS;Action;Sci-fi</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        name  steam_appid  required_age          platforms  \\\n",
       "0             Counter-Strike           10             0  windows;mac;linux   \n",
       "1      Team Fortress Classic           20             0  windows;mac;linux   \n",
       "2              Day of Defeat           30             0  windows;mac;linux   \n",
       "3         Deathmatch Classic           40             0  windows;mac;linux   \n",
       "4  Half-Life: Opposing Force           50             0  windows;mac;linux   \n",
       "\n",
       "                                          categories  genres  achievements  \\\n",
       "0  Multi-player;Online Multi-Player;Local Multi-P...  Action             0   \n",
       "1  Multi-player;Online Multi-Player;Local Multi-P...  Action             0   \n",
       "2              Multi-player;Valve Anti-Cheat enabled  Action             0   \n",
       "3  Multi-player;Online Multi-Player;Local Multi-P...  Action             0   \n",
       "4  Single-player;Multi-player;Valve Anti-Cheat en...  Action             0   \n",
       "\n",
       "  release_date  price  english         developer publisher  appid  \\\n",
       "0   2000-11-01   7.19        1             Valve     Valve     10   \n",
       "1   1999-04-01   3.99        1             Valve     Valve     20   \n",
       "2   2003-05-01   3.99        1             Valve     Valve     30   \n",
       "3   2001-06-01   3.99        1             Valve     Valve     40   \n",
       "4   1999-11-01   3.99        1  Gearbox Software     Valve     50   \n",
       "\n",
       "               name_steamspy  positive  negative             owners  \\\n",
       "0             Counter-Strike    124534      3339  10000000-20000000   \n",
       "1      Team Fortress Classic      3318       633   5000000-10000000   \n",
       "2              Day of Defeat      3416       398   5000000-10000000   \n",
       "3         Deathmatch Classic      1273       267   5000000-10000000   \n",
       "4  Half-Life: Opposing Force      5250       288   5000000-10000000   \n",
       "\n",
       "   average_forever  median_forever  \\\n",
       "0            17612             317   \n",
       "1              277              62   \n",
       "2              187              34   \n",
       "3              258             184   \n",
       "4              624             415   \n",
       "\n",
       "                                           languages  \\\n",
       "0  English, French, German, Italian, Spanish - Sp...   \n",
       "1  English, French, German, Italian, Spanish - Sp...   \n",
       "2  English, French, German, Italian, Spanish - Spain   \n",
       "3  English, French, German, Italian, Spanish - Sp...   \n",
       "4                    English, French, German, Korean   \n",
       "\n",
       "                           tags  \n",
       "0        Action;FPS;Multiplayer  \n",
       "1        Action;FPS;Multiplayer  \n",
       "2  FPS;World War II;Multiplayer  \n",
       "3        Action;FPS;Multiplayer  \n",
       "4             FPS;Action;Sci-fi  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "steam_data = pd.read_csv('../data/exports/steam_data_clean.csv')\n",
    "    \n",
    "merged = steam_data.merge(steamspy_data, left_on='steam_appid', right_on='appid', suffixes=('', '_steamspy'))\n",
    "merged.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can remove, reorder and rename some columns before exporting the complete, clean dataset.\n",
    "\n",
    "We'll remove the SteamSpy rows with overlap - name, languages and AppID. Next we reindex the dataframe, reordering the columns so it makes a little more logical sense to look at, and then we rename some of the columns to make it a bit clearer what they describe. \n",
    "\n",
    "This is a quick and simple step, but as we'll be making this data available for public use it's a very good idea to make it as user-friendly as possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>appid</th>\n",
       "      <th>name</th>\n",
       "      <th>release_date</th>\n",
       "      <th>english</th>\n",
       "      <th>developer</th>\n",
       "      <th>publisher</th>\n",
       "      <th>platforms</th>\n",
       "      <th>required_age</th>\n",
       "      <th>categories</th>\n",
       "      <th>genres</th>\n",
       "      <th>steamspy_tags</th>\n",
       "      <th>achievements</th>\n",
       "      <th>positive_ratings</th>\n",
       "      <th>negative_ratings</th>\n",
       "      <th>average_playtime</th>\n",
       "      <th>median_playtime</th>\n",
       "      <th>owners</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>Counter-Strike</td>\n",
       "      <td>2000-11-01</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>0</td>\n",
       "      <td>Multi-player;Online Multi-Player;Local Multi-P...</td>\n",
       "      <td>Action</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "      <td>0</td>\n",
       "      <td>124534</td>\n",
       "      <td>3339</td>\n",
       "      <td>17612</td>\n",
       "      <td>317</td>\n",
       "      <td>10000000-20000000</td>\n",
       "      <td>7.19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>Team Fortress Classic</td>\n",
       "      <td>1999-04-01</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>0</td>\n",
       "      <td>Multi-player;Online Multi-Player;Local Multi-P...</td>\n",
       "      <td>Action</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "      <td>0</td>\n",
       "      <td>3318</td>\n",
       "      <td>633</td>\n",
       "      <td>277</td>\n",
       "      <td>62</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>3.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>Day of Defeat</td>\n",
       "      <td>2003-05-01</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>0</td>\n",
       "      <td>Multi-player;Valve Anti-Cheat enabled</td>\n",
       "      <td>Action</td>\n",
       "      <td>FPS;World War II;Multiplayer</td>\n",
       "      <td>0</td>\n",
       "      <td>3416</td>\n",
       "      <td>398</td>\n",
       "      <td>187</td>\n",
       "      <td>34</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>3.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>Deathmatch Classic</td>\n",
       "      <td>2001-06-01</td>\n",
       "      <td>1</td>\n",
       "      <td>Valve</td>\n",
       "      <td>Valve</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>0</td>\n",
       "      <td>Multi-player;Online Multi-Player;Local Multi-P...</td>\n",
       "      <td>Action</td>\n",
       "      <td>Action;FPS;Multiplayer</td>\n",
       "      <td>0</td>\n",
       "      <td>1273</td>\n",
       "      <td>267</td>\n",
       "      <td>258</td>\n",
       "      <td>184</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>3.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>Half-Life: Opposing Force</td>\n",
       "      <td>1999-11-01</td>\n",
       "      <td>1</td>\n",
       "      <td>Gearbox Software</td>\n",
       "      <td>Valve</td>\n",
       "      <td>windows;mac;linux</td>\n",
       "      <td>0</td>\n",
       "      <td>Single-player;Multi-player;Valve Anti-Cheat en...</td>\n",
       "      <td>Action</td>\n",
       "      <td>FPS;Action;Sci-fi</td>\n",
       "      <td>0</td>\n",
       "      <td>5250</td>\n",
       "      <td>288</td>\n",
       "      <td>624</td>\n",
       "      <td>415</td>\n",
       "      <td>5000000-10000000</td>\n",
       "      <td>3.99</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   appid                       name release_date  english         developer  \\\n",
       "0     10             Counter-Strike   2000-11-01        1             Valve   \n",
       "1     20      Team Fortress Classic   1999-04-01        1             Valve   \n",
       "2     30              Day of Defeat   2003-05-01        1             Valve   \n",
       "3     40         Deathmatch Classic   2001-06-01        1             Valve   \n",
       "4     50  Half-Life: Opposing Force   1999-11-01        1  Gearbox Software   \n",
       "\n",
       "  publisher          platforms  required_age  \\\n",
       "0     Valve  windows;mac;linux             0   \n",
       "1     Valve  windows;mac;linux             0   \n",
       "2     Valve  windows;mac;linux             0   \n",
       "3     Valve  windows;mac;linux             0   \n",
       "4     Valve  windows;mac;linux             0   \n",
       "\n",
       "                                          categories  genres  \\\n",
       "0  Multi-player;Online Multi-Player;Local Multi-P...  Action   \n",
       "1  Multi-player;Online Multi-Player;Local Multi-P...  Action   \n",
       "2              Multi-player;Valve Anti-Cheat enabled  Action   \n",
       "3  Multi-player;Online Multi-Player;Local Multi-P...  Action   \n",
       "4  Single-player;Multi-player;Valve Anti-Cheat en...  Action   \n",
       "\n",
       "                  steamspy_tags  achievements  positive_ratings  \\\n",
       "0        Action;FPS;Multiplayer             0            124534   \n",
       "1        Action;FPS;Multiplayer             0              3318   \n",
       "2  FPS;World War II;Multiplayer             0              3416   \n",
       "3        Action;FPS;Multiplayer             0              1273   \n",
       "4             FPS;Action;Sci-fi             0              5250   \n",
       "\n",
       "   negative_ratings  average_playtime  median_playtime             owners  \\\n",
       "0              3339             17612              317  10000000-20000000   \n",
       "1               633               277               62   5000000-10000000   \n",
       "2               398               187               34   5000000-10000000   \n",
       "3               267               258              184   5000000-10000000   \n",
       "4               288               624              415   5000000-10000000   \n",
       "\n",
       "   price  \n",
       "0   7.19  \n",
       "1   3.99  \n",
       "2   3.99  \n",
       "3   3.99  \n",
       "4   3.99  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# remove overlapping columns\n",
    "steam_clean = merged.drop(['name_steamspy', 'languages', 'steam_appid'], axis=1)\n",
    "\n",
    "# reindex to reorder columns\n",
    "steam_clean = steam_clean[[\n",
    "    'appid',\n",
    "    'name',\n",
    "    'release_date',\n",
    "    'english',\n",
    "    'developer',\n",
    "    'publisher',\n",
    "    'platforms',\n",
    "    'required_age',\n",
    "    'categories',\n",
    "    'genres',\n",
    "    'tags',\n",
    "    'achievements',\n",
    "    'positive',\n",
    "    'negative',\n",
    "    'average_forever',\n",
    "    'median_forever',\n",
    "    'owners',\n",
    "    'price'\n",
    "]]\n",
    "\n",
    "steam_clean = steam_clean.rename({\n",
    "    'tags': 'steamspy_tags',\n",
    "    'positive': 'positive_ratings',\n",
    "    'negative': 'negative_ratings',\n",
    "    'average_forever': 'average_playtime',\n",
    "    'median_forever': 'median_playtime'\n",
    "}, axis=1)\n",
    "\n",
    "steam_clean.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# export clean dataset\n",
    "steam_clean.to_csv('../data/steam_clean.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And with that, we have come to the end of the cleaning process. It's been a long journey and hopefully you've picked up something useful along the way, and now we can look ahead to analysing the data, creating some visualisations, and gaining some insights into the multi-billion dollar industry that is video games."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}