I am using Python with a pandas dataframe, it is a CSV of Steam games, and I have the categorical columns of publishers, developers, categories, genres, and tags, but categories, genres, and tags are the most problematic. These columns are lists of strings, i.e.
tags
"['Psychological Horror', 'D Vision', 'Emotional', 'Modern', 'Immersive Sim', 'Singleplayer', 'Dungeon Crawler', 'Realistic', 'Exploration', 'Mature', 'Walking Simulator', 'First-Person', 'Mystery', 'VR', 'Indie', 'Hidden Object', 'RPG', 'Puzzle', 'Adventure', 'Multiple Endings']"
"['Indie', 'Singleplayer', 'Narration', 'Hidden Object', 'Retro', 'D', 'Puzzle', 'Classic', 'Fantasy', 'Adventure', 'Story Rich', 'Family Friendly', 'Point & Click', 'Atmospheric', 'Minigames', 'Mystery']"
genres
"['Adventure', 'Indie', 'RPG', 'Simulation']"
"['Adventure', 'Indie']"
"['Action', 'Adventure']"
['Adventure']
"['RPG', 'Simulation', 'Sports', 'Early Access']"
"['Action', 'Adventure', 'RPG']"
"['Action', 'Adventure', 'Indie', 'Simulation']"
"['Adventure', 'Indie']"
"['Casual', 'Indie']"
"['Action', 'Adventure', 'Indie']"
categories
"['Single-player', 'VR Supported', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'In-App Purchases', 'Partial Controller Support', 'Family Sharing']"
"['Single-player', 'Family Sharing']"
"['Multi-player', 'PvP', 'Online PvP', 'Steam Achievements', 'Full controller support', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Remote Play on Tablet', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Remote Play on Phone', 'Remote Play on Tablet', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Cloud', 'Family Sharing']"
from the CSV. Double quote marks new entry.
Since there are so many genres, so many categories, and so many tags, after I encoded the columns to train a machine learning algorithm, I was left with over 34000 columns. This is a project for my class in college, so I do NOT want to work with that much data.
I want to drop the "is_[blank]" columns that are created after encoding with only 1-5 appearances, as they just aren't as important in the grand scheme of 14000 data entries.
What can I do to drop some of these specific genres/tags/categories without dropping entire rows that have other tags/genres/categories that DO appear many more times than 1-5? Developers and Publishers are fine.
is_[blank]columns from encoding categories with low (1-5) counts. You need to post us your code, else this question doesn't follow SO guidelines.