1

I am using Python with a pandas dataframe, it is a CSV of Steam games, and I have the categorical columns of publishers, developers, categories, genres, and tags, but categories, genres, and tags are the most problematic. These columns are lists of strings, i.e.

tags
"['Psychological Horror', 'D Vision', 'Emotional', 'Modern', 'Immersive Sim', 'Singleplayer', 'Dungeon Crawler', 'Realistic', 'Exploration', 'Mature', 'Walking Simulator', 'First-Person', 'Mystery', 'VR', 'Indie', 'Hidden Object', 'RPG', 'Puzzle', 'Adventure', 'Multiple Endings']"
"['Indie', 'Singleplayer', 'Narration', 'Hidden Object', 'Retro', 'D', 'Puzzle', 'Classic', 'Fantasy', 'Adventure', 'Story Rich', 'Family Friendly', 'Point & Click', 'Atmospheric', 'Minigames', 'Mystery']"
genres
"['Adventure', 'Indie', 'RPG', 'Simulation']"
"['Adventure', 'Indie']"
"['Action', 'Adventure']"
['Adventure']
"['RPG', 'Simulation', 'Sports', 'Early Access']"
"['Action', 'Adventure', 'RPG']"
"['Action', 'Adventure', 'Indie', 'Simulation']"
"['Adventure', 'Indie']"
"['Casual', 'Indie']"
"['Action', 'Adventure', 'Indie']"
categories
"['Single-player', 'VR Supported', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'In-App Purchases', 'Partial Controller Support', 'Family Sharing']"
"['Single-player', 'Family Sharing']"
"['Multi-player', 'PvP', 'Online PvP', 'Steam Achievements', 'Full controller support', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Remote Play on Tablet', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Remote Play on Phone', 'Remote Play on Tablet', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Cloud', 'Family Sharing']"

from the CSV. Double quote marks new entry.

Since there are so many genres, so many categories, and so many tags, after I encoded the columns to train a machine learning algorithm, I was left with over 34000 columns. This is a project for my class in college, so I do NOT want to work with that much data.

I want to drop the "is_[blank]" columns that are created after encoding with only 1-5 appearances, as they just aren't as important in the grand scheme of 14000 data entries.

What can I do to drop some of these specific genres/tags/categories without dropping entire rows that have other tags/genres/categories that DO appear many more times than 1-5? Developers and Publishers are fine.

1
  • So you want to drop the is_[blank] columns from encoding categories with low (1-5) counts. You need to post us your code, else this question doesn't follow SO guidelines. Commented Nov 8 at 21:33

2 Answers 2

2

There are many ways how you can handle this:

  1. Feature engineering - you can transform some of these features into booleans. For example - remove Single-player and Multi-player from categories and make a boolean flag has_multiplayer, which is True for games with multiplayer capabilities and False for everything else. Or group labels into bigger categories based on their similarity - like, Horror and Mystery group together well

  2. Use other types of Encoders. One-hot, which is one you're using, is not the only option. You can try Ordinal or Mean(also called target) encoders and see whatever works best for you. But be cautious with model types that you use - while tree-based algorithms work well with Ordinal encoding, logistic regression - not so much. On the other hand, Target Encoding may lead to overfitting if not used carefully. You can learn more about them here

  3. Employ dimensionality reduction techniques, like PCA or SVD on top of One-Hot. They will effectively lower number of features, but it takes some experimentation to find out the best number of resulting features, also resulting features become hard to interpret

  4. Trim the number of features - it's not necessary that you really need all of this data to fit a model with decent performance. There are many ways to select features - based on stats like correlation with target, based on model metrics, on feature importances

Sign up to request clarification or add additional context in comments.

Comments

2

Dimensionality reduction ended up working a lot, stopped my code from crashing and reduced the column size down to 150 from 34000 (after encoding with one hot encoder). I used a pipeline with columntransformer with a sparse output and truncated svd, code posted below:

categorical = ["developers", "publishers", "categories", "genres", "tags"]
numeric = ["price", "windows", "mac", "linux"]

ct = ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown='ignore', sparse_output=True), categorical)],
    remainder='passthrough',   
    sparse_threshold=0.0       
)
svd = TruncatedSVD(n_components = 150, random_state=42) 
pipeline = Pipeline([("ct", ct), ("svd", svd), ("clf", BernoulliNB())]) 
X = randomizedDf[categorical + numeric]
y = randomizedDf['recommendation']

this brought my shape down to (11200, 300) for training data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.