After encoding my categorical columns in a pandas dataframe, I was left with too many columns. How can I drop some?

Question

I am using Python with a pandas dataframe, it is a CSV of Steam games, and I have the categorical columns of publishers, developers, categories, genres, and tags, but categories, genres, and tags are the most problematic. These columns are lists of strings, i.e.

tags
"['Psychological Horror', 'D Vision', 'Emotional', 'Modern', 'Immersive Sim', 'Singleplayer', 'Dungeon Crawler', 'Realistic', 'Exploration', 'Mature', 'Walking Simulator', 'First-Person', 'Mystery', 'VR', 'Indie', 'Hidden Object', 'RPG', 'Puzzle', 'Adventure', 'Multiple Endings']"
"['Indie', 'Singleplayer', 'Narration', 'Hidden Object', 'Retro', 'D', 'Puzzle', 'Classic', 'Fantasy', 'Adventure', 'Story Rich', 'Family Friendly', 'Point & Click', 'Atmospheric', 'Minigames', 'Mystery']"

genres
"['Adventure', 'Indie', 'RPG', 'Simulation']"
"['Adventure', 'Indie']"
"['Action', 'Adventure']"
['Adventure']
"['RPG', 'Simulation', 'Sports', 'Early Access']"
"['Action', 'Adventure', 'RPG']"
"['Action', 'Adventure', 'Indie', 'Simulation']"
"['Adventure', 'Indie']"
"['Casual', 'Indie']"
"['Action', 'Adventure', 'Indie']"

categories
"['Single-player', 'VR Supported', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'In-App Purchases', 'Partial Controller Support', 'Family Sharing']"
"['Single-player', 'Family Sharing']"
"['Multi-player', 'PvP', 'Online PvP', 'Steam Achievements', 'Full controller support', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Remote Play on Tablet', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Trading Cards', 'Steam Cloud', 'Remote Play on Phone', 'Remote Play on Tablet', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Steam Cloud', 'Family Sharing']"
"['Single-player', 'Steam Achievements', 'Full controller support', 'Steam Cloud', 'Family Sharing']"

from the CSV. Double quote marks new entry.

Since there are so many genres, so many categories, and so many tags, after I encoded the columns to train a machine learning algorithm, I was left with over 34000 columns. This is a project for my class in college, so I do NOT want to work with that much data.

I want to drop the "is_[blank]" columns that are created after encoding with only 1-5 appearances, as they just aren't as important in the grand scheme of 14000 data entries.

What can I do to drop some of these specific genres/tags/categories without dropping entire rows that have other tags/genres/categories that DO appear many more times than 1-5? Developers and Publishers are fine.

So you want to drop the is_[blank] columns from encoding categories with low (1-5) counts. You need to post us your code, else this question doesn't follow SO guidelines. — smci
– smci, Commented Nov 8 at 21:33

LowpolyOllie · Accepted Answer · 2025-11-07 20:30:40Z

There are many ways how you can handle this:

Feature engineering - you can transform some of these features into booleans. For example - remove Single-player and Multi-player from categories and make a boolean flag has_multiplayer, which is True for games with multiplayer capabilities and False for everything else. Or group labels into bigger categories based on their similarity - like, Horror and Mystery group together well
Use other types of Encoders. One-hot, which is one you're using, is not the only option. You can try Ordinal or Mean(also called target) encoders and see whatever works best for you. But be cautious with model types that you use - while tree-based algorithms work well with Ordinal encoding, logistic regression - not so much. On the other hand, Target Encoding may lead to overfitting if not used carefully. You can learn more about them here
Employ dimensionality reduction techniques, like PCA or SVD on top of One-Hot. They will effectively lower number of features, but it takes some experimentation to find out the best number of resulting features, also resulting features become hard to interpret
Trim the number of features - it's not necessary that you really need all of this data to fit a model with decent performance. There are many ways to select features - based on stats like correlation with target, based on model metrics, on feature importances

Luciano Elish · Accepted Answer · 2025-11-09 05:39:18Z

Dimensionality reduction ended up working a lot, stopped my code from crashing and reduced the column size down to 150 from 34000 (after encoding with one hot encoder). I used a pipeline with columntransformer with a sparse output and truncated svd, code posted below:

categorical = ["developers", "publishers", "categories", "genres", "tags"]
numeric = ["price", "windows", "mac", "linux"]

ct = ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown='ignore', sparse_output=True), categorical)],
    remainder='passthrough',   
    sparse_threshold=0.0       
)
svd = TruncatedSVD(n_components = 150, random_state=42) 
pipeline = Pipeline([("ct", ct), ("svd", svd), ("clf", BernoulliNB())]) 
X = randomizedDf[categorical + numeric]
y = randomizedDf['recommendation']

this brought my shape down to (11200, 300) for training data.

Collectives™ on Stack Overflow

After encoding my categorical columns in a pandas dataframe, I was left with too many columns. How can I drop some?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related