Return to Answer

edited body

Source Link

edited Nov 9 at 5:39

Dimensionality reduction ended up working a lot, stopped my code from crashing and reduced the column size down to 150 from 34000 (after encoding with one hot encoder). I used a pipeline with columntransformer with a sparse output and truncated svd, code posted below:

categorical = ["developers", "publishers", "categories", "genres", "tags"]
numeric = ["price", "windows", "mac", "linux"]

ct = ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown='ignore', sparse_output=True), categorical)],
    remainder='passthrough',   
    sparse_threshold=0.0       
)
svd = TruncatedSVD(n_components = 300150, random_state=42) 
pipeline = Pipeline([("ct", ct), ("svd", svd), ("clf", BernoulliNB())]) 
X = randomizedDf[categorical + numeric]
y = randomizedDf['recommendation']

this brought my shape down to (11200, 300) for training data.

categorical = ["developers", "publishers", "categories", "genres", "tags"]
numeric = ["price", "windows", "mac", "linux"]

ct = ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown='ignore', sparse_output=True), categorical)],
    remainder='passthrough',   
    sparse_threshold=0.0       
)
svd = TruncatedSVD(n_components = 300, random_state=42) 
pipeline = Pipeline([("ct", ct), ("svd", svd), ("clf", BernoulliNB())]) 
X = randomizedDf[categorical + numeric]
y = randomizedDf['recommendation']

this brought my shape down to (11200, 300) for training data.

categorical = ["developers", "publishers", "categories", "genres", "tags"]
numeric = ["price", "windows", "mac", "linux"]

ct = ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown='ignore', sparse_output=True), categorical)],
    remainder='passthrough',   
    sparse_threshold=0.0       
)
svd = TruncatedSVD(n_components = 150, random_state=42) 
pipeline = Pipeline([("ct", ct), ("svd", svd), ("clf", BernoulliNB())]) 
X = randomizedDf[categorical + numeric]
y = randomizedDf['recommendation']

this brought my shape down to (11200, 300) for training data.

Source Link

answered Nov 9 at 5:38

Luciano Elish

categorical = ["developers", "publishers", "categories", "genres", "tags"]
numeric = ["price", "windows", "mac", "linux"]

ct = ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown='ignore', sparse_output=True), categorical)],
    remainder='passthrough',   
    sparse_threshold=0.0       
)
svd = TruncatedSVD(n_components = 300, random_state=42) 
pipeline = Pipeline([("ct", ct), ("svd", svd), ("clf", BernoulliNB())]) 
X = randomizedDf[categorical + numeric]
y = randomizedDf['recommendation']

this brought my shape down to (11200, 300) for training data.

Collectives™ on Stack Overflow

Return to Answer