1

Hello I am following a video on Udemy. We are trying to apply a random forest classifier. Before we do so, we convert one of the columns in a data frame into a string. The 'Cabin' column represents values such as "4C" but in order to reduce the number of unique values, we want to use simply the first number to map onto a new column 'Cabin_mapped'.

enter image description here

data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
    data['Cabin_mapped'].unique(),0)}

data.loc[:,'Cabin_mapped'] =  data.loc[:,'Cabin_mapped'].map(cabin_dict)

data[['Cabin_mapped', 'Cabin']].head() 

This part below is simply splitting the data into training and test set. The parameters don't really matter for figuring out the problem.

X_train_less_cat, X_test_less_cat, y_train, y_test = \
    train_test_split(data[use_cols].fillna(0), data.Survived, 
                     test_size = 0.3, random_state=0) 

I get an error here after the fit, saying I could not convert the string into a float. rf = RandomForestClassifier(n_estimators=200, random_state=39) rf.fit(X_train_less_cat, y_train)

It seems like I need to convert one of the inputs back into float to use the random forest algorithms. This is despite the error not showing up in the tutorial video. If anyone could help me out, that'd be great.

4
  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. Minimal, complete, verifiable example applies here. We cannot effectively help you until you post your MCVE code and accurately describe the problem. We should be able to paste your posted code into a text file and reproduce the problem you described. Commented Sep 20, 2018 at 23:15
  • "I get an error here after the fit, saying I could not convert the float into a string" - but the error in your title is other way around, string-to-float, and I'd guess that's because you have NaNs in your table, hence the 'n'. Where did they come from? Should they be there? Commented Sep 20, 2018 at 23:33
  • Rup, sorry I meant to write "convert the string into a float". So the 'n' keys are corresponding to the NaN values and mapped to 0 in Cabin_mapped. I'm still wondering why the 'n' would cause a problem unlike the other capitalized letters. I tried solving this by data['Cabin'].fillna() but I wouldn't be able to use fillna(0) would I since I need a letter? Commented Sep 20, 2018 at 23:40
  • Rup, I did data['Cabin'] = data['Cabin'].fillna('X0"). But still, the same problem persists except this time with "X0" instead of n. I also just tried fillna(0) which gives me a strange error indicating that 'E49' cannot be converted, which is strange because E49 is just a random cabin with no missing values. Commented Sep 21, 2018 at 0:02

1 Answer 1

1

here's fully working example - I've highlighted the bit that you are missing. You need to convert EVERY column to a number, not just "cabin".

!wget https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv

import pandas as pd

data = pd.read_csv("train.csv")




data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
    data['Cabin_mapped'].unique(),0)}

data.loc[:,'Cabin_mapped'] =  data.loc[:,'Cabin_mapped'].map(cabin_dict)

data[['Cabin_mapped', 'Cabin']].head()


from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split


## YOU ARE MISSING THIS BIT, some of your columns are still strings
## they need to be converted to numbers (ints OR floats)
for n,v in data.items():
    if v.dtype == "object":
        data[n] = v.factorize()[0]
## END of the bit you're missing

use_cols = data.drop("Survived",axis=1).columns

X_train_less_cat, X_test_less_cat, y_train, y_test = \
    train_test_split(data[use_cols].fillna(0), data.Survived, 
                    test_size = 0.3, random_state=0) 


rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)
Sign up to request clarification or add additional context in comments.

1 Comment

Wow thank you for the extensive answer! You helped me solve the problem indirectly. So in the tutorial, the instructor used data[use_cols] which equals the data with three columns (Cabin, Cabin mapped, Sex). I had no idea why she was using this as it unnecessarily contained 'Cabin' which was not converted into a number. For her it worked. I simply took 'Cabin' out. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.