0

This question is an extension of this question. Consider the pandas DataFrame visualized in the table below.

respondent brand engine country aware aware_2 aware_3 age tesst set
0 a volvo p swe 1 0 1 23 set set
1 b volvo None swe 0 0 1 45 set set
2 c bmw p us 0 0 1 56 test test
3 d bmw p us 0 1 1 43 test test
4 e bmw d germany 1 0 1 34 set set
5 f audi d germany 1 0 1 59 set set
6 g volvo d swe 1 0 0 65 test set
7 h audi d swe 1 0 0 78 test set
8 i volvo d us 1 1 1 32 set set

To convert a column with String entries, one should do a map and then pandas.replace().

For example:

mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})

This would lead to the following DataFrame (table):

respondent brand engine country aware aware_2 aware_3 age tesst set
0 a volvo p swe 1 0 1 23 1 1
1 b volvo None swe 0 0 1 45 1 1
2 c bmw p us 0 0 1 56 2 2
3 d bmw p us 0 1 1 43 2 2
4 e bmw d germany 1 0 1 34 1 1
5 f audi d germany 1 0 1 59 1 1
6 g volvo d swe 1 0 0 65 2 1
7 h audi d swe 1 0 0 78 2 1
8 i volvo d us 1 1 1 32 1 1

As seen above, the last two column's strings are replaced with numbers representing these strings.

The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number? Can one automatically create a mapping (and output it somewhere for human reference)?

Something that makes the DataFrame end up like:

respondent brand engine country aware aware_2 aware_3 age tesst set
0 1 1 1 1 1 0 1 23 1 1
1 2 1 2 1 0 0 1 45 1 1
2 3 2 1 2 0 0 1 56 2 2
3 4 2 1 2 0 1 1 43 2 2
4 5 2 3 3 1 0 1 34 1 1
5 6 3 3 3 1 0 1 59 1 1
6 7 1 3 1 1 0 0 65 2 1
7 8 3 3 1 1 0 0 78 2 1
8 9 1 3 2 1 1 1 32 1 1

Also output:

[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]

Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.

3 Answers 3

0

You can adapte the code given in this response https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested

all_brands = df.brand.unique()
brand_dic = dict(zip(all_brands, range(len(all_brands))))
Sign up to request clarification or add additional context in comments.

Comments

0

You will need to first change the type of the columns to Categorical and then create a new column or overwrite the existing column with codes:

df['brand'] = pd.Categorical(df['brand'])
df['brand_codes'] = df['brand'].cat.codes

If you need the mapping:

dict(enumerate(df['brand'].cat.categories )) #This will work only after you've converted the column to categorical

Comments

0

From the other answers, I've written this function to do solve the problem:

import pandas as pd

def convertStringColumnsToNum(data):
    columns = data.columns
    columns_dtypes = data.dtypes
    maps = []
    
    for col_idx in range(0, len(columns)):
        # don't change columns already comprising of numbers
        if(columns_dtypes[col_idx] == 'int64'): # can be extended to more dtypes
            continue
        # inspired from Shivam Roy's answer 
        col = columns[col_idx]
        tmp = pd.Categorical(data[col])
        data[col] = tmp.codes
        maps.append(tmp.categories)

    return maps

This function returns the mapss used to replace strings with a numeral code. The code is the index in which a string resides inside the list. This function works, yet it comes with the SettingWithCopyWarning.

if it ain't broke don't fix it, right? ;)

*but if anyone has a way to adapt this function so that the warning is no longer shown, feel free to comment on it. Yet it works *shrugs* *

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.