This question is an extension of this question. Consider the pandas DataFrame visualized in the table below.
| respondent | brand | engine | country | aware | aware_2 | aware_3 | age | tesst | set | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | a | volvo | p | swe | 1 | 0 | 1 | 23 | set | set |
| 1 | b | volvo | None | swe | 0 | 0 | 1 | 45 | set | set |
| 2 | c | bmw | p | us | 0 | 0 | 1 | 56 | test | test |
| 3 | d | bmw | p | us | 0 | 1 | 1 | 43 | test | test |
| 4 | e | bmw | d | germany | 1 | 0 | 1 | 34 | set | set |
| 5 | f | audi | d | germany | 1 | 0 | 1 | 59 | set | set |
| 6 | g | volvo | d | swe | 1 | 0 | 0 | 65 | test | set |
| 7 | h | audi | d | swe | 1 | 0 | 0 | 78 | test | set |
| 8 | i | volvo | d | us | 1 | 1 | 1 | 32 | set | set |
To convert a column with String entries, one should do a map and then pandas.replace().
For example:
mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})
This would lead to the following DataFrame (table):
| respondent | brand | engine | country | aware | aware_2 | aware_3 | age | tesst | set | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | a | volvo | p | swe | 1 | 0 | 1 | 23 | 1 | 1 |
| 1 | b | volvo | None | swe | 0 | 0 | 1 | 45 | 1 | 1 |
| 2 | c | bmw | p | us | 0 | 0 | 1 | 56 | 2 | 2 |
| 3 | d | bmw | p | us | 0 | 1 | 1 | 43 | 2 | 2 |
| 4 | e | bmw | d | germany | 1 | 0 | 1 | 34 | 1 | 1 |
| 5 | f | audi | d | germany | 1 | 0 | 1 | 59 | 1 | 1 |
| 6 | g | volvo | d | swe | 1 | 0 | 0 | 65 | 2 | 1 |
| 7 | h | audi | d | swe | 1 | 0 | 0 | 78 | 2 | 1 |
| 8 | i | volvo | d | us | 1 | 1 | 1 | 32 | 1 | 1 |
As seen above, the last two column's strings are replaced with numbers representing these strings.
The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number? Can one automatically create a mapping (and output it somewhere for human reference)?
Something that makes the DataFrame end up like:
| respondent | brand | engine | country | aware | aware_2 | aware_3 | age | tesst | set | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 23 | 1 | 1 |
| 1 | 2 | 1 | 2 | 1 | 0 | 0 | 1 | 45 | 1 | 1 |
| 2 | 3 | 2 | 1 | 2 | 0 | 0 | 1 | 56 | 2 | 2 |
| 3 | 4 | 2 | 1 | 2 | 0 | 1 | 1 | 43 | 2 | 2 |
| 4 | 5 | 2 | 3 | 3 | 1 | 0 | 1 | 34 | 1 | 1 |
| 5 | 6 | 3 | 3 | 3 | 1 | 0 | 1 | 59 | 1 | 1 |
| 6 | 7 | 1 | 3 | 1 | 1 | 0 | 0 | 65 | 2 | 1 |
| 7 | 8 | 3 | 3 | 1 | 1 | 0 | 0 | 78 | 2 | 1 |
| 8 | 9 | 1 | 3 | 2 | 1 | 1 | 1 | 32 | 1 | 1 |
Also output:
[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]
Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.