Efficient conversion of dataframe distinct values in Python

Question

I have a data like this:

republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y

from source. I would like to change all different distinct values from all of the data (dataframe) into numeric values in most efficient way. In the above mentioned example I would like to transform republican-> 1 and democrat -> 2, y ->3, n->4 and ? -> 5 (or NULL).

I tried to use the following:

# Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

However, I'm not sure if using Pandas can be more efficient or there are some other better solutions for it. (This should be generic to any source of data). Here is the transform of data into dataframe using Pandas:

import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset = pd.read_csv(file_path, header=None)

piRSquared · Accepted Answer · 2017-09-25 08:02:36Z

2

v = df.values

f = pd.factorize(v.ravel())[0].reshape(v.shape)

pd.DataFrame(f)

   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16
0   0   1   2   1   2   2   2   1   1   1   2   3   2   2   2   1   2
1   0   1   2   1   2   2   2   1   1   1   1   1   2   2   2   1   3
2   4   3   2   2   3   2   2   1   1   1   1   2   1   2   2   1   1
3   4   1   2   2   1   3   2   1   1   1   1   2   1   2   1   1   2
4   4   2   2   2   1   2   2   1   1   1   1   2   3   2   2   2   2
5   4   1   2   2   1   2   2   1   1   1   1   1   1   2   2   2   2
6   4   1   2   1   2   2   2   1   1   1   1   1   1   3   2   2   2
7   0   1   2   1   2   2   2   1   1   1   1   1   1   2   2   3   2

answered Sep 25, 2017 at 8:02

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Martin Evans · Accepted Answer · 2017-09-25 09:44:31Z

Use replace on the whole dataframe to make the mappings. You could first pass a dictionary of known mappings for values you need to remain consistent, and then generate a set of values for the dataset and map these extra values to say values 100 upwards.

For example, the ? here is not mapped, so would get a value of 100:

mappings = {'republican':1, 'democrat':2, 'y':3, 'n':4}
unknown = set(pd.unique(df.values.ravel())) - set(mappings.keys())
mappings.update([v, c] for c, v in enumerate(unknown, start=100))
df.replace(mappings, inplace=True)

Giving you:

   republican    n  n.1  n.2  n.3  n.4  n.5  n.6  n.7  n.8  n.9    ?  n.10  n.11  n.12  n.13  n.14
0           1    4    3    4    3    3    3    4    4    4    3  100     3     3     3     4     3
1           1    4    3    4    3    3    3    4    4    4    4    4     3     3     3     4   100
2           2  100    3    3  100    3    3    4    4    4    4    3     4     3     3     4     4
3           2    4    3    3    4  100    3    4    4    4    4    3     4     3     4     4     3
4           2    3    3    3    4    3    3    4    4    4    4    3   100     3     3     3     3
5           2    4    3    3    4    3    3    4    4    4    4    4     4     3     3     3     3
6           2    4    3    4    3    3    3    4    4    4    4    4     4   100     3     3     3
7           1    4    3    4    3    3    3    4    4    4    4    4     4     3     3   100     3

A more generalized version would be:

mappings = {v:c for c, v in enumerate(sorted(set(pd.unique(df.values.ravel()))), start=1)}
df.replace(mappings, inplace=True)

Thanks a lot @Martin Evans, Can you please generalize your solution so that there won't be a need for hard coding the values?

jezrael · Accepted Answer · 2017-09-25 08:36:08Z

You can use:

v = df.values

a, b = v.shape
f = pd.factorize(v.T.ravel())[0].reshape(b,a).T

df = pd.DataFrame(f)
print (df)
   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16
0   0   2   4   2   4   4   4   2   2   2   4   3   4   4   4   2   4
1   0   2   4   2   4   4   4   2   2   2   2   2   4   4   4   2   3
2   1   3   4   4   3   4   4   2   2   2   2   4   2   4   4   2   2
3   1   2   4   4   2   3   4   2   2   2   2   4   2   4   2   2   4
4   1   4   4   4   2   4   4   2   2   2   2   4   3   4   4   4   4
5   1   2   4   4   2   4   4   2   2   2   2   2   2   4   4   4   4
6   1   2   4   2   4   4   4   2   2   2   2   2   2   3   4   4   4
7   0   2   4   2   4   4   4   2   2   2   2   2   2   4   4   3   4

Collectives™ on Stack Overflow

Efficient conversion of dataframe distinct values in Python

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related