1

I have a data like this:

republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y

from source. I would like to change all different distinct values from all of the data (dataframe) into numeric values in most efficient way. In the above mentioned example I would like to transform republican-> 1 and democrat -> 2, y ->3, n->4 and ? -> 5 (or NULL).

I tried to use the following:

# Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

However, I'm not sure if using Pandas can be more efficient or there are some other better solutions for it. (This should be generic to any source of data). Here is the transform of data into dataframe using Pandas:

import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset = pd.read_csv(file_path, header=None)

3 Answers 3

2
v = df.values

f = pd.factorize(v.ravel())[0].reshape(v.shape)

pd.DataFrame(f)

   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16
0   0   1   2   1   2   2   2   1   1   1   2   3   2   2   2   1   2
1   0   1   2   1   2   2   2   1   1   1   1   1   2   2   2   1   3
2   4   3   2   2   3   2   2   1   1   1   1   2   1   2   2   1   1
3   4   1   2   2   1   3   2   1   1   1   1   2   1   2   1   1   2
4   4   2   2   2   1   2   2   1   1   1   1   2   3   2   2   2   2
5   4   1   2   2   1   2   2   1   1   1   1   1   1   2   2   2   2
6   4   1   2   1   2   2   2   1   1   1   1   1   1   3   2   2   2
7   0   1   2   1   2   2   2   1   1   1   1   1   1   2   2   3   2
Sign up to request clarification or add additional context in comments.

Comments

2

Use replace on the whole dataframe to make the mappings. You could first pass a dictionary of known mappings for values you need to remain consistent, and then generate a set of values for the dataset and map these extra values to say values 100 upwards.

For example, the ? here is not mapped, so would get a value of 100:

mappings = {'republican':1, 'democrat':2, 'y':3, 'n':4}
unknown = set(pd.unique(df.values.ravel())) - set(mappings.keys())
mappings.update([v, c] for c, v in enumerate(unknown, start=100))
df.replace(mappings, inplace=True)

Giving you:

   republican    n  n.1  n.2  n.3  n.4  n.5  n.6  n.7  n.8  n.9    ?  n.10  n.11  n.12  n.13  n.14
0           1    4    3    4    3    3    3    4    4    4    3  100     3     3     3     4     3
1           1    4    3    4    3    3    3    4    4    4    4    4     3     3     3     4   100
2           2  100    3    3  100    3    3    4    4    4    4    3     4     3     3     4     4
3           2    4    3    3    4  100    3    4    4    4    4    3     4     3     4     4     3
4           2    3    3    3    4    3    3    4    4    4    4    3   100     3     3     3     3
5           2    4    3    3    4    3    3    4    4    4    4    4     4     3     3     3     3
6           2    4    3    4    3    3    3    4    4    4    4    4     4   100     3     3     3
7           1    4    3    4    3    3    3    4    4    4    4    4     4     3     3   100     3

A more generalized version would be:

mappings = {v:c for c, v in enumerate(sorted(set(pd.unique(df.values.ravel()))), start=1)}
df.replace(mappings, inplace=True)

1 Comment

Thanks a lot @Martin Evans, Can you please generalize your solution so that there won't be a need for hard coding the values?
1

You can use:

v = df.values

a, b = v.shape
f = pd.factorize(v.T.ravel())[0].reshape(b,a).T

df = pd.DataFrame(f)
print (df)
   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16
0   0   2   4   2   4   4   4   2   2   2   4   3   4   4   4   2   4
1   0   2   4   2   4   4   4   2   2   2   2   2   4   4   4   2   3
2   1   3   4   4   3   4   4   2   2   2   2   4   2   4   4   2   2
3   1   2   4   4   2   3   4   2   2   2   2   4   2   4   2   2   4
4   1   4   4   4   2   4   4   2   2   2   2   4   3   4   4   4   4
5   1   2   4   4   2   4   4   2   2   2   2   2   2   4   4   4   4
6   1   2   4   2   4   4   4   2   2   2   2   2   2   3   4   4   4
7   0   2   4   2   4   4   4   2   2   2   2   2   2   4   4   3   4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.