2

I have a CSV file dataset that contains 21 columns, the first 10 columns are numbers and I don't want to change them. The next 10 columns are binary data and contain only 1 and 0 in it, one "1" and the others are "0", and the last column is the given label.

the example data looks like below

2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)

Suppose I load the data into a matrix, can I keep the first 10 columns and the last column unchanged, and convert the middle 10 columns into one column? After transformation, I want the column value to be based on the index of the "1" in the row, like the row above, the wanted result is

2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total

Can I achieve this using NumPy, scikit-learn or something else?

4 Answers 4

2

This should do it if it is loaded into a numpy array

out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]

Sign up to request clarification or add additional context in comments.

2 Comments

I use this on the data set and I get "ValueError: all the input arrays must have same number of dimensions", after that I create a small numpy array like "b=array([[11, 22, 0, 1, 0, 2], [22, 33, 1, 0, 0, 1]])" and use "out = np.r_[b[:,:2],np.where(b[:,2:-1])[1]+1,b[:,-1]]" I also get this error. Did I wrongly set the parameters? Thank you.
Woops, should have used np.c_ not np.r_. Fixed
1
from io import StringIO

import pandas as pd

csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
               "\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")

df = pd.read_csv(csv, header=None)

df = pd.concat(objs=[df[df.columns[:11]],
                     df[df.columns[11:-1]].idxmax(axis=1) - 10,
                     df[df.columns[-1]]], axis=1)

print(df)

Output:

     0   1   2    3   4    5    6    7    8     9   10  0   21
0  2596  51   3  258   0  510  221  232  148  6279  24   6   2
1     1   2   3    4   5    6    7    8    9    10  11   5   1

Comments

0

Data:

In [135]: df
Out[135]:
     0   1   2    3   4    5    6    7    8     9  ...  12  13  14  15  16  17  18  19  20  21
0  2596  51   3  258   0  510  221  232  148  6279 ...   0   0   0   0   1   0   0   0   0   2
1  2596  51   3  258   0  510  221  232  148  6279 ...   0   0   0   0   0   0   0   0   1   2

[2 rows x 22 columns]

Solution:

df = pd.read_csv('/path/to/file.csv', header=None)

In [137]: df.iloc[:, :11] \
            .join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
            .join(df.iloc[:, -1])
Out[137]:
     0   1   2    3   4    5    6    7    8     9   10  11  21
0  2596  51   3  258   0  510  221  232  148  6279  24   6   2
1  2596  51   3  258   0  510  221  232  148  6279  24  10   2

Comments

0

Setup

df = pd.DataFrame({0: {2596: 51},
 1: {2596: 3},
 2: {2596: 258},
 3: {2596: 0},
 4: {2596: 510},
 5: {2596: 221},
 6: {2596: 232},
 7: {2596: 148},
 8: {2596: 6279},
 9: {2596: 24},
 10: {2596: 0},
 11: {2596: 0},
 12: {2596: 0},
 13: {2596: 0},
 14: {2596: 0},
 15: {2596: 1},
 16: {2596: 0},
 17: {2596: 0},
 18: {2596: 0},
 19: {2596: 0},
 20: {2596: 2}})

Solution

#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1

#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]

Out[2167]: 
      0   1    2   3    4    5    6    7     8   9   10  20
2596  51   3  258   0  510  221  232  148  6279  24   6   2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.