Append string of column index to DataFrame columns

Question

I am working on a project using Learning to Rank. Below is the example dataset format (taken from https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/). The first column is the rank, second column is query id, and the followings are [feature number]:[feature value]

1008 qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000 … 46:0.00000

1007 qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333 … 46:0.000000

1006 qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000 … 46:0.000000

Right now, I am successfully convert my data into this following format in Pandas.DataFrame.

10  qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0 92.28260 ...
...

The first two column is already fine. What I need next is appending feature number to the remaining columns (e.g. first feature from 3500 become 1:3500)

I know I can append a string to columns by using this following command.

df['col'] = 'str' + df['col'].astype(str)

Look at the first feature, 3500, is located at column index 2, so what I can think of is appending column index - 1 for each column. How do I append the string based on the column number?

Any help would be appreciated.

jezrael · Accepted Answer · 2018-04-25 05:16:15Z

1

I think need DataFrame.radd for add columns names from right side and iloc for select from second column to end:

print (df)
    0                    1     2  3      4      5     6       7       8  \
0  10  qid:354714443278337  3500  1  122.0  156.0  13.0  1698.0  1840.0   
1  10  qid:354714443278337  3500  1  122.0  156.0  13.0  1698.0  1840.0   

         9  
0  92.2826  
1  92.2826  

df.iloc[:, 2:] = df.iloc[:, 2:].astype(str).radd(':').radd((df.columns[2:] - 1).astype(str))
print (df)
    0                    1       2    3        4        5       6         7  \
0  10  qid:354714443278337  1:3500  2:1  3:122.0  4:156.0  5:13.0  6:1698.0   
1  10  qid:354714443278337  1:3500  2:1  3:122.0  4:156.0  5:13.0  6:1698.0   

          8          9  
0  7:1840.0  8:92.2826  
1  7:1840.0  8:92.2826

answered Apr 25, 2018 at 5:16

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Darren Christopher Over a year ago

Thanks! I tried in small dataset and this approach works. However, I am working with large dataset and got MemoryError. I am just wondering, is there more efficient approach?

jezrael Over a year ago

@DarrenChristopher - What is size of DataFrame?

Darren Christopher Over a year ago

it's 50K rows and 1.5K columns. Actually the total data I have is around 440K rows, but as I got MemoryError frequently, I decided to divide 50K-ish then write to file and appending it afterwards.If there is more efficient approach it would be helpful. If no, I guess I should try dividing it less than 50K per transformation process

jezrael Over a year ago

@DarrenChristopher - Not so easy. Little modification should be df.iloc[:, 2:] = np.arange(1, len(df.columns)-1).astype(str) + (':' + df.iloc[:, 2:].astype(str)), but not sure if less memory consumed.

Arpit Solanki · Accepted Answer · 2018-04-25 05:13:48Z

0

You can simply concatenate the columns

df['new_col'] = df[df.columns[3]].astype(str) + ':' + df[df.columns[2]].astype(str)

This will output a new column in your df named new_col. Now you can either delete the unnecessary columns.

answered Apr 25, 2018 at 5:13

Arpit Solanki

10k4 gold badges45 silver badges57 bronze badges

2 Comments

Darren Christopher Over a year ago

Hi, thanks for the reply, the third column will also need to be transformed into similar format (i.e. 2:1). In complete form, they will be (10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0 7:1840.0 8:92.28260 ...

Arpit Solanki Over a year ago

same can be achieved using the code mentioned above

Aritesh · Accepted Answer · 2018-04-25 05:34:12Z

0

You can convert the string to dictionary and then read it again as pandas dataframe.

import pandas as pd
import ast

df = pd.DataFrame({'rank': [1008, 1007, 1006], 'column':['qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000',\
                    'qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333',\
                    'qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000']} )

def putquotes(x):
    x1 = x.split(":")
    return "'" + x1[0] +"':" + x1[1]

def putcommas(x):
    x1 = x.split()
    return "{" + ",".join([putquotes(t) for t in x1]) + "}"

import ast
df1 = [ast.literal_eval(putcommas(x)) for x in df['column'].tolist()]
df = pd.concat([df,pd.DataFrame(df1)], axis=1)

answered Apr 25, 2018 at 5:34

Aritesh

2,1031 gold badge17 silver badges18 bronze badges

Collectives™ on Stack Overflow

Append string of column index to DataFrame columns

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related