3

I am working on a project using Learning to Rank. Below is the example dataset format (taken from https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/). The first column is the rank, second column is query id, and the followings are [feature number]:[feature value]

1008 qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000 … 46:0.00000

1007 qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333 … 46:0.000000

1006 qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000 … 46:0.000000

Right now, I am successfully convert my data into this following format in Pandas.DataFrame.

10  qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0 92.28260 ...
...

The first two column is already fine. What I need next is appending feature number to the remaining columns (e.g. first feature from 3500 become 1:3500)

I know I can append a string to columns by using this following command.

df['col'] = 'str' + df['col'].astype(str)

Look at the first feature, 3500, is located at column index 2, so what I can think of is appending column index - 1 for each column. How do I append the string based on the column number?

Any help would be appreciated.

3 Answers 3

1

I think need DataFrame.radd for add columns names from right side and iloc for select from second column to end:

print (df)
    0                    1     2  3      4      5     6       7       8  \
0  10  qid:354714443278337  3500  1  122.0  156.0  13.0  1698.0  1840.0   
1  10  qid:354714443278337  3500  1  122.0  156.0  13.0  1698.0  1840.0   

         9  
0  92.2826  
1  92.2826  

df.iloc[:, 2:] = df.iloc[:, 2:].astype(str).radd(':').radd((df.columns[2:] - 1).astype(str))
print (df)
    0                    1       2    3        4        5       6         7  \
0  10  qid:354714443278337  1:3500  2:1  3:122.0  4:156.0  5:13.0  6:1698.0   
1  10  qid:354714443278337  1:3500  2:1  3:122.0  4:156.0  5:13.0  6:1698.0   

          8          9  
0  7:1840.0  8:92.2826  
1  7:1840.0  8:92.2826  
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! I tried in small dataset and this approach works. However, I am working with large dataset and got MemoryError. I am just wondering, is there more efficient approach?
@DarrenChristopher - What is size of DataFrame?
it's 50K rows and 1.5K columns. Actually the total data I have is around 440K rows, but as I got MemoryError frequently, I decided to divide 50K-ish then write to file and appending it afterwards.If there is more efficient approach it would be helpful. If no, I guess I should try dividing it less than 50K per transformation process
@DarrenChristopher - Not so easy. Little modification should be df.iloc[:, 2:] = np.arange(1, len(df.columns)-1).astype(str) + (':' + df.iloc[:, 2:].astype(str)), but not sure if less memory consumed.
0

You can simply concatenate the columns

df['new_col'] = df[df.columns[3]].astype(str) + ':' + df[df.columns[2]].astype(str)

This will output a new column in your df named new_col. Now you can either delete the unnecessary columns.

2 Comments

Hi, thanks for the reply, the third column will also need to be transformed into similar format (i.e. 2:1). In complete form, they will be (10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0 7:1840.0 8:92.28260 ...
same can be achieved using the code mentioned above
0

You can convert the string to dictionary and then read it again as pandas dataframe.

import pandas as pd
import ast

df = pd.DataFrame({'rank': [1008, 1007, 1006], 'column':['qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000',\
                    'qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333',\
                    'qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000']} )

def putquotes(x):
    x1 = x.split(":")
    return "'" + x1[0] +"':" + x1[1]

def putcommas(x):
    x1 = x.split()
    return "{" + ",".join([putquotes(t) for t in x1]) + "}"

import ast
df1 = [ast.literal_eval(putcommas(x)) for x in df['column'].tolist()]
df = pd.concat([df,pd.DataFrame(df1)], axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.