-1

I have the following dataframe:

+-------------------------------------------+----------------------------------------+----------------+----------------------------------+
|                  Lookup                   |             LookUp Value 1             | LookUp Value 2 |          LookUp Value 3          |
+-------------------------------------------+----------------------------------------+----------------+----------------------------------+
| 300000,50000,500000,100000,1000000,200000 | -1820,-1820,-1820,-1820,-1820,-1820    |    1,1,1,1,1,1 |    1820,1820,1820,1820,1820,1820 |
| 100000,1000000,200000,300000,50000,500000 | -1360,-28760,-1360,-28760,-1360,-28760 |    2,3,2,3,2,3 | 4120,31520,4120,31520,4120,31520 |
+-------------------------------------------+----------------------------------------+----------------+----------------------------------+

Each column is a list, the first columns is the lookup key and the rest are the lookup value. I would like to generate the dataframe like this.

+--------------------+--------------------+--------------------+
| Lookup_300K_Value1 | Lookup_300K_Value2 | Lookup_300K_Value3 |
+--------------------+--------------------+--------------------+
|              -1820 |                  1 |               1820 |
|             -28760 |                  3 |              31520 |
+--------------------+--------------------+--------------------+

Actually I have a solution using pandas.apply and process row by row. It is very very slow so I would like to see if there are some solution that could speed up the process? Thank you very much.

EDIT: I added the dataframe generation code below

d = {'Lookup_Key': ['300000,50000,500000,100000,1000000,200000', '100000,1000000,200000,300000,50000,500000'],
     'LookUp_Value_1': ['-1820,-1820,-1820,-1820,-1820,-1820', '-1360,-28760,-1360,-28760,-1360,-28760'],
     'LookUp_Value_2': ['1,1,1,1,1,1', '2,3,2,3,2,3'],
     'LookUp_Value_3': ['1820,1820,1820,1820,1820,1820', '4120,31520,4120,31520,4120,31520']}
df = pd.DataFrame(data=d)
5
  • 1
    1) post your current solution anyway. 2) if it's working, perhaps, it's better to move your question to CodeReview site Commented Oct 31, 2019 at 9:31
  • Can you change Lookup column for match to expected output? Commented Oct 31, 2019 at 9:31
  • @jezreal even hard code would do. But I am using apply on a row by row basis. The performance is extremely slow when I have a lot of data. Commented Oct 31, 2019 at 10:55
  • Have you tried using map() Commented Oct 31, 2019 at 11:58
  • Nope. I guess the group solution down there is close to what I want. But I failed to apply the code. Commented Nov 1, 2019 at 6:13

2 Answers 2

1

Solution tested with missing values in some column(s), but in Lookup are not NaNs or Nones:

df = pd.concat([df[x].str.split(',', expand=True).stack() for x in df.columns], axis=1, keys=df.columns)
df = df.reset_index(level=1, drop=True).set_index('Lookup', append=True).unstack().sort_index(axis=1, level=1)
df.columns = [f'{b}_{a}' for a, b in df.columns]

Idea is split each value in loop, explode for Series and concat together, last reshape by stack:

df = pd.concat([df[x].str.split(',').explode() for x in df.columns], axis=1)
df = df.set_index('Lookup', append=True).unstack().sort_index(axis=1, level=1)
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
  100000_LookUp Value 1 100000_LookUp Value 2 100000_LookUp Value 3  \
0                 -1820                     1                  1820   
1                 -1360                     2                  4120   

  1000000_LookUp Value 1 1000000_LookUp Value 2 1000000_LookUp Value 3  \
0                  -1820                      1                   1820   
1                 -28760                      3                  31520   

  200000_LookUp Value 1 200000_LookUp Value 2 200000_LookUp Value 3  \
0                 -1820                     1                  1820   
1                 -1360                     2                  4120   

  300000_LookUp Value 1 300000_LookUp Value 2 300000_LookUp Value 3  \
0                 -1820                     1                  1820   
1                -28760                     3                 31520   

  50000_LookUp Value 1 50000_LookUp Value 2 50000_LookUp Value 3  \
0                -1820                    1                 1820   
1                -1360                    2                 4120   

  500000_LookUp Value 1 500000_LookUp Value 2 500000_LookUp Value 3  
0                 -1820                     1                  1820  
1                -28760                     3                 31520  
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very very much. I learn a lot from looking at the procedures.
1

At the core, you can use groupby very well to achieve your goal:

grouped = df.groupby("Lookup")

This is now a dict-like object that has the values you want for every Lookup value in separate dataframes. The question now is how we get it back together again, and here I have to resort to a quite hacky method. I'm sure there are better ones, but this one does produce a nice result.

dflist = []
keylist = []
basecols = df.columns[1:]

for key, df in grouped.__iter__():
    keylist.append(key)
    dflist.append(df[basecols].reset_index(drop=True)

result = pd.concat(dflist, axis=1)
resultcolumns = pd.MultiIndex.from_product([keylist, basecols])
result.columns = resultcolumns

This produces a MultiIndexed DataFrame with the result you described.

Output:

>> result
   50000                 100000                200000                300000                500000                1000000
   Value1 Value2 Value3  Value1 Value2 Value3  Value1 Value2 Value3  Value1 Value2 Value3  Value1 Value2 Value3  Value1 Value2 Value3
0   -1820      1   1820   -1820      1   1820   -1820      1   1820   -1820      1   1820   -1820      1   1820   -1820      1   1820
1   -1360      2   4120   -1360      2   4120   -1360      2   4120  -28760      3  31520  -28760      3  31520  -28760      3  31520

2 Comments

Yeah I guess the overall idea is valid but the original lookup column is a string list. More preprocessing is needed.
There is no way I can smell the dtype if you just post numbers. You have to specify such things.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.