2

I have my first serious question in python.

I have a few nested lists that I need to convert to pandas DataFrame. Seems easy, but what makes it challenging for me: - the lists are huge (so the code needs to be fast) - they are nested - when they are nested, I need combinations.

So having this input:

la =  ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]

I need the below as output

la      lb      lc
a       1       1
b       2       2
b       2       22
c       3       3
c       33      3
d       11      11
d       11      12
d       11      13
d       12      11
d       12      12
d       12      13
d       13      11
d       13      12
d       13      13
e       4       4

Note that I need all permutations whenever I have a nested list. At first I tried simply:

import pandas as pd
pd.DataFrame({'la' : [x for x in la],
              'lb' : [x for x in lb],
              'lc' : [x for x in lc]})

But looking for rows that need expanding and actually expanding (a huge) DataFrame seemed harder than tinkering around the way I create the DataFrame.

I looked at some great posts about itertools (Flattening a shallow list in Python ), the documentation (https://docs.python.org/3.6/library/itertools.html) and generators (What does the "yield" keyword do?), and came up with something like this:

import itertools

def f(la, lb, lc):
    tmp = len(la) == len(lb) == len(lc)
    if tmp:
        for item in range(len(la)):
            len_b = len(lb[item])
            len_c = len(lc[item])
            if ((len_b>1) or (len_c>1)):
                yield list(itertools.product(la[item], lb[item], lc[item]))
                ## above: list is not the result I need,
                ##        without it it breaks (not an iterable)
            else:
                yield (la[item], lb[item], lc[item])
    else:
        print('error: unequal length')

which I test

my_gen =f(lit1, lit2, lit3)
pd.DataFrame.from_records(my_gen)

which... well... breaks when i yield itertools (it has no length), and creates a wrong data structure after I cast itertools to an iterable.

My questions are as follow:

  • how can I fix that issue with yielding itertools?
  • is this efficient? In real application I will be creating the lists by parsing a file and they will be huge... Any performance tips or better solutions from more advanced colleagues? Right not it breaks/misbehaves so I can't even benchmark...
  • would it make sense to generate the lists element by element and then use my f function?

Thank you in advance!

2 Answers 2

2

I have a solution:

import pandas as pd
from itertools import product

la =  ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]

list_product = reduce(lambda x, y: x + y, [list(product(*_)) for _ in zip(la,lb,lc)])
df = pd.DataFrame(list_product, columns=["la", "lb", "lc"])
print(df)

result:

    la  lb  lc
0   a   1   1
1   b   2   2
2   b   2   22
3   c   3   3
4   c   33  3
5   d   11  11
6   d   11  12
7   d   11  13
8   d   12  11
9   d   12  12
10  d   12  13
11  d   13  11
12  d   13  12
13  d   13  13
14  e   4   4
Sign up to request clarification or add additional context in comments.

1 Comment

Ingenious! Thank you!
0

It's not an abstract solution, but it does get the results you are looking for. I look forward to seeing a more pandas-centric answer to this problem, but offer this up in the mean time.

import pandas as pd
la =  ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]

l1 = []
l2 = []
l3 = []

l1Temp = []
l2Temp = []
l3Temp = []

for i, listInt in enumerate(lb):
    if type(listInt == list):
        for j, item in enumerate(listInt):
            # print('%s - %s' % (lb[i], lc[i][j]))
            l1Temp.append(la[i])
            l2Temp.append(lb[i][j])
            l3Temp.append(lc[i])
            # print('%s - %s' % (l1[i], l2[i]))
    else:
        l1Temp.append(la[i])
        l2Temp.append(lb[i])
        l3Temp.append(lc[i])
        # print('%s - %s' % (lb[i], lc[i]))

for i, listInt in enumerate(l3Temp):
    if type(listInt == list):
        for j, item in enumerate(listInt):
            l1.append(l1Temp[i])
            l2.append(l2Temp[i])
            l3.append(l3Temp[i][j])
    else:
        l1.append(l1Temp[i])
        l2.append(l2Temp[i])
        l3.append(l3Temp[i])

for i, item in enumerate(l3):
    print('%s - %s - %s' % (l1[i], l2[i], l3[i]))

df = pd.DataFrame({'la':[x for x in l1],
    'lb':[x for x in l2],
    'lc': [x for x in l3]})
print(df)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.