0

I am trying to create a pandas data frame using two lists and the output is erroneous for a given length of the lists.(this is not due to varying lengths)

Here I have two cases, one that works as expected and one that doesn't(commented out):

import string
d = dict.fromkeys(string.ascii_lowercase, 0).keys()
groups = sorted(d)[:3]
numList = range(0,4)
# groups = sorted(d)[:20]
# numList = range(0,25)

df = DataFrame({'Number':sorted(numList)*len(groups), 'Group':sorted(groups)*len(numList)})

df.sort_values(['Group', 'Number'])

Expected Output: every item in groups, to correspond to all items in numList

  Group Number 
    a   0
    a   1
    a   2
    a   3
    b   0
    b   1
    b   2
    b   3
    c   0
    c   1
    c   2
    c   3

Actual Results: Works for case in which lists are sized 3 and 4 but not 20 , and 25 (I have commented out that case in the above code)

Why is that? and how to fix that?

1
  • You can apply print(df) for both lists sized 3 and 4, and also 20 and 25 before implementing df.sort_values(['Group', 'Number']) to check the differences. From here, you can understand the root cause of the problems. Commented May 6, 2019 at 8:57

1 Answer 1

1

If I understand this correctly, you want to make dataframe which will have all pairs of groups and numbers. That operation is called cartesian product. If the difference in lengths betweens those two arrays is exactly 1, it works with your approach, but this is more by pure accident. For general case, you want to do this.

df1 = DataFrame({'Number': sorted(numList)})
df2 = DataFrame({'Group': sorted(groups)})
df = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', 1)

And just note about dataframes sorting: You need to remember that in pandas, most of DataFrame operations return new DataFrame by default, don't modify the old one, unless you pass the inplace=True parameter. So you should do

df = df.sort_values(['Group', 'Number'])

or

df.sort_values(['Group', 'Number'], inplace=True)

and it should work now.

Sign up to request clarification or add additional context in comments.

1 Comment

Your answer works, but I found another way by trial and error: I did a sort on numList*len(groups) and it worked perfecly well. But I am not sure why that is the case. df = DataFrame({'Number':sorted(numList*len(groups)), 'Group':sorted(groups)*len(numList)}) Any ideas? @Matěj Račinský

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.