2

I consulted a lot of the posts on ValueError: cannot reindex from a duplicate axis ([What does `ValueError: cannot reindex from a duplicate axis` mean? and other related posts. I understand that the error can arise with duplicate row indices or column names, but I still can't quite figure out what exactly is throwing me the error.

Below is my best at reproducing the spirit of the dataframe, which does throw the error.

d = {"id" : [1,2,3,4,5], 
"cata" : [['aaa1','bbb2','ccc3'],['aaa4','bbb5','ccc6'],['aaa7','bbb8','ccc9'],['aaa10','bbb11','ccc12'],['aaa13','bbb14','ccc15']],
 "catb" : [['ddd1','eee2','fff3','ggg4'],['ddd5','eee6','fff7','ggg8'],['ddd9','eee10','fff11','ggg12'],['ddd13','eee14','fff15','ggg16'],['ddd17','eee18','fff19','ggg20']],
 "catc" : [['hhh1','iii2','jjj3', 'kkk4', 'lll5'],['hhh6','iii7','jjj8', 'kkk9', 'lll10'],['hhh11','iii12','jjj13', 'kkk14', 'lll15'],['hhh16','iii17','jjj18', 'kkk18', 'lll19'],['hhh20','iii21','jjj22', 'kkk23', 'lll24']]}

df = pd.DataFrame(d)

df.head()

    id  cata    catb    catc
0   1   [aaa1, bbb2, ccc3]  [ddd1, eee2, fff3, ggg4]    [hhh1, iii2, jjj3, kkk4, lll5]
1   2   [aaa4, bbb5, ccc6]  [ddd5, eee6, fff7, ggg8]    [hhh6, iii7, jjj8, kkk9, lll10]
2   3   [aaa7, bbb8, ccc9]  [ddd9, eee10, fff11, ggg12]     [hhh11, iii12, jjj13, kkk14, lll15]
3   4   [aaa10, bbb11, ccc12]   [ddd13, eee14, fff15, ggg16]    [hhh16, iii17, jjj18, kkk18, lll19]
4   5   [aaa13, bbb14, ccc15]   [ddd17, eee18, fff19, ggg20]    [hhh20, iii21, jjj22, kkk23, lll24]

df.set_index(['id']).apply(pd.Series.explode).reset_index()

Here is the error:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-63-17e7c29b180c> in <module>()
----> 1 df.set_index(['id']).apply(pd.Series.explode).reset_index()

14 frames

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

The dataset I'm using is a few hundred MBs and it's a pain - lots of lists inside lists, but the example of above is a fair representation of where I'm stuck. Even when I try to generate a fake dataframe with unique values, I still don't understand why I'm getting the ValueError.

I have explored other ways to explode the lists like using df.apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1), which doesn't throw a value error, however, it's definitely not as fast and I'd probably would reconsider how I'm processing the df. Still, I want to understand why I'm getting the ValueError I'm getting when I don't have any obvious duplicate values.

Thanks!!!!

Adding desired output here, below, which i generated by chaining apply/stack/dropping levels.

    id  cata    catb    catc
0   1   aaa1    ddd1    hhh1
1   1   bbb2    eee2    iii2
2   1   ccc3    fff3    jjj3
3   1   NaN     ggg4    kkk4
4   1   NaN     NaN     lll5
5   2   aaa4    ddd5    hhh6
6   2   bbb5    eee6    iii7
7   2   ccc6    fff7    jjj8
8   2   NaN     ggg8    kkk9
9   2   NaN     NaN     lll10
10  3   aaa7    ddd9    hhh11
11  3   bbb8    eee10   iii12
12  3   ccc9    fff11   jjj13
13  3   NaN     ggg12   kkk14
14  3   NaN     NaN     lll15
15  4   aaa10   ddd13   hhh16
16  4   bbb11   eee14   iii17
17  4   ccc12   fff15   jjj18
18  4   NaN     ggg16   kkk18
19  4   NaN     NaN     lll19
20  5   aaa13   ddd17   hhh20
21  5   bbb14   eee18   iii21
22  5   ccc15   fff19   jjj22
23  5   NaN     ggg20   kkk23
24  5   NaN     NaN     lll24
4
  • it is possible that the error is triggered because u have varying list lengths in the column, some lists are of length 3 or 4. i'd like to think that's where the duplicate index error stems from Commented Jun 3, 2020 at 3:25
  • nice. takes it on a wide path though, instead of long form Commented Jun 3, 2020 at 3:29
  • What is your expected output look like? Commented Jun 3, 2020 at 3:30
  • sammywemmy - yeah, i think you're right about the unbalanced lists. Once the lists were balanced, the function ran error-free. I need to dig into .explode further as i had assumed incorrectly that it was faster way to do stack/dropping levels. Alternatively, i need to seriously reconsider how i'm cleaning the data in the first place. I'm going to chew o this for a few more days... Commented Jun 3, 2020 at 13:03

3 Answers 3

0

The error of pd.Series.explode() cannot be solved, but a long form with an 'id' column is created.

tmp = pd.concat([df['id'],df['cata'].apply(pd.Series),df['catb'].apply(pd.Series),df['catc'].apply(pd.Series)],axis=1)
tmp2 = tmp.unstack().to_frame().reset_index()
tmp2 = tmp2[tmp2['level_0'] != 'id']
tmp2.drop('level_1', axis=1, inplace=True)
tmp2.rename(columns={'level_0':'id', 0:'value'}).set_index()
tmp2.reset_index(drop=True, inplace=True)

    id  value
0   0   aaa1
1   0   aaa4
2   0   aaa7
3   0   aaa10
4   0   aaa13
5   1   bbb2
6   1   bbb5
7   1   bbb8
8   1   bbb11
9   1   bbb14
10  2   ccc3
11  2   ccc6
12  2   ccc9
...
Sign up to request clarification or add additional context in comments.

Comments

0

I had to rethink how I was parsing the data. What I accidentally omitted from this post was that I got to unbalanced lists as a consequence of using .str.findall(regex_pattern).to_frame() on different columns. Unbalanced lists resulted because certain metadata fields were missing over the years (e.g., "name") However, because I started with a column of lists of lists, I exploded that using df.explode and then use findall to extract patterns to new cols, which meant that null values could be created too.

For a 500MB dataset of several hundred thousand rows of fields with string type data, the whole process took probably less than 5 min.

Comments

0
from pandas import DataFrame as df

import numpy as np
import pandas as pd 




df = pd.DataFrame(
    {"id" : [1,2,3], 
        0: [['x', 'y', 'z'], ['a', 'b', 'c'], ['a', 'b', 'c']],
                   1: [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']],
                   2: [['a', 'b', 'c'], ['x', 'y', 'z'], ['a', 'b', 'c']]},
                  )


print(df)

"""
   id          0          1          2
0   1  [x, y, z]  [a, b, c]  [a, b, c]
1   2  [a, b, c]  [a, b, c]  [x, y, z]
2   3  [a, b, c]  [a, b, c]  [a, b, c]

"""

bb = (
    df.set_index('id').stack().explode()
    .reset_index(name='val')
    .drop(columns='level_1').reindex()
    )
print (bb)
"""

    id val
0    1   x
1    1   y
2    1   z
3    1   a
4    1   b
5    1   c
6    1   a
7    1   b
8    1   c
9    2   a
10   2   b
11   2   c
12   2   a
13   2   b
14   2   c
15   2   x
16   2   y
17   2   z
18   3   a
19   3   b
20   3   c
21   3   a
22   3   b
23   3   c
24   3   a
25   3   b
26   3   c

"""


aa = df.set_index('id').apply(pd.Series.explode).reset_index()
print(aa)
"""
   id  0  1  2
0   1  x  a  a
1   1  y  b  b
2   1  z  c  c
3   2  a  a  x
4   2  b  b  y
5   2  c  c  z
6   3  a  a  a
7   3  b  b  b
8   3  c  c  c

"""

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.