how to flatten array in pandas dataframe

Question

Assuming I have a pandas dataframe such as

df_p = pd.DataFrame(
   {'name_array':
    [[20130101, 320903902, 239032902],
     [20130101, 3253453, 239032902],
     [65756, 4342452, 32425432523]],
    'name': ['a', 'a', 'c']} )

I want to extract the series which contains the flatten arrays in each row whilst preserving the order

The expected result is a pandas.core.series.Series

This question is not a duplicate because my expected output is a pandas Series, and not a dataframe.

@AlexanderReynolds yes, it is irrelevant. Just a sample of the dataframe — Alex
– Alex, Commented Mar 13, 2019 at 18:23
Possible duplicate of How to convert column with list of values into rows in Pandas DataFrame — alkasm
– alkasm, Commented Mar 13, 2019 at 18:28
Not the accepted answer, but the second one down showing the use of chain.from_iterable should work for you---you just need to pass that into the constructor of Series instead of DataFrame. So: pd.Series(list(chain.from_iterable(df['name_array']))) — alkasm
– alkasm, Commented Mar 13, 2019 at 18:29
@AlexanderReynolds I've come up with a possible approach (I've posted it as an answer). I don't know whether this is an efficient way to do it. — Alex
– Alex, Commented Mar 13, 2019 at 18:30

alkasm · Accepted Answer · 2019-03-13 19:51:59Z

The solutions using melt are slower than OP's original method, which they shared in the answer here, especially after the speedup from my comment on that answer.

I created a larger dataframe to test on:

df = pd.DataFrame({'name_array': np.random.rand(1000, 3).tolist()})

And timing the two solutions using melt on this dataframe yield:

In [16]: %timeit pd.melt(df.name_array.apply(pd.Series).reset_index(), id_vars=['index'],value_name='name_array').drop('variable', axis=1).sort_values('index')
173 ms ± 5.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit df['name_array'].apply(lambda x: pd.Series([i for i in x])).melt().drop('variable', axis=1)['value']
175 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The OP's method with the speedup I suggested in the comments:

In [18]: %timeit pd.Series(np.concatenate(df['name_array']))
18 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And finally, the fastest solution as provided here but modified to provide a series instead of dataframe output:

In [14]: from itertools import chain
In [15]: %timeit pd.Series(list(chain.from_iterable(df['name_array'])))
402 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This last method is faster than melt() by 3 orders of magnitude and faster than np.concatenate() by 2 orders of magnitude.

Alex · Accepted Answer · 2019-03-13 18:32:02Z

2

This is the solution I've figured out. Don't know if there are more efficient ways.

df_p = pd.DataFrame(
   {'name_array':
    [[20130101, 320903902, 239032902],
     [20130101, 3253453, 239032902],
     [65756, 4342452, 32425432523]],
    'name': ['a', 'a', 'c']} )

data = pd.DataFrame( {'column':np.concatenate(df_p['name_array'].values)} )['column']

output:

[0       20130101
 1      320903902
 2      239032902
 3       20130101
 4        3253453
 5      239032902
 6          65756
 7        4342452
 8    32425432523
 Name: column, dtype: int64]

edited Mar 13, 2019 at 18:32

answered Mar 13, 2019 at 18:29

Alex

1,5379 gold badges26 silver badges50 bronze badges

2 Comments

alkasm Over a year ago

You can remove the [] around the data, since you're just putting the new values into a list for no reason. Also, OP asked for a series and you're creating a dataframe and then indexing it with the column name to get a series---you should just be able to use the Series() constructor itself without the middle-man :). Edit: lol didn't realize you were OP.

alkasm Over a year ago

To be specific I'm saying you could do pd.Series(np.concatenate(df_p['name_array']))

panktijk · Accepted Answer · 2019-03-13 18:47:30Z

1

You can use pd.melt:

pd.melt(df_p.name_array.apply(pd.Series).reset_index(), 
        id_vars=['index'],
        value_name='name_array') \
        .drop('variable', axis=1) \
        .sort_values('index')

OUTPUT:

index   name_array
0       20130101
0       320903902
0       239032902
1       20130101
1       3253453
1       239032902
2       65756
2       4342452
2       32425432523

answered Mar 13, 2019 at 18:47

panktijk

1,61411 silver badges11 bronze badges

Comments

Milad Ce · Accepted Answer · 2021-06-13 17:26:04Z

1

you can flatten list of column's lists, and then create series of that, in this way:

pd.Series([element for row in df_p.name_array for element in row])

answered Jun 13, 2021 at 17:26

Milad Ce

915 bronze badges

Collectives™ on Stack Overflow

how to flatten array in pandas dataframe

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related