6

Assuming I have a pandas dataframe such as

df_p = pd.DataFrame(
   {'name_array':
    [[20130101, 320903902, 239032902],
     [20130101, 3253453, 239032902],
     [65756, 4342452, 32425432523]],
    'name': ['a', 'a', 'c']} )

Image of dataframe

I want to extract the series which contains the flatten arrays in each row whilst preserving the order

The expected result is a pandas.core.series.Series

Image of expected output

This question is not a duplicate because my expected output is a pandas Series, and not a dataframe.

5
  • So the name column is irrelevant? Commented Mar 13, 2019 at 18:22
  • @AlexanderReynolds yes, it is irrelevant. Just a sample of the dataframe Commented Mar 13, 2019 at 18:23
  • 1
    Possible duplicate of How to convert column with list of values into rows in Pandas DataFrame Commented Mar 13, 2019 at 18:28
  • Not the accepted answer, but the second one down showing the use of chain.from_iterable should work for you---you just need to pass that into the constructor of Series instead of DataFrame. So: pd.Series(list(chain.from_iterable(df['name_array']))) Commented Mar 13, 2019 at 18:29
  • @AlexanderReynolds I've come up with a possible approach (I've posted it as an answer). I don't know whether this is an efficient way to do it. Commented Mar 13, 2019 at 18:30

4 Answers 4

6

The solutions using melt are slower than OP's original method, which they shared in the answer here, especially after the speedup from my comment on that answer.

I created a larger dataframe to test on:

df = pd.DataFrame({'name_array': np.random.rand(1000, 3).tolist()})

And timing the two solutions using melt on this dataframe yield:

In [16]: %timeit pd.melt(df.name_array.apply(pd.Series).reset_index(), id_vars=['index'],value_name='name_array').drop('variable', axis=1).sort_values('index')
173 ms ± 5.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit df['name_array'].apply(lambda x: pd.Series([i for i in x])).melt().drop('variable', axis=1)['value']
175 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The OP's method with the speedup I suggested in the comments:

In [18]: %timeit pd.Series(np.concatenate(df['name_array']))
18 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And finally, the fastest solution as provided here but modified to provide a series instead of dataframe output:

In [14]: from itertools import chain
In [15]: %timeit pd.Series(list(chain.from_iterable(df['name_array'])))
402 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This last method is faster than melt() by 3 orders of magnitude and faster than np.concatenate() by 2 orders of magnitude.

Sign up to request clarification or add additional context in comments.

Comments

2

This is the solution I've figured out. Don't know if there are more efficient ways.

df_p = pd.DataFrame(
   {'name_array':
    [[20130101, 320903902, 239032902],
     [20130101, 3253453, 239032902],
     [65756, 4342452, 32425432523]],
    'name': ['a', 'a', 'c']} )

data = pd.DataFrame( {'column':np.concatenate(df_p['name_array'].values)} )['column']

output:

[0       20130101
 1      320903902
 2      239032902
 3       20130101
 4        3253453
 5      239032902
 6          65756
 7        4342452
 8    32425432523
 Name: column, dtype: int64]

2 Comments

You can remove the [] around the data, since you're just putting the new values into a list for no reason. Also, OP asked for a series and you're creating a dataframe and then indexing it with the column name to get a series---you should just be able to use the Series() constructor itself without the middle-man :). Edit: lol didn't realize you were OP.
To be specific I'm saying you could do pd.Series(np.concatenate(df_p['name_array']))
1

You can use pd.melt:

pd.melt(df_p.name_array.apply(pd.Series).reset_index(), 
        id_vars=['index'],
        value_name='name_array') \
        .drop('variable', axis=1) \
        .sort_values('index')

OUTPUT:

index   name_array
0       20130101
0       320903902
0       239032902
1       20130101
1       3253453
1       239032902
2       65756
2       4342452
2       32425432523

Comments

1

you can flatten list of column's lists, and then create series of that, in this way:

pd.Series([element for row in df_p.name_array for element in row])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.