dask dataframe apply meta

Question

I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name'. For this particular use case it doesn't look like I need to define meta but I'd like to know how to do that for future reference.

Dummy dataframe and the column frequencies

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],
                   ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],
                   [12, 10, 15, 23, 18, 20, 26]],
                  index=['Column A', 'Column B', 'Column C']).T
dask_df = dd.from_pandas(df)

In [39]: dask_df.head()
Out[39]: 
  Column A Column B Column C
0      Sam      Sam       12
1     Alex    David       10
2    David    David       15
3    Sarah    Alice       23
4    Alice      Sam       18

(dask_df.groupby('Column B')
        .apply(lambda group: len(group))
       ).compute()

UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
Out[60]: 
Column B
Alice    2
David    2
Sam      3
dtype: int64

Trying to define meta produces AttributeError

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta={'Column B': 'int'})).compute()

same for this

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()

same if I try having the dtype be int instead of "int" or for that matter 'f8' or np.float64 so it doesn't seem like it's the dtype that is causing the problem.

The documentation on meta seems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).

What is meta? and how am I supposed to define it?

Using python 3.6 dask 0.14.3 and pandas 0.20.2

Hmm, not sure why that would fail. Does this work meta=('Column B', 'int') ? — Bob Haffner
– Bob Haffner, Commented Jun 8, 2017 at 13:38
Not to answer your question, but how about dask_df.groupby('Column B').count().compute()? That gets the number of valid values in each column, not the length. dask_df['Column B'].value_counts().compute() is a more exact translation. The error I believe is because the output has Column B as the index not the column name. — mdurant
– mdurant, Commented Jun 8, 2017 at 13:39
both of those seem to do the right thing, no idea which one is the most effective — Matti Lyra
– Matti Lyra, Commented Jun 8, 2017 at 15:28

tobsecret · Accepted Answer · 2017-09-22 19:07:23Z

39

meta is the prescription of the names/types of the output from the computation. This is required because apply() is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening. You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.

The output of your computation is actually a series, so the following is the simplest that works

(dask_df.groupby('Column B')
     .apply(len, meta=('int'))).compute()

but more accurate would be

(dask_df.groupby('Column B')
     .apply(len, meta=pd.Series(dtype='int', name='Column B')))

edited Sep 22, 2017 at 19:07

tobsecret

2,5221 gold badge18 silver badges26 bronze badges

answered Jun 8, 2017 at 13:53

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

djakubosky Over a year ago

is there any performance boost to including the full pd.Series meta?

mdurant Over a year ago

No, but it's more explicit, and in some cases allows you finer control, e.g., over the name and type of the index.

Collectives™ on Stack Overflow

dask dataframe apply meta

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related