3

I have a huge dataframe. Following a group_by operation, I have a list of strings corresponding to every element of the first column. What I need is to be able to quickly find common strings between some particular i'th row with all the other rows. I could do that in Pandas by saving the above dataframe as a pickle file. The solution was suboptimal as loading takes a very long time.

I then found polars to be promising, except that I cannot store the dataframe with column of sets in any format that it supports for quick loading. So that leaves the alternate solution of storing as a list but quickly converting the grouped column to sets after loading from parquet. (I faced the same problems with datatables and vaex too.)

The solution with polars that I found was to use .map_elements. But it works in a single thread and is very slow. The code I used was as follows:

df = pl.from_repr("""
┌────────┬────────┐
│ ColA   ┆ ColB   │
│ ---    ┆ ---    │
│ str    ┆ str    │
╞════════╪════════╡
│ apple  ┆ boy    │
│ orange ┆ ball   │
│ apple  ┆ bamboo │
│ orange ┆ bull   │
└────────┴────────┘
""")
df = df.lazy().group_by('ColA').agg('ColB').collect()
shape: (2, 2)
┌────────┬───────────────────┐
│ ColA   ┆ ColB              │
│ ---    ┆ ---               │
│ str    ┆ list[str]         │
╞════════╪═══════════════════╡
│ apple  ┆ ["boy", "bamboo"] │
│ orange ┆ ["ball", "bull"]  │
└────────┴───────────────────┘
df.with_columns(
    pl.col('ColB').map_elements(set)
)
shape: (2, 2)
┌────────┬───────────────────┐
│ ColA   ┆ ColB              │
│ ---    ┆ ---               │
│ str    ┆ object            │
╞════════╪═══════════════════╡
│ apple  ┆ {'boy', 'bamboo'} │
│ orange ┆ {'ball', 'bull'}  │
└────────┴───────────────────┘

I found discussion on using map_batches, but it works on series only. Unlike in that example that worked on per element basis, when I used np.asarray to convert to numpy array (to apply intersect on them later), it also gave me an object column.

df.select(pl.all().map_batches(np.asarray))
shape: (2, 2)
┌────────┬──────────────────┐
│ ColA   ┆ ColB             │
│ ---    ┆ ---              │
│ str    ┆ object           │
╞════════╪══════════════════╡
│ apple  ┆ ['boy' 'bamboo'] │
│ orange ┆ ['ball' 'bull']  │
└────────┴──────────────────┘

I would like to know where I went wrong, and how to use multi-threads (as with map) to convert a column of list to a column of numpy array (or preferably sets).

2 Answers 2

0

Not perhaps the best approach, but the following worked reasonably well.

>>> my_dict = dict(df.to_numpy().tolist())
>>> my_dict
{'orange': array(['ball', 'bull', 'boy'], dtype=object), 'apple': array(['boy', 'bamboo'], dtype=object)}
>>> for i in my_dict:
...     my_dict[i] = set(my_dict[i])
...
>>> my_dict
{'orange': {'ball', 'boy', 'bull'}, 'apple': {'bamboo', 'boy'}}
Sign up to request clarification or add additional context in comments.

Comments

0

Better approach could be

df = df.lazy().group_by('ColA').agg('ColB').with_columns(pl.col("ColB").list.unique())

2 Comments

Could you explain what makes this approach better?
that's still a list, not a set.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.