2

I am working in Polars and I have data set where one column is lists of strings. To see what it's like:

import pandas as pd

list_of_lists = [['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
                  , ['base', 'base.current base', 'base.current base.inventories - total','ABCD']
                  , ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
                  , ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']]

pd_df = pd.DataFrame({'lol': list_of_lists})

Gives:

    lol
0   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
1   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
2   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
3   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']

I want to hash each list. I thought to cast each list as a string and then hash it. I can do that with Pandas

pd_df = pd.DataFrame({'lol': list_of_lists}).astype({'lol':str})
pl_df_1 = pl.DataFrame(pd_df)
pl_df_1.with_columns(pl.col('lol')
                    .hash(seed=140)
                    .name.suffix('_hashed')
                    )

Gives:

               lol                  lol_hashed
               str                   u64
"['base', 'base.current base', …    14283628883798345624
"['base', 'base.current base', …    14283628883798345624
"['base', 'base.current base', …    14283628883798345624
"['base', 'base.current base', …    14283628883798345624

But if I try to do similar in Polars I get an error:

pl_df_2 = pl.DataFrame({'lol': list_of_lists})
pl_df_2.with_columns(pl.col('lol') # <== can insert .cast(pl.String) here still get error
                    .hash(seed=140)
                    .name.suffix('_hashed')
                    )

Gives:

# PanicException: Hashing a list with a non-numeric inner type not supported. 
#   Got dtype: List(String)

I would prefer to work just with the Polars library so is it possible to cast the column of lists as strings or is there a better way in Polars of achieving the same result?

UPDATE:

Based on the accepted answer I experimented further.

list_of_lists = [
                 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'], 
                 ['base', 'base.current base', 'base.current base.inventories - total', 'DEFG'], 
                 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'], 
                 ['base', 'base.current base', 'base.current base.inventories - total', 'HIJK'], 
                 '(bobbyJoe460)',
                 'bobby, Joe (xx866e)',
                 137642039575
                 ]

pl_df_1 = pl.DataFrame({'lol': list_of_lists}, strict=False)  # <==== allow mixed types in column
pl_df_1.with_columns(pl.col('lol')
                     .cast(pl.Categorical)  # <==== cast to Categorical
                     .hash(seed=140)
                     .name.suffix('_hashed')
                     )

Gives:

                 lol.               lol_hashed
                 str                    u64
"["base", "base.current base", …    11231070086490249882
"["base", "base.current base", …    6519339301964281776
"["base", "base.current base", …    11231070086490249882
"["base", "base.current base", …    14549859594875138034
"(bobbyJoe460)"                     1954884316252525743
"bobby, Joe (xx866e)"               4241414284122449899
"137642039575"                      6383308039250228053

1 Answer 1

1

The PanicException is a bug and could be reported.

.list.join() can be used to create a "single string" which you can hash.

df = pl.DataFrame({"lol": list_of_lists})

df.with_columns(
    pl.col("lol").list.join("").hash()
)
shape: (4, 1)
┌─────────────────────┐
│ lol                 │
│ ---                 │
│ u64                 │
╞═════════════════════╡
│ 8244365561843513530 │
│ 8244365561843513530 │
│ 8244365561843513530 │
│ 8244365561843513530 │
└─────────────────────┘

You can also cast to Categoricals which can be hashed.

df.with_columns(
    pl.col("lol").cast(pl.List(pl.Categorical)).hash()
)
shape: (4, 1)
┌────────────────────┐
│ lol                │
│ ---                │
│ u64                │
╞════════════════════╡
│ 599332512135826737 │
│ 599332512135826737 │
│ 599332512135826737 │
│ 599332512135826737 │
└────────────────────┘
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.