3

I am trying to use python polars over pandas sql for a large dataframe as I am running into memory errors. There are two where conditions that are utilized in this dataframe but can't get the syntax right.

Here is what the data looks like:

Key Field DateColumn
1234 Plumb 2020-02-01
1234 Plumb 2020-03-01
1234 Pear 2020-04-01
import pandas as pd
import datetime as dt
import pandasql as ps

d = {'Key':
          [1234,    1234,   1234,   1234,   1234,   1234,   1234,   

1234,   1234,   1234,   1234,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754],
         'Field':[
              "Plumb",  "Plumb",    "Pear", "Plumb",    "Orange",   "Pear", "Plumb",    "Plumb",    "Pear", "Apple",    "Plumb",    "Orange",   "Orange",   "Apple",    "Apple",    "Pear", "Apple",    "Plumb",    "Plumb",    "Orange",   "Orange",   "Pear", "Plumb",    "Pear", "Plumb",    "Pear", "Apple",    "Plumb",    "Orange",   "Pear", "Apple",    "Pear", "Apple"],
         'DateColumn':[
              '2020-02-01', '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01',   '2020-02-01',   '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01',   '2020-02-01',   '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01'

     ]}

df = pd.DataFrame(data=d)

df['DateColumn'] = pd.to_datetime(df['DateColumn'])
df['PreviousMonth'] = df['DateColumn'] - pd.DateOffset(months=1)


df_output = ps.sqldf("""
select
a.Key
,a.Field
,b.Field as PreviousField
,a.DateColumn
,b.DateColumn as PreviousDate

from df as a, df as b
where a.Key = b.Key
and b.DateColumn = a.PreviousMonth

""")

print(df_output.head())

    Key   Field                  DateColumn                PreviousDate
0  1234   Plumb  2020-03-01 00:00:00.000000  2020-02-01 00:00:00.000000
1  1234    Pear  2020-04-01 00:00:00.000000  2020-03-01 00:00:00.000000
2  1234   Plumb  2020-05-01 00:00:00.000000  2020-04-01 00:00:00.000000
3  1234  Orange  2020-06-01 00:00:00.000000  2020-05-01 00:00:00.000000
4  1234    Pear  2020-07-01 00:00:00.000000  2020-06-01 00:00:00.000000

I have tried to do

data_output = df.join(df, left_on='Key', right_on='Key')

But unable to find a good example on how to put the two conditions on the join condition.

2
  • Look for the section Merging on multiple columns in this answer Commented Feb 7, 2023 at 19:39
  • 1
    Please create a copy/pastable df definition with sample data as well as an example of the result you're looking for. Commented Feb 7, 2023 at 19:41

1 Answer 1

4

Let's accomplish everything using Polars.

import polars as pl
df = (
    pl.DataFrame(d)
    .with_columns(
        pl.col('DateColumn').str.to_date()
    )
)

(
    df
    .join(
        df
        .with_columns(pl.col('DateColumn').alias('PreviousDate'))
        .rename({'Field': 'PreviousField'}),
        left_on=['Key', 'DateColumn'],
        right_on=['Key', pl.col('DateColumn').dt.offset_by('1mo')],
        how="inner"
    )
)
shape: (30, 5)
┌──────┬────────┬────────────┬───────────────┬──────────────┐
│ Key  ┆ Field  ┆ DateColumn ┆ PreviousField ┆ PreviousDate │
│ ---  ┆ ---    ┆ ---        ┆ ---           ┆ ---          │
│ i64  ┆ str    ┆ date       ┆ str           ┆ date         │
╞══════╪════════╪════════════╪═══════════════╪══════════════╡
│ 1234 ┆ Plumb  ┆ 2020-03-01 ┆ Plumb         ┆ 2020-02-01   │
│ 1234 ┆ Pear   ┆ 2020-04-01 ┆ Plumb         ┆ 2020-03-01   │
│ 1234 ┆ Plumb  ┆ 2020-05-01 ┆ Pear          ┆ 2020-04-01   │
│ 1234 ┆ Orange ┆ 2020-06-01 ┆ Plumb         ┆ 2020-05-01   │
│ 1234 ┆ Pear   ┆ 2020-07-01 ┆ Orange        ┆ 2020-06-01   │
│ 1234 ┆ Plumb  ┆ 2020-08-01 ┆ Pear          ┆ 2020-07-01   │
│ 1234 ┆ Plumb  ┆ 2020-09-01 ┆ Plumb         ┆ 2020-08-01   │
│ 1234 ┆ Pear   ┆ 2020-10-01 ┆ Plumb         ┆ 2020-09-01   │
│ 1234 ┆ Apple  ┆ 2020-11-01 ┆ Pear          ┆ 2020-10-01   │
│ 1234 ┆ Plumb  ┆ 2020-12-01 ┆ Apple         ┆ 2020-11-01   │
│ 2456 ┆ Orange ┆ 2020-03-01 ┆ Orange        ┆ 2020-02-01   │
│ 2456 ┆ Apple  ┆ 2020-04-01 ┆ Orange        ┆ 2020-03-01   │
│ 2456 ┆ Apple  ┆ 2020-05-01 ┆ Apple         ┆ 2020-04-01   │
│ 2456 ┆ Pear   ┆ 2020-06-01 ┆ Apple         ┆ 2020-05-01   │
│ 2456 ┆ Apple  ┆ 2020-07-01 ┆ Pear          ┆ 2020-06-01   │
│ 2456 ┆ Plumb  ┆ 2020-08-01 ┆ Apple         ┆ 2020-07-01   │
│ 2456 ┆ Plumb  ┆ 2020-09-01 ┆ Plumb         ┆ 2020-08-01   │
│ 2456 ┆ Orange ┆ 2020-10-01 ┆ Plumb         ┆ 2020-09-01   │
│ 2456 ┆ Orange ┆ 2020-11-01 ┆ Orange        ┆ 2020-10-01   │
│ 2456 ┆ Pear   ┆ 2020-12-01 ┆ Orange        ┆ 2020-11-01   │
│ 3754 ┆ Pear   ┆ 2020-03-01 ┆ Plumb         ┆ 2020-02-01   │
│ 3754 ┆ Plumb  ┆ 2020-04-01 ┆ Pear          ┆ 2020-03-01   │
│ 3754 ┆ Pear   ┆ 2020-05-01 ┆ Plumb         ┆ 2020-04-01   │
│ 3754 ┆ Apple  ┆ 2020-06-01 ┆ Pear          ┆ 2020-05-01   │
│ 3754 ┆ Plumb  ┆ 2020-07-01 ┆ Apple         ┆ 2020-06-01   │
│ 3754 ┆ Orange ┆ 2020-08-01 ┆ Plumb         ┆ 2020-07-01   │
│ 3754 ┆ Pear   ┆ 2020-09-01 ┆ Orange        ┆ 2020-08-01   │
│ 3754 ┆ Apple  ┆ 2020-10-01 ┆ Pear          ┆ 2020-09-01   │
│ 3754 ┆ Pear   ┆ 2020-11-01 ┆ Apple         ┆ 2020-10-01   │
│ 3754 ┆ Apple  ┆ 2020-12-01 ┆ Pear          ┆ 2020-11-01   │
└──────┴────────┴────────────┴───────────────┴──────────────┘

Note that we use an Expression in the right_on columns to generate our offset date column on-the-fly.

Sign up to request clarification or add additional context in comments.

1 Comment

This is perfect, exactly what I was looking for!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.