Spark Filtering rows in window functions

Question

I need to applying a window function is PySpark but have to ignore certain rows while doing it.

I have tried the below code.

from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = (sc.parallelize([
            {"id":"900","service":"MM", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-09-13 13:38:17.229" },
            {"id":"900","service":"MM", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-09-13 13:38:17.242" },
            {"id":"1527","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 14:52:02.331" },
            {"id":"1527","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 14:52:02.490" },
            {"id":"1527","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 14:52:02.647" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.095" },
            {"id":"1504","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.253" },
            {"id":"1504","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.372" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.732" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:53.445" },
            {"id":"1504","service":"MT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:53.643" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:53.924" },
            {"id":"1504","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:54.094" },
            {"id":"1504","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:54.243" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:54.732" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:30.764" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:31.099" },
            {"id":"1504","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:33.363" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:33.677" },
            {"id":"1504","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:39.572" }

 ]).toDF()

      )
(
    df    
    .withColumn
    (
        'rank',
        F.when
        (
            (F.col('id') != 900),
            F.row_number()  
            .over
            (
                Window.partitionBy
                (
                    #F.when
                    #(
                    # (
                    # (F.col('id') != 90000)
                    #),
                    F.col('guid')
                #)
                )
                .orderBy
                (
                    F.col('time').asc()
                )
            )
        )
    )
    .select
    (
        'id',
        'service',
        'guid',
        'time',
        'rank'

    )
    .show(truncate = False)
)

I almost have it but the row_numbers need to start from 1 instead of three in this case. So in the rank column the number after the two nulls should be 1 instead of 3.

jxc · Accepted Answer · 2019-05-09 17:51:23Z

1

IIUC, you just need to add one temporary partition column with a value like id == 900 ? 0 : 1:

from pyspark.sql import Window, functions as F

# add `part` into partitionBy: (partition based on if id is 900)
win = Window.partitionBy('guid','part').orderBy('time')

# define part and then calculate rank
df = df.withColumn('part', F.when(df.id == 900, 0).otherwise(1)) \
       .withColumn('rank', F.when(F.col('part')==1, F.row_number().over(win))) \
       .drop('part')

answered May 9, 2019 at 17:51

jxc

14k4 gold badges20 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark Filtering rows in window functions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related