Spark - Python: Select rows and dates

Question

I have the follow df in Spark (Python). I am just trying to select the day that the "datos_acumulados" column is more than 20480. In this case, the output should be a table like the following: (a table format including the nulls):

Results:

 grupo_edad|     fecha|acumuladosMB|datos_acumulados|
|         1|2020-08-04|        4864|           20921|
|         4|      null|        null|            null|

Dataframe: df_datos_acumulados

     grupo_edad|     fecha|acumuladosMB|datos_acumulados|
    +----------+----------+------------+----------------+
    |         1|2020-08-01|        6185|            6185|
    |         1|2020-08-02|        5854|           12039|
    |         1|2020-08-03|        4018|           16057|
    |         1|2020-08-04|        4864|           20921|
    |         1|2020-08-05|        5526|           26447|
    |         1|2020-08-06|        4818|           31265|
    |         1|2020-08-07|        5359|           36624|
    |         4|2020-08-01|         674|             674|
    |         4|2020-08-02|         744|            1418|
    |         4|2020-08-03|         490|            1908|
    |         4|2020-08-04|         355|            2263|
    |         4|2020-08-05|        1061|            3324|
    |         4|2020-08-06|         752|            4076|
    |         4|2020-08-07|         560|            4636|

Thanks!

Thank you to the answer of @pasha701 I could get the final table but It doesn't show the nulls rows that I need too:

grupoDistinctDF = df_datos_acumulados.withColumn("grupo_edad", col("grupo_edad"))


grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")

df_datos_acumulados = df_datos_acumulados.where(col("datos_acumulados") >= 20480) \
  .withColumn("row_number", row_number().over(grupoWindow)) \
  .where(col("row_number") == 1) \
  .drop("row_number")


grupoDistinctDF = grupoDistinctDF.join(df_datos_acumulados,["grupo_edad"], "left")

Output:

 grupo_edad|     fecha|acumuladosMB|datos_acumulados|
|         1|2020-08-04|        4864|           20921|

pasha701 · Accepted Answer · 2021-09-15 14:30:59Z

3

If first row where "datos_acumulados" > 20480 is required, Window function "row_number()" can be used for get such first row, and joined with distinct "grupo_edad" (Scala):

val df = Seq(
  (1, "2020-08-01", 6185, 6185),
  (1, "2020-08-02", 5854, 12039),
  (1, "2020-08-03", 4018, 16057),
  (1, "2020-08-04", 4864, 20921),
  (1, "2020-08-05", 5526, 26447),
  (1, "2020-08-06", 4818, 31265),
  (1, "2020-08-07", 5359, 36624),
  (4, "2020-08-01", 674, 674),
  (4, "2020-08-02", 744, 1418),
  (4, "2020-08-03", 490, 1908),
  (4, "2020-08-04", 355, 2263),
  (4, "2020-08-05", 1061, 3324),
  (4, "2020-08-06", 752, 4076),
  (4, "2020-08-07", 560, 4636)
).toDF("grupo_edad", "fecha", "acumuladosMB", "datos_acumulados")

val grupoDistinctDF = df.select("grupo_edad").distinct()

val grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")

val firstMatchingRowDF = df
  .where($"datos_acumulados" > 20480)
  .withColumn("row_number", row_number().over(grupoWindow))
  .where($"row_number" === 1)
  .drop("row_number")


grupoDistinctDF.join(firstMatchingRowDF, Seq("grupo_edad"), "left")

Output:

+----------+----------+------------+----------------+
|grupo_edad|fecha     |acumuladosMB|datos_acumulados|
+----------+----------+------------+----------------+
|4         |null      |null        |null            |
|1         |2020-08-04|4864        |20921           |
+----------+----------+------------+----------------+

answered Sep 15, 2021 at 14:30

pasha701

7,2171 gold badge17 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Gaby Over a year ago

Thank you so much psha701! I have never used Scala before. I have tryed to do the same but in Python but I am reciving errors such as "'DataFrame' object is not callable". Would it be so different the solution with python?

pasha701 Over a year ago

Guess, in Python the same approach can be used, proper translation is required. Unfortunately, I am not familiar with Python.

Gaby Over a year ago

Thank you so much pasha701 for your time. I could get the final table with Python thanks for your explanation. But I only could get the rows that the number is higher than 20480, and not with the nulls...

pasha701 Over a year ago

last code line (left join) is responsible for getting with nulls. Line can be assigned to variable, and variable printed, like: val result = grupoDistinctDF.join(firstMatchingRowDF, Seq("grupo_edad"), "left") result.show(false)

Collectives™ on Stack Overflow

Spark - Python: Select rows and dates

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related