2

I have the follow df in Spark (Python). I am just trying to select the day that the "datos_acumulados" column is more than 20480. In this case, the output should be a table like the following: (a table format including the nulls):

Results:

 grupo_edad|     fecha|acumuladosMB|datos_acumulados|
|         1|2020-08-04|        4864|           20921|
|         4|      null|        null|            null|

Dataframe: df_datos_acumulados

     grupo_edad|     fecha|acumuladosMB|datos_acumulados|
    +----------+----------+------------+----------------+
    |         1|2020-08-01|        6185|            6185|
    |         1|2020-08-02|        5854|           12039|
    |         1|2020-08-03|        4018|           16057|
    |         1|2020-08-04|        4864|           20921|
    |         1|2020-08-05|        5526|           26447|
    |         1|2020-08-06|        4818|           31265|
    |         1|2020-08-07|        5359|           36624|
    |         4|2020-08-01|         674|             674|
    |         4|2020-08-02|         744|            1418|
    |         4|2020-08-03|         490|            1908|
    |         4|2020-08-04|         355|            2263|
    |         4|2020-08-05|        1061|            3324|
    |         4|2020-08-06|         752|            4076|
    |         4|2020-08-07|         560|            4636|

Thanks!

Thank you to the answer of @pasha701 I could get the final table but It doesn't show the nulls rows that I need too:

grupoDistinctDF = df_datos_acumulados.withColumn("grupo_edad", col("grupo_edad"))


grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")

df_datos_acumulados = df_datos_acumulados.where(col("datos_acumulados") >= 20480) \
  .withColumn("row_number", row_number().over(grupoWindow)) \
  .where(col("row_number") == 1) \
  .drop("row_number")


grupoDistinctDF = grupoDistinctDF.join(df_datos_acumulados,["grupo_edad"], "left")

Output:

 grupo_edad|     fecha|acumuladosMB|datos_acumulados|
|         1|2020-08-04|        4864|           20921|

1 Answer 1

3

If first row where "datos_acumulados" > 20480 is required, Window function "row_number()" can be used for get such first row, and joined with distinct "grupo_edad" (Scala):

val df = Seq(
  (1, "2020-08-01", 6185, 6185),
  (1, "2020-08-02", 5854, 12039),
  (1, "2020-08-03", 4018, 16057),
  (1, "2020-08-04", 4864, 20921),
  (1, "2020-08-05", 5526, 26447),
  (1, "2020-08-06", 4818, 31265),
  (1, "2020-08-07", 5359, 36624),
  (4, "2020-08-01", 674, 674),
  (4, "2020-08-02", 744, 1418),
  (4, "2020-08-03", 490, 1908),
  (4, "2020-08-04", 355, 2263),
  (4, "2020-08-05", 1061, 3324),
  (4, "2020-08-06", 752, 4076),
  (4, "2020-08-07", 560, 4636)
).toDF("grupo_edad", "fecha", "acumuladosMB", "datos_acumulados")

val grupoDistinctDF = df.select("grupo_edad").distinct()

val grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")

val firstMatchingRowDF = df
  .where($"datos_acumulados" > 20480)
  .withColumn("row_number", row_number().over(grupoWindow))
  .where($"row_number" === 1)
  .drop("row_number")


grupoDistinctDF.join(firstMatchingRowDF, Seq("grupo_edad"), "left")

Output:

+----------+----------+------------+----------------+
|grupo_edad|fecha     |acumuladosMB|datos_acumulados|
+----------+----------+------------+----------------+
|4         |null      |null        |null            |
|1         |2020-08-04|4864        |20921           |
+----------+----------+------------+----------------+
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you so much psha701! I have never used Scala before. I have tryed to do the same but in Python but I am reciving errors such as "'DataFrame' object is not callable". Would it be so different the solution with python?
Guess, in Python the same approach can be used, proper translation is required. Unfortunately, I am not familiar with Python.
Thank you so much pasha701 for your time. I could get the final table with Python thanks for your explanation. But I only could get the rows that the number is higher than 20480, and not with the nulls...
last code line (left join) is responsible for getting with nulls. Line can be assigned to variable, and variable printed, like: val result = grupoDistinctDF.join(firstMatchingRowDF, Seq("grupo_edad"), "left") result.show(false)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.