1

Hello I am new to spark and scala and I would like to split the following dataframe:

df:
+----------+-----+------+----------+--------+
|        Ts| Temp|  Wind|  Precipit|Humidity|
+----------+-----+------+----------+--------+
|1579647600|   10|    22|        10|      50|
|1579734000|   11|    21|        10|      55|
|1579820400|   10|    18|        15|      60|
|1579906800|    9|    23|        20|      60|
|1579993200|    8|    24|        25|      50|
|1580079600|   10|    18|        27|      60|
|1580166000|   11|    20|        30|      50|
|1580252400|   12|    17|        15|      50|
|1580338800|   10|    14|        21|      50|
|1580425200|    9|    16|        25|      60|
-----------+-----+------+----------+--------+

The resulting dataframes should be as follows:

df1:
+----------+-----+------+----------+--------+
|        Ts| Temp|  Wind|  Precipit|Humidity|
+----------+-----+------+----------+--------+
|1579647600|   10|    22|        10|      50|
|1579734000|   11|    21|        10|      55|
|1579820400|   10|    18|        15|      60|
|1579906800|    9|    23|        20|      60|
|1579993200|    8|    24|        25|      50|
|1580079600|   10|    18|        27|      60|
|1580166000|   11|    20|        30|      50|
|1580252400|   12|    17|        15|      50|
+----------+-----+------+----------+--------+
df2:
+----------+-----+------+----------+--------+
|        Ts| Temp|  Wind|  Precipit|Humidity|
+----------+-----+------+----------+--------+
|1580338800|   10|    14|        21|      50|
|1580425200|    9|    16|        25|      60|
-----------+-----+------+----------+--------+

where df1 having 80% of the top rows of df and df2 the 20% left.

2 Answers 2

1

Try with monotonically_increasing_id() function with window percent_rank() as this function preserve the order.

Example:

val df=sc.parallelize(Seq((1579647600,10,22,10,50),
(1579734000,11,21,10,55),
(1579820400,10,18,15,60),
(1579906800, 9,23,20,60),
(1579993200, 8,24,25,50),
(1580079600,10,18,27,60),
(1580166000,11,20,30,50),
(1580252400,12,17,15,50),
(1580338800,10,14,21,50),
(1580425200, 9,16,25,60)),10).toDF("Ts","Temp","Wind","Precipit","Humidity")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

val df1=df.withColumn("mid",monotonically_increasing_id)
val df_above_80=df1.withColumn("pr",percent_rank().over(w)).filter(col("pr") >= 0.8).drop(Seq("mid","pr"):_*)
val df_below_80=df1.withColumn("pr",percent_rank().over(w)).filter(col("pr") < 0.8).drop(Seq("mid","pr"):_*)

df_below_80.show()
/*
+----------+----+----+--------+--------+
|        Ts|Temp|Wind|Precipit|Humidity|
+----------+----+----+--------+--------+
|1579647600|  10|  22|      10|      50|
|1579734000|  11|  21|      10|      55|
|1579820400|  10|  18|      15|      60|
|1579906800|   9|  23|      20|      60|
|1579993200|   8|  24|      25|      50|
|1580079600|  10|  18|      27|      60|
|1580166000|  11|  20|      30|      50|
|1580252400|  12|  17|      15|      50|
+----------+----+----+--------+--------+
*/

df_above_80.show()
/*
+----------+----+----+--------+--------+
|        Ts|Temp|Wind|Precipit|Humidity|
+----------+----+----+--------+--------+
|1580338800|  10|  14|      21|      50|
|1580425200|   9|  16|      25|      60|
+----------+----+----+--------+--------+
*/
Sign up to request clarification or add additional context in comments.

Comments

1

Assuming the data are randomly split:

val Array(df1, df2) = df.randomSplit(Array(0.8, 0.2))

If however, by "Top rows" you mean by the 'Ts' column in your example dataframe then you could do this:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col,percent_rank}

val window = Window.partitionBy().orderBy(df['Ts'].desc())

val df1 = df.select('*', percent_rank().over(window).alias('rank')) 
  .filter(col('rank') >= 0.2) 
  .show()

val df2 = df.select('*', percent_rank().over(window).alias('rank')) 
  .filter(col('rank') < 0.2) 
  .show()

2 Comments

I meant that I want to split the dataframe, without changing the order of its rows, and the 80% rows that appear first should be in df1 and the 20 % rows left should appear in df2
so the second option will work then, as your dataset is ordered by Ts. 484's answer above will also work for you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.