Format csv file with column creation in Spark scala

Question

I have a csv file, as below

It has 6 rows with top row as header, while header read as "Students Marks" dataframe is treating them as one columns, now i want to separate both columns with data. "student" and "marks" are separated by space.

df.show()    
_______________
##Student Marks##    
---------------    
A   10;20;10;20    
A   20;20;30;10    
B   10;10;10;10    
B   20;20;20;10    
B   30;30;30;20

Now i want to transform this csv table into two columns, with student and Marks, Also for every student the marks with add up, something like below

Student | Marks
A       | 30;40;40;30
B       | 60;60;60;40

I have tried with below but it is throwing an error

df.withColumn("_tmp", split($"Students Marks","\\ ")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2")).drop("_tmp")

koiralo · Accepted Answer · 2019-05-29 11:30:57Z

2

You can read the csv file with the delimiteryou want and calculate result as below

    val df = spark.read
      .option("header", true)
      .option("delimiter", " ")
      .csv("path to csv")

After You get the dataframe df

val resultDF = df.withColumn("split", split($"Marks", ";"))
  .withColumn("a", $"split"(0))
  .withColumn("b", $"split"(1))
  .withColumn("c", $"split"(2))
  .withColumn("d", $"split"(3))
  .groupBy("Student")
  .agg(concat_ws(";", array(
     Seq(sum($"a"), sum($"b"), sum($"c"), sum($"d")): _*)
  ).as("Marks"))


resultDF.show(false)

Output:

+-------+-------------------+
|Student|Marks              |
+-------+-------------------+
|B      |60.0;60.0;60.0;40.0|
|A      |30.0;40.0;40.0;30.0|
+-------+-------------------+

edited May 29, 2019 at 11:30

answered May 29, 2019 at 10:36

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Pablo López Gallego · Accepted Answer · 2019-05-29 09:39:57Z

1

Three Ideas. The first one is to read the file, split it by space and then create the dataFrame:

val df = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("delimiter", " ")
  .load("your_file.csv")

The second one is to read the file to dataframe and split it:

df.withColumn("Student", split($"Students Marks"," ").getItem(0))
  .withColumn("Marks", split($"Students Marks"," ").getItem(1))
  .drop("Students Marks")

The last one is your solution. It should work, but when you use the select, you don't use $"_tmp", therefore, it should work without the .drop("_tmp")

df.withColumn("_tmp", split($"Students Marks"," "))
  .select($"_tmp".getItem(0).as("Student"),$"_tmp".getItem(1).as("Marks"))

edited May 29, 2019 at 9:39

answered May 29, 2019 at 8:58

Pablo López Gallego

7104 silver badges17 bronze badges

9 Comments

Sam Over a year ago

org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns : [Student Marks];;

Pablo López Gallego Over a year ago

I edited my response, can you use sep instead delimiter too?

Pablo López Gallego Over a year ago

val df = spark.read.option("delimiter", " ").option("header", "true").csv("path/to/file") if this doen't work, can I see your error?

Sam Over a year ago

not an error here but on second line, it is throwing error org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns : [Student Marks];; .

Pablo López Gallego Over a year ago

I edited my response, and change 'value' with "Students Marks", try one of the three new options

|

Collectives™ on Stack Overflow

Format csv file with column creation in Spark scala

2 Answers 2

Comments

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related