1

I have a csv file, as below

It has 6 rows with top row as header, while header read as "Students Marks" dataframe is treating them as one columns, now i want to separate both columns with data. "student" and "marks" are separated by space.

df.show()    
_______________
##Student Marks##    
---------------    
A   10;20;10;20    
A   20;20;30;10    
B   10;10;10;10    
B   20;20;20;10    
B   30;30;30;20

Now i want to transform this csv table into two columns, with student and Marks, Also for every student the marks with add up, something like below

Student | Marks
A       | 30;40;40;30
B       | 60;60;60;40

I have tried with below but it is throwing an error

df.withColumn("_tmp", split($"Students Marks","\\ ")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2")).drop("_tmp")

2 Answers 2

2

You can read the csv file with the delimiteryou want and calculate result as below

    val df = spark.read
      .option("header", true)
      .option("delimiter", " ")
      .csv("path to csv")

After You get the dataframe df

val resultDF = df.withColumn("split", split($"Marks", ";"))
  .withColumn("a", $"split"(0))
  .withColumn("b", $"split"(1))
  .withColumn("c", $"split"(2))
  .withColumn("d", $"split"(3))
  .groupBy("Student")
  .agg(concat_ws(";", array(
     Seq(sum($"a"), sum($"b"), sum($"c"), sum($"d")): _*)
  ).as("Marks"))


resultDF.show(false)

Output:

+-------+-------------------+
|Student|Marks              |
+-------+-------------------+
|B      |60.0;60.0;60.0;40.0|
|A      |30.0;40.0;40.0;30.0|
+-------+-------------------+
Sign up to request clarification or add additional context in comments.

Comments

1

Three Ideas. The first one is to read the file, split it by space and then create the dataFrame:

val df = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("delimiter", " ")
  .load("your_file.csv")

The second one is to read the file to dataframe and split it:

df.withColumn("Student", split($"Students Marks"," ").getItem(0))
  .withColumn("Marks", split($"Students Marks"," ").getItem(1))
  .drop("Students Marks")

The last one is your solution. It should work, but when you use the select, you don't use $"_tmp", therefore, it should work without the .drop("_tmp")

df.withColumn("_tmp", split($"Students Marks"," "))
  .select($"_tmp".getItem(0).as("Student"),$"_tmp".getItem(1).as("Marks"))

9 Comments

org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns : [Student Marks];;
I edited my response, can you use sep instead delimiter too?
val df = spark.read.option("delimiter", " ").option("header", "true").csv("path/to/file") if this doen't work, can I see your error?
not an error here but on second line, it is throwing error org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns : [Student Marks];; .
I edited my response, and change 'value' with "Students Marks", try one of the three new options
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.