How to convert my RDD of JSON strings to DataFrame

Question

I created RDD[String] in which each String element contains multiple JSON strings, but all these JSON strings have the same scheme over the whole RDD.

For example:

RDD{String] called as rdd contains the following entries: String 1:

{"data":"abc", "field1":"def"}
{"data":"123", "field1":"degf"}
{"data":"87j", "field1":"hzc"}
{"data":"efs", "field1":"ssaf"}

String 2:

{"data":"fsg", "field1":"agas"}
{"data":"sgs", "field1":"agg"}
{"data":"sdg", "field1":"agads"}

My goal is to convert this RDD[String] into DataFrame. If I just do it this way:

val df = rdd.toDF()

..., then it does not work correctly. Actually df.count() gives me 2, instead of 7 for the above example, because JSON strings are batched and are not recognized individually.

How can I create DataFrame so that each row would correspond to particular JSON string?

you can use flatMap over your first RDD[String] so that each json shall be each row string in new RDD[String] — Anahcolus
– Anahcolus, Commented May 12, 2017 at 15:21
if you have a valid json you can directly read as val data = spark.read.json(input) — koiralo
– koiralo, Commented May 12, 2017 at 17:30

Nikita Gousak · Accepted Answer · 2017-05-12 16:49:39Z

2

I can't check it right now but i think this should work:

// split each string by newline character
val splitted: RDD[Array[String]] = rdd.map(_.split("\n"))

// flatten
val jsonRdd: RDD[String] = splitted.flatMap(identity)

answered May 12, 2017 at 16:49

Nikita Gousak

1282 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dinosaurius Over a year ago

Wjat is identity?

Nikita Gousak Over a year ago

en.wikipedia.org/wiki/Identity_function Basically flatMap(identity) means flatMap(x => x)

Dinosaurius Over a year ago

Hmm, jsonRdd.printScheme() gives me root |-- _1: string (nullable = true)

Dinosaurius Over a year ago

in fact when I run jsonRdd.toDF().foreach(f => println(f)), I get [{"data":"abc", "field1":"def"}] [{"data":"123", "field1":"degf"}] .... It looks like each string was passed to an array, but I now need to convert each field into DataFrame's column.

Anahcolus · Accepted Answer · 2017-05-13 02:09:37Z

1

By following the information you've provided in your question, following can be your solution :

import sqlContext.implicits._
val str1 = "{\"data\":\"abc\", \"field1\":\"def\"}\n{\"data\":\"123\", \"field1\":\"degf\"}\n{\"data\":\"87j\", \"field1\":\"hzc\"}\n{\"data\":\"efs\", \"field1\":\"ssaf\"}"
val str2 = "{\"data\":\"fsg\", \"field1\":\"agas\"}\n{\"data\":\"sgs\", \"field1\":\"agg\"}\n{\"data\":\"sdg\", \"field1\":\"agads\"}"
val input = Seq(str1, str2)

val rddData = sc.parallelize(input).flatMap(_.split("\n"))
  .map(line => line.split(","))
  .map(array => (array(0).split(":")(1).trim.replaceAll("\\W", ""), array(1).split(":")(1).trim.replaceAll("\\W", "")))
rddData.toDF("data", "field1").show

Edited
You can exclude the fieldNames and just use .toDF() but that would give default column names from your data (like _1 _2 or col_1 col_2 etc)
Instead you can create a schema to create dataframe as below (you can add more fields)

val rddData = sc.parallelize(input).flatMap(_.split("\n"))
  .map(line => line.split(","))
  .map(array => Row.fromSeq(Seq(array(0).split(":")(1).trim.replaceAll("\\W", ""), array(1).split(":")(1).trim.replaceAll("\\W", ""))))

val schema = StructType(Array(StructField("data", StringType, true),
  StructField("field1", StringType, true)))

sqlContext.createDataFrame(rddData, schema).show

Or
You can just create dataset directly but you will need a case class (you can add more fields) as below

val dataSet = sc.parallelize(input).flatMap(_.split("\n"))
  .map(line => line.split(","))
  .map(array => Dinasaurius(array(0).split(":")(1).trim.replaceAll("\\W", ""),
    array(1).split(":")(1).trim.replaceAll("\\W", ""))).toDS

dataSet.show

The case class for above dataset is

case class Dinasaurius(data: String,
                       field1: String)

I hope I answered all your questions

edited May 13, 2017 at 2:09

answered May 12, 2017 at 17:45

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

5 Comments

Dinosaurius Over a year ago

Is it mandatory to pass column names "data", "field1"? I have 100 columns in my real data set.

Dinosaurius Over a year ago

While it seems to work, it is adjusted to 2 columns (fields). I my real data I have 100 columns (fields). How can I fit your solution?

Anahcolus Over a year ago

@Dinosaurius, I have answered all of your confusions and questions.

Dinosaurius Over a year ago

Indeed I managed to solve the problem as follows based on your ideas, but without case class and it works:

val jsonStrings: RDD[String] = sc.parallelize(input).map(_.split("\n")).flatMap(x => x)        val result = sqlContext.read.json(jsonStrings)        var df = result.toDF()

Anahcolus Over a year ago

@Dinosaurius, I am happy to see that my post helped you :)

Collectives™ on Stack Overflow

How to convert my RDD of JSON strings to DataFrame

2 Answers 2

4 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related