10

I'm trying to read an in-memory JSON string into a Spark DataFrame on the fly:

var someJSON : String = getJSONSomehow()
val someDF : DataFrame = magic.convert(someJSON)

I've spent quite a bit of time looking at the Spark API, and the best I can find is to use a sqlContext like so:

var someJSON : String = getJSONSomehow()
val tmpFile : Output = Resource
    .fromFile(s"/tmp/json/${UUID.randomUUID().toString()}")
tmpFile.write("hello")(Codec.UTF8)
val someDF : DataFrame = sqlContext.read().json(tmpFile)

But this feels kind of awkward/wonky and imposes the following constraints:

  1. It requires me to format my JSON to one object per line (per documentation); and
  2. It forces me to write the JSON to a temp file, which is slow and awkward; and
  3. It forces me to clean up temp files over time, which is cumbersome and feels "wrong" to me

So I ask: Is there a direct and more efficient way to convert a JSON string into a Spark DataFrame?

1

1 Answer 1

14

From Spark SQL guide:

val otherPeopleRDD = spark.sparkContext.makeRDD(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

This creates a DataFrame from an intermediate RDD (created by passing a String).

Sign up to request clarification or add additional context in comments.

1 Comment

The very good thing is, you can use it to filter wrong line before parsing (using sqlContext.read.json(sc.textFile("...").filter(....)))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.