Read external json file into RDD and extract specific values in scala

Question

Firstly, I am completely new to scala and spark Although bit famailiar with pyspark. I am working with external json file which is pretty huge and I am not allowed to convert it into dataset or dataframe. I have to perform operations on pure RDD.

So I wanted to know how can I get specific value of key. So I read my json file as sc.textFile("information.json") Now normally in python I would do like

x = sc.textFile("information.json").map(lambda x: json.loads(x))\ 
 .map(lambda x: (x['name'],x['roll_no'])).collect()

is there any equivalent of above code in scala (Extracting value of specific keys) in RDD without converting to dataframe or dataset.

Essentially same question as Equivalent pyspark's json.loads function for spark-shell but hoping to get more concrete and noob friendly answer. Thank you

Json data: {"name":"ABC", "roll_no":"12", "Major":"CS"}

is there any specific reason for not using spark.read.json? then you dont need to do any custom parsing — abiratsis
– abiratsis, Commented Sep 21, 2019 at 17:51

abiratsis · Accepted Answer · 2019-10-12 16:25:36Z

2

Option 1: RDD API + json4s lib

One way is using the json4s library. The library is already used internally by Spark.

import org.json4s._
import org.json4s.jackson.JsonMethods._

// {"name":"ABC1", "roll_no":"12", "Major":"CS1"}
// {"name":"ABC2", "roll_no":"13", "Major":"CS2"}
// {"name":"ABC3", "roll_no":"14", "Major":"CS3"}
val file_location = "information.json"

val rdd = sc.textFile(file_location)

rdd.map{ row =>
  val json_row = parse(row)

  (compact(json_row \ "name"), compact(json_row \ "roll_no"))
}.collect().foreach{println _}

// Output
// ("ABC1","12")
// ("ABC2","13")
// ("ABC3","14")

First we parse the row data into json_row then we access the properties of the row with the operator \ i.e: json_row \ "name". The final result is a sequence of tuples of name,roll_no

Option 2: dataframe API + get_json_object()

And a more straight forward approach would be via the dataframe API in combination with the get_json_object() function.

import org.apache.spark.sql.functions.get_json_object

val df = spark.read.text(file_location)

df.select(
  get_json_object($"value","$.name").as("name"),
  get_json_object($"value","$.roll_no").as("roll_no"))
.collect()
.foreach{println _}

// [ABC1,12]
// [ABC2,13]
// [ABC3,14]

edited Oct 12, 2019 at 16:25

answered Sep 21, 2019 at 18:45

abiratsis

7,3414 gold badges31 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

abiratsis Over a year ago

Hello there @Max, did the solution above work for you?

Max Over a year ago

It did. Thank you so much

SimbaPK · Accepted Answer · 2019-09-21 16:05:54Z

i used to parse json in scala with this kind of method :

 /** ---------------------------------------
    * Example of method to parse simple json
        {
        "fields": [
          {
            "field1": "value",
            "field2": "value",
            "field3": "value"
          }
        ]
      }*/

import scala.io.Source
import scala.util.parsing.json._

  case class outputData(field1 : String, field2: String, field3 : String)

  def singleMapJsonParser(JsonDataFile : String) : List[outputData] = {

    val JsonData : String = Source.fromFile(JsonDataFile).getLines.mkString

    val jsonFormatData = JSON.parseFull(JsonData).map{
      case json : Map[String, List[Map[String,String]]] =>
        json("fields").map(v => outputData(v("field1"),v("field2"),v("field3")))
    }.get

    jsonFormatData
  }

Then you just have to call your sparkContext to transform le List[Class] output to RDD

Collectives™ on Stack Overflow

Read external json file into RDD and extract specific values in scala

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related