2

Firstly, I am completely new to scala and spark Although bit famailiar with pyspark. I am working with external json file which is pretty huge and I am not allowed to convert it into dataset or dataframe. I have to perform operations on pure RDD.

So I wanted to know how can I get specific value of key. So I read my json file as sc.textFile("information.json") Now normally in python I would do like

x = sc.textFile("information.json").map(lambda x: json.loads(x))\ 
 .map(lambda x: (x['name'],x['roll_no'])).collect()

is there any equivalent of above code in scala (Extracting value of specific keys) in RDD without converting to dataframe or dataset.

Essentially same question as Equivalent pyspark's json.loads function for spark-shell but hoping to get more concrete and noob friendly answer. Thank you

Json data: {"name":"ABC", "roll_no":"12", "Major":"CS"}

4
  • Can you give an example of your json please ? Commented Sep 21, 2019 at 15:21
  • Updated with JSON data Commented Sep 21, 2019 at 16:22
  • My answer on how to parse Json with scala should help you Commented Sep 21, 2019 at 16:28
  • is there any specific reason for not using spark.read.json? then you dont need to do any custom parsing Commented Sep 21, 2019 at 17:51

2 Answers 2

2

Option 1: RDD API + json4s lib

One way is using the json4s library. The library is already used internally by Spark.

import org.json4s._
import org.json4s.jackson.JsonMethods._

// {"name":"ABC1", "roll_no":"12", "Major":"CS1"}
// {"name":"ABC2", "roll_no":"13", "Major":"CS2"}
// {"name":"ABC3", "roll_no":"14", "Major":"CS3"}
val file_location = "information.json"

val rdd = sc.textFile(file_location)

rdd.map{ row =>
  val json_row = parse(row)

  (compact(json_row \ "name"), compact(json_row \ "roll_no"))
}.collect().foreach{println _}

// Output
// ("ABC1","12")
// ("ABC2","13")
// ("ABC3","14")

First we parse the row data into json_row then we access the properties of the row with the operator \ i.e: json_row \ "name". The final result is a sequence of tuples of name,roll_no

Option 2: dataframe API + get_json_object()

And a more straight forward approach would be via the dataframe API in combination with the get_json_object() function.

import org.apache.spark.sql.functions.get_json_object

val df = spark.read.text(file_location)

df.select(
  get_json_object($"value","$.name").as("name"),
  get_json_object($"value","$.roll_no").as("roll_no"))
.collect()
.foreach{println _}

// [ABC1,12]
// [ABC2,13]
// [ABC3,14]
Sign up to request clarification or add additional context in comments.

2 Comments

Hello there @Max, did the solution above work for you?
It did. Thank you so much
0

i used to parse json in scala with this kind of method :

 /** ---------------------------------------
    * Example of method to parse simple json
        {
        "fields": [
          {
            "field1": "value",
            "field2": "value",
            "field3": "value"
          }
        ]
      }*/

import scala.io.Source
import scala.util.parsing.json._

  case class outputData(field1 : String, field2: String, field3 : String)

  def singleMapJsonParser(JsonDataFile : String) : List[outputData] = {

    val JsonData : String = Source.fromFile(JsonDataFile).getLines.mkString

    val jsonFormatData = JSON.parseFull(JsonData).map{
      case json : Map[String, List[Map[String,String]]] =>
        json("fields").map(v => outputData(v("field1"),v("field2"),v("field3")))
    }.get

    jsonFormatData
  }

Then you just have to call your sparkContext to transform le List[Class] output to RDD

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.