0

Spark SQL nested JSON error :

{
  "xxxDetails":{  
      "yyyData":{  
         "0":{  
            "additionalData":{  

            },
            "quantity":80000,
            "www":12.6,
            "ddd":5.0,
            "eee":72000,
            "rrr":false
         },
         "130":{  
            "additionalData":{  
               "quantity":1
            },
            "quantity":0,
            "www":1.0,
            "ddd":0.0,
            "eee":0,
            "rrr":false
         },
         "yyy":{  
            "additionalData":{  
               "quantity":1
            },
            "quantity":0,
            "www":1.0,
            "ddd":0.0,
            "eee":0,
            "rrr":false
         }       
      }
   },
   "mmmDto":{  
      "id":0,
      "name":"",
      "data":null
   }
 }

when reading spark.sql("select cast (xxxDetails.yyyData.yyy.additionalData.quantity as Long) as quantity from table") it will work but: spark.sql("select cast (xxxDetails.yyyData.130.additionalData.quantity as Long) as quantity from table") will throw Exception :

org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'cast (xxxDetails.yyyData.130.

When I"m usning datafame API for myDF.select("xxxDetails.yyyData.130.additionalData.quantity") its work . Anyone with decent explanation :)

1 Answer 1

4

It's because SQL column names are expected to start with a letter or some other characters like _, @ or # but not a digit. Let's consider this simple example:

Seq((1, 2)).toDF("x", "666").createOrReplaceTempView("test")

Calling spark.sql("SELECT x FROM test").show() would output

+---+
|  x|
+---+
|  1|
+---+

but calling spark.sql("SELECT 666 FROM test").show() instead outputs

+---+
|666|
+---+
|666|
+---+

because 666 is interpreted as literal, not a column name. To fix this, the column name needs to be quoted using backticks:

spark.sql("SELECT `666` FROM test").show()
+---+
|666|
+---+
|  2|
+---+
Sign up to request clarification or add additional context in comments.

5 Comments

Hi @ollik1 I apologize I"m updating my example/question with more details, the error still occurs even when I'm using: spark.sql("select cast (xxxDetails.'130'.yyy.quantity as Long) as quantity. again sorry for the first incomplete example.
This works for me, @ArnonRodman remember it is not single quote but skewed quote, i.e. spark.sql("select cast (xxxDetails.yyyData.`130`.additionalData.quantity as Long) as quantity from table")
Edited the answer to emphasise using the correct quotation characters
Thx @ollik1 and Richard Nemeth it worked, where can I find this in the documentation? and why spark.sql API is different from DataFrame.sql API
Not sure if there is any better documentation than this issues.apache.org/jira/browse/SPARK-3483 which leads to github.com/apache/spark/pull/2804/files . It does not explain though why backtick was chosen instead of double-quote which would be standard SQL. Dataframe API is different as it is explicit from the method signatures that the passed string refers to a column. SQL string, however, needs to be parsed and analyzed according to certain rules. Note that identifiers starting with a number would also fail in Java, Scala and Python

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.