pyspark.sql.utils.ParseException: "\nmismatched input" in PYSPARKSQL

Question

I have a requirement to convert table using pivot.I have tried it in pyspark sql but facing issues.

Store      Week     xCount
-------    ----     ------
102        1        96
101        1        138
105        1        37
109        1        59
101        2        282
102        2        212
105        2        78
109        2        97
105        3        60
102        3        123
101        3        220
109        3        87

I would like it to come out as a pivot table, like this:

Store        1          2          3       .
----- 
101        138        282        220
102         96        212        123
105         37        
109

code i am using is below

from pyspark import SparkContext,SparkConf
from builtins import int
#from org.spark.com.PySparkDemo import data
from pyspark.sql import Row
from pyspark.sql.context import SQLContext




conf = SparkConf().setAppName("FileSystem").setMaster("local")
sc=SparkContext(conf=conf)
sqlContext=SQLContext(sc)
a = sc.textFile("C:/Users/ashok.darsi/Desktop/Data1.txt")
b = a.map(lambda x:x.split(",")).map(lambda x :Row(store=int(x[0]),week=int(x[1]),xcount=int(x[2])))
c = sqlContext.createDataFrame(b)
#c.show()

c.registerTempTable("d")

e = sqlContext.sql("select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt")

e.show()

error i am facing

select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt
---------------------^^^

pyspark.sql.utils.ParseException: "\nmismatched input '(' expecting {<EOF>, ',', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 21)\n\n== SQL ==\nselect * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt\n---------------------^^^\n"
SUCCESS: The process with PID 8668 (child process of PID 2192) has been terminated.
SUCCESS: The process with PID 2192 (child process of PID 4940) has been terminated.
SUCCESS: The process with PID 4940 (child process of PID 11192) has been terminated.

philantrovert · Accepted Answer · 2017-12-28 09:08:41Z

2

I don't think PIVOT is inherently supported by HiveQL/Spark SQL.

You can use Dataframe Specific function pivot here instead :

df.groupBy("store").pivot("week").max("xcount").show()

+-----+---+---+---+
|store|  1|  2|  3|
+-----+---+---+---+
|  101|138|282|220|
|  102| 96|212|123|
|  105| 37| 78| 60|
|  109| 59| 97| 87|
+-----+---+---+---+

With Spark-SQL, you'll have to specify each week value manually and then find a max from it :

select 
  store,
  max(case when week = 1 then xcount else null end) as `1`,
  max(case when week = 2 then xcount else null end) as `2`,
  max(case when week = 3 then xcount else null end) as `3`
from
  d
group by 
  store

edited Dec 28, 2017 at 9:08

answered Dec 28, 2017 at 8:33

philantrovert

10.1k3 gold badges43 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

user8510536 Over a year ago

Thanks for your input. can you please provide solution in SparkSQL @philantrovert

philantrovert Over a year ago

@AshSr Check now

user8510536 Over a year ago

Yes its working great. Just for my confirmation in "sql" we have pivot function, cant we use the same in spark sql?

philantrovert Over a year ago

No. Because spark use HiveQL specific syntax.

user8510536 Over a year ago

Thank you @philantrovert

|

Collectives™ on Stack Overflow

pyspark.sql.utils.ParseException: "\nmismatched input" in PYSPARKSQL

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related