3

I have a requirement to convert table using pivot.I have tried it in pyspark sql but facing issues.

Store      Week     xCount
-------    ----     ------
102        1        96
101        1        138
105        1        37
109        1        59
101        2        282
102        2        212
105        2        78
109        2        97
105        3        60
102        3        123
101        3        220
109        3        87

I would like it to come out as a pivot table, like this:

Store        1          2          3       .
----- 
101        138        282        220
102         96        212        123
105         37        
109

code i am using is below

from pyspark import SparkContext,SparkConf
from builtins import int
#from org.spark.com.PySparkDemo import data
from pyspark.sql import Row
from pyspark.sql.context import SQLContext




conf = SparkConf().setAppName("FileSystem").setMaster("local")
sc=SparkContext(conf=conf)
sqlContext=SQLContext(sc)
a = sc.textFile("C:/Users/ashok.darsi/Desktop/Data1.txt")
b = a.map(lambda x:x.split(",")).map(lambda x :Row(store=int(x[0]),week=int(x[1]),xcount=int(x[2])))
c = sqlContext.createDataFrame(b)
#c.show()

c.registerTempTable("d")

e = sqlContext.sql("select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt")

e.show()

error i am facing

select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt
---------------------^^^

pyspark.sql.utils.ParseException: "\nmismatched input '(' expecting {<EOF>, ',', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 21)\n\n== SQL ==\nselect * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt\n---------------------^^^\n"
SUCCESS: The process with PID 8668 (child process of PID 2192) has been terminated.
SUCCESS: The process with PID 2192 (child process of PID 4940) has been terminated.
SUCCESS: The process with PID 4940 (child process of PID 11192) has been terminated.

1 Answer 1

2

I don't think PIVOT is inherently supported by HiveQL/Spark SQL.

You can use Dataframe Specific function pivot here instead :

df.groupBy("store").pivot("week").max("xcount").show()

+-----+---+---+---+
|store|  1|  2|  3|
+-----+---+---+---+
|  101|138|282|220|
|  102| 96|212|123|
|  105| 37| 78| 60|
|  109| 59| 97| 87|
+-----+---+---+---+

With Spark-SQL, you'll have to specify each week value manually and then find a max from it :

select 
  store,
  max(case when week = 1 then xcount else null end) as `1`,
  max(case when week = 2 then xcount else null end) as `2`,
  max(case when week = 3 then xcount else null end) as `3`
from
  d
group by 
  store
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks for your input. can you please provide solution in SparkSQL @philantrovert
@AshSr Check now
Yes its working great. Just for my confirmation in "sql" we have pivot function, cant we use the same in spark sql?
No. Because spark use HiveQL specific syntax.
Thank you @philantrovert
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.