pyspark sql with having count

Question

I am trying to select all the warehouseCodes from tables Warehouses and Boxes such that such that Warehouse.capacity is smaller than Boxes.count_of_boxes.

SQL Query that works in postgresql

select w.code
from Warehouses w
join Boxes b
on w.code = b.warehouse
group by w.code
having count(b.code) > w.capacity

But the same query does not work in pyspark

spark.sql("""
select w.code
from Warehouses w
join Boxes b
on w.code = b.warehouse
group by w.code
having count(b.code) > w.capacity

""").show()

How to fix the code?

SETUP

import numpy as np
import pandas as pd


# pyspark
import pyspark
from pyspark.sql import functions as F 
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext, SQLContext


spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
sc.setLogLevel("INFO")


# warehouse
dfw = pd.DataFrame({'code': [1, 2, 3, 4, 5],
          'location': ['Chicago', 'Chicago', 'New York', 'Los Angeles', 'San Francisco'],
          'capacity': [3, 4, 7, 2, 8]})

schema = StructType([
    StructField('code',IntegerType(),True),
    StructField('location',StringType(),True),
    StructField('capacity',IntegerType(),True),
    ])

sdfw = sqlContext.createDataFrame(dfw, schema)
sdfw.createOrReplaceTempView("Warehouses")


# box
dfb = pd.DataFrame({'code': ['0MN7', '4H8P', '4RT3', '7G3H', '8JN6', '8Y6U', '9J6F', 'LL08', 'P0H6', 'P2T6', 'TU55'],
          'contents': ['Rocks', 'Rocks', 'Scissors', 'Rocks', 'Papers', 'Papers', 'Papers', 'Rocks', 'Scissors', 'Scissors', 'Papers'],
          'value': [180.0, 250.0, 190.0, 200.0, 75.0, 50.0, 175.0, 140.0, 125.0, 150.0, 90.0],
          'warehouse': [3, 1, 4, 1, 1, 3, 2, 4, 1, 2, 5]})

schema = StructType([
    StructField('code',StringType(),True),
    StructField('contents',StringType(),True),
    StructField('value',FloatType(),True),
    StructField('warehouse',IntegerType(),True),

    ])

sdfb = sqlContext.createDataFrame(dfb, schema)
sdfb.createOrReplaceTempView("Boxes")

spark.sql("""
select w.code
from Warehouses w
join Boxes b
on w.code = b.warehouse
group by w.code
having count(b.code) > w.capacity

""").show()

What is the expected output from the example and what is the actual output you are getting? Are you getting any errors? — ywbaek
– ywbaek, Commented Mar 22, 2020 at 18:17

murtihash · Accepted Answer · 2020-03-22 18:20:27Z

2

Try this. The error is that spark cannot find capacity as it is not wrapped in an aggregation function. First should do that for you.:

spark.sql("""
select w.code
from Warehouses w
join Boxes b
on w.code = b.warehouse
group by w.code
having count(b.code) > first(w.capacity)

""").show()

answered Mar 22, 2020 at 18:20

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Henrique Branco · Accepted Answer · 2020-03-22 18:48:08Z

1

How to fix the code?

The problem is not with your code, maybe.

Check the version from Java JDK that you are using. What I know is that the spark.sql().show() is not compatible with Java JDK version 11. If you are using this version, just make a downgrade to version 8 (also configuring correctly the environments variable for the JDK 8).

answered Mar 22, 2020 at 18:48

Henrique Branco

1,9581 gold badge18 silver badges43 bronze badges

Collectives™ on Stack Overflow

pyspark sql with having count

SQL Query that works in postgresql

But the same query does not work in pyspark

SETUP

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

SQL Query that works in postgresql

But the same query does not work in pyspark

SETUP

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related