Pyspark join returning no data in output

Question

while performing simple join on 2 data frame, pyspark returns no output data

from pyspark.sql import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col

spark = SparkSession.builder.master("local").appName("test").getOrCreate()

file_path="C:\\bigdata\\pipesep_data\\Sales_ny.csv"

df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load(file_path)

addData=[(1,"1523 Main St","SFO","CA"),
    (2,"3453 Orange St","SFO","NY"),
    (3,"34 Warner St","Jersey","NJ"),
    (4,"221 Cavalier St","Newark","DE"),
    (5,"789 Walnut St","Sandiago","CA")
  ]
addColumns = ["emp_id","addline1","city","State"]
addDF = spark.createDataFrame(addData,addColumns)
addDF.show()
    
df.join(addDF,df["State"] == addDF["State"]).show()

Sales_ny schema

Sales_ny.csv

Output: No data in output, only columns are joined I also tried with left,right,fullouter etc..

Sachin Tiwari · Accepted Answer · 2022-06-06 10:52:16Z

For me it is working fine

>>> df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load("/Path to/sample1.csv")
>>> df.show()
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
| OrderID|             Product|Quantity| Price|          OrderDate|      StoreAddres|         City|State|Month|Hour|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
|295665.0|  Macbook Pro Laptop|     1.0|1700.0|2019-12-30 00:01:00|136 Church St, Ne|New York City|  123| 12.0| 0.0|
|295666.0|  LG Washing Machine|     1.0| 600.0|2019-12-29 07:03:00|   562 2nd St, Ne|New York City|   NY| 12.0| 7.0|
|295667.0|USB-C Charging Cable|     1.0| 11.95|2019-12-12 18:21:00| 277 Main St, New|New York City|   NY| 12.0|18.0|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+

>>> addDF.show()
+------+---------------+--------+-----+
|emp_id|       addline1|    city|State|
+------+---------------+--------+-----+
|     1|   1523 Main St|     SFO|   CA|
|     2| 3453 Orange St|     SFO|   NY|
|     3|   34 Warner St|  Jersey|   NJ|
|     4|221 Cavalier St|  Newark|   DE|
|     5|  789 Walnut St|Sandiago|   CA|
+------+---------------+--------+-----+

>>> df.join(addDF,df["State"] == addDF["State"]).show()
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
| OrderID|             Product|Quantity|Price|          OrderDate|     StoreAddres|         City|State|Month|Hour|emp_id|      addline1|city|State|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
|295667.0|USB-C Charging Cable|     1.0|11.95|2019-12-12 18:21:00|277 Main St, New|New York City|   NY| 12.0|18.0|     2|3453 Orange St| SFO|   NY|
|295666.0|  LG Washing Machine|     1.0|600.0|2019-12-29 07:03:00|  562 2nd St, Ne|New York City|   NY| 12.0| 7.0|     2|3453 Orange St| SFO|   NY|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+

I think your df.State have spaces. you can use below code and remove space then perform join

>>> from pyspark.sql.functions import *
>>> df=df.withColumn('State',trim(df.State))

Collectives™ on Stack Overflow

Pyspark join returning no data in output

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related