0

while performing simple join on 2 data frame, pyspark returns no output data

from pyspark.sql import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col

spark = SparkSession.builder.master("local").appName("test").getOrCreate()

file_path="C:\\bigdata\\pipesep_data\\Sales_ny.csv"

df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load(file_path)

addData=[(1,"1523 Main St","SFO","CA"),
    (2,"3453 Orange St","SFO","NY"),
    (3,"34 Warner St","Jersey","NJ"),
    (4,"221 Cavalier St","Newark","DE"),
    (5,"789 Walnut St","Sandiago","CA")
  ]
addColumns = ["emp_id","addline1","city","State"]
addDF = spark.createDataFrame(addData,addColumns)
addDF.show()
    
df.join(addDF,df["State"] == addDF["State"]).show()

Sales_ny schema enter image description here

Sales_ny.csv enter image description here

Output: No data in output, only columns are joined I also tried with left,right,fullouter etc..

enter image description here

1 Answer 1

1

For me it is working fine

>>> df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load("/Path to/sample1.csv")
>>> df.show()
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
| OrderID|             Product|Quantity| Price|          OrderDate|      StoreAddres|         City|State|Month|Hour|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
|295665.0|  Macbook Pro Laptop|     1.0|1700.0|2019-12-30 00:01:00|136 Church St, Ne|New York City|  123| 12.0| 0.0|
|295666.0|  LG Washing Machine|     1.0| 600.0|2019-12-29 07:03:00|   562 2nd St, Ne|New York City|   NY| 12.0| 7.0|
|295667.0|USB-C Charging Cable|     1.0| 11.95|2019-12-12 18:21:00| 277 Main St, New|New York City|   NY| 12.0|18.0|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+

>>> addDF.show()
+------+---------------+--------+-----+
|emp_id|       addline1|    city|State|
+------+---------------+--------+-----+
|     1|   1523 Main St|     SFO|   CA|
|     2| 3453 Orange St|     SFO|   NY|
|     3|   34 Warner St|  Jersey|   NJ|
|     4|221 Cavalier St|  Newark|   DE|
|     5|  789 Walnut St|Sandiago|   CA|
+------+---------------+--------+-----+

>>> df.join(addDF,df["State"] == addDF["State"]).show()
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
| OrderID|             Product|Quantity|Price|          OrderDate|     StoreAddres|         City|State|Month|Hour|emp_id|      addline1|city|State|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
|295667.0|USB-C Charging Cable|     1.0|11.95|2019-12-12 18:21:00|277 Main St, New|New York City|   NY| 12.0|18.0|     2|3453 Orange St| SFO|   NY|
|295666.0|  LG Washing Machine|     1.0|600.0|2019-12-29 07:03:00|  562 2nd St, Ne|New York City|   NY| 12.0| 7.0|     2|3453 Orange St| SFO|   NY|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+

I think your df.State have spaces. you can use below code and remove space then perform join

>>> from pyspark.sql.functions import *
>>> df=df.withColumn('State',trim(df.State))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.