3

Case 1 : When i'm trying to fetch "b.no" getting the error, code is shared below and also error message. How can i get the values from second dataframe (i.e aliased as b). whether selecting values from b is allowed or not here. If i remove b.no it is working fine.

df1.csv no,name,sal 1,sri,3000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4000

df2.csv no,name,sal 1,sri,3000 1,vas,4000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4500 5,toy,4200 5,koy,4999 6,jim,3090 7,kim,2080

code:

from pyspark.shell import spark
from pyspark.sql import SQLContext

sc = spark.sparkContext
sqlContext = SQLContext(sc)

df11 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\df1.csv")
df22 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\df2.csv")
print("df11", df11.count())
print("df22", df22.count())

resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.no")
print("resDF", resDF.count())
print("resDF", resDF.distinct().show())

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o48.select. : org.apache.spark.sql.AnalysisException: cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];; pyspark.sql.utils.AnalysisException: "cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];;\n'Project [no#10, name#11, 'b.no]\n+- AnalysisBarrier\n +- Project [no#10, name#11, sal#12, name#27, sal#28]\n +- Join Inner, (no#10 = no#26)\n :- SubqueryAlias a\n : +- Relation[no#10,name#11,sal#12] csv\n +- SubqueryAlias b\n +- Relation[no#26,name#27,sal#28] csv\n"

Case 2: when i use b.sal getting duplicate values, its not filtering out.

    resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.sal")      
print("resDF", resDF.distinct().show())

In this case how to get distinct values based on 'no' only.

1
  • 1
    Could you please add some more details about df1 structure? Your question is not well-written. Please put the table of df1 and df2. First row to be column and others to be data Commented Jan 28, 2019 at 13:42

1 Answer 1

3

The problem in case1 is that when you use a string (or arraytype) as join argument, spark will only add a.no and not b.no to avoid duplicate columns after a join (See link for more information). You can avoid this by defining a join expression like F.col('a.no') == col('b.no'). See full example below:

from pyspark.sql import types as T
from pyspark.sql import functions as F
columns1 = ['no','name','sal']
columns2 = ['no','name','sal']

vals1 = [(1,'sri',3000) ,(2,'ram',2000) ,(3,'sam',2500) ,(4,'kri',5000) ,(5,'tom',4000)]

vals2 = [(1,'sri',3000) ,(1,'vas',4000) ,(2,'ram',2000) ,(3,'sam',2500), (4,'kri',5000) ,(5,'tom',4500) ,(5,'toy',4200) ,(5,'koy',4999) ,(6,'jim',3090) ,(7,'kim',2080)]

df1 = spark.createDataFrame(vals1, columns1)
df2 = spark.createDataFrame(vals2, columns2)
#here I use a expression instead of a string
resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.no")
resDF.show()

Output:

+---+----+---+ 
| no|name| no| 
+---+----+---+ 
|  0|   1|  0| 
+---+----+---+

For your Case2: The dataframe distinct method compares each row of the dataframe. When you only need the unique values of one column you have to perform a select at first:

resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.sal")      
resDF.select('no').distinct().show()
Sign up to request clarification or add additional context in comments.

5 Comments

So the dataframes df11 and df22 have only 3 columns (no,name,sal) each? I have updated my answer.
One last question on using F.col and on='on' both are giving same result. did not understand the difference here exactly. And if i want to test not equal in the below condition how to modify the conditon. F.col('a.no') == col('b.no')) Thanks for your time.
No no, a join expression will not give the same result as a string as join statement. A dataframe which was generated with a join expression on column 'no' will only have one 'no'-column (a.no in your case). When you use a string as join statement, the generated dataframe will have two 'no' columns (a.no and b.no in your case). Just try it out: df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).printSchema() and df1.alias("a").join(df2.alias("b"), on='no').printSchema()
That's great, clarified now. Thank you so much. :-)
@cronoik, , even i have similar kind of issue , stackoverflow.com/questions/72963925/… any clue how to fix this ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.