org.apache.spark.sql.AnalysisException: cannot resolve

Question

Case 1 : When i'm trying to fetch "b.no" getting the error, code is shared below and also error message. How can i get the values from second dataframe (i.e aliased as b). whether selecting values from b is allowed or not here. If i remove b.no it is working fine.

df1.csv no,name,sal 1,sri,3000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4000

df2.csv no,name,sal 1,sri,3000 1,vas,4000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4500 5,toy,4200 5,koy,4999 6,jim,3090 7,kim,2080

code:

from pyspark.shell import spark
from pyspark.sql import SQLContext

sc = spark.sparkContext
sqlContext = SQLContext(sc)

df11 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\df1.csv")
df22 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\df2.csv")
print("df11", df11.count())
print("df22", df22.count())

resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.no")
print("resDF", resDF.count())
print("resDF", resDF.distinct().show())

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o48.select. : org.apache.spark.sql.AnalysisException: cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];; pyspark.sql.utils.AnalysisException: "cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];;\n'Project [no#10, name#11, 'b.no]\n+- AnalysisBarrier\n +- Project [no#10, name#11, sal#12, name#27, sal#28]\n +- Join Inner, (no#10 = no#26)\n :- SubqueryAlias a\n : +- Relation[no#10,name#11,sal#12] csv\n +- SubqueryAlias b\n +- Relation[no#26,name#27,sal#28] csv\n"

Case 2: when i use b.sal getting duplicate values, its not filtering out.

    resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.sal")      
print("resDF", resDF.distinct().show())

In this case how to get distinct values based on 'no' only.

Could you please add some more details about df1 structure? Your question is not well-written. Please put the table of df1 and df2. First row to be column and others to be data — Moustafa Mahmoud
– Moustafa Mahmoud, Commented Jan 28, 2019 at 13:42

cronoik · Accepted Answer · 2019-01-28 14:30:10Z

3

The problem in case1 is that when you use a string (or arraytype) as join argument, spark will only add a.no and not b.no to avoid duplicate columns after a join (See link for more information). You can avoid this by defining a join expression like F.col('a.no') == col('b.no'). See full example below:

from pyspark.sql import types as T
from pyspark.sql import functions as F
columns1 = ['no','name','sal']
columns2 = ['no','name','sal']

vals1 = [(1,'sri',3000) ,(2,'ram',2000) ,(3,'sam',2500) ,(4,'kri',5000) ,(5,'tom',4000)]

vals2 = [(1,'sri',3000) ,(1,'vas',4000) ,(2,'ram',2000) ,(3,'sam',2500), (4,'kri',5000) ,(5,'tom',4500) ,(5,'toy',4200) ,(5,'koy',4999) ,(6,'jim',3090) ,(7,'kim',2080)]

df1 = spark.createDataFrame(vals1, columns1)
df2 = spark.createDataFrame(vals2, columns2)
#here I use a expression instead of a string
resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.no")
resDF.show()

Output:

+---+----+---+ 
| no|name| no| 
+---+----+---+ 
|  0|   1|  0| 
+---+----+---+

For your Case2: The dataframe distinct method compares each row of the dataframe. When you only need the unique values of one column you have to perform a select at first:

resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.sal")      
resDF.select('no').distinct().show()

edited Jan 28, 2019 at 14:30

answered Jan 28, 2019 at 13:48

cronoik

20k4 gold badges52 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

cronoik Over a year ago

So the dataframes df11 and df22 have only 3 columns (no,name,sal) each? I have updated my answer.

RK. Over a year ago

One last question on using F.col and on='on' both are giving same result. did not understand the difference here exactly. And if i want to test not equal in the below condition how to modify the conditon. F.col('a.no') == col('b.no')) Thanks for your time.

cronoik Over a year ago

No no, a join expression will not give the same result as a string as join statement. A dataframe which was generated with a join expression on column 'no' will only have one 'no'-column (a.no in your case). When you use a string as join statement, the generated dataframe will have two 'no' columns (a.no and b.no in your case). Just try it out: df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).printSchema() and df1.alias("a").join(df2.alias("b"), on='no').printSchema()

RK. Over a year ago

That's great, clarified now. Thank you so much. :-)

Shasu Over a year ago

@cronoik, , even i have similar kind of issue , stackoverflow.com/questions/72963925/… any clue how to fix this ?

Collectives™ on Stack Overflow

org.apache.spark.sql.AnalysisException: cannot resolve

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related