Case 1 : When i'm trying to fetch "b.no" getting the error, code is shared below and also error message. How can i get the values from second dataframe (i.e aliased as b). whether selecting values from b is allowed or not here. If i remove b.no it is working fine.
df1.csv no,name,sal 1,sri,3000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4000
df2.csv no,name,sal 1,sri,3000 1,vas,4000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4500 5,toy,4200 5,koy,4999 6,jim,3090 7,kim,2080
code:
from pyspark.shell import spark
from pyspark.sql import SQLContext
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df11 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\df1.csv")
df22 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\df2.csv")
print("df11", df11.count())
print("df22", df22.count())
resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.no")
print("resDF", resDF.count())
print("resDF", resDF.distinct().show())
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.select.
: org.apache.spark.sql.AnalysisException: cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];;
pyspark.sql.utils.AnalysisException: "cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];;\n'Project [no#10, name#11, 'b.no]\n+- AnalysisBarrier\n +- Project [no#10, name#11, sal#12, name#27, sal#28]\n +- Join Inner, (no#10 = no#26)\n :- SubqueryAlias a\n : +- Relation[no#10,name#11,sal#12] csv\n +- SubqueryAlias b\n +- Relation[no#26,name#27,sal#28] csv\n"
Case 2: when i use b.sal getting duplicate values, its not filtering out.
resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.sal")
print("resDF", resDF.distinct().show())
In this case how to get distinct values based on 'no' only.