Spark SQL - Joining two tables : how to reference columns names?

Question

I have two table having the same schema :

var champs = List(  StructField("nom"    , StringType, true),
                    StructField("heure " , StringType, true),
                    StructField("velo"   , StringType, true),
                    StructField("action" , StringType, true))
var schema = StructType(champs)

I try to join them with classical sql in sparkSQL :

Select  distinct  p.nom, p.velo, p.action, p.heure, r.action, r.heure
from    prises as p, 
        rendus as r
WHERE   p.velo == r.velo

But I get an error :

Name: org.apache.spark.sql.AnalysisException
Message: cannot resolve '`p.heure`' given input columns: [heure , heure , velo, velo, action, nom, action, nom]; line 2 pos 41;

Is this kind of Query possible in spark ?

I see a lot of pages on which people use [join] method from dataframe. Would that be the only way ?

EDIT 1

val requete = s"""
Select  distinct  p.nom, p.velo, p.action, p.heure, r.action, r.heure
from prises p 
join rendus r
  on (p.velo = r.velo)
"""

sqlContext.sql(requete).show()

gives an error :

Name: org.apache.spark.sql.AnalysisException
Message: cannot resolve '`p.heure`' given input columns: [action, nom, nom, heure , heure , velo, velo, action]; line 2 pos 43;

EDIT 2

The same for :

val requete = s"""
SELECT DISTINCT p.nom, p.velo, p.action, p.heure, r.action, r.heure 
FROM       prises AS p 
INNER JOIN rendus AS r 
ON p.velo = r.velo
"""
sqlContext.sql(requete).show()

gives an error :

Name: org.apache.spark.sql.AnalysisException
Message: cannot resolve '`p.heure`' given input columns: [action, nom, nom, heure , heure , velo, velo, action]; line 2 pos 41;

can't test it atm but it might be confused by the comma join syntax (never use that, write out your joins explicitly) or incorrect double equals in your where clause. — MK.
– MK., Commented Jan 27, 2017 at 14:14

MK. · Accepted Answer · 2017-01-27 14:32:40Z

1

[OK this really shouldn't be an answer but]

You have a trailing space in your column somehow. Look at the error message: some have a space between column name and the comma and some don't.

Also please do use the correct JOIN syntax, comma joins are always a terrible unreadable confusing idea. And SQL uses single equals, not double equals. And <> instead of != while we are at it (even though != is legal in a lot of places, unfortunately).

answered Jan 27, 2017 at 14:32

MK.

34.8k19 gold badges79 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Romain Jouin Over a year ago

trailing space column - my fault :-( You have sharp eyes !

JohnHC · Accepted Answer · 2017-01-27 14:23:52Z

0

As @MK. says, Spark uses explicit JOIN syntax (and single operators for joins)

Try:

Select  distinct  p.nom, p.velo, p.action, p.heure, r.action, r.heure
from prises p 
join rendus r
  on (p.velo = r.velo)

Check the Hive documentation for more info

answered Jan 27, 2017 at 14:23

JohnHC

11.2k1 gold badge28 silver badges42 bronze badges

Comments

Dhruv Saxena · Accepted Answer · 2017-01-27 14:24:12Z

0

The query should be:

SELECT DISTINCT p.nom, p.velo, p.action, p.heure, r.action, r.heure 
       FROM prises AS p 
       INNER JOIN rednus AS r 
ON p.velo = r.velo

Notice that the problem is with using ==. It should be =

answered Jan 27, 2017 at 14:24

Dhruv Saxena

1,3462 gold badges12 silver badges31 bronze badges

Collectives™ on Stack Overflow

Spark SQL - Joining two tables : how to reference columns names?

EDIT 1

EDIT 2

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

EDIT 1

EDIT 2

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related