19

I am struggling to get the CROSS JOIN of 2 data frame. I am using spark 2.0. How can I implement CROSSS JOIN with 2 data frame.?

Edit:

val df=df.join(df_t1, df("Col1")===df_t1("col")).join(df2,joinType=="cross join").where(df("col2")===df2("col2"))
2
  • show us what you have tried... Commented Feb 10, 2017 at 11:51
  • val df=df.join(df_t1, df("Col1")===df_t1("col")).join(df2,joinType=="cross join").where(df("col2")===df2("col2")) Commented Feb 10, 2017 at 12:00

5 Answers 5

25

Use crossJoin if no condition needs to be specified

Here is an extract of working code :

people.crossJoin(area).show()
Sign up to request clarification or add additional context in comments.

Comments

9

Upgrade to latest Version of spark-sql_2.11 version 2.1.0 and use the function .crossJoin of Dataset

Comments

4

Call join with the other dataframe without using a join condition.

Have a look at the following example. Given first dataframe of people:

+---+------+-------+------+
| id|  name|   mail|idArea|
+---+------+-------+------+
|  1|  Jack|[email protected]|     1|
|  2|Valery|[email protected]|     1|
|  3|  Karl|[email protected]|     2|
|  4|  Nick|[email protected]|     2|
|  5|  Luke|[email protected]|     3|
|  6| Marek|[email protected]|     3|
+---+------+-------+------+

and second dataframe of areas:

+------+--------------+
|idArea|      areaName|
+------+--------------+
|     1|Amministration|
|     2|        Public|
|     3|         Store|
+------+--------------+

the cross join is simply given by:

val cross = people.join(area)
+---+------+-------+------+------+--------------+
| id|  name|   mail|idArea|idArea|      areaName|
+---+------+-------+------+------+--------------+
|  1|  Jack|[email protected]|     1|     1|Amministration|
|  1|  Jack|[email protected]|     1|     3|         Store|
|  1|  Jack|[email protected]|     1|     2|        Public|
|  2|Valery|[email protected]|     1|     1|Amministration|
|  2|Valery|[email protected]|     1|     3|         Store|
|  2|Valery|[email protected]|     1|     2|        Public|
|  3|  Karl|[email protected]|     2|     1|Amministration|
|  3|  Karl|[email protected]|     2|     2|        Public|
|  3|  Karl|[email protected]|     2|     3|         Store|
|  4|  Nick|[email protected]|     2|     3|         Store|
|  4|  Nick|[email protected]|     2|     2|        Public|
|  4|  Nick|[email protected]|     2|     1|Amministration|
|  5|  Luke|[email protected]|     3|     2|        Public|
|  5|  Luke|[email protected]|     3|     3|         Store|
|  5|  Luke|[email protected]|     3|     1|Amministration|
|  6| Marek|[email protected]|     3|     1|Amministration|
|  6| Marek|[email protected]|     3|     2|        Public|
|  6| Marek|[email protected]|     3|     3|         Store|
+---+------+-------+------+------+--------------+

1 Comment

Dataframes now have a method named crossJoin for cross joining
2

You might have to enable crossJoin in the spark confs. Example:

spark = SparkSession
.builder
.appName("distance_matrix")
.config("spark.sql.crossJoin.enabled",True)
.getOrCreate()

and use something like this:

df1.join(df2, <condition>)

1 Comment

Thank you for your valuable comment. I have a headache chasing this problem for 3 hours.
2

If the areas data is small you can do it by explode without shuffling:

val df1 = Seq(
    (1,"Jack","[email protected]",1),
    (2,"Valery","[email protected]",1),
    (3,"Karl","[email protected]",2),
    (4,"Nick","[email protected]",2),
    (5,"Luke","[email protected]",3),
    (6,"Marek","[email protected]",3)
).toDF("id","name","mail","idArea")

val arr = array(
    Seq(
            (1,"Amministration"),
            (2,"Public"),
            (3,"Store")
        )
    .map(r => struct(lit(r._1).as("idArea"), lit(r._2).as("areaName"))):_*
)

val cross = df1
    .withColumn("d", explode(arr))
    .withColumn("idArea", $"d.idArea")
    .withColumn("areaName", $"d.areaName")
    .drop("d")

df1.show
cross.show

Output

+---+------+-------+------+
| id|  name|   mail|idArea|
+---+------+-------+------+
|  1|  Jack|[email protected]|     1|
|  2|Valery|[email protected]|     1|
|  3|  Karl|[email protected]|     2|
|  4|  Nick|[email protected]|     2|
|  5|  Luke|[email protected]|     3|
|  6| Marek|[email protected]|     3|
+---+------+-------+------+

+---+------+-------+------+--------------+
| id|  name|   mail|idArea|      areaName|
+---+------+-------+------+--------------+
|  1|  Jack|[email protected]|     1|Amministration|
|  1|  Jack|[email protected]|     2|        Public|
|  1|  Jack|[email protected]|     3|         Store|
|  2|Valery|[email protected]|     1|Amministration|
|  2|Valery|[email protected]|     2|        Public|
|  2|Valery|[email protected]|     3|         Store|
|  3|  Karl|[email protected]|     1|Amministration|
|  3|  Karl|[email protected]|     2|        Public|
|  3|  Karl|[email protected]|     3|         Store|
|  4|  Nick|[email protected]|     1|Amministration|
|  4|  Nick|[email protected]|     2|        Public|
|  4|  Nick|[email protected]|     3|         Store|
|  5|  Luke|[email protected]|     1|Amministration|
|  5|  Luke|[email protected]|     2|        Public|
|  5|  Luke|[email protected]|     3|         Store|
|  6| Marek|[email protected]|     1|Amministration|
|  6| Marek|[email protected]|     2|        Public|
|  6| Marek|[email protected]|     3|         Store|
+---+------+-------+------+--------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.