3

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.

How can I merge these two columns into a new DataFrame?

I tried join, but I think that there should be some other way to do it.

Also, I tried to apply withColumm, but it does not compile.

val result = df1.withColumn(col("col2"), df2.col1)

UPDATE:

For example:

df1 = 
col1
1
2
3

df2 = 
col2
4
5
6

result = 
col1  col2
1     4
2     5
3     6
5
  • how did you join them? Commented Nov 24, 2017 at 21:15
  • 2
    what does "merge" mean Commented Nov 25, 2017 at 0:58
  • Do you want to take the first row from df1 and "merge" it with the first row of df2 and so on for every row in df1? Commented Nov 25, 2017 at 11:23
  • @JacekLaskowski: Yes, right. I want to get a DataFrame with two columns. But actually I cannot use join because there is no joining criteria. I just want to have two columns from two different DataFrames placed next to each other in the new DataFrame. Commented Nov 25, 2017 at 20:46
  • @Mike: Please see my update. Commented Nov 25, 2017 at 20:48

3 Answers 3

5

If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:

var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")

df1.union(df2).show

+---+ 
|one| 
+---+ 
| a | 
| b | 
| c | 
| d | 
| e | 
| f | 
+---+

[edit] Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:

import org.apache.spark.sql.functions.monotonically_increasing_id

var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")

df1.withColumn("id", monotonically_increasing_id())
    .join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
    .drop("id")
    .show

+---+---+ 
|one|two|
+---+---+ 
| a | d | 
| b | e | 
| c | f |
+---+---+
Sign up to request clarification or add additional context in comments.

6 Comments

I need the columns to be next to each other. So, I need to get 2 columns, not one.
Should I import monotonically_increasing_id?
Oh, yes, you'll need to import that
Could you please add the import statement to you answer? I cannot find the import path for monotonically_increasing_id.
Added the import. Please remember to accept the answer if you end up using the solution!
|
1

As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.

But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.

[Update with example of in-driver collect/zip]

val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")

2 Comments

Yes, the DataFrames are indeed very small. How can I zip them together as you proposed in the second paragraph?
I added an example of doing collect and zip in the driver.
1

Depends in what you want to do.

If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)

You are saying that your Data Frames just had one column each.

In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by @Chondrops) witch give you a one column table with all elements.

I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.