Add index column to existing Spark's DataFrame

Question

I operate with Spark 1.5, using Java. I need to append ID/Index column to existing DataFrame, for example:

+---------+--------+
|  surname|    name|
+---------+--------+
|    Green|    Jake|
| Anderson|  Thomas|
| Corleone| Michael|
|    Marsh|   Randy|
|  Montana|    Tony|
|    Green|   Julia|
|Brenneman|    Eady|
|   Durden|   Tyler|
| Corleone|    Vito|
|   Madiro|     Mat|
+---------+--------+

I want every row to be appended with index, in range between between 1 and table records amount. Index order does not matter, any row must just contain unique ID/index. It could be done by transformation into RDD and appending index row and transformation into DataFrame with modified StructType, but, If I understand correctly, this operation consumes a lot of resources for transformation etc., and there must be another way. Result must be like:

+---------+--------+---+
|  surname|    name| id|
+---------+--------+---+
|    Green|    Jake|  3|
| Anderson|  Thomas|  5|
| Corleone| Michael|  2|
|    Marsh|   Randy| 10|
|  Montana|    Tony|  7|
|    Green|   Julia|  1|
|Brenneman|    Eady|  2|
|   Durden|   Tyler|  9|
| Corleone|    Vito|  4|
|   Madiro|     Mat|  6|
+---------+--------+---+

Thank you.

First solution he propose (If I understood Scala syntax correctly) is conversion into RDD etc. Second - I can't call this function in Java, and It generates unique values not from required range, so the only possible solution is using hash functions, but It has unacceptable drawbacks. — Sergey Repnikov
– Sergey Repnikov, Commented Aug 10, 2016 at 14:09
Actually my point here is that given your requirements there is no better solution than rdd -> zipWithIndex. Also excluding Python snippets every piece of code there should be Java compatible. — zero323
– zero323, Commented Aug 10, 2016 at 14:37

dbustosp · Accepted Answer · 2018-04-03 17:12:08Z

3

I know this question might be a while ago, but you can do it as follow:

from pyspark.sql.window import Window  
w = Window.orderBy("myColumn") 
withIndexDF = originalDF.withColumn("index", row_number().over(w))

myColumn: Any specific column from your dataframe.
originalDF: original DataFrame withouth the index column.

answered Apr 3, 2018 at 17:12

dbustosp

4,48629 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dmitry Over a year ago

While using window without partition clause there will be warning about all data falling into single partition, possible huge performance degradation.

Algorithman · Accepted Answer · 2017-10-28 01:42:06Z

1

The most concise way to do this in spark data frame:

.withColumn("idx",monotonically_increasing_id())

Complete documentation: https://docs.databricks.com/spark/latest/sparkr/functions/withColumn.html

answered Oct 28, 2017 at 1:42

Algorithman

1,3341 gold badge17 silver badges41 bronze badges

1 Comment

Caleb Fenton Over a year ago

From the question: > I want every row to be appended with index, in range between between 1 and table records amount. From the code for monotonically_increasing_id(): > The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

Community · Accepted Answer · 2017-05-23 11:47:31Z

0

Folks, a good approach at :

DataFrame-ified zipWithIndex

simulating the ZipWithIndex method from RDD ... the first suggestion performs better but no big deal with the pure Dataframes solution so far (over 100M lines table in my scenario).

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Feb 3, 2017 at 13:31

Mário de Sá Vera

3871 gold badge4 silver badges12 bronze badges

Comments

tuomastik · Accepted Answer · 2017-07-11 20:18:44Z

0

In Scala, first we need to create an indexing Array:

val indx_arr=(1 to your_df.count.toInt).toArray

indx_arr: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Now, we want to append this column to our Dataframe. First, we open up our Dataframe and get it as an array, then we zip it with our indx_arr and then we convert the newly-created array back into and RDD. The final step is to get it as a Dataframe:

final_df = sc.parallelize((your_df.collect.map(
    x=>(x(0),x(1))) zip indx_arr).map(
    x=>(x._1._1.toString,x._1._2.toString,x._2))).toDF("surname","name","id")

This is also an easy and straightforward method of appending an array of any kind to our Spark Dataframe.

edited Jul 11, 2017 at 20:18

tuomastik

4,9856 gold badges46 silver badges52 bronze badges

answered Jul 11, 2017 at 18:55

Mahdi Ghelichi

1,17014 silver badges23 bronze badges

Comments

Bhanu-Bigdata Developer · Accepted Answer · 2016-08-10 17:15:25Z

-2

You can use withColumn function. Usage should be something like Val myDF = existingDF.withColumn("index",express(random(1,existingDF.count())

answered Aug 10, 2016 at 17:15

Bhanu-Bigdata Developer

1315 bronze badges

2 Comments

Paul Over a year ago

what is express?

Bhanu-Bigdata Developer Over a year ago

Expr to express expression

Collectives™ on Stack Overflow

Add index column to existing Spark's DataFrame

5 Answers 5

1 Comment

1 Comment

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related