get specific row from spark dataframe

Question

Is there any alternative for df[100, c("column")] in scala spark data frames. I want to select specific row from a column of spark data frame. for example 100th row in above R equivalent code

Possible duplicate of How to read specific lines from sparkContext — Daniel Darabos
– Daniel Darabos, Commented Feb 8, 2016 at 17:30
This is about DataFrames, and How to read specific lines from sparkContext is about RDDs — Josiah Yoder
– Josiah Yoder, Commented Aug 16, 2016 at 21:12

Alberto Bonsanto · Accepted Answer · 2016-02-06 17:29:00Z

29

Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.

However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.

df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
            .filter(lambda ((l, v), i): i == myIndex)
            .map(lambda ((l,v), i): (l, v))
            .collect())

print(values[0])
# (u'b', 2)

Hopefully, someone gives another solution with fewer steps.

edited Feb 6, 2016 at 17:29

answered Feb 6, 2016 at 17:23

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MyrionSC2 Over a year ago

Your link is dead. It should probably be this: spark.apache.org/docs/latest/api/python/reference/api/…

peter.petrov Over a year ago

@MyrionSC2 Your link seems broken too.

Ignacio Alorre · Accepted Answer · 2017-08-29 10:15:40Z

21

This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding

val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")

val myRow7th = parquetFileDF.rdd.take(7).last

edited Aug 29, 2017 at 10:15

answered Aug 29, 2017 at 10:02

Ignacio Alorre

7,6558 gold badges65 silver badges104 bronze badges

2 Comments

bshelt141 Over a year ago

Will the output change depending on how many nodes the data is clustered across?

Juh_ Over a year ago

the order is not guaranty, so the output might change on each run

flow2k · Accepted Answer · 2019-06-22 04:13:44Z

17

In PySpark, if your dataset is small (can fit into memory of driver), you can do

df.collect()[n]

where df is the DataFrame object, and n is the Row of interest. After getting said Row, you can do row.myColumn or row["myColumn"] to get the contents, as spelled out in the API docs.

edited Jun 22, 2019 at 4:13

answered Jun 22, 2019 at 0:16

flow2k

4,47550 silver badges60 bronze badges

Comments

Selva · Accepted Answer · 2018-05-20 19:42:24Z

9

The getrows() function below should get the specific rows you want.

For completeness, I have written down the full code in order to reproduce the output.

# Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('scratch').getOrCreate()

# Create the dataframe
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])

# Function to get rows at `rownums`
def getrows(df, rownums=None):
    return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])

# Get rows at positions 0 and 2.
getrows(df, rownums=[0, 2]).collect()

# Output:
#> [(Row(letter='a', name=1)), (Row(letter='c', name=3))]

edited May 20, 2018 at 19:42

answered May 20, 2018 at 19:26

Selva

2,1431 gold badge24 silver badges19 bronze badges

Comments

Mohseen Mulla · Accepted Answer · 2020-11-05 15:39:35Z

6

This Works for me in PySpark

df.select("column").collect()[0][0]

answered Nov 5, 2020 at 15:39

Mohseen Mulla

6001 gold badge9 silver badges16 bronze badges

Comments

Oleg Svechkarenko · Accepted Answer · 2018-07-13 19:22:29Z

2

There is a scala way (if you have a enough memory on working machine):

val arr = df.select("column").rdd.collect
println(arr(100))

If dataframe schema is unknown, and you know actual type of "column" field (for example double), than you can get arr as following:

val arr = df.select($"column".cast("Double")).as[Double].rdd.collect

answered Jul 13, 2018 at 19:22

Oleg Svechkarenko

2,51625 silver badges31 bronze badges

Comments

David Arenburg · Accepted Answer · 2019-07-23 11:51:08Z

2

you can simply do that by using below single line of code

val arr = df.select("column").collect()(99)

edited Jul 23, 2019 at 11:51

David Arenburg

92.4k18 gold badges145 silver badges202 bronze badges

answered Jul 5, 2019 at 8:56

Nikunj Kakadiya

3,0762 gold badges26 silver badges45 bronze badges

1 Comment

Fay007 Over a year ago

more like: .collect()[1][0], in case someone needs the help

Shiva Basayya · Accepted Answer · 2020-07-05 13:15:45Z

2

When you want to fetch max value of a date column from dataframe, just the value without object type or Row object information, you can refer to below code.

table = "mytable"

max_date = df.select(max('date_col')).first()[0]

2020-06-26
instead of Row(max(reference_week)=datetime.date(2020, 6, 26))

answered Jul 5, 2020 at 13:15

Shiva Basayya

213 bronze badges

Comments

sapy · Accepted Answer · 2019-08-09 23:14:25Z

-2

Following is a Java-Spark way to do it , 1) add a sequentially increment columns. 2) Select Row number using Id. 3) Drop the Column

import static org.apache.spark.sql.functions.*;
..

ds = ds.withColumn("rownum", functions.monotonically_increasing_id());
ds = ds.filter(col("rownum").equalTo(99));
ds = ds.drop("rownum");

N.B. monotonically_increasing_id starts from 0;

answered Aug 9, 2019 at 23:14

sapy

9,7567 gold badges55 silver badges65 bronze badges

1 Comment

Gowrav Over a year ago

monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

Collectives™ on Stack Overflow

get specific row from spark dataframe

9 Answers 9

2 Comments

2 Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

2 Comments

2 Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

Linked

Related