48

Is there any alternative for df[100, c("column")] in scala spark data frames. I want to select specific row from a column of spark data frame. for example 100th row in above R equivalent code

2

9 Answers 9

29

Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.

However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.

df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
            .filter(lambda ((l, v), i): i == myIndex)
            .map(lambda ((l,v), i): (l, v))
            .collect())

print(values[0])
# (u'b', 2)

Hopefully, someone gives another solution with fewer steps.

Sign up to request clarification or add additional context in comments.

2 Comments

Your link is dead. It should probably be this: spark.apache.org/docs/latest/api/python/reference/api/…
@MyrionSC2 Your link seems broken too.
21

This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding

val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")

val myRow7th = parquetFileDF.rdd.take(7).last

2 Comments

Will the output change depending on how many nodes the data is clustered across?
the order is not guaranty, so the output might change on each run
17

In PySpark, if your dataset is small (can fit into memory of driver), you can do

df.collect()[n]

where df is the DataFrame object, and n is the Row of interest. After getting said Row, you can do row.myColumn or row["myColumn"] to get the contents, as spelled out in the API docs.

Comments

9

The getrows() function below should get the specific rows you want.

For completeness, I have written down the full code in order to reproduce the output.

# Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('scratch').getOrCreate()

# Create the dataframe
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])

# Function to get rows at `rownums`
def getrows(df, rownums=None):
    return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])

# Get rows at positions 0 and 2.
getrows(df, rownums=[0, 2]).collect()

# Output:
#> [(Row(letter='a', name=1)), (Row(letter='c', name=3))]

Comments

6

This Works for me in PySpark

df.select("column").collect()[0][0]

Comments

2

There is a scala way (if you have a enough memory on working machine):

val arr = df.select("column").rdd.collect
println(arr(100))

If dataframe schema is unknown, and you know actual type of "column" field (for example double), than you can get arr as following:

val arr = df.select($"column".cast("Double")).as[Double].rdd.collect

Comments

2

you can simply do that by using below single line of code

val arr = df.select("column").collect()(99)

1 Comment

more like: .collect()[1][0], in case someone needs the help
2

When you want to fetch max value of a date column from dataframe, just the value without object type or Row object information, you can refer to below code.

table = "mytable"

max_date = df.select(max('date_col')).first()[0]

2020-06-26
instead of Row(max(reference_week)=datetime.date(2020, 6, 26))

Comments

-2

Following is a Java-Spark way to do it , 1) add a sequentially increment columns. 2) Select Row number using Id. 3) Drop the Column

import static org.apache.spark.sql.functions.*;
..

ds = ds.withColumn("rownum", functions.monotonically_increasing_id());
ds = ds.filter(col("rownum").equalTo(99));
ds = ds.drop("rownum");

N.B. monotonically_increasing_id starts from 0;

1 Comment

monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.