How to loop through each row of dataFrame in pyspark

Question

E.g

sqlContext = SQLContext(sc)

sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()

The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.

I believe I provided a correct answer. Can you select, or provide feedback to improve? — aaronsteers
– aaronsteers, Commented Dec 2, 2016 at 0:05

3 revs, 2 users 64% · Accepted Answer · 2024-12-23 09:32:47Z

86

No and Yes.

No:

Technical speaking, you simply cannot iterate on DataFrames and other distributed data structures. They can only be accessed by dedicated higher order function and / or SQL methods (https://docs.python.org/3/glossary.html#term-iterable).

Yes:

You can use collect to get a local list of Row objects that can be iterated.

for row in df.rdd.collect():
    do_something(row)

or convert toLocalIterator

for row in df.rdd.toLocalIterator():
    do_something(row)

Note:

Sparks distributed data and distributed processing allows to work on amounts of data that are very hard to handle otherwise.

When using collect(), there is a trade off - e.g. you can loop over rows but the data might not fit into local memory anymore or computations might take much much more time.

edited Dec 23, 2024 at 9:32

community wiki

3 revs, 2 users 64%
zero323

Sign up to request clarification or add additional context in comments.

3 Comments

Jari Turkia Over a year ago

Newbie question: As iterating an already collected dataframe "beats the purpose", from a dataframe, how should I pick the rows I need for further processing?

Jari Turkia Over a year ago

Did some reading and looks like forming a new dataframe with where() would be the Spark-way of doing it properly.

Marco Roy Over a year ago

"it beats all purpose of using Spark" is pretty strong and subjective language. The collect() method exists for a reason, and there are many valid uses cases for it. Once Spark is done processing the data, iterating through the final results might be the only way to integrate with/write to external APIs or legacy systems.

David · Accepted Answer · 2019-07-25 13:12:31Z

71

To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.

def customFunction(row):

   return (row.name, row.age, row.city)

sample2 = sample.rdd.map(customFunction)

or

sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))

The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.

Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.

sample3 = sample.withColumn('age2', sample.age + 2)

edited Jul 25, 2019 at 13:12

answered Apr 1, 2016 at 16:56

David

11.6k4 gold badges44 silver badges46 bronze badges

2 Comments

IrfanClemson Over a year ago

Can you please tell me how to actually use the customFunction so that the return values could be used inside a loop for further procesing? I have a collect() based approach but my data is too large and it causes the Pyspark (v. 3) to fail. Thank you!

Zac Over a year ago

hi @David, if I use map() on a rdd, will each row run customFunction() in order? In my case, I hope every row will be processed sequentially.

aaronsteers · Accepted Answer · 2016-12-02 00:07:57Z

Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:

df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]

In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().

Or more abbreviated:

tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]

And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.

sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]} 
             for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
    print("{} is a {} year old from {}".format(
        row["name"],
        row["age"],
        row["city"]))

Dustin Sun · Accepted Answer · 2019-06-20 23:50:54Z

9

Give A Try Like this

    result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]); 
    for f in result.collect(): 
        print (f.col_name)

edited Jun 20, 2019 at 23:50

Dustin Sun

5,5409 gold badges55 silver badges90 bronze badges

answered Feb 7, 2019 at 8:52

Bala cse

1191 silver badge2 bronze badges

Comments

TobiSH · Accepted Answer · 2020-09-23 04:48:19Z

It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.

Assume this is your df:

+----------+----------+-------------------+-----------+-----------+------------------+ 
|      Date|  New_Date|      New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+ 
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148            | 
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252           |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548         |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148         |
+----------+----------+-------------------+-----------+-----------+------------------+

to loop through rows in Date column:

rows = df3.select('Date').collect()

final_list = []
for i in rows:
    final_list.append(i[0])

print(final_list)

Katya Willard · Accepted Answer · 2016-04-06 15:10:13Z

2

If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.

Note that this will return a PipelinedRDD, not a DataFrame.

edited Apr 6, 2016 at 15:10

answered Apr 1, 2016 at 15:25

Katya Willard

2,2015 gold badges27 silver badges45 bronze badges

Comments

Keyur Potdar · Accepted Answer · 2018-01-21 08:27:55Z

1

above

tupleList = [{name:x["name"], age:x["age"], city:x["city"]}

should be

tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}

for name, age, and city are not variables but simply keys of the dictionary.

edited Jan 21, 2018 at 8:27

Keyur Potdar

7,2386 gold badges27 silver badges40 bronze badges

answered Jan 21, 2018 at 5:48

ten2the6

111 bronze badge

2 Comments

Geoffrey Anderson Over a year ago

Is a square bracket missing from right hand side of code line 2?

Aniruddha Kalburgi Over a year ago

When you're not addressing the original question, don't post it as an answer but rather prefer commenting or suggest edit to the partially correct answer.

Collectives™ on Stack Overflow

How to loop through each row of dataFrame in pyspark

7 Answers 7

3 Comments

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related