Pyspark DataFrame Filtering

Question

I have a dataframe as follows:

|Property ID|Location|Price|Bedrooms|Bathrooms|Size|Price SQ Ft|Status|

When I am filtering it with bedrooms or bathrooms it is giving correct answer

df = spark.read.csv('/FileStore/tables/realestate.txt', header=True, inferSchema=True, sep='|')
df.filter(df.Bedrooms==2).show()

But when I am filtering it with Property ID as df.filter(df.Property ID==1532201).show() , I am getting an error. Is it because there is a space in betweeen Property and ID ?

Anand Vidvat · Accepted Answer · 2021-01-05 04:38:40Z

2

the space between Property and ID is the cause of issue. Another approach you can follow is as follows :

from pyspark.sql import functions as F
df.filter(F.col('Property ID')==1532201).show()

answered Jan 5, 2021 at 4:38

Anand Vidvat

1,0889 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anil Over a year ago

Ok Thanks. I was able to do it using df.filter(col("Property ID")==1499102).show().

mck · Accepted Answer · 2021-01-05 08:17:18Z

1

You can also use the square bracket notation to select the column:

df.filter(df['Property ID'] == 1532201).show()

Or use a raw SQL string to filter: (note the backticks)

df.filter('`Property ID` = 1532201').show()

answered Jan 5, 2021 at 8:17

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

Pyspark DataFrame Filtering

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related