PySpark- How to filter row from this dataframe

Question

I am trying to read the first row from a file and then filter that from the dataframe.

I am using take(1) to read the first row. I then want to filter this from the dataframe (it could appear multiple times within the dataset).

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext(appName = "solution01")
spark = SparkSession(sc)

df1 = spark.read.csv("/Users/abc/test.csv")
header = df1.take(1)
print(header)

final_df = df1.filter(lambda x: x != header)
final_df.show()

However I get the following error TypeError: condition should be string or Column.

I was trying to follow the answer from Nicky here How to skip more then one lines of header in RDD in Spark

The data looks like (but will have multiple columns that i need to do the same for):

customer_id
1
2
3
customer_id
4
customer_id
5

I want the result as:

notNull · Accepted Answer · 2020-07-09 19:06:49Z

2

take on dataframe results list(Row) we need to get the value use [0][0] and In filter clause use column_name and filter the rows which are not equal to header

header = df1.take(1)[0][0]
#filter out rows that are not equal to header
final_df = df1.filter(col("<col_name>") != header)
final_df.show()

answered Jul 9, 2020 at 19:06

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark- How to filter row from this dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related