1

I am trying to read the first row from a file and then filter that from the dataframe.

I am using take(1) to read the first row. I then want to filter this from the dataframe (it could appear multiple times within the dataset).

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext(appName = "solution01")
spark = SparkSession(sc)

df1 = spark.read.csv("/Users/abc/test.csv")
header = df1.take(1)
print(header)

final_df = df1.filter(lambda x: x != header)
final_df.show()

However I get the following error TypeError: condition should be string or Column.

I was trying to follow the answer from Nicky here How to skip more then one lines of header in RDD in Spark

The data looks like (but will have multiple columns that i need to do the same for):

customer_id
1
2
3
customer_id
4
customer_id
5

I want the result as:

1
2
3
4
5

1 Answer 1

2

take on dataframe results list(Row) we need to get the value use [0][0] and In filter clause use column_name and filter the rows which are not equal to header

header = df1.take(1)[0][0]
#filter out rows that are not equal to header
final_df = df1.filter(col("<col_name>") != header)
final_df.show()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.