0

I have the problem that i can't skip my own Header in a CSV-File while reading it with Pyspark read.csv.
CSV-File looks like that:

°°°°°°°°°°°°°°°°°°°°°°°°
°      My Header       °
°    Important Data    °
°        Data          °
°°°°°°°°°°°°°°°°°°°°°°°°

MYROW;SECONDROW;THIRDROW
290;6848;66484
96849684;68463;63848
84646;6484;98718

I can't figure it out how i skip all those first lines or 'n' lines.
I tried something like:

    df_read = spark.read.csv('MyCSV-File.csv', sep=';') \
        .rdd.zipWithIndex() \
        .filter(lambda x: x[1] > 6) \
        .map(lambda x: x[0]) \
        .toDF('MYROW','SECONDROW','THIRDROW')

Is there any posibility to skip the lines, in particular how fast will it be? Data could be some GB's. Thanks

1 Answer 1

1

You can add filter on first lines:

.filter(lambda line: not line.startswith("°"))

Another option is to mark those line as comments:

.option("comment", "°")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.