PySpark - read csv skip own header

Question

I have the problem that i can't skip my own Header in a CSV-File while reading it with Pyspark read.csv.
CSV-File looks like that:

°°°°°°°°°°°°°°°°°°°°°°°°
°      My Header       °
°    Important Data    °
°        Data          °
°°°°°°°°°°°°°°°°°°°°°°°°

MYROW;SECONDROW;THIRDROW
290;6848;66484
96849684;68463;63848
84646;6484;98718

I can't figure it out how i skip all those first lines or 'n' lines.
I tried something like:

    df_read = spark.read.csv('MyCSV-File.csv', sep=';') \
        .rdd.zipWithIndex() \
        .filter(lambda x: x[1] > 6) \
        .map(lambda x: x[0]) \
        .toDF('MYROW','SECONDROW','THIRDROW')

Is there any posibility to skip the lines, in particular how fast will it be? Data could be some GB's. Thanks

zvi · Accepted Answer · 2020-10-18 05:38:23Z

1

You can add filter on first lines:

.filter(lambda line: not line.startswith("°"))

Another option is to mark those line as comments:

.option("comment", "°")

edited Oct 18, 2020 at 5:38

answered Sep 16, 2020 at 13:47

zvi

4,9043 gold badges35 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark - read csv skip own header

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related