data losing while reading a file of huge size in spark scala

Question

val data = spark.read
    .text(filepath)
    .toDF("val")
    .withColumn("id", monotonically_increasing_id())
val count = data.count()

This code works fine when I am reading a file contains upto 50k+ rows.. but when a file comes with rows more than that , this code starts losing data.when this code reads a file having 1 million+ rows , the final datframe count only gives 65k+ rows data. I can't understand where the problem is happening in this code and what needs to change in this code so that it will ingest every data in the final dataframe. p.s - highest file this code will ingest , having almost 14 million + rows , currently this code ingests only 2 million rows out of them.

toxicafunk · Accepted Answer · 2020-03-16 14:38:10Z

1

Seems related to How do I add an persistent column of row ids to Spark DataFrame?

i.e. avoid using monotonically_increasing_id and follow some of the suggestions from that thread.

answered Mar 16, 2020 at 14:38

toxicafunk

4063 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

toxicafunk Over a year ago

Do you mean the value of count or the size of finalDF?

Collectives™ on Stack Overflow

data losing while reading a file of huge size in spark scala

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest