0
val data = spark.read
    .text(filepath)
    .toDF("val")
    .withColumn("id", monotonically_increasing_id())
val count = data.count()

This code works fine when I am reading a file contains upto 50k+ rows.. but when a file comes with rows more than that , this code starts losing data.when this code reads a file having 1 million+ rows , the final datframe count only gives 65k+ rows data. I can't understand where the problem is happening in this code and what needs to change in this code so that it will ingest every data in the final dataframe. p.s - highest file this code will ingest , having almost 14 million + rows , currently this code ingests only 2 million rows out of them.

1 Answer 1

1

Seems related to How do I add an persistent column of row ids to Spark DataFrame?

i.e. avoid using monotonically_increasing_id and follow some of the suggestions from that thread.

Sign up to request clarification or add additional context in comments.

1 Comment

Do you mean the value of count or the size of finalDF?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.