A CSV file needs to be processed using Spring Batch where multiple rows could be grouped based on a field value. In the below sample, two lines in the CSV have the same GROUP_NAME with the value group_1. It is also to be noted that the records are not necessarily ordered.

GROUP_NAME,ENTITY_NAME,DATA,ADDITIONAL_DATA
group_1,Foo book,"{""paperback"" : true}","{""digital"":false}"
group_999,Bar book,,
group_1,Foo book,"{""edition"":5}"

The end goal is to create one database row from the two lines for group_1 as:

ID ENTITY_NAME DATA ADDITIONAL_DATA
1 Foo book {"paperback" : true, "edition" : 5} {"digital" : false}

Spring Batch does provide a way to split the input into partitions before the job starts and then process them in chunks but it cannot be applied to this use case as the records belonging to the same GROUP_NAME are dynamic and partitioning cannot be used. Further, records belonging to the same GROUP_NAME could span multiple chunks and chunks are processed independently.

As per How to group the records in Spring Batch 5.x so that every thread/execution is dedicated for one group of records, chunk based processing is not suitable for this use case and I also have a limitation that a staging database table cannot be used(to which the CSV contents could have been loaded and then a GROUP BY clause would have helped).

One way to solve this would be to:

  • Sort the lines in the CSV file based on the GROUP_NAME field(yes, this is going to be expensive and sorting at Operating System level seems to be performant)

  • Use SingleItemPeekableItemReader to read the lines, collect the records(lines) until a change is identified in GROUP_NAME

Given that Spring Batch is designed to process records independently, is there a better way compared to sorting the file content and grouping the records using SingleItemPeekableItemReader in order to implement this use case with Spring Batch?

2 Replies 2

Make group id part of the columns in the database, write a query to do an upsert instead of an insert when a record with the group already exists.

Else get a machine with a lot of memory and some decent number of cores, use the sort utility on linux to sort the file on the columns you want. Then write your batch job based on that (the sorting could be part of the job as well using the SystemCommandTasklet.

Next to that you could also some other utilities like awk, sed and friends to combine multiple lines into a single one and do the merge in there. That way the only thing you need to do is read the lines and insert.

You could implement the sorting in Java too.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.