A CSV file needs to be processed using Spring Batch where multiple rows could be grouped based on a field value. In the below sample, two lines in the CSV have the same GROUP_NAME with the value group_1. It is also to be noted that the records are not necessarily ordered.
GROUP_NAME,ENTITY_NAME,DATA,ADDITIONAL_DATA
group_1,Foo book,"{""paperback"" : true}","{""digital"":false}"
group_999,Bar book,,
group_1,Foo book,"{""edition"":5}"
The end goal is to create one database row from the two lines for group_1 as:
| ID | ENTITY_NAME | DATA | ADDITIONAL_DATA |
|---|---|---|---|
| 1 | Foo book | {"paperback" : true, "edition" : 5} | {"digital" : false} |
Spring Batch does provide a way to split the input into partitions before the job starts and then process them in chunks but it cannot be applied to this use case as the records belonging to the same GROUP_NAME are dynamic and partitioning cannot be used. Further, records belonging to the same GROUP_NAME could span multiple chunks and chunks are processed independently.
As per How to group the records in Spring Batch 5.x so that every thread/execution is dedicated for one group of records, chunk based processing is not suitable for this use case and I also have a limitation that a staging database table cannot be used(to which the CSV contents could have been loaded and then a GROUP BY clause would have helped).
One way to solve this would be to:
Sort the lines in the CSV file based on the
GROUP_NAMEfield(yes, this is going to be expensive and sorting at Operating System level seems to be performant)Use SingleItemPeekableItemReader to read the lines, collect the records(lines) until a change is identified in
GROUP_NAME
Given that Spring Batch is designed to process records independently, is there a better way compared to sorting the file content and grouping the records using SingleItemPeekableItemReader in order to implement this use case with Spring Batch?