split the file into multiple files based on a string in spark scala

Question

I have a text file with the below data having no particular format

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

I want the output as two files as below :

Based on string abc, I want to split the file.

file 1:

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~

file 2:

abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.

Why do you want to do this with Spark? Why not, say, bash? Is the file on HDFS? — philantrovert
– philantrovert, Commented Feb 22, 2018 at 13:34
Yes, could you give us an idea of the end goal of doing so; would help us finding the appropriate solution. — Xavier Guihot
– Xavier Guihot, Commented Feb 22, 2018 at 13:36
Yes ,Per each hour we are getting 2.5 GB of data and the file is in HDFS. — sr1987
– sr1987, Commented Feb 23, 2018 at 6:54

tricky · Accepted Answer · 2018-02-22 13:55:33Z

3

There is this little dirty trick that I used for a project :

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")

You can set the delimiter of your spark context for reading files. So you could do something like this :

val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
           .map(x => (delimit ++ x))
           .toDF("delimit_column")
           .filter(col("delimit_column") !== delimit)

Then you can map each element of your DataFrame (or RDD) to be written to a file.

It's a dirty method but it might help you !

Have a good day

PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter

edited Feb 22, 2018 at 13:55

answered Feb 22, 2018 at 13:49

tricky

1,5631 gold badge18 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

sr1987 Over a year ago

Thank you. I can use this solution. I have edited the question ,Please help me with that

tricky Over a year ago

Hello, I don't think you can name files in Spark. You should either use the hadoop library and create the paths with filenames before writing. Or create a shell script

sr1987 Over a year ago

Hi , What if I have the "abc" string in the middle of the record ,in this case instead of 2 I will get 3 records in the output .Can we use substring functions here when defining the delimit

Anahcolus · Accepted Answer · 2018-02-22 14:33:16Z

0

You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)

val rdd = sc.wholeTextFiles("path to the file")
  .flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()

And then loop the array to save the texts to output

for(str <- rdd){
  //saving codes here
}

answered Feb 22, 2018 at 14:33

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Collectives™ on Stack Overflow

split the file into multiple files based on a string in spark scala

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related