1

I have a text file with the below data having no particular format

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

I want the output as two files as below :

Based on string abc, I want to split the file.

file 1:

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~

file 2:

abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.

4
  • Which amount of data would there be in each new file? Commented Feb 22, 2018 at 13:33
  • 1
    Why do you want to do this with Spark? Why not, say, bash? Is the file on HDFS? Commented Feb 22, 2018 at 13:34
  • Yes, could you give us an idea of the end goal of doing so; would help us finding the appropriate solution. Commented Feb 22, 2018 at 13:36
  • Yes ,Per each hour we are getting 2.5 GB of data and the file is in HDFS. Commented Feb 23, 2018 at 6:54

2 Answers 2

3

There is this little dirty trick that I used for a project :

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")

You can set the delimiter of your spark context for reading files. So you could do something like this :

val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
           .map(x => (delimit ++ x))
           .toDF("delimit_column")
           .filter(col("delimit_column") !== delimit)

Then you can map each element of your DataFrame (or RDD) to be written to a file.

It's a dirty method but it might help you !

Have a good day

PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you. I can use this solution. I have edited the question ,Please help me with that
Hello, I don't think you can name files in Spark. You should either use the hadoop library and create the paths with filenames before writing. Or create a shell script
Hi , What if I have the "abc" string in the middle of the record ,in this case instead of 2 I will get 3 records in the output .Can we use substring functions here when defining the delimit
0

You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)

val rdd = sc.wholeTextFiles("path to the file")
  .flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()

And then loop the array to save the texts to output

for(str <- rdd){
  //saving codes here
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.