3

I am trying to filter file data into good and bad data per the date, hence will get 2 result files. From test file, first 4 lines need to go in good data and last 2 lines in bad data.

I am having 2 issues

  1. I am not getting any good data, result file is empty
  2. and bad data result looks like following - picking up the name characters only

    (,C,h) (,J,u) (,T,h) (,J,o) (,N,e) (,B,i)

Test file

Christopher|Jan 11, 2017|5 
Justin|11 Jan, 2017|5 
Thomas|6/17/2017|5 
John|11-08-2017|5 
Neli|2016|5 
Bilu||5

Load and RDD

scala> val file = sc.textFile("test/data.txt")
scala> val fileRDD = file.map(x => x.split("|"))

RegEx

scala> val singleReg = """(\w(3))\s(\d+)(,)\s(\d(4))|(\d+)\s(\w(3))(,)\s(\d(4))|(\d+)(\/)(\d+)(\/)(\d(4))|(\d+)(-)(\d+)(-)(\d(4))""".r

Is three " (double quotes) in the beginning and end and .r important here?

Filter issue area

scala> val validSingleRecords = fileRDD.filter(x => (singleReg.pattern.matcher(x(1)).matches))
scala> val badSingleRecords = fileRDD.filter(x => !(singleReg.pattern.matcher(x(1)).matches))

Turn array into string

scala> val validSingle = validSingleRecords.map(x => (x(0),x(1),x(2)))
scala> val badSingle = badSingleRecords.map(x => (x(0),x(1),x(2)))

Write file

scala> validSingle.repartition(1).saveAsTextFile("data/singValid")
scala> badSingle.repartition(1).saveAsTextFile("data/singBad")

Update 1 My regex above was wrong, i have updated it as. in scala backslash is a escape character, so need to duplicate

val singleReg = """\\w{3}\\s\\d+,\\s\\d{4}|\\d+\\s\\w{3},\\s\\d{4}|\\d+\/\\d+\/\\d{4}|\\d+-\\d+-\\d{4}""".r

Checked the regex on regex101 and the dates in the first 4 lines pass.

I have run the the test again and i am still getting the same result.

6
  • Can you please mention the expected output for good and bad data? Commented Aug 13, 2017 at 7:15
  • First 4 lines need to go in good data and the last 2 lines in bad data, per the regex. Commented Aug 13, 2017 at 7:24
  • Why do you think that your REGEX matches the first 4 lines? What do you think does \w(3)? The number of occurrences is for sure not 3 without curly braces. Your's literally matches the 3 You can test the REGEX online, e.g. here regex101.com Commented Aug 13, 2017 at 7:33
  • I have updated the regex, testing... will update shortly Commented Aug 13, 2017 at 7:54
  • Added update 1 to the question Commented Aug 13, 2017 at 8:08

1 Answer 1

5

There are 2 issues with the code:

  1. The character that you are using to split the lines of data.txt is wrong. It should be '|' instead of "|".
  2. The regex singleReg is wrong.

The correct code is as follows:

Load and RDD

scala> val file = sc.textFile("test/data.txt")
scala> val fileRDD = file.map(x => x.split('|'))

RegEx

scala> val singleReg = """\w{3}\s\d{2},\s\d{4}|\d{2}\s\w{3},\s\d{4}|\d{1}\/\d{2}\/\d{4}|\d{2}-\d{2}-\d{4}""".r

Filter

scala> val validSingleRecords = fileRDD.filter(x => (singleReg.pattern.matcher(x(1)).matches))
scala> val badSingleRecords = fileRDD.filter(x => !(singleReg.pattern.matcher(x(1)).matches))

Turn array into string

scala> val validSingle = validSingleRecords.map(x => (x(0),x(1),x(2)))
scala> val badSingle = badSingleRecords.map(x => (x(0),x(1),x(2)))

Write file

scala> validSingle.repartition(1).saveAsTextFile("data/singValid")
scala> badSingle.repartition(1).saveAsTextFile("data/singBad")

The above code will give you following output -

data/singValid

(Christopher,Jan 11, 2017,5 )
(Justin,11 Jan, 2017,5 )
(Thomas,6/17/2017,5 )
(John,11-08-2017,5 )

data/singBad

(Neli,2016,5 )
(Bilu,,5)
Sign up to request clarification or add additional context in comments.

2 Comments

With splitting on comma or space we use "," or " ", why with pipe we are using single quote as '|'?
The answer to your query is here - stackoverflow.com/questions/47867743/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.