0

I have one file as follows:

dept.txt:
1,It,pune,2017-03-12
2,CS,delhi,2017-03-21
3,mech,mum,
4,fin,pune,2017-04-15
5,It,delhi,

What I need to do is :

  1. Read data from 2 files in 2 RDD (This I have done)

  2. Apply filter on date column in dept file and get two outout files based on null and not null value (This I am unable to do)

How far I have been able to proceed:

val loadDept = sc.textFile("/path/to/file/dept.txt")
val cleanDept = loadDept.map(_.split(","))
val dateCol = cleanDept.filter(i => i(3) != "") 

Error occurs in the last line :

java.lang.ArrayIndexOutOfBoundsException: 3

I understand that since there is an empty string/null I am getting a out of bounds exception (please correct me if I am wrong), but how to get around with it?

Note: I only need to use RDDs in Scala

4
  • 1
    first check the length before accessing array elements Commented Dec 29, 2017 at 10:52
  • @Saravana thanks for the response. If you see the dept.txt, there are 4 values in each line seperated by comma. So I believe when I split I should get 0,1,2,3 positions. Hence trying to get the i(3). Now, I think, if you check the line 3 "3,mech,mum," the last comma has no space or nothing after it and a direct carriage return. Hence I believe the problem but I dont know how to get rid of this and achieve what I want. Commented Dec 29, 2017 at 11:00
  • please see the answer for checking the length and also to group the results based on element at index 3 Commented Dec 29, 2017 at 11:20
  • 2
    Possible duplicate of Why is spark throwing an ArrayIndexOutOfBoundsException expection for empty attributes? Commented Dec 29, 2017 at 12:56

3 Answers 3

1

From the description of split(String regex, int limit) method:

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

Since split(String regex) works as if by invoking the two-argument split method with the given expression and a limit argument of zero, you have your empty strings discarded.

The solution, also mentioned in Natalia's answer:

// collect each and every string
val cleanDept = loadDept.map(_.split(",", -1))  
// filter arrays with empty last string
val filledDateDept = cleanDept.filter(_.last.nonEmpty)  
Sign up to request clarification or add additional context in comments.

Comments

0
  1. Check the size of array before accessing its element

    sc.textFile("/file/path/dept.txt")
    .map(_.split(","))
    .filter(a => a.length > 3 && a(3) != null && !a(3).equals(""))
    
  2. You can group the results using groupBy

    sc.textFile("/file/path/dept.txt")
    .map(_.split(","))
    .groupBy(a => a.length > 3 && a(3) != null && !a(3).equals(""))
    

this would group the result by the groupBy key in this case it will be true or false

false key will hold all null and not exist results in a Buffer.

true key will hold all non null results in a Buffer.

output

(false,CompactBuffer([Ljava.lang.String;@5bef517c, [Ljava.lang.String;@51ec0ad3))
(true,CompactBuffer([Ljava.lang.String;@44a4871, [Ljava.lang.String;@16a375ad, [Ljava.lang.String;@37703dfd))

Comments

0

The problem in split operation.

"ab,dfd,".split(",") --> Array(ab, dfd)
"ab,dfd,".split(",", -1) --> Array(ab, dfd, )

If you invoke split only with one param, default limit is used. And it equals 0.

Java doc for split

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

As you can see, in case of 0, all trailing empty strings will be discarded.

3 Comments

Superb.This solved the purpose. If possible, can you explain a bit about the -1 please. I have never seen or used like this.
Your statement substring with length 0 are discarded is incorrect. The second parameter denotes the number of times the pattern will be applied to split the string.
@philantrovert thanks, fixed with copypasting javadoc for split

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.