0

I'm returning to Java after a several-year hiatus with Ruby. I'm looking for idiomatic and short Java code that accomplishes the following Ruby statement:

some_string.scan(/[\w|\']+/)

The above expression creates an array from a string. The elements in the array are all the sections of some_string that are composed of either alphanum chars (\w) or the apostrophe (\' so that "John's" is not split into two words.)

For example:

"(The farmer's daughter) went to the market".scan(/[\w|\']+/)

=>

["The", "farmer's", "daughter", ...]

Update

I know the solution will use something like this:

String[] words = sentence.split(" ");

I just need the regex part that goes in split().

2
  • I know java and regex in java - but I can't see what you're Ruby regex there is doing. Can you tell it in words? :) Commented Apr 18, 2012 at 23:07
  • You don't need | in a character class (surrounded by brackets [ ]), and you don't need to escape the '. The regular expression /[\w']+/ is correct, and yours is buggy. Commented Apr 19, 2012 at 1:02

3 Answers 3

3

Java doesn't have a built-in scan method that can do this in a function call, so you need to roll the loop yourself. You can do this quite easily with Java's regex Matcher class.

import java.util.regex.*;

String yourString = "(The farmer's daughter) went to the supermarket";

/* The regex syntax is basically identical to Ruby, except that you need
 * to specify your regex as a normal string literal, and therefore you need to 
 * double up on your backslashes. The other differences between my regex and 
 * yours are all things that I think you need to change about the Ruby version
 * as well. */
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(yourString);
List<String> words = new Vector<String>();
while (m.find()) {
   words.add(m.group());
}

I'm not sure what the relative merits are of using Matcher versus using Scanner for this situation.

Sign up to request clarification or add additional context in comments.

Comments

2

Regular expressions should behave more or less the same even across languages. In this case, the only difference is that you have to escape the backslashes and single quotes.

If in Ruby we write /[\w']+/, in Java we would write Pattern.compile("[\\w\']+").


Oh, Scanners can scan Strings as well!

final String s = "The farmer's daughter went to the market";
Scanner sc = new Scanner(s);
Pattern p = Pattern.compile("[\\w\\']+");
while (sc.hasNext(p)) { System.out.println(sc.next(p)); }

It is not exactly the same thing, but why not split the string on spaces, which are the word boundaries?

"The farmer's daughter went to the market".split("\s");

4 Comments

That's pretty close. I know I need to use the .split, I just need the regex to filter out non-alphanum chars except apostrophes.
@bevanb, I just learned that Scanners work with Strings as well. See if it can solve your problem. Also, the | inside the square brackets is unnecessary.
The regex in Ruby should be /[\w']+/ and the equivalent regex in Java is "[\\w']+".
@MatheusMoreira, thanks for pointing out that the | is unnecessary.
0

How about

String[] words = test.split("[^a-zA-Z0-9']+");

or

words = test.split("[^\\w']+");

The difference in these patterns from your Ruby example is because you were using Ruby's String#scan - where you supply the pattern which matches a word. Java's String#split is like's Ruby's method of the same name - you supply the pattern which matches your word delimiters.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.