0

I have text file which has text with newline char like this. I read that text file into a String

random Text
State v. USA
some more text
USA v.
NY
Some more text
USA
v.LA ,  MN v. ND
USA vs. MN

I want to know offset (i.e. starting and ending char index) of patterns like [Some word starting with cap] v. [Some word starting with cap]

Or [Some word starting with cap] vs. [Some word starting with cap]

For above example "State v. USA" => Start=11 and End=22

"USA v. NY" => Start=36 and End=45

I started with something like this http://rubular.com/r/T7Ii2WDADw which is not covering all cases .

So, the program could return a Map where key is Start+","+End and value is actual text like "State v. USA"

3 Answers 3

2

To cover both the cases you need to use this regex.

\w+\s((v.)|(vs.))\s\w+

In java code.

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Testapp {

public static void main(String[] args) {
String text = "USA v. Russia \n Some other text \n India vs. Aus";
String regex="\\w+\\s((v.)|(vs.))\\s\\w+";
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group()+ ":" +"start =" + matcher.start() + " end = " + matcher.end());
}
}
}

Output:

Starting & ending index ofUSA v. Russia:start=0 end = 13
Starting & ending index ofIndia vs. Aus:start=34 end = 47
Sign up to request clarification or add additional context in comments.

Comments

2

This would be a working regex: \w+\s+vs?[.]\s+\w+

Then, using Matcher.find(), you could get the beginning and end of each match using Matcher.start(0) and Matcher.end(0).

4 Comments

Thanks! But, I just tested, and it doesn't cover the cases when there is newline. Please see here rubular.com/r/6xA0SBCLy0
you didn't indicate you wanted any/multiple whitespace. updated.
you example also includes "v.State". if you intend to match that as well, change the '\s+' to '\s*'.
I thought my example illustrated any/multiple white spaces.Thanks for RegExp and java code, this is what I needed.
1

Method String.indexOf(String) does exactly what you need.

5 Comments

I might have oversimplified the question to make you think indexOf() will work. I dont know actual finding string beforehead, please see in question, I am working on RegExp. I needed a solution using RegExp Find() or Matcher(). If you can, please elaborate how to find above mentioned pattern "USA v. State" offset using String.indexOf(String). Thanks!
@S.Singh int start = string.indexOf("USA v. State") will give you the start int end = start + "USA v. State".length() will give you the end.
I don't know if it is "USA v. State" or something else. It could be Iraq v. USA or anything. Only thing I know it will contain "v." or "vs." Also, I need offset for ALL the occurrences, not just the first one. That is why I have mentioned about Map as return. Let me know if it is not clear.
@S.Singh Well, then you should have said so in your question ;)
@Baz My bad, I thought that would be obvious to RegExp experts when they see the rubular link :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.