1

so I'm using 3rd party application that uses regex to get matches. It is automatically set to match only the first match since it only looking for one piece of information per page. I cannot change this setting unless I want it to find all matches to be display as an array which I rarely want it to do. That last condition doesn't apply to the match I want.

What I want it to find are ID codes. It just so happens that all the IDs start with 10 and are followed by 4 more numbers

Example:

104230

So I wrote this regex

10[0-9]{4}

The only problem with this is that there is a .js file in the header that is named 10022008.js and since it automatically chooses the first match, all the IDs get set to this.

How do you get regex to ignore that string of numbers and that string only? All the searches I have done only similar ignore type codes have not worked

5
  • Are the others surrounded by whitespace? Does a . follow any of them? The simple solution is to use \s in the pattern as \s+10[0-9]{4}\s. Post some examples of where the ids would occur. Commented Aug 17, 2012 at 18:13
  • sometimes they are, but sometimes they start with #. It varies too which is annoying Commented Aug 17, 2012 at 18:24
  • 1
    so you dont know your problem itself.check out ur extract rules with them and YOU NEED to tag a REGEX question with the langauge that you are using!! Commented Aug 17, 2012 at 18:25
  • the most common regex code i use is like this - - - id="imaunicorn"><a href="(.*?)" id="unicornfriend Commented Aug 17, 2012 at 18:48
  • sorry for not being specific @Anirudha, the extract rules are what I define them to be Commented Aug 20, 2012 at 13:58

3 Answers 3

5

Add the "word boundary" regex \b to each end of your regex:

\b10[0-9]{4}\b

The word boundary matches between any "word" character (ie \w, which is [0-9a-zA-Z_]) and any non-word character, or visa versa, and is zero-width, so it won't add any characters to your capture.

Sign up to request clarification or add additional context in comments.

5 Comments

this one works great! I didn't think of that. I tried a look ahead but it kept canceling all the matches it found and returned nothing
Doesn't \b also match the strings followed by . in .js?
so far it has not, I'm running the crawl right now so I'll know in a few minutes ^_^
@Michael Yes, but in his example the js file has more than 6 chars in its name, so it won't match it
@Bohemian +1 Ah, I didn't look closely at it -I thought the issue was that it was exactly the same length, hence my comment on the OP about whitespace boundaries.
2

Lookahead is one solution. May not be the most efficient, but I think it is the most readable.

10\d{4}(?!08\.js)

This will match 10 followed by any four digits, provided that those digits are not followed by 08.js.

1 Comment

this works in the ruby tester I use but not in the app I am using. I'm not entirely sure how regex in Blosm differs from ruby, perl etc
-1

I'm not sure what the input data looks like, but could you limit it to the beginning and end of line?

^10[0-9]{4}$

3 Comments

^ and $ would not work cuz the match is somewhere within the file,not at the start...use \b instead!
I guess I didn't understand the question. I assumed the input data was on separate lines, and was being processed as such.
Thanks for answering my first question but I tired this too and it didn't work. Anirudha is right with using the \b. Idk what I was thinking but I tried to use the \b to specify what I was trying to ignore and not what I was trying to find...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.