1

I'm trying to use a regex to detect URLs in all the rows of my table, here's the regex

\b(([\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/)))

However, I invariably get the "repetition-operator operand invalid" error, which, after hours of search on the internet, still remains obscure. Where have I gone wrong? What can I do to fix this? And alternaltively, is there a better way to detect URLs in messages in SQL other than a Regex?

Thank you.

5
  • What about Regexp to validate URL in MySQL? Does it answer your question? Commented May 25, 2015 at 10:49
  • @stribizhev I might need a more complex regex, otherwise not all URLs would be detected. The one above worked for me perfectly, and if I can help it, I would like to keep working with it. Commented May 25, 2015 at 10:53
  • @stribizhev Getting the Repetition error for this one. As for the regex you suggested, I tried it, and it only detects a small portions of my URLs. Commented May 25, 2015 at 11:00
  • Looks like the problem is with using ? quantifier, *? and +? cannot be used in MySQL regexp. Also, not sure about non-capturing groups, replace them with capturing ones. Please re-try with [[:<:]](([a-zA-Z0-9-]+:\/\/*|www[.])[^ ()<>]+(\([a-zA-Z0-9_]+\)|([^ [:punct:]]|\/))). Commented May 25, 2015 at 11:07
  • @stribizhev It works this. But the results improved just slightly. Are you sure that the regex hasn't changed at all? (This new syntax boggles me) Commented May 25, 2015 at 11:13

1 Answer 1

1

You cannot use ? quantifier in MySQL regex as the syntax is POSIX-based. Still, you can use * to match 0 or more characters. Also, \b in MySQL regex should be replaced with [[:<:]] (since this matches at the beginning of a word).

Thus, I suggest using

[[:<:]](([a-zA-Z0-9-]+:\/\/*|www[.])[^ ()<>]+(\([a-zA-Z0-9_]+\)|([^ [:punct:]]|\/)))

I am expanding \w to [a-zA-Z0-9_] as it is exactly what \w is. Instead of \s, I am using a literal space. Instead of \d, I am using [0-9]. This is done for readability and better compatibility. If \w, \d and \s work for you, you can use them, but I do not see them among the supported entities in POSIX specs.

Also, instead of literal space, you could use [:space:], it matches space, tab, newline, and carriage return. Instead of [a-zA-Z] you can use [:alpha:], and instead of [0-9], you can use [:digit:]. Please also check this:

[[:<:]](([[:alpha:][:digit:]-]+:\/\/*|www[.])[^[:space:]()<>]+(\([[:alpha:][:digit:]_]+\)|([^[:space:][:punct:]]|\/)))
Sign up to request clarification or add additional context in comments.

2 Comments

It's working flawlessly this time. I'll make sure to read thoroughly through your answer so as to not get stuck twice. Thank you very much. And good day to you.
You are welcome. BTW, I have just found an URL matching regex here and tried to adapt it for you: [[:<:]](https*://)*[a-zA-Z0-9_]+\.(\.*[a-zA-Z0-9_]+)+. No idea if it works better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.