Regex for matching alphanumeric within a String containing URLs

Question

Given a few scenarios, how can I match and extract alphanumeric characters (and symbols) within a String containing URLs? I'm currently using Google Apps Script for retrieving a plain body text of a hyperlinked text from a Gmail thread message, and I'd basically like to match and extract the title out of some Strings as follows:

var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

... in which I'd like to only output: "Testing: Stack Overflow Title 123?"

Here's another scenario:

var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

... again, in which I'd like to only output: "Testing: Stack Overflow Title 123?"

I've tried the following for initially testing to see if the String first contains a URL (in which I confirmed that the regex for matching URLs works and outputs: https://www.stackoverflow.com), and then tests to see if a title exists to eventually extract it, but to no avail:

var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var urlRegex = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/;
var titleRegex = /^[a-zA-Z0-9_:?']*$/;
var containsUrl = urlRegex.test(element);
if (containsUrl) {
    var containsTitle = titleRegex.test(scenario1);
    if (containsTitle) { // No match, and doesn't run
      var title = titleRegex.exec(element)[0];
      Logger.log("title: " + title);
    }
}

Basically, I'd like a Regex pattern that matches EVERYTHING but URLs, if possible

Can there be multiple non-URL substrings? (in which case, would you want an array of those substrings?) — CertainPerformance
– CertainPerformance, Commented Feb 2, 2019 at 17:37

Pushpesh Kumar Rajwanshi · Accepted Answer · 2019-02-02 18:31:48Z

2

We can capture any sequential text excluding what looks like a URL using this regex,

(?:^|\s+)((?:(?!:\/\/).)*)(?=\s|$)

Explanation:

(?:^|\s) - Matches either start of line or one or more whitespaces
((?:(?!:\/\/).)*) - Matches any text except the one that contains :// literally identifying it as a URL
(?=\s|$) - Positive lookahead to ensure it is followed by a whitespace or end of line

Demo

This matches and captures any sequential text except URLs. Hope this works for you.

Here is a Javascript demo.

var arr = ['Testing1: Stack Overflow Title 123? https://www.stackoverflow.com','https://www.stackoverflow.com    Testing2: Stack Overflow Title xyz? https://www.stackoverflow.com Hello this is simple text ftp://www.downloads.com/']

for (s of arr) {
	var reg = /(?:^|\s+)((?:(?!:\/\/).)*)(?=\s|$)/g;
	match = reg.exec(s);
	while (match != null) {
		console.log(match[1])
		match = reg.exec(s);
	}
}

Also, as I can see you want to limit the characters in your matching title, you can use your character set [a-zA-Z0-9_:?' ] (added space in your character set to allow capturing spaces as well) instead of . in my regex and use following regex to be more precise to avoid capturing title having unintended characters,

(?:^|\s+)((?:(?!:\/\/)[a-zA-Z0-9_:?' ])*)(?=\s|$)

Demo with your title character set

edited Feb 2, 2019 at 18:31

answered Feb 2, 2019 at 18:06

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

CertainPerformance Over a year ago

This matches the leading whitespaces too, though, which probably isn't desirable.

Pushpesh Kumar Rajwanshi Over a year ago

Group1 captures the text that doesn't contain whitespaces.

Pushpesh Kumar Rajwanshi Over a year ago

I could have used positive look behind for whole matches to not contain any extra spaces, but lots of old tools/browsers don't support EcmaScript2018 and hence they all won't work so had to use grouping capture. And the group doesn't contain any extra leading/trailing white spaces as can be seen in my demo which should work for OP.

Mohammed Elhag Over a year ago

@PushpeshKumarRajwanshi +1 perfect

The fourth bird · Accepted Answer · 2019-02-02 20:20:44Z

1

One possibility could be to match until you encounter the first url using either a group or a positive lookahead.

Using a positive lookahead that might look like:

\bTesting: .*?(?=\s*(?:https?|ftps?):\/\/)

const regexLookahead = /\bTesting: .*?(?=\s*(?:https?|ftps?):\/\/)/;
[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com"
].forEach(s => console.log(s.match(regexLookahead)[0]));

Using a capturing group where your value would be in the first capturing group:

(\bTesting: .*?)\s*(?:https?|ftps?):\/\/

const regexGroup = /(\bTesting: .*?)\s*(?:https?|ftps?):\/\//;
[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com"
].forEach(s => console.log(s.match(regexGroup)[1]));

If you want to keep all except the urls, you could match them and replace with an empty string:

\s*(?:https?|ftps?):\/\/\S+

[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com test https://www.stackoverflow.com test https://www.stackoverflow.com test",
  "https://www.stackoverflow.com test",
  "test https://www.stackoverflow.com"
].forEach(s => console.log(s.replace(/\s*(?:https?|ftps?):\/\/\S+/g, '').trim()));

edited Feb 2, 2019 at 20:20

answered Feb 2, 2019 at 17:42

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

2 Comments

CertainPerformance Over a year ago

This does depend on the non-URL substrings starting at a word boundary, though, and unfortunately lookbehind (for a space, or for ^) isn't sufficiently widely supported yet as you probably know, not sure how I'd fix it

The fourth bird Over a year ago

@CertainPerformance I see that you mean, I have added a replacing variant as well.

guest271314 · Accepted Answer · 2019-02-02 18:05:49Z

0

You can use .split() space characters and .filter() resulting array to exclude elements which begin with the specified protocols or end with word then dot character then word and end of string

const splitURL = s => s.split` `.filter(w => !/^\w+(?=:\/\/)|\w+\.\w+$/.test(w)).join` `;
 
var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

console.log(splitURL(scenario1), splitURL(scenario2));

answered Feb 2, 2019 at 18:05

guest271314

1

1 Comment

guest271314 Over a year ago

Another option is to .replace() the URL with empty string using urlRegexp s.replace(/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/g, '')

Collectives™ on Stack Overflow

Regex for matching alphanumeric within a String containing URLs

3 Answers 3

4 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related