1

Given a few scenarios, how can I match and extract alphanumeric characters (and symbols) within a String containing URLs? I'm currently using Google Apps Script for retrieving a plain body text of a hyperlinked text from a Gmail thread message, and I'd basically like to match and extract the title out of some Strings as follows:

var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

... in which I'd like to only output: "Testing: Stack Overflow Title 123?"

Here's another scenario:

var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

... again, in which I'd like to only output: "Testing: Stack Overflow Title 123?"

I've tried the following for initially testing to see if the String first contains a URL (in which I confirmed that the regex for matching URLs works and outputs: https://www.stackoverflow.com), and then tests to see if a title exists to eventually extract it, but to no avail:

var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var urlRegex = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/;
var titleRegex = /^[a-zA-Z0-9_:?']*$/;
var containsUrl = urlRegex.test(element);
if (containsUrl) {
    var containsTitle = titleRegex.test(scenario1);
    if (containsTitle) { // No match, and doesn't run
      var title = titleRegex.exec(element)[0];
      Logger.log("title: " + title);
    }
}

Basically, I'd like a Regex pattern that matches EVERYTHING but URLs, if possible

2
  • Can there be multiple non-URL substrings? (in which case, would you want an array of those substrings?) Commented Feb 2, 2019 at 17:37
  • Will all URL's begin with protocol? Commented Feb 2, 2019 at 18:02

3 Answers 3

2

We can capture any sequential text excluding what looks like a URL using this regex,

(?:^|\s+)((?:(?!:\/\/).)*)(?=\s|$)

Explanation:

  • (?:^|\s) - Matches either start of line or one or more whitespaces
  • ((?:(?!:\/\/).)*) - Matches any text except the one that contains :// literally identifying it as a URL
  • (?=\s|$) - Positive lookahead to ensure it is followed by a whitespace or end of line

Demo

This matches and captures any sequential text except URLs. Hope this works for you.

Here is a Javascript demo.

var arr = ['Testing1: Stack Overflow Title 123? https://www.stackoverflow.com','https://www.stackoverflow.com    Testing2: Stack Overflow Title xyz? https://www.stackoverflow.com Hello this is simple text ftp://www.downloads.com/']

for (s of arr) {
	var reg = /(?:^|\s+)((?:(?!:\/\/).)*)(?=\s|$)/g;
	match = reg.exec(s);
	while (match != null) {
		console.log(match[1])
		match = reg.exec(s);
	}
}

Also, as I can see you want to limit the characters in your matching title, you can use your character set [a-zA-Z0-9_:?' ] (added space in your character set to allow capturing spaces as well) instead of . in my regex and use following regex to be more precise to avoid capturing title having unintended characters,

(?:^|\s+)((?:(?!:\/\/)[a-zA-Z0-9_:?' ])*)(?=\s|$)

Demo with your title character set

Sign up to request clarification or add additional context in comments.

4 Comments

This matches the leading whitespaces too, though, which probably isn't desirable.
Group1 captures the text that doesn't contain whitespaces.
I could have used positive look behind for whole matches to not contain any extra spaces, but lots of old tools/browsers don't support EcmaScript2018 and hence they all won't work so had to use grouping capture. And the group doesn't contain any extra leading/trailing white spaces as can be seen in my demo which should work for OP.
@PushpeshKumarRajwanshi +1 perfect
1

One possibility could be to match until you encounter the first url using either a group or a positive lookahead.

Using a positive lookahead that might look like:

\bTesting: .*?(?=\s*(?:https?|ftps?):\/\/)

const regexLookahead = /\bTesting: .*?(?=\s*(?:https?|ftps?):\/\/)/;
[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com"
].forEach(s => console.log(s.match(regexLookahead)[0]));

Using a capturing group where your value would be in the first capturing group:

(\bTesting: .*?)\s*(?:https?|ftps?):\/\/

const regexGroup = /(\bTesting: .*?)\s*(?:https?|ftps?):\/\//;
[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com"
].forEach(s => console.log(s.match(regexGroup)[1]));

If you want to keep all except the urls, you could match them and replace with an empty string:

\s*(?:https?|ftps?):\/\/\S+

[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com test https://www.stackoverflow.com test https://www.stackoverflow.com test",
  "https://www.stackoverflow.com test",
  "test https://www.stackoverflow.com"
].forEach(s => console.log(s.replace(/\s*(?:https?|ftps?):\/\/\S+/g, '').trim()));

2 Comments

This does depend on the non-URL substrings starting at a word boundary, though, and unfortunately lookbehind (for a space, or for ^) isn't sufficiently widely supported yet as you probably know, not sure how I'd fix it
@CertainPerformance I see that you mean, I have added a replacing variant as well.
0

You can use .split() space characters and .filter() resulting array to exclude elements which begin with the specified protocols or end with word then dot character then word and end of string

const splitURL = s => s.split` `.filter(w => !/^\w+(?=:\/\/)|\w+\.\w+$/.test(w)).join` `;
 
var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

console.log(splitURL(scenario1), splitURL(scenario2));

1 Comment

Another option is to .replace() the URL with empty string using urlRegexp s.replace(/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/g, '')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.