Parse out a URL with regex operation in python

Question

I have data as follows,

data

url
http://hostname.com/part1/part2/part3/a+b+c+d
http://m.hostname.com/part3.html?nk!e+f+g+h&_junk
http://hostname.com/as/ck$st=f+g+h+k+i/
http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa

I want to check for first + symbol in the url and move backward until we find a special character such as / or ? or = or any other special character and start from that and go on until we find a space or end of line or & or /.My output should be,

parsed
abcd
efgh
fghki
qwert

My aim is to find first + in the URL and go back until we find a special character and go front until we find a end of line or space or & symbol.

I am new to regex and still learning it and since it is bit complex, I am finding it difficult to write. Can anybody help me in writing a regex in python, to parse out these?

Thanks

alecxe · Accepted Answer · 2016-08-23 22:21:09Z

1

Here is the expression that works for your sample use cases:

>>> import re
>>>
>>> l = [
...     "http://hostname.com/part1/part2/part3/a+b+c+d",
...     "http://m.hostname.com/part3.html?nk!e+f+g+h&_junk",
...     "http://hostname.com/as/ck$st=f+g+h+k+i/",
...     "http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa"
... ]
>>>
>>> pattern = re.compile(r"[^\w\+]([\w\+]+\+[\w\+]+)(?:[^\w\+]|$)")
>>> for item in l:
...     print("".join(pattern.search(item).group(1).split("+")))
... 
abcd
efgh
fghki
qwert

The idea is basically to capture alphanumerics and a plus character that is between the non-alphanumerics and non-plus character or the end of the string. Then, split by plus and join.

Regex101 link.

I have a feeling that it can be further simplified/improved.

answered Aug 23, 2016 at 22:21

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Observer Over a year ago

suppose if I have something like this, hostname.com/as/ck$st=f+g-h-k-i, This one gives out f+g. I know this was not the original requirement. But just asking. Can we make it to check for - as well after checking for first + ? by not limiting it to '-' special character alone and all except that?

alecxe Over a year ago

@Observer you should add - to the expression appropriately. E.g. for a pattern defined as pattern = re.compile(r"[^\w\+\-]([\w\+\-]+(?:\+|-)[\w\+\-]+)(?:[^\w\+\-]|$)"), using re.split() to split by multiple delimiters: print("".join(re.split(r"\+|-", pattern.search("http://hostname.com/as/ck$st=f+g-h-k-i/").group(1)))) would produce fghki. Hope that helps.

Vasif · Accepted Answer · 2016-08-23 22:17:13Z

1

So the appropriate regex that shall parse the required characters you wanted is ((.\+)+.) I am using Javascript regex here. But, You should be able to implement in py as well.

This regex shall extract you a+b+c+d from your first url. It will need to be processed a little bit more to get abcd from a+b+c+d.

I will update this with py function in a bit.

answered Aug 23, 2016 at 22:17

Vasif

1,40311 silver badges26 bronze badges

5 Comments

Observer Over a year ago

this gives and output of a+b . not a+b+c+d

Observer Over a year ago

@can you consider a,b,c,d as individual words? Here I have given it for simplification. It is taking one letter from both the side. I want to take both the words, until we find a character.

Vasif Over a year ago

i mean i just did this in console . var a = new RegExp(/((.\+)+.)/). a.exec('http://m.hostname.com/part3.html?nk!e+f+g+h&_junk')

Vasif Over a year ago

Guess, you got the complete answer above.

Observer Over a year ago

Yes I got . Thanks a lot :)

Collectives™ on Stack Overflow

Parse out a URL with regex operation in python

2 Answers 2

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related