3

I have data as follows,

data

url
http://hostname.com/part1/part2/part3/a+b+c+d
http://m.hostname.com/part3.html?nk!e+f+g+h&_junk
http://hostname.com/as/ck$st=f+g+h+k+i/
http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa

I want to check for first + symbol in the url and move backward until we find a special character such as / or ? or = or any other special character and start from that and go on until we find a space or end of line or & or /.My output should be,

parsed
abcd
efgh
fghki
qwert

My aim is to find first + in the URL and go back until we find a special character and go front until we find a end of line or space or & symbol.

I am new to regex and still learning it and since it is bit complex, I am finding it difficult to write. Can anybody help me in writing a regex in python, to parse out these?

Thanks

2 Answers 2

1

Here is the expression that works for your sample use cases:

>>> import re
>>>
>>> l = [
...     "http://hostname.com/part1/part2/part3/a+b+c+d",
...     "http://m.hostname.com/part3.html?nk!e+f+g+h&_junk",
...     "http://hostname.com/as/ck$st=f+g+h+k+i/",
...     "http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa"
... ]
>>>
>>> pattern = re.compile(r"[^\w\+]([\w\+]+\+[\w\+]+)(?:[^\w\+]|$)")
>>> for item in l:
...     print("".join(pattern.search(item).group(1).split("+")))
... 
abcd
efgh
fghki
qwert

The idea is basically to capture alphanumerics and a plus character that is between the non-alphanumerics and non-plus character or the end of the string. Then, split by plus and join.

Regex101 link.

I have a feeling that it can be further simplified/improved.

Sign up to request clarification or add additional context in comments.

2 Comments

suppose if I have something like this, hostname.com/as/ck$st=f+g-h-k-i, This one gives out f+g. I know this was not the original requirement. But just asking. Can we make it to check for - as well after checking for first + ? by not limiting it to '-' special character alone and all except that?
@Observer you should add - to the expression appropriately. E.g. for a pattern defined as pattern = re.compile(r"[^\w\+\-]([\w\+\-]+(?:\+|-)[\w\+\-]+)(?:[^\w\+\-]|$)"), using re.split() to split by multiple delimiters: print("".join(re.split(r"\+|-", pattern.search("http://hostname.com/as/ck$st=f+g-h-k-i/").group(1)))) would produce fghki. Hope that helps.
1

So the appropriate regex that shall parse the required characters you wanted is ((.\+)+.) I am using Javascript regex here. But, You should be able to implement in py as well.

This regex shall extract you a+b+c+d from your first url. It will need to be processed a little bit more to get abcd from a+b+c+d.

I will update this with py function in a bit.

5 Comments

this gives and output of a+b . not a+b+c+d
@can you consider a,b,c,d as individual words? Here I have given it for simplification. It is taking one letter from both the side. I want to take both the words, until we find a character.
i mean i just did this in console . var a = new RegExp(/((.\+)+.)/). a.exec('http://m.hostname.com/part3.html?nk!e+f+g+h&_junk')
Guess, you got the complete answer above.
Yes I got . Thanks a lot :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.