Find and display links to specified URLs using regex

Question

So I am trying to extract only links to particular sites. I have written the following by sifting through this site for hours now, but it does not work great for me.

match = re.compile('<a href="(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)(youtu|www.youtube|youtube|vimeo|dailymotion|)\.(.+?)"',re.DOTALL).findall(html)
for title in match:
    print '<a href="'+title+'>'+title+'</a>'

Method above gives this error:

    print '<a href="'+title+'>'+title+'</a>'
TypeError: cannot concatenate 'str' and 'tuple' objects

and if i simply put "print = title" I get the following ugly result

('https://www.', 'youtube', 'com/watch?v=gm2SGfjvgjM')

all links scraped will look like this:

<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM"

Im hoping to have it print like following:

<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM">youtube</a>
<a href="http://www.dailymotion.com/video/x5zuvuu">dailymotion</a>

Been playing with python for a while but I struggle alot lol. and FYI Ive spent endless hours trying to figure out beautiful soup but just dont get it. Would appreciate any help on this without changing the method totally if possible, Thanks for any help.

I will try Dani. Thanks have not seen that site before. What would be the benefit to testing in there as apposed to running in idle? — Bobby Peters
– Bobby Peters, Commented Sep 11, 2017 at 0:16
The reason you get the error is, you are trying to put together tuples and strings. If you are not sure at what point title becomes a string (though you can try figuring that out yourself), python tutor can help you, by showing you the steps the program takes, visually, 1 by 1. — user5306470
– user5306470, Commented Sep 11, 2017 at 0:18
Also, there probably is a solution without using regex, and you should definitely try that. stackoverflow.com/a/7553730/5306470 — user5306470
– user5306470, Commented Sep 11, 2017 at 0:19
Regex is not ideal for parsing HTML. Use an HTML parser like BeautifulSoup. — Mark Tolonen
– Mark Tolonen, Commented Sep 11, 2017 at 2:50

cs95 · Accepted Answer · 2017-09-11 00:34:18Z

1

Your pattern seems okay. The problem is with the capturing groups inside. Make them all non-capturing with ?: so you only end up capturing the whole expression together.

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\
                         '(?:youtu|www.youtube|youtube|vimeo|dailymotion|)'\
                         '\.(?:.+?))"',re.DOTALL)
match = p.findall(html)
for title in match:
    print '<a href="' + title + '>' + title + '</a>'

To retain the link as well as the domain name, a another small change is needed – capture the whole expression, and the website name as two separate groups (the former also captures the latter):

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\
                         '(youtu|www.youtube|youtube|vimeo|dailymotion|)'\
                         '\.(?:.+?))"',re.DOTALL)

match = p.findall(html)
for title in match:
    print '<a href="' + title[0] + '>' + title[1] + '</a>'

Access the groups by title[i].

edited Sep 11, 2017 at 0:34

answered Sep 11, 2017 at 0:16

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Bobby Peters Over a year ago

That works pretty darn good COLDSPEED Thanks for your help. If im not pushing my luck tho could you help with adding host name as link title? I have always been under the impression that every instance of "(.+? or whatever)" would be a match i could name and print but in this case when I give it a name it tells me there are too many values to unpack. Any insight as to why its not a match would be useful info to. Thanks so much

cs95 Over a year ago

@BobbyPeters Made an edit. Take a look and see if it works.

cs95 Over a year ago

@BobbyPeters Note that, if you pass capturing groups to findall, only the capture groups are returned. Knowing how this works helps you work around it.

Bobby Peters Over a year ago

That works flawlessly I really appreciate your help. I now understand "Group" I have been playing with python for a cpl years but am not made to be a programmer lol, I have ADHD and have a hard time reading through documentation. I learn best by playing with other peoples codes. Thanks again :) Means alot to me

y.luis.rojo · Accepted Answer · 2017-09-11 00:35:36Z

1

You can simply use:

print '<a href="'+''.join(title)+'>'+title[1]+'</a>'

Your matching element consists on a tuple where each element is a matching group. So, you join them together to form the URL, and the second element will be the group you what to use to name the link.

edited Sep 11, 2017 at 0:35

answered Sep 11, 2017 at 0:27

y.luis.rojo

1,8554 gold badges24 silver badges44 bronze badges

2 Comments

Bobby Peters Over a year ago

Thank you y.luis I never seen the word tuple till that error came up. Appreciate the knowledge you share. :)

y.luis.rojo Over a year ago

Your welcome. Great documentation about tuples: openbookproject.net/thinkcs/python/english3e/tuples.html

Collectives™ on Stack Overflow

Find and display links to specified URLs using regex

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related