0

So I am trying to extract only links to particular sites. I have written the following by sifting through this site for hours now, but it does not work great for me.

match = re.compile('<a href="(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)(youtu|www.youtube|youtube|vimeo|dailymotion|)\.(.+?)"',re.DOTALL).findall(html)
for title in match:
    print '<a href="'+title+'>'+title+'</a>'

Method above gives this error:

    print '<a href="'+title+'>'+title+'</a>'
TypeError: cannot concatenate 'str' and 'tuple' objects

and if i simply put "print = title" I get the following ugly result

('https://www.', 'youtube', 'com/watch?v=gm2SGfjvgjM')

all links scraped will look like this:

<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM"

Im hoping to have it print like following:

<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM">youtube</a>
<a href="http://www.dailymotion.com/video/x5zuvuu">dailymotion</a>

Been playing with python for a while but I struggle alot lol. and FYI Ive spent endless hours trying to figure out beautiful soup but just dont get it. Would appreciate any help on this without changing the method totally if possible, Thanks for any help.

7
  • Try running your code here: pythontutor.com Commented Sep 11, 2017 at 0:09
  • I will try Dani. Thanks have not seen that site before. What would be the benefit to testing in there as apposed to running in idle? Commented Sep 11, 2017 at 0:16
  • The reason you get the error is, you are trying to put together tuples and strings. If you are not sure at what point title becomes a string (though you can try figuring that out yourself), python tutor can help you, by showing you the steps the program takes, visually, 1 by 1. Commented Sep 11, 2017 at 0:18
  • Also, there probably is a solution without using regex, and you should definitely try that. stackoverflow.com/a/7553730/5306470 Commented Sep 11, 2017 at 0:19
  • 1
    Regex is not ideal for parsing HTML. Use an HTML parser like BeautifulSoup. Commented Sep 11, 2017 at 2:50

2 Answers 2

1

Your pattern seems okay. The problem is with the capturing groups inside. Make them all non-capturing with ?: so you only end up capturing the whole expression together.

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\
                         '(?:youtu|www.youtube|youtube|vimeo|dailymotion|)'\
                         '\.(?:.+?))"',re.DOTALL)
match = p.findall(html)
for title in match:
    print '<a href="' + title + '>' + title + '</a>'

To retain the link as well as the domain name, a another small change is needed – capture the whole expression, and the website name as two separate groups (the former also captures the latter):

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\
                         '(youtu|www.youtube|youtube|vimeo|dailymotion|)'\
                         '\.(?:.+?))"',re.DOTALL)

match = p.findall(html)
for title in match:
    print '<a href="' + title[0] + '>' + title[1] + '</a>'

Access the groups by title[i].

Sign up to request clarification or add additional context in comments.

4 Comments

That works pretty darn good COLDSPEED Thanks for your help. If im not pushing my luck tho could you help with adding host name as link title? I have always been under the impression that every instance of "(.+? or whatever)" would be a match i could name and print but in this case when I give it a name it tells me there are too many values to unpack. Any insight as to why its not a match would be useful info to. Thanks so much
@BobbyPeters Made an edit. Take a look and see if it works.
@BobbyPeters Note that, if you pass capturing groups to findall, only the capture groups are returned. Knowing how this works helps you work around it.
That works flawlessly I really appreciate your help. I now understand "Group" I have been playing with python for a cpl years but am not made to be a programmer lol, I have ADHD and have a hard time reading through documentation. I learn best by playing with other peoples codes. Thanks again :) Means alot to me
1

You can simply use:

print '<a href="'+''.join(title)+'>'+title[1]+'</a>'

Your matching element consists on a tuple where each element is a matching group. So, you join them together to form the URL, and the second element will be the group you what to use to name the link.

2 Comments

Thank you y.luis I never seen the word tuple till that error came up. Appreciate the knowledge you share. :)
Your welcome. Great documentation about tuples: openbookproject.net/thinkcs/python/english3e/tuples.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.