Search and extract a URL from a text file

Question

I'm looking to grab a url that begins with http:// or https:// from a textfile that also contains other unrelated text and transfer it to another file/list.

    def test():
        with open('findlink.txt') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                if "https://" in line:
                    outfile.write(line[line.find("https://"): line.find("")])
            print("Done")

The code currently does nothing.

Edit: I see this is being negatively voted like usual, is there anything I can add here?

This is not a duplicate, please re-read carefully.

What is expected out of this outfile.write(line[line.find("https://"): line.find("")])? — mad_
– mad_, Commented Feb 5, 2019 at 21:13
It is expected to separate the URL from other unrelated text. Picture a file with contents like this lorem ipsum https://stackoverflow.com/questions/54543095/search-and-extract-a-url-from-a-text-file dolor sit amet There may or may not be text written after the URL so line.find(" ") would not be useful here. — Dann
– Dann, Commented Feb 5, 2019 at 21:15
The second part of your slice line.find("") this returns 0 that will completely mess up the slice. use re — Jab
– Jab, Commented Feb 5, 2019 at 21:16
Yes @Jaba, I'm looking for the proper solution to fix that. Leaving that out won't return only the URL like needed. — Dann
– Dann, Commented Feb 5, 2019 at 21:17
Possible duplicate of How do you extract a url from a string using python? — Jab
– Jab, Commented Feb 5, 2019 at 21:20

Osman Mamun · Accepted Answer · 2019-02-05 21:34:08Z

2

You can use re to extract all the url.

In [1]: st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov
   ...: h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'''

In [2]: st
Out[2]: 'https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'

In [3]: import re

In [4]: a = re.compile(r"https*://(\w+\.\w{3})/*")
In [5]: for i in a.findall(st):
   ...:     print(i)


regex101.com
regex202.gov
regex303.com
regex101.com

For variable tld and path:

st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/ ie fah fah http://regex101.co/ ty ahn fah jaio l http://regex101/yhes.com/'''
a = re.compile(r"https*://([\w/]+\.\w{0,3})/*")
for i in a.findall(st):
    print(i)

regex101.com
regex202.gov
regex303.com
regex101.com
regex101.co
regex101/yhes.com

edited Feb 5, 2019 at 21:34

answered Feb 5, 2019 at 21:21

Osman Mamun

2,8802 gold badges18 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dann Over a year ago

Would this also work for URLs that have a path? All URLs being used here will contain a path and a TLD that may or may not have 3 characters.

Osman Mamun Over a year ago

for tld, you can use {0,3} to have no characters upto 3 characters. For path, you can include path separator in the group /*

Osman Mamun Over a year ago

Also it would help if you include some examples of url you are extracting.

Jab · Accepted Answer · 2019-02-05 23:11:53Z

1

You need to use re like in this answer. Below is this incorperated into your function.

def test():
        with open('findlink.txt', 'r') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                try:
                    url = re.search("(?P<url>https?://[^\s]+)", line).group("url")
                    outfile.write(url)
                except AttributeError:
                    pass
            print("Done")

edited Feb 5, 2019 at 23:11

answered Feb 5, 2019 at 21:24

Jab

27.6k22 gold badges81 silver badges117 bronze badges

5 Comments

Dann Over a year ago

Thank you. This solves the issue without worrying about specific TLDs or lengths!

Dann Over a year ago

Just a note so you can edit your solution, this returns an attribute error due to group if re.search doesn't return a url.

Jab Over a year ago

@Dansey Edited my answer

Dann Over a year ago

I know this is weeks later but does .group serve a purpose? could ?P<url> and .group("url") be removed to make a simple re search? @Jab

Jab Over a year ago

No, re.search returns either None or re.MatchObject as per the docs. Read there and see your options.

Green Cloak Guy · Accepted Answer · 2019-02-05 21:21:13Z

-1

Here's why the code currently does nothing:

outfile.write(line[line.find("https://"): line.find("")])

Note that line.find("") is looking for the empty string. This is always going to be found at the very beginning of the string, and therefore will always return 0. Thus your list slice is 0 elements long and thus is empty.

Try changing it to line.find(" ") - you're looking for a space, not an empty string.

However, if the line contains spaces before that point, you're still going to mess up. The simplest-to-read way to do it is probably just using separate variables:

if "https://" in line:
    https_begin = line.find("https://")
    https_end = line[https_begin:].find(" ")  # find the next space after the url begins
    outfile.write(line[https_begin: https_end])

answered Feb 5, 2019 at 21:21

Green Cloak Guy

24.8k4 gold badges39 silver badges58 bronze badges

3 Comments

Dann Over a year ago

See comments. I'm not looking for a space as infile may or may not contain a space after the url.

Jordan Singer Over a year ago

That's not a problem though. Then the find(' ') would return a -1 and you're good to go.

Dann Over a year ago

A possible solution using this would be to add a space to the end of findlink.txt regardless of its contents.

Collectives™ on Stack Overflow

Search and extract a URL from a text file

3 Answers 3

3 Comments

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related