0

I'm looking to grab a url that begins with http:// or https:// from a textfile that also contains other unrelated text and transfer it to another file/list.

    def test():
        with open('findlink.txt') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                if "https://" in line:
                    outfile.write(line[line.find("https://"): line.find("")])
            print("Done")

The code currently does nothing.

Edit: I see this is being negatively voted like usual, is there anything I can add here?

This is not a duplicate, please re-read carefully.

13
  • What is expected out of this outfile.write(line[line.find("https://"): line.find("")])? Commented Feb 5, 2019 at 21:13
  • It is expected to separate the URL from other unrelated text. Picture a file with contents like this lorem ipsum https://stackoverflow.com/questions/54543095/search-and-extract-a-url-from-a-text-file dolor sit amet There may or may not be text written after the URL so line.find(" ") would not be useful here. Commented Feb 5, 2019 at 21:15
  • The second part of your slice line.find("") this returns 0 that will completely mess up the slice. use re Commented Feb 5, 2019 at 21:16
  • Yes @Jaba, I'm looking for the proper solution to fix that. Leaving that out won't return only the URL like needed. Commented Feb 5, 2019 at 21:17
  • 2
    Possible duplicate of How do you extract a url from a string using python? Commented Feb 5, 2019 at 21:20

3 Answers 3

2

You can use re to extract all the url.

In [1]: st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov
   ...: h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'''

In [2]: st
Out[2]: 'https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'

In [3]: import re

In [4]: a = re.compile(r"https*://(\w+\.\w{3})/*")
In [5]: for i in a.findall(st):
   ...:     print(i)


regex101.com
regex202.gov
regex303.com
regex101.com

For variable tld and path:

st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/ ie fah fah http://regex101.co/ ty ahn fah jaio l http://regex101/yhes.com/'''
a = re.compile(r"https*://([\w/]+\.\w{0,3})/*")
for i in a.findall(st):
    print(i)

regex101.com
regex202.gov
regex303.com
regex101.com
regex101.co
regex101/yhes.com
Sign up to request clarification or add additional context in comments.

3 Comments

Would this also work for URLs that have a path? All URLs being used here will contain a path and a TLD that may or may not have 3 characters.
for tld, you can use {0,3} to have no characters upto 3 characters. For path, you can include path separator in the group /*
Also it would help if you include some examples of url you are extracting.
1

You need to use re like in this answer. Below is this incorperated into your function.

def test():
        with open('findlink.txt', 'r') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                try:
                    url = re.search("(?P<url>https?://[^\s]+)", line).group("url")
                    outfile.write(url)
                except AttributeError:
                    pass
            print("Done")

5 Comments

Thank you. This solves the issue without worrying about specific TLDs or lengths!
Just a note so you can edit your solution, this returns an attribute error due to group if re.search doesn't return a url.
@Dansey Edited my answer
I know this is weeks later but does .group serve a purpose? could ?P<url> and .group("url") be removed to make a simple re search? @Jab
No, re.search returns either None or re.MatchObject as per the docs. Read there and see your options.
-1

Here's why the code currently does nothing:

outfile.write(line[line.find("https://"): line.find("")])

Note that line.find("") is looking for the empty string. This is always going to be found at the very beginning of the string, and therefore will always return 0. Thus your list slice is 0 elements long and thus is empty.

Try changing it to line.find(" ") - you're looking for a space, not an empty string.


However, if the line contains spaces before that point, you're still going to mess up. The simplest-to-read way to do it is probably just using separate variables:

if "https://" in line:
    https_begin = line.find("https://")
    https_end = line[https_begin:].find(" ")  # find the next space after the url begins
    outfile.write(line[https_begin: https_end])

3 Comments

See comments. I'm not looking for a space as infile may or may not contain a space after the url.
That's not a problem though. Then the find(' ') would return a -1 and you're good to go.
A possible solution using this would be to add a space to the end of findlink.txt regardless of its contents.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.