1

I am working on a Python-script that goes through a CSV and does two checks:

  1. Is there one or more URLs in this text?
  2. What is the returned status-code when making a request via requests.get

The CSV has 2 columns

richAnswer,kbid
"<p>This answer has one URL in it!  <a href=""https://www.google.com"" target="“_blank""> and also some text! </p>",301
"<p>This answer has two URLs in it!  <a href=""https://www.google.com"" target="“_blank""> and <a href=""https://www.bing.com"" target="“_blank"">! </p>",258
"<p>This answer has absolutely no URL in it and is very sad :( </p>",774

the rich answer contains either one or more, or no URL.

I use pandas to read in the CSV into a data frame and set up a row count for later.

df = pd.read_csv("./FAQs/faq-data-short.csv")
answers = df['richAnswer']
current_row = 0

After that, I start a while loop that checks if there is an answer itself is not empty to keep running and look for URLs in the current answer. If there is more than one URL in the answer I start a for loop for each URL.

while len(answers[current_row]) != 0:
    urls = re.findall(r'(https?://[^\s]+)', answers[current_row].replace('"',''))
    if len(urls) > 1:        
        for url in urls:
            status = get_statuscode(url)
            df['URL'] = url
            KBID = df['kbid'][current_row]                        
            print('URL', current_row, ': ', ID, str([urls]), status)
            current_row += 1
    elif len(urls) == 1: 
        status = get_statuscode(url)
        df['URL'] = url[current_row]
        KBID = df['kbid'][current_row]                        
        print('URL', current_row, ': ', ID, urls, status)
        current_row += 1
    else:
        status = "None"
        df['URL'] = "None"
        KBID = df['kbid'][current_row]                        
        print('URL', current_row, ': ', ID, status)
        current_row += 1

Now there are some issues I am facing. If I let my script run like that it will end with KeyError: 3 when I reach over the last row and check for the KBID. This is due to the way I need to check both URLs to get both status-codes.

URL 0 :  301 ['https://www.google.com'] 200
URL 1 :  301 [['https://www.google.com', 'https://www.bing.com']] 200
URL 2 :  774 [['https://www.google.com', 'https://www.bing.com']] 200

I can see two possible solutions:

  1. a way to write both URLs (if there are) in the same row, but different columns like URL1, URL1-Status, URL2, and so on
  2. Have each URL have its own row but retain the corresponding ID.

Sadly for neither of which I have found a solution so far. I am sure this is something I missed or just went down the wrong path, so I'd be very grateful for any help you can offer.

4
  • Do you really need to loop over the dataframe? Looks like you need a function that splits a string into URLs, joins with existing status code and then you work on list of ( url, code) tuples. You gain very little by working in a dataframe. Commented Jul 7, 2020 at 20:52
  • Are you looking to find all the links in the richAnswer column? Or just the ones that are in the HTML a tag? Commented Jul 8, 2020 at 3:46
  • @Evgeny Thank you for the question. Quite honestly, I am not sure. for me this was the logical path. I am happy to learn an easier or more efficient way though. Commented Jul 8, 2020 at 12:41
  • 1
    @zmike solution is good, I did not think of lxml for parsing. Commented Jul 8, 2020 at 21:11

1 Answer 1

1

Here's a solution that has each URL on its own row and retains the corresponding ID. Some suggestions:

  • Use an XML/HTML parser to parse the markup column (richAnswer)
  • Use xpath for find the link URLs in the HTML a tags. Sometimes they have relative links, so they won't match the regex. In case you want all URLs (just not the ones in the tags), regex would come back into play.
  • Use for _, entry in df.iterrows() to iterate through all the rows.
import pandas as pd
from lxml import etree
from io import StringIO
import requests

df = pd.read_csv("./FAQs/faq-data-short.csv"), header=0)
parser = etree.HTMLParser()  # Use an XML/HTML parser to parse the Rich Text entries

URL_number = 0
for _, entry in df.iterrows():
    # Search for the links using XPath
    tree = etree.parse(StringIO(entry["richAnswer"]), parser)
    links = tree.xpath("//a/@href")

    kbid = entry["kbid"]
    if links:
        # Print out the links if there are any
        for url in links:
            # Get response status code
            response = requests.get(url)
            # Print to console - Change this to console if you wish
            print(f'URL {URL_number}: {kbid}, {url}, {response.status_code}')
            URL_number += 1
    else:
        # Print out something if none are found
        print(f'KBID {kbid}: no URLs found')

Output

URL 0: 301, https://www.google.com, 200
URL 1: 258, https://www.google.com, 200
URL 2: 258, https://www.bing.com, 200
KBID 774: no URLs found
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! This was exactly what I needed! I managed to work further with it and instead of printing I am saving into a new CSV. I also had to add some try/excepts for Connection Errors or Invalid/Missing Schemas since the original list wasn't quite as clean. Again, thank you very much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.