I am working on a Python-script that goes through a CSV and does two checks:
- Is there one or more URLs in this text?
- What is the returned status-code when making a request via requests.get
The CSV has 2 columns
richAnswer,kbid
"<p>This answer has one URL in it! <a href=""https://www.google.com"" target="“_blank""> and also some text! </p>",301
"<p>This answer has two URLs in it! <a href=""https://www.google.com"" target="“_blank""> and <a href=""https://www.bing.com"" target="“_blank"">! </p>",258
"<p>This answer has absolutely no URL in it and is very sad :( </p>",774
the rich answer contains either one or more, or no URL.
I use pandas to read in the CSV into a data frame and set up a row count for later.
df = pd.read_csv("./FAQs/faq-data-short.csv")
answers = df['richAnswer']
current_row = 0
After that, I start a while loop that checks if there is an answer itself is not empty to keep running and look for URLs in the current answer. If there is more than one URL in the answer I start a for loop for each URL.
while len(answers[current_row]) != 0:
urls = re.findall(r'(https?://[^\s]+)', answers[current_row].replace('"',''))
if len(urls) > 1:
for url in urls:
status = get_statuscode(url)
df['URL'] = url
KBID = df['kbid'][current_row]
print('URL', current_row, ': ', ID, str([urls]), status)
current_row += 1
elif len(urls) == 1:
status = get_statuscode(url)
df['URL'] = url[current_row]
KBID = df['kbid'][current_row]
print('URL', current_row, ': ', ID, urls, status)
current_row += 1
else:
status = "None"
df['URL'] = "None"
KBID = df['kbid'][current_row]
print('URL', current_row, ': ', ID, status)
current_row += 1
Now there are some issues I am facing. If I let my script run like that it will end with KeyError: 3 when I reach over the last row and check for the KBID. This is due to the way I need to check both URLs to get both status-codes.
URL 0 : 301 ['https://www.google.com'] 200
URL 1 : 301 [['https://www.google.com', 'https://www.bing.com']] 200
URL 2 : 774 [['https://www.google.com', 'https://www.bing.com']] 200
I can see two possible solutions:
- a way to write both URLs (if there are) in the same row, but different columns like URL1, URL1-Status, URL2, and so on
- Have each URL have its own row but retain the corresponding ID.
Sadly for neither of which I have found a solution so far. I am sure this is something I missed or just went down the wrong path, so I'd be very grateful for any help you can offer.
richAnswercolumn? Or just the ones that are in the HTMLatag?