0

Hi everyone I am currently trying to get some data from urls and then trying to predict what category should that article belong. So far I have done this but it has an error:

    info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
    html, category = [], []
    for i in info.index:
        response = requests.get(info.iloc[i,0])
        soup = BeautifulSoup(response.text, 'html.parser')
        html.append([re.sub(r'<.*?>','', 
                      str(soup.findAll(['p','h1','\href="/avtorji/'])))])
        category.append(info.iloc[0,i])

    data = pd.DataFrame()
    data['html'] = html
    data['category'] = category

And the error is this:

IndexError: single positional indexer is out-of-bounds.

Can someone help me please?

2 Answers 2

1

You can avoid the iloc call and use iterrows instead, and I think you would have to use loc instead of iloc because you were operating on the index, but using iloc and loc in loops is generally not that efficient. You can try the following code (with waiting time inserted):

import time

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

In case you really only need the url in your loop, you replace:

for i, row in info.iterrows():
    url= row.iloc[0]

By something like:

for url in info[put_the_name_of_the_url_column_here]: # or info.iloc[:,0] as proposed by serge
Sign up to request clarification or add additional context in comments.

7 Comments

I just hast row[0] and row[1] as names. Thank you for your answer. I am guessing it should take a while since it has 4000 rows. What is the minimum time it should take according to you?
I can't tell because the longest time will be the requests.get calls. Maybe an hour? Btw. if your urls point to the same server, you should probably add some waiting time inbetween to give it some air to breathe and not beeing blocked.
Yes it is all from one server. Can you help me do that because I don't know yet how. I am new in this interesting field.
Just a moment, I'll add it.
As always: please make sure, you are really allowed to scrape that web site. Happy scraping!
|
1

The error is likely to be caused by passing an index to iloc: loc expect index values and column names, while iloc expect numerical position of rows and columns. Furthermore, you have interchanged row and column position for category with category.append(info.iloc[0,i]). So you should at least do:

for i in range(len(info)):
    response = requests.get(info.iloc[i,0])
    ...
    category.append(info.iloc[i,0])

But as you are trying to iterate the first column of a dataframe, above code is not Pythonic. It is better to directly use the column:

for url in info.loc[:, 0]:
    response = requests.get(url)
    ...
    category.append(url)

2 Comments

I will try also this one but first I need to wait for the previous one to finish. Do you think that there is a code that works more optimally?
Since you are only using the urls, it is enough to iterate only over this one column.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.