Getting data from url and putting it into DataFrame

Question

Hi everyone I am currently trying to get some data from urls and then trying to predict what category should that article belong. So far I have done this but it has an error:

    info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
    html, category = [], []
    for i in info.index:
        response = requests.get(info.iloc[i,0])
        soup = BeautifulSoup(response.text, 'html.parser')
        html.append([re.sub(r'<.*?>','', 
                      str(soup.findAll(['p','h1','\href="/avtorji/'])))])
        category.append(info.iloc[0,i])

    data = pd.DataFrame()
    data['html'] = html
    data['category'] = category

And the error is this:

IndexError: single positional indexer is out-of-bounds.

Can someone help me please?

jottbe · Accepted Answer · 2019-07-10 08:27:16Z

1

You can avoid the iloc call and use iterrows instead, and I think you would have to use loc instead of iloc because you were operating on the index, but using iloc and loc in loops is generally not that efficient. You can try the following code (with waiting time inserted):

import time

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

In case you really only need the url in your loop, you replace:

for i, row in info.iterrows():
    url= row.iloc[0]

By something like:

for url in info[put_the_name_of_the_url_column_here]: # or info.iloc[:,0] as proposed by serge

edited Jul 10, 2019 at 8:27

answered Jul 10, 2019 at 7:44

jottbe

4,5464 gold badges19 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

mandella Over a year ago

I just hast row[0] and row[1] as names. Thank you for your answer. I am guessing it should take a while since it has 4000 rows. What is the minimum time it should take according to you?

jottbe Over a year ago

I can't tell because the longest time will be the requests.get calls. Maybe an hour? Btw. if your urls point to the same server, you should probably add some waiting time inbetween to give it some air to breathe and not beeing blocked.

mandella Over a year ago

Yes it is all from one server. Can you help me do that because I don't know yet how. I am new in this interesting field.

jottbe Over a year ago

Just a moment, I'll add it.

jottbe Over a year ago

As always: please make sure, you are really allowed to scrape that web site. Happy scraping!

|

Serge Ballesta · Accepted Answer · 2019-07-10 07:57:34Z

1

The error is likely to be caused by passing an index to iloc: loc expect index values and column names, while iloc expect numerical position of rows and columns. Furthermore, you have interchanged row and column position for category with category.append(info.iloc[0,i]). So you should at least do:

for i in range(len(info)):
    response = requests.get(info.iloc[i,0])
    ...
    category.append(info.iloc[i,0])

But as you are trying to iterate the first column of a dataframe, above code is not Pythonic. It is better to directly use the column:

for url in info.loc[:, 0]:
    response = requests.get(url)
    ...
    category.append(url)

answered Jul 10, 2019 at 7:57

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

2 Comments

mandella Over a year ago

I will try also this one but first I need to wait for the previous one to finish. Do you think that there is a code that works more optimally?

jottbe Over a year ago

Since you are only using the urls, it is enough to iterate only over this one column.

Collectives™ on Stack Overflow

Getting data from url and putting it into DataFrame

2 Answers 2

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related