-5

I am trying to grab geo locations using URLs from a csv by searching the twitter, tweets urls. The input file has more than 100K rows with bunch of columns.

I am using python 3.x anaconda with all the updated version and I am getting following error:

Traceback (most recent call last):
  File "__main__.py", line 21, in <module>
    location = get_location(userid)
  File "C:path\twitter_location.py", line 22, in get_location
    location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
IndexError: list index out of range

The code below :

#!/usr/env/bin python
import urllib.request
import urllib3
from bs4 import BeautifulSoup

def get_location(userid):
    '''
    Get location as string ('Paris', 'New york', ..) by scraping twitter profils page.
    Returns None if location can not be scrapped
    '''

    page_url = 'http://twitter.com/{0}'.format(userid)

    try:
        page = urllib.request.urlopen(page_url)
    except urllib.request.HTTPError:
        print ('ERROR: user {} not found'.format(userid))
        return None

    content = page.read()
    html = BeautifulSoup(content)
    location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()

    if location.strip() == '':
        return None
    return location.strip()

I am looking for a quick fix so that I can execute the whole input files with more than 100k rows.

Edit: I As mentioned in the answer below, After including the try block the outputs have stopped grabbing geo location.

Before the inclusion of try block after certain count list out of range error.

After Including the try block the error is gone and so the coordinates. I am getting all none values.

Here is the DropBox link with Input, Before & After Output & entire code bundle.

Edit: II

Entire code and inputs are in the dropbox I am searching for some help where we can eliminate the entire API thing and find an alternative to pull geo locations of twitter usernames.

Appreciate the help in fixing the problem. Thanks in advance.

2
  • 6
    My guess is that some of the contents do not have '.ProfileHea...' in them and therefore html select gives an empty list with no index 0 Commented Apr 20, 2017 at 23:29
  • @EzerK Thank you for the suggestion, in that case how can I neglect such a row and move forward to next ? I am trying with ('.ProfileHeaderCard-locationText')[-1] and executing. Let me see how does that work in this case. Commented Apr 20, 2017 at 23:35

2 Answers 2

3

Well, you have an exception handling for HTTPErrors but there's no handling if there's no .ProfileHeaderCard-locationText. That's probably the issue. Now you can import/implement

import logging
logging.warning('Watch out!')  # will print a message to the console
logging.info('I told you so')  # will not print anything
logging.exception()

You can use this in all of your programs (and you should !). Just like you added a try, except block for ` try:

try:
    page = urllib.request.urlopen(page_url)
except urllib.request.HTTPError:
    print('ERROR: user {} not found'.format(userid))
    return None

you can do the same for

try:
    location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
except Exception:
    print("Error,Hey dud couldn't find Profile...")

The main problem might be that Google limits the usage of their API Using it your way. Much more convenient way is using Google-Maps-Python-API < CLICK FOR DETAILS Usage Example:

from geolocation.google_maps import GoogleMaps

address = "New York City Wall Street 12"

google_maps = GoogleMaps(api_key='your_google_maps_key') 

location = google_maps.search(location=address) # sends search to Google Maps.

print(location.all()) # returns all locations.

my_location = location.first() # returns only first location.

print(my_location.city)
print(my_location.route)
print(my_location.street_number)
print(my_location.postal_code)

EDIT:

    if location.strip() == '':
        return None
return location.strip()

I think you meant:

if location.strip()==None:
    return None
else:
    return location.strip()
Sign up to request clarification or add additional context in comments.

14 Comments

Thank you so much for the suggestion But in this code which part do I use this logging? But there needs to be a way to skip may be that particular from the input and move forward with parsing the next row ? Isnt it ?
I've added code, you can use try: except block everywhere in your code where you expect an error (for any reason), now just using except Exception is a pretty sloppy way of doing things, but I do it and it gets the job done. *For this instance you can use except IndexError: print("There's no .profileHeaderCard, moving on...")
Thank you so much .. Let me try the addition and execute and will inform you back .
Of course, make sure you give feedback and explain your problem / edit your question, add code to make it easier for you and for us as well. Tip: If you think you don't know what you're looking for exactly ask for videos / docs explaining the subject.
I included the try: block but that is giving me indentation error try: ^ TabError: inconsistent use of tabs and spaces in indentation
|
2
+25

While handling the 'index out of range' exception as suggested in Faulty Fuse's answer is important, this will only fix the symptom.

The root of the problem is that after a certain number of requests twitter blocks your IP(s) and stops sending any usable content. (they don't like such mass-queries).

Potential solutions:

  1. Go slower. This will delay being blocked by Twitter. You need to go so slow you don't get blocked at all. While this might not be possible for 100k records, this would be an easy fix if you have the time to wait for results.

  2. Use rotating proxies. Use many of them. ... or combine a handful of proxies with going somewhat slower.

3 Comments

Thank you so much.. Can u please help me with writing delay block in the code.. Please it will be huge help for me.. May be like 100 queries at a time and halt n delay and again some more queries.. Until the end.. Please try to help with code..
Sure, if you want to go the slow route, I would put a single time.sleep(wait_seconds) into your loop that goes through all the userids and then experiment with different values for wait_seconds, starting with wait_seconds=2 and adjusting up and down accord to your test results.
Please can u help with code edit.. I have put all the input, output and code in the question with Dropbox link.. It would be huge help for me..

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.