Issue with web-scraping a public Github repo

Question

I am trying to scrape a public Github repo (https://github.com/stlrda/redb_python/tree/master/python/DAGs) in order to grab the name and datetime from each file. The code that I have posted below will work, but not all of the time. Sometimes I get an Index out of range error when it runs the DAGs[counter]['age'] = x.find('.no-wrap')[0].attrs['datetime'] line. I'm very confused why this code will sometimes work and other times fails to find the datetime. Any ideas on how I can fix this to find the datetime every run?

session = HTMLSession()
r = session.get('https://github.com/stlrda/redb_python/tree/master/python/DAGs')

div = r.html.find('tbody', first=True)
title = div.find('.content')

DAGs = []

#Grab the names of each DAG in the repo
for x in range((len(title))):

    if x == 0:
        continue
    else:
        info = {"name": title[x].text}
        DAGs.append(info)

#Update the dictionary with the age of the DAG
gitTable = div.find('.js-navigation-item')

counter = 0
for x in gitTable:
    DAGs[counter]['age'] = x.find('.no-wrap')[0].attrs['datetime']
#     print (x.find('.no-wrap')[0].attrs['datetime'])
    counter+=1

When the code fails, here is what the gitTable variable contains:

[<Element 'tr' class=('js-navigation-item',)>,
 <Element 'tr' class=('js-navigation-item',)>,
 <Element 'tr' class=('js-navigation-item',)>,
 <Element 'tr' class=('js-navigation-item',)>]

And the html of one of these items in the gitTable list is:

>>>gitTable[0].html
'<tr class="js-navigation-item">\n<td class="icon">\n<svg aria-label="file" class="octicon octicon-file" height="16" role="img" version="1.1" viewbox="0 0 12 16" width="12"><path d="M6 5H2V4h4v1zM2 8h7V7H2v1zm0 2h7V9H2v1zm0 2h7v-1H2v1zm10-7.5V14c0 .55-.45 1-1 1H1c-.55 0-1-.45-1-1V2c0-.55.45-1 1-1h7.5L12 4.5zM11 5L8 2H1v12h10V5z" fill-rule="evenodd"/></svg>\n<img alt="" class="spinner" height="16" src="https://github.githubassets.com/images/spinners/octocat-spinner-32.gif" width="16"/>\n</td>\n<td class="content">\n<span class="css-truncate css-truncate-target"><a class="js-navigation-open" href="/stlrda/redb_python/blob/master/python/DAGs/MigratetoPG_DAG.py" id="5554cd417ad3b8097206c9a0e81566d0-7416c3966dc565eb1b0115b89fa72116e4cc3ee6" title="MigratetoPG_DAG.py">MigratetoPG_DAG.py</a></span>\n</td>\n<td class="message">\n<span class="css-truncate css-truncate-target">\n</span>\n</td>\n<td class="age">\n<span class="css-truncate css-truncate-target"/>\n</td>\n</tr>'

if you get error then you should first check what you get in HTML - you can print() it or save in file and open in web browser. Maybe you get HTML with something different like different design or mistake in HTML. You could check if x.find('.no-wrap') is not empty before you use [0], or put it in try/except — furas
– furas, Commented Jan 11, 2020 at 0:21
OR maybe it didn't recognized browser (user-agent) and it generated HTML with different tags or with different attributes. — furas
– furas, Commented Jan 11, 2020 at 0:39

Amanser · Accepted Answer · 2020-01-13 17:02:40Z

2

Looks like I was taking a much harder route by trying to scrape GitHub, and completely overlooked their API.

The commits and contents endpoints were able to provide me with the file name and datetime info that I needed. Below are examples of the endpoints.

I could not find a single endpoint that gave both the filename and the datetime data, so if anyone knows of one, please let me know.

Datetime --> https://api.github.com/repos/github account/repo name/commits?path=path to folder

Name --> https://api.github.com/repos/github account/repo name/contents/path to folder

answered Jan 13, 2020 at 17:02

Amanser

396 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Issue with web-scraping a public Github repo

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related