How to make a copy of GitHub repository at each commit using GitPython package in Python

Question

I've been trying to make a copy of a GitHub repository at each commit in its history using the GitPython package in Python and am running into this error when it gets partway through my code.

git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git reset --mixed HEAD~1 --
stderr: 'fatal: Failed to resolve 'HEAD~1' as a valid revision.'

This is the code that I've been running:

from git import *
import os, shutil

repo = Repo(repo_path)
commits = list(repo.iter_commits('master'))
for c in commits:
    # reset to previous commit
    repo.head.reset('HEAD~1', index = True, working_tree = True)
    # unique SHA key
    sha = c.name_rev.split()[0] 
    shutil.copytree(repo_path, destination_path)

Might this error be because of a merge? If so, how do I get around it such that I can get all the commits in the master branch of the repo?

torek · Accepted Answer · 2017-08-03 02:52:09Z

Before I even start on an answer, I will say: it is not clear to me why you are doing any of this at all. You could, for instance, use git archive to create a tar or zip file of any given commit. For instance:

git archive -o foo.tar v2.3.1

makes a foo.tar file out of the revision tagged v2.3.1. To make many tar or zip files out of all the revisions reachable from master, you could write:

git rev-list master | while read hash; do
    git archive -o /path/to/$hash.zip $hash
done

and be done with it.

Might this error be because of a merge?

Yes, it might.

If so, how do I get around it such that I can get all the commits in the master branch of the repo?

Beware: the commits in master likely include many commits that are also in other branches.

When you do this:

commits = list(repo.iter_commits('master'))

you get a full list of every commit that is reachable from the name master, starting with the most recent. Suppose master points to commit in a graph that looks like this one, for instance. Instead of each actual commit hash ID, I'll use a single uppercase letter to represent the commits:

A--B--C------G   <-- master
    \       /
     D--E--F   <--- develop

This repository has seven (count them!) commits. All seven commits are on, i.e., reachable from, branch master. Six of the seven commits are on branch develop. The name master identifies commit G, which is a merge commit. The name develop identifies commit F, which is not.

When you do this:

    repo.head.reset('HEAD~1', index = True, working_tree = True)

you have Python tell Git to resolve the current commit, which is one of these seven, to its first parent, and then change the repository's idea of "current commit" to the commit you just found. Let's say that you start out with HEAD (the current commit) being commit G. Then HEAD~1 is commit C.

Here things get a bit complicated. The repo.head object represents Git's own HEAD, which is always one of two different items. In this case, though, it's pretty clearly a symbolic reference, pointing to master. I have not tested this out but it seems virtually certain that GitPython faithfully reproduces Git's own behavior here, and does the equivalent of git reset with one of --soft, --mixed, or --hard depending on your parameters, and yours are those for --hard (curiously the command shown failing here uses --mixed; either your code doesn't match your posting, or more likely, GitPython uses an extra step). So what this ends up doing is making the name master point to the newly selected commit C:

A--B--C   <-- master
 \
  D--E--F   <-- develop

Where did commit G go? Well, nowhere really, but it's now "lost": it is hard to find, and after an expiration period, it will be really removed entirely. So commit G is effectively gone. (It could be resurrected, if we know its hash: we could force master to point to it again with another git reset or equivalent. Your list of commits in variable commits still lists its hash, so that's one of many ways we could find and resurrect it.)

You now do your main loop body code, working with commit C:

sha = c.name_rev.split()[0] 
shutil.copytree(repo_path, destination_path)

You've gone through one of the seven commits in your list, making a copy of commit C while thinking it was commit G (the first commit in repo.iter_commits('master') is commit G since that's the one master points-to).

You are now ready to loop around to work on the second. The repository, however, now has just six commits, and master points to commit C. You now do another git reset --hard, erasing commit C from the picture, leaving us with:

A--B   <-- master
    \
     D--E--F   <-- develop

Now you do something with commit B (while the c in for c in commits is on the second commit of the seven, listed in some order—it's not clear what order repo.iter_commits uses, but it probably runs git rev-list and hence gets the default order; if so, see the git rev-list documentation).

Now you do another git reset --hard. This time, commit B is not forgotten: commit D remembers it. But master winds up pointing to commit A:

A   <-- master
 \
  B--D--E--F   <-- develop

You do your thing with commit A, while the for c in commits is on the third commit of seven.

Now you ask Git to find A's first parent commit ... but A doesn't have a first parent, or any parent at all. Commit A is the first commit ever made; it's a root commit. At this point, git reset simply fails. You've iterated over the four commits that are reachable from master by following only the first-parent links. The other three commits that are reachable from master require, at one point, following the second parent. You have also removed two of the four commits you visited; two remain only because they're reachable from another name.

Note that you could have the same graph but without the name develop any more:

A--B--C------G   <-- master
    \       /
     D--E--F

In this case, the first git reset that wipes out G also wipes out access to the D-E-F chain, because G was the key to that access: it's now G^2, which is the second parent of commit G, that finds F. It's F that finds E, and E that finds D; so losing G loses all of these, and this winds up leaving just:

A--B--C   <-- master

visible. (As before, all the "erased" commits stick around for a grace period, and can be resurrected as long as you can find them again.)

... how do I get around it

Use a completely different algorithm, and/or choose your commits wisely. Just because there are seven (or whatever other number) of commits that are reachable from some branch name, does not mean that all seven (or whatever) are linked as first parents.

Note that even in a completely linear setup, such as:

A--B   <-- master

you will have a list of two commits (in the order B then A), but you can only run git reset HEAD~1 once, to step back from B to A. Once you are on A, you cannot step back again. You must step back one fewer times than you do things with commits, in this situation. You should also do your thing, whatever it is, with the commit first.

It's not immediately obvious to me how GitPython deals with a "detached HEAD", though if you want to access files directly from Python code there's not that much point to using a detached HEAD. But if you're going to run shutils.copytree you might as well just write this whole thing in shell script, which is far simpler: Git is full of shell scripts, and is designed to work well with them, and requires a shell interpreter to exist in order for Git to function at all, so that if you have Git, you have a shell interpreter.

thanks for the comprehensive response! running git archive is so much simpler and fixed the problem. new to python and i had previously received suggestions to use GitPython but running the terminal command is so much simpler. thanks so much!

Arount · Accepted Answer · 2017-08-02 23:59:13Z

0

'fatal: Failed to resolve 'HEAD~1' as a valid revision.' means git can not find previous commit, this happend only when any previous commit exists.

This is surelly because you run your script several times.

GitPython interact with your repository in the exact same way you would by command-line, so if you run a script that reset the whole repository to the first commit - your repository would store one single commit.

So, the next time you will run it - nothing will happend except this error.

I advise you to clone an existing repository in a temporary directory first, like:

import git
git.Git().clone("git://foobar.git", "path/to/cloned_repo")

Or from local directory (if you don't need online repos):

git.Git().clone("path/to/source_repo/", "path/to/cloned_repo")

ps:

commits = list(repo.iter_commits('master'))
for c in commits:

Would be as well like:

for commit in repo.iter_commits('master'):

edited Aug 2, 2017 at 23:59

answered Aug 2, 2017 at 23:53

Arount

10.5k1 gold badge35 silver badges45 bronze badges

2 Comments

emillyrock Over a year ago

I figured that might be the issue (if I'm understanding you correctly) so I have been running the code on a new copy of the repo each time. The error only pops up when it gets part way through the commits (i.e. it works fine and makes the appropriate copy up until a specific commit).

Arount Over a year ago

Is the specific commit the last one ? It's the main problem, you list all commits and try to revert to the previous each time - So each time you touch the point where there is only 1 commit left, so you can not revert to the previous

Collectives™ on Stack Overflow

How to make a copy of GitHub repository at each commit using GitPython package in Python

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related