aboutsummaryrefslogtreecommitdiffstats
path: root/commit-graph.c
AgeCommit message (Collapse)AuthorFilesLines
2019-04-01commit-graph: don't pass filename to load_commit_graph_one_fd_st()Ævar Arnfjörð Bjarmason1-4/+3
An earlier change implemented load_commit_graph_one_fd_st() in a way that was bug-compatible with earlier code in terms of the "graph file %s is too small" error message printing out the path to the commit-graph (".git/objects/info/commit-graph"). But change that, because: * A function that takes an already-open file descriptor also needing the filename isn't very intuitive. * The vast majority of errors we might emit when loading the graph come from parse_commit_graph(), which doesn't report the filename. Let's not do that either in this case for consistency. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01commit-graph: don't early exit(1) on e.g. "git status"Ævar Arnfjörð Bjarmason1-12/+30
Make the commit-graph loading code work as a library that returns an error code instead of calling exit(1) when the commit-graph is corrupt. This means that e.g. "status" will now report commit-graph corruption as an "error: [...]" at the top of its output, but then proceed to work normally. This required splitting up the load_commit_graph_one() function so that the code that deals with open()-ing and stat()-ing the graph can now be called independently as open_commit_graph(). This is needed because "commit-graph verify" where the graph doesn't exist isn't an error. See the third paragraph in 283e68c72f ("commit-graph: add 'verify' subcommand", 2018-06-27). There's a bug in that logic where we conflate the intended ENOENT with other errno values (e.g. EACCES), but this change doesn't address that. That'll be addressed in a follow-up change. I'm then splitting most of the logic out of load_commit_graph_one() into load_commit_graph_one_fd_st(), which allows for providing an existing file descriptor and stat information to the loading code. This isn't strictly needed, but it would be redundant and confusing to open() and stat() the file twice for some of the codepaths, this allows for calling open_commit_graph() followed by load_commit_graph_one_fd_st(). The "graph_file" still needs to be passed to that function for the the "graph file %s is too small" error message. This leaves load_commit_graph_one() unused by everything except the internal prepare_commit_graph_one() function, so let's mark it as "static". If someone needs it in the future we can remove the "static" attribute. I could also rewrite its sole remaining user ("prepare_commit_graph_one()") to use load_commit_graph_one_fd_st() instead, but let's leave it at this. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01commit-graph: fix segfault on e.g. "git status"Ævar Arnfjörð Bjarmason1-9/+34
When core.commitGraph=true is set, various common commands now consult the commit graph. Because the commit-graph code is very trusting of its input data, it's possibly to construct a graph that'll cause an immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In some other cases where git immediately exits with a cryptic error about the graph being broken. The root cause of this is that while the "commit-graph verify" sub-command exhaustively verifies the graph, other users of the graph simply trust the graph, and will e.g. deference data found at certain offsets as pointers, causing segfaults. This change does the bare minimum to ensure that we don't segfault in the common fill_commit_in_graph() codepath called by e.g. setup_revisions(), to do this instrument the "commit-graph verify" tests to always check if "status" would subsequently segfault. This fixes the following tests which would previously segfault: not ok 50 - detect low chunk count not ok 51 - detect missing OID fanout chunk not ok 52 - detect missing OID lookup chunk not ok 53 - detect missing commit data chunk Those happened because with the commit-graph enabled setup_revisions() would eventually call fill_commit_in_graph(), where e.g. g->chunk_commit_data is used early as an offset (and will be 0x0). With this change we get far enough to detect that the graph is broken, and show an error instead. E.g.: $ git status; echo $? error: commit-graph is missing the Commit Data chunk 1 That also sucks, we should *warn* and not hard-fail "status" just because the commit-graph is corrupt, but fixing is left to a follow-up change. A side-effect of changing the reporting from graph_report() to error() is that we now have an "error: " prefix for these even for "commit-graph verify". Pseudo-diff before/after: $ git commit-graph verify -commit-graph is missing the Commit Data chunk +error: commit-graph is missing the Commit Data chunk Changing that is OK. Various errors it emits now early on are prefixed with "error: ", moving these over and changing the output doesn't break anything. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-05Merge branch 'ab/commit-graph-write-progress'Junio C Hamano1-36/+94
The codepath to show progress meter while writing out commit-graph file has been improved. * ab/commit-graph-write-progress: commit-graph write: emit a percentage for all progress commit-graph write: add itermediate progress commit-graph write: remove empty line for readability commit-graph write: add more descriptive progress output commit-graph write: show progress for object search commit-graph write: more descriptive "writing out" output commit-graph write: add "Writing out" progress output commit-graph: don't call write_graph_chunk_extra_edges() unnecessarily commit-graph: rename "large edges" to "extra edges"
2019-02-05Merge branch 'ab/commit-graph-write-optim'Junio C Hamano1-2/+4
The codepath to write out commit-graph has been optimized by following the usual pattern of visiting objects in in-pack order. * ab/commit-graph-write-optim: commit-graph write: use pack order when finding commits
2019-02-05Merge branch 'js/commit-graph-chunk-table-fix'Junio C Hamano1-19/+48
The codepath to read from the commit-graph file attempted to read past the end of it when the file's table-of-contents was corrupt. * js/commit-graph-chunk-table-fix: Makefile: correct example fuzz build commit-graph: fix buffer read-overflow commit-graph, fuzz: add fuzzer for commit-graph
2019-02-05Merge branch 'sb/more-repo-in-api'Junio C Hamano1-16/+24
The in-core repository instances are passed through more codepaths. * sb/more-repo-in-api: (23 commits) t/helper/test-repository: celebrate independence from the_repository path.h: make REPO_GIT_PATH_FUNC repository agnostic commit: prepare free_commit_buffer and release_commit_memory for any repo commit-graph: convert remaining functions to handle any repo submodule: don't add submodule as odb for push submodule: use submodule repos for object lookup pretty: prepare format_commit_message to handle arbitrary repositories commit: prepare logmsg_reencode to handle arbitrary repositories commit: prepare repo_unuse_commit_buffer to handle any repo commit: prepare get_commit_buffer to handle any repo commit-reach: prepare in_merge_bases[_many] to handle any repo commit-reach: prepare get_merge_bases to handle any repo commit-reach.c: allow get_merge_bases_many_0 to handle any repo commit-reach.c: allow remove_redundant to handle any repo commit-reach.c: allow merge_bases_many to handle any repo commit-reach.c: allow paint_down_to_common to handle any repo commit: allow parse_commit* to handle any repo object: parse_object to honor its repository argument object-store: prepare has_{sha1, object}_file to handle any repo object-store: prepare read_object_file to deal with any repo ...
2019-01-29Merge branch 'bc/sha-256'Junio C Hamano1-16/+17
Add sha-256 hash and plug it through the code to allow building Git with the "NewHash". * bc/sha-256: hash: add an SHA-256 implementation using OpenSSL sha256: add an SHA-256 implementation using libgcrypt Add a base implementation of SHA-256 support commit-graph: convert to using the_hash_algo t/helper: add a test helper to compute hash speed sha1-file: add a constant for hash block size t: make the sha1 test-tool helper generic t: add basic tests for our SHA-1 implementation cache: make hashcmp and hasheq work with larger hashes hex: introduce functions to print arbitrary hashes sha1-file: provide functions to look up hash algorithms sha1-file: rename algorithm to "sha1"
2019-01-23commit-graph write: emit a percentage for all progressÆvar Arnfjörð Bjarmason1-7/+7
Follow-up 01ca387774 ("commit-graph: split up close_reachable() progress output", 2018-11-19) by making the progress bars in close_reachable() report a completion percentage. This fixes the last occurrence where in the commit graph writing where we didn't report that. The change in 01ca387774 split up the 1x progress bar in close_reachable() into 3x, but left them as dumb counters without a percentage completion. Fixing that is easy, and the only reason it wasn't done already is because that commit was rushed in during the v2.20.0 RC period to fix the unrelated issue of over-reporting commit numbers. See [1] and follow-ups for ML activity at the time and [2] for an alternative approach where the progress bars weren't split up. Now for e.g. linux.git we'll emit: $ ~/g/git/git --exec-path=$HOME/g/git commit-graph write Finding commits for commit graph among packed objects: 100% (6529159/6529159), done. Expanding reachable commits in commit graph: 100% (815990/815980), done. Computing commit graph generation numbers: 100% (815983/815983), done. Writing out commit graph in 4 passes: 100% (3263932/3263932), done. 1. https://public-inbox.org/git/20181119202300.18670-1-avarab@gmail.com/ 2. https://public-inbox.org/git/20181122153922.16912-11-avarab@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23commit-graph write: add itermediate progressÆvar Arnfjörð Bjarmason1-0/+13
Add progress output to sections of code between "Annotating[...]" and "Computing[...]generation numbers". This can collectively take 5-10 seconds on a large enough repository. On a test repository with I have with ~7 million commits and ~50 million objects we'll now emit: $ ~/g/git/git --exec-path=$HOME/g/git commit-graph write Finding commits for commit graph among packed objects: 100% (124763727/124763727), done. Loading known commits in commit graph: 100% (18989461/18989461), done. Expanding reachable commits in commit graph: 100% (18989507/18989461), done. Clearing commit marks in commit graph: 100% (18989507/18989507), done. Counting distinct commits in commit graph: 100% (18989507/18989507), done. Finding extra edges in commit graph: 100% (18989507/18989507), done. Computing commit graph generation numbers: 100% (7250302/7250302), done. Writing out commit graph in 4 passes: 100% (29001208/29001208), done. Whereas on a medium-sized repository such as linux.git these new progress bars won't have time to kick in and as before and we'll still emit output like: $ ~/g/git/git --exec-path=$HOME/g/git commit-graph write Finding commits for commit graph among packed objects: 100% (6529159/6529159), done. Expanding reachable commits in commit graph: 815990, done. Computing commit graph generation numbers: 100% (815983/815983), done. Writing out commit graph in 4 passes: 100% (3263932/3263932), done. The "Counting distinct commits in commit graph" phase will spend most of its time paused at "0/*" as we QSORT(...) the list. That's not optimal, but at least we don't seem to be stalling anymore most of the time. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23commit-graph write: remove empty line for readabilityÆvar Arnfjörð Bjarmason1-1/+0
Remove the empty line between a QSORT(...) and the subsequent oideq() for-loop. This makes it clearer that the QSORT(...) is being done so that we can run the oideq() loop on adjacent OIDs. Amends code added in 08fd81c9b6 ("commit-graph: implement write_commit_graph()", 2018-04-02). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23commit-graph write: add more descriptive progress outputÆvar Arnfjörð Bjarmason1-7/+18
Make the progress output shown when we're searching for commits to include in the graph more descriptive. This amends code I added in 7b0f229222 ("commit-graph write: add progress output", 2018-09-17). Now, on linux.git, we'll emit this sort of output in the various modes we support: $ git commit-graph write Finding commits for commit graph among packed objects: 100% (6529159/6529159), done. [...] # Actually we don't emit this since this takes almost no time at # all. But if we did (s/_delayed//) we'd show: $ git for-each-ref --format='%(objectname)' | git commit-graph write --stdin-commits Finding commits for commit graph from 630 refs: 100% (630/630), done. [...] $ (cd .git/objects/pack/ && ls *idx) | git commit-graph write --stdin-pack Finding commits for commit graph in 3 packs: 6529159, done. [...] The middle on of those is going to be the output users might see in practice, since it'll be emitted when they get the commit graph via gc.writeCommitGraph=true. But as noted above you need a really large number of refs for this message to show. It'll show up on a test repository I have with ~165k refs: Finding commits for commit graph from 165203 refs: 100% (165203/165203), done. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23commit-graph write: show progress for object searchÆvar Arnfjörð Bjarmason1-2/+7
Show the percentage progress for the "Finding commits for commit graph" phase for the common case where we're operating on all packs in the repository, as "commit-graph write" or "gc" will do. Before we'd emit on e.g. linux.git with "commit-graph write": Finding commits for commit graph: 6529159, done. [...] And now: Finding commits for commit graph: 100% (6529159/6529159), done. [...] Since the commit graph only includes those commits that are packed (via for_each_packed_object(...)) the approximate_object_count() returns the actual number of objects we're going to process. Still, it is possible due to a race with "gc" or another process maintaining packs that the number of objects we're going to process is lower than what approximate_object_count() reported. In that case we don't want to stop the progress bar short of 100%. So let's make sure it snaps to 100% at the end. The inverse case is also possible and more likely. I.e. that a new pack has been added between approximate_object_count() and for_each_packed_object(). In that case the percentage will go beyond 100%, and we'll do nothing to snap it back to 100% at the end. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23commit-graph write: more descriptive "writing out" outputÆvar Arnfjörð Bjarmason1-2/+10
Make the "Writing out" part of the progress output more descriptive. Depending on the shape of the graph we either make 3 or 4 passes over it. Let's present this information to the user in case they're wondering what this number, which is much larger than their number of commits, has to do with writing out the commit graph. Now e.g. on linux.git we emit: $ ~/g/git/git --exec-path=$HOME/g/git -C ~/g/linux commit-graph write Finding commits for commit graph: 6529159, done. Expanding reachable commits in commit graph: 815990, done. Computing commit graph generation numbers: 100% (815983/815983), done. Writing out commit graph in 4 passes: 100% (3263932/3263932), done. A note on i18n: Why are we using the Q_() function and passing a number & English text for a singular which'll never be used? Because the plural rules of translated languages may not match those of English, and to use the plural function we need to use this format. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23commit-graph write: add "Writing out" progress outputÆvar Arnfjörð Bjarmason1-9/+30
Add progress output to be shown when we're writing out the commit-graph, this adds to the output already added in 7b0f229222 ("commit-graph write: add progress output", 2018-09-17). As noted in that commit most of the progress output isn't displayed on small repositories, but before this change we'd noticeably hang for 2-3 seconds at the end on medium sized repositories such as linux.git. Now we'll instead show output like this, and reduce the human-observable times at which we're not producing progress output: $ ~/g/git/git --exec-path=$HOME/g/git -C ~/g/2015-04-03-1M-git commit-graph write Finding commits for commit graph: 13064614, done. Expanding reachable commits in commit graph: 1000447, done. Computing commit graph generation numbers: 100% (1000447/1000447), done. Writing out commit graph: 100% (3001341/3001341), done. This "Writing out" number is 3x or 4x the number of commits, depending on the graph we're processing. A later change will make this explicit to the user. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23commit-graph: don't call write_graph_chunk_extra_edges() unnecessarilySZEDER Gábor1-1/+2
The optional 'Extra Edge List' chunk of the commit graph file stores parent information for commits with more than two parents. Since the chunk is optional, write_commit_graph() looks through all commits to find those with more than two parents, and then writes the commit graph file header accordingly, i.e. if there are no such commits, then there won't be a 'Extra Edge List' chunk written, only the three mandatory chunks. However, when it later comes to writing actual chunk data, write_commit_graph() unconditionally invokes write_graph_chunk_extra_edges(), even when it was decided earlier that that chunk won't be written. Strictly speaking there is no bug here, because write_graph_chunk_extra_edges() won't write anything if it doesn't find any commits with more than two parents, but then it unnecessarily and in vain looks through all commits once again in search for such commits. Don't call write_graph_chunk_extra_edges() when that chunk won't be written to spare an unnecessary iteration over all commits. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-22commit-graph: rename "large edges" to "extra edges"SZEDER Gábor1-12/+12
The optional 'Large Edge List' chunk of the commit graph file stores parent information for commits with more than two parents, and the names of most of the macros, variables, struct fields, and functions related to this chunk contain the term "large edges", e.g. write_graph_chunk_large_edges(). However, it's not a really great term, as the edges to the second and subsequent parents stored in this chunk are not any larger than the edges to the first and second parents stored in the "main" 'Commit Data' chunk. It's the number of edges, IOW number of parents, that is larger compared to non-merge and "regular" two-parent merge commits. And indeed, two functions in 'commit-graph.c' have a local variable called 'num_extra_edges' that refer to the same thing, and this "extra edges" term is much better at describing these edges. So let's rename all these references to "large edges" in macro, variable, function, etc. names to "extra edges". There is a GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency rename it to GRAPH_EXTRA_EDGES_NEEDED. We can do so safely without causing any incompatibility issues, because the term "large edges" doesn't come up in the file format itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there is no 'L' in there), but only in the specification text. The string "large edges", however, does come up in the output of 'git commit-graph read' and in tests looking at its input, but that command is explicitly documented as debugging aid, so we can change its output and the affected tests safely. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-22commit-graph write: use pack order when finding commitsÆvar Arnfjörð Bjarmason1-2/+4
Slightly optimize the "commit-graph write" step by using FOR_EACH_OBJECT_PACK_ORDER with for_each_object_in_pack(). See commit [1] and [2] for the facility and a similar optimization for "cat-file". On Linux it is around 5% slower to run: echo 1 >/proc/sys/vm/drop_caches && cat .git/objects/pack/* >/dev/null && git cat-file --batch-all-objects --batch-check --unordered Than the same thing with the "cat" omitted. This is as expected, since we're iterating in pack order and the "cat" is extra work. Before this change the opposite was true of "commit-graph write". We were 6% faster if we first ran "cat" to efficiently populate the FS cache for our sole big pack on linux.git, than if we had populated it via for_each_object_in_pack(). Now we're 3% faster without the "cat" instead. My tests were done on an unloaded Linux 3.10 system with 10 runs for each. Derrick Stolee did his own tests on Windows[3] showing a 2% improvement with a high degree of accuracy. 1. 736eb88fdc ("for_each_packed_object: support iterating in pack-order", 2018-08-10) 2. 0750bb5b51 ("cat-file: support "unordered" output for --batch-all-objects", 2018-08-10) 3. https://public-inbox.org/git/f71fa868-25e8-a9c9-46a6-611b987f1a8f@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-18Merge branch 'ds/commit-graph-assert-missing-parents'Junio C Hamano1-6/+11
Tightening error checking in commit-graph writer. * ds/commit-graph-assert-missing-parents: commit-graph: writing missing parents is a BUG
2019-01-15commit-graph: fix buffer read-overflowJosh Steadmon1-2/+12
fuzz-commit-graph identified a case where Git will read past the end of a buffer containing a commit graph if the graph's header has an incorrect chunk count. A simple bounds check in parse_commit_graph() prevents this. Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-15commit-graph, fuzz: add fuzzer for commit-graphJosh Steadmon1-17/+36
Break load_commit_graph_one() into a new function, parse_commit_graph(). The latter function operates on arbitrary buffers, which makes it suitable as a fuzzing target. Since parse_commit_graph() is only called by load_commit_graph_one() (and the fuzzer described below), we omit error messages that would be duplicated by the caller. Adds fuzz-commit-graph.c, which provides a fuzzing entry point compatible with libFuzzer (and possibly other fuzzing engines). Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-14Merge branch 'ab/commit-graph-progress-fix'Junio C Hamano1-3/+10
* ab/commit-graph-progress-fix: commit-graph: split up close_reachable() progress output
2019-01-02commit-graph: writing missing parents is a BUGDerrick Stolee1-6/+11
When writing a commit-graph, we write GRAPH_MISSING_PARENT if the parent's object id does not appear in the list of commits to be written into the commit-graph. This was done as the initial design allowed commits to have missing parents, but the final version requires the commit-graph to be closed under reachability. Thus, this GRAPH_MISSING_PARENT value should never be written. However, there are reasons why it could be written! These range from a bug in the reachable-closure code to a memory error causing the binary search into the list of object ids to fail. In either case, we should fail fast and avoid writing the commit-graph file with bad data. Remove the GRAPH_MISSING_PARENT constant in favor of the constant GRAPH_EDGE_LAST_MASK, which has the same value. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-12-28commit-graph: convert remaining functions to handle any repoStefan Beller1-16/+24
Convert all functions to handle arbitrary repositories in commit-graph.c that are used by functions taking a repository argument already. Notable exclusion is write_commit_graph and its local functions as that only works on the_repository. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-20commit-graph: split up close_reachable() progress outputÆvar Arnfjörð Bjarmason1-3/+10
Amend the progress output added in 7b0f229222 ("commit-graph write: add progress output", 2018-09-17) so that the total numbers it reports aren't higher than the total number of commits anymore. See [1] for a bug report pointing that out. When I added this I wasn't intending to provide an accurate count, but just have some progress output to show the user the command wasn't hanging[2]. But since we are showing numbers, let's make them accurate. The progress descriptions were suggested by Derrick Stolee in [3]. As noted in [2] we are unlikely to show anything except the "Expanding reachable..." message even on fairly large repositories such as linux.git. On a test repository I have with north of 7 million commits all of these are displayed. Two of them don't show up for long, but as noted in [5] future-proofing this for if the loops become more expensive in the future makes sense. 1. https://public-inbox.org/git/20181010203738.GE23446@szeder.dev/ 2. https://public-inbox.org/git/87pnwhea8y.fsf@evledraar.gmail.com/ 3. https://public-inbox.org/git/f7a0cbee-863c-61d3-4959-5cec8b43c705@gmail.com/ 4. https://public-inbox.org/git/20181015160545.GG19800@szeder.dev/ 5. https://public-inbox.org/git/87murle8da.fsf@evledraar.gmail.com/ Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Helped-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-14commit-graph: convert to using the_hash_algobrian m. carlson1-16/+17
Instead of using hard-coded constants for object sizes, use the_hash_algo to look them up. In addition, use a function call to look up the object ID version and produce the correct value. For now, we use version 1, which means to use the default algorithm used in the rest of the repository. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13sha1-file: use an object_directory for the main object dirJeff King1-4/+1
Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13rename "alternate_object_database" to "object_directory"Jeff King1-5/+5
In preparation for unifying the handling of alt odb's and the normal repo object directory, let's use a more neutral name. This patch is purely mechanical, swapping the type name, and converting any variables named "alt" to "odb". There should be no functional change, but it will reduce the noise in subsequent diffs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-19Merge branch 'ds/commit-graph-leakfix'Junio C Hamano1-6/+10
Code clean-up. * ds/commit-graph-leakfix: commit-graph: reduce initial oid allocation builtin/commit-graph.c: UNLEAK variables commit-graph: clean up leaked memory during write
2018-10-16Merge branch 'ds/commit-graph-with-grafts'Junio C Hamano1-4/+34
The recently introduced commit-graph auxiliary data is incompatible with mechanisms such as replace & grafts that "breaks" immutable nature of the object reference relationship. Disable optimizations based on its use (and updating existing commit-graph) when these incompatible features are in use in the repository. * ds/commit-graph-with-grafts: commit-graph: close_commit_graph before shallow walk commit-graph: not compatible with uninitialized repo commit-graph: not compatible with grafts commit-graph: not compatible with replace objects test-repository: properly init repo commit-graph: update design document refs.c: upgrade for_each_replace_ref to be a each_repo_ref_fn callback refs.c: migrate internal ref iteration to pass thru repository argument
2018-10-16Merge branch 'ab/commit-graph-progress'Junio C Hamano1-8/+57
Generation of (experimental) commit-graph files have so far been fairly silent, even though it takes noticeable amount of time in a meaningfully large repository. The users will now see progress output. * ab/commit-graph-progress: gc: fix regression in 7b0f229222 impacting --quiet commit-graph verify: add progress output commit-graph write: add progress output
2018-10-07commit-graph: reduce initial oid allocationDerrick Stolee1-1/+1
While writing a commit-graph file, we store the full list of commits in a flat list. We use this list for sorting and ensuring we are closed under reachability. The initial allocation assumed that (at most) one in four objects is a commit. This is a dramatic over-count for many repos, especially large ones. Since we grow the repo dynamically, reduce this count by a factor of eight. We still set it to a minimum of 1024 before allocating. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-07commit-graph: clean up leaked memory during writeDerrick Stolee1-5/+9
The write_commit_graph() method in commit-graph.c leaks some lits and strings during execution. In addition, a list of strings is leaked in write_commit_graph_reachable(). Clean these up so our memory checking is cleaner. Further, if we use a list of pack-files to find the commits, we can leak the packed_git structs after scanning them for commits. Running the following commands demonstrates the leak before and the fix after: * valgrind --leak-check=full ./git commit-graph write --reachable * valgrind --leak-check=full ./git commit-graph write --stdin-packs Signed-off-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17Merge branch 'ds/commit-graph-tests'Junio C Hamano1-2/+3
We can now optionally run tests with commit-graph enabled. * ds/commit-graph-tests: commit-graph: define GIT_TEST_COMMIT_GRAPH
2018-09-17Merge branch 'jk/cocci'Junio C Hamano1-5/+5
spatch transformation to replace boolean uses of !hashcmp() to newly introduced oideq() is added, and applied, to regain performance lost due to support of multiple hash algorithms. * jk/cocci: show_dirstat: simplify same-content check read-cache: use oideq() in ce_compare functions convert hashmap comparison functions to oideq() convert "hashcmp() != 0" to "!hasheq()" convert "oidcmp() != 0" to "!oideq()" convert "hashcmp() == 0" to hasheq() convert "oidcmp() == 0" to oideq() introduce hasheq() and oideq() coccinelle: use <...> for function exclusion
2018-09-17Merge branch 'ds/reachable'Junio C Hamano1-0/+18
The code for computing history reachability has been shuffled, obtained a bunch of new tests to cover them, and then being improved. * ds/reachable: commit-reach: correct accidental #include of C file commit-reach: use can_all_from_reach commit-reach: make can_all_from_reach... linear commit-reach: replace ref_newer logic test-reach: test commit_contains test-reach: test can_all_from_reach_with_flags test-reach: test reduce_heads test-reach: test get_merge_bases_many test-reach: test is_descendant_of test-reach: test in_merge_bases test-reach: create new test tool for ref_newer commit-reach: move can_all_from_reach_with_flags upload-pack: generalize commit date cutoff upload-pack: refactor ok_to_give_up() upload-pack: make reachable() more generic commit-reach: move commit_contains from ref-filter commit-reach: move ref_newer from remote.c commit.h: remove method declarations commit-reach: move walk methods from commit.c
2018-09-17commit-graph verify: add progress outputÆvar Arnfjörð Bjarmason1-0/+5
For the reasons explained in the "commit-graph write: add progress output" commit leading up to this one, emit progress on "commit-graph verify". Since e0fd51e1d7 ("fsck: verify commit-graph", 2018-06-27) "git fsck" has called this command if core.commitGraph=true, but there's been no progress output to indicate that anything was different. Now there is (on my tiny dotfiles.git repository): $ git -c core.commitGraph=true -C ~/ fsck Checking object directories: 100% (256/256), done. Checking objects: 100% (2821/2821), done. dangling blob 5b8bbdb9b788ed90459f505b0934619c17cc605b Verifying commits in commit graph: 100% (867/867), done. And on a larger repository, such as the 2015-04-03-1M-git.git test repository: $ time git -c core.commitGraph=true -C ~/g/2015-04-03-1M-git/ commit-graph verify Verifying commits in commit graph: 100% (1000447/1000447), done. real 0m7.813s [...] Since the "commit-graph verify" subcommand is never called from "git gc", we don't have to worry about passing some some "report_progress" progress variable around for this codepath. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17commit-graph write: add progress outputÆvar Arnfjörð Bjarmason1-8/+52
Before this change the "commit-graph write" command didn't report any progress. On my machine this command takes more than 10 seconds to write the graph for linux.git, and around 1m30s on the 2015-04-03-1M-git.git[1] test repository (a test case for a large monorepository). Furthermore, since the gc.writeCommitGraph setting was added in d5d5d7b641 ("gc: automatically write commit-graph files", 2018-06-27), there was no indication at all from a "git gc" run that anything was different. This why one of the progress bars being added here uses start_progress() instead of start_delayed_progress(), so that it's guaranteed to be seen. E.g. on my tiny 867 commit dotfiles.git repository: $ git -c gc.writeCommitGraph=true gc Enumerating objects: 2821, done. [...] Computing commit graph generation numbers: 100% (867/867), done. On larger repositories, such as linux.git the delayed progress bar(s) will kick in, and we'll show what's going on instead of, as was previously happening, printing nothing while we write the graph: $ git -c gc.writeCommitGraph=true gc [...] Annotating commits in commit graph: 1565573, done. Computing commit graph generation numbers: 100% (782484/782484), done. Note that here we don't show "Finding commits for commit graph", this is because under "git gc" we seed the search with the commit references in the repository, and that set is too small to show any progress, but would e.g. on a smaller repo such as git.git with --stdin-commits: $ git rev-list --all | git -c gc.writeCommitGraph=true write --stdin-commits Finding commits for commit graph: 100% (162576/162576), done. Computing commit graph generation numbers: 100% (162576/162576), done. With --stdin-packs we don't show any estimation of how much is left to do. This is because we might be processing more than one pack. We could be less lazy here and show progress, either by detecting that we're only processing one pack, or by first looping over the packs to discover how many commits they have. I don't see the point in doing that work. So instead we get (on 2015-04-03-1M-git.git): $ echo pack-<HASH>.idx | git -c gc.writeCommitGraph=true --exec-path=$PWD commit-graph write --stdin-packs Finding commits for commit graph: 13064614, done. Annotating commits in commit graph: 3001341, done. Computing commit graph generation numbers: 100% (1000447/1000447), done. No GC mode uses --stdin-packs. It's what they use at Microsoft to manually compute the generation numbers for their collection of large packs which are never coalesced. The reason we need a "report_progress" variable passed down from "git gc" is so that we don't report this output when we're running in the process "git gc --auto" detaches from the terminal. Since we write the commit graph from the "git gc" process itself (as opposed to what we do with say the "git repack" phase), we'd end up writing the output to .git/gc.log and reporting it to the user next time as part of the "The last gc run reported the following[...]" error, see 329e6e8794 ("gc: save log from daemonized gc --auto and print it next time", 2015-09-19). So we must keep track of whether or not we're running in that demonized mode, and if so print no progress. See [2] and subsequent replies for a discussion of an approach not taken in compute_generation_numbers(). I.e. we're saying "Computing commit graph generation numbers", even though on an established history we're mostly skipping over all the work we did in the past. This is similar to the white lie we tell in the "Writing objects" phase (not all are objects being written). Always showing progress is considered more important than accuracy. I.e. on a repository like 2015-04-03-1M-git.git we'd hang for 6 seconds with no output on the second "git gc" if no changes were made to any objects in the interim if we'd take the approach in [2]. 1. https://github.com/avar/2015-04-03-1M-git 2. <c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com> (https://public-inbox.org/git/c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com/) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29convert "hashcmp() != 0" to "!hasheq()"Jeff King1-1/+1
This rounds out the previous three patches, covering the inequality logic for the "hash" variant of the functions. As with the previous three, the accompanying code changes are the mechanical result of applying the coccinelle patch; see those patches for more discussion. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29convert "oidcmp() != 0" to "!oideq()"Jeff King1-3/+3
This is the flip side of the previous two patches: checking for a non-zero oidcmp() can be more strictly expressed as inequality. Like those patches, we write "!= 0" in the coccinelle transformation, which covers by isomorphism the more common: if (oidcmp(E1, E2)) As with the previous two patches, this patch can be achieved almost entirely by running "make coccicheck"; the only differences are manual line-wrap fixes to match the original code. There is one thing to note for anybody replicating this, though: coccinelle 1.0.4 seems to miss the case in builtin/tag.c, even though it's basically the same as all the others. Running with 1.0.7 does catch this, so presumably it's just a coccinelle bug that was fixed in the interim. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29convert "oidcmp() == 0" to oideq()Jeff King1-1/+1
Using the more restrictive oideq() should, in the long run, give the compiler more opportunities to optimize these callsites. For now, this conversion should be a complete noop with respect to the generated code. The result is also perhaps a little more readable, as it avoids the "zero is equal" idiom. Since it's so prevalent in C, I think seasoned programmers tend not to even notice it anymore, but it can sometimes make for awkward double negations (e.g., we can drop a few !!oidcmp() instances here). This patch was generated almost entirely by the included coccinelle patch. This mechanical conversion should be completely safe, because we check explicitly for cases where oidcmp() is compared to 0, which is what oideq() is doing under the hood. Note that we don't have to catch "!oidcmp()" separately; coccinelle's standard isomorphisms make sure the two are treated equivalently. I say "almost" because I did hand-edit the coccinelle output to fix up a few style violations (it mostly keeps the original formatting, but sometimes unwraps long lines). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29commit-graph: define GIT_TEST_COMMIT_GRAPHDerrick Stolee1-2/+3
The commit-graph feature is tested in isolation by t5318-commit-graph.sh and t6600-test-reach.sh, but there are many more interesting scenarios involving commit walks. Many of these scenarios are covered by the existing test suite, but we need to maintain coverage when the optional commit-graph structure is not present. To allow running the full test suite with the commit-graph present, add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try to load the commit-graph when parsing commits, and writes the commit-graph file after every 'git commit' command. There are a few tests that rely on commits not existing in pack-files to trigger important events, so manually set GIT_TEST_COMMIT_GRAPH to false for the necessary commands. There is one test in t6024-recursive-merge.sh that relies on the merge-base algorithm picking one of two ambiguous merge-bases, and the commit-graph feature changes which merge-base is picked. Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-21commit-graph: close_commit_graph before shallow walkDerrick Stolee1-4/+4
Call close_commit_graph() when about to start a rev-list walk that includes shallow commits. This is necessary in code paths that "fake" shallow commits for the sake of fetch. Specifically, test 351 in t5500-fetch-pack.sh runs git fetch --shallow-exclude one origin with a file-based transfer. When the "remote" has a commit-graph, we do not prevent the commit-graph from being loaded, but then the commits are intended to be dynamically transferred into shallow commits during get_shallow_commits_by_rev_list(). By closing the commit-graph before this call, we prevent this interaction. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-21commit-graph: not compatible with uninitialized repoDerrick Stolee1-0/+3
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-21commit-graph: not compatible with graftsDerrick Stolee1-0/+6
Augment commit_graph_compatible(r) to return false when the given repository r has commit grafts or is a shallow clone. Test that in these situations we ignore existing commit-graph files and we do not write new commit-graph files. Helped-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-21commit-graph: not compatible with replace objectsDerrick Stolee1-0/+21
Create new method commit_graph_compatible(r) to check if a given repository r is compatible with the commit-graph feature. Fill the method with a check to see if replace-objects exist. Test this interaction succeeds, including ignoring an existing commit-graph and failing to write a new commit-graph. However, we do ensure that we write a new commit-graph by setting read_replace_refs to 0, thereby ignoring the replace refs. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20Merge branch 'jk/for-each-object-iteration'Junio C Hamano1-1/+1
The API to iterate over all objects learned to optionally list objects in the order they appear in packfiles, which helps locality of access if the caller accesses these objects while as objects are enumerated. * jk/for-each-object-iteration: for_each_*_object: move declarations to object-store.h cat-file: use a single strbuf for all output cat-file: split batch "buf" into two variables cat-file: use oidset check-and-insert cat-file: support "unordered" output for --batch-all-objects cat-file: rename batch_{loose,packed}_object callbacks t1006: test cat-file --batch-all-objects with duplicates for_each_packed_object: support iterating in pack-order for_each_*_object: give more comprehensive docstrings for_each_*_object: take flag arguments as enum for_each_*_object: store flag definitions in a single location
2018-08-15Merge branch 'nd/i18n'Junio C Hamano1-10/+10
Many more strings are prepared for l10n. * nd/i18n: (23 commits) transport-helper.c: mark more strings for translation transport.c: mark more strings for translation sha1-file.c: mark more strings for translation sequencer.c: mark more strings for translation replace-object.c: mark more strings for translation refspec.c: mark more strings for translation refs.c: mark more strings for translation pkt-line.c: mark more strings for translation object.c: mark more strings for translation exec-cmd.c: mark more strings for translation environment.c: mark more strings for translation dir.c: mark more strings for translation convert.c: mark more strings for translation connect.c: mark more strings for translation config.c: mark more strings for translation commit-graph.c: mark more strings for translation builtin/replace.c: mark more strings for translation builtin/pack-objects.c: mark more strings for translation builtin/grep.c: mark strings for translation builtin/config.c: mark more strings for translation ...
2018-08-13for_each_packed_object: support iterating in pack-orderJeff King1-1/+1
We currently iterate over objects within a pack in .idx order, which uses the object hashes. That means that it is effectively random with respect to the location of the object within the pack. If you're going to access the actual object data, there are two reasons to move linearly through the pack itself: 1. It improves the locality of access in the packfile. In the cold-cache case, this may mean fewer disk seeks, or better usage of disk cache. 2. We store related deltas together in the packfile. Which means that the delta base cache can operate much more efficiently if we visit all of those related deltas in sequence, as the earlier items are likely to still be in the cache. Whereas if we visit the objects in random order, our cache entries are much more likely to have been evicted by unrelated deltas in the meantime. So in general, if you're going to access the object contents pack order is generally going to end up more efficient. But if you're simply generating a list of object names, or if you're going to end up sorting the result anyway, you're better off just using the .idx order, as finding the pack order means generating the in-memory pack-revindex. According to the numbers in 8b8dfd5132 (pack-revindex: radix-sort the revindex, 2013-07-11), that takes about 200ms for linux.git, and 20ms for git.git (those numbers are a few years old but are still a good ballpark). That makes it a good optimization for some cases (we can save tens of seconds in git.git by having good locality of delta access, for a 20ms cost), but a bad one for others (e.g., right now "cat-file --batch-all-objects --batch-check="%(objectname)" is 170ms in git.git, so adding 20ms to that is noticeable). Hence this patch makes it an optional flag. You can't actually do any interesting timings yet, as it's not plumbed through to any user-facing tools like cat-file. That will come in a later patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-23commit-graph.c: mark more strings for translationNguyễn Thái Ngọc Duy1-10/+10
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20commit-reach: use can_all_from_reachDerrick Stolee1-0/+18
The is_descendant_of method previously used in_merge_bases() to check if the commit can reach any of the commits in the provided list. This had two performance problems: 1. The performance is quadratic in worst-case. 2. A single in_merge_bases() call requires walking beyond the target commit in order to find the full set of boundary commits that may be merge-bases. The can_all_from_reach method avoids this quadratic behavior and can limit the search beyond the target commits using generation numbers. It requires a small prototype adjustment to stop using commit-date as a cutoff, as that optimization is no longer appropriate here. Since in_merge_bases() uses paint_down_to_common(), is_descendant_of() naturally found cutoffs to avoid walking the entire commit graph. Since we want to always return the correct result, we cannot use the min_commit_date cutoff in can_all_from_reach. We then rely on generation numbers to provide the cutoff. Since not all repos will have a commit-graph file, nor will we always have generation numbers computed for a commit-graph file, create a new method, generation_numbers_enabled(), that checks for a commit-graph file and sees if the first commit in the file has a non-zero generation number. In the case that we do not have generation numbers, use the old logic for is_descendant_of(). Performance was meausured on a copy of the Linux repository using the 'test-tool reach is_descendant_of' command using this input: A:v4.9 X:v4.10 X:v4.11 X:v4.12 X:v4.13 X:v4.14 X:v4.15 X:v4.16 X:v4.17 X.v3.0 Note that this input is tailored to demonstrate the quadratic nature of the previous method, as it will compute merge-bases for v4.9 versus all of the later versions before checking against v4.1. Before: 0.26 s After: 0.21 s Since we previously used the is_descendant_of method in the ref_newer method, we also measured performance there using 'test-tool reach ref_newer' with this input: A:v4.9 B:v3.19 Before: 0.10 s After: 0.08 s By adding a new commit with parent v3.19, we test the non-reachable case of ref_newer: Before: 0.09 s After: 0.08 s Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-17commit-graph: add repo arg to graph readersJonathan Tan1-27/+33
Add a struct repository argument to the functions in commit-graph.h that read the commit graph. (This commit does not affect functions that write commit graphs.) Because the commit graph functions can now read the commit graph of any repository, the global variable core_commit_graph has been removed. Instead, the config option core.commitGraph is now read on the first time in a repository that a commit is attempted to be parsed using its commit graph. This commit includes a test that exercises the functionality on an arbitrary repository that is not the_repository. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-17commit-graph: store graph in struct object_storeJonathan Tan1-21/+19
Instead of storing commit graphs in static variables, store them in struct object_store. There are no changes to the signatures of existing functions - they all still only support the_repository, and support for other instances of struct repository will be added in a subsequent commit. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-17commit-graph: add free_commit_graphJonathan Tan1-10/+14
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-17commit-graph: refactor preparing commit graphJonathan Tan1-11/+17
Two functions in the code (1) check if the repository is configured for commit graphs, (2) call prepare_commit_graph(), and (3) check if the graph exists. Move (1) and (3) into prepare_commit_graph(), reducing duplication of code. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-17Merge branch 'ds/commit-graph-fsck' into jt/commit-graph-per-object-storeJunio C Hamano1-16/+235
* ds/commit-graph-fsck: (23 commits) coccinelle: update commit.cocci commit-graph: update design document gc: automatically write commit-graph files commit-graph: add '--reachable' option commit-graph: use string-list API for input fsck: verify commit-graph commit-graph: verify contents match checksum commit-graph: test for corrupted octopus edge commit-graph: verify commit date commit-graph: verify generation number commit-graph: verify parent list commit-graph: verify root tree OIDs commit-graph: verify objects exist commit-graph: verify corrupt OID fanout and lookup commit-graph: verify required chunks are present commit-graph: verify catches corrupt signature commit-graph: add 'verify' subcommand commit-graph: load a root tree from specific graph commit: force commit to parse from object database commit-graph: parse commit from chosen graph ...
2018-06-29commit: add repository argument to lookup_commitStefan Beller1-5/+5
Add a repository argument to allow callers of lookup_commit to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29commit: add repository argument to lookup_commit_reference_gentlyStefan Beller1-1/+1
Add a repository argument to allow callers of lookup_commit_reference_gently to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29tree: add repository argument to lookup_treeStefan Beller1-1/+1
Add a repository argument to allow the callers of lookup_tree to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: add '--reachable' optionDerrick Stolee1-0/+20
When writing commit-graph files, it can be convenient to ask for all reachable commits (starting at the ref set) in the resulting file. This is particularly helpful when writing to stdin is complicated, such as a future integration with 'git gc'. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: use string-list API for inputDerrick Stolee1-8/+7
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify contents match checksumDerrick Stolee1-2/+14
The commit-graph file ends with a SHA1 hash of the previous contents. If a commit-graph file has errors but the checksum hash is correct, then we know that the problem is a bug in Git and not simply file corruption after-the-fact. Compute the checksum right away so it is the first error that appears, and make the message translatable since this error can be "corrected" by a user by simply deleting the file and recomputing. The rest of the errors are useful only to developers. Be sure to continue checking the rest of the file data if the checksum is wrong. This is important for our tests, as we break the checksum as we modify bytes of the commit-graph file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify commit dateDerrick Stolee1-0/+6
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify generation numberDerrick Stolee1-0/+34
While iterating through the commit parents, perform the generation number calculation and compare against the value stored in the commit-graph. The tests demonstrate that having a different set of parents affects the generation number calculation, and this value propagates to descendants. Hence, we drop the single-line condition on the output. Since Git will ship with the commit-graph feature without generation numbers, we need to accept commit-graphs with all generation numbers equal to zero. In this case, ignore the generation number calculation. However, verify that we should never have a mix of zero and non-zero generation numbers. Create a test that sets one commit to generation zero and all following commits report a failure as they have non-zero generation in a file that contains generation number zero. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify parent listDerrick Stolee1-0/+28
The commit-graph file stores parents in a two-column portion of the commit data chunk. If there is only one parent, then the second column stores 0xFFFFFFFF to indicate no second parent. The 'verify' subcommand checks the parent list for the commit loaded from the commit-graph and the one parsed from the object database. Test these checks for corrupt parents, too many parents, and wrong parents. Add a boundary check to insert_parent_or_die() for when the parent position value is out of range. The octopus merge will be tested in a later commit. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify root tree OIDsDerrick Stolee1-1/+16
The 'verify' subcommand must compare the commit content parsed from the commit-graph against the content in the object database. Use lookup_commit() and parse_commit_in_graph_one() to parse the commits from the graph and compare against a commit that is loaded separately and parsed directly from the object database. Add checks for the root tree OID. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify objects existDerrick Stolee1-0/+18
In the 'verify' subcommand, load commits directly from the object database to ensure they exist. Parse by skipping the commit-graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify corrupt OID fanout and lookupDerrick Stolee1-0/+36
In the commit-graph file, the OID fanout chunk provides an index into the OID lookup. The 'verify' subcommand should find incorrect values in the fanout. Similarly, the 'verify' subcommand should find out-of-order values in the OID lookup. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: verify required chunks are presentDerrick Stolee1-0/+9
The commit-graph file requires the following three chunks: * OID Fanout * OID Lookup * Commit Data If any of these are missing, then the 'verify' subcommand should report a failure. This includes the chunk IDs malformed or the chunk count is truncated. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: add 'verify' subcommandDerrick Stolee1-0/+23
If the commit-graph file becomes corrupt, we need a way to verify that its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph verify' subcommand to report all issues with the file. Add the 'verify' subcommand to the 'commit-graph' builtin and its documentation. The subcommand is currently a no-op except for loading the commit-graph into memory, which may trigger run-time errors that would be caught by normal use. Add a simple test that ensures the command returns a zero error code. If no commit-graph file exists, this is an acceptable state. Do not report any errors. Helped-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: load a root tree from specific graphDerrick Stolee1-3/+9
When lazy-loading a tree for a commit, it will be important to select the tree from a specific struct commit_graph. Create a new method that specifies the commit-graph file and use that in get_commit_tree_in_graph(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: parse commit from chosen graphDerrick Stolee1-3/+15
Before verifying a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27commit-graph: fix GRAPH_MIN_SIZEDerrick Stolee1-2/+3
The GRAPH_MIN_SIZE macro should be the smallest size of a parsable commit-graph file. However, the minimum number of chunks was wrong. It is possible to write a commit-graph file with zero commits, and that violates this macro's value. Rewrite the macro, and use extra macros to better explain the magic constants. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-25Merge branch 'ds/commit-graph-lockfile-fix'Junio C Hamano1-32/+81
Update to ds/generation-numbers topic. * ds/commit-graph-lockfile-fix: commit-graph: fix UX issue when .lock file exists commit-graph.txt: update design document merge: check config before loading commits commit: use generation number in remove_redundant() commit: add short-circuit to paint_down_to_common() commit: use generation numbers for in_merge_bases() ref-filter: use generation number for --contains commit-graph: always load commit-graph information commit: use generations in paint_down_to_common() commit-graph: compute generation numbers commit: add generation number to struct commit ref-filter: fix outdated comment on in_commit_list
2018-05-23Merge branch 'sb/oid-object-info'Junio C Hamano1-1/+1
The codepath around object-info API has been taught to take the repository object (which in turn tells the API which object store the objects are to be located). * sb/oid-object-info: cache.h: allow oid_object_info to handle arbitrary repositories packfile: add repository argument to cache_or_unpack_entry packfile: add repository argument to unpack_entry packfile: add repository argument to read_object packfile: add repository argument to packed_object_info packfile: add repository argument to packed_to_object_type packfile: add repository argument to retry_bad_packed_offset cache.h: add repository argument to oid_object_info cache.h: add repository argument to oid_object_info_extended
2018-05-23Merge branch 'ds/lazy-load-trees'Junio C Hamano1-4/+24
The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so. * ds/lazy-load-trees: coccinelle: avoid wrong transformation suggestions from commit.cocci commit-graph: lazy-load trees for commits treewide: replace maybe_tree with accessor methods commit: create get_commit_tree() method treewide: rename tree to maybe_tree
2018-05-22commit-graph: fix UX issue when .lock file existsDerrick Stolee1-17/+5
We use the lockfile API to avoid multiple Git processes from writing to the commit-graph file in the .git/objects/info directory. In some cases, this directory may not exist, so we check for its existence. The existing code does the following when acquiring the lock: 1. Try to acquire the lock. 2. If it fails, try to create the .git/object/info directory. 3. Try to acquire the lock, failing if necessary. The problem is that if the lockfile exists, then the mkdir fails, giving an error that doesn't help the user: "fatal: cannot mkdir .git/objects/info: File exists" While technically this honors the lockfile, it does not help the user. Instead, do the following: 1. Check for existence of .git/objects/info; create if necessary. 2. Try to acquire the lock, failing if necessary. The new output looks like: fatal: Unable to create '<dir>/.git/objects/info/commit-graph.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-22commit-graph: always load commit-graph informationDerrick Stolee1-15/+31
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-22commit-graph: compute generation numbersDerrick Stolee1-0/+43
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-22commit: add generation number to struct commitDerrick Stolee1-0/+2
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0xFFFFFFFF GENERATION_NUMBER_MAX 0x3FFFFFFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-08Merge branch 'ds/commit-graph'Junio C Hamano1-0/+741
Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. * ds/commit-graph: commit-graph: implement "--append" option commit-graph: build graph from starting commits commit-graph: read only from specific pack-indexes commit: integrate commit graph with commit parsing commit-graph: close under reachability commit-graph: add core.commitGraph setting commit-graph: implement git commit-graph read commit-graph: implement git-commit-graph write commit-graph: implement write_commit_graph() commit-graph: create git-commit-graph builtin graph: add commit graph design document commit-graph: add format document csum-file: refactor finalize_hashfile() method csum-file: rename hashclose() to finalize_hashfile()
2018-04-11commit-graph: lazy-load trees for commitsDerrick Stolee1-3/+23
The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits. Delay loading the tree object for a commit loaded from the graph until requested via get_commit_tree(). Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small. On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.92s After: 0.66s Rel %: -28.3% Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. There was no measureable performance change as a result of this patch. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11treewide: replace maybe_tree with accessor methodsDerrick Stolee1-1/+1
In anticipation of making trees load lazily, create a Coccinelle script (contrib/coccinelle/commit.cocci) to ensure that all references to the 'maybe_tree' member of struct commit are either mutations or accesses through get_commit_tree() or get_commit_tree_oid(). Apply the Coccinelle script to create the rest of the patch. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11treewide: rename tree to maybe_treeDerrick Stolee1-2/+2
Using the commit-graph file to walk commit history removes the large cost of parsing commits during the walk. This exposes a performance issue: lookup_tree() takes a large portion of the computation time, even when Git never uses those trees. In anticipation of lazy-loading these trees, rename the 'tree' member of struct commit to 'maybe_tree'. This serves two purposes: it hints at the future role of possibly being NULL even if the commit has a valid tree, and it allows for unambiguous transformation from simple member access (i.e. commit->maybe_tree) to method access. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11commit-graph: implement "--append" optionDerrick Stolee1-1/+16
Teach git-commit-graph to add all commits from the existing commit-graph file to the file about to be written. This should be used when adding new commits without performing garbage collection. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11commit-graph: build graph from starting commitsDerrick Stolee1-2/+25
Teach git-commit-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11commit-graph: read only from specific pack-indexesDerrick Stolee1-2/+24
Teach git-commit-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the commit graph iteratively. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11commit: integrate commit graph with commit parsingDerrick Stolee1-1/+140
Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently(). This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date. If core.commitGraph is false, then do not check graph files. In test script t5318-commit-graph.sh, add output-matching conditions on read-only graph operations. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks. Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits. | Command | Before | After | Rel % | |----------------------------------|--------|--------|-------| | log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% | | branch -vv | 1.02s | 0.14s | -86% | | rev-list --all | 5.89s | 1.07s | -81% | | rev-list --all --objects | 66.15s | 58.45s | -11% | Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11commit-graph: close under reachabilityDerrick Stolee1-0/+45
Teach write_commit_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Also automatically add commits from the existing graph file, if it exists. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11commit-graph: implement git commit-graph readDerrick Stolee1-1/+136
Teach git-commit-graph to read commit graph files and summarize their contents. Use the read subcommand to verify the contents of a commit graph file in the tests. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02commit-graph: implement write_commit_graph()Derrick Stolee1-0/+359
Teach Git to write a commit graph file by checking all packed objects to see if they are commits, then store the file in the given object directory. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>