aboutsummaryrefslogtreecommitdiffstats
path: root/builtin/pack-objects.c
AgeCommit message (Collapse)AuthorFilesLines
2019-04-01pack-bitmap: switch hash tables to use struct object_idbrian m. carlson1-3/+3
Instead of storing unsigned char pointers in the hash tables, switch to storing instances of struct object_id. Update several internal functions and one external function to take pointers to struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-22trace2:data: pack-objects: add trace2 regionsDerrick Stolee1-1/+15
When studying the performance of 'git push' we would like to know how much time is spent at various parts of the command. One area that could cause performance trouble is 'git pack-objects'. Add trace2 regions around the three main actions taken in this command: 1. Enumerate objects. 2. Prepare pack. 3. Write pack-file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-06Merge branch 'sl/const'Junio C Hamano1-7/+7
Code cleanup. * sl/const: various: tighten constness of some local variables
2019-02-06Merge branch 'jk/loose-object-cache-oid'Junio C Hamano1-2/+2
Code clean-up. * jk/loose-object-cache-oid: prefer "hash mismatch" to "sha1 mismatch" sha1-file: avoid "sha1 file" for generic use in messages sha1-file: prefer "loose object file" to "sha1 file" in messages sha1-file: drop has_sha1_file() convert has_sha1_file() callers to has_object_file() sha1-file: convert pass-through functions to object_id sha1-file: modernize loose header/stream functions sha1-file: modernize loose object file functions http: use struct object_id instead of bare sha1 update comment references to sha1_object_info() sha1-file: fix outdated sha1 comment references
2019-02-06Merge branch 'ds/push-sparse-tree-walk'Junio C Hamano1-1/+9
"git pack-objects" learned another algorithm to compute the set of objects to send, that trades the resulting packfile off to save traversal cost to favor small pushes. * ds/push-sparse-tree-walk: pack-objects: create GIT_TEST_PACK_SPARSE pack-objects: create pack.useSparse setting revision: implement sparse algorithm list-objects: consume sparse tree walk revision: add mark_tree_uninteresting_sparse
2019-02-06Merge branch 'nd/the-index-final'Junio C Hamano1-1/+1
The assumption to work on the single "in-core index" instance has been reduced from the library-ish part of the codebase. * nd/the-index-final: cache.h: flip NO_THE_INDEX_COMPATIBILITY_MACROS switch read-cache.c: remove the_* from index_has_changes() merge-recursive.c: remove implicit dependency on the_repository merge-recursive.c: remove implicit dependency on the_index sha1-name.c: remove implicit dependency on the_index read-cache.c: replace update_index_if_able with repo_& read-cache.c: kill read_index() checkout: avoid the_index when possible repository.c: replace hold_locked_index() with repo_hold_locked_index() notes-utils.c: remove the_repository references grep: use grep_opt->repo instead of explict repo argument
2019-02-05Merge branch 'ph/pack-objects-mutex-fix'Junio C Hamano1-17/+10
"git pack-objects" incorrectly used uninitialized mutex, which has been corrected. * ph/pack-objects-mutex-fix: pack-objects: merge read_lock and lock in packing_data struct pack-objects: move read mutex to packing_data struct
2019-02-04various: tighten constness of some local variablesShahzad Lone1-7/+7
Signed-off-by: Shahzad Lone <shahzadlone@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-29Merge branch 'bc/tree-walk-oid'Junio C Hamano1-2/+2
The code to walk tree objects has been taught that we may be working with object names that are not computed with SHA-1. * bc/tree-walk-oid: cache: make oidcpy always copy GIT_MAX_RAWSZ bytes tree-walk: store object_id in a separate member match-trees: use hashcpy to splice trees match-trees: compute buffer offset correctly when splicing tree-walk: copy object ID before use
2019-01-28pack-objects: merge read_lock and lock in packing_data structPatrick Hogg1-14/+10
Rename the packing_data lock to obd_lock and upgrade it to a recursive mutex to make it suitable for current read_lock usages. Additionally remove the superfluous #ifndef NO_PTHREADS guard around mutex initialization in prepare_packing_data as the mutex functions themselves are already protected. Signed-off-by: Patrick Hogg <phogg@novamoon.net> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-28pack-objects: move read mutex to packing_data structPatrick Hogg1-5/+2
ac77d0c37 ("pack-objects: shrink size field in struct object_entry", 2018-04-14) added an extra usage of read_lock/read_unlock in the newly introduced oe_get_size_slow for thread safety in parallel calls to try_delta(). Unfortunately oe_get_size_slow is also used in serial code, some of which is called before the first invocation of ll_find_deltas. As such the read mutex is not guaranteed to be initialized. Resolve this by moving the read mutex to packing_data and initializing it in prepare_packing_data which is initialized in cmd_pack_objects. Signed-off-by: Patrick Hogg <phogg@novamoon.net> Reviewed-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-24cache.h: flip NO_THE_INDEX_COMPATIBILITY_MACROS switchNguyễn Thái Ngọc Duy1-1/+1
By default, index compat macros are off from now on, because they could hide the_index dependency. Only those in builtin can use it. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17pack-objects: create GIT_TEST_PACK_SPARSEDerrick Stolee1-0/+1
Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17pack-objects: create pack.useSparse settingDerrick Stolee1-0/+4
The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17list-objects: consume sparse tree walkDerrick Stolee1-1/+4
When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. This commit walk is not changing during this series. The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly different way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. Add a '--sparse' flag in 'git pack-objects' to call this new logic. Add a new test script t/t5322-pack-objects-sparse.sh that tests this option. The tests currently demonstrate that the resulting object list is the same as the old algorithm. This includes a case where both algorithms pack an object that is not needed by a remote due to limits on the explored set of trees. When the sparse algorithm is changed in a later commit, we will add a test that demonstrates a change of behavior in some cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-15tree-walk: store object_id in a separate memberbrian m. carlson1-2/+2
When parsing a tree, we read the object ID directly out of the tree buffer. This is normally fine, but such an object ID cannot be used with oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1, there may not be that many bytes to copy. Instead, store the object ID in a separate struct member. Since we can no longer efficiently compute the path length, store that information as well in struct name_entry. Ensure we only copy the object ID into the new buffer if the path length is nonzero, as some callers will pass us an empty path with no object ID following it, and we will not want to read past the end of the buffer. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-14Merge branch 'md/exclude-promisor-objects-fix-cleanup'Junio C Hamano1-2/+4
Code clean-up. * md/exclude-promisor-objects-fix-cleanup: revision.c: put promisor option in specialized struct
2019-01-08update comment references to sha1_object_info()Jeff King1-2/+2
Commit abef9020e3 (sha1_file: convert sha1_object_info* to object_id, 2018-03-12) renamed the function to oid_object_info(), but missed some comments which mention it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-04Merge branch 'nd/the-index'Junio C Hamano1-3/+3
More codepaths become aware of working with in-core repository instance other than the default "the_repository". * nd/the-index: (22 commits) rebase-interactive.c: remove the_repository references rerere.c: remove the_repository references pack-*.c: remove the_repository references pack-check.c: remove the_repository references notes-cache.c: remove the_repository references line-log.c: remove the_repository reference diff-lib.c: remove the_repository references delta-islands.c: remove the_repository references cache-tree.c: remove the_repository references bundle.c: remove the_repository references branch.c: remove the_repository reference bisect.c: remove the_repository reference blame.c: remove implicit dependency the_repository sequencer.c: remove implicit dependency on the_repository sequencer.c: remove implicit dependency on the_index transport.c: remove implicit dependency on the_index notes-merge.c: remove implicit dependency the_repository notes-merge.c: remove implicit dependency on the_index list-objects.c: reduce the_repository references list-objects-filter.c: remove implicit dependency on the_index ...
2018-12-06revision.c: put promisor option in specialized structMatthew DeVore1-2/+4
Put the allow_exclude_promisor_objects flag in setup_revision_opt. When it was in rev_info, it was unclear when it was used, since rev_info is passed to functions that don't use the flag. This resulted in unnecessary setting of the flag in prune.c, so fix that as well. Signed-off-by: Matthew DeVore <matvore@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-21Merge branch 'cc/delta-islands'Junio C Hamano1-1/+3
A few issues in the implementation of "delta-islands" feature has been corrected. * cc/delta-islands: pack-objects: fix off-by-one in delta-island tree-depth computation pack-objects: zero-initialize tree_depth/layer arrays pack-objects: fix tree_depth and layer invariants
2018-11-21pack-objects: fix off-by-one in delta-island tree-depth computationJeff King1-1/+3
When delta-islands are in use, we need to record the deepest path at which we find each tree and blob. Our loop to do so counts slashes, so "foo" is depth 0, "foo/bar" is depth 1, and so on. However, this neglects root trees, which are represented by the empty string. Those also have depth 0, but are at a layer above "foo". Thus, "foo" should be 1, "foo/bar" at 2, and so on. We use this depth to topo-sort the trees in resolve_tree_islands(). As a result, we may fail to visit a root tree before the sub-trees it contains, and therefore not correctly pass down the island marks. That in turn could lead to missing some delta opportunities (objects are in the same island, but we didn't realize it) or creating unwanted cross-island deltas (one object is in an island another isn't, but we don't realize). In practice, it seems to have only a small effect. Some experiments on the real-world git/git fork network at GitHub showed an improvement of only 0.14% in the resulting clone size. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-19Merge branch 'tb/print-size-t-with-uintmax-format'Junio C Hamano1-6/+6
Code preparation to replace ulong vars with size_t vars where appropriate. * tb/print-size-t-with-uintmax-format: Upcast size_t variables to uintmax_t when printing
2018-11-19Merge branch 'ds/push-squelch-ambig-warning'Junio C Hamano1-0/+6
"git push" used to check ambiguities between object-names and refnames while processing the list of refs' old and new values, which was unnecessary (as it knew that it is feeding raw object names). This has been optimized out. * ds/push-squelch-ambig-warning: pack-objects: ignore ambiguous object warnings
2018-11-18Merge branch 'jk/unused-parameter-fixes'Junio C Hamano1-1/+4
Various functions have been audited for "-Wunused-parameter" warnings and bugs in them got fixed. * jk/unused-parameter-fixes: midx: double-check large object write loop assert NOARG/NONEG behavior of parse-options callbacks parse-options: drop OPT_DATE() apply: return -1 from option callback instead of calling exit(1) cat-file: report an error on multiple --batch options tag: mark "--message" option with NONEG show-branch: mark --reflog option as NONEG format-patch: mark "--no-numbered" option with NONEG status: mark --find-renames option with NONEG cat-file: mark batch options with NONEG pack-objects: mark index-version option as NONEG ls-files: mark exclude options as NONEG am: handle --no-patch-format option apply: mark include/exclude options as NONEG
2018-11-18Merge branch 'nd/pthreads'Junio C Hamano1-24/+2
The codebase has been cleaned up to reduce "#ifndef NO_PTHREADS". * nd/pthreads: Clean up pthread_create() error handling read-cache.c: initialize copy_len to shut up gcc 8 read-cache.c: reduce branching based on HAVE_THREADS read-cache.c: remove #ifdef NO_PTHREADS pack-objects: remove #ifdef NO_PTHREADS preload-index.c: remove #ifdef NO_PTHREADS grep: clean up num_threads handling grep: remove #ifdef NO_PTHREADS attr.c: remove #ifdef NO_PTHREADS name-hash.c: remove #ifdef NO_PTHREADS index-pack: remove #ifdef NO_PTHREADS send-pack.c: move async's #ifdef NO_PTHREADS back to run-command.c run-command.h: include thread-utils.h instead of pthread.h thread-utils: macros to unconditionally compile pthreads API
2018-11-12Upcast size_t variables to uintmax_t when printingTorsten Bögershausen1-6/+6
When printing variables which contain a size, today "unsigned long" is used at many places. In order to be able to change the type from "unsigned long" into size_t some day in the future, we need to have a way to print 64 bit variables on a system that has "unsigned long" defined to be 32 bit, like Win64. Upcast all those variables into uintmax_t before they are printed. This is to prepare for a bigger change, when "unsigned long" will be converted into size_t for variables which may be > 4Gib. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12pack-*.c: remove the_repository referencesNguyễn Thái Ngọc Duy1-1/+1
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12delta-islands.c: remove the_repository referencesNguyễn Thái Ngọc Duy1-2/+2
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-07pack-objects: ignore ambiguous object warningsDerrick Stolee1-0/+6
A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! I observed a simple push spending three seconds checking these paths. The fix here is similar to 4c30d50 "rev-list: disable object/refname ambiguity check with --stdin". While the get_object_list() method reads the objects from stdin, turn warn_on_object_refname_ambiguity flag (which is usually true) to false. Just for code hygiene, save away the original at the beginning and restore it once we are done. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-06Merge branch 'md/exclude-promisor-objects-fix'Junio C Hamano1-0/+1
Operations on promisor objects make sense in the context of only a small subset of the commands that internally use the revisions machinery, but the "--exclude-promisor-objects" option were taken and led to nonsense results by commands like "log", to which it didn't make much sense. This has been corrected. * md/exclude-promisor-objects-fix: exclude-promisor-objects: declare when option is allowed Documentation/git-log.txt: do not show --exclude-promisor-objects
2018-11-06assert NOARG/NONEG behavior of parse-options callbacksJeff King1-0/+3
When we define a parse-options callback, the flags we put in the option struct must match what the callback expects. For example, a callback which does not handle the "unset" parameter should only be used with PARSE_OPT_NONEG. But since the callback and the option struct are not defined next to each other, it's easy to get this wrong (as earlier patches in this series show). Fortunately, the compiler can help us here: compiling with -Wunused-parameters can show us which callbacks ignore their "unset" parameters (and likewise, ones that ignore "arg" expect to be triggered with PARSE_OPT_NOARG). But after we've inspected a callback and determined that all of its callers use the right flags, what do we do next? We'd like to silence the compiler warning, but do so in a way that will catch any wrong calls in the future. We can do that by actually checking those variables and asserting that they match our expectations. Because this is such a common pattern, we'll introduce some helper macros. The resulting messages aren't as descriptive as we could make them, but the file/line information from BUG() is enough to identify the problem (and anyway, the point is that these should never be seen). Each of the annotated callbacks in this patch triggers -Wunused-parameters, and was manually inspected to make sure all callers use the correct options (so none of these BUGs should be triggerable). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-06pack-objects: mark index-version option as NONEGJeff King1-1/+1
Running "git pack-objects --no-index-version" will segfault, since the callback is not prepared to handle the "unset" flag. In theory this might be used to counteract an earlier "--index-version", or override a pack.indexversion config setting. But the semantics aren't immediately obvious, and it's unlikely anybody wants this. Let's just disable the broken option for now. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-05pack-objects: remove #ifdef NO_PTHREADSNguyễn Thái Ngọc Duy1-24/+2
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-30Merge branch 'js/pack-objects-mutex-init-fix'Junio C Hamano1-1/+0
A mutex used in "git pack-objects" were not correctly initialized and this caused "git repack" to dump core on Windows. * js/pack-objects-mutex-init-fix: pack-objects (mingw): initialize `packing_data` mutex in the correct spot pack-objects (mingw): demonstrate a segmentation fault with large deltas pack-objects: fix typo 'detla' -> 'delta'
2018-10-23exclude-promisor-objects: declare when option is allowedMatthew DeVore1-0/+1
The --exclude-promisor-objects option causes some funny behavior in at least two commands: log and blame. It causes a BUG crash: $ git log --exclude-promisor-objects BUG: revision.c:2143: exclude_promisor_objects can only be used when fetch_if_missing is 0 Aborted [134] Fix this such that the option is treated like any other unknown option. The commands that must support it are limited, so declare in those commands that the flag is supported. In particular: pack-objects prune rev-list The commands were found by searching for logic which parses --exclude-promisor-objects outside of revision.c. Extra logic outside of revision.c is needed because fetch_if_missing must be turned on before revision.c sees the option or it will BUG-crash. The above list is supported by the fact that no other command is introspectively invoked by another command passing --exclude-promisor-object. Signed-off-by: Matthew DeVore <matvore@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-19pack-objects (mingw): initialize `packing_data` mutex in the correct spotJohannes Schindelin1-1/+0
In 9ac3f0e5b3e4 (pack-objects: fix performance issues on packing large deltas, 2018-07-22), a mutex was introduced that is used to guard the call to set the delta size. This commit even added code to initialize it, but at an incorrect spot: in `init_threaded_search()`, while the call to `oe_set_delta_size()` (and hence to `packing_data_lock()`) can happen in the call chain `check_object()` <- `get_object_details()` <- `prepare_pack()` <- `cmd_pack_objects()`, which is long before the `prepare_pack()` function calls `ll_find_deltas()` (which initializes the threaded search). Another tell-tale that the mutex was initialized in an incorrect spot is that the function to initialize it lives in builtin/, while the code that uses the mutex is defined in a libgit.a header file. Let's use a more appropriate function: `prepare_packing_data()`, which not only lives in libgit.a, but *has* to be called before the `packing_data` struct is used that contains that mutex. This fixes https://github.com/git-for-windows/git/issues/1839. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-19Merge branch 'nd/the-index'Junio C Hamano1-1/+1
Various codepaths in the core-ish part learn to work on an arbitrary in-core index structure, not necessarily the default instance "the_index". * nd/the-index: (23 commits) revision.c: reduce implicit dependency the_repository revision.c: remove implicit dependency on the_index ws.c: remove implicit dependency on the_index tree-diff.c: remove implicit dependency on the_index submodule.c: remove implicit dependency on the_index line-range.c: remove implicit dependency on the_index userdiff.c: remove implicit dependency on the_index rerere.c: remove implicit dependency on the_index sha1-file.c: remove implicit dependency on the_index patch-ids.c: remove implicit dependency on the_index merge.c: remove implicit dependency on the_index merge-blobs.c: remove implicit dependency on the_index ll-merge.c: remove implicit dependency on the_index diff-lib.c: remove implicit dependency on the_index read-cache.c: remove implicit dependency on the_index diff.c: remove implicit dependency on the_index grep.c: remove implicit dependency on the_index diff.c: remove the_index dependency in textconv() functions blame.c: rename "repo" argument to "r" combine-diff.c: remove implicit dependency on the_index ...
2018-10-16Merge branch 'jk/delta-islands-with-bitmap-reuse-delta-fix'Junio C Hamano1-16/+52
Fix interactions between two recent topics. * jk/delta-islands-with-bitmap-reuse-delta-fix: pack-objects: handle island check for "external" delta base
2018-09-24Merge branch 'tb/void-check-attr'Junio C Hamano1-2/+1
Code clean-up. * tb/void-check-attr: Make git_check_attr() a void function
2018-09-21revision.c: remove implicit dependency on the_indexNguyễn Thái Ngọc Duy1-1/+1
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-19pack-objects: handle island check for "external" delta baseJeff King1-16/+52
Two recent topics, jk/pack-delta-reuse-with-bitmap and cc/delta-islands, can have a funny interaction. When checking if we can reuse an on-disk delta, the first topic allows base_entry to be NULL when we find an object that's not in the packing list. But the latter topic introduces a call to in_same_island(), which needs to look at base_entry->idx.oid. When these two features are used together, we might try to dereference a NULL base_entry. In practice, this doesn't really happen. We'd generally only use delta islands when packing to disk, since the whole point is to optimize the pack for serving fetches later. And the new delta-reuse code relies on having used reachability bitmaps to determine the set of objects, which we would typically only do when serving an actual fetch. However, it is technically possible to combine these features. And even without doing so, building with "SANITIZE=address,undefined" will cause t5310.46 to complain. Even though that test does not have delta islands enabled, we still take the address of the NULL entry to pass to in_same_island(). That function then promptly returns without dereferencing the value when it sees that islands are not enabled, but it's enough to trigger a sanitizer error. The solution is straight-forward: when both features are used together, we should pass the oid of the found base to in_same_island(). This is tricky to do inside a single "if" statement. And after the merge in f3504ea3dd (Merge branch 'cc/delta-islands', 2018-09-17), that "if" condition is already getting pretty unwieldy. So this patch moves the logic into a helper function, where we can easily use multiple return paths. The result is a bit longer, but the logic should be much easier to follow. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17Merge branch 'jk/cocci'Junio C Hamano1-2/+2
spatch transformation to replace boolean uses of !hashcmp() to newly introduced oideq() is added, and applied, to regain performance lost due to support of multiple hash algorithms. * jk/cocci: show_dirstat: simplify same-content check read-cache: use oideq() in ce_compare functions convert hashmap comparison functions to oideq() convert "hashcmp() != 0" to "!hasheq()" convert "oidcmp() != 0" to "!oideq()" convert "hashcmp() == 0" to hasheq() convert "oidcmp() == 0" to oideq() introduce hasheq() and oideq() coccinelle: use <...> for function exclusion
2018-09-17Merge branch 'cc/delta-islands'Junio C Hamano1-43/+94
Lift code from GitHub to restrict delta computation so that an object that exists in one fork is not made into a delta against another object that does not appear in the same forked repository. * cc/delta-islands: pack-objects: move 'layer' into 'struct packing_data' pack-objects: move tree_depth into 'struct packing_data' t5320: tests for delta islands repack: add delta-islands support pack-objects: add delta-islands support pack-objects: refactor code into compute_layer_order() Add delta-islands.{c,h}
2018-09-17Merge branch 'jk/pack-delta-reuse-with-bitmap'Junio C Hamano1-9/+19
When creating a thin pack, which allows objects to be made into a delta against another object that is not in the resulting pack but is known to be present on the receiving end, the code learned to take advantage of the reachability bitmap; this allows the server to send a delta against a base beyond the "boundary" commit. * jk/pack-delta-reuse-with-bitmap: pack-objects: reuse on-disk deltas for thin "have" objects pack-bitmap: save "have" bitmap from walk t/perf: add perf tests for fetches from a bitmapped server t/perf: add infrastructure for measuring sizes t/perf: factor out percent calculations t/perf: factor boilerplate out of test_perf
2018-09-17Merge branch 'ds/multi-pack-index'Junio C Hamano1-7/+35
When there are too many packfiles in a repository (which is not recommended), looking up an object in these would require consulting many pack .idx files; a new mechanism to have a single file that consolidates all of these .idx files is introduced. * ds/multi-pack-index: (32 commits) pack-objects: consider packs in multi-pack-index midx: test a few commands that use get_all_packs treewide: use get_all_packs packfile: add all_packs list midx: fix bug that skips midx with alternates midx: stop reporting garbage midx: mark bad packed objects multi-pack-index: store local property multi-pack-index: provide more helpful usage info midx: clear midx on repack packfile: skip loading index if in multi-pack-index midx: prevent duplicate packfile loads midx: use midx in approximate_object_count midx: use existing midx when writing new one midx: use midx in abbreviation calculations midx: read objects from multi-pack-index config: create core.multiPackIndex setting midx: write object offsets midx: write object id fanout chunk midx: write object ids in a chunk ...
2018-09-12Make git_check_attr() a void functionTorsten Bögershausen1-2/+1
git_check_attr() returns always 0. Remove all the error handling code of the callers, which is never executed. Change git_check_attr() to be a void function. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29convert "oidcmp() == 0" to oideq()Jeff King1-2/+2
Using the more restrictive oideq() should, in the long run, give the compiler more opportunities to optimize these callsites. For now, this conversion should be a complete noop with respect to the generated code. The result is also perhaps a little more readable, as it avoids the "zero is equal" idiom. Since it's so prevalent in C, I think seasoned programmers tend not to even notice it anymore, but it can sometimes make for awkward double negations (e.g., we can drop a few !!oidcmp() instances here). This patch was generated almost entirely by the included coccinelle patch. This mechanical conversion should be completely safe, because we check explicitly for cases where oidcmp() is compared to 0, which is what oideq() is doing under the hood. Note that we don't have to catch "!oidcmp()" separately; coccinelle's standard isomorphisms make sure the two are treated equivalently. I say "almost" because I did hand-edit the coccinelle output to fix up a few style violations (it mostly keeps the original formatting, but sometimes unwraps long lines). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-22Merge branch 'nd/pack-deltify-regression-fix'Junio C Hamano1-4/+1
In a recent update in 2.18 era, "git pack-objects" started producing a larger than necessary packfiles by missing opportunities to use large deltas. * nd/pack-deltify-regression-fix: pack-objects: fix performance issues on packing large deltas
2018-08-21pack-objects: reuse on-disk deltas for thin "have" objectsJeff King1-9/+19
When we serve a fetch, we pass the "wants" and "haves" from the fetch negotiation to pack-objects. That tells us not only which objects we need to send, but we also use the boundary commits as "preferred bases": their trees and blobs are candidates for delta bases, both for reusing on-disk deltas and for finding new ones. However, this misses some opportunities. Modulo some special cases like shallow or partial clones, we know that every object reachable from the "haves" could be a preferred base. We don't use all of them for two reasons: 1. It's expensive to traverse the whole history and enumerate all of the objects the other side has. 2. The delta search is expensive, so we want to keep the number of candidate bases sane. The boundary commits are the most likely to work. When we have reachability bitmaps, though, reason 1 no longer applies. We can efficiently compute the set of reachable objects on the other side (and in fact already did so as part of the bitmap set-difference to get the list of interesting objects). And using this set conveniently covers the shallow and partial cases, since we have to disable the use of bitmaps for those anyway. The second reason argues against using these bases in the search for new deltas. But there's one case where we can use this information for free: when we have an existing on-disk delta that we're considering reusing, we can do so if we know the other side has the base object. This in fact saves time during the delta search, because it's one less delta we have to compute. And that's exactly what this patch does: when we're considering whether to reuse an on-disk delta, if bitmaps tell us the other side has the object (and we're making a thin-pack), then we reuse it. Here are the results on p5311 using linux.git, which simulates a client fetching after `N` days since their last fetch: Test origin HEAD -------------------------------------------------------------------------- 5311.3: server (1 days) 0.27(0.27+0.04) 0.12(0.09+0.03) -55.6% 5311.4: size (1 days) 0.9M 237.0K -73.7% 5311.5: client (1 days) 0.04(0.05+0.00) 0.10(0.10+0.00) +150.0% 5311.7: server (2 days) 0.34(0.42+0.04) 0.13(0.10+0.03) -61.8% 5311.8: size (2 days) 1.5M 347.7K -76.5% 5311.9: client (2 days) 0.07(0.08+0.00) 0.16(0.15+0.01) +128.6% 5311.11: server (4 days) 0.56(0.77+0.08) 0.13(0.10+0.02) -76.8% 5311.12: size (4 days) 2.8M 566.6K -79.8% 5311.13: client (4 days) 0.13(0.15+0.00) 0.34(0.31+0.02) +161.5% 5311.15: server (8 days) 0.97(1.39+0.11) 0.30(0.25+0.05) -69.1% 5311.16: size (8 days) 4.3M 1.0M -76.0% 5311.17: client (8 days) 0.20(0.22+0.01) 0.53(0.52+0.01) +165.0% 5311.19: server (16 days) 1.52(2.51+0.12) 0.30(0.26+0.03) -80.3% 5311.20: size (16 days) 8.0M 2.0M -74.5% 5311.21: client (16 days) 0.40(0.47+0.03) 1.01(0.98+0.04) +152.5% 5311.23: server (32 days) 2.40(4.44+0.20) 0.31(0.26+0.04) -87.1% 5311.24: size (32 days) 14.1M 4.1M -70.9% 5311.25: client (32 days) 0.70(0.90+0.03) 1.81(1.75+0.06) +158.6% 5311.27: server (64 days) 11.76(26.57+0.29) 0.55(0.50+0.08) -95.3% 5311.28: size (64 days) 89.4M 47.4M -47.0% 5311.29: client (64 days) 5.71(9.31+0.27) 15.20(15.20+0.32) +166.2% 5311.31: server (128 days) 16.15(36.87+0.40) 0.91(0.82+0.14) -94.4% 5311.32: size (128 days) 134.8M 100.4M -25.5% 5311.33: client (128 days) 9.42(16.86+0.49) 25.34(25.80+0.46) +169.0% In all cases we save CPU time on the server (sometimes significant) and the resulting pack is smaller. We do spend more CPU time on the client side, because it has to reconstruct more deltas. But that's the right tradeoff to make, since clients tend to outnumber servers. It just means the thin pack mechanism is doing its job. From the user's perspective, the end-to-end time of the operation will generally be faster. E.g., in the 128-day case, we saved 15s on the server at a cost of 16s on the client. Since the resulting pack is 34MB smaller, this is a net win if the network speed is less than 270Mbit/s. And that's actually the worst case. The 64-day case saves just over 11s at a cost of just under 11s. So it's a slight win at any network speed, and the 40MB saved is pure bonus. That trend continues for the smaller fetches. The implementation itself is mostly straightforward, with the new logic going into check_object(). But there are two tricky bits. The first is that check_object() needs access to the relevant information (the thin flag and bitmap result). We can do this by pushing these into program-lifetime globals. The second is that the rest of the code assumes that any reused delta will point to another "struct object_entry" as its base. But of course the case we are interested in here is the one where don't have such an entry! I looked at a number of options that didn't quite work: - we could use a flag to signal a reused delta, but it's not a single bit. We have to actually store the oid of the base, which is normally done by pointing to the existing object_entry. And we'd have to modify all the code which looks at deltas. - we could add the reused bases to the end of the existing object_entry array. While this does create some extra work as later stages consider the extra entries, it's actually not too bad (we're not sending them, so they don't cost much in the delta search, and at most we'd have 2*N of them). But there's a more subtle problem. Adding to the existing array means we might need to grow it with realloc, which could move the earlier entries around. While many of the references to other entries are done by integer index, some (including ones on the stack) use pointers, which would become invalidated. This isn't insurmountable, but it would require quite a bit of refactoring (and it's hard to know that you've got it all, since it may work _most_ of the time and then fail subtly based on memory allocation patterns). - we could allocate a new one-off entry for the base. In fact, this is what an earlier version of this patch did. However, since the refactoring brought in by ad635e82d6 (Merge branch 'nd/pack-objects-pack-struct', 2018-05-23), the delta_idx code requires that both entries be in the main packing list. So taking all of those options into account, what I ended up with is a separate list of "external bases" that are not part of the main packing list. Each delta entry that points to an external base has a single-bit flag to do so; we have a little breathing room in the bitfield section of object_entry. This lets us limit the change primarily to the oe_delta() and oe_set_delta_ext() functions. And as a bonus, most of the rest of the code does not consider these dummy entries at all, saving both runtime CPU and code complexity. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20pack-objects: consider packs in multi-pack-indexDerrick Stolee1-0/+28
When running 'git pack-objects --local', we want to avoid packing objects that are in an alternate. Currently, we check for these objects using the packed_git_mru list, which excludes the pack-files covered by a multi-pack-index. Add a new iteration over the multi-pack-indexes to find these copies and mark them as unwanted. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20treewide: use get_all_packsDerrick Stolee1-7/+7
There are many places in the codebase that want to iterate over all packfiles known to Git. The purposes are wide-ranging, and those that can take advantage of the multi-pack-index already do. So, use get_all_packs() instead of get_packed_git() to be sure we are iterating over all packfiles. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20Merge branch 'nd/no-the-index'Junio C Hamano1-1/+1
The more library-ish parts of the codebase learned to work on the in-core index-state instance that is passed in by their callers, instead of always working on the singleton "the_index" instance. * nd/no-the-index: (24 commits) blame.c: remove implicit dependency on the_index apply.c: remove implicit dependency on the_index apply.c: make init_apply_state() take a struct repository apply.c: pass struct apply_state to more functions resolve-undo.c: use the right index instead of the_index archive-*.c: use the right repository archive.c: avoid access to the_index grep: use the right index instead of the_index attr: remove index from git_attr_set_direction() entry.c: use the right index instead of the_index submodule.c: use the right index instead of the_index pathspec.c: use the right index instead of the_index unpack-trees: avoid the_index in verify_absent() unpack-trees: convert clear_ce_flags* to avoid the_index unpack-trees: don't shadow global var the_index unpack-trees: add a note about path invalidation unpack-trees: remove 'extern' on function declaration ls-files: correct index argument to get_convert_attr_ascii() preload-index.c: use the right index instead of the_index dir.c: remove an implicit dependency on the_index in pathspec code ...
2018-08-17Merge branch 'rs/parse-opt-lithelp'Junio C Hamano1-1/+1
The parse-options machinery learned to refrain from enclosing placeholder string inside a "<bra" and "ket>" pair automatically without PARSE_OPT_LITERAL_ARGHELP. Existing help text for option arguments that are not formatted correctly have been identified and fixed. * rs/parse-opt-lithelp: parse-options: automatically infer PARSE_OPT_LITERAL_ARGHELP shortlog: correct option help for -w send-pack: specify --force-with-lease argument help explicitly pack-objects: specify --index-version argument help explicitly difftool: remove angular brackets from argument help add, update-index: fix --chmod argument help push: use PARSE_OPT_LITERAL_ARGHELP instead of unbalanced brackets
2018-08-16pack-objects: move 'layer' into 'struct packing_data'Christian Couder1-2/+2
This reduces the size of 'struct object_entry' from 88 bytes to 80 and therefore makes packing objects more efficient. For example on a Linux repo with 12M objects, `git pack-objects --all` needs extra 96MB memory even if the layer feature is not used. Helped-by: Jeff King <peff@peff.net> Helped-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-16pack-objects: move tree_depth into 'struct packing_data'Christian Couder1-2/+2
This reduces the size of 'struct object_entry' and therefore makes packing objects more efficient. This also renames cmp_tree_depth() into tree_depth_compare(), as it is more modern to have the name of the compare functions end with "compare". Helped-by: Jeff King <peff@peff.net> Helped-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-16pack-objects: add delta-islands supportJeff King1-8/+49
Implement support for delta islands in git pack-objects and document how delta islands work in "Documentation/git-pack-objects.txt" and Documentation/config.txt. This allows users to setup delta islands in their config and get the benefit of less disk usage while cloning and fetching is still quite fast and not much more CPU intensive. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-16pack-objects: refactor code into compute_layer_order()Christian Couder1-40/+50
In a following commit, as we will use delta islands, we will have to compute the write order for different layers, not just for one. Let's prepare for that by refactoring the code that will be used to compute the write order for a given layer into a new compute_layer_order() function. This will make it easier to see and understand what the following changes are doing. Helped-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-15Merge branch 'nd/pack-objects-threading-doc'Junio C Hamano1-0/+19
Doc fix. * nd/pack-objects-threading-doc: pack-objects: document about thread synchronization
2018-08-15Merge branch 'nd/i18n'Junio C Hamano1-54/+61
Many more strings are prepared for l10n. * nd/i18n: (23 commits) transport-helper.c: mark more strings for translation transport.c: mark more strings for translation sha1-file.c: mark more strings for translation sequencer.c: mark more strings for translation replace-object.c: mark more strings for translation refspec.c: mark more strings for translation refs.c: mark more strings for translation pkt-line.c: mark more strings for translation object.c: mark more strings for translation exec-cmd.c: mark more strings for translation environment.c: mark more strings for translation dir.c: mark more strings for translation convert.c: mark more strings for translation connect.c: mark more strings for translation config.c: mark more strings for translation commit-graph.c: mark more strings for translation builtin/replace.c: mark more strings for translation builtin/pack-objects.c: mark more strings for translation builtin/grep.c: mark strings for translation builtin/config.c: mark more strings for translation ...
2018-08-15Merge branch 'jk/core-use-replace-refs'Junio C Hamano1-1/+1
A new configuration variable core.usereplacerefs has been added, primarily to help server installations that want to ignore the replace mechanism altogether. * jk/core-use-replace-refs: add core.usereplacerefs config option check_replace_refs: rename to read_replace_refs check_replace_refs: fix outdated comment
2018-08-13attr: remove an implicit dependency on the_indexNguyễn Thái Ngọc Duy1-1/+1
Make the attr API take an index_state instead of assuming the_index in attr code. All call sites are converted blindly to keep the patch simple and retain current behavior. Individual call sites may receive further updates to use the right index instead of the_index. There is one ugly temporary workaround added in attr.c that needs some more explanation. Commit c24f3abace (apply: file commited with CRLF should roundtrip diff and apply - 2017-08-19) forces one convert_to_git() call to NOT read the index at all. But what do you know, we read it anyway by falling back to the_index. When "istate" from convert_to_git is now propagated down to read_attr_from_array() we will hit segfault somewhere inside read_blob_data_from_index. The right way of dealing with this is to kill "use_index" variable and only follow "istate" but at this stage we are not ready for that: while most git_attr_set_direction() calls just passes the_index to be assigned to use_index, unpack-trees passes a different one which is used by entry.c code, which has no way to know what index to use if we delete use_index. So this has to be done later. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-03parse-options: automatically infer PARSE_OPT_LITERAL_ARGHELPRené Scharfe1-1/+1
Parseopt wraps argument help strings in a pair of angular brackets by default, to tell users that they need to replace it with an actual value. This is useful in most cases, because most option arguments are indeed single values of a certain type. The option PARSE_OPT_LITERAL_ARGHELP needs to be used in option definitions with arguments that have multiple parts or are literal strings. Stop adding these angular brackets if special characters are present, as they indicate that we don't deal with a simple placeholder. This simplifies the code a bit and makes defining special options slightly easier. Remove the flag PARSE_OPT_LITERAL_ARGHELP in the cases where the new and more cautious handling suffices. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-03pack-objects: specify --index-version argument help explicitlyRené Scharfe1-2/+2
Wrap both placeholders in the argument help string in angular brackets to signal that users needs replace them with some actual value. Use the flag PARSE_OPT_LITERAL_ARGHELP to prevent parseopt from adding another pair. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-02Merge branch 'sb/object-store-lookup'Junio C Hamano1-1/+1
lookup_commit_reference() and friends have been updated to find in-core object for a specific in-core repository instance. * sb/object-store-lookup: (32 commits) commit.c: allow lookup_commit_reference to handle arbitrary repositories commit.c: allow lookup_commit_reference_gently to handle arbitrary repositories tag.c: allow deref_tag to handle arbitrary repositories object.c: allow parse_object to handle arbitrary repositories object.c: allow parse_object_buffer to handle arbitrary repositories commit.c: allow get_cached_commit_buffer to handle arbitrary repositories commit.c: allow set_commit_buffer to handle arbitrary repositories commit.c: migrate the commit buffer to the parsed object store commit-slabs: remove realloc counter outside of slab struct commit.c: allow parse_commit_buffer to handle arbitrary repositories tag: allow parse_tag_buffer to handle arbitrary repositories tag: allow lookup_tag to handle arbitrary repositories commit: allow lookup_commit to handle arbitrary repositories tree: allow lookup_tree to handle arbitrary repositories blob: allow lookup_blob to handle arbitrary repositories object: allow lookup_object to handle arbitrary repositories object: allow object_as_type to handle arbitrary repositories tag: add repository argument to deref_tag tag: add repository argument to parse_tag_buffer tag: add repository argument to lookup_tag ...
2018-07-30pack-objects: document about thread synchronizationNguyễn Thái Ngọc Duy1-0/+19
These extra comments should be make it easier to understand how to use locks in pack-objects delta search code. For reference, see 8ecce684a3 (basic threaded delta search - 2007-09-06) 384b32c09b (pack-objects: fix threaded load balancing - 2007-12-08) 50f22ada52 (threaded pack-objects: Use condition... - 2007-12-16) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-23builtin/pack-objects.c: mark more strings for translationNguyễn Thái Ngọc Duy1-50/+52
Most of these are straight forward. GETTEXT_POISON does catch the last string in cmd_pack_objects(), but since this is --progress output, it's not supposed to be machine-readable. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-23Update messages in preparation for i18nNguyễn Thái Ngọc Duy1-10/+15
Many messages will be marked for translation in the following commits. This commit updates some of them to be more consistent and reduce diff noise in those commits. Changes are - keep the first letter of die(), error() and warning() in lowercase - no full stop in die(), error() or warning() if it's single sentence messages - indentation - some messages are turned to BUG(), or prefixed with "BUG:" and will not be marked for i18n - some messages are improved to give more information - some messages are broken down by sentence to be i18n friendly (on the same token, combine multiple warning() into one big string) - the trailing \n is converted to printf_ln if possible, or deleted if not redundant - errno_errno() is used instead of explicit strerror() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-23pack-objects: fix performance issues on packing large deltasNguyễn Thái Ngọc Duy1-4/+1
Let's start with some background about oe_delta_size() and oe_set_delta_size(). If you already know, skip the next paragraph. These two are added in 0aca34e826 (pack-objects: shrink delta_size field in struct object_entry - 2018-04-14) to help reduce 'struct object_entry' size. The delta size field in this struct is reduced to only contain max 1MB. So if any new delta is produced and larger than 1MB, it's dropped because we can't really save such a large size anywhere. Fallback is provided in case existing packfiles already have large deltas, then we can retrieve it from the pack. While this should help small machines repacking large repos without large deltas (i.e. less memory pressure), dropping large deltas during the delta selection process could end up with worse pack files. And if existing packfiles already have >1MB delta and pack-objects is instructed to not reuse deltas, all of them will be dropped on the floor, and the resulting pack would be definitely bigger. There is also a regression in terms of CPU/IO if we have large on-disk deltas because fallback code needs to parse the pack every time the delta size is needed and just access to the mmap'd pack data is enough for extra page faults when memory is under pressure. Both of these issues were reported on the mailing list. Here's some numbers for comparison. Version Pack (MB) MaxRSS(kB) Time (s) ------- --------- ---------- -------- 2.17.0 5498 43513628 2494.85 2.18.0 10531 40449596 4168.94 This patch provides a better fallback that is - cheaper in terms of cpu and io because we won't have to read existing pack files as much - better in terms of pack size because the pack heuristics is back to 2.17.0 time, we do not drop large deltas at all If we encounter any delta (on-disk or created during try_delta phase) that is larger than the 1MB limit, we stop using delta_size_ field for this because it can't contain such size anyway. A new array of delta size is dynamically allocated and can hold all the deltas that 2.17.0 can. This array only contains delta sizes that delta_size_ can't contain. With this, we do not have to drop deltas in try_delta() anymore. Of course the downside is we use slightly more memory, even compared to 2.17.0. But since this is considered an uncommon case, a bit more memory consumption should not be a problem. Delta size limit is also raised from 1MB to 16MB to better cover common case and avoid that extra memory consumption (99.999% deltas in this reported repo are under 12MB; Jeff noted binary artifacts topped out at about 3MB in some other private repos). Other fields are shuffled around to keep this struct packed tight. We don't use more memory in common case even with this limit update. A note about thread synchronization. Since this code can be run in parallel during delta searching phase, we need a mutex. The realloc part in packlist_alloc() is not protected because it only happens during the object counting phase, which is always single-threaded. Access to e->delta_size_ (and by extension pack->delta_size[e - pack->objects]) is unprotected as before, the thread scheduler in pack-objects must make sure "e" is never updated by two different threads. The area under the new lock is as small as possible, avoiding locking at all in common case, since lock contention with high thread count could be expensive (most blobs are small enough that delta compute time is short and we end up taking the lock very often). The previous attempt to always hold a lock in oe_delta_size() and oe_set_delta_size() increases execution time by 33% when repacking linux.git with with 40 threads. Reported-by: Elijah Newren <newren@gmail.com> Helped-by: Elijah Newren <newren@gmail.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-18check_replace_refs: rename to read_replace_refsJeff King1-1/+1
This was added as a NEEDSWORK in c3c36d7de2 (replace-object: check_replace_refs is safe in multi repo environment, 2018-04-11), waiting for a calmer period. Since doing so now doesn't conflict with anything in 'pu', it seems as good a time as any. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-18Merge branch 'jt/remove-pack-bitmap-global'Junio C Hamano1-2/+5
The effort to move globals to per-repository in-core structure continues. * jt/remove-pack-bitmap-global: pack-bitmap: add free function pack-bitmap: remove bitmap_git global variable
2018-07-18Merge branch 'sb/object-store-grafts'Junio C Hamano1-3/+3
The conversion to pass "the_repository" and then "a_repository" throughout the object access API continues. * sb/object-store-grafts: commit: allow lookup_commit_graft to handle arbitrary repositories commit: allow prepare_commit_graft to handle arbitrary repositories shallow: migrate shallow information into the object parser path.c: migrate global git_path_* to take a repository argument cache: convert get_graft_file to handle arbitrary repositories commit: convert read_graft_file to handle arbitrary repositories commit: convert register_commit_graft to handle arbitrary repositories commit: convert commit_graft_pos() to handle arbitrary repositories shallow: add repository argument to is_repository_shallow shallow: add repository argument to check_shallow_file_for_update shallow: add repository argument to register_shallow shallow: add repository argument to set_alternate_shallow_file commit: add repository argument to lookup_commit_graft commit: add repository argument to prepare_commit_graft commit: add repository argument to read_graft_file commit: add repository argument to register_commit_graft commit: add repository argument to commit_graft_pos object: move grafts to object parser object-store: move object access functions to object-store.h
2018-06-29tag: add repository argument to lookup_tagStefan Beller1-1/+1
Add a repository argument to allow the callers of lookup_tag to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29Merge branch 'sb/object-store-grafts' into sb/object-store-lookupJunio C Hamano1-3/+3
* sb/object-store-grafts: commit: allow lookup_commit_graft to handle arbitrary repositories commit: allow prepare_commit_graft to handle arbitrary repositories shallow: migrate shallow information into the object parser path.c: migrate global git_path_* to take a repository argument cache: convert get_graft_file to handle arbitrary repositories commit: convert read_graft_file to handle arbitrary repositories commit: convert register_commit_graft to handle arbitrary repositories commit: convert commit_graft_pos() to handle arbitrary repositories shallow: add repository argument to is_repository_shallow shallow: add repository argument to check_shallow_file_for_update shallow: add repository argument to register_shallow shallow: add repository argument to set_alternate_shallow_file commit: add repository argument to lookup_commit_graft commit: add repository argument to prepare_commit_graft commit: add repository argument to read_graft_file commit: add repository argument to register_commit_graft commit: add repository argument to commit_graft_pos object: move grafts to object parser object-store: move object access functions to object-store.h
2018-06-21pack-bitmap: add free functionJonathan Tan1-0/+1
Add a function to free struct bitmap_index instances, and use it where needed (except when rebuild_existing_bitmaps() is used, since it creates references to the bitmaps within the struct bitmap_index passed to it). Note that the hashes field in struct bitmap_index is not freed because it points to another field within the same struct. The documentation for that field has been updated to clarify that. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-21pack-bitmap: remove bitmap_git global variableJonathan Tan1-2/+4
Remove the bitmap_git global variable. Instead, generate on demand an instance of struct bitmap_index for code that needs to access it. This allows us significant control over the lifetime of instances of struct bitmap_index. In particular, packs can now be closed without worrying if an unnecessarily long-lived "pack" field in struct bitmap_index still points to it. The bitmap API is also clearer in that we need to first obtain a struct bitmap_index, then we use it. This patch raises two potential issues: (1) memory for the struct bitmap_index is allocated without being freed, and (2) prepare_bitmap_git() and prepare_bitmap_walk() can reuse a previously loaded bitmap. For (1), this will be dealt with in a subsequent patch in this patch set that also deals with freeing the contents of the struct bitmap_index (which were not freed previously, because they have global scope). For (2), current bitmap users only load the bitmap once at most (note that pack-objects can use bitmaps or write bitmaps, but not both at the same time), so support for reuse has no effect - and future users can pass around the struct bitmap_index * obtained if they need to do 2 or more things with the same bitmap. Helped-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-01Merge branch 'nd/use-opt-int-set-f'Junio C Hamano1-12/+12
Code simplification. * nd/use-opt-int-set-f: Use OPT_SET_INT_F() for cmdline option specification
2018-05-30Merge branch 'bc/object-id'Junio C Hamano1-15/+17
Conversion from uchar[20] to struct object_id continues. * bc/object-id: (42 commits) merge-one-file: compute empty blob object ID add--interactive: compute the empty tree value Update shell scripts to compute empty tree object ID sha1_file: only expose empty object constants through git_hash_algo dir: use the_hash_algo for empty blob object ID sequencer: use the_hash_algo for empty tree object ID cache-tree: use is_empty_tree_oid sha1_file: convert cached object code to struct object_id builtin/reset: convert use of EMPTY_TREE_SHA1_BIN builtin/receive-pack: convert one use of EMPTY_TREE_SHA1_HEX wt-status: convert two uses of EMPTY_TREE_SHA1_HEX submodule: convert several uses of EMPTY_TREE_SHA1_HEX sequencer: convert one use of EMPTY_TREE_SHA1_HEX merge: convert empty tree constant to the_hash_algo builtin/merge: switch tree functions to use object_id builtin/am: convert uses of EMPTY_TREE_SHA1_BIN to the_hash_algo sha1-file: add functions for hex empty tree and blob OIDs builtin/receive-pack: avoid hard-coded constants for push certs diff: specify abbreviation size in terms of the_hash_algo upload-pack: replace use of several hard-coded constants ...
2018-05-30Merge branch 'js/use-bug-macro'Junio C Hamano1-2/+2
Developer support update, by using BUG() macro instead of die() to mark codepaths that should not happen more clearly. * js/use-bug-macro: BUG_exit_code: fix sparse "symbol not declared" warning Convert remaining die*(BUG) messages Replace all die("BUG: ...") calls by BUG() ones run-command: use BUG() to report bugs, not die() test-tool: help verifying BUG() code paths
2018-05-24Use OPT_SET_INT_F() for cmdline option specificationNguyễn Thái Ngọc Duy1-12/+12
The only thing these commands need is extra parseopt flag which can be passed in by OPT_SET_INT_F() and it is a bit more compact than full struct initialization. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-23Merge branch 'nd/pack-unreachable-objects-doc'Junio C Hamano1-0/+2
Doc update. * nd/pack-unreachable-objects-doc: pack-objects: validation and documentation about unreachable options
2018-05-23Merge branch 'nd/pack-objects-pack-struct'Junio C Hamano1-128/+241
"git pack-objects" needs to allocate tons of "struct object_entry" while doing its work, and shrinking its size helps the performance quite a bit. * nd/pack-objects-pack-struct: ci: exercise the whole test suite with uncommon code in pack-objects pack-objects: reorder members to shrink struct object_entry pack-objects: shrink delta_size field in struct object_entry pack-objects: shrink size field in struct object_entry pack-objects: clarify the use of object_entry::size pack-objects: don't check size when the object is bad pack-objects: shrink z_delta_size field in struct object_entry pack-objects: refer to delta objects by index instead of pointer pack-objects: move in_pack out of struct object_entry pack-objects: move in_pack_pos out of struct object_entry pack-objects: use bitfield for object_entry::depth pack-objects: use bitfield for object_entry::dfs_state pack-objects: turn type and in_pack_type to bitfields pack-objects: a bit of document about struct object_entry read-cache.c: make $GIT_TEST_SPLIT_INDEX boolean
2018-05-23Merge branch 'sb/oid-object-info'Junio C Hamano1-4/+7
The codepath around object-info API has been taught to take the repository object (which in turn tells the API which object store the objects are to be located). * sb/oid-object-info: cache.h: allow oid_object_info to handle arbitrary repositories packfile: add repository argument to cache_or_unpack_entry packfile: add repository argument to unpack_entry packfile: add repository argument to read_object packfile: add repository argument to packed_object_info packfile: add repository argument to packed_to_object_type packfile: add repository argument to retry_bad_packed_offset cache.h: add repository argument to oid_object_info cache.h: add repository argument to oid_object_info_extended
2018-05-23Merge branch 'nd/repack-keep-pack'Junio C Hamano1-20/+63
"git gc" in a large repository takes a lot of time as it considers to repack all objects into one pack by default. The command has been taught to pretend as if the largest existing packfile is marked with ".keep" so that it is left untouched while objects in other packs and loose ones are repacked. * nd/repack-keep-pack: pack-objects: show some progress when counting kept objects gc --auto: exclude base pack if not enough mem to "repack -ad" gc: handle a corner case in gc.bigPackThreshold gc: add gc.bigPackThreshold config gc: add --keep-largest-pack option repack: add --keep-pack option t7700: have closing quote of a test at the beginning of line
2018-05-18shallow: add repository argument to is_repository_shallowStefan Beller1-2/+2
Add a repository argument to allow callers of is_repository_shallow to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-18shallow: add repository argument to register_shallowStefan Beller1-1/+1
Add a repository argument to allow callers of register_shallow to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-08Merge branch 'ds/commit-graph'Junio C Hamano1-3/+3
Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. * ds/commit-graph: commit-graph: implement "--append" option commit-graph: build graph from starting commits commit-graph: read only from specific pack-indexes commit: integrate commit graph with commit parsing commit-graph: close under reachability commit-graph: add core.commitGraph setting commit-graph: implement git commit-graph read commit-graph: implement git-commit-graph write commit-graph: implement write_commit_graph() commit-graph: create git-commit-graph builtin graph: add commit graph design document commit-graph: add format document csum-file: refactor finalize_hashfile() method csum-file: rename hashclose() to finalize_hashfile()
2018-05-06Replace all die("BUG: ...") calls by BUG() onesJohannes Schindelin1-2/+2
In d8193743e08 (usage.c: add BUG() function, 2017-05-12), a new macro was introduced to use for reporting bugs instead of die(). It was then subsequently used to convert one single caller in 588a538ae55 (setup_git_env: convert die("BUG") to BUG(), 2017-05-12). The cover letter of the patch series containing this patch (cf 20170513032414.mfrwabt4hovujde2@sigill.intra.peff.net) is not terribly clear why only one call site was converted, or what the plan is for other, similar calls to die() to report bugs. Let's just convert all remaining ones in one fell swoop. This trick was performed by this invocation: sed -i 's/die("BUG: /BUG("/g' $(git grep -l 'die("BUG' \*.c) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-06pack-objects: validation and documentation about unreachable optionsNguyễn Thái Ngọc Duy1-0/+2
These options are added in [1] [2] [3]. All these depend on running rev-list internally which is normally true since they are always used with "--all --objects" which implies --revs. But let's keep this dependency explicit. While at there, add documentation for them. These are mostly used internally by git-repack. But it's still good to not chase down the right commit message to know how they work. [1] ca11b212eb (let pack-objects do the writing of unreachable objects as loose objects - 2008-05-14) [2] 08cdfb1337 (pack-objects --keep-unreachable - 2007-09-16) [3] e26a8c4721 (repack: extend --keep-unreachable to loose objects - 2016-06-13) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-02pack-objects: abstract away hash algorithmbrian m. carlson1-14/+16
Instead of using hard-coded instances of the constant 20, use the_hash_algo to look up the correct constant. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-02sha1-file: convert freshen functions to object_idbrian m. carlson1-1/+1
Convert the various functions for freshening objects and has_loose_object_nonlocal to use struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-26packfile: add repository argument to packed_object_infoJonathan Nieder1-1/+2
Add a repository argument to allow callers of packed_object_info to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-26cache.h: add repository argument to oid_object_infoStefan Beller1-3/+5
Add a repository argument to allow the callers of oid_object_info to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Stefan Beller <sbeller@google.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: show some progress when counting kept objectsNguyễn Thái Ngọc Duy1-6/+12
We only show progress when there are new objects to be packed. But when --keep-pack is specified on the base pack, we will exclude most of objects. This makes 'pack-objects' stay silent for a long time while the counting phase is going. Let's show some progress whenever we visit an object instead. The old "Counting objects" is renamed to "Enumerating objects" and a new progress "Counting objects" line is added. This new "Counting objects" line should progress pretty quick when the system is beefy. But when the system is under pressure, the reading object header done in this phase could be slow and showing progress is an improvement over staying silent in the current code. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16gc --auto: exclude base pack if not enough mem to "repack -ad"Nguyễn Thái Ngọc Duy1-1/+1
pack-objects could be a big memory hog especially on large repos, everybody knows that. The suggestion to stick a .keep file on the giant base pack to avoid this problem is also known for a long time. Recent patches add an option to do just this, but it has to be either configured or activated manually. This patch lets `git gc --auto` activate this mode automatically when it thinks `repack -ad` will use a lot of memory and start affecting the system due to swapping or flushing OS cache. gc --auto decides to do this based on an estimation of pack-objects memory usage, which is quite accurate at least for the heap part, and whether that fits in half of system memory (the assumption here is for desktop environment where there are many other applications running). This mechanism only kicks in if gc.bigBasePackThreshold is not configured. If it is, it is assumed that the user already knows what they want. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16repack: add --keep-pack optionNguyễn Thái Ngọc Duy1-13/+50
We allow to keep existing packs by having companion .keep files. This is helpful when a pack is permanently kept. In the next patch, git-gc just wants to keep a pack temporarily, for one pack-objects run. git-gc can use --keep-pack for this use case. A note about why the pack_keep field cannot be reused and pack_keep_in_core has to be added. This is about the case when --keep-pack is specified together with either --keep-unreachable or --unpack-unreachable, but --honor-pack-keep is NOT specified. In this case, we want to exclude objects from the packs specified on command line, not from ones with .keep files. If only one bit flag is used, we have to clear pack_keep on pack files with the .keep file. But we can't make any assumption about unreachable objects in .keep packs. If "pack_keep" field is false for .keep packs, we could potentially pull lots of unreachable objects into the new pack, or unpack them loose. The safer approach is ignore all packs with either .keep file or --keep-pack. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: shrink delta_size field in struct object_entryNguyễn Thái Ngọc Duy1-10/+16
Allowing a delta size of 64 bits is crazy. Shrink this field down to 20 bits with one overflow bit. If we find an existing delta larger than 1MB, we do not cache delta_size at all and will get the value from oe_size(), potentially from disk if it's larger than 4GB. Note, since DELTA_SIZE() is used in try_delta() code, it must be thread-safe. Luckily oe_size() does guarantee this so we it is thread-safe. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: shrink size field in struct object_entryNguyễn Thái Ngọc Duy1-25/+80
It's very very rare that an uncompressed object is larger than 4GB (partly because Git does not handle those large files very well to begin with). Let's optimize it for the common case where object size is smaller than this limit. Shrink size field down to 31 bits and one overflow bit. If the size is too large, we read it back from disk. As noted in the previous patch, we need to return the delta size instead of canonical size when the to-be-reused object entry type is a delta instead of a canonical one. Add two compare helpers that can take advantage of the overflow bit (e.g. if the file is 4GB+, chances are it's already larger than core.bigFileThreshold and there's no point in comparing the actual value). Another note about oe_get_size_slow(). This function MUST be thread safe because SIZE() macro is used inside try_delta() which may run in parallel. Outside parallel code, no-contention locking should be dirt cheap (or insignificant compared to i/o access anyway). To exercise this code, it's best to run the test suite with something like make test GIT_TEST_OE_SIZE=4 which forces this code on all objects larger than 3 bytes. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: clarify the use of object_entry::sizeNguyễn Thái Ngọc Duy1-3/+8
While this field most of the time contains the canonical object size, there is one case it does not: when we have found that the base object of the delta in question is also to be packed, we will very happily reuse the delta by copying it over instead of regenerating the new delta. "size" in this case will record the delta size, not canonical object size. Later on in write_reuse_object(), we reconstruct the delta header and "size" is used for this purpose. When this happens, the "type" field contains a delta type instead of a canonical type. Highlight this in the code since it could be tricky to see. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: don't check size when the object is badNguyễn Thái Ngọc Duy1-2/+2
sha1_object_info() in check_objects() may fail to locate an object in the pack and return type OBJ_BAD. In that case, it will likely leave the "size" field untouched. We delay error handling until later in prepare_pack() though. Until then, do not touch "size" field. This field should contain the default value zero, but we can't say sha1_object_info() cannot damage it. This becomes more important later when the object size may have to be retrieved back from the (non-existing) pack. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: shrink z_delta_size field in struct object_entryNguyễn Thái Ngọc Duy1-6/+18
We only cache deltas when it's smaller than a certain limit. This limit defaults to 1000 but save its compressed length in a 64-bit field. Shrink that field down to 20 bits, so you can only cache 1MB deltas. Larger deltas must be recomputed at when the pack is written down. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: refer to delta objects by index instead of pointerNguyễn Thái Ngọc Duy1-54/+63
These delta pointers always point to elements in the objects[] array in packing_data struct. We can only hold maximum 4G of those objects because the array size in nr_objects is uint32_t. We could use uint32_t indexes to address these elements instead of pointers. On 64-bit architecture (8 bytes per pointer) this would save 4 bytes per pointer. Convert these delta pointers to indexes. Since we need to handle NULL pointers as well, the index is shifted by one [1]. [1] This means we can only index 2^32-2 objects even though nr_objects could contain 2^32-1 objects. It should not be a problem in practice because when we grow objects[], nr_alloc would probably blow up long before nr_objects hits the wall. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: move in_pack out of struct object_entryNguyễn Thái Ngọc Duy1-11/+17
Instead of using 8 bytes (on 64 bit arch) to store a pointer to a pack. Use an index instead since the number of packs should be relatively small. This limits the number of packs we can handle to 1k. Since we can't be sure people can never run into the situation where they have more than 1k pack files. Provide a fall back route for it. If we find out they have too many packs, the new in_pack_by_idx[] array (which has at most 1k elements) will not be used. Instead we allocate in_pack[] array that holds nr_objects elements. This is similar to how the optional in_pack_pos field is handled. The new simple test is just to make sure the too-many-packs code path is at least executed. The true test is running make test GIT_TEST_FULL_IN_PACK_ARRAY=1 to take advantage of other special case tests. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: move in_pack_pos out of struct object_entryNguyễn Thái Ngọc Duy1-1/+2
This field is only need for pack-bitmap, which is an optional feature. Move it to a separate array that is only allocated when pack-bitmap is used (like objects[], it is not freed, since we need it until the end of the process) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: use bitfield for object_entry::depthNguyễn Thái Ngọc Duy1-0/+6
Because of struct packing from now on we can only handle max depth 4095 (or even lower when new booleans are added in this struct). This should be ok since long delta chain will cause significant slow down anyway. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: use bitfield for object_entry::dfs_stateNguyễn Thái Ngọc Duy1-0/+3
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: turn type and in_pack_type to bitfieldsNguyễn Thái Ngọc Duy1-24/+35
An extra field type_valid is added to carry the equivalent of OBJ_BAD in the original "type" field. in_pack_type always contains a valid type so we only need 3 bits for it. A note about accepting OBJ_NONE as "valid" type. The function read_object_list_from_stdin() can pass this value [1] and it eventually calls create_object_entry() where current code skip setting "type" field if the incoming type is zero. This does not have any bad side effects because "type" field should be memset()'d anyway. But since we also need to set type_valid now, skipping oe_set_type() leaves type_valid zero/false, which will make oe_type() return OBJ_BAD, not OBJ_NONE anymore. Apparently we do care about OBJ_NONE in prepare_pack(). This switch from OBJ_NONE to OBJ_BAD may trigger fatal: unable to get type of object ... Accepting OBJ_NONE [2] does sound wrong, but this is how it is has been for a very long time and I haven't time to dig in further. [1] See 5c49c11686 (pack-objects: better check_object() performances - 2007-04-16) [2] 21666f1aae (convert object type handling from a string to a number - 2007-02-26) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11Merge branch 'sb/packfiles-in-repository'Junio C Hamano1-1/+0
Refactoring of the internal global data structure continues. * sb/packfiles-in-repository: packfile: keep prepare_packed_git() private packfile: allow find_pack_entry to handle arbitrary repositories packfile: add repository argument to find_pack_entry packfile: allow reprepare_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git_one to handle arbitrary repositories packfile: add repository argument to reprepare_packed_git packfile: add repository argument to prepare_packed_git packfile: add repository argument to prepare_packed_git_one packfile: allow install_packed_git to handle arbitrary repositories packfile: allow rearrange_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git_mru to handle arbitrary repositories
2018-04-11Merge branch 'sb/object-store'Junio C Hamano1-9/+12
Refactoring the internal global data structure to make it possible to open multiple repositories, work with and then close them. Rerolled by Duy on top of a separate preliminary clean-up topic. The resulting structure of the topics looked very sensible. * sb/object-store: (27 commits) sha1_file: allow sha1_loose_object_info to handle arbitrary repositories sha1_file: allow map_sha1_file to handle arbitrary repositories sha1_file: allow map_sha1_file_1 to handle arbitrary repositories sha1_file: allow open_sha1_file to handle arbitrary repositories sha1_file: allow stat_sha1_file to handle arbitrary repositories sha1_file: allow sha1_file_name to handle arbitrary repositories sha1_file: add repository argument to sha1_loose_object_info sha1_file: add repository argument to map_sha1_file sha1_file: add repository argument to map_sha1_file_1 sha1_file: add repository argument to open_sha1_file sha1_file: add repository argument to stat_sha1_file sha1_file: add repository argument to sha1_file_name sha1_file: allow prepare_alt_odb to handle arbitrary repositories sha1_file: allow link_alt_odb_entries to handle arbitrary repositories sha1_file: add repository argument to prepare_alt_odb sha1_file: add repository argument to link_alt_odb_entries sha1_file: add repository argument to read_info_alternates sha1_file: add repository argument to link_alt_odb_entry sha1_file: add raw_object_store argument to alt_odb_usable pack: move approximate object count to object store ...
2018-04-10Merge branch 'bc/object-id'Junio C Hamano1-16/+11
Conversion from uchar[20] to struct object_id continues. * bc/object-id: (36 commits) convert: convert to struct object_id sha1_file: introduce a constant for max header length Convert lookup_replace_object to struct object_id sha1_file: convert read_sha1_file to struct object_id sha1_file: convert read_object_with_reference to object_id tree-walk: convert tree entry functions to object_id streaming: convert istream internals to struct object_id tree-walk: convert get_tree_entry_follow_symlinks internals to object_id builtin/notes: convert static functions to object_id builtin/fmt-merge-msg: convert remaining code to object_id sha1_file: convert sha1_object_info* to object_id Convert remaining callers of sha1_object_info_extended to object_id packfile: convert unpack_entry to struct object_id sha1_file: convert retry_bad_packed_offset to struct object_id sha1_file: convert assert_sha1_type to object_id builtin/mktree: convert to struct object_id streaming: convert open_istream to use struct object_id sha1_file: convert check_sha1_signature to struct object_id sha1_file: convert read_loose_object to use struct object_id builtin/index-pack: convert struct ref_delta_entry to object_id ...
2018-04-02csum-file: refactor finalize_hashfile() methodDerrick Stolee1-2/+2
If we want to use a hashfile on the temporary file for a lockfile, then we need finalize_hashfile() to fully write the trailing hash but also keep the file descriptor open. Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional change that checks this flag before writing the checksum to the stream. This differs from previous behavior since it would be written if either CSUM_CLOSE or CSUM_FSYNC is provided. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02csum-file: rename hashclose() to finalize_hashfile()Derrick Stolee1-3/+3
The hashclose() method behaves very differently depending on the flags parameter. In particular, the file descriptor is not always closed. Perform a simple rename of "hashclose()" to "finalize_hashfile()" in preparation for functional changes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26packfile: keep prepare_packed_git() privateNguyễn Thái Ngọc Duy1-1/+0
The reason callers have to call this is to make sure either packed_git or packed_git_mru pointers are initialized since we don't do that by default. Sometimes it's hard to see this connection between where the function is called and where packed_git pointer is used (sometimes in separate functions). Keep this dependency internal because now all access to packed_git and packed_git_mru must go through get_xxx() wrappers. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26packfile: add repository argument to prepare_packed_gitStefan Beller1-1/+1
Add a repository argument to allow prepare_packed_git callers to be more specific about which repository to handle. See commit "sha1_file: add repository argument to link_alt_odb_entry" for an explanation of the #define trick. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26object-store: move packed_git and packed_git_mru to object storeStefan Beller1-9/+12
In a process with multiple repositories open, packfile accessors should be associated to a single repository and not shared globally. Move packed_git and packed_git_mru into the_repository and adjust callers to reflect this. [nd: while at there, wrap access to these two fields in get_packed_git() and get_packed_git_mru(). This allows us to lazily initialize these fields without caller doing that explicitly] Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14Merge branch 'nd/object-allocation-comments'Junio C Hamano1-0/+1
Code doc update. * nd/object-allocation-comments: object.h: realign object flag allocation comment object.h: update flag allocation comment
2018-03-14sha1_file: convert read_sha1_file to struct object_idbrian m. carlson1-10/+6
Convert read_sha1_file to take a pointer to struct object_id and rename it read_object_file. Do the same for read_sha1_file_extended. Convert one use in grep.c to use the new function without any other code change, since the pointer being passed is a void pointer that is already initialized with a pointer to struct object_id. Update the declaration and definitions of the modified functions, and apply the following semantic patch to convert the remaining callers: @@ expression E1, E2, E3; @@ - read_sha1_file(E1.hash, E2, E3) + read_object_file(&E1, E2, E3) @@ expression E1, E2, E3; @@ - read_sha1_file(E1->hash, E2, E3) + read_object_file(E1, E2, E3) @@ expression E1, E2, E3, E4; @@ - read_sha1_file_extended(E1.hash, E2, E3, E4) + read_object_file_extended(&E1, E2, E3, E4) @@ expression E1, E2, E3, E4; @@ - read_sha1_file_extended(E1->hash, E2, E3, E4) + read_object_file_extended(E1, E2, E3, E4) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14sha1_file: convert read_object_with_reference to object_idbrian m. carlson1-1/+1
Convert read_object_with_reference to take pointers to struct object_id. Update the internals of the function accordingly. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14sha1_file: convert sha1_object_info* to object_idbrian m. carlson1-4/+3
Convert sha1_object_info and sha1_object_info_extended to take pointers to struct object_id and rename them to use "oid" instead of "sha1" in their names. Update the declaration and definition and apply the following semantic patch, plus the standard object_id transforms: @@ expression E1, E2; @@ - sha1_object_info(E1.hash, E2) + oid_object_info(&E1, E2) @@ expression E1, E2; @@ - sha1_object_info(E1->hash, E2) + oid_object_info(E1, E2) @@ expression E1, E2, E3; @@ - sha1_object_info_extended(E1.hash, E2, E3) + oid_object_info_extended(&E1, E2, E3) @@ expression E1, E2, E3; @@ - sha1_object_info_extended(E1->hash, E2, E3) + oid_object_info_extended(E1, E2, E3) Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14streaming: convert open_istream to use struct object_idbrian m. carlson1-1/+1
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-06Merge branch 'bw/c-plus-plus'Junio C Hamano1-4/+4
Avoid using identifiers that clash with C++ keywords. Even though it is not a goal to compile Git with C++ compilers, changes like this help use of code analysis tools that targets C++ on our codebase. * bw/c-plus-plus: (37 commits) replace: rename 'new' variables trailer: rename 'template' variables tempfile: rename 'template' variables wrapper: rename 'template' variables environment: rename 'namespace' variables diff: rename 'template' variables environment: rename 'template' variables init-db: rename 'template' variables unpack-trees: rename 'new' variables trailer: rename 'new' variables submodule: rename 'new' variables split-index: rename 'new' variables remote: rename 'new' variables ref-filter: rename 'new' variables read-cache: rename 'new' variables line-log: rename 'new' variables imap-send: rename 'new' variables http: rename 'new' variables entry: rename 'new' variables diffcore-delta: rename 'new' variables ...
2018-03-06object.h: update flag allocation commentNguyễn Thái Ngọc Duy1-0/+1
Since the "flags" is shared, it's a good idea to keep track of who uses what bit. When we need to use more flags in library code, we can be sure it won't be re-used for another purpose by some caller. While at there, fix the location of "5" (should be in a different column than "4" two lines down) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-15Merge branch 'bc/hash-algo'Junio C Hamano1-26/+26
More abstraction of hash function from the codepath. * bc/hash-algo: hash: update obsolete reference to SHA1_HEADER bulk-checkin: abstract SHA-1 usage csum-file: abstract uses of SHA-1 csum-file: rename sha1file to hashfile read-cache: abstract away uses of SHA-1 pack-write: switch various SHA-1 values to abstract forms pack-check: convert various uses of SHA-1 to abstract forms fast-import: switch various uses of SHA-1 to the_hash_algo sha1_file: switch uses of SHA-1 to the_hash_algo builtin/unpack-objects: switch uses of SHA-1 to the_hash_algo builtin/index-pack: improve hash function abstraction hash: create union for hash context allocation hash: move SHA-1 macros to hash.h
2018-02-15Merge branch 'po/object-id'Junio C Hamano1-1/+1
Conversion from uchar[20] to struct object_id continues. * po/object-id: sha1_file: rename hash_sha1_file_literally sha1_file: convert write_loose_object to object_id sha1_file: convert force_object_loose to object_id sha1_file: convert write_sha1_file to object_id notes: convert write_notes_tree to object_id notes: convert combine_notes_* to object_id commit: convert commit_tree* to object_id match-trees: convert splice_tree to object_id cache: clear whole hash buffer with oidclr sha1_file: convert hash_sha1_file to object_id dir: convert struct sha1_stat to use object_id sha1_file: convert pretend_sha1_file to object_id
2018-02-14pack-objects: rename 'this' variablesBrandon Williams1-4/+4
Rename C++ keyword in order to bring the codebase closer to being able to be compiled with a C++ compiler. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-13Merge branch 'gs/retire-mru'Junio C Hamano1-5/+4
Retire mru API as it does not give enough abstraction over underlying list API to be worth it. * gs/retire-mru: mru: Replace mru.[ch] with list.h implementation
2018-02-13Merge branch 'ot/mru-on-list'Junio C Hamano1-2/+3
The first step to getting rid of mru API and using the doubly-linked list API directly instead. * ot/mru-on-list: mru: use double-linked list from list.h
2018-02-13Merge branch 'jh/fsck-promisors'Junio C Hamano1-2/+35
In preparation for implementing narrow/partial clone, the machinery for checking object connectivity used by gc and fsck has been taught that a missing object is OK when it is referenced by a packfile specially marked as coming from trusted repository that promises to make them available on-demand and lazily. * jh/fsck-promisors: gc: do not repack promisor packfiles rev-list: support termination at promisor objects sha1_file: support lazily fetching missing objects introduce fetch-object: fetch one promisor object index-pack: refactor writing of .keep files fsck: support promisor objects as CLI argument fsck: support referenced promisor objects fsck: support refs pointing to promisor objects fsck: introduce partialclone extension extension.partialclone: introduce partial clone extension
2018-02-02csum-file: rename sha1file to hashfilebrian m. carlson1-26/+26
Rename struct sha1file to struct hashfile, along with all of its related functions. The transformation in this commit was made by global search-and-replace. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-01-30sha1_file: convert force_object_loose to object_idPatryk Obara1-1/+1
Convert the definition and declaration of force_object_loose to struct object_id and adjust usage of this function. Signed-off-by: Patryk Obara <patryk.obara@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-01-24mru: Replace mru.[ch] with list.h implementationGargi Sharma1-5/+4
Replace the custom calls to mru.[ch] with calls to list.h. This patch is the final step in removing the mru API completely and inlining the logic. This patch leads to significant code reduction and the mru API hence, is not a useful abstraction anymore. Signed-off-by: Gargi Sharma <gs051095@gmail.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-12-27Merge branch 'jh/object-filtering'Junio C Hamano1-1/+63
In preparation for implementing narrow/partial clone, the object walking machinery has been taught a way to tell it to "filter" some objects from enumeration. * jh/object-filtering: rev-list: support --no-filter argument list-objects-filter-options: support --no-filter list-objects-filter-options: fix 'keword' typo in comment pack-objects: add list-objects filtering rev-list: add list-objects filtering support list-objects: filter objects in traverse_commit_list oidset: add iterator methods to oidset oidmap: add oidmap iterator methods dir: allow exclusions from blob in addition to file
2017-12-08gc: do not repack promisor packfilesJonathan Tan1-2/+35
Teach gc to stop traversal at promisor objects, and to leave promisor packfiles alone. This has the effect of only repacking non-promisor packfiles, and preserves the distinction between promisor packfiles and non-promisor packfiles. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-11-22pack-objects: add list-objects filteringJeff Hostetler1-1/+63
Teach pack-objects to use the filtering provided by the traverse_commit_list_filtered() interface to omit unwanted objects from the resulting packfile. Filtering requires the use of the "--stdout" option. Add t5317 test. In the future, we will introduce a "partial clone" mechanism wherein an object in a repo, obtained from a remote, may reference a missing object that can be dynamically fetched from that remote once needed. This "partial clone" mechanism will have a way, sometimes slow, of determining if a missing link is one of the links expected to be produced by this mechanism. This patch introduces handling of missing objects to help debugging and development of the "partial clone" mechanism, and once the mechanism is implemented, for a power user to perform operations that are missing-object aware without incurring the cost of checking if a missing link is expected. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-16refs: convert peel_ref to struct object_idbrian m. carlson1-2/+2
Convert peel_ref (and its corresponding backend) to struct object_id. This transformation was done with an update to the declaration, definition, comments, and test helper and the following semantic patch: @@ expression E1, E2; @@ - peel_ref(E1, E2.hash) + peel_ref(E1, &E2) @@ expression E1, E2; @@ - peel_ref(E1, E2->hash) + peel_ref(E1, E2) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-16builtin/pack-objects: convert to struct object_idbrian m. carlson1-67/+68
This is one of the last unconverted callers to peel_ref. While we're fixing that, convert the rest of the file, since it will need to be converted at some point anyway. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-16pack-bitmap: convert traverse_bitmap_commit_list to object_idbrian m. carlson1-4/+4
Convert traverse_bitmap_commit_list and the callbacks it takes to use a pointer to struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-10cleanup: fix possible overflow errors in binary searchDerrick Stolee1-1/+1
A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; This translation is safe since the operation occurs inside a loop conditioned on "min < max". The included changes were found using the following git grep: git grep '/ *2;' '*.c' Making this cleanup will prevent future review friction when a new binary search is contructed based on existing code. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-01mru: use double-linked list from list.hOlga Telezhnaya1-2/+3
Simplify mru.[ch] and related code by reusing the double-linked list implementation from list.h instead of a custom one. This commit is an intermediate step. Our final goal is to get rid of mru.[ch] at all and inline all logic. Mentored-by: Christian Couder <christian.couder@gmail.com> Mentored by: Jeff King <peff@peff.net> Signed-off-by: Olga Telezhnaia <olyatelezhnaya@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-29Merge branch 'rj/no-sign-compare'Junio C Hamano1-2/+2
Many codepaths have been updated to squelch -Wsign-compare warnings. * rj/no-sign-compare: ALLOC_GROW: avoid -Wsign-compare warnings cache.h: hex2chr() - avoid -Wsign-compare warnings commit-slab.h: avoid -Wsign-compare warnings git-compat-util.h: xsize_t() - avoid -Wsign-compare warnings
2017-09-22ALLOC_GROW: avoid -Wsign-compare warningsRamsay Jones1-2/+2
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-14pack: make packed_git_mru global a value instead of a pointerJonathan Nieder1-2/+2
The MRU cache that keeps track of recently used packs is represented using two global variables: struct mru packed_git_mru_storage; struct mru *packed_git_mru = &packed_git_mru_storage; Callers never assign to the packed_git_mru pointer, though, so we can simplify by eliminating it and using &packed_git_mru_storage (renamed to &packed_git_mru) directly. This variable is only used by the packfile subsystem, making this a relatively uninvasive change (and any new unadapted callers would trigger a compile error). Noticed while moving these globals to the object_store struct. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Acked-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-10Merge branch 'ma/ts-cleanups'Junio C Hamano1-0/+6
Assorted bugfixes and clean-ups. * ma/ts-cleanups: ThreadSanitizer: add suppressions strbuf_setlen: don't write to strbuf_slopbuf pack-objects: take lock before accessing `remaining` convert: always initialize attr_action in convert_attrs
2017-08-23pack: move open_pack_index(), parse_pack_index()Jonathan Tan1-0/+1
alloc_packed_git() in packfile.c is duplicated from sha1_file.c. In a subsequent commit, alloc_packed_git() will be removed from sha1_file.c. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-23Merge branch 'rs/pack-objects-pbase-cleanup' into maintJunio C Hamano1-1/+1
Code clean-up. * rs/pack-objects-pbase-cleanup: pack-objects: remove unnecessary NULL check
2017-08-23pack-objects: take lock before accessing `remaining`Martin Ågren1-0/+6
When checking the conditional of "while (me->remaining)", we did not hold the lock. Calling find_deltas would still be safe, since it checks "remaining" (after taking the lock) and is able to handle all values. In fact, this could (currently) not trigger any bug: a bug could happen if `remaining` transitioning from zero to non-zero races with the evaluation of the while-condition, but these are always separated by the data_ready-mechanism. Make sure we have the lock when we read `remaining`. This does mean we release it just so that find_deltas can take it immediately again. We could tweak the contract so that the lock should be taken before calling find_deltas, but let's defer that until someone can actually show that "unlock+lock" has a measurable negative impact. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-11Merge branch 'rs/pack-objects-pbase-cleanup'Junio C Hamano1-1/+1
Code clean-up. * rs/pack-objects-pbase-cleanup: pack-objects: remove unnecessary NULL check
2017-07-20pack-objects: remove unnecessary NULL checkRené Scharfe1-1/+1
If done_pbase_paths is NULL then done_pbase_paths_num must be zero and done_pbase_path_pos() returns -1 without accessing the array, so the check is not necessary. If the invariant was violated then the check would make sure we keep on going and allocate the necessary amount of memory in the next ALLOC_GROW call. That sounds nice, but all array entries except for one would contain garbage data. If the invariant was violated without the check we'd get a segfault in done_pbase_path_pos(), i.e. an observable crash, alerting us of the presence of a bug. Currently there is no such bug: Only the functions check_pbase_path() and cleanup_preferred_base() change pointer and counter, and both make sure to keep them in sync. Get rid of the check anyway to allow us to see if later changes introduce such a defect, and to simplify the code. Detected by Coverity Scan. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-17use MOVE_ARRAYRené Scharfe1-3/+2
Simplify the code for moving members inside of an array and make it more robust by using the helper macro MOVE_ARRAY. It calculates the size based on the specified number of elements for us and supports NULL pointers when that number is zero. Raw memmove(3) calls with NULL can cause the compiler to (over-eagerly) optimize out later NULL checks. This patch was generated with contrib/coccinelle/array.cocci and spatch (Coccinelle). Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-24Merge branch 'ab/free-and-null'Junio C Hamano1-8/+4
A common pattern to free a piece of memory and assign NULL to the pointer that used to point at it has been replaced with a new FREE_AND_NULL() macro. * ab/free-and-null: *.[ch] refactoring: make use of the FREE_AND_NULL() macro coccinelle: make use of the "expression" FREE_AND_NULL() rule coccinelle: add a rule to make "expression" code use FREE_AND_NULL() coccinelle: make use of the "type" FREE_AND_NULL() rule coccinelle: add a rule to make "type" code use FREE_AND_NULL() git-compat-util: add a FREE_AND_NULL() wrapper around free(ptr); ptr = NULL
2017-06-24Merge branch 'bw/config-h'Junio C Hamano1-0/+1
Fix configuration codepath to pay proper attention to commondir that is used in multi-worktree situation, and isolate config API into its own header file. * bw/config-h: config: don't implicitly use gitdir or commondir config: respect commondir setup: teach discover_git_directory to respect the commondir config: don't include config.h by default config: remove git_config_iter config: create config.h
2017-06-16coccinelle: make use of the "type" FREE_AND_NULL() ruleÆvar Arnfjörð Bjarmason1-8/+4
Apply the result of the just-added coccinelle rule. This manually excludes a few occurrences, mostly things that resulted in many FREE_AND_NULL() on one line, that'll be manually fixed in a subsequent change. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-15config: don't include config.h by defaultBrandon Williams1-0/+1
Stop including config.h by default in cache.h. Instead only include config.h in those files which require use of the config system. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-04Merge branch 'jk/disable-pack-reuse-when-broken' into maintJunio C Hamano1-1/+5
"pack-objects" can stream a slice of an existing packfile out when the pack bitmap can tell that the reachable objects are all needed in the output, without inspecting individual objects. This strategy however would not work well when "--local" and other options are in use, and need to be disabled. * jk/disable-pack-reuse-when-broken: t5310: fix "; do" style pack-objects: disable pack reuse for object-selection options
2017-06-02Merge branch 'ab/grep-preparatory-cleanup'Junio C Hamano1-1/+3
The internal implementation of "git grep" has seen some clean-up. * ab/grep-preparatory-cleanup: (31 commits) grep: assert that threading is enabled when calling grep_{lock,unlock} grep: given --threads with NO_PTHREADS=YesPlease, warn pack-objects: fix buggy warning about threads pack-objects & index-pack: add test for --threads warning test-lib: add a PTHREADS prerequisite grep: move is_fixed() earlier to avoid forward declaration grep: change internal *pcre* variable & function names to be *pcre1* grep: change the internal PCRE macro names to be PCRE1 grep: factor test for \0 in grep patterns into a function grep: remove redundant regflags assignments grep: catch a missing enum in switch statement perf: add a comparison test of log --grep regex engines with -F perf: add a comparison test of log --grep regex engines perf: add a comparison test of grep regex engines with -F perf: add a comparison test of grep regex engines perf: emit progress output when unpacking & building perf: add a GIT_PERF_MAKE_COMMAND for when *_MAKE_OPTS won't do grep: add tests to fix blind spots with \0 patterns grep: prepare for testing binary regexes containing rx metacharacters grep: add a test helper function for less verbose -f \0 tests ...
2017-05-29Merge branch 'jk/disable-pack-reuse-when-broken'Junio C Hamano1-1/+5
"pack-objects" can stream a slice of an existing packfile out when the pack bitmap can tell that the reachable objects are all needed in the output, without inspecting individual objects. This strategy however would not work well when "--local" and other options are in use, and need to be disabled. * jk/disable-pack-reuse-when-broken: t5310: fix "; do" style pack-objects: disable pack reuse for object-selection options
2017-05-29Merge branch 'bc/object-id'Junio C Hamano1-30/+41
Conversion from uchar[20] to struct object_id continues. * bc/object-id: (53 commits) object: convert parse_object* to take struct object_id tree: convert parse_tree_indirect to struct object_id sequencer: convert do_recursive_merge to struct object_id diff-lib: convert do_diff_cache to struct object_id builtin/ls-tree: convert to struct object_id merge: convert checkout_fast_forward to struct object_id sequencer: convert fast_forward_to to struct object_id builtin/ls-files: convert overlay_tree_on_cache to object_id builtin/read-tree: convert to struct object_id sha1_name: convert internals of peel_onion to object_id upload-pack: convert remaining parse_object callers to object_id revision: convert remaining parse_object callers to object_id revision: rename add_pending_sha1 to add_pending_oid http-push: convert process_ls_object and descendants to object_id refs/files-backend: convert many internals to struct object_id refs: convert struct ref_update to use struct object_id ref-filter: convert some static functions to struct object_id Convert struct ref_array_item to struct object_id Convert the verify_pack callback to struct object_id Convert lookup_tag to struct object_id ...
2017-05-26pack-objects: fix buggy warning about threadsÆvar Arnfjörð Bjarmason1-1/+3
Fix a buggy warning about threads under NO_PTHREADS=YesPlease. Due to re-using the delta_search_threads variable for both the state of the "pack.threads" config & the --threads option, setting "pack.threads" but not supplying --threads would trigger the warning for both "pack.threads" & --threads. Solve this bug by resetting the delta_search_threads variable in git_pack_config(), it might then be set by --threads again and be subsequently warned about, as the test I'm changing here asserts. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-09pack-objects: disable pack reuse for object-selection optionsJeff King1-1/+5
If certain options like --honor-pack-keep, --local, or --incremental are used with pack-objects, then we need to feed each potential object to want_object_in_pack() to see if it should be filtered out. But when the bitmap reuse_packfile optimization is in effect, we do not call that function at all, and in fact skip adding the objects to the to_pack list entirely. This means we have a bug: for certain requests we will silently ignore those options and include objects in that pack that should not be there. The problem has been present since the inception of the pack-reuse code in 6b8fda2db (pack-objects: use bitmaps when packing objects, 2013-12-21), but it was unlikely to come up in practice. These options are generally used for on-disk packing, not transfer packs (which go to stdout), but we've never allowed pack reuse for non-stdout packs (until 645c432d6, we did not even use bitmaps, which the reuse optimization relies on; after that, we explicitly turned it off when not packing to stdout). We can fix this by just disabling the reuse_packfile optimization when the options are in use. In theory we could teach the pack-reuse code to satisfy these checks, but it's not worth the complexity. The purpose of the optimization is to keep the amount of per-object work we do to a minimum. But these options inherently require us to search for other copies of each object, drowning out any benefit of the pack-reuse optimization. But note that the optimizations from 56dfeb626 (pack-objects: compute local/ignore_pack_keep early, 2016-07-29) happen before pack-reuse, meaning that specifying "--honor-pack-keep" in a repository with no .keep files can still follow the fast path. There are tests in t5310 that check these options with bitmaps and --stdout, but they didn't catch the bug, and it's hard to adapt them to do so. One problem is that they don't use --delta-base-offset; without that option, we always disable the reuse optimization entirely. It would be fine to add it in (it actually makes the test more realistic), but that still isn't quite enough. The other problem is that the reuse code is very picky; it only kicks in when it can reuse most of a pack, starting from the first byte. So we'd have to start from a fully repacked and bitmapped state to trigger it. But the tests for these options use a much more subtle state; they want to be sure that the want_object_in_pack() code is allowing some objects but not others. Doing a full repack runs counter to that. So this patch adds new tests at the end of the script which create the fully-packed state and make sure that each option is not fooled by reusable pack. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08Convert lookup_tag to struct object_idbrian m. carlson1-1/+1
Convert lookup_tag to take a pointer to struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08pack: convert struct pack_idx_entry to struct object_idbrian m. carlson1-26/+37
Convert struct pack_idx_entry to use struct object_id by changing the definition and applying the following semantic patch, plus the standard object_id transforms: @@ struct pack_idx_entry E1; @@ - E1.sha1 + E1.oid.hash @@ struct pack_idx_entry *E1; @@ - E1->sha1 + E1->oid.hash Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08shallow: convert shallow registration functions to object_idbrian m. carlson1-3/+3
Convert register_shallow and unregister_shallow to take struct object_id. register_shallow is a caller of lookup_commit, which we will convert later. It doesn't make sense for the registration and unregistration functions to have incompatible interfaces, so convert them both. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-27timestamp_t: a new data type for timestampsJohannes Schindelin1-2/+2
Git's source code assumes that unsigned long is at least as precise as time_t. Which is incorrect, and causes a lot of problems, in particular where unsigned long is only 32-bit (notably on Windows, even in 64-bit versions). So let's just use a more appropriate data type instead. In preparation for this, we introduce the new `timestamp_t` data type. By necessity, this is a very, very large patch, as it has to replace all timestamps' data type in one go. As we will use a data type that is not necessarily identical to `time_t`, we need to be very careful to use `time_t` whenever we interact with the system functions, and `timestamp_t` everywhere else. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-19Merge branch 'bc/object-id'Junio C Hamano1-12/+12
Conversion from unsigned char [40] to struct object_id continues. * bc/object-id: Documentation: update and rename api-sha1-array.txt Rename sha1_array to oid_array Convert sha1_array_for_each_unique and for_each_abbrev to object_id Convert sha1_array_lookup to take struct object_id Convert remaining callers of sha1_array_lookup to object_id Make sha1_array_append take a struct object_id * sha1-array: convert internal storage for struct sha1_array to object_id builtin/pull: convert to struct object_id submodule: convert check_for_new_submodule_commits to object_id sha1_name: convert disambiguate_hint_fn to take object_id sha1_name: convert struct disambiguate_state to object_id test-sha1-array: convert most code to struct object_id parse-options-cb: convert sha1_array_append caller to struct object_id fsck: convert init_skiplist to struct object_id builtin/receive-pack: convert portions to struct object_id builtin/pull: convert portions to struct object_id builtin/diff: convert to struct object_id Convert GIT_SHA1_RAWSZ used for allocation to GIT_MAX_RAWSZ Convert GIT_SHA1_HEXSZ used for allocation to GIT_MAX_HEXSZ Define new hash-size constants for allocating memory
2017-03-31Rename sha1_array to oid_arraybrian m. carlson1-5/+5
Since this structure handles an array of object IDs, rename it to struct oid_array. Also rename the accessor functions and the initialization constant. This commit was produced mechanically by providing non-Documentation files to the following Perl one-liners: perl -pi -E 's/struct sha1_array/struct oid_array/g' perl -pi -E 's/\bsha1_array_/oid_array_/g' perl -pi -E 's/SHA1_ARRAY_INIT/OID_ARRAY_INIT/g' Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-31Convert sha1_array_lookup to take struct object_idbrian m. carlson1-1/+1
Convert this function by changing the declaration and definition and applying the following semantic patch to update the callers: @@ expression E1, E2; @@ - sha1_array_lookup(E1, E2.hash) + sha1_array_lookup(E1, &E2) @@ expression E1, E2; @@ - sha1_array_lookup(E1, E2->hash) + sha1_array_lookup(E1, E2) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-31Convert remaining callers of sha1_array_lookup to object_idbrian m. carlson1-8/+8
There are a very small number of callers which don't already use struct object_id. Convert them. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-31Make sha1_array_append take a struct object_id *brian m. carlson1-2/+2
Convert the callers to pass struct object_id by changing the function declaration and definition and applying the following semantic patch: @@ expression E1, E2; @@ - sha1_array_append(E1, E2.hash) + sha1_array_append(E1, &E2) @@ expression E1, E2; @@ - sha1_array_append(E1, E2->hash) + sha1_array_append(E1, E2) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-28Merge branch 'jk/fast-import-cleanup'Junio C Hamano1-4/+8
Code clean-up. * jk/fast-import-cleanup: pack.h: define largest possible encoded object size encode_in_pack_object_header: respect output buffer length fast-import: use xsnprintf for formatting headers fast-import: use xsnprintf for writing sha1s
2017-03-24pack.h: define largest possible encoded object sizeJeff King1-2/+4
Several callers use fixed buffers for storing the pack object header, and they've picked 10 as a magic number. This is reasonable, since it handles objects up to 2^67. But let's give them a constant so it's clear that the number isn't pulled out of thin air. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-24encode_in_pack_object_header: respect output buffer lengthJeff King1-2/+4
The encode_in_pack_object_header() writes a variable-length header to an output buffer, but it doesn't actually know long the buffer is. At first glance, this looks like it might be possible to overflow. In practice, this is probably impossible. The smallest buffer we use is 10 bytes, which would hold the header for an object up to 2^67 bytes. Obviously we're not likely to see such an object, but we might worry that an object could lie about its size (causing us to overflow before we realize it does not actually have that many bytes). But the argument is passed as a uintmax_t. Even on systems that have __int128 available, uintmax_t is typically restricted to 64-bit by the ABI. So it's unlikely that a system exists where this could be exploited. Still, it's easy enough to use a normal out/len pair and make sure we don't write too far. That protects the hypothetical 128-bit system, makes it harder for callers to accidentally specify a too-small buffer, and makes the resulting code easier to audit. Note that the one caller in fast-import tried to catch such a case, but did so _after_ the call (at which point we'd have already overflowed!). This check can now go away. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-17Merge branch 'bc/object-id'Junio C Hamano1-3/+3
"uchar [40]" to "struct object_id" conversion continues. * bc/object-id: wt-status: convert to struct object_id builtin/merge-base: convert to struct object_id Convert object iteration callbacks to struct object_id sha1_file: introduce an nth_packed_object_oid function refs: simplify parsing of reflog entries refs: convert each_reflog_ent_fn to struct object_id reflog-walk: convert struct reflog_info to struct object_id builtin/replace: convert to struct object_id Convert remaining callers of resolve_refdup to object_id builtin/merge: convert to struct object_id builtin/clone: convert to struct object_id builtin/branch: convert to struct object_id builtin/grep: convert to struct object_id builtin/fmt-merge-message: convert to struct object_id builtin/fast-export: convert to struct object_id builtin/describe: convert to struct object_id builtin/diff-tree: convert to struct object_id builtin/commit: convert to struct object_id hex: introduce parse_oid_hex
2017-02-27Merge branch 'bw/attr'Junio C Hamano1-14/+5
The gitattributes machinery is being taught to work better in a multi-threaded environment. * bw/attr: (27 commits) attr: reformat git_attr_set_direction() function attr: push the bare repo check into read_attr() attr: store attribute stack in attr_check structure attr: tighten const correctness with git_attr and match_attr attr: remove maybe-real, maybe-macro from git_attr attr: eliminate global check_all_attr array attr: use hashmap for attribute dictionary attr: change validity check for attribute names to use positive logic attr: pass struct attr_check to collect_some_attrs attr: retire git_check_attrs() API attr: convert git_check_attrs() callers to use the new API attr: convert git_all_attrs() to use "struct attr_check" attr: (re)introduce git_check_attr() and struct attr_check attr: rename function and struct related to checking attributes attr.c: outline the future plans by heavily commenting Documentation: fix a typo attr.c: add push_stack() helper attr: support quoting pathname patterns in C style attr.c: plug small leak in parse_attr_line() attr.c: tighten constness around "git_attr" structure ...
2017-02-27Merge branch 'jk/delta-chain-limit'Junio C Hamano1-23/+110
"git repack --depth=<n>" for a long time busted the specified depth when reusing delta from existing packs. This has been corrected. * jk/delta-chain-limit: pack-objects: convert recursion to iteration in break_delta_chain() pack-objects: enforce --depth limit in reused deltas
2017-02-22Convert object iteration callbacks to struct object_idbrian m. carlson1-3/+3
Convert each_loose_object_fn and each_packed_object_fn to take a pointer to struct object_id. Update the various callbacks. Convert several 40-based constants to use GIT_SHA1_HEXSZ. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-02-01attr: convert git_check_attrs() callers to use the new APIJunio C Hamano1-14/+5
The remaining callers are all simple "I have N attributes I am interested in. I'll ask about them with various paths one by one". After this step, no caller to git_check_attrs() remains. After removing it, we can extend "struct attr_check" struct with data that can be used in optimizing the query for the specific N attributes it contains. Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-02-01attr: rename function and struct related to checking attributesJunio C Hamano1-3/+3
The traditional API to check attributes is to prepare an N-element array of "struct git_attr_check" and pass N and the array to the function "git_check_attr()" as arguments. In preparation to revamp the API to pass a single structure, in which these N elements are held, rename the type used for these individual array elements to "struct attr_check_item" and rename the function to "git_check_attrs()". Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-27pack-objects: convert recursion to iteration in break_delta_chain()Jeff King1-30/+99
The break_delta_chain() function is recursive over the depth of a given delta chain, which can lead to possibly running out of stack space. Normally delta depth is quite small, but if there _is_ a pathological case, this is where we would find and fix it, so we should be more careful. We can do it without recursion at all, but there's a little bit of cleverness needed to do so. It's easiest to explain by covering the less-clever strategies first. The obvious thing to try is just keeping our own stack on the heap. Whenever we would recurse, push the new entry onto the stack and loop instead. But this gets tricky; when we see an ACTIVE entry, we need to care if we just pushed it (in which case it's a cycle) or if we just popped it (in which case we dealt with its bases, and no we need to clear the ACTIVE flag and compute its depth). You can hack around that in various ways, like keeping a "just pushed" flag, but the logic gets muddled. However, we can observe that we do all of our pushes first, and then all of our pops afterwards. In other words, we can do this in two passes. First dig down to the base, stopping when we see a cycle, and pushing each item onto our stack. Then pop the stack elements, clearing the ACTIVE flag and computing the depth for each. This works, and is reasonably elegant. However, why do we need the stack for the second pass? We can just walk the delta pointers again. There's one complication. Popping the stack went over our list in reverse, so we could compute the depth of each entry by incrementing the depth of its base, which we will have just computed. To go forward in the second pass, we have to compute the total depth on the way down, and then assign it as we go. This patch implements this final strategy, because it not only keeps the memory off the stack, but it eliminates it entirely. Credit for the cleverness in that approach goes to Michael Haggerty; bugs are mine. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-27pack-objects: enforce --depth limit in reused deltasJeff King1-0/+18
Since 898b14c (pack-objects: rework check_delta_limit usage, 2007-04-16), we check the delta depth limit only when figuring out whether we should make a new delta. We don't consider it at all when reusing deltas, which means that packing once with --depth=250, and then again with --depth=50, the second pack may still contain chains larger than 50. This is generally considered a feature, as the results of earlier high-depth repacks are carried forward, used for serving fetches, etc. However, since we started using cross-pack deltas in c9af708b1 (pack-objects: use mru list when iterating over packs, 2016-08-11), we are no longer bounded by the length of an existing delta chain in a single pack. Here's one particular pathological case: a sequence of N packs, each with 2 objects, the base of which is stored as a delta in a previous pack. If we chain all the deltas together, we have a cycle of length N. We break the cycle, but the tip delta is still at depth N-1. This is less unlikely than it might sound. See the included test for a reconstruction based on real-world actions. I ran into such a case in the wild, where a client was rapidly sending packs, and we had accumulated 10,000 before doing a server-side repack. The pack that "git repack" tried to generate had a very deep chain, which caused pack-objects to run out of stack space in the recursive write_one(). This patch bounds the length of delta chains in the output pack based on --depth, regardless of whether they are caused by cross-pack deltas or existed in the input packs. This fixes the problem, but does have two possible downsides: 1. High-depth aggressive repacks followed by "normal" repacks will throw away the high-depth chains. In the long run this is probably OK; investigation showed that high-depth repacks aren't actually beneficial, and we dropped the aggressive depth default to match the normal case in 07e7dbf0d (gc: default aggressive depth to 50, 2016-08-11). 2. If you really do want to store high-depth deltas on disk, they may be discarded and new delta computed when serving a fetch, unless you set pack.depth to match your high-depth size. The implementation uses the existing search for delta cycles. That lets us compute the depth of any node based on the depth of its base, because we know the base is DFS_DONE by the time we look at it (modulo any cycles in the graph, but we know there cannot be any because we break them as we see them). There is some subtlety worth mentioning, though. We record the depth of each object as we compute it. It might seem like we could save the per-object storage space by just keeping track of the depth of our traversal (i.e., have break_delta_chains() report how deep it went). But we may visit an object through multiple delta paths, and on subsequent paths we want to know its depth immediately, without having to walk back down to its final base (doing so would make our graph walk quadratic rather than linear). Likewise, one could try to record the depth not from the base, but from our starting point (i.e., start recursion_depth at 0, and pass "recursion_depth + 1" to each invocation of break_delta_chains()). And then when recursion_depth gets too big, we know that we must cut the delta chain. But that technique is wrong if we do not visit the nodes in topological order. In a chain A->B->C, it if we visit "C", then "B", then "A", we will never recurse deeper than 1 link (because we see at each node that we have already visited it). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-11-15compression: unify pack.compression configuration parsingJunio C Hamano1-14/+0
There are three codepaths that use a variable whose name is pack_compression_level to affect how objects and deltas sent to a packfile is compressed. Unlike zlib_compression_level that controls the loose object compression, however, this variable was static to each of these codepaths. Two of them read the pack.compression configuration variable, using core.compression as the default, and one of them also allowed overriding it from the command line. The other codepath in bulk-checkin did not pay any attention to the configuration. Unify the configuration parsing to git_default_config(), where we implement the parsing of core.loosecompression and core.compression and make the former override the latter, by moving code to parse pack.compression and also allow core.compression to give default to this variable. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-25sha1_file: rename git_open_noatime() to git_open()Lars Schneider1-1/+1
This function is meant to be used when reading from files in the object store, and the original objective was to avoid smudging atime of loose object files too often, hence its name. Because we'll be extending its role in the next commit to also arrange the file descriptors they return auto-closed in the child processes, rename it to lose "noatime" part that is too specific. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10Merge branch 'jk/pack-objects-optim-mru'Junio C Hamano1-2/+90
"git pack-objects" in a repository with many packfiles used to spend a lot of time looking for/at objects in them; the accesses to the packfiles are now optimized by checking the most-recently-used packfile first. * jk/pack-objects-optim-mru: pack-objects: use mru list when iterating over packs pack-objects: break delta cycles before delta-search phase sha1_file: make packed_object_info public provide an initializer for "struct object_info"
2016-09-29use QSORTRené Scharfe1-4/+3
Apply the semantic patch contrib/coccinelle/qsort.cocci to the code base, replacing calls of qsort(3) with QSORT. The resulting code is shorter and supports empty arrays with NULL pointers. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-21Merge branch 'ks/pack-objects-bitmap'Junio C Hamano1-40/+88
Some codepaths in "git pack-objects" were not ready to use an existing pack bitmap; now they are and as the result they have become faster. * ks/pack-objects-bitmap: pack-objects: use reachability bitmap index when generating non-stdout pack pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
2016-09-15Merge branch 'jk/pack-tag-of-tag'Junio C Hamano1-1/+30
"git pack-objects --include-tag" was taught that when we know that we are sending an object C, we want a tag B that directly points at C but also a tag A that points at the tag B. We used to miss the intermediate tag B in some cases. * jk/pack-tag-of-tag: pack-objects: walk tag chains for --include-tag t5305: simplify packname handling t5305: use "git -C" t5305: drop "dry-run" of unpack-objects t5305: move cleanup into test block
2016-09-12pack-objects: use reachability bitmap index when generating non-stdout packKirill Smelkov1-7/+24
Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects) if a repository has bitmap index, pack-objects can nicely speedup "Counting objects" graph traversal phase. That however was done only for case when resultant pack is sent to stdout, not written into a file. The reason here is for on-disk repack by default we want: - to produce good pack (with bitmap index not-yet-packed objects are emitted to pack in suboptimal order). - to use more robust pack-generation codepath (avoiding possible bugs in bitmap code and possible bitmap index corruption). Jeff King further explains: The reason for this split is that pack-objects tries to determine how "careful" it should be based on whether we are packing to disk or to stdout. Packing to disk implies "git repack", and that we will likely delete the old packs after finishing. We want to be more careful (so as not to carry forward a corruption, and to generate a more optimal pack), and we presumably run less frequently and can afford extra CPU. Whereas packing to stdout implies serving a remote via "git fetch" or "git push". This happens more frequently (e.g., a server handling many fetching clients), and we assume the receiving end takes more responsibility for verifying the data. But this isn't always the case. One might want to generate on-disk packfiles for a specialized object transfer. Just using "--stdout" and writing to a file is not optimal, as it will not generate the matching pack index. So it would be useful to have some way of overriding this heuristic: to tell pack-objects that even though it should generate on-disk files, it is still OK to use the reachability bitmaps to do the traversal. So we can teach pack-objects to use bitmap index for initial object counting phase when generating resultant pack file too: - if we take care to not let it be activated under git-repack: See above about repack robustness and not forward-carrying corruption. - if we know bitmap index generation is not enabled for resultant pack: The current code has singleton bitmap_git, so it cannot work simultaneously with two bitmap indices. We also want to avoid (at least with current implementation) generating bitmaps off of bitmaps. The reason here is: when generating a pack, not-yet-packed objects will be emitted into pack in suboptimal order and added to tail of the bitmap as "extended entries". When the resultant pack + some new objects in associated repository are in turn used to generate another pack with bitmap, the situation repeats: new objects are again not emitted optimally and just added to bitmap tail - not in recency order. So the pack badness can grow over time when at each step we have bitmapped pack + some other objects. That's why we want to avoid generating bitmaps off of bitmaps, not to let pack badness grow. - if we keep pack reuse enabled still only for "send-to-stdout" case: Because pack-to-file needs to generate index for destination pack, and currently on pack reuse raw entries are directly written out to the destination pack by write_reused_pack(), bypassing needed for pack index generation bookkeeping done by regular codepath in write_one() and friends. ( In the future we might teach pack-reuse code about cases when index also needs to be generated for resultant pack and remove pack-reuse-only-for-stdout limitation ) This way for pack-objects -> file we get nice speedup: erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup repository managed by git-backup[2] via time echo 0186ac99 | git pack-objects --revs erp5pack before: 37.2s after: 26.2s And for `git repack -adb` packed git.git time echo 5c589a73 | git pack-objects --revs gitpack before: 7.1s after: 3.6s i.e. it can be 30% - 50% speedup for pack extraction. git-backup extracts many packs on repositories restoration. That was my initial motivation for the patch. [1] https://lab.nexedi.com/nexedi/erp5 [2] https://lab.nexedi.com/kirr/git-backup NOTE Jeff also suggests that pack.useBitmaps was probably a mistake to introduce originally. This way we are not adding another config point, but instead just always default to-file pack-objects not to use bitmap index: Tools which need to generate on-disk packs with using bitmap, can pass --use-bitmap-index explicitly. And git-repack does never pass --use-bitmap-index, so this way we can be sure regular on-disk repacking remains robust. NOTE2 `git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower than `git pack-objects file.pack`. Extracting erp5.git pack from lab.nexedi.com backup repository: $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack real 0m22.309s user 0m21.148s sys 0m0.932s $ time git index-pack erp5pack-stdout.pack real 0m50.873s <-- more than 2 times slower than time to generate pack itself! user 0m49.300s sys 0m1.360s So the time for `pack-object --stdout >file.pack` + `index-pack file.pack` is 72s, while `pack-objects file.pack` which does both pack and index is 27s. And even `pack-objects --no-use-bitmap-index file.pack` is 37s. Jeff explains: The packfile does not carry the sha1 of the objects. A receiving index-pack has to compute them itself, including inflating and applying all of the deltas. that's why for `git-backup restore` we want to teach `git pack-objects file.pack` to use bitmaps instead of using `git pack-objects --stdout >file.pack` + `git index-pack file.pack`. NOTE3 The speedup is now tracked via t/perf/p5310-pack-bitmaps.sh Test 56dfeb62 this tree -------------------------------------------------------------------------------- 5310.2: repack to disk 8.98(8.05+0.29) 9.05(8.08+0.33) +0.8% 5310.3: simulated clone 2.02(2.27+0.09) 2.01(2.25+0.08) -0.5% 5310.4: simulated fetch 0.81(1.07+0.02) 0.81(1.05+0.04) +0.0% 5310.5: pack to file 7.58(7.04+0.28) 7.60(7.04+0.30) +0.3% 5310.6: pack to file (bitmap) 7.55(7.02+0.28) 3.25(2.82+0.18) -57.0% 5310.8: clone (partial bitmap) 1.83(2.26+0.12) 1.82(2.22+0.14) -0.5% 5310.9: pack to file (partial bitmap) 6.86(6.58+0.30) 2.87(2.74+0.20) -58.2% More context: http://marc.info/?t=146792101400001&r=1&w=2 http://public-inbox.org/git/20160707190917.20011-1-kirr@nexedi.com/T/#t Cc: Vicent Marti <tanoku@gmail.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-12pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is ↵Kirill Smelkov1-33/+64
in use Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there are two codepaths in pack-objects: with & without using bitmap reachability index. However add_object_entry_from_bitmap(), despite its non-bitmapped counterpart add_object_entry(), in no way does check for whether --local or --honor-pack-keep or --incremental should be respected. In non-bitmapped codepath this is handled in want_object_in_pack(), but bitmapped codepath has simply no such checking at all. The bitmapped codepath however was allowing to pass in all those options and with bitmap indices still being used under such conditions - potentially giving wrong output (e.g. including objects from non-local or .keep'ed pack). We can easily fix this by noting the following: when an object comes to add_object_entry_from_bitmap() it can come for two reasons: 1. entries coming from main pack covered by bitmap index, and 2. object coming from, possibly alternate, loose or other packs. "2" can be already handled by want_object_in_pack() and to cover "1" we can teach want_object_in_pack() to expect that *found_pack can be non-NULL, meaning calling client already found object's pack entry. In want_object_in_pack() we care to start the checks from already found pack, if we have one, this way determining the answer right away in case neither --local nor --honour-pack-keep are active. In particular, as p5310-pack-bitmaps.sh shows (3 consecutive runs), we do not do harm to served-with-bitmap clones performance-wise: Test 56dfeb62 this tree ----------------------------------------------------------------- 5310.2: repack to disk 9.08(8.20+0.25) 9.09(8.14+0.32) +0.1% 5310.3: simulated clone 1.92(2.12+0.08) 1.93(2.12+0.09) +0.5% 5310.4: simulated fetch 0.82(1.07+0.04) 0.82(1.06+0.04) +0.0% 5310.6: partial bitmap 1.96(2.42+0.13) 1.95(2.40+0.15) -0.5% Test 56dfeb62 this tree ----------------------------------------------------------------- 5310.2: repack to disk 9.11(8.16+0.32) 9.11(8.19+0.28) +0.0% 5310.3: simulated clone 1.93(2.14+0.07) 1.92(2.11+0.10) -0.5% 5310.4: simulated fetch 0.82(1.06+0.04) 0.82(1.04+0.05) +0.0% 5310.6: partial bitmap 1.95(2.38+0.16) 1.94(2.39+0.14) -0.5% Test 56dfeb62 this tree ----------------------------------------------------------------- 5310.2: repack to disk 9.13(8.17+0.31) 9.07(8.13+0.28) -0.7% 5310.3: simulated clone 1.92(2.13+0.07) 1.91(2.12+0.06) -0.5% 5310.4: simulated fetch 0.82(1.08+0.03) 0.82(1.08+0.03) +0.0% 5310.6: partial bitmap 1.96(2.43+0.14) 1.96(2.42+0.14) +0.0% with delta timings showing they are all within noise from run to run. In the general case we do not want to call find_pack_entry_one() more than once, because it is expensive. This patch splits the loop in want_object_in_pack() into two parts: finding the object and seeing if it impacts our choice to include it in the pack. We may call the inexpensive want_found_object() twice, but we will never call find_pack_entry_one() if we do not need to. I appreciate help and discussing this change with Junio C Hamano and Jeff King. Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-07pack-objects: walk tag chains for --include-tagJeff King1-1/+30
When pack-objects is given --include-tag, it peels each tag ref down to a non-tag object, and if that non-tag object is going to be packed, we include the tag, too. But what happens if we have a chain of tags (e.g., tag "A" points to tag "B", which points to commit "C")? We'll peel down to "C" and realize that we want to include tag "A", but we do not ever consider tag "B", leading to a broken pack (assuming "B" was not otherwise selected). Instead, we have to walk the whole chain, adding any tags we find to the pack. Interestingly, it doesn't seem possible to trigger this problem with "git fetch", but you can with "git clone --single-branch". The reason is that we generate the correct pack when the client explicitly asks for "A" (because we do a real reachability analysis there), and "fetch" is more willing to do so. There are basically two cases: 1. If "C" is already a ref tip, then the client can deduce that it needs "A" itself (via find_non_local_tags), and will ask for it explicitly rather than relying on the include-tag capability. Everything works. 2. If "C" is not already a ref tip, then we hope for include-tag to send us the correct tag. But it doesn't; it generates a broken pack. However, the next step is to do a follow-up run of find_non_local_tags(), followed by fetch_refs() to backfill any tags we learned about. In the normal case, fetch_refs() calls quickfetch(), which does a connectivity check and sees we have no new objects to fetch. We just write the refs. But for the broken-pack case, the connectivity check fails, and quickfetch will follow-up with the remote, asking explicitly for each of the ref tips. This picks up the missing object in a new pack. For a regular "git clone", we are similarly OK, because we explicitly request all of the tag refs, and get a correct pack. But with "--single-branch", we kick in tag auto-following via "include-tag", but do _not_ do a follow-up backfill. We just take whatever the server sent us via include-tag and write out tag refs for any tag objects we were sent. So prior to c6807a4 (clone: open a shortcut for connectivity check, 2013-05-26), we actually claimed the clone was a success, but the result was silently corrupted! Since c6807a4, index-pack's connectivity check catches this case, and we correctly complain. The included test directly checks that pack-objects does not generate a broken pack, but also confirms that "clone --single-branch" does not hit the bug. Note that tag chains introduce another interesting question: if we are packing the tag "B" but not the commit "C", should "A" be included? Both before and after this patch, we do not include "A", because the initial peel_ref() check only knows about the bottom-most level, "C". To realize that "B" is involved at all, we would have to switch to an incremental peel, in which we examine each tagged object, asking if it is being packed (and including the outer tag if so). But that runs contrary to the optimizations in peel_ref(), which avoid accessing the objects at all, in favor of using the value we pull from packed-refs. It's OK to walk the whole chain once we know we're going to include the tag (we have to access it anyway, so the effort is proportional to the pack we're generating). But for the initial selection, we have to look at every ref. If we're only packing a few objects, we'd still have to parse every single referenced tag object just to confirm that it isn't part of a tag chain. This could be addressed if packed-refs stored the complete tag chain for each peeled ref (in most cases, this would be the same cost as now, as each "chain" is only a single link). But given the size of that project, it's out of scope for this fix (and probably nobody cares enough anyway, as it's such an obscure situation). This commit limits itself to just avoiding the creation of a broken pack. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11pack-objects: use mru list when iterating over packsJeff King1-3/+7
In the original implementation of want_object_in_pack(), we always looked for the object in every pack, so the order did not matter for performance. As of the last few patches, however, we can now often break out of the loop early after finding the first instance, and avoid looking in the other packs at all. In this case, pack order can make a big difference, because we'd like to find the objects by looking at as few packs as possible. This patch switches us to the same packed_git_mru list that is now used by normal object lookups. Here are timings for p5303 on linux.git: Test HEAD^ HEAD ------------------------------------------------------------------------ 5303.3: rev-list (1) 31.31(31.07+0.23) 31.28(31.00+0.27) -0.1% 5303.4: repack (1) 40.35(38.84+2.60) 40.53(39.31+2.32) +0.4% 5303.6: rev-list (50) 31.37(31.15+0.21) 31.41(31.16+0.24) +0.1% 5303.7: repack (50) 58.25(68.54+2.03) 47.28(57.66+1.89) -18.8% 5303.9: rev-list (1000) 31.91(31.57+0.33) 31.93(31.64+0.28) +0.1% 5303.10: repack (1000) 304.80(376.00+3.92) 87.21(159.54+2.84) -71.4% The rev-list numbers are unchanged, which makes sense (they are not exercising this code at all). The 50- and 1000-pack repack cases show considerable improvement. The single-pack repack case doesn't, of course; there's nothing to improve. In fact, it gives us a baseline for how fast we could possibly go. You can see that though rev-list can approach the single-pack case even with 1000 packs, repack doesn't. The reason is simple: the loop we are optimizing is only part of what the repack is doing. After the "counting" phase, we do delta compression, which is much more expensive when there are multiple packs, because we have fewer deltas we can reuse (you can also see that these numbers come from a multicore machine; the CPU times are much higher than the wall-clock times due to the delta phase). So the good news is that in cases with many packs, we used to be dominated by the "counting" phase, and now we are dominated by the delta compression (which is faster, and which we have already parallelized). Here are similar numbers for git.git: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.3: rev-list (1) 1.55(1.51+0.02) 1.54(1.53+0.00) -0.6% 5303.4: repack (1) 1.82(1.80+0.08) 1.82(1.78+0.09) +0.0% 5303.6: rev-list (50) 1.58(1.57+0.00) 1.58(1.56+0.01) +0.0% 5303.7: repack (50) 2.50(3.12+0.07) 2.31(2.95+0.06) -7.6% 5303.9: rev-list (1000) 2.22(2.20+0.02) 2.23(2.19+0.03) +0.5% 5303.10: repack (1000) 10.47(16.78+0.22) 7.50(13.76+0.22) -28.4% Not as impressive in terms of percentage, but still measurable wins. If you look at the wall-clock time improvements in the 1000-pack case, you can see that linux improved by roughly 10x as many seconds as git. That's because it has roughly 10x as many objects, and we'd expect this improvement to scale linearly with the number of objects (since the number of packs is kept constant). It's just that the "counting" phase is a smaller percentage of the total time spent for a git.git repack, and hence the percentage win is smaller. The implementation itself is a straightforward use of the MRU code. We only bother marking a pack as used when we know that we are able to break early out of the loop, for two reasons: 1. If we can't break out early, it does no good; we have to visit each pack anyway, so we might as well avoid even the minor overhead of managing the cache order. 2. The mru_mark() function reorders the list, which would screw up our traversal. So it is only safe to mark when we are about to break out of the loop. We could record the found pack and mark it after the loop finishes, of course, but that's more complicated and it doesn't buy us anything due to (1). Note that this reordering does have a potential impact on the final pack, as we store only a single "found" pack for each object, even if it is present in multiple packs. In principle, any copy is acceptable, as they all refer to the same content. But in practice, they may differ in whether they are stored as deltas, against which base, etc. This may have an impact on delta reuse, and even the delta search (since we skip pairs that were already in the same pack). It's not clear whether this change of order would hurt or even help average cases, though. The most likely reason to have duplicate objects is from the completion of thin packs (e.g., you have some objects in a "base" pack, then receive several pushes; the packs you receive may be thin on the wire, with deltas that refer to bases outside the pack, but we complete them with duplicate base objects when indexing them). In such a case the current code would always find the thin duplicates (because we currently walk the packs in reverse chronological order). Whereas with this patch, some of those duplicates would be found in the base pack instead. In my tests repacking a real-world case of linux.git with 3600 thin-pack pushes (on top of a large "base" pack), the resulting pack was about 0.04% larger with this patch. On the other hand, because we were more likely to hit the base pack, there were more opportunities for delta reuse, and we had 50,000 fewer objects to examine in the delta search. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11pack-objects: break delta cycles before delta-search phaseJeff King1-0/+84
We do not allow cycles in the delta graph of a pack (i.e., A is a delta of B which is a delta of A) for the obvious reason that you cannot actually access any of the objects in such a case. There's a last-ditch attempt to notice cycles during the write phase, during which we issue a warning to the user and write one of the objects out in full. However, this is "last-ditch" for two reasons: 1. By this time, it's too late to find another delta for the object, so the resulting pack is larger than it otherwise could be. 2. The warning is there because this is something that _shouldn't_ ever happen. If it does, then either: a. a pack we are reusing deltas from had its own cycle b. we are reusing deltas from multiple packs, and we found a cycle among them (i.e., A is a delta of B in one pack, but B is a delta of A in another, and we choose to use both deltas). c. there is a bug in the delta-search code So this code serves as a final check that none of these things has happened, warns the user, and prevents us from writing a bogus pack. Right now, (2b) should never happen because of the static ordering of packs in want_object_in_pack(). If two objects have a delta relationship, then they must be in the same pack, and therefore we will find them from that same pack. However, a future patch would like to change that static ordering, which will make (2b) a common occurrence. In preparation, we should be able to handle those kinds of cycles better. This patch does by introducing a cycle-breaking step during the get_object_details() phase, when we are deciding which deltas can be reused. That gives us the chance to feed the objects into the delta search as if the cycle did not exist. We'll leave the detection and warning in the write_object() phase in place, as it still serves as a check for case (2c). This does mean we will stop warning for (2a). That case is caused by bogus input packs, and we ideally would warn the user about it. However, since those cycles show up after picking reusable deltas, they look the same as (2b) to us; our new code will break the cycles early and the last-ditch check will never see them. We could do analysis on any cycles that we find to distinguish the two cases (i.e., it is a bogus pack if and only if every delta in the cycle is in the same pack), but we don't need to. If there is a cycle inside a pack, we'll run into problems not only reusing the delta, but accessing the object data at all. So when we try to dig up the actual size of the object, we'll hit that same cycle and kick in our usual complain-and-try-another-source code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-08Merge branch 'jk/pack-objects-optim'Junio C Hamano1-0/+40
"git pack-objects" has a few options that tell it not to pack objects found in certain packfiles, which require it to scan .idx files of all available packs. The codepaths involved in these operations have been optimized for a common case of not having any non-local pack and/or any .kept pack. * jk/pack-objects-optim: pack-objects: compute local/ignore_pack_keep early pack-objects: break out of want_object loop early find_pack_entry: replace last_found_pack with MRU cache add generic most-recently-used list sha1_file: drop free_pack_by_name t/perf: add tests for many-pack scenarios
2016-08-08Merge branch 'nd/pack-ofs-4gb-limit' into maintJunio C Hamano1-8/+9
"git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation
2016-07-29pack-objects: compute local/ignore_pack_keep earlyJeff King1-1/+25
In want_object_in_pack(), we can exit early from our loop if neither "local" nor "ignore_pack_keep" are set. If they are, however, we must examine each pack to see if it has the object and is non-local or has a ".keep". It's quite common for there to be no non-local or .keep packs at all, in which case we know ahead of time that looking further will be pointless. We can pre-compute this by simply iterating over the list of packs ahead of time, and dropping the flags if there are no packs that could match. Another similar strategy would be to modify the loop in want_object_in_pack() to notice that we have already found the object once, and that we are looping only to check for "local" and "keep" attributes. If a pack has neither of those, we can skip the call to find_pack_entry_one(), which is the expensive part of the loop. This has two advantages: - it isn't all-or-nothing; we still get some improvement when there's a small number of kept or non-local packs, and a large number of non-kept local packs - it eliminates any possible race where we add new non-local or kept packs after our initial scan. In practice, I don't think this race matters; we already cache the packed_git information, so somebody who adds a new pack or .keep file after we've started will not be noticed at all, unless we happen to need to call reprepare_packed_git() because a lookup fails. In other words, we're already racy, and the race is not a big deal (losing the race means we might include an object in the pack that would not otherwise be, which is an acceptable outcome). However, it also has a disadvantage: we still loop over the rest of the packs for each object to check their flags. This is much less expensive than doing the object lookup, but still not free. So if we wanted to implement that strategy to cover the non-all-or-nothing cases, we could do so in addition to this one (so you get the most speedup in the all-or-nothing case, and the best we can do in the other cases). But given that the all-or-nothing case is likely the most common, it is probably not worth the trouble, and we can revisit this later if evidence points otherwise. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-29pack-objects: break out of want_object loop earlyJeff King1-0/+16
When pack-objects collects the list of objects to pack (either from stdin, or via its internal rev-list), it filters each one through want_object_in_pack(). This function loops through each existing packfile, looking for the object. When we find it, we mark the pack/offset combo for later use. However, we can't just return "yes, we want it" at that point. If --honor-pack-keep is in effect, we must keep looking to find it in _all_ packs, to make sure none of them has a .keep. Likewise, if --local is in effect, we must make sure it is not present in any non-local pack. As a result, the sum effort of these calls is effectively O(nr_objects * nr_packs). In an ordinary repository, we have only a handful of packs, and this doesn't make a big difference. But in pathological cases, it can slow the counting phase to a crawl. This patch notices the case that we have neither "--local" nor "--honor-pack-keep" in effect and breaks out of the loop early, after finding the first instance. Note that our worst case is still "objects * packs" (i.e., we might find each object in the last pack we look in), but in practice we will often break out early. On an "average" repo, my git.git with 8 packs, this shows a modest 2% (a few dozen milliseconds) improvement in the counting-objects phase of "git pack-objects --all <foo" (hackily instrumented by sticking exit(0) right after list_objects). But in a much more pathological case, it makes a bigger difference. I ran the same command on a real-world example with ~9 million objects across 1300 packs. The counting time dropped from 413s to 45s, an improvement of about 89%. Note that this patch won't do anything by itself for a normal "git gc", as it uses both --honor-pack-keep and --local. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-28Merge branch 'nd/pack-ofs-4gb-limit'Junio C Hamano1-8/+9
"git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation
2016-07-13pack-objects: do not truncate result in-pack object size on 32-bit systemsNguyễn Thái Ngọc Duy1-7/+8
A typical diff will not show what's going on and you need to see full functions. The core code is like this, at the end of of write_one() e->idx.offset = *offset; size = write_object(f, e, *offset); if (!size) { e->idx.offset = recursing; return WRITE_ONE_BREAK; } written_list[nr_written++] = &e->idx; /* make sure off_t is sufficiently large not to wrap */ if (signed_add_overflows(*offset, size)) die("pack too large for current definition of off_t"); *offset += size; Here we can see that the in-pack object size is returned by write_object (or indirectly by write_reuse_object). And it's used to calculate object offsets, which end up in the pack index file, generated at the end. If "size" overflows (on 32-bit sytems, unsigned long is 32-bit while off_t can be 64-bit), we got wrong offsets and produce incorrect .idx file, which may make it look like the .pack file is corrupted. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-12pack-objects: pass length to check_pack_crc() without truncationNguyễn Thái Ngọc Duy1-1/+1
On 32 bit systems with large file support, unsigned long is 32-bit while the two offsets in the subtraction expression (pack-objects has the exact same expression as in sha1_file.c but not shown in diff) are in 64-bit. If an in-pack object is larger than 2^32 len/datalen is truncated and we get a misleading "error: bad packed object CRC for ..." as a result. Use off_t for len and datalen. check_pack_crc() already accepts this argument as off_t and can deal with 4+ GB. Noticed-by: Christoph Michelbach <michelbach94@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-14repack: extend --keep-unreachable to loose objectsJeff King1-0/+31
If you use "repack -adk" currently, we will pack all objects that are already packed into the new pack, and then drop the old packs. However, loose unreachable objects will be left as-is. In theory these are meant to expire eventually with "git prune". But if you are using "repack -k", you probably want to keep things forever and therefore do not run "git prune" at all. Meaning those loose objects may build up over time and end up fooling any object-count heuristics (such as the one done by "gc --auto", though since git-gc does not support "repack -k", this really applies to whatever custom scripts people might have driving "repack -k"). With this patch, we instead stuff any loose unreachable objects into the pack along with the already-packed unreachable objects. This may seem wasteful, but it is really no more so than using "repack -k" in the first place. We are at a slight disadvantage, in that we have no useful ordering for the result, or names to hand to the delta code. However, this is again no worse than what "repack -k" is already doing for the packed objects. The packing of these objects doesn't matter much because they should not be accessed frequently (unless they actually _do_ become referenced, but then they would get moved to a different part of the packfile during the next repack). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-05-17Merge branch 'nd/error-errno'Junio C Hamano1-4/+2
The code for warning_errno/die_errno has been refactored and a new error_errno() reporting helper is introduced. * nd/error-errno: (41 commits) wrapper.c: use warning_errno() vcs-svn: use error_errno() upload-pack.c: use error_errno() unpack-trees.c: use error_errno() transport-helper.c: use error_errno() sha1_file.c: use {error,die,warning}_errno() server-info.c: use error_errno() sequencer.c: use error_errno() run-command.c: use error_errno() rerere.c: use error_errno() and warning_errno() reachable.c: use error_errno() mailmap.c: use error_errno() ident.c: use warning_errno() http.c: use error_errno() and warning_errno() grep.c: use error_errno() gpg-interface.c: use error_errno() fast-import.c: use error_errno() entry.c: use error_errno() editor.c: use error_errno() diff-no-index.c: use error_errno() ...
2016-05-10Merge branch 'ew/doc-split-pack-disables-bitmap'Junio C Hamano1-1/+8
Doc update. * ew/doc-split-pack-disables-bitmap: pack-objects: warn on split packs disabling bitmaps