aboutsummaryrefslogtreecommitdiffstats
path: root/midx.c
AgeCommit message (Collapse)AuthorFilesLines
2025-10-30packfile: always add packfiles to MRU when adding a packPatrick Steinhardt1-2/+0
When preparing the packfile store we know to also prepare the MRU list of packfiles with all packs that are currently loaded in the store via `packfile_store_prepare_mru()`. So we know that the list of packs in the MRU list should match the list of packs in the non-MRU list. But there are some direct or indirect callsites that add a packfile to the store via `packfile_store_add_pack()` without adding the pack to the MRU. And while functions that access the MRU (e.g. `find_pack_entry()`) know to call `packfile_store_prepare()`, which knows to prepare the MRU via `packfile_store_prepare_mru()`, that operation will be turned into a no-op because the packfile store is already prepared. So this will not cause us to add the packfile to the MRU, and consequently we won't be able to find the packfile in our MRU list. There are only a handful of callers outside of "packfile.c" that add a packfile to the store: - "builtin/fast-import.c" adds multiple packs of imported objects, but it knows to look up objects via `packfile_store_get_packs()`. This function does not use the MRU, so we're good. - "builtin/index-pack.c" adds the indexed pack to the store in case it needs to perform consistency checks on its objects. - "http.c" adds the fetched pack to the store so that we can access its objects. In all of these cases we actually want to access the contained objects. And luckily, reading these objects works as expected: 1. We eventually end up in `do_oid_object_info_extended()`. 2. Calling `find_pack_entry()` fails because the MRU list doesn't contain the newly added packfile. 3. The callers don't pass `OBJECT_INFO_QUICK`, so we end up repreparing the object database. This will also cause us to reprepare the MRU list. 4. We now retry reading the object via `find_pack_entry()`, and now we succeed because the MRU list got populated. This logic feels quite fragile: we intentionally add the packfile to the store, but we then ultimately rely on repreparing the entire store only to make the packfile accessible. While we do the correct thing in `do_oid_object_info_extended()`, other sites that access the MRU may not know to reprepare. But besides being fragile it's also a waste of resources: repreparing the object database requires us to re-read the alternates file and discard any caches. Refactor the code so that we unconditionally add packfiles to the MRU when adding them to a packfile store. This makes the logic less fragile and ensures that we don't have to reprepare the store to make the pack accessible. Note that this does not allow us to drop `packfile_store_prepare_mru()` just yet: while the MRU list is already populated with all packs now, the order in which we add these packs is indeterministic for most of the part. So by first calling `sort_pack()` on the other packfile list and then re-preparing the MRU list we inherit its sorting. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-10-30packfile: move the MRU list into the packfile storePatrick Steinhardt1-1/+1
Packfiles have two lists associated to them: - A list that keeps track of packfiles in the order that they were added to a packfile store. - A list that keeps track of packfiles in most-recently-used order so that packfiles that are more likely to contain a specific object are ordered towards the front. Both of these lists are hosted by `struct packed_git` itself, So to identify all packfiles in a repository you simply need to grab the first packfile and then iterate the `->next` pointers or the MRU list. This pattern has the problem that all packfiles are part of the same list, regardless of whether or not they belong to the same object source. With the upcoming pluggable object database effort this needs to change: packfiles should be contained by a single object source, and reading an object from any such packfile should use that source to look up the object. Consequently, we need to break up the global lists of packfiles into per-object-source lists. A first step towards this goal is to move those lists out of `struct packed_git` and into the packfile store. While the packfile store is currently sitting on the `struct object_database` level, the intent is to push it down one level into the `struct odb_source` in a subsequent patch series. Introduce a new `struct packfile_list` that is used to manage lists of packfiles and use it to store the list of most-recently-used packfiles in `struct packfile_store`. For now, the new list type is only used in a single spot, but we'll expand its usage in subsequent patches. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-24packfile: move `get_multi_pack_index()` into "midx.c"Patrick Steinhardt1-0/+6
The `get_multi_pack_index()` function is declared and implemented in the packfile subsystem, even though it really belongs into the multi-pack index subsystem. The reason for this is likely that it needs to call `packfile_store_prepare()`, which is not exposed by the packfile system. In a subsequent commit we're about to add another caller outside of the packfile system though, so we'll have to expose the function anyway. Do so now already and move `get_multi_pack_index()` into the MIDX subsystem. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-24packfile: introduce function to load and add packfilesPatrick Steinhardt1-19/+4
We have a recurring pattern where we essentially perform an upsert of a packfile in case it isn't yet known by the packfile store. The logic to do so is non-trivial as we have to reconstruct the packfile's key, check the map of packfiles, then create the new packfile and finally add it to the store. Introduce a new function that does this dance for us. Refactor callsites to use it. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-24packfile: refactor `install_packed_git()` to work on packfile storePatrick Steinhardt1-1/+1
The `install_packed_git()` functions adds a packfile to a specific object store. Refactor it to accept a packfile store instead of a repository to clarify its scope. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-24odb: move MRU list of packfiles into `struct packfile_store`Patrick Steinhardt1-1/+1
The object database tracks the list of packfiles in most-recently-used order, which is mostly used to favor reading from packfiles that contain most of the objects that we're currently accessing. With the introduction of the `struct packfile_store` we have a better place to host this list though. Move the list accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-24odb: move packfile map into `struct packfile_store`Patrick Steinhardt1-1/+1
The object database tracks a map of packfiles by their respective paths, which is used to figure out whether a given packfile has already been loaded. With the introduction of the `struct packfile_store` we have a better place to host this list though. Move the map accordingly. `pack_map_entry_cmp()` isn't used anywhere but in "packfile.c" anymore after this change, so we convert it to a static function, as well. Note that we also drop the `inline` hint: the function is used as a callback function exclusively, and callbacks cannot be inlined. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11midx: compute paths via their sourcePatrick Steinhardt1-29/+25
With the preceding commits we started to always have the object database source available when we load, write or access multi-pack indices. With this in place we can change how MIDX paths are computed so that we don't have to pass in the combination of a hash algorithm and object directory anymore, but only the object database source. Refactor the code accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11midx: stop duplicating info redundant with its owning sourcePatrick Steinhardt1-10/+11
Multi-pack indices store some information that is redundant with their owning source: - The locality bit that tracks whether the source is the primary object source or an alternate. - The object directory path the multi-pack index is located in. - The pointer to the owning parent directory. All of this information is already contained in `struct odb_source`. So now that we always have that struct available when loading a multi-pack index we have it readily accessible. Drop the redundant information and instead store a pointer to the object source. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11midx: load multi-pack indices via their sourcePatrick Steinhardt1-32/+25
To load a multi-pack index the caller is expected to pass both the repository and the object directory where the multi-pack index is located. While this works, this layout has a couple of downsides: - We need to pass in information reduntant with the owning source, namely its object directory and whether the source is local or not. - We don't have access to the source when loading the multi-pack index. If we had that access, we could store a pointer to the owning source in the MIDX and thus deduplicate some information. - Multi-pack indices are inherently specific to the object source and its format. With the goal of pluggable object backends in mind we will eventually want the backends to own the logic of reading and writing multi-pack indices. Making the logic work on top of object sources is a step into that direction. Refactor loading of multi-pack indices accordingly. This surfaces one small problem though: git-multi-pack-index(1) and our MIDX test helper both know to read and write multi-pack-indices located in a different object directory. This issue is addressed by adding the user-provided object directory as an in-memory alternate. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11midx: drop redundant `struct repository` parameterPatrick Steinhardt1-9/+9
There are a couple of functions that take both a `struct repository` and a `struct multi_pack_index`. This provides redundant information though without much benefit given that the multi-pack index already has a pointer to its owning repository. Drop the `struct repository` parameter from such functions. While at it, reorder the list of parameters of `fill_midx_entry()` so that the MIDX comes first to better align with our coding guidelines. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11odb: store locality in object database sourcesPatrick Steinhardt1-2/+3
Object database sources are classified either as: - Local, which means that the source is the repository's primary source. This is typically ".git/objects". - Non-local, which is everything else. Most importantly this includes alternates and quarantine directories. This locality is often computed ad-hoc by checking whether a given object source is the first one. This works, but it is quite roundabout. Refactor the code so that we store locality when creating the sources in the first place. This makes it both more accessible and robust. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-15midx: remove now-unused linked list of multi-pack indicesPatrick Steinhardt1-16/+2
In the preceding commits we have migrated all users of the linked list of multi-pack indices to instead use those stored in the object database sources. Remove those now-unused pointers. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-15midx: stop using linked list when closing MIDXPatrick Steinhardt1-5/+8
When calling `close_midx()` we not only close the multi-pack index for one object source, but instead we iterate through the whole linked list of MIDXs to close all of them. This linked list is about to go away in favor of using the new per-source pointer to its respective MIDX. Refactor the function to iterate through sources instead. Note that after this patch, there's a couple of callsites left that continue to use `close_midx()` without iterating through all sources. These are all cases where we don't care about the MIDX from other sources though, so it's fine to keep them as-is. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-15midx: start tracking per object database sourcePatrick Steinhardt1-8/+11
Multi-pack indices are tracked via `struct multi_pack_index`. This data structure is stored as a linked list inside `struct object_database`, which is the global database that spans across all of the object sources. This layout causes two problems: - Object databases consist of multiple object sources (e.g. one source per alternate object directory), where each multi-pack index is specific to one of those sources. Regardless of that though, the MIDX is not tracked per source, but tracked globally for the whole object database. This creates a mismatch between the on-disk layout and how things are organized in the object database subsystems and makes some parts, like figuring out whether a source has an MIDX, quite awkward. - Multi-pack indices are an implementation detail of how efficient access for packfiles work. As such, they are neither relevant in the context of loose objects, nor in a potential future where we have pluggable backends. Refactor `prepare_multi_pack_index_one()` so that it works on a specific source, which allows us to easily store a pointer to the multi-pack index inside of it. For now, this pointer exists next to the existing linked list we have in the object database. Users will be adjusted in subsequent patches to instead use the per-source pointers. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01object-store: rename `object_directory` to `odb_source`Patrick Steinhardt1-3/+3
The `object_directory` structure is used as an access point for a single object directory like ".git/objects". While the structure isn't yet fully self-contained, the intent is for it to eventually contain all information required to access objects in one specific location. While the name "object directory" is a good fit for now, this will change over time as we continue with the agenda to make pluggable object databases a thing. Eventually, objects may not be accessed via any kind of directory at all anymore, but they could instead be backed by any kind of durable storage mechanism. While it seems quite far-fetched for now, it is thinkable that eventually this might even be some form of a database, for example. As such, the current name of this structure will become worse over time as we evolve into the direction of pluggable ODBs. Immediate next steps will start to carve out proper self-contained object directories, which requires us to pass in these object directories as parameters. Based on our modern naming schema this means that those functions should then be named after their subsystem, which means that we would start to bake the current name into the codebase more and more. Let's preempt this by renaming the structure. There have been a couple alternatives that were discussed: - `odb_backend` was discarded because it led to the association that one object database has a single backend, but the model is that one alternate has one backend. Furthermore, "backend" is more about the actual backing implementation and less about the high-level concept. - `odb_alternate` was discarded because it is a bit of a stretch to also call the main object directory an "alternate". Instead, pick `odb_source` as the new name. It makes it sufficiently clear that there can be multiple sources and does not cause confusion when mixed with the already-existing "alternate" terminology. In the future, this change allows us to easily introduce for example a `odb_files_source` and other format-specific implementations. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-28midx: stop repeatedly looking up nonexistent packfilesPatrick Steinhardt1-2/+10
The multi-pack index acts as a cache across a set of packfiles so that we can quickly look up which of those packfiles contains a given object. As such, the multi-pack index naturally needs to be updated every time one of the packfiles goes away, or otherwise the multi-pack index has grown stale. A stale multi-pack index should be handled gracefully by Git though, and in fact it is: if the indexed pack cannot be found we simply ignore it and eventually we fall back to doing the object lookup by just iterating through all packs, even if those aren't indexed. But while this fallback works, it has one significant downside: we don't cache the fact that a pack has vanished. This leads to us repeatedly trying to look up the same pack only to realize that it (still) doesn't exist. This issue can be easily demonstrated by creating a repository with a stale multi-pack index and a couple of objects. We do so by creating a repository with two packfiles, both of which are indexed by the multi-pack index, and then repack those two packfiles. Note that we have to move the multi-pack-index before doing the final repack, as Git knows to delete it otherwise. $ git init repo $ cd repo/ $ git config set maintenance.auto false $ for i in $(seq 1000); do printf "%d-original" $i >file-$i; done $ git add . $ git commit -moriginal $ git repack -dl $ for i in $(seq 1000); do printf "%d-modified" $i >file-$i; done $ git commit -a -mmodified $ git repack -dl $ git multi-pack-index write $ mv .git/objects/pack/multi-pack-index . $ git repack -Adl $ mv multi-pack-index .git/objects/pack/ Commands that cause a lot of objects lookups will now repeatedly invoke `add_packed_git()`, which leads to three failed access(3p) calls as well as one failed stat(3p) call. The following strace for example is done for `git log --patch` in the above repository: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 74.67 0.024693 1 18038 18031 access 25.33 0.008378 1 6045 6017 newfstatat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.033071 1 24083 24048 total Fix the issue by introducing a negative lookup cache for indexed packs. This cache works by simply storing an invalid pointer for a missing pack when `prepare_midx_pack()` fails to look up the pack. Most users of the `packs` array don't need to be adjusted, either, as they all know to call `prepare_midx_pack()` before accessing the array. With this change in place we can now see a significantly reduced number of syscalls: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 73.58 0.000323 5 60 28 newfstatat 26.42 0.000116 5 23 16 access ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000439 5 83 44 total Furthermore, this change also results in a speedup: Benchmark 1: git log --patch (revision = HEAD~) Time (mean ± σ): 50.4 ms ± 2.5 ms [User: 22.0 ms, System: 24.4 ms] Range (min … max): 45.4 ms … 54.9 ms 53 runs Benchmark 2: git log --patch (revision = HEAD) Time (mean ± σ): 12.7 ms ± 0.4 ms [User: 11.1 ms, System: 1.6 ms] Range (min … max): 12.4 ms … 15.0 ms 191 runs Summary git log --patch (revision = HEAD) ran 3.96 ± 0.22 times faster than git log --patch (revision = HEAD~) In the end, it should in theory never be necessary to have this negative lookup cache given that we know to update the multi-pack index together with repacks. But as the change is quite contained and as the speedup can be significant as demonstrated above, it does feel sensible to have the negative lookup cache regardless. Based-on-patch-by: Jeff King <peff@peff.net> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: move `git_open_cloexec()` to "compat/open.c"Patrick Steinhardt1-1/+0
The `git_open_cloexec()` wrapper function provides the ability to open a file with `O_CLOEXEC` in a platform-agnostic way. This function is provided by "object-file.c" even though it is not specific to the object subsystem at all. Move the file into "compat/open.c". This file already exists before this commit, but has only been compiled conditionally depending on whether or not open(3p) may return EINTR. With this change we now unconditionally compile the object, but wrap `git_open_with_retry()` in an ifdef. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10csum-file: stop depending on `the_repository`Patrick Steinhardt1-1/+2
There are multiple sites in "csum-file.c" where we use the global `the_repository` variable, either explicitly or implicitly by using `the_hash_algo`. Refactor the code to stop using `the_repository` by adapting functions to receive required data as parameters. Adapt callsites accordingly by either using `the_repository->hash_algo`, or by using a context-provided hash algorithm in case the subsystem already got rid of its dependency on `the_repository`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-18progress: stop using `the_repository`Patrick Steinhardt1-4/+9
Stop using `the_repository` in the "progress" subsystem by passing in a repository when initializing `struct progress`. Furthermore, store a pointer to the repository in that struct so that we can pass it to the trace2 API when logging information. Adjust callers accordingly by using `the_repository`. While there may be some callers that have a repository available in their context, this trivial conversion allows for easier verification and bubbles up the use of `the_repository` by one level. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-18Merge branch 'ps/build-sign-compare' into ps/the-repositoryJunio C Hamano1-0/+2
* ps/build-sign-compare: t/helper: don't depend on implicit wraparound scalar: address -Wsign-compare warnings builtin/patch-id: fix type of `get_one_patchid()` builtin/blame: fix type of `length` variable when emitting object ID gpg-interface: address -Wsign-comparison warnings daemon: fix type of `max_connections` daemon: fix loops that have mismatching integer types global: trivial conversions to fix `-Wsign-compare` warnings pkt-line: fix -Wsign-compare warning on 32 bit platform csum-file: fix -Wsign-compare warning on 32-bit platform diff.h: fix index used to loop through unsigned integer config.mak.dev: drop `-Wno-sign-compare` global: mark code units that generate warnings with `-Wsign-compare` compat/win32: fix -Wsign-compare warning in "wWinMain()" compat/regex: explicitly ignore "-Wsign-compare" warnings git-compat-util: introduce macros to disable "-Wsign-compare" warnings
2024-12-06global: mark code units that generate warnings with `-Wsign-compare`Patrick Steinhardt1-0/+1
Mark code units that generate warnings with `-Wsign-compare`. This allows for a structured approach to get rid of all such warnings over time in a way that can be easily measured. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04midx: inline the `MIDX_MIN_SIZE` definitionKarthik Nayak1-5/+1
The `MIDX_MIN_SIZE` definition is used to check the midx_size in `local_multi_pack_index_one`. This definition relies on the `the_hash_algo` global variable. Inline this and remove the global variable usage. With this, remove `USE_THE_REPOSITORY_VARIABLE` usage from `midx.c`. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04midx: pass down `hash_algo` to functions using global variablesKarthik Nayak1-11/+15
The functions `get_split_midx_filename_ext()`, `get_midx_filename()` and `get_midx_filename_ext()` use `hash_to_hex()` which internally uses the `the_hash_algo` global variable. Remove this dependency on global variables by passing down the `hash_algo` through to the functions mentioned and instead calling `hash_to_hex_algop()` along with the obtained `hash_algo`. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04midx: pass `repository` to `load_multi_pack_index`Karthik Nayak1-5/+6
The `load_multi_pack_index` function in midx uses `the_repository` variable to access the `repository` struct. Modify the function and its callee's to send the `repository` field. This moves usage of `the_repository` to the `test-read-midx.c` file. While that is not optimal, it is okay, since the upcoming commits will slowly move the usage of `the_repository` up the layers and remove it eventually. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04midx: cleanup internal usage of `the_repository` and `the_hash_algo`Karthik Nayak1-22/+27
In the `midx.c` file, there are multiple usages of `the_repository` and `the_hash_algo` within static functions of the file. Some of the usages can be simply swapped out with the available `repository` struct. While some of them can be swapped out by passing the repository to the required functions. This leaves out only some other usages of `the_repository` and `the_hash_algo` in the file in non-static functions, which we'll tackle in upcoming commits. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04midx: add repository to `multi_pack_index` structKarthik Nayak1-0/+1
The `multi_pack_index` struct represents the MIDX for a repository. Here, we add a pointer to the repository in this struct, allowing direct use of the repository variable without relying on the global `the_repository` struct. With this addition, we can determine the repository associated with a `bitmap_index` struct. A `bitmap_index` points to either a `packed_git` or a `multi_pack_index`, both of which have direct repository references. To support this, we introduce a static helper function, `bitmap_repo`, in `pack-bitmap.c`, which retrieves a repository given a `bitmap_index`. With this, we clear up all usages of `the_repository` within `pack-bitmap.c` and also remove the `USE_THE_REPOSITORY_VARIABLE` definition. Bringing us another step closer to remove all global variable usage. Although this change also opens up the potential to clean up `midx.c`, doing so would require additional refactoring to pass the repository struct to functions where the MIDX struct is created: a task better suited for future patches. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04packfile: add repository to struct `packed_git`Karthik Nayak1-1/+1
The struct `packed_git` holds information regarding a packed object file. Let's add the repository variable to this object, to represent the repository that this packfile belongs to. This helps remove dependency on the global `the_repository` object in `packfile.c` by simply using repository information now readily available in the struct. We do need to consider that a packfile could be part of the alternates of a repository, but considering that we only have one repository struct and also that we currently anyways use 'the_repository', we should be OK with this change. We also modify `alloc_packed_git` to ensure that the repository is added to newly created `packed_git` structs. This requires modifying the function and all its callee to pass the repository object down the levels. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-25packfile: use object_id in find_pack_entry_one()Jeff King1-1/+1
The main function we use to search a pack index for an object is find_pack_entry_one(). That function still takes a bare pointer to the hash, despite the fact that its underlying bsearch_pack() function needs an object_id struct. And so we end up making an extra copy of the hash into the struct just to do a lookup. As it turns out, all callers but one already have such an object_id. So we can just take a pointer to that struct and use it directly. This avoids the extra copy and provides a more type-safe interface. The one exception is get_delta_base() in packfile.c, when we are chasing a REF_DELTA from inside the pack (and thus we have a pointer directly to the mmap'd pack memory, not a struct). We can just bump the hashcpy() from inside find_pack_entry_one() to this one caller that needs it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-25midx: avoid duplicate packed_git entriesJeff King1-3/+17
When we scan a pack directory to load the idx entries we find into the packed_git list, we skip any of them that are contained in a midx. We then load them later lazily if we actually need to access the corresponding pack, referencing them both from the midx struct and the packed_git list. The lazy-load in the midx code checks to see if the midx already mentions the pack, but doesn't otherwise check the packed_git list. This makes sense, since we should have added any pack to both lists. But there's a loophole! If we call close_object_store(), that frees the midx entirely, but _not_ the packed_git structs, which we must keep around for Reasons[1]. If we then try to look up more objects, we'll auto-load the midx again, which won't realize that we've already loaded those packs, and will create duplicate entries in the packed_git list. This is possibly inefficient, because it means we may open and map the pack redundantly. But it can also lead to weird user-visible behavior. The case I found is in "git repack", which closes and reopens the midx after repacking and then calls update_server_info(). We end up writing the duplicate entries into objects/info/packs. We could obviously de-dup them while writing that file, but it seems like a violation of more core assumptions that we end up with these duplicate structs at all. We can avoid the duplicates reasonably efficiently by checking their names in the pack_map hash. This annoyingly does require a little more than a straight hash lookup due to the naming conventions, but it should only happen when we are about to actually open a pack. I don't think one extra malloc will be noticeable there. [1] I'm not entirely sure of all the details, except that we generally assume the packed_git structs never go away. We noted this restriction in the comment added by 6f1e9394e2 (object: fix leaking packfiles when closing object store, 2024-08-08), but it's somewhat vague. At any rate, if you try freeing the structs in close_object_store(), you can observe segfaults all over the test suite. So it might be fixable, but it's not trivial. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-08-27pack-bitmap: tag bitmapped packs with their corresponding MIDXTaylor Blau1-0/+1
The next commit will need to use the bitmap's MIDX (if one exists) to translate bit positions into pack-relative positions in the source pack. Ordinarily, we'd use the "midx" field of the bitmap_index struct. But since that struct is defined within pack-bitmap.c, and our caller is in a separate compilation unit, we do not have access to the MIDX field. Instead, add a "from_midx" field to the bitmapped_pack structure so that we can use that piece of data from outside of pack-bitmap.c. The caller that uses this new piece of information will be added in the following commit. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-13midx: drop unused parameters from add_midx_to_chain()Jeff King1-9/+5
When loading a chained midx, we build up an array of hashes, one per layer of the chain. But since the chain is also represented by the linked list of multi_pack_index structs, nobody actually reads this array. We pass it to add_midx_to_chain(), but the parameters are completely ignored. So we can drop those unused parameters. And then we can see that its sole caller, load_midx_chain_fd_st(), only cares about one layer hash at a time (for parsing each line and feeding it to the single-layer midx code). So we can replace the array with a single object_id on the stack. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: implement support for writing incremental MIDX chainsTaylor Blau1-11/+51
Now that the rest of the MIDX subsystem and relevant callers have been updated to learn about how to read and process incremental MIDX chains, let's finally update the implementation in `write_midx_internal()` to be able to write incremental MIDX chains. This new feature is available behind the `--incremental` option for the `multi-pack-index` builtin, like so: $ git multi-pack-index write --incremental The implementation for doing so is relatively straightforward, and boils down to a handful of different kinds of changes implemented in this patch: - The `compute_sorted_entries()` function is taught to reject objects which appear in any existing MIDX layer. - Functions like `write_midx_revindex()` are adjusted to write pack_order values which are offset by the number of objects in the base MIDX layer. - The end of `write_midx_internal()` is adjusted to move non-incremental MIDX files when necessary (i.e. when creating an incremental chain with an existing non-incremental MIDX in the repository). There are a handful of other changes that are introduced, like new functions to clear incremental MIDX files that are unrelated to the current chain (using the same "keep_hash" mechanism as in the non-incremental case). The tests explicitly exercising the new incremental MIDX feature are relatively limited for two reasons: 1. Most of the "interesting" behavior is already thoroughly covered in t5319-multi-pack-index.sh, which handles the core logic of reading objects through a MIDX. The new tests in t5334-incremental-multi-pack-index.sh are mostly focused on creating and destroying incremental MIDXs, as well as stitching their results together across layers. 2. A new GIT_TEST environment variable is added called "GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL", which modifies the entire test suite to write incremental MIDXs after repacking when combined with the "GIT_TEST_MULTI_PACK_INDEX" variable. This exercises the long tail of other interesting behavior that is defined implicitly throughout the rest of the CI suite. It is likewise added to the linux-TEST-vars job. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: implement verification support for incremental MIDXsTaylor Blau1-17/+30
Teach the verification implementation used by `git multi-pack-index verify` to perform verification for incremental MIDX chains by independently validating each layer within the chain. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: support reading incremental MIDX chainsTaylor Blau1-10/+174
Now that the MIDX machinery's internals have been taught to understand incremental MIDXs over the previous handful of commits, the MIDX machinery itself can begin reading incremental MIDXs. (Note that while the on-disk format for incremental MIDXs has been defined, the writing end has not been implemented. This will take place in the commit after next.) The core of this change involves following the order specified in the MIDX chain in reverse and opening up MIDXs in the chain one-by-one, adding them to the previous layer's `->base_midx` pointer at each step. In order to implement this, the `load_multi_pack_index()` function is taught to call a new `load_multi_pack_index_chain()` function if loading a non-incremental MIDX failed via `load_multi_pack_index_one()`. When loading a MIDX chain, `load_midx_chain_fd_st()` reads each line in the file one-by-one and dispatches calls to `load_multi_pack_index_one()` to read each layer of the MIDX chain. When a layer was successfully read, it is added to the MIDX chain by calling `add_midx_to_chain()` which validates the contents of the `BASE` chunk, performs some bounds checks on the number of combined packs and objects, and attaches the new MIDX by assigning its `base_midx` pointer to the existing part of the chain. As a supplement to this, introduce a new mode in the test-read-midx test-tool which allows us to read the information for a specific MIDX in the chain by specifying its trailing checksum via the command-line arguments like so: $ test-tool read-midx .git/objects [checksum] Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `midx_preferred_pack()` about incremental MIDXsTaylor Blau1-2/+5
The function `midx_preferred_pack()` is used to determine the identity of the preferred pack, which is the identity of a unique pack within the MIDX which is used as a tie-breaker when selecting from which pack to represent an object that appears in multiple packs within the MIDX. Historically we have said that the MIDX's preferred pack has the unique property that all objects from that pack are represented in the MIDX. But that isn't quite true: a more precise statement would be that all objects from that pack *which appear in the MIDX* are selected from that pack. This helps us extend the concept of preferred packs across a MIDX chain, where some object(s) in the preferred pack may appear in other packs in an earlier MIDX layer, in which case those object(s) will not appear in a subsequent MIDX layer from either the preferred pack or any other pack. Extend the concept of preferred packs by using the pack which represents the object at the first position in MIDX pseudo-pack order belonging to the current MIDX layer (i.e., at position 'm->num_objects_in_base'). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `midx_contains_pack()` about incremental MIDXsTaylor Blau1-1/+10
Now that the `midx_contains_pack()` versus `midx_locate_pack()` debacle has been cleaned up, teach the former about how to operate in an incremental MIDX-aware world in a similar fashion as in previous commits. Instead of using either of the two `midx_for_object()` or `midx_for_pack()` helpers, this function is split into two: one that determines whether a pack is contained in a single MIDX, and another which calls the former in a loop over all MIDXs. This approach does not require that we change any of the implementation in what is now `midx_contains_pack_1()` as it still operates over a single MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: remove unused `midx_locate_pack()`Taylor Blau1-11/+2
Commit 307d75bbe6 (midx: implement `midx_locate_pack()`, 2023-12-14) introduced `midx_locate_pack()`, which was described at the time as a complement to the function `midx_contains_pack()` which allowed callers to determine where in the MIDX lexical order a pack appeared, as opposed to whether or not it was simply contained. 307d75bbe6 suggests that future patches would be added which would introduce callers for this new function, but none ever were, meaning the function has gone unused since its introduction. Clean this up by in effect reverting 307d75bbe6, which removes the unused functions and inlines its definition back into `midx_contains_pack()`. (Looking back through the list archives when 307d75bbe6 was written, this was in preparation for this[1] patch from back when we had the concept of "disjoint" packs while developing multi-pack verbatim reuse. That concept was abandoned before the series was merged, but I never dropped what would become 307d75bbe6 from the series, leading to the state prior to this commit). [1]: https://lore.kernel.org/git/3019738b52ba8cd78ea696a3b800fa91e722eb66.1701198172.git.me@ttaylorr.com/ Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `fill_midx_entry()` about incremental MIDXsTaylor Blau1-4/+2
In a similar fashion as previous commits, teach the `fill_midx_entry()` function to work in a incremental MIDX-aware fashion. This function, unlike others which accept an index into either the lexical order of objects or packs, takes in an object_id, and attempts to fill a caller-provided 'struct pack_entry' with the remaining pieces of information about that object from the MIDX. The function uses `bsearch_midx()` which fills out the frame-local 'pos' variable, recording the given object_id's lexical position within the MIDX chain, if found (if no matching object ID was found, we'll return immediately without filling out the `pack_entry` structure). Once given that position, we jump back through the `->base_midx` pointer to ensure that our `m` points at the MIDX layer which contains the given object_id (and not an ancestor or descendant of it in the chain). Note that we can drop the bounds check "if (pos >= m->num_objects)" because `midx_for_object()` performs this check for us. After that point, we only need to make two special considerations within this function: - First, the pack_int_id returned to us by `nth_midxed_pack_int_id()` is a position in the concatenated lexical order of packs, so we must ensure that we subtract `m->num_packs_in_base` before accessing the MIDX-local `packs` array. - Second, we must avoid translating the `pos` back to a MIDX-local index, since we use it as an argument to `nth_midxed_offset()` which expects a position relative to the concatenated lexical order of objects. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_midxed_offset()` about incremental MIDXsTaylor Blau1-0/+2
In a similar fashion as in previous commits, teach the function `nth_midxed_offset()` about incremental MIDXs. The given object `pos` is used to find the containing MIDX, and translated back into a MIDX-local position by assigning the return value of `midx_for_object()` to it. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `bsearch_midx()` about incremental MIDXsTaylor Blau1-1/+4
Now that the special cases callers of `bsearch_midx()` have been dealt with, teach `bsearch_midx()` to handle incremental MIDX chains. The incremental MIDX-aware version of `bsearch_midx()` works by repeatedly searching for a given OID in each layer along the `->base_midx` pointer, stopping either when an exact match is found, or the end of the chain is reached. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: introduce `bsearch_one_midx()`Taylor Blau1-3/+14
The `bsearch_midx()` function will be extended in a following commit to search for the location of a given object ID across all MIDXs in a chain (or the single non-chain MIDX if no chain is available). While most callers will naturally want to use the updated `bsearch_midx()` function, there are a handful of special cases that will want finer control and will only want to search through a single MIDX. For instance, the object abbreviation code, which cares about object IDs near to where we'd expect to find a match in a MIDX. In that case, we want to look at the nearby matches in each layer of the MIDX chain, not just a single one). Split the more fine-grained control out into a separate function called `bsearch_one_midx()` which searches only a single MIDX. At present both `bsearch_midx()` and `bsearch_one_midx()` have identical behavior, but the following commit will rewrite the former to be aware of incremental MIDXs for the remaining non-special case callers. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_bitmapped_pack()` about incremental MIDXsTaylor Blau1-3/+5
In a similar fashion as in previous commits, teach the function `nth_bitmapped_pack()` about incremental MIDXs by translating the given `pack_int_id` from the concatenated lexical order to a MIDX-local lexical position. When accessing the containing MIDX's array of packs, use the local pack ID. Likewise, when reading the 'BTMP' chunk, use the MIDX-local offset when accessing the data within that chunk. (Note that the both the call to prepare_midx_pack() and the assignment of bp->pack_int_id both care about the global pack_int_id, so avoid shadowing the given 'pack_int_id' parameter). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_midxed_object_oid()` about incremental MIDXsTaylor Blau1-1/+3
The function `nth_midxed_object_oid()` returns the object ID for a given object position in the MIDX lexicographic order. Teach this function to instead operate over the concatenated lexicographic order defined in an earlier step so that it is able to be used with incremental MIDXs. To do this, we need to both (a) adjust the bounds check for the given 'n', as well as record the MIDX-local position after chasing the `->base_midx` pointer to find the MIDX which contains that object. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `prepare_midx_pack()` about incremental MIDXsTaylor Blau1-4/+22
The function `prepare_midx_pack()` is part of the midx.h API and loads the pack identified by the MIDX-local 'pack_int_id'. This patch prepares that function to be aware of an incremental MIDX world. To do this, introduce the second of the two general purpose helpers mentioned in the previous commit. This commit introduces `midx_for_pack()`, which is the pack-specific analog of `midx_for_object()`, and works in the same fashion. Like `midx_for_object()`, this function chases down the '->base_midx' field until it finds the MIDX layer within the chain that contains the given pack. Use this function within `prepare_midx_pack()` so that the `pack_int_id` it expects is now relative to the entire MIDX chain, and that it prepares the given pack in the appropriate MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_midxed_pack_int_id()` about incremental MIDXsTaylor Blau1-2/+21
The function `nth_midxed_pack_int_id()` takes in a object position in MIDX lexicographic order and returns an identifier of the pack from which that object was selected in the MIDX. Currently, the given object position is an index into the lexicographic order of objects in a single MIDX. Change this position to instead refer into the concatenated lexicographic order of all MIDXs in a MIDX chain. This has two visible effects within the implementation of `prepare_midx_pack()`: - First, the given position is now an index into the concatenated lexicographic order of all MIDXs in the order in which they appear in the MIDX chain. - Second the pack ID returned from this function is now also in the concatenated order of packs among all layers of the MIDX chain in the same order that they appear in the MIDX chain. To do this, introduce the first of two general purpose helpers, this one being `midx_for_object()`. `midx_for_object()` takes a double pointer to a `struct multi_pack_index` as well as an object `pos` in terms of the entire MIDX chain[^1]. The function chases down the '->base_midx' field until it finds the MIDX layer within the chain that contains the given object. It then: - modifies the double pointer to point to the containing MIDX, instead of the tip of the chain, and - returns the MIDX-local position[^2] at which the given object can be found. Use this function within `nth_midxed_pack_int_id()` so that the `pos` it expects is now relative to the entire MIDX chain, and that it returns the appropriate pack position for that object. [^1]: As a reminder, this means that the object is identified among the objects contained in all layers of the incremental MIDX chain, not any particular layer. For example, consider MIDX chain with two individual MIDXs, one with 4 objects and another with 3 objects. If the MIDX with 4 objects appears earlier in the chain, then asking for object 6 would return the second object in the MIDX with 3 objects. [^2]: Building on the previous example, asking for object 6 in a MIDX chain with (4, 3) objects, respectively, this would set the double pointer to point at the MIDX containing three objects, and would return an index to the second object within that MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14global: introduce `USE_THE_REPOSITORY_VARIABLE` macroPatrick Steinhardt1-0/+2
Use of the `the_repository` variable is deprecated nowadays, and we slowly but steadily convert the codebase to not use it anymore. Instead, callers should be passing down the repository to work on via parameters. It is hard though to prove that a given code unit does not use this variable anymore. The most trivial case, merely demonstrating that there is no direct use of `the_repository`, is already a bit of a pain during code reviews as the reviewer needs to manually verify claims made by the patch author. The bigger problem though is that we have many interfaces that implicitly rely on `the_repository`. Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code units to opt into usage of `the_repository`. The intent of this macro is to demonstrate that a certain code unit does not use this variable anymore, and to keep it from new dependencies on it in future changes, be it explicit or implicit For now, the macro only guards `the_repository` itself as well as `the_hash_algo`. There are many more known interfaces where we have an implicit dependency on `the_repository`, but those are not guarded at the current point in time. Over time though, we should start to add guards as required (or even better, just remove them). Define the macro as required in our code units. As expected, most of our code still relies on the global variable. Nearly all of our builtins rely on the variable as there is no way yet to pass `the_repository` to their entry point. For now, declare the macro in "biultin.h" to keep the required changes at least a little bit more contained. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14hash: require hash algorithm in `oidread()` and `oidclr()`Patrick Steinhardt1-1/+2
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash function that shall be used. Require callers to pass in the hash algorithm to get rid of this implicit dependency. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-30midx: replace `get_midx_rev_filename()` with a generic helperTaylor Blau1-4/+6
Commit f894081deae (pack-revindex: read multi-pack reverse indexes, 2021-03-30) introduced the `get_midx_rev_filename()` helper (later modified by commit 60980aed786 (midx.c: write MIDX filenames to strbuf, 2021-10-26)). This function returns the location of the classic ".rev" files we used to write for MIDXs (prior to 95e8383bac1 (midx.c: make changing the preferred pack safe, 2022-01-25)), which is always of the form: $GIT_DIR/objects/pack/multi-pack-index-$HASH.rev Replace this function with a generic helper that populates a strbuf with the above form, replacing the ".rev" extension with a caller-provided argument. This will allow us to remove a similarly-defined function in the pack-bitmap code (used to determine the location of a MIDX .bitmap file) by reimplementing it in terms of `get_midx_filename_ext()`. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-23Merge branch 'ps/missing-btmp-fix'Junio C Hamano1-3/+4
GIt 2.44 introduced a regression that makes the updated code to barf in repositories with multi-pack index written by older versions of Git, which has been corrected. * ps/missing-btmp-fix: pack-bitmap: gracefully handle missing BTMP chunks
2024-04-15pack-bitmap: gracefully handle missing BTMP chunksPatrick Steinhardt1-3/+4
In 0fea6b73f1 (Merge branch 'tb/multi-pack-verbatim-reuse', 2024-01-12) we have introduced multi-pack verbatim reuse of objects. This series has introduced a new BTMP chunk, which encodes information about bitmapped objects in the multi-pack index. Starting with dab60934e3 (pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions, 2023-12-14) we use this information to figure out objects which we can reuse from each of the packfiles. One thing that we glossed over though is backwards compatibility with repositories that do not yet have BTMP chunks in their multi-pack index. In that case, `nth_bitmapped_pack()` would return an error, which causes us to emit a warning followed by another error message. These warnings are visible to users that fetch from a repository: ``` $ git fetch ... remote: error: MIDX does not contain the BTMP chunk remote: warning: unable to load pack: 'pack-f6bb7bd71d345ea9fe604b60cab9ba9ece54ffbe.idx', disabling pack-reuse remote: Enumerating objects: 40, done. remote: Counting objects: 100% (40/40), done. remote: Compressing objects: 100% (39/39), done. remote: Total 40 (delta 5), reused 0 (delta 0), pack-reused 0 (from 0) ... ``` While the fetch succeeds the user is left wondering what they did wrong. Furthermore, as visible both from the warning and from the reuse stats, pack-reuse is completely disabled in such repositories. What is quite interesting is that this issue can even be triggered in case `pack.allowPackReuse=single` is set, which is the default value. One could have expected that in this case we fall back to the old logic, which is to use the preferred packfile without consulting BTMP chunks at all. But either we fail with the above error in case they are missing, or we use the first pack in the multi-pack-index. The former case disables pack-reuse altogether, whereas the latter case may result in reusing objects from a suboptimal packfile. Fix this issue by partially reverting the logic back to what we had before this patch series landed. Namely, in the case where we have no BTMP chunks or when `pack.allowPackReuse=single` are set, we use the preferred pack instead of consulting the BTMP chunks. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-01midx-write: move writing-related functions from midx.cTaylor Blau1-1539/+14
Introduce a new midx-write.c source file, which holds all of the functionality from the MIDX sub-system related to writing new MIDX files. Similar to the relationship between "pack-bitmap.c" and "pack-bitmap-write.c", this source file will hold code that is specific to writing MIDX files as opposed to reading them (the latter will remain in midx.c). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-03-25midx: use strvec_pushf() for pack-objects base nameRené Scharfe1-6/+1
Build the pack base name argument directly using strvec_pushf() instead of with an intermediate strbuf. This is shorter, simpler and avoids the need for explicit cleanup. Signed-off-by: René Scharfe <l.s.r@web.de> Reviewed-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: implement `midx_preferred_pack()`Taylor Blau1-0/+20
When performing a binary search over the objects in a MIDX's bitmap (i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack ordering using a combination of (a) the preferred pack, (b) the pack's lexical position in the MIDX based on pack names, and (c) the object offset within the pack. In order to perform this binary search, the reader must know the identity of the preferred pack. This could be stored in the MIDX, but isn't for historical reasons, mostly because it can easily be inferred at read-time by looking at the object in the first bit position and finding out which pack it was selected from in the MIDX, like so: nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0)); In midx_to_pack_pos() which performs this binary search, we look up the identity of the preferred pack before each search. This is relatively quick, since it involves two table-driven lookups (one in the MIDX's revindex for `pack_pos_to_midx()`, and another in the MIDX's object table for `nth_midxed_pack_int_id()`). But since the preferred pack does not change after the MIDX is written, it is safe to cache this value on the MIDX itself. Write a helper to do just that, and rewrite all of the existing call-sites that care about the identity of the preferred pack in terms of this new helper. This will prepare us for a subsequent patch where we will need to binary search through the MIDX's pseudo-pack order multiple times. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: implement `midx_locate_pack()`Taylor Blau1-2/+11
The multi-pack index API exposes a `midx_contains_pack()` function that takes in a string ending in either ".idx" or ".pack" and returns whether or not the MIDX contains a given pack corresponding to that string. There is no corresponding function to locate the position of a pack within the MIDX's pack order (sorted lexically by pack filename). We could add an optional out parameter to `midx_contains_pack()` that is filled out with the pack's position when the parameter is non-NULL. To minimize the amount of fallout from this change, instead introduce a new function by renaming `midx_contains_pack()` to `midx_locate_pack()`, adding that output parameter, and then reimplementing `midx_contains_pack()` in terms of it. Future patches will make use of this new function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: implement `BTMP` chunkTaylor Blau1-3/+72
When a multi-pack bitmap is used to implement verbatim pack reuse (that is, when verbatim chunks from an on-disk packfile are copied directly[^1]), it does so by using its "preferred pack" as the source for pack-reuse. This allows repositories to pack the majority of their objects into a single (often large) pack, and then use it as the single source for verbatim pack reuse. This increases the amount of objects that are reused verbatim (and consequently, decrease the amount of time it takes to generate many packs). But this performance comes at a cost, which is that the preferred packfile must pace its growth with that of the entire repository in order to maintain the utility of verbatim pack reuse. As repositories grow beyond what we can reasonably store in a single packfile, the utility of verbatim pack reuse diminishes. Or, at the very least, it becomes increasingly more expensive to maintain as the pack grows larger and larger. It would be beneficial to be able to perform this same optimization over multiple packs, provided some modest constraints (most importantly, that the set of packs eligible for verbatim reuse are disjoint with respect to the subset of their objects being sent). If we assume that the packs which we treat as candidates for verbatim reuse are disjoint with respect to any of their objects we may output, we need to make only modest modifications to the verbatim pack-reuse code itself. Most notably, we need to remove the assumption that the bits in the reachability bitmap corresponding to objects from the single reuse pack begin at the first bit position. Future patches will unwind these assumptions and reimplement their existing functionality as special cases of the more general assumptions (e.g. that reuse bits can start anywhere within the bitset, but happen to start at 0 for all existing cases). This patch does not yet relax any of those assumptions. Instead, it implements a foundational data-structure, the "Bitampped Packs" (`BTMP`) chunk of the multi-pack index. The `BTMP` chunk's contents are described in detail here. Importantly, the `BTMP` chunk contains information to map regions of a multi-pack index's reachability bitmap to the packs whose objects they represent. For now, this chunk is only written, not read (outside of the test-tool used in this patch to test the new chunk's behavior). Future patches will begin to make use of this new chunk. [^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the pack that wasn't used verbatim. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: factor out `fill_pack_info()`Taylor Blau1-18/+20
When selecting which packfiles will be written while generating a MIDX, the MIDX internals fill out a 'struct pack_info' with various pieces of book-keeping. Instead of filling out each field of the `pack_info` structure individually in each of the two spots that modify the array of such structures (`ctx->info`), extract a common routine that does this for us. This reduces the code duplication by a modest amount. But more importantly, it zero-initializes the structure before assigning values into it. This hardens us for a future change which will add additional fields to this structure which (until this patch) was not zero-initialized. As a result, any new fields added to the `pack_info` structure need only be updated in a single location, instead of at each spot within midx.c. There are no functional changes in this patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14pack-objects: free packing_data in more placesTaylor Blau1-0/+5
The pack-objects internals use a packing_data struct to track what objects are part of the pack(s) being formed. Since these structures contain allocated fields, failing to appropriately free() them results in a leak. Plug that leak by introducing a clear_packing_data() function, and call it in the appropriate spots. This is a fairly straightforward leak to plug, since none of the callers expect to read any values or have any references to parts of the address space being freed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09midx: check consistency of fanout tableJeff King1-9/+11
The commit-graph, midx, and pack idx on-disk formats all have oid fanout tables which are fed to bsearch_hash(). If these tables do not increase monotonically, then the binary search may not only produce bogus values, it may cause out of bounds reads. We fixed this for commit graphs in 4169d89645 (commit-graph: check consistency of fanout table, 2023-10-09). That commit argued that we did not need to do the same for midx and pack idx files, because they already did this check. However, that is wrong. We _do_ check the fanout table for pack idx files when we load them, but we only do so for midx files when running "git multi-pack-index verify". So it is possible to get an out-of-bounds read by running a normal command with a specially crafted midx file. Let's fix this using the same solution (and roughly the same test) we did for the commit-graph in 4169d89645. This replaces the same check from "multi-pack-index verify", because verify uses the same read routines, we'd bail on reading the midx much sooner now. So let's make sure to copy its verbose error message. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of revindex chunkJeff King1-1/+2
When we load a revindex from disk, we check the size of the file compared to the number of objects we expect it to have. But when we use a RIDX chunk stored directly in the midx, we just access the memory directly. This can lead to out-of-bounds memory access for a corrupted or malicious multi-pack-index file. We can catch this by recording the RIDX chunk size, and then checking it against the expected size when we "load" the revindex. Note that this check is much simpler than the one that load_revindex_from_disk() does, because we just have the data array with no header (so we do not need to account for the header size, and nor do we need to bother validating the header values). The test confirms both that we catch this case, and that we continue the process (the revindex is required to use the midx bitmaps, but we fallback to a non-bitmap traversal). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: bounds-check large offset chunkJeff King1-3/+5
When we see a large offset bit in the regular midx offset table, we use the entry as an index into a separate large offset table (just like a pack idx does). But we don't bounds-check the access to that large offset table (nor even record its size when we parse the chunk!). The equivalent code for a regular pack idx is in check_pack_index_ptr(). But things are a bit simpler here because of the chunked format: we can just check our array index directly. As a bonus, we can get rid of the st_mult() here. If our array bounds-check is successful, then we know that the result will fit in a size_t (and the bounds check uses a division to avoid overflow entirely). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of object offset chunkJeff King1-1/+14
The object offset chunk has one fixed-size entry for each object in the midx. But since we don't check its size, we may access out-of-bounds memory if we see a corrupt or malicious midx file. Sine the entries are fixed-size, the total length can be known up-front, and we can just check it while parsing the chunk (this is similar to what we do when opening pack idx files, which contain a similar offset table). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: enforce chunk alignment on readingJeff King1-1/+2
The midx reader assumes chunks are aligned to a 4-byte boundary: we treat the fanout chunk as an array of uint32_t, indexing it to feed the results to ntohl(). Without aligning the chunks, we may violate the CPU's alignment constraints. Though many platforms allow this, some do not. And certanily UBSan will complain, since it is undefined behavior. Even though most chunks are naturally 4-byte-aligned (because they are storing uint32_t or larger types), PNAM is not. It stores NUL-terminated pack names, so you can have a valid chunk with any length. The writing side handles this by 4-byte-aligning the chunk, introducing a few extra NULs as necessary. But since we don't check this on the reading side, we may end up with a misaligned fanout and trigger the undefined behavior. We have two options here: 1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The latter handles alignment itself. It's possible that it's slightly slower (though in practice I'm not sure how true that is, especially for these code paths which then go on to do a binary search). 2. Enforce the alignment when reading the chunks. This is easy to do, since the table-of-contents reader can check it in one spot. I went with the second option here, just because it places less burden on maintenance going forward (it is OK to continue using ntohl), and we know it can't have any performance impact on the actual reads. The commit-graph code uses the same chunk API. It's usually also 4-byte aligned, but some chunks are not (like Bloom filter BDAT chunks). So we'll pass "1" here to allow any alignment. It doesn't suffer from the same problem as midx with its fanout because the fanout chunk is always the first (and the rest of the format dictates that the first chunk will start aligned). The new test shows the effect on a midx with a misaligned PNAM chunk. Note that the midx-reading code treats chunk-toc errors as soft, falling back to the non-midx path rather than calling die(), as we do for other parsing errors. Arguably we should make all of these behave the same, but that's out of scope for this patch. For now the test just expects the fallback behavior. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of pack names chunkJeff King1-2/+9
We parse the pack-name chunk as a series of NUL-terminated strings. But since we don't look at the chunk size, there's nothing to guarantee that we don't parse off the end of the chunk (or even off the end of the mapped file). We can record the length, and then as we parse make sure that we never walk past it. The new test exercises the case, though note that it does not actually segfault before this patch. It hits a NUL byte somewhere in one of the other chunks, and comes up with a garbage pack name. You could construct one that reads out-of-bounds (e.g., a PNAM chunk at the end of file), but this case is simple and sufficient to check that we detect the problem. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of oid lookup chunkJeff King1-3/+15
When reading an on-disk multi-pack-index, we take the number of objects in the midx from the final value of the fanout table. But we just blindly assume that the chunk containing the actual oid entries is the correct size. This can lead to us reading out-of-bounds memory if the lookup chunk is too small (or if the fanout is corrupted; when they don't agree we cannot tell which one is wrong). Note that we bump the assignment of m->num_objects into the fanout parser callback, so that it's set when we parse the lookup table (otherwise we'd have to manually record the lookup table size and check it later). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: stop ignoring malformed oid fanout chunkJeff King1-8/+8
When we load the oid-fanout chunk, our callback checks that its size is reasonable and returns an error if not. However, the caller only checks our return value against CHUNK_NOT_FOUND, so we end up ignoring the error completely! Using a too-small fanout table means we end up accessing random memory for the fanout and segfault. We can fix this by checking for any non-zero return value, rather than just CHUNK_NOT_FOUND, and adjusting our error message to cover both cases. We could handle each error code individually, but there's not much point for such a rare case. The extra message produced in the callback makes it clear what is going on. The same pattern is used in the adjacent code. Those cases are actually OK for now because they do not use a custom callback, so the only error they can get is CHUNK_NOT_FOUND. But let's convert them, as this is an accident waiting to happen (especially as we convert some of them away from pair_chunk). The error messages are more verbose, but it should be rare for a user to see these anyway. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09chunk-format: note that pair_chunk() is unsafeJeff King1-5/+5
The pair_chunk() function is provided as an easy helper for parsing chunks that just want a pointer to a set of bytes. But every caller has a hidden bug: because we return only the pointer without the matching chunk size, the callers have no clue how many bytes they are allowed to look at. And as a result, they may read off the end of the mmap'd data when the on-disk file does not match their expectations. Since chunk files are typically used for local-repository data like commit-graph files and midx's, the security implications here are pretty mild. The worst that can happen is that you hand somebody a corrupted repository tarball, and running Git on it does an out-of-bounds read and crashes. So it's worth being more defensive, but we don't need to drop everything and fix every caller immediately. I noticed the problem because the pair_chunk_fn() callback does not look at its chunk_size argument, and wanted to annotate it to silence -Wunused-parameter. We could do that now, but we'd lose the hint that this code should be audited and fixed. So instead, let's set ourselves up for going down that path: 1. Provide a pair_chunk() function that does return the size, which prepares us for fixing these cases. 2. Rename the existing function to pair_chunk_unsafe(). That gives us an easy way to grep for cases which still need to be fixed, and the name should cause anybody adding new calls to think twice before using it. There are no callers of the "safe" version yet, but we'll add some in subsequent patches. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-25Merge branch 'tb/object-access-overflow-protection'Junio C Hamano1-18/+24
Various offset computation in the code that accesses the packfiles and other data in the object layer has been hardened against arithmetic overflow, especially on 32-bit systems. * tb/object-access-overflow-protection: commit-graph.c: prevent overflow in `verify_commit_graph()` commit-graph.c: prevent overflow in `write_commit_graph()` commit-graph.c: prevent overflow in `merge_commit_graph()` commit-graph.c: prevent overflow in `split_graph_merge_strategy()` commit-graph.c: prevent overflow in `load_tree_for_commit()` commit-graph.c: prevent overflow in `fill_commit_in_graph()` commit-graph.c: prevent overflow in `fill_commit_graph_info()` commit-graph.c: prevent overflow in `load_oid_from_graph()` commit-graph.c: prevent overflow in add_graph_to_chain() commit-graph.c: prevent overflow in `write_commit_graph_file()` pack-bitmap.c: ensure that eindex lookups don't overflow midx.c: prevent overflow in `fill_included_packs_batch()` midx.c: prevent overflow in `write_midx_internal()` midx.c: store `nr`, `alloc` variables as `size_t`'s midx.c: prevent overflow in `nth_midxed_offset()` midx.c: prevent overflow in `nth_midxed_object_oid()` midx.c: use `size_t`'s for fanout nr and alloc packfile.c: use checked arithmetic in `nth_packed_object_offset()` packfile.c: prevent overflow in `load_idx()` packfile.c: prevent overflow in `nth_packed_object_id()`
2023-07-14midx.c: prevent overflow in `fill_included_packs_batch()`Taylor Blau1-2/+2
In a similar spirit as in previous commits, avoid an integer overflow when computing the expected size of a MIDX. (Note that this is also OK as-is, since `p->pack_size` is an `off_t`, so this computation should already be done as 64-bit integers. But again, let's use `st_mult()` to make this fact clear). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: prevent overflow in `write_midx_internal()`Taylor Blau1-4/+5
When writing a MIDX, we use the chunk-format API to write out each individual chunk of the MIDX. Each chunk of the MIDX is tracked via a call to `add_chunk()`, along with the expected size of that chunk. Guard against overflow when dealing with a MIDX with a large number of entries (and consequently, large chunks within the MIDX file itself) to avoid corrupting the contents of the MIDX itself. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: store `nr`, `alloc` variables as `size_t`'sTaylor Blau1-7/+9
In the `write_midx_context` structure, we use two `uint32_t`'s to track the length and allocated size of the packs, and one `uint32_t` to track the number of objects in the MIDX. In practice, having these be 32-bit unsigned values shouldn't cause any problems since we are unlikely to have that many objects or packs in any real-world repository. But these values should be `size_t`'s, so change their type to reflect that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: prevent overflow in `nth_midxed_offset()`Taylor Blau1-1/+2
In a similar spirit as previous patches, avoid an overflow when looking up object offsets in the MIDX's large offset table by guarding the computation via `st_mult()`. This instance is also OK as-is, since the left operand is the result of `sizeof(...)`, which is already a `size_t`. But use `st_mult()` instead here to make it explicit that this computation is to be performed using 64-bit unsigned integers. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: prevent overflow in `nth_midxed_object_oid()`Taylor Blau1-1/+1
In a similar spirit as previous commits, avoid overflow when looking up an object's OID in a MIDX when its position is greater than `2^32-1/m->hash_len`. As usual, it is perfectly OK for a MIDX to have as many as 2^32-1 objects (since we use 32-bit fields to count the number of objects at each fanout layer). But if we have more than `2^32-1/m->hash_len` number of objects, we will incorrectly perform the computation using 32-bit integers, overflowing the result. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: use `size_t`'s for fanout nr and allocTaylor Blau1-3/+5
The `midx_fanout` struct is used to keep track of a set of OIDs corresponding to each layer of the MIDX's fanout table. It stores an array of entries, along with the number of entries in the table, and the allocated size of the array. Both `nr` and `alloc` are stored as 32-bit unsigned integers. In practice, this should never cause any problems, since most packs have far fewer than 2^32-1 objects. But storing these as `size_t`'s is more appropriate, and prevents us from accidentally overflowing some result when multiplying or adding to either of these values. Update these struct members to be `size_t`'s as appropriate. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan1-1/+0
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21object-store-ll.h: split this header out of object-store.hElijah Newren1-1/+1
The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-05-09Merge branch 'en/header-split-cache-h-part-2'Junio C Hamano1-1/+1
More header clean-up. * en/header-split-cache-h-part-2: (22 commits) reftable: ensure git-compat-util.h is the first (indirect) include diff.h: reduce unnecessary includes object-store.h: reduce unnecessary includes commit.h: reduce unnecessary includes fsmonitor: reduce includes of cache.h cache.h: remove unnecessary headers treewide: remove cache.h inclusion due to previous changes cache,tree: move basic name compare functions from read-cache to tree cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c hash-ll.h: split out of hash.h to remove dependency on repository.h tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h dir.h: move DTYPE defines from cache.h versioncmp.h: move declarations for versioncmp.c functions from cache.h ws.h: move declarations for ws.c functions from cache.h match-trees.h: move declarations for match-trees.c functions from cache.h pkt-line.h: move declarations for pkt-line.c functions from cache.h base85.h: move declarations for base85.c functions from cache.h copy.h: move declarations for copy.c functions from cache.h server-info.h: move declarations for server-info.c functions from cache.h packfile.h: move pack_window and pack_entry from cache.h ...
2023-04-25Merge branch 'ps/fix-geom-repack-with-alternates'Junio C Hamano1-3/+3
Geometric repacking ("git repack --geometric=<n>") in a repository that borrows from an alternate object database had various corner case bugs, which have been corrected. * ps/fix-geom-repack-with-alternates: repack: disable writing bitmaps when doing a local repack repack: honor `-l` when calculating pack geometry t/helper: allow chmtime to print verbosely without modifying mtime pack-objects: extend test coverage of `--stdin-packs` with alternates pack-objects: fix error when same packfile is included and excluded pack-objects: fix error when packing same pack twice pack-objects: split out `--stdin-packs` tests into separate file repack: fix generating multi-pack-index with only non-local packs repack: fix trying to use preferred pack in alternates midx: fix segfault with no packs and invalid preferred pack
2023-04-24treewide: remove cache.h inclusion due to previous changesElijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-14midx: fix segfault with no packs and invalid preferred packPatrick Steinhardt1-3/+3
When asked to write a multi-pack-index the user can specify a preferred pack that is used as a tie breaker when multiple packs contain the same objects. When this packfile cannot be found, we just pick the first pack that is getting tracked by the newly written multi-pack-index as a fallback. Picking the fallback can fail in the case where we're asked to write a multi-pack-index with no packfiles at all: picking the fallback value will cause a segfault as we blindly index into the array of packfiles, which would be empty. Fix this bug by resetting the preferred packfile index to `-1` before searching for the preferred pack. This fixes the segfault as we already check for whether the index is `> - 1`. If it is not, we simply don't pick a preferred packfile at all. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11object-file.h: move declarations for object-file.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21csum-file.h: remove unnecessary inclusion of cache.hElijah Newren1-1/+1
With the change in the last commit to move several functions to write-or-die.h, csum-file.h no longer needs to include cache.h. However, removing that include forces several other C files, which directly or indirectly dependend upon csum-file.h's inclusion of cache.h, to now be more explicit about their dependencies. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21abspath.h: move absolute path functions from cache.hElijah Newren1-0/+1
This is another step towards letting us remove the include of cache.h in strbuf.c. It does mean that we also need to add includes of abspath.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren1-0/+1
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-17Merge branch 'jk/unused-post-2.39-part2'Junio C Hamano1-1/+1
More work towards -Wunused. * jk/unused-post-2.39-part2: (21 commits) help: mark unused parameter in git_unknown_cmd_config() run_processes_parallel: mark unused callback parameters userformat_want_item(): mark unused parameter for_each_commit_graft(): mark unused callback parameter rewrite_parents(): mark unused callback parameter fetch-pack: mark unused parameter in callback function notes: mark unused callback parameters prio-queue: mark unused parameters in comparison functions for_each_object: mark unused callback parameters list-objects: mark unused callback parameters mark unused parameters in signal handlers run-command: mark error routine parameters as unused mark "pointless" data pointers in callbacks ref-filter: mark unused callback parameters http-backend: mark unused parameters in virtual functions http-backend: mark argc/argv unused object-name: mark unused parameters in disambiguate callbacks serve: mark unused parameters in virtual functions serve: use repository pointer to get config ls-refs: drop config caching ...
2023-02-24for_each_object: mark unused callback parametersJeff King1-1/+1
The for_each_{loose,packed}_object interface uses callback functions, but not every callback needs all of the parameters. Mark the unused ones to satisfy -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren1-1/+2
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-27Merge branch 'tb/midx-bitmap-selection-fix'Junio C Hamano1-1/+33
A bugfix with tracing support in midx codepath * tb/midx-bitmap-selection-fix: pack-bitmap-write.c: instrument number of reused bitmaps midx.c: instrument MIDX and bitmap generation with trace2 regions midx.c: consider annotated tags during bitmap selection midx.c: fix whitespace typo
2022-10-13midx.c: instrument MIDX and bitmap generation with trace2 regionsTaylor Blau1-0/+28
When debugging MIDX and MIDX-bitmap related issues, it is useful to figure out where Git is spending its time. GitHub has been using the below trace2 regions to instrument various components of generating a MIDX itself, as well time spent preparing to build a MIDX bitmap. These are limited to instrumenting the following functions: - midx.c::find_commits_for_midx_bitmap() - midx.c::midx_pack_order() - midx.c::prepare_midx_packing_data() - midx.c::write_midx_bitmap() - midx.c::write_midx_internal() - midx.c::write_midx_reverse_index() to start and end with a trace2_region_enter() and trace2_region_leave(), respectively. The category for all of these is "midx", which matches the existing convention. The region description matches the name of the function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-13midx.c: consider annotated tags during bitmap selectionTaylor Blau1-0/+4
When generating a multi-pack bitmap without a `--refs-snapshot` (e.g., by running `git multi-pack-index write --bitmap` directly), we determine the set of bitmap-able commits by enumerating each reference, and adding the referrent as the tip of a reachability traversal when it appears somewhere in the MIDX. (Any commit we encounter during the reachability traversal then becomes a candidate for bitmap selection). But we incorrectly avoid peeling the object at the tip of each reference. So if we see some reference that points at an annotated tag (which in turn points through zero or more additional annotated tags at a commit), that we will not add it as a tip for the reachability traversal. This means that if some commit C is only referenced through one or more annotated tag(s), then C won't become a bitmap candidate. Correct this by peeling the reference tips as we enumerate them to ensure that we consider commits which are the targets of annotated tags, in addition to commits which are referenced directly. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-13midx.c: fix whitespace typoTaylor Blau1-1/+1
This was unintentionally introduced via 893b563505 (midx: inline nth_midxed_pack_entry(), 2021-09-11) where "struct repository *r" became "struct repository * r". The latter does not adhere to our usual style conventions, so fix that up to look more like our usual declarations. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21midx.c: avoid cruft packs with non-zero `repack --batch-size`Taylor Blau1-0/+2
Apply similar treatment with respect to cruft packs as in a few commits ago to `repack` with a non-zero `--batch-size`. Since the case of a non-zero `--batch-size` is handled separately (in `fill_included_packs_batch()` instead of `fill_included_packs_all()`), a separate fix must be applied for this case. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21midx.c: remove unnecessary loop conditionTaylor Blau1-1/+1
The fill_included_packs_batch() routine is responsible for aggregating objects in packs with a non-zero value for the `--batch-size` option of the `git multi-pack-index repack` sub-command. Since this routine is explicitly called only when `--batch-size` is non-zero, there is no point in checking that this is the case in our loop condition. Remove the unnecessary part of this condition to avoid confusion. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21midx.c: replace `xcalloc()` with `CALLOC_ARRAY()`Taylor Blau1-1/+3
Replace a direct invocation of Git's `xcalloc()` wrapper with the `CALLOC_ARRAY()` macro instead. The latter is preferred since it is more conventional in Git's codebase, but also because it automatically picks the correct value for the record size. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21midx.c: avoid cruft packs with `repack --batch-size=0`Taylor Blau1-0/+2
The `repack` sub-command of the `git multi-pack-index` builtin creates a new pack aggregating smaller packs contained in the MIDX up to some given `--batch-size`. When `--batch-size=0`, this instructs the MIDX builtin to repack everything contained in the MIDX into a single pack. In similar spirit as a previous commit, it is undesirable to repack the contents of a cruft pack in this step. Teach `repack` to ignore any cruft pack(s) when `--batch-size=0` for the same reason(s). (The case of a non-zero `--batch-size` will be handled in a subsequent commit). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21midx.c: prevent `expire` from removing the cruft packTaylor Blau1-1/+1
The `expire` sub-command unlinks any packs that are (a) contained in the MIDX, but (b) have no objects referenced by the MIDX. This sub-command ignores `.keep` packs, which remain on-disk even if they have no objects referenced by the MIDX. Cruft packs, however, aren't given the same treatment: if none of the objects contained in the cruft pack are selected from the cruft pack by the MIDX, then the cruft pack is eligible to be expired. This is less than desireable, since the cruft pack has important metadata about the individual object mtimes, which is useful to determine how quickly an object should age out of the repository when pruning. Ordinarily, we wouldn't expect the contents of a cruft pack to duplicated across non-cruft packs (and we'd expect to see the MIDX select all cruft objects from other sources even less often). But nonetheless, it is still possible to trick the `expire` sub-command into removing the `.mtimes` file in this circumstance. Teach the `expire` sub-command to ignore cruft packs in the same manner as it does `.keep` packs, in order to keep their metadata around, even when they are unreferenced by the MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-05Merge branch 'ac/bitmap-lookup-table'Junio C Hamano1-0/+3
The pack bitmap file gained a bitmap-lookup table to speed up locating the necessary bitmap for a given commit. * ac/bitmap-lookup-table: pack-bitmap-write: drop unused pack_idx_entry parameters bitmap-lookup-table: add performance tests for lookup table pack-bitmap: prepare to read lookup table extension pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests pack-bitmap-write.c: write lookup table extension bitmap: move `get commit positions` code to `bitmap_writer_finish` Documentation/technical: describe bitmap lookup table extension
2022-09-05Merge branch 'tb/midx-with-changing-preferred-pack-fix'Junio C Hamano1-46/+95
Multi-pack index got corrupted when preferred pack changed from one pack to another in a certain way, which has been corrected. * tb/midx-with-changing-preferred-pack-fix: midx.c: avoid adding preferred objects twice midx.c: include preferred pack correctly with existing MIDX midx.c: extract `midx_fanout_add_pack_fanout()` midx.c: extract `midx_fanout_add_midx_fanout()` midx.c: extract `struct midx_fanout` t/lib-bitmap.sh: avoid silencing stderr t5326: demonstrate potential bitmap corruption
2022-08-26pack-bitmap-write: learn pack.writeBitmapLookupTable and add testsAbhradeep Chakraborty1-0/+3
Teach Git to provide a way for users to enable/disable bitmap lookup table extension by providing a config option named 'writeBitmapLookupTable'. Default is false. Also add test to verify writting of lookup table. Mentored-by: Taylor Blau <me@ttaylorr.com> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com> Co-Authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-22midx.c: avoid adding preferred objects twiceTaylor Blau1-2/+13
The last commit changes the behavior of midx.c's `get_sorted_objects()` function to handle the case of writing a MIDX bitmap while reusing an existing MIDX and changing the identity of the preferred pack separately. As part of this change, all objects from the (new) preferred pack are added to the fanout table in a separate pass. Since these copies of the objects all have their preferred bits set, any duplicates will be resolved in their favor. Importantly, this includes any copies of those same objects that come from the existing MIDX. We know at the time of adding them that they'll be redundant if their source pack is the (new) preferred one, so we can avoid adding them to the list in this case. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-22midx.c: include preferred pack correctly with existing MIDXTaylor Blau1-7/+7
This patch resolves an issue where the object order used to generate a MIDX bitmap would violate an invariant that all of the preferred pack's objects are represented by that pack in the MIDX. The problem arises when reusing an existing MIDX while generating a new one, and occurs specifically when the identity of the preferred pack changes from one MIDX to another, along with a few other conditions: - the new preferred pack must also be present in the existing MIDX - the new preferred pack must *not* have been the preferred pack in the existing MIDX - most importantly, there must be at least one object present in the physical preferred pack (ie., it shows up in that pack's index) but was selected from a *different* pack when the previous MIDX was generated When the above conditions are all met, we end up (incorrectly) discarding copies of some objects in the pack selected as the preferred pack. This is because `get_sorted_entries()` adds objects to its list by doing the following at each fanout level: - first, adding all objects from that fanout level from an existing MIDX - then, adding all objects from that fanout level in each pack *not* included in the existing MIDX So if some object was not selected from the to-be-preferred pack when writing the previous MIDX, then we will never consider it as a candidate when generating the new MIDX. This means that it's possible for the preferred pack to not include all of its objects in the MIDX's pseudo-pack object order, which is an invariant violation of that order. Resolve this by adding all objects from the preferred pack separately when it appears in the existing MIDX (if one was present). This will duplicate objects from that pack that *did* appear in the MIDX, but this is fine, since get_sorted_entries() already handles duplicates. (A future optimization in this area could avoid adding copies of objects that we know already existing in the MIDX.) Note that we no longer need to compute the preferred-ness of objects added from the MIDX, since we only want to select the preferred objects from a single source. (We could still mark these preferred bits, but doing so is redundant and unnecessary). This resolves the bug demonstrated by t5326.174 ("preferred pack change with existing MIDX bitmap"). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-22midx.c: extract `midx_fanout_add_pack_fanout()`Taylor Blau1-15/+28
Extract a routine to add all objects whose object ID's first byte is `cur_fanout` from a given pack (identified by its index into the `struct pack_info` array maintained by the MIDX writing routine). Unlike the previous extraction (for `midx_fanout_add_midx_fanout()`), this function will be called twice, once for all new packs, and again for the preferred pack (if it appears in an existing MIDX). The latter change is to resolve the bug described a few patches ago, and will be made in the subsequent commit. Similar to the previous refactoring, this function also enhances the readability of its caller in `get_sorted_entries()`. Its functionality is unchanged in this commit. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-22midx.c: extract `midx_fanout_add_midx_fanout()`Taylor Blau1-19/+28
Extract a routine to add all objects whose object ID's first byte is `cur_fanout` from an existing MIDX. This function will only be called once, so extracting it is purely cosmetic to improve the readability of `get_sorted_entries()` (its sole caller) below. The functionality is unchanged in this commit. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-22midx.c: extract `struct midx_fanout`Taylor Blau1-19/+35
To build up a list of objects (along with their packs, and the offsets within those packs that each object appears at), the MIDX code implements `get_sorted_entries()` which builds up a list of candidates, sorts them, and then removes duplicate entries. To do this, it keeps an array of `pack_midx_entry` structures that it builds up once for each fanout level (ie., for all possible values of the first byte of each object's ID). This array is a function-local variable of `get_sorted_entries()`. Since it uses the ALLOC_GROW() macro, having the `alloc_fanout` variable also be local to that function, and only modified within that function is convenient. However, subsequent changes will extract the two ways this array is filled (from a pack at some fanout value, and from an existing MIDX at some fanout value) into separate functions. Instead of passing around pointers to the entries array, along with `nr_fanout` and `alloc_fanout`, encapsulate these three into a structure instead. Then pass around a pointer to this structure instead. This patch does not yet extract the above two functions, but sets us up to begin doing so in the following commit. For now, the implementation of get_sorted_entries() is only modified to replace `entries_by_fanout` with `fanout.entries`, `nr_fanout` with `fanout.nr`, and so on. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-27write_midx_bitmap(): drop unused refs_snapshot parameterJeff King1-2/+1
The refactoring in 90b2bb710d (midx: extract bitmap write setup, 2022-07-19) hoisted our call to find_commits_for_midx_bitmap() into the caller, which means we no longer need to see the refs_snapshot at all. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-19midx: reduce memory pressure while writing bitmapsDerrick Stolee1-0/+13
We noticed that some 'git multi-pack-index write --bitmap' processes were running with very high memory. It turns out that a lot of this memory is required to store a list of every object in the written multi-pack-index, with a second copy that has additional information used for the bitmap writing logic. Using 'valgrind --tool=massif' before this change, the following chart shows how memory load increased and was maintained throughout the process: GB 4.102^ :: | @ @::@@::@@::::::::@::::::@@:#:::::::::::::@@:: : | :::::@@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :::: :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :::: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | : :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | : :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | @ :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | @ :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : | @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: : 0 +---------------------------------------------------------------> It turns out that the 'struct write_midx_context' data is persisting through the life of the process, including the 'entries' array. This array is used last inside find_commits_for_midx_bitmap() within write_midx_bitmap(). If we free (and nullify) the array at that point, we can free a decent chunk of memory before the bitmap logic adds more to the memory footprint. Here is the massif memory load chart after this change: GB 3.111^ # | # :::::::::::@::::::::::::::@ | # ::::::::::::::::::::::::: : :: : @:: ::::: :: ::@ | @# :::::::::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | :::@#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | :: @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | :: @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ | :: @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@ 0 +---------------------------------------------------------------> The previous change introduced a refactoring of write_midx_bitmap() to make it more clear how much of the 'struct write_midx_context' instance is needed at different parts of the process. In addition, the following defensive programming measures were put in place: 1. Using FREE_AND_NULL() we will at least get a segfault from reading a NULL pointer instead of a use-after-free. 2. 'entries_nr' is also set to zero to make any loop that would iterate over the entries be trivial. 3. Add significant comments in write_midx_internal() to add warnings for future authors who might accidentally add references to this cleared memory. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-19midx: extract bitmap write setupDerrick Stolee1-24/+32
The write_midx_bitmap() method is a long method that does a lot of steps. It requires the write_midx_context struct for use in prepare_midx_packing_data() and find_commits_for_midx_bitmap(), but after that only needs the pack_order array. This is a messy, but completely non-functional refactoring. The code is only being moved around to reduce visibility of the write_midx_context during the longest part of computing reachability bitmaps. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-07Merge branch 'ab/plug-leak-in-revisions'Junio C Hamano1-0/+1
Plug the memory leaks from the trickiest API of all, the revision walker. * ab/plug-leak-in-revisions: (27 commits) revisions API: add a TODO for diff_free(&revs->diffopt) revisions API: have release_revisions() release "topo_walk_info" revisions API: have release_revisions() release "date_mode" revisions API: call diff_free(&revs->pruning) in revisions_release() revisions API: release "reflog_info" in release revisions() revisions API: clear "boundary_commits" in release_revisions() revisions API: have release_revisions() release "prune_data" revisions API: have release_revisions() release "grep_filter" revisions API: have release_revisions() release "filter" revisions API: have release_revisions() release "cmdline" revisions API: have release_revisions() release "mailmap" revisions API: have release_revisions() release "commits" revisions API users: use release_revisions() for "prune_data" users revisions API users: use release_revisions() with UNLEAK() revisions API users: use release_revisions() in builtin/log.c revisions API users: use release_revisions() in http-push.c revisions API users: add "goto cleanup" for release_revisions() stash: always have the owner of "stash_info" free it revisions API users: use release_revisions() needing REV_INFO_INIT revision.[ch]: document and move code declared around "init" ...
2022-06-03Merge branch 'tb/cruft-packs'Junio C Hamano1-15/+3
A mechanism to pack unreachable objects into a "cruft pack", instead of ejecting them into loose form to be reclaimed later, has been introduced. * tb/cruft-packs: sha1-file.c: don't freshen cruft packs builtin/gc.c: conditionally avoid pruning objects via loose builtin/repack.c: add cruft packs to MIDX during geometric repack builtin/repack.c: use named flags for existing_packs builtin/repack.c: allow configuring cruft pack generation builtin/repack.c: support generating a cruft pack builtin/pack-objects.c: --cruft with expiration reachable: report precise timestamps from objects in cruft packs reachable: add options to add_unseen_recent_objects_to_traversal builtin/pack-objects.c: --cruft without expiration builtin/pack-objects.c: return from create_object_entry() t/helper: add 'pack-mtimes' test-tool pack-mtimes: support writing pack .mtimes files chunk-format.h: extract oid_version() pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' pack-mtimes: support reading .mtimes files Documentation/technical: add cruft-packs.txt
2022-05-26chunk-format.h: extract oid_version()Taylor Blau1-15/+3
There are three definitions of an identical function which converts `the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a copy of this function for writing both the commit-graph and multi-pack-index file, and another inline definition used to write the .rev header. Consolidate these into a single definition in chunk-format.h. It's not clear that this is the best header to define this function in, but it should do for now. (Worth noting, the .rev caller expects a 4-byte unsigned, but the other two callers work with a single unsigned byte. The consolidated version uses the latter type, and lets the compiler widen it when required). Another caller will be added in a subsequent patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-04Merge branch 'ds/midx-normalize-pathname-before-comparison'Junio C Hamano1-4/+13
The path taken by "git multi-pack-index" command from the end user was compared with path internally prepared by the tool withut first normalizing, which lead to duplicated paths not being noticed, which has been corrected. * ds/midx-normalize-pathname-before-comparison: cache: use const char * for get_object_directory() multi-pack-index: use --object-dir real path midx: use real paths in lookup_multi_pack_index()
2022-04-25midx: use real paths in lookup_multi_pack_index()Derrick Stolee1-4/+13
This helper looks for a parsed multi-pack-index whose object directory matches the given object_dir. Before going into the loop over the parsed multi-pack-indexes, it calls find_odb() to ensure that the given object_dir is actually a known object directory. However, find_odb() uses real-path manipulations to compare the input to the alternate directories. This same real-path comparison is not used in the loop, leading to potential issues with the strcmp(). Update the method to use the real-path values instead. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-13revisions API users: add straightforward release_revisions()Ævar Arnfjörð Bjarmason1-0/+1
Add a release_revisions() to various users of "struct rev_list" in those straightforward cases where we only need to add the release_revisions() call to the end of a block, and don't need to e.g. refactor anything to use a "goto cleanup" pattern. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10core.fsync: introduce granular fsync control infrastructureNeeraj Singh1-1/+2
This commit introduces the infrastructure for the core.fsync configuration knob. The repository components we want to sync are identified by flags so that we can turn on or off syncing for specific components. If core.fsyncObjectFiles is set and the core.fsync configuration also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any loose objects. This picks the strictest data integrity behavior if core.fsync and core.fsyncObjectFiles are set to conflicting values. This change introduces the currently unused fsync_component helper, which will be used by a later patch that adds fsyncing to the refs backend. Actual configuration and documentation of the fsync components list are in other patches in the series to separate review of the underlying mechanism from the policy of how it's configured. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-18Merge branch 'tb/midx-no-bitmap-for-no-objects'Junio C Hamano1-0/+9
When there is no object to write .bitmap file for, "git multi-pack-index" triggered an error, instead of just skipping, which has been corrected. * tb/midx-no-bitmap-for-no-objects: midx: prevent writing a .bitmap without any objects
2022-02-09midx: prevent writing a .bitmap without any objectsTaylor Blau1-0/+9
When trying to write a MIDX, we already prevent the case where there weren't any packs present, and thus we would have written an empty MIDX. But there is another "empty" case, which is more interesting, and we don't yet handle. If we try to write a MIDX which has at least one pack, but those packs together don't contain any objects, we will encounter a BUG() when trying to use the bitmap corresponding to that MIDX, like so: $ git rev-parse HEAD | git pack-objects --revs --use-bitmap-index --stdout >/dev/null BUG: pack-revindex.c:394: pack_pos_to_midx: out-of-bounds object at 0 (note that in the above reproduction, both `--use-bitmap-index` and `--stdout` are important, since without the former we won't even both to load the .bitmap, and without the latter we wont attempt pack reuse). The problem occurs when we try to discover the identity of the preferred pack to determine which range if any of existing packs we can reuse verbatim. This path is: `reuse_packfile_objects()` -> `reuse_partial_packfile_from_bitmap()` -> `midx_preferred_pack()`. #4 0x000055555575401f in pack_pos_to_midx (m=0x555555997160, pos=0) at pack-revindex.c:394 #5 0x00005555557502c8 in midx_preferred_pack (bitmap_git=0x55555599c280) at pack-bitmap.c:1431 #6 0x000055555575036c in reuse_partial_packfile_from_bitmap (bitmap_git=0x55555599c280, packfile_out=0x5555559666b0 <reuse_packfile>, entries=0x5555559666b8 <reuse_packfile_objects>, reuse_out=0x5555559666c0 <reuse_packfile_bitmap>) at pack-bitmap.c:1452 #7 0x00005555556041f6 in get_object_list_from_bitmap (revs=0x7fffffffcbf0) at builtin/pack-objects.c:3658 #8 0x000055555560465c in get_object_list (ac=2, av=0x555555997050) at builtin/pack-objects.c:3765 #9 0x0000555555605e4e in cmd_pack_objects (argc=0, argv=0x7fffffffe920, prefix=0x0) at builtin/pack-objects.c:4154 Since neither the .bitmap or MIDX stores the identity of the preferred pack, we infer it by trying to load the first object in pseudo-pack order, and then asking the MIDX which pack was chosen to represent that object. But this fails our bounds check, since there are zero objects in the MIDX to begin with, which results in the BUG(). We could catch this more carefully in `midx_preferred_pack()`, but signaling the absence of a preferred pack out to all of its callers is somewhat awkward. Instead, let's avoid writing a MIDX .bitmap without any objects altogether. We catch this case in `write_midx_internal()`, and emit a warning if the caller indicated they wanted to write a bitmap before clearing out the relevant flags. If we somehow got to write_midx_bitmap(), then we will call BUG(), but this should now be an unreachable path. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27midx: read `RIDX` chunk when presentTaylor Blau1-1/+5
When a MIDX contains the new `RIDX` chunk, ensure that the reverse index is read from it instead of the on-disk .rev file. Since we need to encode the object order in the MIDX itself for correctness reasons, there is no point in storing the same data again outside of the MIDX. So, this patch stops writing separate .rev files, and reads it out of the MIDX itself. This is possible to do with relatively little new code, since the format of the RIDX chunk is identical to the data in the .rev file. In other words, we can implement this by pointing the `revindex_data` field at the reverse index chunk of the MIDX instead of the .rev file without any other changes. Note that we have two knobs that are adjusted for the new tests: GIT_TEST_MIDX_WRITE_REV and GIT_TEST_MIDX_READ_RIDX. The former controls whether the MIDX .rev is written at all, and the latter controls whether we read the MIDX's RIDX chunk. Both are necessary to ensure that the test added at the beginning of this series continues to work. This is because we always need to write the RIDX chunk in the MIDX in order to change its checksum, but we want to make sure reading the existing .rev file still works (since the RIDX chunk takes precedence by default). Arguably this isn't a very interesting mode to test, because the precedence rules mean that we'll always read the RIDX chunk over the .rev file. But it makes it impossible for a user to induce corruption in their repository by adjusting the test knobs (since if we had an either/or knob they could stop writing the RIDX chunk, allowing them to tweak the MIDX's object order without changing its checksum). Signed-off-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27midx.c: make changing the preferred pack safeTaylor Blau1-3/+20
The previous patch demonstrates a bug where a MIDX's auxiliary object order can become out of sync with a MIDX bitmap. This is because of two confounding factors: - First, the object order is stored in a file which is named according to the multi-pack index's checksum, and the MIDX does not store the object order. This means that the object order can change without altering the checksum. - But the .rev file is moved into place with finalize_object_file(), which link(2)'s the file into place instead of renaming it. For us, that means that a modified .rev file will not be moved into place if MIDX's checksum was unchanged. This fix is to force the MIDX's checksum to change when the preferred pack changes but the set of packs contained in the MIDX does not. In other words, when the object order changes, the MIDX's checksum needs to change with it (regardless of whether the MIDX is tracking the same or different packs). This prevents a race whereby changing the object order (but not the packs themselves) enables a reader to see the new .rev file with the old MIDX, or similarly seeing the new bitmap with the old object order. But why can't we just stop hardlinking the .rev into place instead adding additional data to the MIDX? Suppose that's what we did. Then when we go to generate the new bitmap, we'll load the old MIDX bitmap, along with the MIDX that it references. That's fine, since the new MIDX isn't moved into place until after the new bitmap is generated. But the new object order *has* been moved into place. So we'll read the old bitmaps in the new order when generating the new bitmap file, meaning that without this secondary change, bitmap generation itself would become a victim of the race described here. This can all be prevented by forcing the MIDX's checksum to change when the object order does. By embedding the entire object order into the MIDX, we do just that. That is, the MIDX's checksum will change in response to any perturbation of the underlying object order. In t5326, this will cause the MIDX's checksum to update (even without changing the set of packs in the MIDX), preventing the stale read problem. Note that this makes it safe to continue to link(2) the MIDX .rev file into place, since it is now impossible to have a .rev file that is out-of-sync with the MIDX whose checksum it references. (But we will do away with MIDX .rev files later in this series anyway, so this is somewhat of a moot point). In theory, it is possible to store a "fingerprint" of the full object order here, so long as that fingerprint changes at least as often as the full object order does. Some possibilities here include storing the identity of the preferred pack, along with the mtimes of the non-preferred packs in a consistent order. But storing a limited part of the information makes it difficult to reason about whether or not there are gaps between the two that would cause us to get bitten by this bug again. Signed-off-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-29Merge branch 'tb/plug-pack-bitmap-leaks'Junio C Hamano1-28/+38
Leakfix. * tb/plug-pack-bitmap-leaks: pack-bitmap.c: more aggressively free in free_bitmap_index() pack-bitmap.c: don't leak type-level bitmaps midx.c: write MIDX filenames to strbuf builtin/multi-pack-index.c: don't leak concatenated options builtin/repack.c: avoid leaking child arguments builtin/pack-objects.c: don't leak memory via arguments t/helper/test-read-midx.c: free MIDX within read_midx_file() midx.c: don't leak MIDX from verify_midx_file midx.c: clean up chunkfile after reading the MIDX
2021-10-28midx.c: write MIDX filenames to strbufTaylor Blau1-26/+33
To ask for the name of a MIDX and its corresponding .rev file, callers invoke get_midx_filename() and get_midx_rev_filename(), respectively. These both invoke xstrfmt(), allocating a chunk of memory which must be freed later on. This makes callers in pack-bitmap.c somewhat awkward. Specifically, midx_bitmap_filename(), which is implemented like: return xstrfmt("%s-%s.bitmap", get_midx_filename(midx->object_dir), hash_to_hex(get_midx_checksum(midx))); this leaks the second argument to xstrfmt(), which itself was allocated with xstrfmt(). This caller could assign both the result of get_midx_filename() and the outer xstrfmt() to a temporary variable, remembering to free() the former before returning. But that involves a wasteful copy. Instead, get_midx_filename() and get_midx_rev_filename() take a strbuf as an output parameter. This way midx_bitmap_filename() can manipulate and pass around a temporary buffer which it detaches back to its caller. That allows us to implement the function without copying or open-coding get_midx_filename() in a way that doesn't leak. Update the other callers of get_midx_filename() and get_midx_rev_filename() accordingly. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-27midx.c: don't leak MIDX from verify_midx_fileTaylor Blau1-1/+3
The function midx.c:verify_midx_file() allocates a MIDX struct by calling load_multi_pack_index(). But when cleaning up, it calls free() without freeing any resources associated with the MIDX. Call the more appropriate close_midx() which does free those resources, which causes t5319.3 to pass when Git is compiled with SANITIZE=leak. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-21midx.c: clean up chunkfile after reading the MIDXTaylor Blau1-1/+2
In order to read the contents of a MIDX, we initialize a chunkfile structure which can read the table of contents and assign pointers into different sections of the file for us. We do call free(), since the chunkfile struct is heap allocated, but not the more appropriate free_chunkfile(), which also frees memory that the structure itself owns. Call that instead to avoid leaking memory in this function. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-18Merge branch 'tb/repack-write-midx'Junio C Hamano1-12/+98
"git repack" has been taught to generate multi-pack reachability bitmaps. * tb/repack-write-midx: test-read-midx: fix leak of bitmap_index struct builtin/repack.c: pass `--refs-snapshot` when writing bitmaps builtin/repack.c: make largest pack preferred builtin/repack.c: support writing a MIDX while repacking builtin/repack.c: extract showing progress to a variable builtin/repack.c: rename variables that deal with non-kept packs builtin/repack.c: keep track of existing packs unconditionally midx: preliminary support for `--refs-snapshot` builtin/multi-pack-index.c: support `--stdin-packs` mode midx: expose `write_midx_file_only()` publicly
2021-10-15midx.c: guard against commit_lock_file() failuresTaylor Blau1-1/+2
When writing a MIDX, we atomically move the new MIDX into place via commit_lock_file(), but do not check to see if that call was successful. Make sure that we do check in order to prevent us from incorrectly reporting that we wrote a new MIDX if we actually encountered an error. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15midx.c: lookup MIDX by object directory during repackTaylor Blau1-4/+1
Apply similar treatment as in the last commit to the MIDX `repack` operation. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15midx.c: lookup MIDX by object directory during expireTaylor Blau1-4/+3
Before a new MIDX can be written, expire_midx_packs() first loads the existing MIDX, figures out which packs can be expired, and then writes a new MIDX based on that information. In order to load the existing MIDX, it uses load_multi_pack_index(), which mmaps the multi-pack-index file, but does not store the resulting `struct multi_pack_index *` in the object store. write_midx_internal() also needs to open the existing MIDX, and it does so by iterating the results of get_multi_pack_index(), so that it reuses the same pointer held by the object store. But before it can move the new MIDX into place, it close_object_store() to munmap() the multi-pack-index file to accommodate platforms like Windows which don't allow overwriting files which are memory mapped. That's where things get weird. Since expire_midx_packs has its own *separate* memory mapped copy of the MIDX, the MIDX file is still memory mapped! Interestingly, this doesn't seem to cause a problem in our tests. (I believe that this has much more to do with my own lack of familiarity with Windows than it does a lack of coverage in our tests). In any case, we can side-step the whole issue by teaching expire_midx_packs() to use the `struct multi_pack_index` pointer it found via the object store instead of maintain its own copy. That way, when write_midx_internal() calls `close_object_store()`, we know that there are no memory mapped copies of the MIDX laying around. A couple of other small notes about this patch: - As far as I can tell, passing `local == 1` to the call to load_multi_pack_index() was an error, since object_dir could be an alternate. But it doesn't matter, since even though we write `m->local = 1`, we never read that field back later on. - Setting `m = NULL` after write_midx_internal() was likely to prevent a double-free back from when that function took a `struct multi_pack_index *` that it called close_midx() on itself. We can rely on write_midx_internal() to call that for us now. Finally, this enforces the same "the value of --object-dir must be the local object store, or an alternate" rule from f57a739691 (midx: avoid opening multiple MIDXs when writing, 2021-09-01) to the `expire` sub-command, too. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15midx.c: extract MIDX lookup by object_dirTaylor Blau1-10/+17
The first thing that write_midx_internal() does is load the MIDX corresponding to the given object directory, if one is present. Prepare for other functions in midx.c to do the same thing by extracting that operation out to a small helper function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15Merge branch 'tb/repack-write-midx' into tb/fix-midx-rename-while-mappedJunio C Hamano1-12/+98
* tb/repack-write-midx: test-read-midx: fix leak of bitmap_index struct builtin/repack.c: pass `--refs-snapshot` when writing bitmaps builtin/repack.c: make largest pack preferred builtin/repack.c: support writing a MIDX while repacking builtin/repack.c: extract showing progress to a variable builtin/repack.c: rename variables that deal with non-kept packs builtin/repack.c: keep track of existing packs unconditionally midx: preliminary support for `--refs-snapshot` builtin/multi-pack-index.c: support `--stdin-packs` mode midx: expose `write_midx_file_only()` publicly
2021-10-11Merge branch 'tb/midx-write-propagate-namehash'Junio C Hamano1-1/+5
"git multi-pack-index write --bitmap" learns to propagate the hashcache from original bitmap to resulting bitmap. * tb/midx-write-propagate-namehash: t5326: test propagating hashcache values p5326: generate pack bitmaps before writing the MIDX bitmap p5326: don't set core.multiPackIndex unnecessarily p5326: create missing 'perf-tag' tag midx.c: respect 'pack.writeBitmapHashcache' when writing bitmaps pack-bitmap.c: propagate namehash values from existing bitmaps t/helper/test-bitmap.c: add 'dump-hashes' mode
2021-09-28midx: preliminary support for `--refs-snapshot`Taylor Blau1-8/+53
To figure out which commits we can write a bitmap for, the multi-pack index/bitmap code does a reachability traversal, marking any commit which can be found in the MIDX as eligible to receive a bitmap. This approach will cause a problem when multi-pack bitmaps are able to be generated from `git repack`, since the reference tips can change during the repack. Even though we ignore commits that don't exist in the MIDX (when doing a scan of the ref tips), it's possible that a commit in the MIDX reaches something that isn't. This can happen when a multi-pack index contains some pack which refers to loose objects (e.g., if a pack was pushed after starting the repack but before generating the MIDX which depends on an object which is stored as loose in the repository, and by definition isn't included in the multi-pack index). By taking a snapshot of the references before we start repacking, we can close that race window. In the above scenario (where we have a packed object pointing at a loose one), we'll either (a) take a snapshot of the references before seeing the packed one, or (b) take it after, at which point we can guarantee that the loose object will be packed and included in the MIDX. This patch does just that. It writes a temporary "reference snapshot", which is a list of OIDs that are at the ref tips before writing a multi-pack bitmap. References that are "preferred" (i.e,. are a suffix of at least one value of the 'pack.preferBitmapTips' configuration) are marked with a special '+'. The format is simple: one line per commit at each tip, with an optional '+' at the beginning (for preferred references, as described above). When provided, the reference snapshot is used to drive bitmap selection instead of the MIDX code doing its own traversal. When it isn't provided, the usual traversal takes place instead. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-28midx: expose `write_midx_file_only()` publiclyTaylor Blau1-8/+49
Expose a variant of the write_midx_file() function which ignores packs that aren't included in an explicit "allow" list. This will be used in an upcoming patch to power a new `--stdin-packs` mode of `git multi-pack-index write` for callers that only want to include certain packs in a MIDX (and ignore any packs which may have happened to enter the repository independently, e.g., from pushes). Those patches will provide test coverage for this new function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-23Merge branch 'rs/packfile-bad-object-list-in-oidset'Junio C Hamano1-26/+11
Replace a handcrafted data structure used to keep track of bad objects in the packfile API by an oidset. * rs/packfile-bad-object-list-in-oidset: packfile: use oidset for bad objects packfile: convert has_packed_and_bad() to object_id packfile: convert mark_bad_packed_object() to object_id midx: inline nth_midxed_pack_entry() oidset: make oidset_size() an inline function
2021-09-14midx.c: respect 'pack.writeBitmapHashcache' when writing bitmapsTaylor Blau1-1/+5
In the previous commit, the bitmap writing code learned to propagate values from an existing hash-cache extension into the bitmap that it is writing. Now that that functionality exists, let's expose it by teaching the 'git multi-pack-index' builtin to respect the `pack.writeBitmapHashCache` option so that the hash-cache may be written at all. Two minor points worth noting here: - The 'git multi-pack-index write' sub-command didn't previously read any configuration (instead this is handled in the base command). A separate handler is added here to respect this write-specific config option. - I briefly considered adding a 'bitmap_flags' field to the static options struct, but decided against it since it would require plumbing through a new parameter to the write_midx_file() function. Instead, a new MIDX-specific flag is added, which is translated to the corresponding bitmap one. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12packfile: use oidset for bad objectsRené Scharfe1-7/+3
Store the object ID of broken pack entries in an oidset instead of keeping only their hashes in an unsorted array. The resulting code is shorter and easier to read. It also handles the (hopefully) very rare case of having a high number of bad objects better. Helped-by: Jeff King <peff@peff.net> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12midx: inline nth_midxed_pack_entry()René Scharfe1-20/+9
fill_midx_entry() finds the position of an object ID and passes it to nth_midxed_pack_entry(), which uses the position to look up the object ID for its own purposes. Inline the latter into the former to avoid that lookup. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09pack-bitmap: drop repository argument from prepare_midx_bitmap_git()Jeff King1-1/+1
We never look at the repository argument which is passed. This makes sense, since the multi_pack_index struct already tells us everything we need to access the files in its associated object directory. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap: write multi-pack bitmapsTaylor Blau1-8/+201
Write multi-pack bitmaps in the format described by Documentation/technical/bitmap-format.txt, inferring their presence with the absence of '--bitmap'. To write a multi-pack bitmap, this patch attempts to reuse as much of the existing machinery from pack-objects as possible. Specifically, the MIDX code prepares a packing_data struct that pretends as if a single packfile has been generated containing all of the objects contained within the MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap: read multi-pack bitmapsTaylor Blau1-2/+2
This prepares the code in pack-bitmap to interpret the new multi-pack bitmaps described in Documentation/technical/bitmap-format.txt, which mostly involves converting bit positions to accommodate looking them up in a MIDX. Note that there are currently no writers who write multi-pack bitmaps, and that this will be implemented in the subsequent commit. Note also that get_midx_checksum() and get_midx_filename() are made non-static so they can be called from pack-bitmap.c. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: avoid opening multiple MIDXs when writingTaylor Blau1-11/+18
Opening multiple instance of the same MIDX can lead to problems like two separate packed_git structures which represent the same pack being added to the repository's object store. The above scenario can happen because prepare_midx_pack() checks if `m->packs[pack_int_id]` is NULL in order to determine if a pack has been opened and installed in the repository before. But a caller can construct two copies of the same MIDX by calling get_multi_pack_index() and load_multi_pack_index() since the former manipulates the object store directly but the latter is a lower-level routine which allocates a new MIDX for each call. So if prepare_midx_pack() is called on multiple MIDXs with the same pack_int_id, then that pack will be installed twice in the object store's packed_git pointer. This can lead to problems in, for e.g., the pack-bitmap code, which does something like the following (in pack-bitmap.c:open_pack_bitmap()): struct bitmap_index *bitmap_git = ...; for (p = get_all_packs(r); p; p = p->next) { if (open_pack_bitmap_1(bitmap_git, p) == 0) ret = 0; } which is a problem if two copies of the same pack exist in the packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a conditional like the following: if (bitmap_git->pack || bitmap_git->midx) { /* ignore extra bitmap file; we can only handle one */ warning("ignoring extra bitmap file: %s", packfile->pack_name); close(fd); return -1; } Avoid this scenario by not letting write_midx_internal() open a MIDX that isn't also pointed at by the object store. So long as this is the case, other routines should prefer to open MIDXs with get_multi_pack_index() or reprepare_packed_git() instead of creating instances on their own. Because get_multi_pack_index() returns `r->object_store->multi_pack_index` if it is non-NULL, we'll only have one instance of a MIDX open at one time, avoiding these problems. To encourage this, drop the `struct multi_pack_index *` parameter from `write_midx_internal()`, and rely instead on the `object_dir` to find (or initialize) the correct MIDX instance. Likewise, replace the call to `close_midx()` with `close_object_store()`, since we're about to replace the MIDX with a new one and should invalidate the object store's memory of any MIDX that might have existed beforehand. Note that this now forbids passing object directories that don't belong to alternate repositories over `--object-dir`, since before we would have happily opened a MIDX in any directory, but now restrict ourselves to only those reachable by `r->objects->multi_pack_index` (and alternate MIDXs that we can see by walking the `next` pointer). As far as I can tell, supporting arbitrary directories with `--object-dir` was a historical accident, since even the documentation says `<alt>` when referring to the value passed to this option. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: close linked MIDXs, avoid leaking memoryTaylor Blau1-0/+3
When a repository has at least one alternate, the MIDX belonging to each alternate is accessed through the `next` pointer on the main object store's copy of the MIDX. close_midx() didn't bother to close any of the linked MIDXs. It likewise didn't free the memory pointed to by `m`, leaving uninitialized bytes with live pointers to them left around in the heap. Clean this up by closing linked MIDXs, and freeing up the memory pointed to by each of them. When callers call close_midx(), then they can discard the entire linked list of MIDXs and set their pointer to the head of that list to NULL. This isn't strictly required for the upcoming patches, but it makes it much more difficult (though still possible, for e.g., by calling `close_midx(m->next)` which leaves `m->next` pointing at uninitialized bytes) to have pointers to uninitialized memory. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: infer preferred pack when not given oneTaylor Blau1-6/+44
In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30), the multi-pack index code learned how to select a pack which all duplicate objects are selected from. That is, if an object appears in multiple packs, select the copy in the preferred pack before breaking ties according to the other rules like pack mtime and readdir() order. Not specifying a preferred pack can cause serious problems with multi-pack reachability bitmaps, because these bitmaps rely on having at least one pack from which all duplicates are selected. Not having such a pack causes problems with the code in pack-objects to reuse packs verbatim (e.g., that code assumes that a delta object in a chunk of pack sent verbatim will have its base object sent from the same pack). So why does not marking a pack preferred cause problems here? The reason is roughly as follows: - Ties are broken (when handling duplicate objects) by sorting according to midx_oid_compare(), which sorts objects by OID, preferred-ness, pack mtime, and finally pack ID (more on that later). - The psuedo pack-order (described in Documentation/technical/pack-format.txt under the section "multi-pack-index reverse indexes") is computed by midx_pack_order(), and sorts by pack ID and pack offset, with preferred packs sorting first. - But! Pack IDs come from incrementing the pack count in add_pack_to_midx(), which is a callback to for_each_file_in_pack_dir(), meaning that pack IDs are assigned in readdir() order. When specifying a preferred pack, all of that works fine, because duplicate objects are correctly resolved in favor of the copy in the preferred pack, and the preferred pack sorts first in the object order. "Sorting first" is critical, because the bitmap code relies on finding out which pack holds the first object in the MIDX's pseudo pack-order to determine which pack is preferred. But if we didn't specify a preferred pack, and the pack which comes first in readdir() order does not also have the lowest timestamp, then it's possible that that pack (the one that sorts first in pseudo-pack order, which the bitmap code will treat as the preferred one) did *not* have all duplicate objects resolved in its favor, resulting in breakage. The fix is simple: pick a (semi-arbitrary, non-empty) preferred pack when none was specified. This forces that pack to have duplicates resolved in its favor, and (critically) to sort first in pseudo-pack order. Unfortunately, testing this behavior portably isn't possible, since it depends on readdir() order which isn't guaranteed by POSIX. (Note that multi-pack reachability bitmaps have yet to be implemented; so in that sense this patch is fixing a bug which does not yet exist. But by having this patch beforehand, we can prevent the bug from ever materializing.) Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: reject empty `--preferred-pack`'sTaylor Blau1-0/+29
The soon-to-be-implemented multi-pack bitmap treats object in the first bit position specially by assuming that all objects in the pack it was selected from are also represented from that pack in the MIDX. In other words, the pack from which the first object was selected must also have all of its other objects selected from that same pack in the MIDX in case of any duplicates. But this assumption relies on the fact that there is at least one object in that pack to begin with; otherwise the object in the first bit position isn't from a preferred pack, in which case we can no longer assume that all objects in that pack were also selected from the same pack. Guard this assumption by checking the number of objects in the given preferred pack, and failing if the given pack is empty. To make sure we can safely perform this check, open any packs which are contained in an existing MIDX via prepare_midx_pack(). The same is done for new packs via the add_pack_to_midx() callback, but packs picked up from a previous MIDX will not yet have these opened. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: clear auxiliary .rev after replacing the MIDXTaylor Blau1-1/+2
When writing a new multi-pack index, write_midx_internal() attempts to clean up any auxiliary files (currently just the MIDX's `.rev` file, but soon to include a `.bitmap`, too) corresponding to the MIDX it's replacing. This step should happen after the new MIDX is written into place, since doing so beforehand means that the old MIDX could be read without its corresponding .rev file. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: fix `*.rev` cleanups with `--object-dir`Taylor Blau1-5/+5
If using --object-dir to point into an object directory which belongs to a different repository than the one in the current working directory, such as: git init repo git -C repo ... # add some objects cd alternate git multi-pack-index --object-dir ../repo/.git/objects write the binary will segfault trying to access the object-dir via the repo it found, but that's not fully initialized. Worse, if we later call clear_midx_files_ext(), we will use `the_repository` and remove files out of the wrong object directory. Fix this by using the given object_dir (or the object directory of `the_repository` if `--object-dir` wasn't given) to properly to clean up the *.rev files, avoiding the crash. Original-patch-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-28Merge branch 'ab/attribute-format'Junio C Hamano1-0/+1
Many "printf"-like helper functions we have have been annotated with __attribute__() to catch placeholder/parameter mismatches. * ab/attribute-format: advice.h: add missing __attribute__((format)) & fix usage *.h: add a few missing __attribute__((format)) *.c static functions: add missing __attribute__((format)) sequencer.c: move static function to avoid forward decl *.c static functions: don't forward-declare __attribute__
2021-07-13*.c static functions: add missing __attribute__((format))Ævar Arnfjörð Bjarmason1-0/+1
Add missing __attribute__((format)) function attributes to various "static" functions that take printf arguments. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28midx: report checksum mismatches during 'verify'Taylor Blau1-0/+3
'git multi-pack-index verify' inspects the data in an existing MIDX for correctness by checking that the recorded object offsets are correct, and so on. But it does not check that the file's trailing checksum matches the data that it records. So, if an on-disk corruption happened to occur in the final few bytes (and all other data was recorded correctly), we would: - get a clean result from 'git multi-pack-index verify', but - be unable to reuse the existing MIDX when writing a new one (since we now check for checksum mismatches before reusing a MIDX) Teach the 'verify' sub-command to recognize corruption in the checksum by calling midx_checksum_valid(). Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28midx: don't reuse corrupt MIDXs when writingTaylor Blau1-0/+10
When writing a new multi-pack index, Git tries to reuse as much of the data from an existing MIDX as possible, like object offsets. This is done to avoid re-opening a bunch of *.idx files unnecessarily, but can lead to problems if the data we are reusing is corrupt. That's because we'll blindly reuse data from an existing MIDX without checking its trailing checksum for validity. So if there is memory corruption while writing a MIDX, or disk corruption in the intervening period between writing and reuse, we'll blindly propagate those bad values forward. Suppose we experience a memory corruption while writing a MIDX such that we write an incorrect object offset (or alternatively, the disk corrupts the data after being written, but before being reused). Then when we go to write a new MIDX, we'll reuse the bad object offset without checking its validity. This means that the MIDX we just wrote is broken, but its trailing checksum is in-tact, since we never bothered to look at the values before writing. In the above, a "git multi-pack-index verify" would have caught the problem before writing, but writing a new MIDX wouldn't have noticed anything wrong, blindly carrying forward the corrupt offset. Individual pack indexes check their validity by verifying the crc32 attached to each entry when carrying data forward during a repack. We could solve this problem for MIDXs in the same way, but individual crc32's don't make much sense, since their entries are so small. Likewise, checking the whole file on every read may be prohibitively expensive if a repository has a lot of objects, packs, or both. But we can check the trailing checksum when reusing an existing MIDX when writing a new one. And a corrupt MIDX need not stop us from writing a new one, since we can just avoid reusing the existing one at all and pretend as if we are writing a new MIDX from scratch. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Always use oidread to read into struct object_idbrian m. carlson1-1/+1
In the future, we'll want oidread to automatically set the hash algorithm member for an object ID we read into it, so ensure we use oidread instead of hashcpy everywhere we're copying a hash value into a struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-08Merge branch 'tb/reverse-midx'Junio C Hamano1-13/+206
An on-disk reverse-index to map the in-pack location of an object back to its object name across multiple packfiles is introduced. * tb/reverse-midx: midx.c: improve cache locality in midx_pack_order_cmp() pack-revindex: write multi-pack reverse indexes pack-write.c: extract 'write_rev_file_order' pack-revindex: read multi-pack reverse indexes Documentation/technical: describe multi-pack reverse indexes midx: make some functions non-static midx: keep track of the checksum midx: don't free midx_name early midx: allow marking a pack as preferred t/helper/test-read-midx.c: add '--show-objects' builtin/multi-pack-index.c: display usage on unrecognized command builtin/multi-pack-index.c: don't enter bogus cmd_mode builtin/multi-pack-index.c: split sub-commands builtin/multi-pack-index.c: define common usage with a macro builtin/multi-pack-index.c: don't handle 'progress' separately builtin/multi-pack-index.c: inline 'flags' with options
2021-04-01midx.c: improve cache locality in midx_pack_order_cmp()Jeff King1-26/+29
There is a lot of pointer dereferencing in the pre-image version of 'midx_pack_order_cmp()', which this patch gets rid of. Instead of comparing the pack preferred-ness and then the pack id, both of these checks are done at the same time by using the high-order bit of the pack id to represent whether it's preferred. Then the pack id and offset are compared as usual. This produces the same result so long as there are less than 2^31 packs, which seems like a likely assumption to make in practice. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01pack-revindex: write multi-pack reverse indexesTaylor Blau1-0/+115
Implement the writing half of multi-pack reverse indexes. This is nothing more than the format describe a few patches ago, with a new set of helper functions that will be used to clear out stale .rev files corresponding to old MIDXs. Unfortunately, a very similar comparison function as the one implemented recently in pack-revindex.c is reimplemented here, this time accepting a MIDX-internal type. An effort to DRY these up would create more indirection and overhead than is necessary, so it isn't pursued here. Currently, there are no callers which pass the MIDX_WRITE_REV_INDEX flag, meaning that this is all dead code. But, that won't be the case for long, since subsequent patches will introduce the multi-pack bitmap, which will begin passing this field. (In midx.c:write_midx_internal(), the two adjacent if statements share a conditional, but are written separately since the first one will eventually also handle the MIDX_WRITE_BITMAP flag, which does not yet exist.) Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01pack-revindex: read multi-pack reverse indexesTaylor Blau1-0/+11
Implement reading for multi-pack reverse indexes, as described in the previous patch. Note that these functions don't yet have any callers, and won't until multi-pack reachability bitmaps are introduced in a later patch series. In the meantime, this patch implements some of the infrastructure necessary to support multi-pack bitmaps. There are three new functions exposed by the revindex API: - load_midx_revindex(): loads the reverse index corresponding to the given multi-pack index. - midx_to_pack_pos() and pack_pos_to_midx(): these convert between the multi-pack index and pseudo-pack order. load_midx_revindex() and pack_pos_to_midx() are both relatively straightforward. load_midx_revindex() needs a few functions to be exposed from the midx API. One to get the checksum of a midx, and another to get the .rev's filename. Similar to recent changes in the packed_git struct, three new fields are added to the multi_pack_index struct: one to keep track of the size, one to keep track of the mmap'd pointer, and another to point past the header and at the reverse index's data. pack_pos_to_midx() simply reads the corresponding entry out of the table. midx_to_pack_pos() is the trickiest, since it needs to find an object's position in the psuedo-pack order, but that order can only be recovered in the .rev file itself. This mapping can be implemented with a binary search, but note that the thing we're binary searching over isn't an array of values, but rather a permuted order of those values. So, when comparing two items, it's helpful to keep in mind the difference. Instead of a traditional binary search, where you are comparing two things directly, here we're comparing a (pack, offset) tuple with an index into the multi-pack index. That index describes another (pack, offset) tuple, and it is _those_ two tuples that are compared. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01midx: make some functions non-staticTaylor Blau1-2/+2
In a subsequent commit, pack-revindex.c will become responsible for sorting a list of objects in the "MIDX pack order" (which will be defined in the following patch). To do so, it will need to be know the pack identifier and offset within that pack for each object in the MIDX. The MIDX code already has functions for doing just that (nth_midxed_offset() and nth_midxed_pack_int_id()), but they are statically declared. Since there is no reason that they couldn't be exposed publicly, and because they are already doing exactly what the caller in pack-revindex.c will want, expose them publicly so that they can be reused there. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01midx: keep track of the checksumTaylor Blau1-1/+2
write_midx_internal() uses a hashfile to write the multi-pack index, but discards its checksum. This makes sense, since nothing that takes place after writing the MIDX cares about its checksum. That is about to change in a subsequent patch, when the optional reverse index corresponding to the MIDX will want to include the MIDX's checksum. Store the checksum of the MIDX in preparation for that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01midx: don't free midx_name earlyTaylor Blau1-1/+0
A subsequent patch will need to refer back to 'midx_name' later on in the function. In fact, this variable is already free()'d later on, so this makes the later free() no longer redundant. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01midx: allow marking a pack as preferredTaylor Blau1-9/+73
When multiple packs in the multi-pack index contain the same object, the MIDX machinery must make a choice about which pack it associates with that object. Prior to this patch, the lowest-ordered[1] pack was always selected. Pack selection for duplicate objects is relatively unimportant today, but it will become important for multi-pack bitmaps. This is because we can only invoke the pack-reuse mechanism when all of the bits for reused objects come from the reuse pack (in order to ensure that all reused deltas can find their base objects in the same pack). To encourage the pack selection process to prefer one pack over another (the pack to be preferred is the one a caller would like to later use as a reuse pack), introduce the concept of a "preferred pack". When provided, the MIDX code will always prefer an object found in a preferred pack over any other. No format changes are required to store the preferred pack, since it will be able to be inferred with a corresponding MIDX bitmap, by looking up the pack associated with the object in the first bit position (this ordering is described in detail in a subsequent commit). [1]: the ordering is specified by MIDX internals; for our purposes we can consider the "lowest ordered" pack to be "the one with the most-recent mtime. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13use CALLOC_ARRAYRené Scharfe1-4/+4
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-01Merge branch 'ds/chunked-file-api'Junio C Hamano1-264/+169
The common code to deal with "chunked file format" that is shared by the multi-pack-index and commit-graph files have been factored out, to help codepaths for both filetypes to become more robust. * ds/chunked-file-api: commit-graph.c: display correct number of chunks when writing chunk-format: add technical docs chunk-format: restore duplicate chunk checks midx: use 64-bit multiplication for chunk sizes midx: use chunk-format read API commit-graph: use chunk-format read API chunk-format: create read chunk API midx: use chunk-format API in write_midx_internal() midx: drop chunk progress during write midx: return success/failure in chunk write methods midx: add num_large_offsets to write_midx_context midx: add pack_perm to write_midx_context midx: add entries to write_midx_context midx: use context in write_midx_pack_names() midx: rename pack_info to write_midx_context commit-graph: use chunk-format write API chunk-format: create chunk format write API commit-graph: anonymize data in chunk_write_fn
2021-02-24Merge branch 'ds/chunked-file-api' into tb/reverse-midxJunio C Hamano1-264/+169
* ds/chunked-file-api: commit-graph.c: display correct number of chunks when writing chunk-format: add technical docs chunk-format: restore duplicate chunk checks midx: use 64-bit multiplication for chunk sizes midx: use chunk-format read API commit-graph: use chunk-format read API chunk-format: create read chunk API midx: use chunk-format API in write_midx_internal() midx: drop chunk progress during write midx: return success/failure in chunk write methods midx: add num_large_offsets to write_midx_context midx: add pack_perm to write_midx_context midx: add entries to write_midx_context midx: use context in write_midx_pack_names() midx: rename pack_info to write_midx_context commit-graph: use chunk-format write API chunk-format: create chunk format write API commit-graph: anonymize data in chunk_write_fn
2021-02-18midx: use 64-bit multiplication for chunk sizesDerrick Stolee1-5/+6
When calculating the sizes of certain chunks, we should use 64-bit multiplication always. This allows us to properly predict the chunk sizes without risk of overflow. Other possible overflows were discovered by evaluating each multiplication in midx.c and ensuring that at least one side of the operator was of type size_t or off_t. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: use chunk-format read APIDerrick Stolee1-47/+26
Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). In particular, we can use the return value of pair_chunk() to generate an error when a required chunk is missing. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: use chunk-format API in write_midx_internal()Derrick Stolee1-86/+20
The chunk-format API allows writing the table of contents and all chunks using the anonymous 'struct chunkfile' type. We only need to convert our local chunk logic to this API for the multi-pack-index writes to share that logic with the commit-graph file writes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: drop chunk progress during writeDerrick Stolee1-7/+0
Most expensive operations in write_midx_internal() use the context struct's progress member, and these indicate the process of the expensive operations within the chunk writing methods. However, there is a competing progress struct that counts the progress over all chunks. This is not very helpful compared to the others, so drop it. This also reduces our barriers to combining the chunk writing code with chunk-format.c. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: return success/failure in chunk write methodsDerrick Stolee1-36/+27
Historically, the chunk-writing methods in midx.c have returned the amount of data written so the writer method could compare this with the table of contents. This presents with some interesting issues: 1. If a chunk writing method has a bug that miscalculates the written bytes, then we can satisfy the table of contents without actually writing the right amount of data to the hashfile. The commit-graph writing code checks the hashfile struct directly for a more robust verification. 2. There is no way for a chunk writing method to gracefully fail. Returning an int presents an opportunity to fail without a die(). 3. The current pattern doesn't match chunk_write_fn type exactly, so we cannot share code with commit-graph.c For these reasons, convert the midx chunk writer methods to return an 'int'. Since none of them fail at the moment, they all return 0. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: add num_large_offsets to write_midx_contextDerrick Stolee1-7/+10
In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "uint32_t num_large_offsets" into the context. With this new data, write_midx_large_offsets() now matches the chunk_write_fn type. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: add pack_perm to write_midx_contextDerrick Stolee1-19/+21
In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "uint32_t *pack_perm" and large_offsets_needed bit into the context. Update write_midx_object_offsets() to match chunk_write_fn. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: add entries to write_midx_contextDerrick Stolee1-23/+26
In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "struct pack_midx_entry *entries" list and its count into the context. Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the context directly, as these are easy conversions with this new data. Only the callers of write_midx_object_offsets() and write_midx_large_offsets() are updated here, since additional data in the context before those methods can match chunk_write_fn. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: use context in write_midx_pack_names()Derrick Stolee1-11/+10
In an effort to align the write_midx_internal() to use the chunk-format API, start converting chunk writing methods to match chunk_write_fn. The first case is to convert write_midx_pack_names() to take "void *data". We already have the necessary data in "struct write_midx_context", so this conversion is rather mechanical. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18midx: rename pack_info to write_midx_contextDerrick Stolee1-65/+65
In an effort to streamline our chunk-based file formats, align some of the code structure in write_midx_internal() to be similar to the patterns in write_commit_graph_file(). Specifically, let's create a "struct write_midx_context" that can be used as a data parameter to abstract function types. This change only renames "struct pack_info" to "struct write_midx_context" and the names of instances from "packs" to "ctx". In future changes, we will expand the data inside "struct write_midx_context" and align our chunk-writing method with the chunk-format API. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25Merge branch 'ma/more-opaque-lock-file'Junio C Hamano1-1/+1
Code clean-up. * ma/more-opaque-lock-file: read-cache: try not to peek into `struct {lock_,temp}file` refs/files-backend: don't peek into `struct lock_file` midx: don't peek into `struct lock_file` commit-graph: don't peek into `struct lock_file` builtin/gc: don't peek into `struct lock_file`
2021-01-06midx: don't peek into `struct lock_file`Martin Ågren1-1/+1
Similar to the previous commits, avoid peeking into the `struct lock_file`. Use the lock file API instead. The two functions we're calling here double-check that the tempfile is indeed "active", which is arguably overkill considering how we took the lock on the line immediately above. More importantly, this future-proofs us against, e.g., other code appearing between these two lines or the lock file and/or tempfile internals changing. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04hash-lookup: rename from sha1-lookupMartin Ågren1-1/+1
Change all remnants of "sha1" in hash-lookup.c and .h and rename them to reflect that we're not just able to handle SHA-1 these days. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08Merge branch 'tb/idx-midx-race-fix'Junio C Hamano1-1/+1
Processes that access packdata while the .idx file gets removed (e.g. while repacking) did not fail or fall back gracefully as they could. * tb/idx-midx-race-fix: midx.c: protect against disappearing packs packfile.c: protect against disappearing indexes
2020-11-25Merge branch 'rs/hashwrite-be64'Junio C Hamano1-5/+2
Code simplification. * rs/hashwrite-be64: pack-write: use hashwrite_be64() midx: use hashwrite_be64() csum-file: add hashwrite_be64()
2020-11-25midx.c: protect against disappearing packsTaylor Blau1-1/+1
When a packed object is stored in a multi-pack index, but that pack has racily gone away, the MIDX code simply calls die(), when it could be returning an error to the caller, which would in turn lead to re-scanning the pack directory. A pack can racily disappear, for example, due to a simultaneous 'git repack -ad', You can also reproduce this with two terminals, where one is running: git init while true; do git commit -q --allow-empty -m foo git repack -ad git multi-pack-index write done (in effect, constantly writing new MIDXs), and the other is running: obj=$(git rev-parse HEAD) while true; do echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break done That will sometimes hit the error preparing packfile from multi-pack-index message, which this patch fixes. Right now, that path to discovering a missing pack looks something like 'find_pack_entry()' calling 'fill_midx_entry()' and eventually making its way to call 'nth_midxed_pack_entry()'. 'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and propagates an error if the pack is invalid. So, this works if the pack has gone away between calling 'prepare_midx_pack()' and before calling 'is_pack_valid()', but not if it disappears before then. Catch the case where the pack has already disappeared before 'prepare_midx_pack()' by returning an error in that case, too. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-12midx: use hashwrite_be64()René Scharfe1-5/+2
Call hashwrite_be64() to write 64-bit values instead of open-coding it using hashwrite_be32() and sizeof. This shortens the code and makes its intent clearer. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-27Merge branch 'ds/maintenance-part-2'Junio C Hamano1-13/+8
"git maintenance", an extended big brother of "git gc", continues to evolve. * ds/maintenance-part-2: maintenance: add incremental-repack auto condition maintenance: auto-size incremental-repack batch maintenance: add incremental-repack task midx: use start_delayed_progress() midx: enable core.multiPackIndex by default maintenance: create auto condition for loose-objects maintenance: add loose-objects task maintenance: add prefetch task
2020-09-25midx: use start_delayed_progress()Derrick Stolee1-5/+5
Now that the multi-pack-index may be written as part of auto maintenance at the end of a command, reduce the progress output when the operations are quick. Use start_delayed_progress() instead of start_progress(). Update t5319-multi-pack-index.sh to use GIT_PROGRESS_DELAY=0 now that the progress indicators are conditional. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25midx: enable core.multiPackIndex by defaultDerrick Stolee1-8/+3
The core.multiPackIndex setting has been around since c4d25228ebb (config: create core.multiPackIndex setting, 2018-07-12), but has been disabled by default. If a user wishes to use the multi-pack-index feature, then they must enable this config and run 'git multi-pack-index write'. The multi-pack-index feature is relatively stable now, so make the config option true by default. For users that do not use a multi-pack-index, the only extra cost will be a file lookup to see if a multi-pack-index file exists (once per process, per object directory). Also, this config option will be referenced by an upcoming "incremental-repack" task in the maintenance builtin, so move the config option into the repository settings struct. Note that if GIT_TEST_MULTI_PACK_INDEX=1, then we want to ignore the config option and treat core.multiPackIndex as enabled. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18Merge branch 'rs/misc-cleanups'Junio C Hamano1-7/+4
Misc cleanups. * rs/misc-cleanups: pack-bitmap-write: use hashwrite_be32() in write_hash_cache() midx: use hashwrite_u8() in write_midx_header() fast-import: use write_pack_header()
2020-09-09Merge branch 'tb/repack-clearing-midx'Junio C Hamano1-2/+6
When a packfile is removed by "git repack", multi-pack-index gets cleared; the code was taught to do so less aggressively by first checking if the midx actually refers to a pack that no longer exists. * tb/repack-clearing-midx: midx: traverse the local MIDX first builtin/repack.c: invalidate MIDX only when necessary
2020-09-06midx: use hashwrite_u8() in write_midx_header()René Scharfe1-7/+4
Emit byte-sized values using hashwrite_u8() instead of buffering them locally first. The hashwrite functions already do their own buffering, so this double-buffering does not reduce the number of system calls. Getting rid of it shortens and simplifies the code a bit. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-28midx: traverse the local MIDX firstTaylor Blau1-2/+6
When a repository has an alternate object directory configured, callers can traverse through each alternate's MIDX by walking the '->next' pointer. But, when 'prepare_multi_pack_index_one()' loads multiple MIDXs, it places the new ones at the front of this pointer chain, not at the end. This can be confusing for callers such as 'git repack -ad', causing test failures like in t7700.6 with 'GIT_TEST_MULTI_PACK_INDEX=1'. The occurs when dropping a pack known to the local MIDX with alternates configured that have their own MIDX. Since the alternate's MIDX is returned via 'get_multi_pack_index()', 'midx_contains_pack()' returns true (which is correct, since it traverses through the '->next' pointer to find the MIDX in the chain that does contain the requested object). But, we call 'clear_midx_file()' on 'the_repository', which drops the MIDX at the path of the first MIDX in the chain, which (in the case of t7700.6 is the one in the alternate). This patch addresses that by: - placing the local MIDX first in the chain when calling 'prepare_multi_pack_index_one()', and - introducing a new 'get_local_multi_pack_index()', which explicitly returns the repository-local MIDX, if any. Don't impose an additional order on the MIDX's '->next' pointer beyond that the first item in the chain must be local if one exists so that we avoid a quadratic insertion. Likewise, use 'get_local_multi_pack_index()' in 'remove_redundant_pack()' to fix the formerly broken t7700.6 when run with 'GIT_TEST_MULTI_PACK_INDEX=1'. Finally, note that the MIDX ordering invariant is only preserved by the insertion order in 'prepare_packed_git()', which traverses through the ODB's '->next' pointer, meaning we visit the local object store first. This fragility makes this an undesirable long-term solution if more callers are added, but it is acceptable for now since this is the only caller. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24Merge branch 'rs/more-buffered-io'Junio C Hamano1-3/+5
Use more buffered I/O where we used to call many small write(2)s. * rs/more-buffered-io: upload-pack: use buffered I/O to talk to rev-list midx: use buffered I/O to talk to pack-objects connected: use buffered I/O to talk to rev-list
2020-08-24Merge branch 'jk/unleak-fixes'Junio C Hamano1-6/+2
Fix some incorrect UNLEAK() annotations. * jk/unleak-fixes: ls-remote: simplify UNLEAK() usage stop calling UNLEAK() before die()
2020-08-24Merge branch 'ds/midx-repack-to-batch-size'Junio C Hamano1-1/+1
The "--batch-size" option of "git multi-pack-index repack" command is now used to specify that very small packfiles are collected into one until the total size roughly exceeds it. * ds/midx-repack-to-batch-size: multi-pack-index: repack batches below --batch-size
2020-08-17multi-pack-index: use hash version byteDerrick Stolee1-6/+29
Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17midx: use buffered I/O to talk to pack-objectsRené Scharfe1-3/+5
Like f0bca72dc77 (send-pack: use buffered I/O to talk to pack-objects, 2016-06-08), significantly reduce the number of system calls and simplify the code for sending object IDs to pack-objects by using stdio's buffering. Helped-by: Chris Torek <chris.torek@gmail.com> Helped-by: Johannes Sixt <j6t@kdbg.org> Encouraged-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-13stop calling UNLEAK() before die()Jeff King1-6/+2
The point of UNLEAK() is to make a reference to a variable that is about to go out of scope so that leak-checkers will consider it to be not-leaked. Doing so right before die() is therefore pointless; even though we are about to exit the program, the variable will still be on the stack and accessible to leak-checkers. These annotations aren't really hurting anything, but they clutter the code and set a bad example of how to use UNLEAK(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-11multi-pack-index: repack batches below --batch-sizeDerrick Stolee1-1/+1
The --batch-size=<size> option of 'git multi-pack-index repack' is intended to limit the amount of work done by the repack. In the case of a large repository, this command should repack a number of small pack-files but leave the large pack-files alone. Most often, the repository has one large pack-file from a 'git clone' operation and number of smaller pack-files from incremental 'git fetch' operations. The issue with '--batch-size' is that it also _prevents_ the repack from happening if the expected size of the resulting pack-file is too small. This was intended as a way to avoid frequent churn of small pack-files, but it has mostly caused confusion when a repository is of "medium" size. That is, not enormous like the Windows OS repository, but also not so small that this incremental repack isn't valuable. The solution presented here is to collect pack-files for repack if their expected size is smaller than the batch-size parameter until either the total expected size exceeds the batch-size or all pack-files are considered. If there are at least two pack-files, then these are combined to a new pack-file whose size should not be too much larger than the batch-size. This new strategy should succeed in keeping the number of pack-files small in these "medium" size repositories. The concern about churn is likely not interesting, as the real control over that is the frequency in which the repack command is run. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28strvec: convert remaining callers away from argv_array nameJeff King1-6/+6
We eventually want to drop the argv_array name and just use strvec consistently. There's no particular reason we have to do it all at once, or care about interactions between converted and unconverted bits. Because of our preprocessor compat layer, the names are interchangeable to the compiler (so even a definition and declaration using different names is OK). This patch converts all of the remaining files, as the resulting diff is reasonably sized. The conversion was done purely mechanically with: git ls-files '*.c' '*.h' | xargs perl -i -pe ' s/ARGV_ARRAY/STRVEC/g; s/argv_array/strvec/g; ' We'll deal with any indentation/style fallouts separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-10multi-pack-index: respect repack.packKeptObjects=falseDerrick Stolee1-5/+21
When selecting a batch of pack-files to repack in the "git multi-pack-index repack" command, Git should respect the repack.packKeptObjects config option. When false, this option says that the pack-files with an associated ".keep" file should not be repacked. This config value is "false" by default. There are two cases for selecting a batch of objects. The first is the case where the input batch-size is zero, which specifies "repack everything". The second is with a non-zero batch size, which selects pack-files using a greedy selection criteria. Both of these cases are updated and tested. Reported-by: Son Luong Ngoc <sluongng@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-10midx: teach "git multi-pack-index repack" honor "git repack" configurationsSon Luong Ngoc1-0/+16
When the "repack" subcommand of "git multi-pack-index" command creates new packfile(s), it does not call the "git repack" command but instead directly calls the "git pack-objects" command, and the configuration variables meant for the "git repack" command, like "repack.usedaeltabaseoffset", are ignored. Check the configuration variables used by "git repack" ourselves in "git multi-index-pack" and pass the corresponding options to underlying "git pack-objects". Note that `repack.writeBitmaps` configuration is ignored, as the pack bitmap facility is useful only with a single packfile. Signed-off-by: Son Luong Ngoc <sluongng@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-01Merge branch 'ds/multi-pack-index'Junio C Hamano1-3/+1
The multi-pack-index left mmapped file descriptors open when it does not have to. * ds/multi-pack-index: multi-pack-index: close file descriptor after mmap
2020-04-24multi-pack-index: close file descriptor after mmapDerrick Stolee1-3/+1
The multi-pack-index subsystem was not closing its file descriptor after memory-mapping the file contents. After this mmap() succeeds, there is no need to keep the file descriptor open. In fact, there is signficant reason to close it so we do not run out of descriptors. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-28midx.c: fix an integer underflowDamien Robert1-0/+15
When verifying a midx index with 0 objects, the m->num_objects - 1 underflows and wraps around to 4294967295. Fix this both by checking that the midx contains at least one oid, and also that we don't write any midx when there is no packfiles. Update the tests to check that `git multi-pack-index write` does not write an midx when there is no objects, and another to check that `git multi-pack-index verify` warns when it verifies an midx with no objects. For this last test, use t5319/no-objects.midx which was generated by an older version of git. Signed-off-by: Damien Robert <damien.olivier.robert+git@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24nth_packed_object_oid(): use customary integer returnJeff King1-1/+1
Our nth_packed_object_sha1() function returns NULL for error. So when we wrapped it with nth_packed_object_oid(), we kept the same semantics. But it's a bit funny, because the caller actually passes in an out parameter, and the pointer we return is just that same struct they passed to us (or NULL). It's not too terrible, but it does make the interface a little non-idiomatic. Let's switch to our usual "0 for success, negative for error" return value. Most callers either don't check it, or are trivially converted. The one that requires the biggest change is actually improved, as we can ditch an extra aliased pointer variable. Since we are changing the interface in a subtle way that the compiler wouldn't catch, let's also change the name to catch any topics in flight. We can drop the 'o' and make it nth_packed_object_id(). That's slightly shorter, but also less redundant since the 'o' stands for "object" already. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23midx: honor the MIDX_PROGRESS flag in midx_repackWilliam Baker1-0/+6
Update midx_repack to only display progress when the MIDX_PROGRESS flag is set. Signed-off-by: William Baker <William.Baker@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>