| Age | Commit message (Collapse) | Author | Files | Lines |
|
When preparing the packfile store we know to also prepare the MRU list
of packfiles with all packs that are currently loaded in the store via
`packfile_store_prepare_mru()`. So we know that the list of packs in the
MRU list should match the list of packs in the non-MRU list.
But there are some direct or indirect callsites that add a packfile to
the store via `packfile_store_add_pack()` without adding the pack to the
MRU. And while functions that access the MRU (e.g. `find_pack_entry()`)
know to call `packfile_store_prepare()`, which knows to prepare the MRU
via `packfile_store_prepare_mru()`, that operation will be turned into a
no-op because the packfile store is already prepared. So this will not
cause us to add the packfile to the MRU, and consequently we won't be
able to find the packfile in our MRU list.
There are only a handful of callers outside of "packfile.c" that add a
packfile to the store:
- "builtin/fast-import.c" adds multiple packs of imported objects, but
it knows to look up objects via `packfile_store_get_packs()`. This
function does not use the MRU, so we're good.
- "builtin/index-pack.c" adds the indexed pack to the store in case it
needs to perform consistency checks on its objects.
- "http.c" adds the fetched pack to the store so that we can access
its objects.
In all of these cases we actually want to access the contained objects.
And luckily, reading these objects works as expected:
1. We eventually end up in `do_oid_object_info_extended()`.
2. Calling `find_pack_entry()` fails because the MRU list doesn't
contain the newly added packfile.
3. The callers don't pass `OBJECT_INFO_QUICK`, so we end up
repreparing the object database. This will also cause us to
reprepare the MRU list.
4. We now retry reading the object via `find_pack_entry()`, and now we
succeed because the MRU list got populated.
This logic feels quite fragile: we intentionally add the packfile to the
store, but we then ultimately rely on repreparing the entire store only
to make the packfile accessible. While we do the correct thing in
`do_oid_object_info_extended()`, other sites that access the MRU may not
know to reprepare.
But besides being fragile it's also a waste of resources: repreparing
the object database requires us to re-read the alternates file and
discard any caches.
Refactor the code so that we unconditionally add packfiles to the MRU
when adding them to a packfile store. This makes the logic less fragile
and ensures that we don't have to reprepare the store to make the pack
accessible.
Note that this does not allow us to drop `packfile_store_prepare_mru()`
just yet: while the MRU list is already populated with all packs now,
the order in which we add these packs is indeterministic for most of the
part. So by first calling `sort_pack()` on the other packfile list and
then re-preparing the MRU list we inherit its sorting.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Packfiles have two lists associated to them:
- A list that keeps track of packfiles in the order that they were
added to a packfile store.
- A list that keeps track of packfiles in most-recently-used order so
that packfiles that are more likely to contain a specific object are
ordered towards the front.
Both of these lists are hosted by `struct packed_git` itself, So to
identify all packfiles in a repository you simply need to grab the first
packfile and then iterate the `->next` pointers or the MRU list. This
pattern has the problem that all packfiles are part of the same list,
regardless of whether or not they belong to the same object source.
With the upcoming pluggable object database effort this needs to change:
packfiles should be contained by a single object source, and reading an
object from any such packfile should use that source to look up the
object. Consequently, we need to break up the global lists of packfiles
into per-object-source lists.
A first step towards this goal is to move those lists out of `struct
packed_git` and into the packfile store. While the packfile store is
currently sitting on the `struct object_database` level, the intent is
to push it down one level into the `struct odb_source` in a subsequent
patch series.
Introduce a new `struct packfile_list` that is used to manage lists of
packfiles and use it to store the list of most-recently-used packfiles
in `struct packfile_store`. For now, the new list type is only used in a
single spot, but we'll expand its usage in subsequent patches.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `get_multi_pack_index()` function is declared and implemented in the
packfile subsystem, even though it really belongs into the multi-pack
index subsystem. The reason for this is likely that it needs to call
`packfile_store_prepare()`, which is not exposed by the packfile system.
In a subsequent commit we're about to add another caller outside of the
packfile system though, so we'll have to expose the function anyway.
Do so now already and move `get_multi_pack_index()` into the MIDX
subsystem.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We have a recurring pattern where we essentially perform an upsert of a
packfile in case it isn't yet known by the packfile store. The logic to
do so is non-trivial as we have to reconstruct the packfile's key, check
the map of packfiles, then create the new packfile and finally add it to
the store.
Introduce a new function that does this dance for us. Refactor callsites
to use it.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `install_packed_git()` functions adds a packfile to a specific
object store. Refactor it to accept a packfile store instead of a
repository to clarify its scope.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The object database tracks the list of packfiles in most-recently-used
order, which is mostly used to favor reading from packfiles that contain
most of the objects that we're currently accessing. With the
introduction of the `struct packfile_store` we have a better place to
host this list though.
Move the list accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The object database tracks a map of packfiles by their respective paths,
which is used to figure out whether a given packfile has already been
loaded. With the introduction of the `struct packfile_store` we have a
better place to host this list though.
Move the map accordingly.
`pack_map_entry_cmp()` isn't used anywhere but in "packfile.c" anymore
after this change, so we convert it to a static function, as well. Note
that we also drop the `inline` hint: the function is used as a callback
function exclusively, and callbacks cannot be inlined.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
With the preceding commits we started to always have the object database
source available when we load, write or access multi-pack indices. With
this in place we can change how MIDX paths are computed so that we don't
have to pass in the combination of a hash algorithm and object directory
anymore, but only the object database source.
Refactor the code accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Multi-pack indices store some information that is redundant with their
owning source:
- The locality bit that tracks whether the source is the primary
object source or an alternate.
- The object directory path the multi-pack index is located in.
- The pointer to the owning parent directory.
All of this information is already contained in `struct odb_source`. So
now that we always have that struct available when loading a multi-pack
index we have it readily accessible.
Drop the redundant information and instead store a pointer to the object
source.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
To load a multi-pack index the caller is expected to pass both the
repository and the object directory where the multi-pack index is
located. While this works, this layout has a couple of downsides:
- We need to pass in information reduntant with the owning source,
namely its object directory and whether the source is local or not.
- We don't have access to the source when loading the multi-pack
index. If we had that access, we could store a pointer to the owning
source in the MIDX and thus deduplicate some information.
- Multi-pack indices are inherently specific to the object source and
its format. With the goal of pluggable object backends in mind we
will eventually want the backends to own the logic of reading and
writing multi-pack indices. Making the logic work on top of object
sources is a step into that direction.
Refactor loading of multi-pack indices accordingly.
This surfaces one small problem though: git-multi-pack-index(1) and our
MIDX test helper both know to read and write multi-pack-indices located
in a different object directory. This issue is addressed by adding the
user-provided object directory as an in-memory alternate.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
There are a couple of functions that take both a `struct repository` and
a `struct multi_pack_index`. This provides redundant information though
without much benefit given that the multi-pack index already has a
pointer to its owning repository.
Drop the `struct repository` parameter from such functions. While at it,
reorder the list of parameters of `fill_midx_entry()` so that the MIDX
comes first to better align with our coding guidelines.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Object database sources are classified either as:
- Local, which means that the source is the repository's primary
source. This is typically ".git/objects".
- Non-local, which is everything else. Most importantly this includes
alternates and quarantine directories.
This locality is often computed ad-hoc by checking whether a given
object source is the first one. This works, but it is quite roundabout.
Refactor the code so that we store locality when creating the sources in
the first place. This makes it both more accessible and robust.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In the preceding commits we have migrated all users of the linked list
of multi-pack indices to instead use those stored in the object database
sources. Remove those now-unused pointers.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When calling `close_midx()` we not only close the multi-pack index for
one object source, but instead we iterate through the whole linked list
of MIDXs to close all of them. This linked list is about to go away in
favor of using the new per-source pointer to its respective MIDX.
Refactor the function to iterate through sources instead.
Note that after this patch, there's a couple of callsites left that
continue to use `close_midx()` without iterating through all sources.
These are all cases where we don't care about the MIDX from other
sources though, so it's fine to keep them as-is.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Multi-pack indices are tracked via `struct multi_pack_index`. This data
structure is stored as a linked list inside `struct object_database`,
which is the global database that spans across all of the object
sources.
This layout causes two problems:
- Object databases consist of multiple object sources (e.g. one source
per alternate object directory), where each multi-pack index is
specific to one of those sources. Regardless of that though, the
MIDX is not tracked per source, but tracked globally for the whole
object database. This creates a mismatch between the on-disk layout
and how things are organized in the object database subsystems and
makes some parts, like figuring out whether a source has an MIDX,
quite awkward.
- Multi-pack indices are an implementation detail of how efficient
access for packfiles work. As such, they are neither relevant in the
context of loose objects, nor in a potential future where we have
pluggable backends.
Refactor `prepare_multi_pack_index_one()` so that it works on a specific
source, which allows us to easily store a pointer to the multi-pack
index inside of it. For now, this pointer exists next to the existing
linked list we have in the object database. Users will be adjusted in
subsequent patches to instead use the per-source pointers.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `object_directory` structure is used as an access point for a single
object directory like ".git/objects". While the structure isn't yet
fully self-contained, the intent is for it to eventually contain all
information required to access objects in one specific location.
While the name "object directory" is a good fit for now, this will
change over time as we continue with the agenda to make pluggable object
databases a thing. Eventually, objects may not be accessed via any kind
of directory at all anymore, but they could instead be backed by any
kind of durable storage mechanism. While it seems quite far-fetched for
now, it is thinkable that eventually this might even be some form of a
database, for example.
As such, the current name of this structure will become worse over time
as we evolve into the direction of pluggable ODBs. Immediate next steps
will start to carve out proper self-contained object directories, which
requires us to pass in these object directories as parameters. Based on
our modern naming schema this means that those functions should then be
named after their subsystem, which means that we would start to bake the
current name into the codebase more and more.
Let's preempt this by renaming the structure. There have been a couple
alternatives that were discussed:
- `odb_backend` was discarded because it led to the association that
one object database has a single backend, but the model is that one
alternate has one backend. Furthermore, "backend" is more about the
actual backing implementation and less about the high-level concept.
- `odb_alternate` was discarded because it is a bit of a stretch to
also call the main object directory an "alternate".
Instead, pick `odb_source` as the new name. It makes it sufficiently
clear that there can be multiple sources and does not cause confusion
when mixed with the already-existing "alternate" terminology.
In the future, this change allows us to easily introduce for example a
`odb_files_source` and other format-specific implementations.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The multi-pack index acts as a cache across a set of packfiles so that
we can quickly look up which of those packfiles contains a given object.
As such, the multi-pack index naturally needs to be updated every time
one of the packfiles goes away, or otherwise the multi-pack index has
grown stale.
A stale multi-pack index should be handled gracefully by Git though, and
in fact it is: if the indexed pack cannot be found we simply ignore it
and eventually we fall back to doing the object lookup by just iterating
through all packs, even if those aren't indexed.
But while this fallback works, it has one significant downside: we don't
cache the fact that a pack has vanished. This leads to us repeatedly
trying to look up the same pack only to realize that it (still) doesn't
exist.
This issue can be easily demonstrated by creating a repository with a
stale multi-pack index and a couple of objects. We do so by creating a
repository with two packfiles, both of which are indexed by the
multi-pack index, and then repack those two packfiles. Note that we have
to move the multi-pack-index before doing the final repack, as Git knows
to delete it otherwise.
$ git init repo
$ cd repo/
$ git config set maintenance.auto false
$ for i in $(seq 1000); do printf "%d-original" $i >file-$i; done
$ git add .
$ git commit -moriginal
$ git repack -dl
$ for i in $(seq 1000); do printf "%d-modified" $i >file-$i; done
$ git commit -a -mmodified
$ git repack -dl
$ git multi-pack-index write
$ mv .git/objects/pack/multi-pack-index .
$ git repack -Adl
$ mv multi-pack-index .git/objects/pack/
Commands that cause a lot of objects lookups will now repeatedly invoke
`add_packed_git()`, which leads to three failed access(3p) calls as well
as one failed stat(3p) call. The following strace for example is done
for `git log --patch` in the above repository:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
74.67 0.024693 1 18038 18031 access
25.33 0.008378 1 6045 6017 newfstatat
------ ----------- ----------- --------- --------- ----------------
100.00 0.033071 1 24083 24048 total
Fix the issue by introducing a negative lookup cache for indexed packs.
This cache works by simply storing an invalid pointer for a missing pack
when `prepare_midx_pack()` fails to look up the pack. Most users of the
`packs` array don't need to be adjusted, either, as they all know to
call `prepare_midx_pack()` before accessing the array.
With this change in place we can now see a significantly reduced number
of syscalls:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
73.58 0.000323 5 60 28 newfstatat
26.42 0.000116 5 23 16 access
------ ----------- ----------- --------- --------- ----------------
100.00 0.000439 5 83 44 total
Furthermore, this change also results in a speedup:
Benchmark 1: git log --patch (revision = HEAD~)
Time (mean ± σ): 50.4 ms ± 2.5 ms [User: 22.0 ms, System: 24.4 ms]
Range (min … max): 45.4 ms … 54.9 ms 53 runs
Benchmark 2: git log --patch (revision = HEAD)
Time (mean ± σ): 12.7 ms ± 0.4 ms [User: 11.1 ms, System: 1.6 ms]
Range (min … max): 12.4 ms … 15.0 ms 191 runs
Summary
git log --patch (revision = HEAD) ran
3.96 ± 0.22 times faster than git log --patch (revision = HEAD~)
In the end, it should in theory never be necessary to have this negative
lookup cache given that we know to update the multi-pack index together
with repacks. But as the change is quite contained and as the speedup
can be significant as demonstrated above, it does feel sensible to have
the negative lookup cache regardless.
Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `git_open_cloexec()` wrapper function provides the ability to open a
file with `O_CLOEXEC` in a platform-agnostic way. This function is
provided by "object-file.c" even though it is not specific to the object
subsystem at all.
Move the file into "compat/open.c". This file already exists before this
commit, but has only been compiled conditionally depending on whether or
not open(3p) may return EINTR. With this change we now unconditionally
compile the object, but wrap `git_open_with_retry()` in an ifdef.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
There are multiple sites in "csum-file.c" where we use the global
`the_repository` variable, either explicitly or implicitly by using
`the_hash_algo`.
Refactor the code to stop using `the_repository` by adapting functions
to receive required data as parameters. Adapt callsites accordingly by
either using `the_repository->hash_algo`, or by using a context-provided
hash algorithm in case the subsystem already got rid of its dependency
on `the_repository`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Stop using `the_repository` in the "progress" subsystem by passing in a
repository when initializing `struct progress`. Furthermore, store a
pointer to the repository in that struct so that we can pass it to the
trace2 API when logging information.
Adjust callers accordingly by using `the_repository`. While there may be
some callers that have a repository available in their context, this
trivial conversion allows for easier verification and bubbles up the use
of `the_repository` by one level.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
* ps/build-sign-compare:
t/helper: don't depend on implicit wraparound
scalar: address -Wsign-compare warnings
builtin/patch-id: fix type of `get_one_patchid()`
builtin/blame: fix type of `length` variable when emitting object ID
gpg-interface: address -Wsign-comparison warnings
daemon: fix type of `max_connections`
daemon: fix loops that have mismatching integer types
global: trivial conversions to fix `-Wsign-compare` warnings
pkt-line: fix -Wsign-compare warning on 32 bit platform
csum-file: fix -Wsign-compare warning on 32-bit platform
diff.h: fix index used to loop through unsigned integer
config.mak.dev: drop `-Wno-sign-compare`
global: mark code units that generate warnings with `-Wsign-compare`
compat/win32: fix -Wsign-compare warning in "wWinMain()"
compat/regex: explicitly ignore "-Wsign-compare" warnings
git-compat-util: introduce macros to disable "-Wsign-compare" warnings
|
|
Mark code units that generate warnings with `-Wsign-compare`. This
allows for a structured approach to get rid of all such warnings over
time in a way that can be easily measured.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `MIDX_MIN_SIZE` definition is used to check the midx_size in
`local_multi_pack_index_one`. This definition relies on the
`the_hash_algo` global variable. Inline this and remove the global
variable usage.
With this, remove `USE_THE_REPOSITORY_VARIABLE` usage from `midx.c`.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The functions `get_split_midx_filename_ext()`, `get_midx_filename()` and
`get_midx_filename_ext()` use `hash_to_hex()` which internally uses the
`the_hash_algo` global variable.
Remove this dependency on global variables by passing down the
`hash_algo` through to the functions mentioned and instead calling
`hash_to_hex_algop()` along with the obtained `hash_algo`.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `load_multi_pack_index` function in midx uses `the_repository`
variable to access the `repository` struct. Modify the function and its
callee's to send the `repository` field.
This moves usage of `the_repository` to the `test-read-midx.c` file.
While that is not optimal, it is okay, since the upcoming commits will
slowly move the usage of `the_repository` up the layers and remove it
eventually.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In the `midx.c` file, there are multiple usages of `the_repository` and
`the_hash_algo` within static functions of the file. Some of the usages
can be simply swapped out with the available `repository` struct. While
some of them can be swapped out by passing the repository to the
required functions.
This leaves out only some other usages of `the_repository` and
`the_hash_algo` in the file in non-static functions, which we'll tackle
in upcoming commits.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `multi_pack_index` struct represents the MIDX for a repository.
Here, we add a pointer to the repository in this struct, allowing direct
use of the repository variable without relying on the global
`the_repository` struct.
With this addition, we can determine the repository associated with a
`bitmap_index` struct. A `bitmap_index` points to either a `packed_git`
or a `multi_pack_index`, both of which have direct repository
references. To support this, we introduce a static helper function,
`bitmap_repo`, in `pack-bitmap.c`, which retrieves a repository given a
`bitmap_index`.
With this, we clear up all usages of `the_repository` within
`pack-bitmap.c` and also remove the `USE_THE_REPOSITORY_VARIABLE`
definition. Bringing us another step closer to remove all global
variable usage.
Although this change also opens up the potential to clean up `midx.c`,
doing so would require additional refactoring to pass the repository
struct to functions where the MIDX struct is created: a task better
suited for future patches.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The struct `packed_git` holds information regarding a packed object
file. Let's add the repository variable to this object, to represent the
repository that this packfile belongs to. This helps remove dependency
on the global `the_repository` object in `packfile.c` by simply using
repository information now readily available in the struct.
We do need to consider that a packfile could be part of the alternates
of a repository, but considering that we only have one repository struct
and also that we currently anyways use 'the_repository', we should be
OK with this change.
We also modify `alloc_packed_git` to ensure that the repository is added
to newly created `packed_git` structs. This requires modifying the
function and all its callee to pass the repository object down the
levels.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The main function we use to search a pack index for an object is
find_pack_entry_one(). That function still takes a bare pointer to the
hash, despite the fact that its underlying bsearch_pack() function needs
an object_id struct. And so we end up making an extra copy of the hash
into the struct just to do a lookup.
As it turns out, all callers but one already have such an object_id. So
we can just take a pointer to that struct and use it directly. This
avoids the extra copy and provides a more type-safe interface.
The one exception is get_delta_base() in packfile.c, when we are chasing
a REF_DELTA from inside the pack (and thus we have a pointer directly to
the mmap'd pack memory, not a struct). We can just bump the hashcpy()
from inside find_pack_entry_one() to this one caller that needs it.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
|
|
When we scan a pack directory to load the idx entries we find into the
packed_git list, we skip any of them that are contained in a midx. We
then load them later lazily if we actually need to access the
corresponding pack, referencing them both from the midx struct and the
packed_git list.
The lazy-load in the midx code checks to see if the midx already
mentions the pack, but doesn't otherwise check the packed_git list. This
makes sense, since we should have added any pack to both lists.
But there's a loophole! If we call close_object_store(), that frees the
midx entirely, but _not_ the packed_git structs, which we must keep
around for Reasons[1]. If we then try to look up more objects, we'll
auto-load the midx again, which won't realize that we've already loaded
those packs, and will create duplicate entries in the packed_git list.
This is possibly inefficient, because it means we may open and map the
pack redundantly. But it can also lead to weird user-visible behavior.
The case I found is in "git repack", which closes and reopens the midx
after repacking and then calls update_server_info(). We end up writing
the duplicate entries into objects/info/packs.
We could obviously de-dup them while writing that file, but it seems
like a violation of more core assumptions that we end up with these
duplicate structs at all.
We can avoid the duplicates reasonably efficiently by checking their
names in the pack_map hash. This annoyingly does require a little more
than a straight hash lookup due to the naming conventions, but it should
only happen when we are about to actually open a pack. I don't think one
extra malloc will be noticeable there.
[1] I'm not entirely sure of all the details, except that we generally
assume the packed_git structs never go away. We noted this
restriction in the comment added by 6f1e9394e2 (object: fix leaking
packfiles when closing object store, 2024-08-08), but it's somewhat
vague. At any rate, if you try freeing the structs in
close_object_store(), you can observe segfaults all over the test
suite. So it might be fixable, but it's not trivial.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
|
|
The next commit will need to use the bitmap's MIDX (if one exists) to
translate bit positions into pack-relative positions in the source pack.
Ordinarily, we'd use the "midx" field of the bitmap_index struct. But
since that struct is defined within pack-bitmap.c, and our caller is in
a separate compilation unit, we do not have access to the MIDX field.
Instead, add a "from_midx" field to the bitmapped_pack structure so that
we can use that piece of data from outside of pack-bitmap.c. The caller
that uses this new piece of information will be added in the following
commit.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When loading a chained midx, we build up an array of hashes, one per
layer of the chain. But since the chain is also represented by the
linked list of multi_pack_index structs, nobody actually reads this
array. We pass it to add_midx_to_chain(), but the parameters are
completely ignored.
So we can drop those unused parameters. And then we can see that its
sole caller, load_midx_chain_fd_st(), only cares about one layer hash at a
time (for parsing each line and feeding it to the single-layer midx
code). So we can replace the array with a single object_id on the stack.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Now that the rest of the MIDX subsystem and relevant callers have been
updated to learn about how to read and process incremental MIDX chains,
let's finally update the implementation in `write_midx_internal()` to be
able to write incremental MIDX chains.
This new feature is available behind the `--incremental` option for the
`multi-pack-index` builtin, like so:
$ git multi-pack-index write --incremental
The implementation for doing so is relatively straightforward, and boils
down to a handful of different kinds of changes implemented in this
patch:
- The `compute_sorted_entries()` function is taught to reject objects
which appear in any existing MIDX layer.
- Functions like `write_midx_revindex()` are adjusted to write
pack_order values which are offset by the number of objects in the
base MIDX layer.
- The end of `write_midx_internal()` is adjusted to move
non-incremental MIDX files when necessary (i.e. when creating an
incremental chain with an existing non-incremental MIDX in the
repository).
There are a handful of other changes that are introduced, like new
functions to clear incremental MIDX files that are unrelated to the
current chain (using the same "keep_hash" mechanism as in the
non-incremental case).
The tests explicitly exercising the new incremental MIDX feature are
relatively limited for two reasons:
1. Most of the "interesting" behavior is already thoroughly covered in
t5319-multi-pack-index.sh, which handles the core logic of reading
objects through a MIDX.
The new tests in t5334-incremental-multi-pack-index.sh are mostly
focused on creating and destroying incremental MIDXs, as well as
stitching their results together across layers.
2. A new GIT_TEST environment variable is added called
"GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL", which modifies the
entire test suite to write incremental MIDXs after repacking when
combined with the "GIT_TEST_MULTI_PACK_INDEX" variable.
This exercises the long tail of other interesting behavior that is
defined implicitly throughout the rest of the CI suite. It is
likewise added to the linux-TEST-vars job.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Teach the verification implementation used by `git multi-pack-index
verify` to perform verification for incremental MIDX chains by
independently validating each layer within the chain.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Now that the MIDX machinery's internals have been taught to understand
incremental MIDXs over the previous handful of commits, the MIDX
machinery itself can begin reading incremental MIDXs.
(Note that while the on-disk format for incremental MIDXs has been
defined, the writing end has not been implemented. This will take place
in the commit after next.)
The core of this change involves following the order specified in the
MIDX chain in reverse and opening up MIDXs in the chain one-by-one,
adding them to the previous layer's `->base_midx` pointer at each step.
In order to implement this, the `load_multi_pack_index()` function is
taught to call a new `load_multi_pack_index_chain()` function if loading
a non-incremental MIDX failed via `load_multi_pack_index_one()`.
When loading a MIDX chain, `load_midx_chain_fd_st()` reads each line in
the file one-by-one and dispatches calls to
`load_multi_pack_index_one()` to read each layer of the MIDX chain. When
a layer was successfully read, it is added to the MIDX chain by calling
`add_midx_to_chain()` which validates the contents of the `BASE` chunk,
performs some bounds checks on the number of combined packs and objects,
and attaches the new MIDX by assigning its `base_midx` pointer to the
existing part of the chain.
As a supplement to this, introduce a new mode in the test-read-midx
test-tool which allows us to read the information for a specific MIDX in
the chain by specifying its trailing checksum via the command-line
arguments like so:
$ test-tool read-midx .git/objects [checksum]
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `midx_preferred_pack()` is used to determine the identity
of the preferred pack, which is the identity of a unique pack within
the MIDX which is used as a tie-breaker when selecting from which pack
to represent an object that appears in multiple packs within the MIDX.
Historically we have said that the MIDX's preferred pack has the unique
property that all objects from that pack are represented in the MIDX.
But that isn't quite true: a more precise statement would be that all
objects from that pack *which appear in the MIDX* are selected from that
pack.
This helps us extend the concept of preferred packs across a MIDX chain,
where some object(s) in the preferred pack may appear in other packs
in an earlier MIDX layer, in which case those object(s) will not appear
in a subsequent MIDX layer from either the preferred pack or any other
pack.
Extend the concept of preferred packs by using the pack which represents
the object at the first position in MIDX pseudo-pack order belonging to
the current MIDX layer (i.e., at position 'm->num_objects_in_base').
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Now that the `midx_contains_pack()` versus `midx_locate_pack()` debacle
has been cleaned up, teach the former about how to operate in an
incremental MIDX-aware world in a similar fashion as in previous
commits.
Instead of using either of the two `midx_for_object()` or
`midx_for_pack()` helpers, this function is split into two: one that
determines whether a pack is contained in a single MIDX, and another
which calls the former in a loop over all MIDXs.
This approach does not require that we change any of the implementation
in what is now `midx_contains_pack_1()` as it still operates over a
single MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Commit 307d75bbe6 (midx: implement `midx_locate_pack()`, 2023-12-14)
introduced `midx_locate_pack()`, which was described at the time as a
complement to the function `midx_contains_pack()` which allowed
callers to determine where in the MIDX lexical order a pack appeared, as
opposed to whether or not it was simply contained.
307d75bbe6 suggests that future patches would be added which would
introduce callers for this new function, but none ever were, meaning the
function has gone unused since its introduction.
Clean this up by in effect reverting 307d75bbe6, which removes the
unused functions and inlines its definition back into
`midx_contains_pack()`.
(Looking back through the list archives when 307d75bbe6 was written,
this was in preparation for this[1] patch from back when we had the
concept of "disjoint" packs while developing multi-pack verbatim reuse.
That concept was abandoned before the series was merged, but I never
dropped what would become 307d75bbe6 from the series, leading to the
state prior to this commit).
[1]: https://lore.kernel.org/git/3019738b52ba8cd78ea696a3b800fa91e722eb66.1701198172.git.me@ttaylorr.com/
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In a similar fashion as previous commits, teach the `fill_midx_entry()`
function to work in a incremental MIDX-aware fashion.
This function, unlike others which accept an index into either the
lexical order of objects or packs, takes in an object_id, and attempts
to fill a caller-provided 'struct pack_entry' with the remaining pieces
of information about that object from the MIDX.
The function uses `bsearch_midx()` which fills out the frame-local 'pos'
variable, recording the given object_id's lexical position within the
MIDX chain, if found (if no matching object ID was found, we'll return
immediately without filling out the `pack_entry` structure).
Once given that position, we jump back through the `->base_midx` pointer
to ensure that our `m` points at the MIDX layer which contains the given
object_id (and not an ancestor or descendant of it in the chain). Note
that we can drop the bounds check "if (pos >= m->num_objects)" because
`midx_for_object()` performs this check for us.
After that point, we only need to make two special considerations within
this function:
- First, the pack_int_id returned to us by `nth_midxed_pack_int_id()`
is a position in the concatenated lexical order of packs, so we must
ensure that we subtract `m->num_packs_in_base` before accessing the
MIDX-local `packs` array.
- Second, we must avoid translating the `pos` back to a MIDX-local
index, since we use it as an argument to `nth_midxed_offset()` which
expects a position relative to the concatenated lexical order of
objects.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In a similar fashion as in previous commits, teach the function
`nth_midxed_offset()` about incremental MIDXs.
The given object `pos` is used to find the containing MIDX, and
translated back into a MIDX-local position by assigning the return value
of `midx_for_object()` to it.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Now that the special cases callers of `bsearch_midx()` have been dealt
with, teach `bsearch_midx()` to handle incremental MIDX chains.
The incremental MIDX-aware version of `bsearch_midx()` works by
repeatedly searching for a given OID in each layer along the
`->base_midx` pointer, stopping either when an exact match is found, or
the end of the chain is reached.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `bsearch_midx()` function will be extended in a following commit to
search for the location of a given object ID across all MIDXs in a chain
(or the single non-chain MIDX if no chain is available).
While most callers will naturally want to use the updated
`bsearch_midx()` function, there are a handful of special cases that
will want finer control and will only want to search through a single
MIDX.
For instance, the object abbreviation code, which cares about object IDs
near to where we'd expect to find a match in a MIDX. In that case, we
want to look at the nearby matches in each layer of the MIDX chain, not
just a single one).
Split the more fine-grained control out into a separate function called
`bsearch_one_midx()` which searches only a single MIDX.
At present both `bsearch_midx()` and `bsearch_one_midx()` have identical
behavior, but the following commit will rewrite the former to be aware
of incremental MIDXs for the remaining non-special case callers.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In a similar fashion as in previous commits, teach the function
`nth_bitmapped_pack()` about incremental MIDXs by translating the given
`pack_int_id` from the concatenated lexical order to a MIDX-local
lexical position.
When accessing the containing MIDX's array of packs, use the local pack
ID. Likewise, when reading the 'BTMP' chunk, use the MIDX-local offset
when accessing the data within that chunk.
(Note that the both the call to prepare_midx_pack() and the assignment
of bp->pack_int_id both care about the global pack_int_id, so avoid
shadowing the given 'pack_int_id' parameter).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `nth_midxed_object_oid()` returns the object ID for a given
object position in the MIDX lexicographic order.
Teach this function to instead operate over the concatenated
lexicographic order defined in an earlier step so that it is able to be
used with incremental MIDXs.
To do this, we need to both (a) adjust the bounds check for the given
'n', as well as record the MIDX-local position after chasing the
`->base_midx` pointer to find the MIDX which contains that object.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `prepare_midx_pack()` is part of the midx.h API and
loads the pack identified by the MIDX-local 'pack_int_id'. This patch
prepares that function to be aware of an incremental MIDX world.
To do this, introduce the second of the two general purpose helpers
mentioned in the previous commit. This commit introduces
`midx_for_pack()`, which is the pack-specific analog of
`midx_for_object()`, and works in the same fashion.
Like `midx_for_object()`, this function chases down the '->base_midx'
field until it finds the MIDX layer within the chain that contains the
given pack.
Use this function within `prepare_midx_pack()` so that the `pack_int_id`
it expects is now relative to the entire MIDX chain, and that it
prepares the given pack in the appropriate MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `nth_midxed_pack_int_id()` takes in a object position in
MIDX lexicographic order and returns an identifier of the pack from
which that object was selected in the MIDX.
Currently, the given object position is an index into the lexicographic
order of objects in a single MIDX. Change this position to instead refer
into the concatenated lexicographic order of all MIDXs in a MIDX chain.
This has two visible effects within the implementation of
`prepare_midx_pack()`:
- First, the given position is now an index into the concatenated
lexicographic order of all MIDXs in the order in which they appear
in the MIDX chain.
- Second the pack ID returned from this function is now also in the
concatenated order of packs among all layers of the MIDX chain in
the same order that they appear in the MIDX chain.
To do this, introduce the first of two general purpose helpers, this one
being `midx_for_object()`. `midx_for_object()` takes a double pointer to
a `struct multi_pack_index` as well as an object `pos` in terms of the
entire MIDX chain[^1].
The function chases down the '->base_midx' field until it finds the MIDX
layer within the chain that contains the given object. It then:
- modifies the double pointer to point to the containing MIDX, instead
of the tip of the chain, and
- returns the MIDX-local position[^2] at which the given object can be
found.
Use this function within `nth_midxed_pack_int_id()` so that the `pos` it
expects is now relative to the entire MIDX chain, and that it returns
the appropriate pack position for that object.
[^1]: As a reminder, this means that the object is identified among the
objects contained in all layers of the incremental MIDX chain, not any
particular layer. For example, consider MIDX chain with two individual
MIDXs, one with 4 objects and another with 3 objects. If the MIDX with
4 objects appears earlier in the chain, then asking for object 6 would
return the second object in the MIDX with 3 objects.
[^2]: Building on the previous example, asking for object 6 in a MIDX
chain with (4, 3) objects, respectively, this would set the double
pointer to point at the MIDX containing three objects, and would
return an index to the second object within that MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.
It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.
Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit
For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).
Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash
function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Commit f894081deae (pack-revindex: read multi-pack reverse indexes,
2021-03-30) introduced the `get_midx_rev_filename()` helper (later
modified by commit 60980aed786 (midx.c: write MIDX filenames to
strbuf, 2021-10-26)).
This function returns the location of the classic ".rev" files we used
to write for MIDXs (prior to 95e8383bac1 (midx.c: make changing the
preferred pack safe, 2022-01-25)), which is always of the form:
$GIT_DIR/objects/pack/multi-pack-index-$HASH.rev
Replace this function with a generic helper that populates a strbuf with
the above form, replacing the ".rev" extension with a caller-provided
argument.
This will allow us to remove a similarly-defined function in the
pack-bitmap code (used to determine the location of a MIDX .bitmap file)
by reimplementing it in terms of `get_midx_filename_ext()`.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
GIt 2.44 introduced a regression that makes the updated code to
barf in repositories with multi-pack index written by older
versions of Git, which has been corrected.
* ps/missing-btmp-fix:
pack-bitmap: gracefully handle missing BTMP chunks
|
|
In 0fea6b73f1 (Merge branch 'tb/multi-pack-verbatim-reuse', 2024-01-12)
we have introduced multi-pack verbatim reuse of objects. This series has
introduced a new BTMP chunk, which encodes information about bitmapped
objects in the multi-pack index. Starting with dab60934e3 (pack-bitmap:
pass `bitmapped_pack` struct to pack-reuse functions, 2023-12-14) we use
this information to figure out objects which we can reuse from each of
the packfiles.
One thing that we glossed over though is backwards compatibility with
repositories that do not yet have BTMP chunks in their multi-pack index.
In that case, `nth_bitmapped_pack()` would return an error, which causes
us to emit a warning followed by another error message. These warnings
are visible to users that fetch from a repository:
```
$ git fetch
...
remote: error: MIDX does not contain the BTMP chunk
remote: warning: unable to load pack: 'pack-f6bb7bd71d345ea9fe604b60cab9ba9ece54ffbe.idx', disabling pack-reuse
remote: Enumerating objects: 40, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 40 (delta 5), reused 0 (delta 0), pack-reused 0 (from 0)
...
```
While the fetch succeeds the user is left wondering what they did wrong.
Furthermore, as visible both from the warning and from the reuse stats,
pack-reuse is completely disabled in such repositories.
What is quite interesting is that this issue can even be triggered in
case `pack.allowPackReuse=single` is set, which is the default value.
One could have expected that in this case we fall back to the old logic,
which is to use the preferred packfile without consulting BTMP chunks at
all. But either we fail with the above error in case they are missing,
or we use the first pack in the multi-pack-index. The former case
disables pack-reuse altogether, whereas the latter case may result in
reusing objects from a suboptimal packfile.
Fix this issue by partially reverting the logic back to what we had
before this patch series landed. Namely, in the case where we have no
BTMP chunks or when `pack.allowPackReuse=single` are set, we use the
preferred pack instead of consulting the BTMP chunks.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Introduce a new midx-write.c source file, which holds all of the
functionality from the MIDX sub-system related to writing new MIDX files.
Similar to the relationship between "pack-bitmap.c" and
"pack-bitmap-write.c", this source file will hold code that is specific
to writing MIDX files as opposed to reading them (the latter will remain
in midx.c).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Build the pack base name argument directly using strvec_pushf() instead
of with an intermediate strbuf. This is shorter, simpler and avoids the
need for explicit cleanup.
Signed-off-by: René Scharfe <l.s.r@web.de>
Reviewed-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When performing a binary search over the objects in a MIDX's bitmap
(i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack
ordering using a combination of (a) the preferred pack, (b) the pack's
lexical position in the MIDX based on pack names, and (c) the object
offset within the pack.
In order to perform this binary search, the reader must know the
identity of the preferred pack. This could be stored in the MIDX, but
isn't for historical reasons, mostly because it can easily be inferred
at read-time by looking at the object in the first bit position and
finding out which pack it was selected from in the MIDX, like so:
nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));
In midx_to_pack_pos() which performs this binary search, we look up the
identity of the preferred pack before each search. This is relatively
quick, since it involves two table-driven lookups (one in the MIDX's
revindex for `pack_pos_to_midx()`, and another in the MIDX's object
table for `nth_midxed_pack_int_id()`).
But since the preferred pack does not change after the MIDX is written,
it is safe to cache this value on the MIDX itself.
Write a helper to do just that, and rewrite all of the existing
call-sites that care about the identity of the preferred pack in terms
of this new helper.
This will prepare us for a subsequent patch where we will need to binary
search through the MIDX's pseudo-pack order multiple times.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The multi-pack index API exposes a `midx_contains_pack()` function that
takes in a string ending in either ".idx" or ".pack" and returns whether
or not the MIDX contains a given pack corresponding to that string.
There is no corresponding function to locate the position of a pack
within the MIDX's pack order (sorted lexically by pack filename).
We could add an optional out parameter to `midx_contains_pack()` that is
filled out with the pack's position when the parameter is non-NULL. To
minimize the amount of fallout from this change, instead introduce a new
function by renaming `midx_contains_pack()` to `midx_locate_pack()`,
adding that output parameter, and then reimplementing
`midx_contains_pack()` in terms of it.
Future patches will make use of this new function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.
This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.
As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.
It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the subset of their objects being sent).
If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to any of their objects we may output,
we need to make only modest modifications to the verbatim pack-reuse
code itself. Most notably, we need to remove the assumption that the
bits in the reachability bitmap corresponding to objects from the single
reuse pack begin at the first bit position.
Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).
This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
chunk of the multi-pack index. The `BTMP` chunk's contents are described
in detail here. Importantly, the `BTMP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.
For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.
[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
pack that wasn't used verbatim.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When selecting which packfiles will be written while generating a MIDX,
the MIDX internals fill out a 'struct pack_info' with various pieces of
book-keeping.
Instead of filling out each field of the `pack_info` structure
individually in each of the two spots that modify the array of such
structures (`ctx->info`), extract a common routine that does this for
us.
This reduces the code duplication by a modest amount. But more
importantly, it zero-initializes the structure before assigning values
into it. This hardens us for a future change which will add additional
fields to this structure which (until this patch) was not
zero-initialized.
As a result, any new fields added to the `pack_info` structure need only
be updated in a single location, instead of at each spot within midx.c.
There are no functional changes in this patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.
Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a clear_packing_data() function, and call it in the
appropriate spots.
This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The commit-graph, midx, and pack idx on-disk formats all have oid fanout
tables which are fed to bsearch_hash(). If these tables do not increase
monotonically, then the binary search may not only produce bogus values,
it may cause out of bounds reads.
We fixed this for commit graphs in 4169d89645 (commit-graph: check
consistency of fanout table, 2023-10-09). That commit argued that we did
not need to do the same for midx and pack idx files, because they
already did this check. However, that is wrong. We _do_ check the fanout
table for pack idx files when we load them, but we only do so for midx
files when running "git multi-pack-index verify". So it is possible to
get an out-of-bounds read by running a normal command with a specially
crafted midx file.
Let's fix this using the same solution (and roughly the same test) we
did for the commit-graph in 4169d89645. This replaces the same check
from "multi-pack-index verify", because verify uses the same read
routines, we'd bail on reading the midx much sooner now. So let's make
sure to copy its verbose error message.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When we load a revindex from disk, we check the size of the file
compared to the number of objects we expect it to have. But when we use
a RIDX chunk stored directly in the midx, we just access the memory
directly. This can lead to out-of-bounds memory access for a corrupted
or malicious multi-pack-index file.
We can catch this by recording the RIDX chunk size, and then checking it
against the expected size when we "load" the revindex. Note that this
check is much simpler than the one that load_revindex_from_disk() does,
because we just have the data array with no header (so we do not need
to account for the header size, and nor do we need to bother validating
the header values).
The test confirms both that we catch this case, and that we continue the
process (the revindex is required to use the midx bitmaps, but we
fallback to a non-bitmap traversal).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When we see a large offset bit in the regular midx offset table, we
use the entry as an index into a separate large offset table (just like
a pack idx does). But we don't bounds-check the access to that large
offset table (nor even record its size when we parse the chunk!).
The equivalent code for a regular pack idx is in check_pack_index_ptr().
But things are a bit simpler here because of the chunked format: we can
just check our array index directly.
As a bonus, we can get rid of the st_mult() here. If our array
bounds-check is successful, then we know that the result will fit in a
size_t (and the bounds check uses a division to avoid overflow
entirely).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The object offset chunk has one fixed-size entry for each object in the
midx. But since we don't check its size, we may access out-of-bounds
memory if we see a corrupt or malicious midx file.
Sine the entries are fixed-size, the total length can be known up-front,
and we can just check it while parsing the chunk (this is similar to
what we do when opening pack idx files, which contain a similar offset
table).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The midx reader assumes chunks are aligned to a 4-byte boundary: we
treat the fanout chunk as an array of uint32_t, indexing it to feed the
results to ntohl(). Without aligning the chunks, we may violate the
CPU's alignment constraints. Though many platforms allow this, some do
not. And certanily UBSan will complain, since it is undefined behavior.
Even though most chunks are naturally 4-byte-aligned (because they are
storing uint32_t or larger types), PNAM is not. It stores NUL-terminated
pack names, so you can have a valid chunk with any length. The writing
side handles this by 4-byte-aligning the chunk, introducing a few extra
NULs as necessary. But since we don't check this on the reading side, we
may end up with a misaligned fanout and trigger the undefined behavior.
We have two options here:
1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The
latter handles alignment itself. It's possible that it's slightly
slower (though in practice I'm not sure how true that is,
especially for these code paths which then go on to do a binary
search).
2. Enforce the alignment when reading the chunks. This is easy to do,
since the table-of-contents reader can check it in one spot.
I went with the second option here, just because it places less burden
on maintenance going forward (it is OK to continue using ntohl), and we
know it can't have any performance impact on the actual reads.
The commit-graph code uses the same chunk API. It's usually also 4-byte
aligned, but some chunks are not (like Bloom filter BDAT chunks). So
we'll pass "1" here to allow any alignment. It doesn't suffer from the
same problem as midx with its fanout because the fanout chunk is always
the first (and the rest of the format dictates that the first chunk will
start aligned).
The new test shows the effect on a midx with a misaligned PNAM chunk.
Note that the midx-reading code treats chunk-toc errors as soft, falling
back to the non-midx path rather than calling die(), as we do for other
parsing errors. Arguably we should make all of these behave the same,
but that's out of scope for this patch. For now the test just expects
the fallback behavior.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We parse the pack-name chunk as a series of NUL-terminated strings. But
since we don't look at the chunk size, there's nothing to guarantee that
we don't parse off the end of the chunk (or even off the end of the
mapped file).
We can record the length, and then as we parse make sure that we never
walk past it.
The new test exercises the case, though note that it does not actually
segfault before this patch. It hits a NUL byte somewhere in one of the
other chunks, and comes up with a garbage pack name. You could construct
one that reads out-of-bounds (e.g., a PNAM chunk at the end of file),
but this case is simple and sufficient to check that we detect the
problem.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When reading an on-disk multi-pack-index, we take the number of objects
in the midx from the final value of the fanout table. But we just
blindly assume that the chunk containing the actual oid entries is the
correct size. This can lead to us reading out-of-bounds memory if the
lookup chunk is too small (or if the fanout is corrupted; when they
don't agree we cannot tell which one is wrong).
Note that we bump the assignment of m->num_objects into the fanout
parser callback, so that it's set when we parse the lookup table
(otherwise we'd have to manually record the lookup table size and check
it later).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When we load the oid-fanout chunk, our callback checks that its size is
reasonable and returns an error if not. However, the caller only checks
our return value against CHUNK_NOT_FOUND, so we end up ignoring the
error completely! Using a too-small fanout table means we end up
accessing random memory for the fanout and segfault.
We can fix this by checking for any non-zero return value, rather than
just CHUNK_NOT_FOUND, and adjusting our error message to cover both
cases. We could handle each error code individually, but there's not
much point for such a rare case. The extra message produced in the
callback makes it clear what is going on.
The same pattern is used in the adjacent code. Those cases are actually
OK for now because they do not use a custom callback, so the only error
they can get is CHUNK_NOT_FOUND. But let's convert them, as this is an
accident waiting to happen (especially as we convert some of them away
from pair_chunk). The error messages are more verbose, but it should be
rare for a user to see these anyway.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The pair_chunk() function is provided as an easy helper for parsing
chunks that just want a pointer to a set of bytes. But every caller has
a hidden bug: because we return only the pointer without the matching
chunk size, the callers have no clue how many bytes they are allowed to
look at. And as a result, they may read off the end of the mmap'd data
when the on-disk file does not match their expectations.
Since chunk files are typically used for local-repository data like
commit-graph files and midx's, the security implications here are pretty
mild. The worst that can happen is that you hand somebody a corrupted
repository tarball, and running Git on it does an out-of-bounds read and
crashes. So it's worth being more defensive, but we don't need to drop
everything and fix every caller immediately.
I noticed the problem because the pair_chunk_fn() callback does not look
at its chunk_size argument, and wanted to annotate it to silence
-Wunused-parameter. We could do that now, but we'd lose the hint that
this code should be audited and fixed.
So instead, let's set ourselves up for going down that path:
1. Provide a pair_chunk() function that does return the size, which
prepares us for fixing these cases.
2. Rename the existing function to pair_chunk_unsafe(). That gives us
an easy way to grep for cases which still need to be fixed, and the
name should cause anybody adding new calls to think twice before
using it.
There are no callers of the "safe" version yet, but we'll add some in
subsequent patches.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Various offset computation in the code that accesses the packfiles
and other data in the object layer has been hardened against
arithmetic overflow, especially on 32-bit systems.
* tb/object-access-overflow-protection:
commit-graph.c: prevent overflow in `verify_commit_graph()`
commit-graph.c: prevent overflow in `write_commit_graph()`
commit-graph.c: prevent overflow in `merge_commit_graph()`
commit-graph.c: prevent overflow in `split_graph_merge_strategy()`
commit-graph.c: prevent overflow in `load_tree_for_commit()`
commit-graph.c: prevent overflow in `fill_commit_in_graph()`
commit-graph.c: prevent overflow in `fill_commit_graph_info()`
commit-graph.c: prevent overflow in `load_oid_from_graph()`
commit-graph.c: prevent overflow in add_graph_to_chain()
commit-graph.c: prevent overflow in `write_commit_graph_file()`
pack-bitmap.c: ensure that eindex lookups don't overflow
midx.c: prevent overflow in `fill_included_packs_batch()`
midx.c: prevent overflow in `write_midx_internal()`
midx.c: store `nr`, `alloc` variables as `size_t`'s
midx.c: prevent overflow in `nth_midxed_offset()`
midx.c: prevent overflow in `nth_midxed_object_oid()`
midx.c: use `size_t`'s for fanout nr and alloc
packfile.c: use checked arithmetic in `nth_packed_object_offset()`
packfile.c: prevent overflow in `load_idx()`
packfile.c: prevent overflow in `nth_packed_object_id()`
|
|
In a similar spirit as in previous commits, avoid an integer overflow
when computing the expected size of a MIDX.
(Note that this is also OK as-is, since `p->pack_size` is an `off_t`, so
this computation should already be done as 64-bit integers. But again,
let's use `st_mult()` to make this fact clear).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When writing a MIDX, we use the chunk-format API to write out each
individual chunk of the MIDX. Each chunk of the MIDX is tracked via a
call to `add_chunk()`, along with the expected size of that chunk.
Guard against overflow when dealing with a MIDX with a large number of
entries (and consequently, large chunks within the MIDX file itself) to
avoid corrupting the contents of the MIDX itself.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In the `write_midx_context` structure, we use two `uint32_t`'s to track
the length and allocated size of the packs, and one `uint32_t` to track
the number of objects in the MIDX.
In practice, having these be 32-bit unsigned values shouldn't cause any
problems since we are unlikely to have that many objects or packs in any
real-world repository. But these values should be `size_t`'s, so change
their type to reflect that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In a similar spirit as previous patches, avoid an overflow when looking
up object offsets in the MIDX's large offset table by guarding the
computation via `st_mult()`.
This instance is also OK as-is, since the left operand is the result of
`sizeof(...)`, which is already a `size_t`. But use `st_mult()` instead
here to make it explicit that this computation is to be performed using
64-bit unsigned integers.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In a similar spirit as previous commits, avoid overflow when looking up
an object's OID in a MIDX when its position is greater than
`2^32-1/m->hash_len`.
As usual, it is perfectly OK for a MIDX to have as many as 2^32-1
objects (since we use 32-bit fields to count the number of objects at
each fanout layer). But if we have more than `2^32-1/m->hash_len` number
of objects, we will incorrectly perform the computation using 32-bit
integers, overflowing the result.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `midx_fanout` struct is used to keep track of a set of OIDs
corresponding to each layer of the MIDX's fanout table. It stores an
array of entries, along with the number of entries in the table, and the
allocated size of the array.
Both `nr` and `alloc` are stored as 32-bit unsigned integers. In
practice, this should never cause any problems, since most packs have
far fewer than 2^32-1 objects.
But storing these as `size_t`'s is more appropriate, and prevents us
from accidentally overflowing some result when multiplying or adding to
either of these values. Update these struct members to be `size_t`'s as
appropriate.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for
dynamic array allocation. Moving these macros to git-compat-util.h with
the other alloc macros focuses alloc.[ch] to allocation for Git objects
and additionally allows us to remove inclusions to alloc.h from files
that solely used the above macros.
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The vast majority of files including object-store.h did not need dir.h
nor khash.h. Split the header into two files, and let most just depend
upon object-store-ll.h, while letting the two callers that need it
depend on the full object-store.h.
After this patch:
$ git grep -h include..object-store | sort | uniq -c
2 #include "object-store.h"
129 #include "object-store-ll.h"
Diff best viewed with `--color-moved`.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
More header clean-up.
* en/header-split-cache-h-part-2: (22 commits)
reftable: ensure git-compat-util.h is the first (indirect) include
diff.h: reduce unnecessary includes
object-store.h: reduce unnecessary includes
commit.h: reduce unnecessary includes
fsmonitor: reduce includes of cache.h
cache.h: remove unnecessary headers
treewide: remove cache.h inclusion due to previous changes
cache,tree: move basic name compare functions from read-cache to tree
cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c
hash-ll.h: split out of hash.h to remove dependency on repository.h
tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h
dir.h: move DTYPE defines from cache.h
versioncmp.h: move declarations for versioncmp.c functions from cache.h
ws.h: move declarations for ws.c functions from cache.h
match-trees.h: move declarations for match-trees.c functions from cache.h
pkt-line.h: move declarations for pkt-line.c functions from cache.h
base85.h: move declarations for base85.c functions from cache.h
copy.h: move declarations for copy.c functions from cache.h
server-info.h: move declarations for server-info.c functions from cache.h
packfile.h: move pack_window and pack_entry from cache.h
...
|
|
Geometric repacking ("git repack --geometric=<n>") in a repository
that borrows from an alternate object database had various corner
case bugs, which have been corrected.
* ps/fix-geom-repack-with-alternates:
repack: disable writing bitmaps when doing a local repack
repack: honor `-l` when calculating pack geometry
t/helper: allow chmtime to print verbosely without modifying mtime
pack-objects: extend test coverage of `--stdin-packs` with alternates
pack-objects: fix error when same packfile is included and excluded
pack-objects: fix error when packing same pack twice
pack-objects: split out `--stdin-packs` tests into separate file
repack: fix generating multi-pack-index with only non-local packs
repack: fix trying to use preferred pack in alternates
midx: fix segfault with no packs and invalid preferred pack
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When asked to write a multi-pack-index the user can specify a preferred
pack that is used as a tie breaker when multiple packs contain the same
objects. When this packfile cannot be found, we just pick the first pack
that is getting tracked by the newly written multi-pack-index as a
fallback.
Picking the fallback can fail in the case where we're asked to write a
multi-pack-index with no packfiles at all: picking the fallback value
will cause a segfault as we blindly index into the array of packfiles,
which would be empty.
Fix this bug by resetting the preferred packfile index to `-1` before
searching for the preferred pack. This fixes the segfault as we already
check for whether the index is `> - 1`. If it is not, we simply don't
pick a preferred packfile at all.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
With the change in the last commit to move several functions to
write-or-die.h, csum-file.h no longer needs to include cache.h.
However, removing that include forces several other C files, which
directly or indirectly dependend upon csum-file.h's inclusion of
cache.h, to now be more explicit about their dependencies.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This is another step towards letting us remove the include of cache.h in
strbuf.c. It does mean that we also need to add includes of abspath.h
in a number of C files.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Dozens of files made use of gettext functions, without explicitly
including gettext.h. This made it more difficult to find which files
could remove a dependence on cache.h. Make C files explicitly include
gettext.h if they are using it.
However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an
include of gettext.h, it was left out to avoid conflicting with an
in-flight topic.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
More work towards -Wunused.
* jk/unused-post-2.39-part2: (21 commits)
help: mark unused parameter in git_unknown_cmd_config()
run_processes_parallel: mark unused callback parameters
userformat_want_item(): mark unused parameter
for_each_commit_graft(): mark unused callback parameter
rewrite_parents(): mark unused callback parameter
fetch-pack: mark unused parameter in callback function
notes: mark unused callback parameters
prio-queue: mark unused parameters in comparison functions
for_each_object: mark unused callback parameters
list-objects: mark unused callback parameters
mark unused parameters in signal handlers
run-command: mark error routine parameters as unused
mark "pointless" data pointers in callbacks
ref-filter: mark unused callback parameters
http-backend: mark unused parameters in virtual functions
http-backend: mark argc/argv unused
object-name: mark unused parameters in disambiguate callbacks
serve: mark unused parameters in virtual functions
serve: use repository pointer to get config
ls-refs: drop config caching
...
|
|
The for_each_{loose,packed}_object interface uses callback functions,
but not every callback needs all of the parameters. Mark the unused ones
to satisfy -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This allows us to replace includes of cache.h with includes of the much
smaller alloc.h in many places. It does mean that we also need to add
includes of alloc.h in a number of C files.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
A bugfix with tracing support in midx codepath
* tb/midx-bitmap-selection-fix:
pack-bitmap-write.c: instrument number of reused bitmaps
midx.c: instrument MIDX and bitmap generation with trace2 regions
midx.c: consider annotated tags during bitmap selection
midx.c: fix whitespace typo
|
|
When debugging MIDX and MIDX-bitmap related issues, it is useful to
figure out where Git is spending its time.
GitHub has been using the below trace2 regions to instrument various
components of generating a MIDX itself, as well time spent preparing to
build a MIDX bitmap.
These are limited to instrumenting the following functions:
- midx.c::find_commits_for_midx_bitmap()
- midx.c::midx_pack_order()
- midx.c::prepare_midx_packing_data()
- midx.c::write_midx_bitmap()
- midx.c::write_midx_internal()
- midx.c::write_midx_reverse_index()
to start and end with a trace2_region_enter() and trace2_region_leave(),
respectively.
The category for all of these is "midx", which matches the existing
convention. The region description matches the name of the function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When generating a multi-pack bitmap without a `--refs-snapshot` (e.g.,
by running `git multi-pack-index write --bitmap` directly), we determine
the set of bitmap-able commits by enumerating each reference, and adding
the referrent as the tip of a reachability traversal when it appears
somewhere in the MIDX. (Any commit we encounter during the reachability
traversal then becomes a candidate for bitmap selection).
But we incorrectly avoid peeling the object at the tip of each
reference. So if we see some reference that points at an annotated tag
(which in turn points through zero or more additional annotated tags at
a commit), that we will not add it as a tip for the reachability
traversal. This means that if some commit C is only referenced through
one or more annotated tag(s), then C won't become a bitmap candidate.
Correct this by peeling the reference tips as we enumerate them to
ensure that we consider commits which are the targets of annotated tags,
in addition to commits which are referenced directly.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This was unintentionally introduced via 893b563505 (midx: inline
nth_midxed_pack_entry(), 2021-09-11) where "struct repository *r"
became "struct repository * r".
The latter does not adhere to our usual style conventions, so fix that
up to look more like our usual declarations.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Apply similar treatment with respect to cruft packs as in a few commits
ago to `repack` with a non-zero `--batch-size`.
Since the case of a non-zero `--batch-size` is handled separately (in
`fill_included_packs_batch()` instead of `fill_included_packs_all()`), a
separate fix must be applied for this case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The fill_included_packs_batch() routine is responsible for aggregating
objects in packs with a non-zero value for the `--batch-size` option of
the `git multi-pack-index repack` sub-command.
Since this routine is explicitly called only when `--batch-size` is
non-zero, there is no point in checking that this is the case in our
loop condition.
Remove the unnecessary part of this condition to avoid confusion.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Replace a direct invocation of Git's `xcalloc()` wrapper with the
`CALLOC_ARRAY()` macro instead.
The latter is preferred since it is more conventional in Git's codebase,
but also because it automatically picks the correct value for the record
size.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `repack` sub-command of the `git multi-pack-index` builtin creates a
new pack aggregating smaller packs contained in the MIDX up to some
given `--batch-size`.
When `--batch-size=0`, this instructs the MIDX builtin to repack
everything contained in the MIDX into a single pack.
In similar spirit as a previous commit, it is undesirable to repack the
contents of a cruft pack in this step. Teach `repack` to ignore any
cruft pack(s) when `--batch-size=0` for the same reason(s).
(The case of a non-zero `--batch-size` will be handled in a subsequent
commit).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `expire` sub-command unlinks any packs that are (a) contained in the
MIDX, but (b) have no objects referenced by the MIDX.
This sub-command ignores `.keep` packs, which remain on-disk even if
they have no objects referenced by the MIDX. Cruft packs, however,
aren't given the same treatment: if none of the objects contained in the
cruft pack are selected from the cruft pack by the MIDX, then the cruft
pack is eligible to be expired.
This is less than desireable, since the cruft pack has important
metadata about the individual object mtimes, which is useful to
determine how quickly an object should age out of the repository when
pruning.
Ordinarily, we wouldn't expect the contents of a cruft pack to
duplicated across non-cruft packs (and we'd expect to see the MIDX
select all cruft objects from other sources even less often). But
nonetheless, it is still possible to trick the `expire` sub-command into
removing the `.mtimes` file in this circumstance.
Teach the `expire` sub-command to ignore cruft packs in the same manner
as it does `.keep` packs, in order to keep their metadata around, even
when they are unreferenced by the MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The pack bitmap file gained a bitmap-lookup table to speed up
locating the necessary bitmap for a given commit.
* ac/bitmap-lookup-table:
pack-bitmap-write: drop unused pack_idx_entry parameters
bitmap-lookup-table: add performance tests for lookup table
pack-bitmap: prepare to read lookup table extension
pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
pack-bitmap-write.c: write lookup table extension
bitmap: move `get commit positions` code to `bitmap_writer_finish`
Documentation/technical: describe bitmap lookup table extension
|
|
Multi-pack index got corrupted when preferred pack changed from one
pack to another in a certain way, which has been corrected.
* tb/midx-with-changing-preferred-pack-fix:
midx.c: avoid adding preferred objects twice
midx.c: include preferred pack correctly with existing MIDX
midx.c: extract `midx_fanout_add_pack_fanout()`
midx.c: extract `midx_fanout_add_midx_fanout()`
midx.c: extract `struct midx_fanout`
t/lib-bitmap.sh: avoid silencing stderr
t5326: demonstrate potential bitmap corruption
|
|
Teach Git to provide a way for users to enable/disable bitmap lookup
table extension by providing a config option named 'writeBitmapLookupTable'.
Default is false.
Also add test to verify writting of lookup table.
Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Co-Authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The last commit changes the behavior of midx.c's `get_sorted_objects()`
function to handle the case of writing a MIDX bitmap while reusing an
existing MIDX and changing the identity of the preferred pack
separately.
As part of this change, all objects from the (new) preferred pack are
added to the fanout table in a separate pass. Since these copies of the
objects all have their preferred bits set, any duplicates will be
resolved in their favor.
Importantly, this includes any copies of those same objects that come
from the existing MIDX. We know at the time of adding them that they'll
be redundant if their source pack is the (new) preferred one, so we can
avoid adding them to the list in this case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This patch resolves an issue where the object order used to generate a
MIDX bitmap would violate an invariant that all of the preferred pack's
objects are represented by that pack in the MIDX.
The problem arises when reusing an existing MIDX while generating a new
one, and occurs specifically when the identity of the preferred pack
changes from one MIDX to another, along with a few other conditions:
- the new preferred pack must also be present in the existing MIDX
- the new preferred pack must *not* have been the preferred pack in
the existing MIDX
- most importantly, there must be at least one object present in the
physical preferred pack (ie., it shows up in that pack's index)
but was selected from a *different* pack when the previous MIDX
was generated
When the above conditions are all met, we end up (incorrectly)
discarding copies of some objects in the pack selected as the preferred
pack. This is because `get_sorted_entries()` adds objects to its list
by doing the following at each fanout level:
- first, adding all objects from that fanout level from an existing
MIDX
- then, adding all objects from that fanout level in each pack *not*
included in the existing MIDX
So if some object was not selected from the to-be-preferred pack when
writing the previous MIDX, then we will never consider it as a candidate
when generating the new MIDX. This means that it's possible for the
preferred pack to not include all of its objects in the MIDX's
pseudo-pack object order, which is an invariant violation of that order.
Resolve this by adding all objects from the preferred pack separately
when it appears in the existing MIDX (if one was present). This will
duplicate objects from that pack that *did* appear in the MIDX, but this
is fine, since get_sorted_entries() already handles duplicates. (A
future optimization in this area could avoid adding copies of objects
that we know already existing in the MIDX.)
Note that we no longer need to compute the preferred-ness of objects
added from the MIDX, since we only want to select the preferred objects
from a single source. (We could still mark these preferred bits, but
doing so is redundant and unnecessary).
This resolves the bug demonstrated by t5326.174 ("preferred pack change
with existing MIDX bitmap").
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Extract a routine to add all objects whose object ID's first byte is
`cur_fanout` from a given pack (identified by its index into the `struct
pack_info` array maintained by the MIDX writing routine).
Unlike the previous extraction (for `midx_fanout_add_midx_fanout()`),
this function will be called twice, once for all new packs, and again
for the preferred pack (if it appears in an existing MIDX). The latter
change is to resolve the bug described a few patches ago, and will be
made in the subsequent commit.
Similar to the previous refactoring, this function also enhances the
readability of its caller in `get_sorted_entries()`.
Its functionality is unchanged in this commit.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Extract a routine to add all objects whose object ID's first byte is
`cur_fanout` from an existing MIDX.
This function will only be called once, so extracting it is purely
cosmetic to improve the readability of `get_sorted_entries()` (its sole
caller) below.
The functionality is unchanged in this commit.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
To build up a list of objects (along with their packs, and the offsets
within those packs that each object appears at), the MIDX code
implements `get_sorted_entries()` which builds up a list of candidates,
sorts them, and then removes duplicate entries.
To do this, it keeps an array of `pack_midx_entry` structures that it
builds up once for each fanout level (ie., for all possible values of
the first byte of each object's ID).
This array is a function-local variable of `get_sorted_entries()`. Since
it uses the ALLOC_GROW() macro, having the `alloc_fanout` variable also
be local to that function, and only modified within that function is
convenient.
However, subsequent changes will extract the two ways this array is
filled (from a pack at some fanout value, and from an existing MIDX at
some fanout value) into separate functions. Instead of passing around
pointers to the entries array, along with `nr_fanout` and
`alloc_fanout`, encapsulate these three into a structure instead. Then
pass around a pointer to this structure instead.
This patch does not yet extract the above two functions, but sets us up
to begin doing so in the following commit. For now, the implementation
of get_sorted_entries() is only modified to replace `entries_by_fanout`
with `fanout.entries`, `nr_fanout` with `fanout.nr`, and so on.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The refactoring in 90b2bb710d (midx: extract bitmap write setup,
2022-07-19) hoisted our call to find_commits_for_midx_bitmap() into the
caller, which means we no longer need to see the refs_snapshot at all.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We noticed that some 'git multi-pack-index write --bitmap' processes
were running with very high memory. It turns out that a lot of this
memory is required to store a list of every object in the written
multi-pack-index, with a second copy that has additional information
used for the bitmap writing logic.
Using 'valgrind --tool=massif' before this change, the following chart
shows how memory load increased and was maintained throughout the
process:
GB
4.102^ ::
| @ @::@@::@@::::::::@::::::@@:#:::::::::::::@@:: :
| :::::@@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :::: :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :::: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| : :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| : :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| @ :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| @ :: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
| @::: :: : :: @@:@: @ ::@ ::: ::::@: ::: @@:#:::::: :: : :@ :: :
0 +--------------------------------------------------------------->
It turns out that the 'struct write_midx_context' data is persisting
through the life of the process, including the 'entries' array. This
array is used last inside find_commits_for_midx_bitmap() within
write_midx_bitmap(). If we free (and nullify) the array at that point,
we can free a decent chunk of memory before the bitmap logic adds more
to the memory footprint.
Here is the massif memory load chart after this change:
GB
3.111^ #
| # :::::::::::@::::::::::::::@
| # ::::::::::::::::::::::::: : :: : @:: ::::: :: ::@
| @# :::::::::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| :::@#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| :: @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| :: @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
| :: @#::: ::: :::::: :::: :: : :::::::: : :: : @:: ::::: :: ::@
0 +--------------------------------------------------------------->
The previous change introduced a refactoring of write_midx_bitmap() to
make it more clear how much of the 'struct write_midx_context' instance
is needed at different parts of the process. In addition, the following
defensive programming measures were put in place:
1. Using FREE_AND_NULL() we will at least get a segfault from reading a
NULL pointer instead of a use-after-free.
2. 'entries_nr' is also set to zero to make any loop that would iterate
over the entries be trivial.
3. Add significant comments in write_midx_internal() to add warnings
for future authors who might accidentally add references to this
cleared memory.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The write_midx_bitmap() method is a long method that does a lot of
steps. It requires the write_midx_context struct for use in
prepare_midx_packing_data() and find_commits_for_midx_bitmap(), but
after that only needs the pack_order array.
This is a messy, but completely non-functional refactoring. The code is
only being moved around to reduce visibility of the write_midx_context
during the longest part of computing reachability bitmaps.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Plug the memory leaks from the trickiest API of all, the revision
walker.
* ab/plug-leak-in-revisions: (27 commits)
revisions API: add a TODO for diff_free(&revs->diffopt)
revisions API: have release_revisions() release "topo_walk_info"
revisions API: have release_revisions() release "date_mode"
revisions API: call diff_free(&revs->pruning) in revisions_release()
revisions API: release "reflog_info" in release revisions()
revisions API: clear "boundary_commits" in release_revisions()
revisions API: have release_revisions() release "prune_data"
revisions API: have release_revisions() release "grep_filter"
revisions API: have release_revisions() release "filter"
revisions API: have release_revisions() release "cmdline"
revisions API: have release_revisions() release "mailmap"
revisions API: have release_revisions() release "commits"
revisions API users: use release_revisions() for "prune_data" users
revisions API users: use release_revisions() with UNLEAK()
revisions API users: use release_revisions() in builtin/log.c
revisions API users: use release_revisions() in http-push.c
revisions API users: add "goto cleanup" for release_revisions()
stash: always have the owner of "stash_info" free it
revisions API users: use release_revisions() needing REV_INFO_INIT
revision.[ch]: document and move code declared around "init"
...
|
|
A mechanism to pack unreachable objects into a "cruft pack",
instead of ejecting them into loose form to be reclaimed later, has
been introduced.
* tb/cruft-packs:
sha1-file.c: don't freshen cruft packs
builtin/gc.c: conditionally avoid pruning objects via loose
builtin/repack.c: add cruft packs to MIDX during geometric repack
builtin/repack.c: use named flags for existing_packs
builtin/repack.c: allow configuring cruft pack generation
builtin/repack.c: support generating a cruft pack
builtin/pack-objects.c: --cruft with expiration
reachable: report precise timestamps from objects in cruft packs
reachable: add options to add_unseen_recent_objects_to_traversal
builtin/pack-objects.c: --cruft without expiration
builtin/pack-objects.c: return from create_object_entry()
t/helper: add 'pack-mtimes' test-tool
pack-mtimes: support writing pack .mtimes files
chunk-format.h: extract oid_version()
pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
pack-mtimes: support reading .mtimes files
Documentation/technical: add cruft-packs.txt
|
|
There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.
Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.
(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).
Another caller will be added in a subsequent patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The path taken by "git multi-pack-index" command from the end user
was compared with path internally prepared by the tool withut first
normalizing, which lead to duplicated paths not being noticed,
which has been corrected.
* ds/midx-normalize-pathname-before-comparison:
cache: use const char * for get_object_directory()
multi-pack-index: use --object-dir real path
midx: use real paths in lookup_multi_pack_index()
|
|
This helper looks for a parsed multi-pack-index whose object directory
matches the given object_dir. Before going into the loop over the parsed
multi-pack-indexes, it calls find_odb() to ensure that the given
object_dir is actually a known object directory.
However, find_odb() uses real-path manipulations to compare the input to
the alternate directories. This same real-path comparison is not used in
the loop, leading to potential issues with the strcmp().
Update the method to use the real-path values instead.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Add a release_revisions() to various users of "struct rev_list" in
those straightforward cases where we only need to add the
release_revisions() call to the end of a block, and don't need to
e.g. refactor anything to use a "goto cleanup" pattern.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This commit introduces the infrastructure for the core.fsync
configuration knob. The repository components we want to sync
are identified by flags so that we can turn on or off syncing
for specific components.
If core.fsyncObjectFiles is set and the core.fsync configuration
also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any
loose objects. This picks the strictest data integrity behavior
if core.fsync and core.fsyncObjectFiles are set to conflicting values.
This change introduces the currently unused fsync_component
helper, which will be used by a later patch that adds fsyncing to
the refs backend.
Actual configuration and documentation of the fsync components
list are in other patches in the series to separate review of
the underlying mechanism from the policy of how it's configured.
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When there is no object to write .bitmap file for, "git
multi-pack-index" triggered an error, instead of just skipping,
which has been corrected.
* tb/midx-no-bitmap-for-no-objects:
midx: prevent writing a .bitmap without any objects
|
|
When trying to write a MIDX, we already prevent the case where there
weren't any packs present, and thus we would have written an empty MIDX.
But there is another "empty" case, which is more interesting, and we
don't yet handle. If we try to write a MIDX which has at least one pack,
but those packs together don't contain any objects, we will encounter a
BUG() when trying to use the bitmap corresponding to that MIDX, like so:
$ git rev-parse HEAD | git pack-objects --revs --use-bitmap-index --stdout >/dev/null
BUG: pack-revindex.c:394: pack_pos_to_midx: out-of-bounds object at 0
(note that in the above reproduction, both `--use-bitmap-index` and
`--stdout` are important, since without the former we won't even both to
load the .bitmap, and without the latter we wont attempt pack reuse).
The problem occurs when we try to discover the identity of the
preferred pack to determine which range if any of existing packs we can
reuse verbatim. This path is: `reuse_packfile_objects()` ->
`reuse_partial_packfile_from_bitmap()` -> `midx_preferred_pack()`.
#4 0x000055555575401f in pack_pos_to_midx (m=0x555555997160, pos=0) at pack-revindex.c:394
#5 0x00005555557502c8 in midx_preferred_pack (bitmap_git=0x55555599c280) at pack-bitmap.c:1431
#6 0x000055555575036c in reuse_partial_packfile_from_bitmap (bitmap_git=0x55555599c280, packfile_out=0x5555559666b0 <reuse_packfile>,
entries=0x5555559666b8 <reuse_packfile_objects>, reuse_out=0x5555559666c0 <reuse_packfile_bitmap>) at pack-bitmap.c:1452
#7 0x00005555556041f6 in get_object_list_from_bitmap (revs=0x7fffffffcbf0) at builtin/pack-objects.c:3658
#8 0x000055555560465c in get_object_list (ac=2, av=0x555555997050) at builtin/pack-objects.c:3765
#9 0x0000555555605e4e in cmd_pack_objects (argc=0, argv=0x7fffffffe920, prefix=0x0) at builtin/pack-objects.c:4154
Since neither the .bitmap or MIDX stores the identity of the
preferred pack, we infer it by trying to load the first object in
pseudo-pack order, and then asking the MIDX which pack was chosen to
represent that object.
But this fails our bounds check, since there are zero objects in the
MIDX to begin with, which results in the BUG().
We could catch this more carefully in `midx_preferred_pack()`, but
signaling the absence of a preferred pack out to all of its callers is
somewhat awkward.
Instead, let's avoid writing a MIDX .bitmap without any objects
altogether. We catch this case in `write_midx_internal()`, and emit a
warning if the caller indicated they wanted to write a bitmap before
clearing out the relevant flags. If we somehow got to
write_midx_bitmap(), then we will call BUG(), but this should now be an
unreachable path.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When a MIDX contains the new `RIDX` chunk, ensure that the reverse index
is read from it instead of the on-disk .rev file. Since we need to
encode the object order in the MIDX itself for correctness reasons,
there is no point in storing the same data again outside of the MIDX.
So, this patch stops writing separate .rev files, and reads it out of
the MIDX itself. This is possible to do with relatively little new code,
since the format of the RIDX chunk is identical to the data in the .rev
file. In other words, we can implement this by pointing the
`revindex_data` field at the reverse index chunk of the MIDX instead of
the .rev file without any other changes.
Note that we have two knobs that are adjusted for the new tests:
GIT_TEST_MIDX_WRITE_REV and GIT_TEST_MIDX_READ_RIDX. The former controls
whether the MIDX .rev is written at all, and the latter controls whether
we read the MIDX's RIDX chunk.
Both are necessary to ensure that the test added at the beginning of
this series continues to work. This is because we always need to write
the RIDX chunk in the MIDX in order to change its checksum, but we want
to make sure reading the existing .rev file still works (since the RIDX
chunk takes precedence by default).
Arguably this isn't a very interesting mode to test, because the
precedence rules mean that we'll always read the RIDX chunk over the
.rev file. But it makes it impossible for a user to induce corruption in
their repository by adjusting the test knobs (since if we had an
either/or knob they could stop writing the RIDX chunk, allowing them to
tweak the MIDX's object order without changing its checksum).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The previous patch demonstrates a bug where a MIDX's auxiliary object
order can become out of sync with a MIDX bitmap.
This is because of two confounding factors:
- First, the object order is stored in a file which is named according
to the multi-pack index's checksum, and the MIDX does not store the
object order. This means that the object order can change without
altering the checksum.
- But the .rev file is moved into place with finalize_object_file(),
which link(2)'s the file into place instead of renaming it. For us,
that means that a modified .rev file will not be moved into place if
MIDX's checksum was unchanged.
This fix is to force the MIDX's checksum to change when the preferred
pack changes but the set of packs contained in the MIDX does not. In
other words, when the object order changes, the MIDX's checksum needs to
change with it (regardless of whether the MIDX is tracking the same or
different packs).
This prevents a race whereby changing the object order (but not the
packs themselves) enables a reader to see the new .rev file with the old
MIDX, or similarly seeing the new bitmap with the old object order.
But why can't we just stop hardlinking the .rev into place instead
adding additional data to the MIDX? Suppose that's what we did. Then
when we go to generate the new bitmap, we'll load the old MIDX bitmap,
along with the MIDX that it references. That's fine, since the new MIDX
isn't moved into place until after the new bitmap is generated. But the
new object order *has* been moved into place. So we'll read the old
bitmaps in the new order when generating the new bitmap file, meaning
that without this secondary change, bitmap generation itself would
become a victim of the race described here.
This can all be prevented by forcing the MIDX's checksum to change when
the object order does. By embedding the entire object order into the
MIDX, we do just that. That is, the MIDX's checksum will change in
response to any perturbation of the underlying object order. In t5326,
this will cause the MIDX's checksum to update (even without changing the
set of packs in the MIDX), preventing the stale read problem.
Note that this makes it safe to continue to link(2) the MIDX .rev file
into place, since it is now impossible to have a .rev file that is
out-of-sync with the MIDX whose checksum it references. (But we will do
away with MIDX .rev files later in this series anyway, so this is
somewhat of a moot point).
In theory, it is possible to store a "fingerprint" of the full object
order here, so long as that fingerprint changes at least as often as the
full object order does. Some possibilities here include storing the
identity of the preferred pack, along with the mtimes of the
non-preferred packs in a consistent order. But storing a limited part of
the information makes it difficult to reason about whether or not there
are gaps between the two that would cause us to get bitten by this bug
again.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Leakfix.
* tb/plug-pack-bitmap-leaks:
pack-bitmap.c: more aggressively free in free_bitmap_index()
pack-bitmap.c: don't leak type-level bitmaps
midx.c: write MIDX filenames to strbuf
builtin/multi-pack-index.c: don't leak concatenated options
builtin/repack.c: avoid leaking child arguments
builtin/pack-objects.c: don't leak memory via arguments
t/helper/test-read-midx.c: free MIDX within read_midx_file()
midx.c: don't leak MIDX from verify_midx_file
midx.c: clean up chunkfile after reading the MIDX
|
|
To ask for the name of a MIDX and its corresponding .rev file, callers
invoke get_midx_filename() and get_midx_rev_filename(), respectively.
These both invoke xstrfmt(), allocating a chunk of memory which must be
freed later on.
This makes callers in pack-bitmap.c somewhat awkward. Specifically,
midx_bitmap_filename(), which is implemented like:
return xstrfmt("%s-%s.bitmap",
get_midx_filename(midx->object_dir),
hash_to_hex(get_midx_checksum(midx)));
this leaks the second argument to xstrfmt(), which itself was allocated
with xstrfmt(). This caller could assign both the result of
get_midx_filename() and the outer xstrfmt() to a temporary variable,
remembering to free() the former before returning. But that involves a
wasteful copy.
Instead, get_midx_filename() and get_midx_rev_filename() take a strbuf
as an output parameter. This way midx_bitmap_filename() can manipulate
and pass around a temporary buffer which it detaches back to its caller.
That allows us to implement the function without copying or open-coding
get_midx_filename() in a way that doesn't leak.
Update the other callers of get_midx_filename() and
get_midx_rev_filename() accordingly.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function midx.c:verify_midx_file() allocates a MIDX struct by
calling load_multi_pack_index(). But when cleaning up, it calls free()
without freeing any resources associated with the MIDX.
Call the more appropriate close_midx() which does free those resources,
which causes t5319.3 to pass when Git is compiled with SANITIZE=leak.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In order to read the contents of a MIDX, we initialize a chunkfile
structure which can read the table of contents and assign pointers into
different sections of the file for us.
We do call free(), since the chunkfile struct is heap allocated, but not
the more appropriate free_chunkfile(), which also frees memory that the
structure itself owns.
Call that instead to avoid leaking memory in this function.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
"git repack" has been taught to generate multi-pack reachability
bitmaps.
* tb/repack-write-midx:
test-read-midx: fix leak of bitmap_index struct
builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
builtin/repack.c: make largest pack preferred
builtin/repack.c: support writing a MIDX while repacking
builtin/repack.c: extract showing progress to a variable
builtin/repack.c: rename variables that deal with non-kept packs
builtin/repack.c: keep track of existing packs unconditionally
midx: preliminary support for `--refs-snapshot`
builtin/multi-pack-index.c: support `--stdin-packs` mode
midx: expose `write_midx_file_only()` publicly
|
|
When writing a MIDX, we atomically move the new MIDX into place via
commit_lock_file(), but do not check to see if that call was successful.
Make sure that we do check in order to prevent us from incorrectly
reporting that we wrote a new MIDX if we actually encountered an error.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Apply similar treatment as in the last commit to the MIDX `repack`
operation.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Before a new MIDX can be written, expire_midx_packs() first loads the
existing MIDX, figures out which packs can be expired, and then writes a
new MIDX based on that information.
In order to load the existing MIDX, it uses load_multi_pack_index(),
which mmaps the multi-pack-index file, but does not store the resulting
`struct multi_pack_index *` in the object store.
write_midx_internal() also needs to open the existing MIDX, and it does
so by iterating the results of get_multi_pack_index(), so that it reuses
the same pointer held by the object store. But before it can move the
new MIDX into place, it close_object_store() to munmap() the
multi-pack-index file to accommodate platforms like Windows which don't
allow overwriting files which are memory mapped.
That's where things get weird. Since expire_midx_packs has its own
*separate* memory mapped copy of the MIDX, the MIDX file is still memory
mapped! Interestingly, this doesn't seem to cause a problem in our
tests. (I believe that this has much more to do with my own lack of
familiarity with Windows than it does a lack of coverage in our tests).
In any case, we can side-step the whole issue by teaching
expire_midx_packs() to use the `struct multi_pack_index` pointer it
found via the object store instead of maintain its own copy. That way,
when write_midx_internal() calls `close_object_store()`, we know that
there are no memory mapped copies of the MIDX laying around.
A couple of other small notes about this patch:
- As far as I can tell, passing `local == 1` to the call to
load_multi_pack_index() was an error, since object_dir could be an
alternate. But it doesn't matter, since even though we write
`m->local = 1`, we never read that field back later on.
- Setting `m = NULL` after write_midx_internal() was likely to prevent
a double-free back from when that function took a `struct
multi_pack_index *` that it called close_midx() on itself. We can
rely on write_midx_internal() to call that for us now.
Finally, this enforces the same "the value of --object-dir must be the
local object store, or an alternate" rule from f57a739691 (midx: avoid
opening multiple MIDXs when writing, 2021-09-01) to the `expire`
sub-command, too.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The first thing that write_midx_internal() does is load the MIDX
corresponding to the given object directory, if one is present.
Prepare for other functions in midx.c to do the same thing by extracting
that operation out to a small helper function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
* tb/repack-write-midx:
test-read-midx: fix leak of bitmap_index struct
builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
builtin/repack.c: make largest pack preferred
builtin/repack.c: support writing a MIDX while repacking
builtin/repack.c: extract showing progress to a variable
builtin/repack.c: rename variables that deal with non-kept packs
builtin/repack.c: keep track of existing packs unconditionally
midx: preliminary support for `--refs-snapshot`
builtin/multi-pack-index.c: support `--stdin-packs` mode
midx: expose `write_midx_file_only()` publicly
|
|
"git multi-pack-index write --bitmap" learns to propagate the
hashcache from original bitmap to resulting bitmap.
* tb/midx-write-propagate-namehash:
t5326: test propagating hashcache values
p5326: generate pack bitmaps before writing the MIDX bitmap
p5326: don't set core.multiPackIndex unnecessarily
p5326: create missing 'perf-tag' tag
midx.c: respect 'pack.writeBitmapHashcache' when writing bitmaps
pack-bitmap.c: propagate namehash values from existing bitmaps
t/helper/test-bitmap.c: add 'dump-hashes' mode
|
|
To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.
This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.
This can happen when a multi-pack index contains some pack which refers
to loose objects (e.g., if a pack was pushed after starting the repack
but before generating the MIDX which depends on an object which is
stored as loose in the repository, and by definition isn't included in
the multi-pack index).
By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.
This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.
The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).
When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Expose a variant of the write_midx_file() function which ignores packs
that aren't included in an explicit "allow" list.
This will be used in an upcoming patch to power a new `--stdin-packs`
mode of `git multi-pack-index write` for callers that only want to
include certain packs in a MIDX (and ignore any packs which may have
happened to enter the repository independently, e.g., from pushes).
Those patches will provide test coverage for this new function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Replace a handcrafted data structure used to keep track of bad
objects in the packfile API by an oidset.
* rs/packfile-bad-object-list-in-oidset:
packfile: use oidset for bad objects
packfile: convert has_packed_and_bad() to object_id
packfile: convert mark_bad_packed_object() to object_id
midx: inline nth_midxed_pack_entry()
oidset: make oidset_size() an inline function
|
|
In the previous commit, the bitmap writing code learned to propagate
values from an existing hash-cache extension into the bitmap that it is
writing.
Now that that functionality exists, let's expose it by teaching the 'git
multi-pack-index' builtin to respect the `pack.writeBitmapHashCache`
option so that the hash-cache may be written at all.
Two minor points worth noting here:
- The 'git multi-pack-index write' sub-command didn't previously read
any configuration (instead this is handled in the base command). A
separate handler is added here to respect this write-specific
config option.
- I briefly considered adding a 'bitmap_flags' field to the static
options struct, but decided against it since it would require
plumbing through a new parameter to the write_midx_file() function.
Instead, a new MIDX-specific flag is added, which is translated to
the corresponding bitmap one.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Store the object ID of broken pack entries in an oidset instead of
keeping only their hashes in an unsorted array. The resulting code is
shorter and easier to read. It also handles the (hopefully) very rare
case of having a high number of bad objects better.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
fill_midx_entry() finds the position of an object ID and passes it to
nth_midxed_pack_entry(), which uses the position to look up the object
ID for its own purposes. Inline the latter into the former to avoid
that lookup.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We never look at the repository argument which is passed. This makes
sense, since the multi_pack_index struct already tells us everything we
need to access the files in its associated object directory.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Write multi-pack bitmaps in the format described by
Documentation/technical/bitmap-format.txt, inferring their presence with
the absence of '--bitmap'.
To write a multi-pack bitmap, this patch attempts to reuse as much of
the existing machinery from pack-objects as possible. Specifically, the
MIDX code prepares a packing_data struct that pretends as if a single
packfile has been generated containing all of the objects contained
within the MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This prepares the code in pack-bitmap to interpret the new multi-pack
bitmaps described in Documentation/technical/bitmap-format.txt, which
mostly involves converting bit positions to accommodate looking them up
in a MIDX.
Note that there are currently no writers who write multi-pack bitmaps,
and that this will be implemented in the subsequent commit. Note also
that get_midx_checksum() and get_midx_filename() are made non-static so
they can be called from pack-bitmap.c.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Opening multiple instance of the same MIDX can lead to problems like two
separate packed_git structures which represent the same pack being added
to the repository's object store.
The above scenario can happen because prepare_midx_pack() checks if
`m->packs[pack_int_id]` is NULL in order to determine if a pack has been
opened and installed in the repository before. But a caller can
construct two copies of the same MIDX by calling get_multi_pack_index()
and load_multi_pack_index() since the former manipulates the
object store directly but the latter is a lower-level routine which
allocates a new MIDX for each call.
So if prepare_midx_pack() is called on multiple MIDXs with the same
pack_int_id, then that pack will be installed twice in the object
store's packed_git pointer.
This can lead to problems in, for e.g., the pack-bitmap code, which does
something like the following (in pack-bitmap.c:open_pack_bitmap()):
struct bitmap_index *bitmap_git = ...;
for (p = get_all_packs(r); p; p = p->next) {
if (open_pack_bitmap_1(bitmap_git, p) == 0)
ret = 0;
}
which is a problem if two copies of the same pack exist in the
packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a
conditional like the following:
if (bitmap_git->pack || bitmap_git->midx) {
/* ignore extra bitmap file; we can only handle one */
warning("ignoring extra bitmap file: %s", packfile->pack_name);
close(fd);
return -1;
}
Avoid this scenario by not letting write_midx_internal() open a MIDX
that isn't also pointed at by the object store. So long as this is the
case, other routines should prefer to open MIDXs with
get_multi_pack_index() or reprepare_packed_git() instead of creating
instances on their own. Because get_multi_pack_index() returns
`r->object_store->multi_pack_index` if it is non-NULL, we'll only have
one instance of a MIDX open at one time, avoiding these problems.
To encourage this, drop the `struct multi_pack_index *` parameter from
`write_midx_internal()`, and rely instead on the `object_dir` to find
(or initialize) the correct MIDX instance.
Likewise, replace the call to `close_midx()` with
`close_object_store()`, since we're about to replace the MIDX with a new
one and should invalidate the object store's memory of any MIDX that
might have existed beforehand.
Note that this now forbids passing object directories that don't belong
to alternate repositories over `--object-dir`, since before we would
have happily opened a MIDX in any directory, but now restrict ourselves
to only those reachable by `r->objects->multi_pack_index` (and alternate
MIDXs that we can see by walking the `next` pointer).
As far as I can tell, supporting arbitrary directories with
`--object-dir` was a historical accident, since even the documentation
says `<alt>` when referring to the value passed to this option.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When a repository has at least one alternate, the MIDX belonging to each
alternate is accessed through the `next` pointer on the main object
store's copy of the MIDX. close_midx() didn't bother to close any
of the linked MIDXs. It likewise didn't free the memory pointed to by
`m`, leaving uninitialized bytes with live pointers to them left around
in the heap.
Clean this up by closing linked MIDXs, and freeing up the memory pointed
to by each of them. When callers call close_midx(), then they can
discard the entire linked list of MIDXs and set their pointer to the
head of that list to NULL.
This isn't strictly required for the upcoming patches, but it makes it
much more difficult (though still possible, for e.g., by calling
`close_midx(m->next)` which leaves `m->next` pointing at uninitialized
bytes) to have pointers to uninitialized memory.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30), the
multi-pack index code learned how to select a pack which all duplicate
objects are selected from. That is, if an object appears in multiple
packs, select the copy in the preferred pack before breaking ties
according to the other rules like pack mtime and readdir() order.
Not specifying a preferred pack can cause serious problems with
multi-pack reachability bitmaps, because these bitmaps rely on having at
least one pack from which all duplicates are selected. Not having such a
pack causes problems with the code in pack-objects to reuse packs
verbatim (e.g., that code assumes that a delta object in a chunk of pack
sent verbatim will have its base object sent from the same pack).
So why does not marking a pack preferred cause problems here? The reason
is roughly as follows:
- Ties are broken (when handling duplicate objects) by sorting
according to midx_oid_compare(), which sorts objects by OID,
preferred-ness, pack mtime, and finally pack ID (more on that
later).
- The psuedo pack-order (described in
Documentation/technical/pack-format.txt under the section
"multi-pack-index reverse indexes") is computed by
midx_pack_order(), and sorts by pack ID and pack offset, with
preferred packs sorting first.
- But! Pack IDs come from incrementing the pack count in
add_pack_to_midx(), which is a callback to
for_each_file_in_pack_dir(), meaning that pack IDs are assigned in
readdir() order.
When specifying a preferred pack, all of that works fine, because
duplicate objects are correctly resolved in favor of the copy in the
preferred pack, and the preferred pack sorts first in the object order.
"Sorting first" is critical, because the bitmap code relies on finding
out which pack holds the first object in the MIDX's pseudo pack-order to
determine which pack is preferred.
But if we didn't specify a preferred pack, and the pack which comes
first in readdir() order does not also have the lowest timestamp, then
it's possible that that pack (the one that sorts first in pseudo-pack
order, which the bitmap code will treat as the preferred one) did *not*
have all duplicate objects resolved in its favor, resulting in breakage.
The fix is simple: pick a (semi-arbitrary, non-empty) preferred pack
when none was specified. This forces that pack to have duplicates
resolved in its favor, and (critically) to sort first in pseudo-pack
order. Unfortunately, testing this behavior portably isn't possible,
since it depends on readdir() order which isn't guaranteed by POSIX.
(Note that multi-pack reachability bitmaps have yet to be implemented;
so in that sense this patch is fixing a bug which does not yet exist.
But by having this patch beforehand, we can prevent the bug from ever
materializing.)
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The soon-to-be-implemented multi-pack bitmap treats object in the first
bit position specially by assuming that all objects in the pack it was
selected from are also represented from that pack in the MIDX. In other
words, the pack from which the first object was selected must also have
all of its other objects selected from that same pack in the MIDX in
case of any duplicates.
But this assumption relies on the fact that there is at least one object
in that pack to begin with; otherwise the object in the first bit
position isn't from a preferred pack, in which case we can no longer
assume that all objects in that pack were also selected from the same
pack.
Guard this assumption by checking the number of objects in the given
preferred pack, and failing if the given pack is empty.
To make sure we can safely perform this check, open any packs which are
contained in an existing MIDX via prepare_midx_pack(). The same is done
for new packs via the add_pack_to_midx() callback, but packs picked up
from a previous MIDX will not yet have these opened.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When writing a new multi-pack index, write_midx_internal() attempts to
clean up any auxiliary files (currently just the MIDX's `.rev` file, but
soon to include a `.bitmap`, too) corresponding to the MIDX it's
replacing.
This step should happen after the new MIDX is written into place, since
doing so beforehand means that the old MIDX could be read without its
corresponding .rev file.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
If using --object-dir to point into an object directory which belongs to
a different repository than the one in the current working directory,
such as:
git init repo
git -C repo ... # add some objects
cd alternate
git multi-pack-index --object-dir ../repo/.git/objects write
the binary will segfault trying to access the object-dir via the repo it
found, but that's not fully initialized. Worse, if we later call
clear_midx_files_ext(), we will use `the_repository` and remove files
out of the wrong object directory.
Fix this by using the given object_dir (or the object directory of
`the_repository` if `--object-dir` wasn't given) to properly to clean up
the *.rev files, avoiding the crash.
Original-patch-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Many "printf"-like helper functions we have have been annotated
with __attribute__() to catch placeholder/parameter mismatches.
* ab/attribute-format:
advice.h: add missing __attribute__((format)) & fix usage
*.h: add a few missing __attribute__((format))
*.c static functions: add missing __attribute__((format))
sequencer.c: move static function to avoid forward decl
*.c static functions: don't forward-declare __attribute__
|
|
Add missing __attribute__((format)) function attributes to various
"static" functions that take printf arguments.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
'git multi-pack-index verify' inspects the data in an existing MIDX for
correctness by checking that the recorded object offsets are correct,
and so on.
But it does not check that the file's trailing checksum matches the data
that it records. So, if an on-disk corruption happened to occur in the
final few bytes (and all other data was recorded correctly), we would:
- get a clean result from 'git multi-pack-index verify', but
- be unable to reuse the existing MIDX when writing a new one (since
we now check for checksum mismatches before reusing a MIDX)
Teach the 'verify' sub-command to recognize corruption in the checksum
by calling midx_checksum_valid().
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When writing a new multi-pack index, Git tries to reuse as much of the
data from an existing MIDX as possible, like object offsets. This is
done to avoid re-opening a bunch of *.idx files unnecessarily, but can
lead to problems if the data we are reusing is corrupt.
That's because we'll blindly reuse data from an existing MIDX without
checking its trailing checksum for validity. So if there is memory
corruption while writing a MIDX, or disk corruption in the intervening
period between writing and reuse, we'll blindly propagate those bad
values forward.
Suppose we experience a memory corruption while writing a MIDX such that
we write an incorrect object offset (or alternatively, the disk corrupts
the data after being written, but before being reused). Then when we go
to write a new MIDX, we'll reuse the bad object offset without checking
its validity. This means that the MIDX we just wrote is broken, but its
trailing checksum is in-tact, since we never bothered to look at the
values before writing.
In the above, a "git multi-pack-index verify" would have caught the
problem before writing, but writing a new MIDX wouldn't have noticed
anything wrong, blindly carrying forward the corrupt offset.
Individual pack indexes check their validity by verifying the crc32
attached to each entry when carrying data forward during a repack.
We could solve this problem for MIDXs in the same way, but individual
crc32's don't make much sense, since their entries are so small.
Likewise, checking the whole file on every read may be prohibitively
expensive if a repository has a lot of objects, packs, or both.
But we can check the trailing checksum when reusing an existing MIDX
when writing a new one. And a corrupt MIDX need not stop us from writing
a new one, since we can just avoid reusing the existing one at all and
pretend as if we are writing a new MIDX from scratch.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In the future, we'll want oidread to automatically set the hash
algorithm member for an object ID we read into it, so ensure we use
oidread instead of hashcpy everywhere we're copying a hash value into a
struct object_id.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
An on-disk reverse-index to map the in-pack location of an object
back to its object name across multiple packfiles is introduced.
* tb/reverse-midx:
midx.c: improve cache locality in midx_pack_order_cmp()
pack-revindex: write multi-pack reverse indexes
pack-write.c: extract 'write_rev_file_order'
pack-revindex: read multi-pack reverse indexes
Documentation/technical: describe multi-pack reverse indexes
midx: make some functions non-static
midx: keep track of the checksum
midx: don't free midx_name early
midx: allow marking a pack as preferred
t/helper/test-read-midx.c: add '--show-objects'
builtin/multi-pack-index.c: display usage on unrecognized command
builtin/multi-pack-index.c: don't enter bogus cmd_mode
builtin/multi-pack-index.c: split sub-commands
builtin/multi-pack-index.c: define common usage with a macro
builtin/multi-pack-index.c: don't handle 'progress' separately
builtin/multi-pack-index.c: inline 'flags' with options
|
|
There is a lot of pointer dereferencing in the pre-image version of
'midx_pack_order_cmp()', which this patch gets rid of.
Instead of comparing the pack preferred-ness and then the pack id, both
of these checks are done at the same time by using the high-order bit of
the pack id to represent whether it's preferred. Then the pack id and
offset are compared as usual.
This produces the same result so long as there are less than 2^31 packs,
which seems like a likely assumption to make in practice.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Implement the writing half of multi-pack reverse indexes. This is
nothing more than the format describe a few patches ago, with a new set
of helper functions that will be used to clear out stale .rev files
corresponding to old MIDXs.
Unfortunately, a very similar comparison function as the one implemented
recently in pack-revindex.c is reimplemented here, this time accepting a
MIDX-internal type. An effort to DRY these up would create more
indirection and overhead than is necessary, so it isn't pursued here.
Currently, there are no callers which pass the MIDX_WRITE_REV_INDEX
flag, meaning that this is all dead code. But, that won't be the case
for long, since subsequent patches will introduce the multi-pack bitmap,
which will begin passing this field.
(In midx.c:write_midx_internal(), the two adjacent if statements share a
conditional, but are written separately since the first one will
eventually also handle the MIDX_WRITE_BITMAP flag, which does not yet
exist.)
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Implement reading for multi-pack reverse indexes, as described in the
previous patch.
Note that these functions don't yet have any callers, and won't until
multi-pack reachability bitmaps are introduced in a later patch series.
In the meantime, this patch implements some of the infrastructure
necessary to support multi-pack bitmaps.
There are three new functions exposed by the revindex API:
- load_midx_revindex(): loads the reverse index corresponding to the
given multi-pack index.
- midx_to_pack_pos() and pack_pos_to_midx(): these convert between the
multi-pack index and pseudo-pack order.
load_midx_revindex() and pack_pos_to_midx() are both relatively
straightforward.
load_midx_revindex() needs a few functions to be exposed from the midx
API. One to get the checksum of a midx, and another to get the .rev's
filename. Similar to recent changes in the packed_git struct, three new
fields are added to the multi_pack_index struct: one to keep track of
the size, one to keep track of the mmap'd pointer, and another to point
past the header and at the reverse index's data.
pack_pos_to_midx() simply reads the corresponding entry out of the
table.
midx_to_pack_pos() is the trickiest, since it needs to find an object's
position in the psuedo-pack order, but that order can only be recovered
in the .rev file itself. This mapping can be implemented with a binary
search, but note that the thing we're binary searching over isn't an
array of values, but rather a permuted order of those values.
So, when comparing two items, it's helpful to keep in mind the
difference. Instead of a traditional binary search, where you are
comparing two things directly, here we're comparing a (pack, offset)
tuple with an index into the multi-pack index. That index describes
another (pack, offset) tuple, and it is _those_ two tuples that are
compared.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In a subsequent commit, pack-revindex.c will become responsible for
sorting a list of objects in the "MIDX pack order" (which will be
defined in the following patch). To do so, it will need to be know the
pack identifier and offset within that pack for each object in the MIDX.
The MIDX code already has functions for doing just that
(nth_midxed_offset() and nth_midxed_pack_int_id()), but they are
statically declared.
Since there is no reason that they couldn't be exposed publicly, and
because they are already doing exactly what the caller in
pack-revindex.c will want, expose them publicly so that they can be
reused there.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
write_midx_internal() uses a hashfile to write the multi-pack index, but
discards its checksum. This makes sense, since nothing that takes place
after writing the MIDX cares about its checksum.
That is about to change in a subsequent patch, when the optional
reverse index corresponding to the MIDX will want to include the MIDX's
checksum.
Store the checksum of the MIDX in preparation for that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
A subsequent patch will need to refer back to 'midx_name' later on in
the function. In fact, this variable is already free()'d later on, so
this makes the later free() no longer redundant.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When multiple packs in the multi-pack index contain the same object, the
MIDX machinery must make a choice about which pack it associates with
that object. Prior to this patch, the lowest-ordered[1] pack was always
selected.
Pack selection for duplicate objects is relatively unimportant today,
but it will become important for multi-pack bitmaps. This is because we
can only invoke the pack-reuse mechanism when all of the bits for reused
objects come from the reuse pack (in order to ensure that all reused
deltas can find their base objects in the same pack).
To encourage the pack selection process to prefer one pack over another
(the pack to be preferred is the one a caller would like to later use as
a reuse pack), introduce the concept of a "preferred pack". When
provided, the MIDX code will always prefer an object found in a
preferred pack over any other.
No format changes are required to store the preferred pack, since it
will be able to be inferred with a corresponding MIDX bitmap, by looking
up the pack associated with the object in the first bit position (this
ordering is described in detail in a subsequent commit).
[1]: the ordering is specified by MIDX internals; for our purposes we
can consider the "lowest ordered" pack to be "the one with the
most-recent mtime.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead. It shortens the code and infers the
element size automatically.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The common code to deal with "chunked file format" that is shared
by the multi-pack-index and commit-graph files have been factored
out, to help codepaths for both filetypes to become more robust.
* ds/chunked-file-api:
commit-graph.c: display correct number of chunks when writing
chunk-format: add technical docs
chunk-format: restore duplicate chunk checks
midx: use 64-bit multiplication for chunk sizes
midx: use chunk-format read API
commit-graph: use chunk-format read API
chunk-format: create read chunk API
midx: use chunk-format API in write_midx_internal()
midx: drop chunk progress during write
midx: return success/failure in chunk write methods
midx: add num_large_offsets to write_midx_context
midx: add pack_perm to write_midx_context
midx: add entries to write_midx_context
midx: use context in write_midx_pack_names()
midx: rename pack_info to write_midx_context
commit-graph: use chunk-format write API
chunk-format: create chunk format write API
commit-graph: anonymize data in chunk_write_fn
|
|
* ds/chunked-file-api:
commit-graph.c: display correct number of chunks when writing
chunk-format: add technical docs
chunk-format: restore duplicate chunk checks
midx: use 64-bit multiplication for chunk sizes
midx: use chunk-format read API
commit-graph: use chunk-format read API
chunk-format: create read chunk API
midx: use chunk-format API in write_midx_internal()
midx: drop chunk progress during write
midx: return success/failure in chunk write methods
midx: add num_large_offsets to write_midx_context
midx: add pack_perm to write_midx_context
midx: add entries to write_midx_context
midx: use context in write_midx_pack_names()
midx: rename pack_info to write_midx_context
commit-graph: use chunk-format write API
chunk-format: create chunk format write API
commit-graph: anonymize data in chunk_write_fn
|
|
When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.
Other possible overflows were discovered by evaluating each
multiplication in midx.c and ensuring that at least one side of the
operator was of type size_t or off_t.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The chunk-format API allows writing the table of contents and all chunks
using the anonymous 'struct chunkfile' type. We only need to convert our
local chunk logic to this API for the multi-pack-index writes to share
that logic with the commit-graph file writes.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.
This also reduces our barriers to combining the chunk writing code with
chunk-format.c.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:
1. If a chunk writing method has a bug that miscalculates the written
bytes, then we can satisfy the table of contents without actually
writing the right amount of data to the hashfile. The commit-graph
writing code checks the hashfile struct directly for a more robust
verification.
2. There is no way for a chunk writing method to gracefully fail.
Returning an int presents an opportunity to fail without a die().
3. The current pattern doesn't match chunk_write_fn type exactly, so we
cannot share code with commit-graph.c
For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.
Update write_midx_object_offsets() to match chunk_write_fn.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.
Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.
Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().
Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.
This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Code clean-up.
* ma/more-opaque-lock-file:
read-cache: try not to peek into `struct {lock_,temp}file`
refs/files-backend: don't peek into `struct lock_file`
midx: don't peek into `struct lock_file`
commit-graph: don't peek into `struct lock_file`
builtin/gc: don't peek into `struct lock_file`
|
|
Similar to the previous commits, avoid peeking into the `struct
lock_file`. Use the lock file API instead.
The two functions we're calling here double-check that the tempfile is
indeed "active", which is arguably overkill considering how we took the
lock on the line immediately above. More importantly, this future-proofs
us against, e.g., other code appearing between these two lines or the
lock file and/or tempfile internals changing.
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Change all remnants of "sha1" in hash-lookup.c and .h and rename them to
reflect that we're not just able to handle SHA-1 these days.
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Processes that access packdata while the .idx file gets removed
(e.g. while repacking) did not fail or fall back gracefully as they
could.
* tb/idx-midx-race-fix:
midx.c: protect against disappearing packs
packfile.c: protect against disappearing indexes
|
|
Code simplification.
* rs/hashwrite-be64:
pack-write: use hashwrite_be64()
midx: use hashwrite_be64()
csum-file: add hashwrite_be64()
|
|
When a packed object is stored in a multi-pack index, but that pack has
racily gone away, the MIDX code simply calls die(), when it could be
returning an error to the caller, which would in turn lead to
re-scanning the pack directory.
A pack can racily disappear, for example, due to a simultaneous 'git
repack -ad',
You can also reproduce this with two terminals, where one is running:
git init
while true; do
git commit -q --allow-empty -m foo
git repack -ad
git multi-pack-index write
done
(in effect, constantly writing new MIDXs), and the other is running:
obj=$(git rev-parse HEAD)
while true; do
echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break
done
That will sometimes hit the error preparing packfile from
multi-pack-index message, which this patch fixes.
Right now, that path to discovering a missing pack looks something like
'find_pack_entry()' calling 'fill_midx_entry()' and eventually making
its way to call 'nth_midxed_pack_entry()'.
'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and
propagates an error if the pack is invalid. So, this works if the pack
has gone away between calling 'prepare_midx_pack()' and before calling
'is_pack_valid()', but not if it disappears before then.
Catch the case where the pack has already disappeared before
'prepare_midx_pack()' by returning an error in that case, too.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Call hashwrite_be64() to write 64-bit values instead of open-coding it
using hashwrite_be32() and sizeof. This shortens the code and makes its
intent clearer.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
"git maintenance", an extended big brother of "git gc", continues
to evolve.
* ds/maintenance-part-2:
maintenance: add incremental-repack auto condition
maintenance: auto-size incremental-repack batch
maintenance: add incremental-repack task
midx: use start_delayed_progress()
midx: enable core.multiPackIndex by default
maintenance: create auto condition for loose-objects
maintenance: add loose-objects task
maintenance: add prefetch task
|
|
Now that the multi-pack-index may be written as part of auto maintenance
at the end of a command, reduce the progress output when the operations
are quick. Use start_delayed_progress() instead of start_progress().
Update t5319-multi-pack-index.sh to use GIT_PROGRESS_DELAY=0 now that
the progress indicators are conditional.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The core.multiPackIndex setting has been around since c4d25228ebb
(config: create core.multiPackIndex setting, 2018-07-12), but has been
disabled by default. If a user wishes to use the multi-pack-index
feature, then they must enable this config and run 'git multi-pack-index
write'.
The multi-pack-index feature is relatively stable now, so make the
config option true by default. For users that do not use a
multi-pack-index, the only extra cost will be a file lookup to see if a
multi-pack-index file exists (once per process, per object directory).
Also, this config option will be referenced by an upcoming
"incremental-repack" task in the maintenance builtin, so move the config
option into the repository settings struct. Note that if
GIT_TEST_MULTI_PACK_INDEX=1, then we want to ignore the config option
and treat core.multiPackIndex as enabled.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Misc cleanups.
* rs/misc-cleanups:
pack-bitmap-write: use hashwrite_be32() in write_hash_cache()
midx: use hashwrite_u8() in write_midx_header()
fast-import: use write_pack_header()
|
|
When a packfile is removed by "git repack", multi-pack-index gets
cleared; the code was taught to do so less aggressively by first
checking if the midx actually refers to a pack that no longer
exists.
* tb/repack-clearing-midx:
midx: traverse the local MIDX first
builtin/repack.c: invalidate MIDX only when necessary
|
|
Emit byte-sized values using hashwrite_u8() instead of buffering them
locally first. The hashwrite functions already do their own buffering,
so this double-buffering does not reduce the number of system calls.
Getting rid of it shortens and simplifies the code a bit.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When a repository has an alternate object directory configured, callers
can traverse through each alternate's MIDX by walking the '->next'
pointer.
But, when 'prepare_multi_pack_index_one()' loads multiple MIDXs, it
places the new ones at the front of this pointer chain, not at the end.
This can be confusing for callers such as 'git repack -ad', causing test
failures like in t7700.6 with 'GIT_TEST_MULTI_PACK_INDEX=1'.
The occurs when dropping a pack known to the local MIDX with alternates
configured that have their own MIDX. Since the alternate's MIDX is
returned via 'get_multi_pack_index()', 'midx_contains_pack()' returns
true (which is correct, since it traverses through the '->next' pointer
to find the MIDX in the chain that does contain the requested object).
But, we call 'clear_midx_file()' on 'the_repository', which drops the
MIDX at the path of the first MIDX in the chain, which (in the case of
t7700.6 is the one in the alternate).
This patch addresses that by:
- placing the local MIDX first in the chain when calling
'prepare_multi_pack_index_one()', and
- introducing a new 'get_local_multi_pack_index()', which explicitly
returns the repository-local MIDX, if any.
Don't impose an additional order on the MIDX's '->next' pointer beyond
that the first item in the chain must be local if one exists so that we
avoid a quadratic insertion.
Likewise, use 'get_local_multi_pack_index()' in
'remove_redundant_pack()' to fix the formerly broken t7700.6 when run
with 'GIT_TEST_MULTI_PACK_INDEX=1'.
Finally, note that the MIDX ordering invariant is only preserved by the
insertion order in 'prepare_packed_git()', which traverses through the
ODB's '->next' pointer, meaning we visit the local object store first.
This fragility makes this an undesirable long-term solution if more
callers are added, but it is acceptable for now since this is the only
caller.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Use more buffered I/O where we used to call many small write(2)s.
* rs/more-buffered-io:
upload-pack: use buffered I/O to talk to rev-list
midx: use buffered I/O to talk to pack-objects
connected: use buffered I/O to talk to rev-list
|
|
Fix some incorrect UNLEAK() annotations.
* jk/unleak-fixes:
ls-remote: simplify UNLEAK() usage
stop calling UNLEAK() before die()
|
|
The "--batch-size" option of "git multi-pack-index repack" command
is now used to specify that very small packfiles are collected into
one until the total size roughly exceeds it.
* ds/midx-repack-to-batch-size:
multi-pack-index: repack batches below --batch-size
|
|
Similar to the commit-graph format, the multi-pack-index format has a
byte in the header intended to track the hash version used to write the
file. This allows one to interpret the hash length without having the
context of the repository config specifying the hash length. This was
not modified as part of the SHA-256 work because the hash length was
automatically up-shifted due to that config.
Since we have this byte available, we can make the file formats more
obviously incompatible instead of relying on other context from the
repository.
Add a new oid_version() method in midx.c similar to the one in
commit-graph.c. This is specifically made separate from that
implementation to avoid artificially linking the formats.
The test impact requires a few more things than the corresponding change
in the commit-graph format. Specifically, 'test-tool read-midx' was not
writing anything about this header value to output. Since the value
available in 'struct multi_pack_index' is hash_len instead of a version
value, we output "20" or "32" instead of "1" or "2".
Since we want a user to not have their Git commands fail if their
multi-pack-index has the incorrect hash version compared to the
repository's hash version, we relax the die() to an error() in
load_multi_pack_index(). This has some effect on 'git multi-pack-index
verify' as we need to check that a failed parse of a file that exists is
actually a verify error. For that test that checks the hash version
matches, we change the corrupted byte from "2" to "3" to ensure the test
fails for both hash algorithms.
Helped-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Like f0bca72dc77 (send-pack: use buffered I/O to talk to pack-objects,
2016-06-08), significantly reduce the number of system calls and
simplify the code for sending object IDs to pack-objects by using
stdio's buffering.
Helped-by: Chris Torek <chris.torek@gmail.com>
Helped-by: Johannes Sixt <j6t@kdbg.org>
Encouraged-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The point of UNLEAK() is to make a reference to a variable that is about
to go out of scope so that leak-checkers will consider it to be
not-leaked. Doing so right before die() is therefore pointless; even
though we are about to exit the program, the variable will still be on
the stack and accessible to leak-checkers.
These annotations aren't really hurting anything, but they clutter the
code and set a bad example of how to use UNLEAK().
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The --batch-size=<size> option of 'git multi-pack-index repack' is
intended to limit the amount of work done by the repack. In the case of
a large repository, this command should repack a number of small
pack-files but leave the large pack-files alone. Most often, the
repository has one large pack-file from a 'git clone' operation and
number of smaller pack-files from incremental 'git fetch' operations.
The issue with '--batch-size' is that it also _prevents_ the repack from
happening if the expected size of the resulting pack-file is too small.
This was intended as a way to avoid frequent churn of small pack-files,
but it has mostly caused confusion when a repository is of "medium"
size. That is, not enormous like the Windows OS repository, but also not
so small that this incremental repack isn't valuable.
The solution presented here is to collect pack-files for repack if their
expected size is smaller than the batch-size parameter until either the
total expected size exceeds the batch-size or all pack-files are
considered. If there are at least two pack-files, then these are
combined to a new pack-file whose size should not be too much larger
than the batch-size.
This new strategy should succeed in keeping the number of pack-files
small in these "medium" size repositories. The concern about churn is
likely not interesting, as the real control over that is the frequency
in which the repack command is run.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We eventually want to drop the argv_array name and just use strvec
consistently. There's no particular reason we have to do it all at once,
or care about interactions between converted and unconverted bits.
Because of our preprocessor compat layer, the names are interchangeable
to the compiler (so even a definition and declaration using different
names is OK).
This patch converts all of the remaining files, as the resulting diff is
reasonably sized.
The conversion was done purely mechanically with:
git ls-files '*.c' '*.h' |
xargs perl -i -pe '
s/ARGV_ARRAY/STRVEC/g;
s/argv_array/strvec/g;
'
We'll deal with any indentation/style fallouts separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When selecting a batch of pack-files to repack in the "git
multi-pack-index repack" command, Git should respect the
repack.packKeptObjects config option. When false, this option says that
the pack-files with an associated ".keep" file should not be repacked.
This config value is "false" by default.
There are two cases for selecting a batch of objects. The first is the
case where the input batch-size is zero, which specifies "repack
everything". The second is with a non-zero batch size, which selects
pack-files using a greedy selection criteria. Both of these cases are
updated and tested.
Reported-by: Son Luong Ngoc <sluongng@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When the "repack" subcommand of "git multi-pack-index" command
creates new packfile(s), it does not call the "git repack"
command but instead directly calls the "git pack-objects"
command, and the configuration variables meant for the "git
repack" command, like "repack.usedaeltabaseoffset", are ignored.
Check the configuration variables used by "git repack" ourselves
in "git multi-index-pack" and pass the corresponding options to
underlying "git pack-objects".
Note that `repack.writeBitmaps` configuration is ignored, as the
pack bitmap facility is useful only with a single packfile.
Signed-off-by: Son Luong Ngoc <sluongng@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The multi-pack-index left mmapped file descriptors open when it
does not have to.
* ds/multi-pack-index:
multi-pack-index: close file descriptor after mmap
|
|
The multi-pack-index subsystem was not closing its file descriptor
after memory-mapping the file contents. After this mmap() succeeds,
there is no need to keep the file descriptor open. In fact, there
is signficant reason to close it so we do not run out of
descriptors.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When verifying a midx index with 0 objects, the
m->num_objects - 1
underflows and wraps around to 4294967295.
Fix this both by checking that the midx contains at least one oid,
and also that we don't write any midx when there is no packfiles.
Update the tests to check that `git multi-pack-index write` does
not write an midx when there is no objects, and another to check
that `git multi-pack-index verify` warns when it verifies an midx with no
objects. For this last test, use t5319/no-objects.midx which was
generated by an older version of git.
Signed-off-by: Damien Robert <damien.olivier.robert+git@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Our nth_packed_object_sha1() function returns NULL for error. So when we
wrapped it with nth_packed_object_oid(), we kept the same semantics. But
it's a bit funny, because the caller actually passes in an out
parameter, and the pointer we return is just that same struct they
passed to us (or NULL).
It's not too terrible, but it does make the interface a little
non-idiomatic. Let's switch to our usual "0 for success, negative for
error" return value. Most callers either don't check it, or are
trivially converted. The one that requires the biggest change is
actually improved, as we can ditch an extra aliased pointer variable.
Since we are changing the interface in a subtle way that the compiler
wouldn't catch, let's also change the name to catch any topics in
flight. We can drop the 'o' and make it nth_packed_object_id(). That's
slightly shorter, but also less redundant since the 'o' stands for
"object" already.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Update midx_repack to only display progress when
the MIDX_PROGRESS flag is set.
Signed-off-by: William Baker <William.Baker@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|