| Age | Commit message (Collapse) | Author | Files | Lines |
|
The "git_istream" abstraction has been revamped to make it easier
to interface with pluggable object database design.
* ps/object-read-stream:
streaming: drop redundant type and size pointers
streaming: move into object database subsystem
streaming: refactor interface to be object-database-centric
streaming: move logic to read packed objects streams into backend
streaming: move logic to read loose objects streams into backend
streaming: make the `odb_read_stream` definition public
streaming: get rid of `the_repository`
streaming: rely on object sources to create object stream
packfile: introduce function to read object info from a store
streaming: move zlib stream into backends
streaming: create structure for filtered object streams
streaming: create structure for packed object streams
streaming: create structure for loose object streams
streaming: create structure for in-core object streams
streaming: allocate stream inside the backend-specific logic
streaming: explicitly pass packfile info when streaming a packed object
streaming: propagate final object type via the stream
streaming: drop the `open()` callback function
streaming: rename `git_istream` into `odb_read_stream`
|
|
A part of code paths that deals with loose objects has been cleaned
up.
* ps/object-source-loose:
object-file: refactor writing objects via a stream
object-file: rename `write_object_file()`
object-file: refactor freshening of objects
object-file: rename `has_loose_object()`
object-file: read objects via the loose object source
object-file: move loose object map into loose source
object-file: hide internals when we need to reprepare loose sources
object-file: move loose object cache into loose source
object-file: introduce `struct odb_source_loose`
object-file: move `fetch_if_missing`
odb: adjust naming to free object sources
odb: introduce `odb_source_new()`
odb: fix subtle logic to check whether an alternate is usable
|
|
Adding a repository that uses a different hash function is a no-no,
but "git submodule add" did nt prevent it, which has been corrected.
* bc/submodule-force-same-hash:
read-cache: drop submodule check from add_to_cache()
object-file: disallow adding submodules of different hash algo
|
|
In the preceding commits we have turned `struct odb_read_stream` into a
publicly visible structure. Furthermore, this structure now contains the
type and size of the object that we are about to stream. Consequently,
the out-pointers that we used before to propagate the type and size of
the streamed object are now somewhat redundant with the data contained
in the structure itself.
Drop these out-pointers and adapt callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The "streaming" terminology is somewhat generic, so it may not be
immediately obvious that "streaming.{c,h}" is specific to the object
database. Rectify this by moving it into the "odb/" directory so that it
can be immediately attributed to the object subsystem.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Refactor the streaming interface to be centered around object databases
instead of centered around the repository. Rename the functions
accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Move the logic to read loose object streams into the respective
subsystem. This allows us to make a couple of function declarations
private.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In the following patches we are about to make the `git_istream` more
generic so that it becomes fully controlled by the specific object
source that wants to create it. As part of these refactorings we'll
fully move the structure into the object database subsystem.
Prepare for this change by renaming the structure from `git_istream`
to `odb_read_stream`. This mirrors the `odb_write_stream` structure that
we already have.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In add_to_cache(), we treat any directories as submodules, and complain
if we can't resolve their HEAD. This call to resolve_gitlink_ref() was
added by f937bc2f86 (add: error appropriately on repository with no
commits, 2019-04-09), with the goal of improving the error message for
empty repositories.
But we already resolve the submodule HEAD in index_path(), which is
where we find the actual oid we're going to use. Resolving it again here
introduces some downsides:
1. It's more work, since we have to open up the submodule repository's
files twice.
2. There are call paths that get to index_path() without going through
add_to_cache(). For instance, we'd want a similar informative
message if "git diff empty" finds that it can't resolve the
submodule's HEAD. (In theory we can also get there through
update-index, but AFAICT it refuses to consider directories as
submodules at all, and just complains about them).
3. The resolution in index_path() catches more errors that we don't
handle here. In particular, it will validate that the object format
for the submodule matches that of the superproject. This isn't a
bug, since our call in add_to_cache() throws away the oid it gets
without looking at it. But it certainly caused confusion for me
when looking at where the object-format check should go.
So instead of resolving the submodule HEAD in add_to_cache(), let's just
teach the call in index_path() to actually produce an error message
(which it already does for other cases). That's probably what f937bc2f86
should have done in the first place, and it gives us a single point of
resolution when adding a submodule to the index.
The resulting output is slightly more verbose, as we propagate the error
up the call stack, but I think that's OK (and again, matches many other
errors we get when indexing fails).
I've left the text of the error message as-is, though it is perhaps
overly specific. There are many reasons that resolving the submodule
HEAD might fail, though outside of corruption or system errors it is
probably most likely that the submodule HEAD is simply on an unborn
branch.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The design of the hash algorithm transition plan is that objects stored
must be entirely in one algorithm since we lack any way to indicate a
mix of algorithms. This also includes submodules, but we have
traditionally not enforced this, which leads to various problems when
trying to clone or check out the the submodule from the remote.
Since this cannot work in the general case, restrict adding a submodule
of a different algorithm to the index. Add tests for git add and git
submodule add that these are rejected.
Note that we cannot check this in git fsck because the malformed
submodule is stored in the tree as an object ID which is either
truncated (when a SHA-256 submodule is added to a SHA-1 repository) or
padded with zeros (when a SHA-1 submodule is added to a SHA-256
repository). We cannot detect even the latter case because someone
could have an actual submodule that actually ends in 24 zeros, which
would be a false positive.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We have two different ways to write an object into the database:
- We either provide the full buffer and write the object all at once.
- Or we provide an input stream that has a `read()` function so that
we can chunk the object.
The latter is especially used for large objects, where it may be too
expensive to hold the complete object in memory all at once.
While we already have `odb_write_object()` at the ODB-layer, we don't
have an equivalent for streaming an object. Introduce a new function
`odb_write_object_stream()` to address this gap so that callers don't
have to be aware of the inner workings of how to stream an object to
disk with a specific object source.
Rename `stream_loose_object()` to `odb_source_loose_write_stream()` to
clarify its scope. This matches our modern best practices around how to
name functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Rename `write_object_file()` to `odb_source_loose_write_object()` so
that it becomes clear that this is tied to a specific loose object
source. This matches our modern naming schema for functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When writing an object that already exists in our object database we
skip the write and instead only update mtimes of the object, either in
its packed or loose object format. This logic is wholly contained in
"object-file.c", but that file is really only concerned with loose
objects. So it does not really make sense that it also contains the
logic to freshen a packed object.
Introduce a new `odb_freshen_object()` function that sits on the object
database level and two functions `packfile_store_freshen_object()` and
`odb_source_loose_freshen_object()`. Like this, the format-specific
functions can be part of their respective subsystems, while the backend
agnostic function to freshen an object sits at the object database
layer.
Note that this change also moves the logic that iterates through object
sources from the object source layer into the object database layer.
This change is intentional: object sources should ideally only have to
worry about themselves, and coordination of different sources should be
handled on the object database level.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Rename `has_loose_object()` to `odb_source_loose_has_object()` so that
it becomes clear that this is tied to a specific loose object source.
This matches our modern naming schema for functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When reading an object via `loose_object_info()` or `map_loose_object()`
we hand in the whole repository. We then iterate through each of the
object sources to figure out whether that source has the object in
question.
This logic is reversing responsibility though: a specific backend should
only care about one specific source, where the object sources themselves
are then managed by the object database.
Refactor the code accordingly by passing an object source to both of
these functions instead. The different sources are then handled by
either `do_oid_object_info_extended()`, which sits on the object
database level, and by `open_istream_loose()`. The latter function
arguably is still at the wrong level, but this will be cleaned up at a
later point in time.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The loose object map is used to map from the repository's canonical
object hash to the compatibility hash. As the name indicates, this map
is only used for loose objects, and as such it is tied to a specific
loose object source.
Same as with preceding commits, move this map into the loose object
source accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
There are two different situations where we have to clear the cache of
loose objects:
- When freeing the loose object source itself to avoid memory leaks.
- When repreparing the loose object source so that any potentially-
stale data is getting evicted from the cache.
The former is already handled by `odb_source_loose_free()`. But the
latter case is still done manually by in `odb_reprepare()`, so we are
leaking internals into that code.
Introduce a new `odb_source_loose_reprepare()` function as an equivalent
to `packfile_store_prepare()` to hide these implementation details.
Furthermore, while at it, rename the function `odb_clear_loose_cache()`
to `odb_source_loose_clear()`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Our loose objects use a cache that (optionally) stores all objects for
each of the opened sharding directories. This cache is located in the
`struct odb_source`, but now that we have `struct odb_source_loose` it
makes sense to move it into the latter structure so that all state that
relates to loose objects is entirely self-contained.
Do so. While at it, rename corresponding functions to have a prefix that
relates to `struct odb_source_loose`.
Note that despite this prefix, the functions still accept a `struct
odb_source` as input. This is done intentionally: once we introduce
pluggable object databases, we will continue to accept this struct but
then do a cast inside these functions to `struct odb_source_loose`. This
design is similar to how we do it for our ref backends.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Currently, all state that relates to loose objects is held directly by
the `struct odb_source`. Introduce a new `struct odb_source_loose` to
hold the state instead so that it is entirely self-contained.
This structure will eventually morph into the backend for accessing
loose objects. As such, this is part of the refactorings to introduce
pluggable object databases.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Code clean-up around the in-core list of all the pack files and
object database(s).
* ps/packfile-store:
packfile: refactor `get_packed_git_mru()` to work on packfile store
packfile: refactor `get_all_packs()` to work on packfile store
packfile: refactor `get_packed_git()` to work on packfile store
packfile: move `get_multi_pack_index()` into "midx.c"
packfile: introduce function to load and add packfiles
packfile: refactor `install_packed_git()` to work on packfile store
packfile: split up responsibilities of `reprepare_packed_git()`
packfile: refactor `prepare_packed_git()` to work on packfile store
packfile: reorder functions to avoid function declaration
odb: move kept cache into `struct packfile_store`
odb: move MRU list of packfiles into `struct packfile_store`
odb: move packfile map into `struct packfile_store`
odb: move initialization bit into `struct packfile_store`
odb: move list of packfiles into `struct packfile_store`
packfile: introduce a new `struct packfile_store`
|
|
Transactions are managed via the {begin,end}_odb_transaction() function
in the object-file subsystem and its implementation is specific to the
files object source. Introduce odb_transaction_{begin,commit}() in the
odb subsystem to provide an eventual object source agnostic means to
manage transactions.
Update call sites to instead manage transactions through the odb
subsystem. Also rename {begin,end}_odb_transaction() functions to
object_file_transaction_{begin,commit}() to clarify the object source it
supports.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Update the names of several functions and types relocated from the
bulk-checkin subsystem for better clarity. Also drop
finish_tmp_packfile() as a standalone function in favor of embedding it
in flush_packfile_transaction() directly.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The bulk-checkin subsystem provides various functions to manage ODB
transactions. Apart from {begin,end}_odb_transaction(), these functions
are only used by the object-file subsystem to manage aspects of a
transaction implementation specific to the files object source.
Relocate all the transaction code in bulk-checkin to object-file. This
simplifies the exposed transaction interface by reducing it to only
{begin,end}_odb_transaction(). Function and type names are adjusted in
the subsequent commit to better fit the new location.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
ODB transactions support being nested. Only the outermost
{begin,end}_odb_transaction() start and finish a transaction. This
allows internal object write codepaths to be optimized with ODB
transactions without worrying about whether a transaction is already
active. When {begin,end}_odb_transaction() is invoked during an active
transaction, these operations are essentially treated as no-ops. This
can make the interface a bit awkward to use, as calling
end_odb_transaction() does not guarantee that a transaction is actually
ended. Thus, in situations where a transaction needs to be explicitly
flushed, flush_odb_transaction() must be used.
To remove the need for an explicit transaction flush operation via
flush_odb_transaction() and better clarify transaction semantics, drop
the transaction nesting mechanism in favor of begin_odb_transaction()
returning a NULL transaction value to signal it was a no-op, and
end_odb_transaction() behaving as a no-op when a NULL transaction value
is passed. This is safe for existing callers as the transaction value
wired to end_odb_transaction() already comes from
begin_odb_transaction() and thus continues the same no-op behavior when
a transaction is already pending. With this model, passing a pending
transaction to end_odb_transaction() ensures it is committed at that
point in time.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The bulk-checkin subsystem provides a mechanism to write blobs directly
to a packfile via `index_blob_bulk_checkin()`. If there is an ongoing
transaction when invoked, objects written via this function are stored
in the same packfile. The packfile is not flushed until the transaction
itself is flushed. If there is no transaction, the single object is
written to a packfile and immediately flushed. This complicates
`index_blob_bulk_checkin()` as it cannot reliably use the provided
transaction to get the associated repository.
Update `index_blob_bulk_checkin()` to assume that a valid transaction is
always provided. Callers are now expected to ensure a transaction is set
up beforehand. With this simplification, `deflate_blob_bulk_checkin()`
is no longer needed as a standalone internal function and is combined
with `index_blob_bulk_checkin()`. The single call site in
`object-file.c:index_fd()` is updated accordingly. Due to how
`{begin,end}_odb_transaction()` handles nested transactions, a new
transaction is only created and committed if there is not already an
ongoing transaction.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Object database transactions in the bulk-checkin subsystem rely on
global state to track transaction status. Stop relying on global state
and instead store the transaction in the `struct object_database`.
Functions that operate on transactions are updated to now wire
transaction state.
Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Both `index_fd()` and `index_path()` still use `the_repository` even
though they have a repository available via `struct index_state`. Adapt
them so that they use the index' repository instead to get rid of this
global dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `force_object_loose()` forces an object to become a loose
object in case it only exists in its packed form. To do so it implicitly
relies on `the_repository`.
Refactor the function by passing a `struct odb_source` as parameter.
While the check whether any such loose object exists already acts on the
whole object database, writing the loose object happens in one specific
source.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `read_loose_object()` takes a path to an object file and
tries to parse it. As such, the function does not depend on any specific
object database but instead acts as an ODB-independent way to read a
specific file. As such, all it needs as input is a repository so that we
can derive repo settings and the hash algorithm.
That repository isn't passed in as a parameter though, as we implicitly
depend on the global `the_repository`. Refactor the function so that we
pass in the repository as a parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The iterators for loose objects still rely on `the_repository`. Refactor
them:
- `for_each_loose_file_in_objdir()` is refactored so that the caller
is now expected to pass an `odb_source` as parameter instead of the
path to that source. Furthermore, it is renamed accordingly to
`for_each_loose_file_in_source()`.
- `for_each_loose_object()` is refactored to take in an object
database now and calls the above function in a loop.
This allows us to get rid of the global dependency.
Adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `for_each_file_in_obj_subdir()` is declared in our headers,
but it is not used anywhere else than in the corresponding code file
itself. Drop the declaration and mark the function as file-local.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `for_each_loose_file_in_objdir_buf()` is declared in our
headers, but it is not used anywhere else than in the corresponding code
file itself. Drop the declaration and inline the function into its only
caller.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The logic that writes loose objects still relies on `the_repository` to
decide where exactly the object shall be written to. Refactor it so that
the logic instead operates on a `struct odb_source` so that we can get
rid of this global dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We do not have a backend-agnostic way to write objects into an object
database. While there is `write_object_file()`, this function is rather
specific to the loose object format.
Introduce `odb_write_object()` to plug this gap. For now, this function
is a simple wrapper around `write_object_file()` and doesn't even use
the passed-in object database yet. This will change in subsequent
commits, where `write_object_file()` is converted so that it works on
top of an `odb_source`. `odb_write_object()` will then become
responsible for deciding which source an object shall be written to.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When a repository is configured to have a compatibility hash algorithm
we keep track of object ID mappings for loose objects via the loose
object map. This map simply maps an object ID of the actual hash to the
object ID of the compatibility hash. This loose object map is an
inherent property of the loose files backend and thus of one specific
object source.
Refactor the interfaces to reflect this by requiring a `struct
odb_source` as input instead of a repository. This prepares for
subsequent commits where we will refactor writing of loose objects to
work on a `struct odb_source`, as well.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We implicitly depend on `the_repository` when moving an object file into
place in `finalize_object_file()`. Get rid of this global dependency by
passing in a repository.
Note that one might be pressed to inject an object database instead of a
repository. But the function doesn't really care about the ODB at all.
All it does is to move a file into place while checking whether there is
any collision. As such, the functionality it provides is independent of
the object database and only needs the repository as parameter so that
it can adjust permissions of the file we are about to finalize.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
While `loose_object_info()` already accepts a repository as parameter we
still have one callsite in there where we use `the_repository` to figure
out the hash algorithm. Use the passed-in repository instead.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We implicitly depend on `the_repository` when freshening either loose or
packed objects. Refactor these functions to instead accept an object
database as input so that we can get rid of the global dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `check_and_freshen()` functions are only used by a single caller
now. Inline them into `freshen_loose_object()`.
While at it, rename `check_and_freshen_odb()` to `_source()` to reflect
that it works on a single object source instead of on the whole database.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We implicitly depend on `the_repository` in `has_loose_object()`.
Refactor the function to accept an `odb_source` as input that should be
checked for such a loose object.
This refactoring changes semantics of the function to not check the
whole object database for such a loose object anymore, but instead we
now only check that single source. Existing callers thus need to loop
through all sources manually now.
While this change may seem illogical at first, whether or not an object
exists in a specific format should be answered by the source using that
format. As such, we can eventually convert this into a generic function
`odb_source_has_object()` that simply checks whether a given object
exists in an object source. And as we will know about the format that
any given source uses it allows us to derive whether the object exists
in a given format.
This change also makes `has_loose_object_nonlocal()` obsolete. The only
caller of this function is adapted so that it skips the primary object
source.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
There are a couple of users of the `the_hash_algo` macro, which
implicitly depends on `the_repository`. Adapt these callers to not do so
anymore, either by deriving it from already-available context or by
using `the_repository->hash_algo`. The latter variant doesn't yet help
to remove the global dependency, but such users will be adapted in the
following commits to not use `the_repository` anymore.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
There are some trivial -Wsign-compare warnings in "object-file.c". Fix
them and drop the preprocessor define that disables those warnings.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Rename `oid_object_info()` to `odb_read_object_info()` as well as their
`_extended()` variant to match other functions related to the object
database and our modern coding guidelines.
Introduce compatibility wrappers so that any in-flight topics will
continue to compile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The functions to manage alternates all depend on `the_repository`.
Refactor them to accept an object database as a parameter and adjust all
callers. The functions are renamed accordingly.
Note that right now the situation is still somewhat weird because we end
up using the object store path provided by the object store's repository
anyway. Consequently, we could have instead passed in a pointer to the
repository instead of passing in the pointer to the object store. This
will be addressed in subsequent commits though, where we will start to
use the path owned by the object store itself.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In the preceding commits we have renamed the structures contained in
"object-store.h" to `struct object_database` and `struct odb_backend`.
As such, the code files "object-store.{c,h}" are confusingly named now.
Rename them to "odb.{c,h}" accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `object_directory` structure is used as an access point for a single
object directory like ".git/objects". While the structure isn't yet
fully self-contained, the intent is for it to eventually contain all
information required to access objects in one specific location.
While the name "object directory" is a good fit for now, this will
change over time as we continue with the agenda to make pluggable object
databases a thing. Eventually, objects may not be accessed via any kind
of directory at all anymore, but they could instead be backed by any
kind of durable storage mechanism. While it seems quite far-fetched for
now, it is thinkable that eventually this might even be some form of a
database, for example.
As such, the current name of this structure will become worse over time
as we evolve into the direction of pluggable ODBs. Immediate next steps
will start to carve out proper self-contained object directories, which
requires us to pass in these object directories as parameters. Based on
our modern naming schema this means that those functions should then be
named after their subsystem, which means that we would start to bake the
current name into the codebase more and more.
Let's preempt this by renaming the structure. There have been a couple
alternatives that were discussed:
- `odb_backend` was discarded because it led to the association that
one object database has a single backend, but the model is that one
alternate has one backend. Furthermore, "backend" is more about the
actual backing implementation and less about the high-level concept.
- `odb_alternate` was discarded because it is a bit of a stretch to
also call the main object directory an "alternate".
Instead, pick `odb_source` as the new name. It makes it sufficiently
clear that there can be multiple sources and does not cause confusion
when mixed with the already-existing "alternate" terminology.
In the future, this change allows us to easily introduce for example a
`odb_files_source` and other format-specific implementations.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Since "hash-object --literally" no longer supports objects with unknown
types, there are now no callers of write_object_file_literally() and its
helpers. Let's drop them to simplify the code.
In particular, this gets rid of some ugly copy-and-paste code from
write_object_file_literally(), which is a parallel implementation of
write_object_file(). When the split was originally made, the two weren't
that long, but commits like 63a6745a07 (object-file: update the loose
object map when writing loose objects, 2023-10-01) ended up having to
duplicate some tricky code.
This patch drops all of that duplication and should make things less
error-prone going forward.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We provide a mechanism for callers to get the object type as a raw
string, rather than an object_type enum. This was in theory useful for
returning types that are not representable in the enum, but we consider
any such type to be an error, and there are no callers that use the
strbuf anymore.
Let's drop support to simplify the code a bit.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When fsck-ing a loose object, we use object_info's type_name strbuf to
record the parsed object type as a string. For most objects this is
redundant with the object_type enum, but it does let us report the
string when we encounter an object with an unknown type (for which there
is no matching enum value).
There are a few downsides, though:
1. The code to report these cases is not actually robust. Since we did
not pass a strbuf to unpack_loose_header(), we only retrieved types
from headers up to 32 bytes. In longer cases, we'd simply say
"object corrupt or missing".
2. This is the last caller that uses object_info's type_name strbuf
support. It would be nice to refactor it so that we can simplify
that code.
3. Likewise, we'll check the hash of the object using its unknown type
(again, as long as that type is short enough). That depends on the
hash_object_file_literally() code, which we'd eventually like to
get rid of.
So we can simplify things by bailing immediately in read_loose_object()
when we encounter an unknown type. This has a few user-visible effects:
a. Instead of producing a single line of error output like this:
error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object is of unknown type 'bogus': .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6
we'll now issue two lines (the first from read_loose_object() when
we see the unparsable header, and the second from the fsck code,
since we couldn't read the object):
error: unable to parse type from header 'bogus 4' of .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6
error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object corrupt or missing: .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6
This is a little more verbose, but this sort of error should be
rare (such objects are almost impossible to work with, and cannot
be transferred between repositories as they are not representable
in packfiles). And as a bonus, reporting the broken header in full
could help with debugging other cases (e.g., a header like "blob
xyzzy\0" would fail in parsing the size, but previously we'd not
have showed the offending bytes).
b. An object with an unknown type will be reported as corrupt, without
actually doing a hash check. Again, I think this is unlikely to
matter in practice since such objects are totally unusable.
We'll update one fsck test to match the new error strings. And we can
remove another test that covered the case of an object with an unknown
type _and_ a hash corruption. Since we'll skip the hash check now in
this case, the test is no longer interesting.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Since cat-file dropped its "--allow-unknown-type" option in the previous
commit, there are no more uses of the internal flag that implemented it.
Let's drop it.
That in turn lets us drop the strbuf parameter of unpack_loose_header(),
which now is always NULL. And without that, we can drop all of the
additional code to inflate larger headers into the strbuf.
Arguably we could drop ULHR_TOO_LONG, as no callers really care about
the distinction from ULHR_BAD. But it's easy enough to retain, and it
does let us produce a slightly more specific message in one instance.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function `loose_object_path()` is a trivial wrapper around
`odb_loose_path()`, with the only exception that it always uses the
primary object database of the given repository. This doesn't really add
a ton of value though, so let's drop the function and inline it at every
callsite.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
* ps/object-file-cleanup:
object-store: merge "object-store-ll.h" and "object-store.h"
object-store: remove global array of cached objects
object: split out functions relating to object store subsystem
object-file: drop `index_blob_stream()`
object-file: split up concerns of `HASH_*` flags
object-file: split out functions relating to object store subsystem
object-file: move `xmmap()` into "wrapper.c"
object-file: move `git_open_cloexec()` to "compat/open.c"
object-file: move `safe_create_leading_directories()` into "path.c"
object-file: move `mkdir_in_gitdir()` into "path.c"
|
|
The object layer has been updated to take an explicit repository
instance as a parameter in more code paths.
* ps/object-wo-the-repository:
hash: stop depending on `the_repository` in `null_oid()`
hash: fix "-Wsign-compare" warnings
object-file: split out logic regarding hash algorithms
delta-islands: stop depending on `the_repository`
object-file-convert: stop depending on `the_repository`
pack-bitmap-write: stop depending on `the_repository`
pack-revindex: stop depending on `the_repository`
pack-check: stop depending on `the_repository`
environment: move access to "core.bigFileThreshold" into repo settings
pack-write: stop depending on `the_repository` and `the_hash_algo`
object: stop depending on `the_repository`
csum-file: stop depending on `the_repository`
|
|
Fix our use of zlib corner cases.
* jk/zlib-inflate-fixes:
unpack_loose_rest(): rewrite return handling for clarity
unpack_loose_rest(): simplify error handling
unpack_loose_rest(): never clean up zstream
unpack_loose_rest(): avoid numeric comparison of zlib status
unpack_loose_header(): avoid numeric comparison of zlib status
git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT
unpack_loose_header(): fix infinite loop on broken zlib input
unpack_loose_header(): report headers without NUL as "bad"
unpack_loose_header(): simplify next_out assignment
loose_object_info(): BUG() on inflating content with unknown type
|
|
The "object-store-ll.h" header has been introduced to keep transitive
header dependendcies and compile times at bay. Now that we have created
a new "object-store.c" file though we can easily move the last remaining
additional bit of "object-store.h", the `odb_path_map`, out of the
header.
Do so. As the "object-store.h" header is now equivalent to its low-level
alternative we drop the latter and inline it into the former.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `index_blob_stream()` function is a mere wrapper around
`index_blob_bulk_checkin()`. This has been the case since 568508e7657
(bulk-checkin: replace fast-import based implementation, 2011-10-28),
which has moved the implementation from `index_blob_stream()` (which was
still called `index_stream()`) into `index_bulk_checkin()` (which has
since been renamed to `index_blob_bulk_checkin()`).
Remove the redirection by dropping the wrapper. Move the comment to
`index_blob_bulk_checkin()` to retain its context.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The functions `hash_object_file()`, `write_object_file()` and
`index_fd()` reuse the same set of flags to alter their behaviour. This
not only adds confusion, but given that every function only supports a
subset of the flags it becomes very hard to see which flags can be
passed to what function. Last but not least, this entangles the
implementation of all three function families.
Split up concerns by creating separate flags for each of the function
families.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
While we have the "object-store.h" header, most of the functionality for
object stores is actually hosted in "object-file.c". This makes it hard
to find relevant functions and causes us to mix up concerns.
Split out functions relating to the object store subsystem into a new
"object-store.c" file.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `xmmap()` function is provided by "object-file.c" even though its
functionality has nothing to do with the object file subsystem. Move it
into "wrapper.c", whose header already declares those functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `git_open_cloexec()` wrapper function provides the ability to open a
file with `O_CLOEXEC` in a platform-agnostic way. This function is
provided by "object-file.c" even though it is not specific to the object
subsystem at all.
Move the file into "compat/open.c". This file already exists before this
commit, but has only been compiled conditionally depending on whether or
not open(3p) may return EINTR. With this change we now unconditionally
compile the object, but wrap `git_open_with_retry()` in an ifdef.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `safe_create_leading_directories()` function and its relatives are
located in "object-file.c", which is not a good fit as they provide
generic functionality not related to objects at all. Move them into
"path.c", which already hosts `safe_create_dir()` and its relative
`safe_create_dir_in_gitdir()`.
"path.c" is free of `the_repository`, but the moved functions depend on
`the_repository` to read the "core.sharedRepository" config. Adapt the
function signature to accept a repository as argument to fix the issue
and adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `mkdir_in_gitdir()` function is similar to `safe_create_dir()`, but
the former is hosted in "object-file.c" whereas the latter is hosted in
"path.c". The latter code unit makes way more sense though as the logic
has nothing to do with object files in particular.
Move the file into "path.c". While at it, we:
- Rename the function to `safe_create_dir_in_gitdir()` so that the
function names are similar to one another.
- Remove the dependency on `the_repository` by making the callers pass
the repository instead.
Adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When the compiler/linker cannot verify that an assert() invocation is
free of side effects for us (e.g. because the assertion includes some
kind of function call), replace the use of assert() with ASSERT().
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `null_oid()` function returns the object ID that only consists of
zeroes. Naturally, this ID also depends on the hash algorithm used, as
the number of zeroes is different between SHA1 and SHA256. Consequently,
the function returns the hash-algorithm-specific null object ID.
This is currently done by depending on `the_hash_algo`, which implicitly
makes us depend on `the_repository`. Refactor the function to instead
pass in the hash algorithm for which we want to retrieve the null object
ID. Adapt callsites accordingly by passing in `the_repository`, thus
bubbling up the dependency on that global variable by one layer.
There are a couple of trivial exceptions for subsystems that already got
rid of `the_repository`. These subsystems instead use the repository
that is available via the calling context:
- "builtin/grep.c"
- "grep.c"
- "refs/debug.c"
There are also two non-trivial exceptions:
- "diff-no-index.c": Here we know that we may not have a repository
initialized at all, so we cannot rely on `the_repository`. Instead,
we adapt `diff_no_index()` to get a `struct git_hash_algo` as
parameter. The only caller is located in "builtin/diff.c", where we
know to call `repo_set_hash_algo()` in case we're running outside of
a Git repository. Consequently, it is fine to continue passing
`the_repository->hash_algo` even in this case.
- "builtin/ls-files.c": There is an in-flight patch series that drops
`USE_THE_REPOSITORY_VARIABLE` in this file, which causes a semantic
conflict because we use `null_oid()` in `show_submodule()`. The
value is passed to `repo_submodule_init()`, which may use the object
ID to resolve a tree-ish in the superproject from which we want to
read the submodule config. As such, the object ID should refer to an
object in the superproject, and consequently we need to use its hash
algorithm.
This means that we could in theory just not bother about this edge
case at all and just use `the_repository` in "diff-no-index.c". But
doing so would feel misdesigned.
Remove the `USE_THE_REPOSITORY_VARIABLE` preprocessor define in
"hash.c".
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
While we have a "hash.h" header, the actual implementation of the
subsystem is hosted by "object-file.c". This makes it harder than
necessary to find the actual implementation of the hash subsystem and
intermingles the different concerns with one another.
Split out the implementation of hash algorithms into a new, separate
"hash.c" file.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
There are multiple sites in "object-file-convert.c" where we use the
global `the_repository` variable, either explicitly or implicitly by
using `the_hash_algo`. All of these callsites are transitively called
from `convert_object_file()`, which indeed has no repo as input.
Refactor the function so that it receives a repository as a parameter
and pass it through to all internal functions to get rid of the
dependency. Remove the `USE_THE_REPOSITORY_VARIABLE` define.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The "core.bigFileThreshold" setting is stored in a global variable and
populated via `git_default_core_config()`. This may cause issues in
the case where one is handling multiple different repositories in a
single process with different values for that config key, as we may or
may not see the correct value in that case. Furthermore, global state
blocks our path towards libification.
Refactor the code so that we instead store the value in `struct
repo_settings`, where the value is computed as-needed and cached.
Note that this change requires us to adapt one test in t1050 that
verifies that we die when parsing an invalid "core.bigFileThreshold"
value. The exercised Git command doesn't use the value at all, and thus
it won't hit the new code path that parses the value. This is addressed
by using git-hash-object(1) instead, which does read the value.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The path.[ch] API takes an explicit repository parameter passed
throughout the callchain, instead of relying on the_repository
singleton instance.
* ps/path-sans-the-repository:
path: adjust last remaining users of `the_repository`
environment: move access to "core.sharedRepository" into repo settings
environment: move access to "core.hooksPath" into repo settings
repo-settings: introduce function to clear struct
path: drop `git_path()` in favor of `repo_git_path()`
rerere: let `rerere_path()` write paths into a caller-provided buffer
path: drop `git_common_path()` in favor of `repo_common_path()`
worktree: return allocated string from `get_worktree_git_dir()`
path: drop `git_path_buf()` in favor of `repo_git_path_replace()`
path: drop `git_pathdup()` in favor of `repo_git_path()`
path: drop unused `strbuf_git_path()` function
path: refactor `repo_submodule_path()` family of functions
submodule: refactor `submodule_to_gitdir()` to accept a repo
path: refactor `repo_worktree_path()` family of functions
path: refactor `repo_git_path()` family of functions
path: refactor `repo_common_path()` family of functions
|
|
With the preceding refactorings we now only have a couple of implicit
users of `the_repository` left in the "path" subsystem, all of which
depend on global state via `calc_shared_perm()`. Make the dependency on
`the_repository` explicit by passing the repo as a parameter instead and
adjust callers accordingly.
Note that this change bubbles up into a couple of subsystems that were
previously declared as free from `the_repository`. Instead of marking
all of them as `the_repository`-dependent again, we instead use the
repository that is available in the calling context. There are three
exceptions though with "copy.c", "pack-write.c" and "tempfile.c".
Adjusting these would require us to adapt callsites all over the place,
so this is left for a future iteration.
Mark "path.c" as free from `the_repository`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We have a pattern like:
if (error1)
...handle error 1...
else if (error2)
...handle error 2...
else
...return buf...
...free buf and return NULL...
This is a little subtle because it is the return in the success block
that lets us skip the common error handling. Rewrite this instead to
free the buffer in each error path, marking it as NULL, and then all
code paths can use the common return.
This should make the logic a bit easier to follow. It does mean
duplicating the buf cleanup for errors, but it's a single line.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Inflating a loose object is considered successful only if we got
Z_STREAM_END and there were no more bytes. We check both of those
conditions and return success, but then have to check them a second time
to decide which error message to produce.
I.e., we do something like this:
if (!error_1 && !error_2)
...return success...
if (error_1)
...handle error1...
else if (error_2)
...handle error2...
...common error handling...
This repetition was the source of a small bug fixed in an earlier commit
(our Z_STREAM_END check was not the same in the two conditionals).
Instead we can chain them all into a single if/else cascade, which
avoids repeating ourselves:
if (error_1)
...handle error1...
else if (error_2)
...handle error2....
else
...return success...
...common error handling...
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The unpack_loose_rest() function has funny ownership semantics: we pass
in a z_stream opened by the caller, but then only _sometimes_ close it.
This oddity has developed over time. When the function was originally
split out in 5180cacc20 (Split up unpack_sha1_file() some more,
2005-06-02), it always called inflateEnd() to clean up the stream
(though nowadays it is a git_zstream and we call git_inflate_end()).
But in 7efbff7531 (unpack_sha1_file(): detect corrupt loose object
files., 2007-03-05) we added error code paths which don't close the
stream. This makes some sense, as we'd still look at parts of the stream
struct to decide which error to show (though I am not sure in practice
if inflateEnd() even touches those fields).
This subtlety makes it hard to know when the caller has to clean up the
stream and when it does not. That led to the leak fixed by aa9ef614dc
(object-file: fix memory leak when reading corrupted headers,
2024-08-14).
Let's instead always leave the stream intact, forcing the caller to
clean it up. You might think that would create more work for the
callers, but it actually ends up simplifying them, since they can put
the call to git_inflate_end() in the common cleanup code path.
Two things to note, though:
- The check_stream_oid() function is used as a replacement for
unpack_loose_rest() in read_loose_object() to read blobs. It
inherited the same funny semantics, and we should fix it here, too
(to keep the cleanup in read_loose_object() consistent).
- In read_loose_object() we need a second "out" label, as we can jump
to the existing label before opening the stream at all (and since
the struct is opaque, there is no way to if it was initialized or
not, so we must not call git_inflate_end() in that case).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When unpacking the actual content of a loose object file, we insist both
that the status code we got is Z_STREAM_END, and that we consumed all
bytes.
If we didn't, we'll return an error, but the specific error message we
produce depends on which of the two error conditions we saw. So we'll
check both a second time to decide which error to produce. But this
second time, our status code check is loose: it checks for a negative
status value.
This can get confused by zlib codes which are not negative, such as
Z_NEED_DICT. In this case we'd erroneously print nothing at all, when we
should say "corrupt loose object".
Instead, this second check should check explicitly against Z_STREAM_END.
Note that Z_OK is "0", so the existing code also produced no message for
Z_OK. But it's impossible to see that status, since we only break out of
the inflate loop when we stop seeing Z_OK (so a stream which has more
bytes than its object header claims would eventually yield Z_BUF_ERROR).
There's no test here, as it would require a loose object whose zlib
stream returns Z_NEED_DICT in the middle of the object content. I think
that is probably possible, but even our Z_NEED_DICT test in t1006 does
not trigger this, because we hit that error while reading the header. I
found this bug while reviewing all callers of git_inflate() for bugs
similar to the one we saw in unpack_loose_header(). This was the only
other case that did a numeric comparison rather than explicitly checking
for Z_STREAM_END.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When unpacking a loose header, we try to inflate the first 32 bytes.
We'd expect either Z_OK (we filled up the output buffer, but there are
more bytes in the object) or Z_STREAM_END (this is a tiny object whose
header and content fit in the buffer).
We check for that with "if (status < Z_OK)", making the assumption that
all of the errors we'd see have negative values (as Z_OK itself is "0",
and Z_STREAM_END is "1").
But there's at least one case this misses: Z_NEED_DICT is "2". This
isn't something we'd ever expect to see, but if we do see it, we should
consider it an error (since we have no dictionary to load).
Instead, the current code interprets Z_NEED_DICT as success and looks
for the object header's terminating NUL in the bytes we've read. This
will generaly be zero bytes if the dictionary is mentioned at the start
of the stream. So we'll fail to find it and complain "the header is too
long" (ULHR_LONG). But really, the problem is that the object is
malformed, and we should return ULHR_BAD.
This is a minor bug, as we consider both cases to be an error. But it
does mean we print the wrong error message. The test case added in the
previous patch triggers this code, so we can just confirm the error
message we see here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When reading a loose object, we first try to expand the first 32 bytes
to read the type+size header. This is enough for any of the normal Git
types. But since 46f034483e (sha1_file: support reading from a loose
object of unknown type, 2015-05-03), the caller can also ask us to parse
any unknown names, which can be much longer. In this case we keep
inflating until we find the NUL at the end of the header, or hit
Z_STREAM_END.
But what if zlib can't make forward progress? For example, if the loose
object file is truncated, we'll have no more data to feed it. It will
return Z_BUF_ERROR, and we'll just loop infinitely, calling
git_inflate() over and over but never seeing new bytes nor an
end-of-stream marker.
We can fix this by only looping when we think we can make forward
progress. This will always be Z_OK in this case. In other code we might
also be able to continue on Z_BUF_ERROR, but:
- We will never see Z_BUF_ERROR because the output buffer is full; we
always feed a fresh 32-byte buffer on each call to git_inflate().
- We may see Z_BUF_ERROR if we run out of input. But since we've fed
the whole mmap'd buffer to zlib, if it runs out of input there is
nothing more we can do.
So if we don't see Z_OK (and didn't see the end-of-header NUL, otherwise
we'd have broken out of the loop), then we should stop looping and
return an error.
The test case shows an example where the input is truncated (which gives
us the input Z_BUF_ERROR case above).
Although we do operate on objects we might get from an untrusted remote,
I don't think the security implications of this bug are too great. It
can only trigger if both of these are true:
- You're reading a loose object whose on-disk representation was
written by an attacker. So fetching an object (or receiving a push)
are mostly OK, because even with unpack-objects it is our local,
trusted code that writes out the object file. The exception may be
fetching from an untrusted local repo, or using dumb-http, which
copies objects verbatim. But...
- The only code path which triggers the inflate loop is cat-file's
--allow-unknown-type option. This is unlikely to be called at all
outside of debugging. But I also suspect that objects with
non-standard types (or that are truncated) would not survive the
usual fetch/receive checks in the first place.
So I think it would be quite hard to trick somebody into running the
infinite loop, and we can just fix the bug.
Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
If a caller asks us to read the whole loose object header value into a
strbuf (e.g., via "cat-file --allow-unknown-type"), we'll keep reading
until we see a NUL byte marking the end of the header.
If we hit Z_STREAM_END before seeing the NUL, we obviously have to stop.
But we return ULHR_TOO_LONG, which doesn't make any sense. The "too
long" return code is used in the normal, 32-byte limited mode to
indicate that we stopped looking. There is no such thing as "too long"
here, as we'd keep reading forever until we see the end of stream or the
NUL.
Instead, we should return ULHR_BAD. The loose object has no NUL marking
the end of header, so it is malformed. The behavior difference is
slight; in either case we'd consider the object unreadable and refuse to
go further. The only difference is the specific error message we
produce.
There's no test case here, as we'd need to generate a valid zlib stream
without a NUL. That's not something Git will do without writing new
custom code. And in the next patch we'll fix another bug in this area
which will make this easier to do (and we will test it then).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When using OBJECT_INFO_ALLOW_UNKNOWN_TYPE to unpack a header that
doesn't fit into our initial 32-byte buffer, we loop over calls
git_inflate(), feeding it our buffer to the "next_out" pointer each
time. As the code is written, we reset next_out after each inflate call
(and after reading the output), ready for the next loop.
This isn't wrong, but there are a few advantages to setting up
"next_out" right before each inflate call, rather than after:
1. It drops a few duplicated lines of code.
2. It makes it obvious that we always feed a fresh buffer on each call
(and thus can never see Z_BUF_ERROR due to due to a lack of output
space).
3. After we exit the loop, we'll leave stream->next_out pointing to
the end of the fetched data (this is how zlib callers find out how
much data is in the buffer). This doesn't matter in practice, since
nobody looks at it again. But it's probably the least-surprising
thing to do, as it matches how next_out is left when the whole
thing fits in the initial 32-byte buffer (and we don't enter the
loop at all).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
After unpack_loose_header() returns, it will have inflated not only the
object header, but possibly some bytes of the object content. When we
call unpack_loose_rest() to extract the actual content, it finds those
extra bytes by skipping past the header's terminating NUL in the buffer.
Like this:
int bytes = strlen(buffer) + 1;
n = stream->total_out - bytes;
...
memcpy(buf, (char *) buffer + bytes, n);
This won't work with the OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag, as there
we allow a header of arbitrary size. We put into a strbuf, but feed only
the final 32-byte chunk we read to unpack_loose_rest(). In that case
stream->total_out may unexpectedly large, and thus our "n" will be
large, causing an out-of-bounds read (we do check it against our
allocated buffer size, which prevents an out-of-bounds write).
Probably this could be made to work by feeding the strbuf to
unpack_loose_rest(), along with adjusting some types (e.g., "bytes"
would need to be a size_t, since it is no longer operating on a 32-byte
buffer).
But I don't think it's possible to actually trigger this in practice.
The only caller who passes ALLOW_UNKNOWN_TYPE is cat-file, which only
allows it with the "-t" and "-s" options (neither of which access the
content). There is one way you can _almost_ trigger it: the oid compat
routines (i.e., accessing sha1 via sha256 names and vice versa) will
convert objects on the fly (which requires access to the content) using
the same flags that were passed in. So in theory this:
t='some very large type field that causes an extra inflate call'
sha1_oid=$(git hash-object -w -t "$t" file)
sha256_oid=$(git rev-parse --output-object-format=sha256 $sha1_oid)
git cat-file --allow-unknown-type -s $sha256_oid
would try to access the content. But it doesn't work, because using
compat objects requires an entry in the .git/objects/loose-object-idx
file, and we don't generate such an entry for non-standard types (see
the "compat" section of write_object_file_literally()).
If we use "t=blob" instead, then it does access the compat object, but
it doesn't trigger the problem (because "blob" is a standard short type
name, and it fits in the initial 32-byte buffer).
So given that this is almost a memory error bug, I think it's worth
addressing. But because we can't actually trigger the situation, I'm
hesitant to try a fix that we can't run. Instead let's document the
restriction and protect ourselves from the out-of-bounds read by adding
a BUG() check.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Remove `git_path_buf()` in favor of `repo_git_path_replace()`. The
latter does essentially the same, with the only exception that it does
not rely on `the_repository` but takes the repo as separate parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Remove `git_pathdup()` in favor of `repo_git_path()`. The latter does
essentially the same, with the only exception that it does not rely on
`the_repository` but takes the repo as separate parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The API around choosing to use unsafe variant of SHA-1
implementation has been updated in an attempt to make it harder to
abuse.
* tb/unsafe-hash-cleanup:
hash.h: drop unsafe_ function variants
csum-file: introduce hashfile_checkpoint_init()
t/helper/test-hash.c: use unsafe_hash_algo()
csum-file.c: use unsafe_hash_algo()
hash.h: introduce `unsafe_hash_algo()`
csum-file.c: extract algop from hashfile_checksum_valid()
csum-file: store the hash algorithm as a struct field
t/helper/test-tool: implement sha1-unsafe helper
|
|
Adapt callers to use generic hash context helpers instead of using the
hash algorithm to update them. This makes the callsites easier to reason
about and removes the possibility that the wrong hash algorithm is used
to update the hash context's state. And as a nice side effect this also
gets rid of a bunch of users of `the_hash_algo`.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The hash context is supposed to be updated via the `git_hash_algo`
structure, which contains a list of function pointers to update, clone
or finalize a hashing context. This requires the callers to track which
algorithm was used to initialize the context and continue to use the
exact same algorithm. If they fail to do that correctly, it can happen
that we start to access context state of one hash algorithm with
functions of a different hash algorithm. The result would typically be a
segfault, as could be seen e.g. in the patches part of 98422943f0 (Merge
branch 'ps/weak-sha1-for-tail-sum-fix', 2025-01-01).
The situation was significantly improved starting with 04292c3796
(hash.h: drop unsafe_ function variants, 2025-01-23) and its parent
commits. These refactorings ensure that it is not possible to mix up
safe and unsafe variants of the same hash algorithm anymore. But in
theory, it is still possible to mix up different hash algorithms with
each other, even though this is a lot less likely to happen.
But still, we can do better: instead of asking the caller to remember
the hash algorithm used to initialize a context, we can instead make the
context itself remember which algorithm it has been initialized with. If
we do so, callers can use a set of generic helpers to update the context
and don't need to be aware of the hash algorithm at all anymore.
Adapt the context initialization functions to store the hash algorithm
in the hashing context and introduce these generic helpers. Callers will
be adapted in the subsequent commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We generally avoid using `typedef` in the Git codebase. One exception
though is the `git_hash_ctx`, likely because it used to be a union
rather than a struct until the preceding commit refactored it. But now
that it is a normal `struct` there isn't really a need for a typedef
anymore.
Drop the typedef and adapt all callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `git_hash_context` is a union containing the different hash-specific
states for SHA1, its unsafe variant as well as SHA256. We know that only
one of these states will ever be in use at the same time because hash
contexts cannot be used for multiple different hashes at the same point
in time.
We're about to extend the structure though to keep track of the hash
algorithm used to initialize the context, which is impossible to do
while the context is a union. Refactor it to instead be a structure that
contains the union of context states.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
* tb/unsafe-hash-cleanup:
hash.h: drop unsafe_ function variants
csum-file: introduce hashfile_checkpoint_init()
t/helper/test-hash.c: use unsafe_hash_algo()
csum-file.c: use unsafe_hash_algo()
hash.h: introduce `unsafe_hash_algo()`
csum-file.c: extract algop from hashfile_checksum_valid()
csum-file: store the hash algorithm as a struct field
t/helper/test-tool: implement sha1-unsafe helper
|
|
Now that all callers have been converted from:
the_hash_algo->unsafe_init_fn();
to
unsafe_hash_algo(the_hash_algo)->init_fn();
and similar, we can remove the scaffolding for the unsafe_ function
variants and force callers to use the new unsafe_hash_algo() mechanic
instead.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants,
2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing
functions by introducing new functions like "unsafe_init_fn()" and so
on.
This approach has a major shortcoming that callers must remember to
consistently use one variant or the other. Failing to consistently use
(or not use) the unsafe variants can lead to crashes at best, or subtle
memory corruption issues at worst.
In the hashfile API, this isn't difficult to achieve, but verifying that
all callers consistently use the unsafe variants is somewhat of a chore
given how spread out all of the callers are. In the sha1 and sha1-unsafe
test helpers, all of the calls to various hash functions are guarded by
an "if (unsafe)" conditional, which is repetitive and cumbersome.
Address these issues by introducing a new pattern whereby one
'git_hash_algo' can return a pointer to another 'git_hash_algo' that
represents the unsafe version of itself. So instead of having something
like:
if (unsafe)
the_hash_algo->init_fn(...);
the_hash_algo->update_fn(...);
the_hash_algo->final_fn(...);
else
the_hash_algo->unsafe_init_fn(...);
the_hash_algo->unsafe_update_fn(...);
the_hash_algo->unsafe_final_fn(...);
we can instead write:
struct git_hash_algo *algop = the_hash_algo;
if (unsafe)
algop = unsafe_hash_algo(algop);
algop->init_fn(...);
algop->update_fn(...);
algop->final_fn(...);
This removes the existing shortcoming by no longer forcing the caller to
"remember" which variant of the hash functions it wants to call, only to
hold onto a 'struct git_hash_algo' pointer that is initialized once.
Similarly, while there currently is still a way to "mix" safe and unsafe
functions, this too will go away after subsequent commits remove all
direct calls to the unsafe_ variants.
Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
unsafe variant of a hash function. All other query functions on the
hash_algos array will continue to return the safe variants of any
function.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
CI jobs gave sporadic failures, which turns out that that the
object finalization code was giving an error when it did not have
to.
* ps/object-collision-check:
object-file: retry linking file into place when occluding file vanishes
object-file: don't special-case missing source file in collision check
object-file: rename variables in `check_collision()`
object-file: fix race in object collision check
|
|
Prior to 0ad3d65652 (object-file: fix race in object collision check,
2024-12-30), callers could expect that a successful return from
`finalize_object_file()` means that either the file was moved into
place, or the identical bytes were already present. If neither of those
happens, we'd return an error.
Since that commit, if the destination file disappears between our
link(3p) call and the collision check, we'd return success without
actually checking the contents, and without retrying the link. This
solves the common case that the files were indeed the same, but it means
that we may corrupt the repository if they weren't (this implies a hash
collision, but the whole point of this function is protecting against
hash collisions).
We can't be pessimistic and assume they're different; that hurts the
common case that the mentioned commit was trying to fix. But after
seeing that the destination file went away, we can retry linking again.
Adapt the code to do so when we see that the destination file has racily
vanished. This should generally succeed as we have just observed that
the destination file does not exist anymore, except in the very unlikely
event that it gets recreated by another concurrent process again.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In 0ad3d65652 (object-file: fix race in object collision check,
2024-12-30) we have started to ignore ENOENT when opening either the
source or destination file of the collision check. This was done to
handle races more gracefully in case either of the potentially-colliding
disappears.
The fix is overly broad though: while the destination file may indeed
vanish racily, this shouldn't ever happen for the source file, which is
a temporary object file (either loose or in packfile format) that we
have just created. So if any concurrent process would have removed that
temporary file it would indicate an actual issue.
Stop treating ENOENT specially for the source file so that we always
bubble up this error.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Rename variables used in `check_collision()` to clearly identify which
file is the source and which is the destination. This will make the next
step easier to reason about when we start to treat those files different
from one another.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
One of the tests in t5616 asserts that git-fetch(1) with `--refetch`
triggers repository maintenance with the correct set of arguments. This
test is flaky and causes us to fail sometimes:
++ git -c protocol.version=0 -c gc.autoPackLimit=0 -c maintenance.incremental-repack.auto=1234 -C pc1 fetch --refetch origin
error: unable to open .git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack: No such file or directory
fatal: unable to rename temporary file to '.git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack'
fatal: could not finish pack-objects to repack local links
fatal: index-pack failed
error: last command exited with $?=128
The error message is quite confusing as it talks about trying to rename
a temporary packfile. A first hunch would thus be that this packfile
gets written by git-fetch(1), but removed by git-maintenance(1) while it
hasn't yet been finalized, which shouldn't ever happen. And indeed, when
looking closer one notices that the file that is supposedly of temporary
nature does not have the typical `tmp_pack_` prefix.
As it turns out, the "unable to rename temporary file" fatal error is a
red herring and the real error is "unable to open". That error is raised
by `check_collision()`, which is called by `finalize_object_file()` when
moving the new packfile into place. Because t5616 re-fetches objects, we
end up with the exact same pack as we already have in the repository. So
when the concurrent git-maintenance(1) process rewrites the preexisting
pack and unlinks it exactly at the point in time where git-fetch(1)
wants to check the old and new packfiles for equality we will see ENOENT
and thus `check_collision()` returns an error, which gets bubbled up by
`finalize_object_file()` and is then handled by `rename_tmp_packfile()`.
That function does not know about the exact root cause of the error and
instead just claims that the rename has failed.
This race is thus caused by b1b8dfde69 (finalize_object_file():
implement collision check, 2024-09-26), where we have newly introduced
the collision check.
By definition, two files cannot collide with each other when one of them
has been removed. We can thus trivially fix the issue by ignoring ENOENT
when opening either of the files we're about to check for collision.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Mark code units that generate warnings with `-Wsign-compare`. This
allows for a structured approach to get rid of all such warnings over
time in a way that can be easily measured.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We define macros with the bytes of the empty trees and blobs for sha1
and sha256. But since e1ccd7e2b1 (sha1_file: only expose empty object
constants through git_hash_algo, 2018-05-02), those are used only for
initializing the git_hash_algo entries. Any other code using the macros
directly would be suspicious, since a hash_algo pointer is the level of
indirection we use to make everything work with both sha1 and sha256.
So let's future proof against code doing the wrong thing by dropping the
macros entirely and just initializing the structs directly.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The cached-object API maps oids to in-memory entries. Once inserted,
these entries should be immutable. Let's return them from the
find_cached_object() call with a const tag to make this clear.
Suggested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The pretend_object_file() function adds to an array mapping oids to
object contents, which are later retrieved with find_cached_object().
We naturally need to store the oid for each entry, since it's the lookup
key.
But find_cached_object() also returns a hard-coded empty_tree object.
There we don't care about its oid field and instead compare against
the_hash_algo->empty_tree. The oid field is left as all-zeroes.
This all works, but it means that the cached_object struct we return
from find_cached_object() may or may not have a valid oid field, depend
whether it is the hard-coded tree or came from pretend_object_file().
Nobody looks at the field, so there's no bug. But let's future-proof it
by returning only the object contents themselves, not the oid. We'll
continue to call this "struct cached_object", and the array entry
mapping the key to those contents will be a "cached_object_entry".
This would also let us swap out the array for a better data structure
(like a hashmap) if we chose, but there's not much point. The only code
that adds an entry is git-blame, which adds at most a single entry per
process.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The fake empty_tree struct is a static global, but the only code that
looks at it is find_cached_object(). The struct itself is a little odd,
with an invalid "oid" field that is handled specially by that function.
Since it's really just an implementation detail, let's move it to a
static within the function. That future-proofs against other code trying
to use it and seeing the weird oid value.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We treat the empty tree specially, providing an in-memory "cached" copy,
which allows you to diff against it even if the object doesn't exist in
the repository. This is implemented as part of the larger cached_object
subsystem, but we use a stand-alone empty_tree struct.
We initialize the oid of that struct using EMPTY_TREE_SHA1_BIN_LITERAL.
At first glance, that seems like a bug; how could this ever work for
sha256 repositories?
The answer is that we never look at the oid field! The oid field is used
to look up entries added by pretend_object_file() to the cached_objects
array. But for our stand-alone entry, we look for it independently using
the_hash_algo->empty_tree, which will point to the correct algo struct
for the repository.
This happened in 62ba93eaa9 (sha1_file: convert cached object code to
struct object_id, 2018-05-02), which even mentions that this field is
never used. Let's reduce confusion for anybody reading this code by
replacing the sha1 initializer with a comment. The resulting field will
be all-zeroes, so any violation of our assumption that the oid field is
not used will break equally for sha1 and sha256.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We hard-code a few well-known hash values for empty trees and blobs in
both sha1 and sha256 formats. We do so with string literals like this:
#define EMPTY_TREE_SHA256_BIN_LITERAL \
"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
"\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5" \
"\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc" \
"\x53\x21"
and then use it to initialize the hash field of an object_id struct.
That hash field is exactly 32 bytes long (the size we need for sha256).
But the string literal above is actually 33 bytes long due to the NUL
terminator. This is legal in C, and the NUL is ignored.
Side note on legality: in general excess initializer elements are
forbidden, and gcc will warn on both of these:
char foo[3] = { 'h', 'u', 'g', 'e' };
char bar[3] = "VeryLongString";
I couldn't find specific language in the standard allowing
initialization from a string literal where _just_ the NUL is ignored,
but C99 section 6.7.8 (Initialization), paragraph 32 shows this exact
case as "example 8".
However, the upcoming gcc 15 will start warning for this case (when
compiled with -Wextra via DEVELOPER=1):
CC object-file.o
object-file.c:52:9: warning: initializer-string for array of ‘unsigned char’ is too long [-Wunterminated-string-initialization]
52 | "\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
object-file.c:79:17: note: in expansion of macro ‘EMPTY_TREE_SHA256_BIN_LITERAL’
which is understandable. Even though this is not a bug for us, since we
do not care about the NUL terminator (and are just using the literal as
a convenient format), it would be easy to accidentally create an array
that was mistakenly unterminated.
We can avoid this warning by switching the initializer to an actual
array of unsigned values. That arguably demonstrates our intent more
clearly anyway.
Reported-by: Sam James <sam@gentoo.org>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The checksum at the tail of files are now computed without
collision detection protection. This is safe as the consumer of
the information to protect itself from replay attacks checks for
hash collisions independently.
* tb/weak-sha1-for-tail-sum:
csum-file.c: use unsafe SHA-1 implementation when available
Makefile: allow specifying a SHA-1 for non-cryptographic uses
hash.h: scaffolding for _unsafe hashing variants
sha1: do not redefine `platform_SHA_CTX` and friends
pack-objects: use finalize_object_file() to rename pack/idx/etc
finalize_object_file(): implement collision check
finalize_object_file(): refactor unlink_or_warn() placement
finalize_object_file(): check for name collision before renaming
|
|
Git's default SHA-1 implementation is collision-detecting, which hardens
us against known SHA-1 attacks against Git objects. This makes Git
object writes safer at the expense of some speed when hashing through
the collision-detecting implementation, which is slower than
non-collision detecting alternatives.
Prepare for loading a separate "unsafe" SHA-1 implementation that can be
used for non-cryptographic purposes, like computing the checksum of
files that use the hashwrite() API.
This commit does not actually introduce any new compile-time knobs to
control which implementation is used as the unsafe SHA-1 variant, but
does add scaffolding so that the "git_hash_algo" structure has five new
function pointers which are "unsafe" variants of the five existing
hashing-related function pointers:
- git_hash_init_fn unsafe_init_fn
- git_hash_clone_fn unsafe_clone_fn
- git_hash_update_fn unsafe_update_fn
- git_hash_final_fn unsafe_final_fn
- git_hash_final_oid_fn unsafe_final_oid_fn
The following commit will introduce compile-time knobs to specify which
SHA-1 implementation is used for non-cryptographic uses.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We've had "FIXME!!! Collision check here ?" in finalize_object_file()
since aac1794132 (Improve sha1 object file writing., 2005-05-03). That
is, when we try to write a file with the same name, we assume the
on-disk contents are the same and blindly throw away the new copy.
One of the reasons we never implemented this is because the files it
moves are all named after the cryptographic hash of their contents
(either loose objects, or packs which have their hash in the name these
days). So we are unlikely to see such a collision by accident. And even
though there are weaknesses in sha1, we assume they are mitigated by our
use of sha1dc.
So while it's a theoretical concern now, it hasn't been a priority.
However, if we start using weaker hashes for pack checksums and names,
this will become a practical concern. So in preparation, let's actually
implement a byte-for-byte collision check.
The new check will cause the write of new differing content to be a
failure, rather than a silent noop, and we'll retain the temporary file
on disk. If there's no collision present, we'll clean up the temporary
file as usual after either rename()-ing or link()-ing it into place.
Note that this may cause some extra computation when the files are in
fact identical, but this should happen rarely.
Loose objects are exempt from this check, and the collision check may be
skipped by calling the _flags variant of this function with the
FOF_SKIP_COLLISION_CHECK bit set. This is done for a couple of reasons:
- We don't treat the hash of the loose object file's contents as a
checksum, since the same loose object can be stored using different
bytes on disk (e.g., when adjusting core.compression, using a
different version of zlib, etc.).
This is fundamentally different from cases where
finalize_object_file() is operating over a file which uses the hash
value as a checksum of the contents. In other words, a pair of
identical loose objects can be stored using different bytes on disk,
and that should not be treated as a collision.
- We already use the path of the loose object as its hash value /
object name, so checking for collisions at the content level doesn't
add anything.
Adding a content-level collision check would have to happen at a
higher level than in finalize_object_file(), since (avoiding race
conditions) writing an object loose which already exists in the
repository will prevent us from even reaching finalize_object_file()
via the object freshening code.
There is a collision check in index-pack via its `check_collision()`
function, but there isn't an analogous function in unpack-objects,
which just feeds the result to write_object_file().
So skipping the collision check here does not change for better or
worse the hardness of loose object writes.
As a small note related to the latter bullet point above, we must teach
the tmp-objdir routines to similarly skip the content-level collision
checks when calling migrate_one() on a loose object file, which we do by
setting the FOF_SKIP_COLLISION_CHECK bit when we are inside of a loose
object shard.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
As soon as we've tried to link() a temporary object into place, we then
unlink() the tempfile immediately, whether we were successful or not.
For the success case, this is because we no longer need the old file
(it's now linked into place).
For the error case, there are two outcomes. Either we got EEXIST, in
which case we consider the collision to be a noop. Or we got a system
error, in which we case we are just cleaning up after ourselves.
Using a single line for all of these cases has some problems:
- in the error case, our unlink() may clobber errno, which we use in
the error message
- for the collision case, there's a FIXME that indicates we should do
a collision check. In preparation for implementing that, we'll need
to actually hold on to the file.
Split these three cases into their own calls to unlink_or_warn(). This
is more verbose, but lets us do the right thing in each case.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We prefer link()/unlink() to rename() for object files, with the idea
that we should prefer the data that is already on disk to what is
incoming. But we may fall back to rename() if the user has configured us
to do so, or if the filesystem seems not to support cross-directory
links. This loses the "prefer what is on disk" property.
We can mitigate this somewhat by trying to stat() the destination
filename before doing the rename. This is racy, since the object could
be created between the stat() and rename() calls. But in practice it is
expanding the definition of "what is already on disk" to be the point
that the function is called. That is enough to deal with any potential
attacks where an attacker is trying to collide hashes with what's
already in the repository.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Typofix.
* ak/typofix-2.46-maint:
upload-pack: fix a typo
sideband: fix a typo
setup: fix a typo
run-command: fix a typo
revision: fix a typo
refs: fix typos
rebase: fix a typo
read-cache-ll: fix a typo
pretty: fix a typo
object-file: fix a typo
merge-ort: fix typos
merge-ll: fix a typo
http: fix a typo
gpg-interface: fix a typo
git-p4: fix typos
git-instaweb: fix a typo
fsmonitor-settings: fix a typo
diffcore-rename: fix typos
config.mak.dev: fix a typo
|
|
Fix a typo in comments.
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `odb_mkstemp()` and `odb_pack_keep()` functions are quite clearly
tied to the object store, but regardless of that they are located in
"environment.c". Move them over, which also helps to get rid of
dependencies on `the_repository` in the environment subsystem.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `get_object_directory()` function retrieves the path to the object
directory for `the_repository`. Make it accept a `struct repository`
such that it can work on arbitrary repositories and make it part of the
repository subsystem. This reduces our reliance on `the_repository` and
clarifies scope.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
More leak fixes.
* ps/leakfixes-part-4: (22 commits)
builtin/diff: free symmetric diff members
diff: free state populated via options
builtin/log: fix leak when showing converted blob contents
userdiff: fix leaking memory for configured diff drivers
builtin/format-patch: fix various trivial memory leaks
diff: fix leak when parsing invalid ignore regex option
unpack-trees: clear index when not propagating it
sequencer: release todo list on error paths
merge-ort: unconditionally release attributes index
builtin/fast-export: plug leaking tag names
builtin/fast-export: fix leaking diff options
builtin/fast-import: plug trivial memory leaks
builtin/notes: fix leaking `struct notes_tree` when merging notes
builtin/rebase: fix leaking `commit.gpgsign` value
config: fix leaking comment character config
submodule-config: fix leaking name entry when traversing submodules
read-cache: fix leaking hashfile when writing index fails
bulk-checkin: fix leaking state TODO
object-name: fix leaking symlink paths in object context
object-file: fix memory leak when reading corrupted headers
...
|
|
When reading corrupt object headers in `read_loose_object()`, we bail
out immediately. This causes a memory leak though because we would have
already initialized the zstream in `unpack_loose_header()`, and it is
the callers responsibility to finish the zstream even on error. While
this feels weird, other callsites do it correctly already.
Fix this leak by ending the zstream even on errors. We may want to
revisit this interface in the future such that the callee handles this
for us already when there was an error.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The "fsck_error" callback is designed to report the objects-related
error messages. It accepts two parameter "oid" and "object_type" which
is not generic. In order to provide a unified callback which can report
either objects or refs, remove the objects-related parameters and add
the generic parameter "void *fsck_report".
Create a new "fsck_object_report" structure which incorporates the
removed parameters "oid" and "object_type". Then change the
corresponding references to adapt to new "fsck_error" callback.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Leakfix.
* ew/object-convert-leakfix:
object-file: fix leak on conversion failure
|
|
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help
transition the codebase to rely less on the availability of the
singleton the_repository instance.
* ps/use-the-repository:
hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE`
t/helper: remove dependency on `the_repository` in "proc-receive"
t/helper: fix segfault in "oid-array" command without repository
t/helper: use correct object hash in partial-clone helper
compat/fsmonitor: fix socket path in networked SHA256 repos
replace-object: use hash algorithm from passed-in repository
protocol-caps: use hash algorithm from passed-in repository
oidset: pass hash algorithm when parsing file
http-fetch: don't crash when parsing packfile without a repo
hash-ll: merge with "hash.h"
refs: avoid include cycle with "repository.h"
global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
hash: require hash algorithm in `empty_tree_oid_hex()`
hash: require hash algorithm in `is_empty_{blob,tree}_oid()`
hash: make `is_null_oid()` independent of `the_repository`
hash: convert `oidcmp()` and `oideq()` to compare whole hash
global: ensure that object IDs are always padded
hash: require hash algorithm in `oidread()` and `oidclr()`
hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
|
|
I'm not sure exactly how to trigger the leak, but it seems fairly
obvious that the `content' buffer should be freed even if
convert_object_file() fails. Noticed while working in this area
on unrelated things.
Signed-off-by: Eric Wong <e@80x24.org>
Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Building with "-Werror -Wwrite-strings" is now supported.
* ps/no-writable-strings: (27 commits)
config.mak.dev: enable `-Wwrite-strings` warning
builtin/merge: always store allocated strings in `pull_twohead`
builtin/rebase: always store allocated string in `options.strategy`
builtin/rebase: do not assign default backend to non-constant field
imap-send: fix leaking memory in `imap_server_conf`
imap-send: drop global `imap_server_conf` variable
mailmap: always store allocated strings in mailmap blob
revision: always store allocated strings in output encoding
remote-curl: avoid assigning string constant to non-const variable
send-pack: always allocate receive status
parse-options: cast long name for OPTION_ALIAS
http: do not assign string constant to non-const field
compat/win32: fix const-correctness with string constants
pretty: add casts for decoration option pointers
object-file: make `buf` parameter of `index_mem()` a constant
object-file: mark cached object buffers as const
ident: add casts for fallback name and GECOS
entry: refactor how we remove items for delayed checkouts
line-log: always allocate the output prefix
line-log: stop assigning string constant to file parent buffer
...
|
|
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.
It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.
Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit
For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).
Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `empty_tree_oid_hex()` function use `the_repository` to derive the
hash function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
While at it, remove the unused `empty_blob_oid_hex()` function.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `oidcmp()` and `oideq()` functions only compare the prefix length as
specified by the given hash algorithm. This mandates that the object IDs
have a valid hash algorithm set, or otherwise we wouldn't be able to
figure out that prefix. As we do not have a hash algorithm in many
cases, for example when handling null object IDs, this assumption cannot
always be fulfilled. We thus have a fallback in place that instead uses
`the_repository` to derive the hash function. This implicit dependency
is hidden away from callers and can be quite surprising, especially in
contexts where there may be no repository.
In theory, we can adapt those functions to always memcmp(3P) the whole
length of their hash arrays. But there exist a couple of sites where we
populate `struct object_id`s such that only the prefix of its hash that
is actually used by the hash algorithm is populated. The remaining bytes
are left uninitialized. The fact that those bytes are uninitialized also
leads to warnings under Valgrind in some places where we copy those
bytes.
Refactor callsites where we populate object IDs to always initialize all
bytes. This also allows us to get rid of `oidcpy_with_padding()`, for
one because the input is now fully initialized, and because `oidcpy()`
will now always copy the whole hash array.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash
function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The `buf` parameter of `index_mem()` is a non-constant string. This will
break once we enable `-Wwrite-strings` because we also pass constants
from at least one callsite.
Adapt the parameter to be a constant. As we cannot free the buffer
without casting now, this also requires us to move the lifetime of the
nested buffer around.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The buffers of cached objects are never modified, but are still stored
as a non-constant pointer. This will cause a compiler warning once we
enable the `-Wwrite-strings` compiler warning as we assign an empty
constant string when initializing the static `empty_tree` cached object.
Convert the field to be constant. This requires us to shuffle around
the code a bit because we memcpy(3P) into the allocated buffer in
`pretend_object_file()`. This is easily fixed though by allocating the
buffer into a temporary variable first.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In `resolve_gitlink_ref()` we implicitly rely on `the_repository` to
look up the submodule ref store. Now that we can look up submodule ref
stores for arbitrary repositories we can improve this function to
instead accept a repository as parameter for which we want to resolve
the gitlink.
Do so and adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Work to support a repository that work with both SHA-1 and SHA-256
hash algorithms has started.
* eb/hash-transition: (30 commits)
t1016-compatObjectFormat: add tests to verify the conversion between objects
t1006: test oid compatibility with cat-file
t1006: rename sha1 to oid
test-lib: compute the compatibility hash so tests may use it
builtin/ls-tree: let the oid determine the output algorithm
object-file: handle compat objects in check_object_signature
tree-walk: init_tree_desc take an oid to get the hash algorithm
builtin/cat-file: let the oid determine the output algorithm
rev-parse: add an --output-object-format parameter
repository: implement extensions.compatObjectFormat
object-file: update object_info_extended to reencode objects
object-file-convert: convert commits that embed signed tags
object-file-convert: convert commit objects when writing
object-file-convert: don't leak when converting tag objects
object-file-convert: convert tag objects when writing
object-file-convert: add a function to convert trees between algorithms
object: factor out parse_mode out of fast-import and tree-walk into in object.h
cache: add a function to read an OID of a specific algorithm
tag: sign both hashes
commit: export add_header_signature to support handling signatures on tags
...
|
|
Each of these were checked with
gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).
...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file. These cases were:
* builtin/credential-cache.c
* builtin/pull.c
* builtin/send-pack.c
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Update check_object_signature to find the hash algorithm the exising
signature uses, and to use the same hash algorithm when recomputing it
to check the signature is valid.
This will be useful when teaching git ls-tree to display objects
encoded with the compat hash algorithm.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
oid_object_info_extended is updated to detect an oid encoding that
does not match the current repository, use repo_oid_to_algop to find
the correspoding oid in the current repository and to return the data
for the oid.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
To create the proper signatures for commit objects both versions of
the commit object need to be generated and signed. After that it is
a waste to throw away the work of generating the compatibility hash
so update write_object_file_flags to take a compatibility hash input
parameter that it can use to skip the work of generating the
compatability hash.
Update the places that don't generate the compatability hash to
pass NULL so it is easy to tell write_object_file_flags should
not attempt to use their compatability hash.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
To implement SHA1 compatibility on SHA256 repositories the loose
object map needs to be updated whenver a loose object is written.
Updating the loose object map this way allows git to support
the old hash algorithm in constant time.
The functions write_loose_object, and stream_loose_object are
the only two functions that write to the loose object store.
Update stream_loose_object to compute the compatibiilty hash, update
the loose object, and then call repo_add_loose_object_map to update
the loose object map.
Update write_object_file_flags to convert the object into
it's compatibility encoding, hash the compatibility encoding,
write the object, and then update the loose object map.
Update force_object_loose to lookup the hash of the compatibility
encoding, write the loose object, and then update the loose object
map.
Update write_object_file_literally to convert the object into it's
compatibility hash encoding, hash the compatibility enconding, write
the object, and then update the loose object map, when the type string
is a known type. For objects with an unknown type this results in a
partially broken repository, as the objects are not mapped.
The point of write_object_file_literally is to generate a partially
broken repository for testing. For testing skipping writing the loose
object map is much more useful than refusing to write the broken
object at all.
Except that the loose objects are updated before the loose object map
I have not done any analysis to see how robust this scheme is in the
event of failure.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
As the code is written today index_bulk_checkin only accepts blobs.
Remove the enum object_type parameter and rename index_bulk_checkin to
index_blob_bulk_checkin, index_stream to index_blob_stream,
deflate_to_pack to deflate_blob_to_pack, stream_to_pack to
stream_blob_to_pack, to make this explicit.
Not supporting commits, tags, or trees has no downside as it is not
currently supported now, and commits, tags, and trees being smaller by
design do not have the problem that the problem that index_bulk_checkin
was built to solve.
Before we start adding code to support the hash function transition
supporting additional objects types in index_bulk_checkin has no real
additional cost, just an extra function parameter to know what the
object type is. Once we begin the hash function transition this is not
the case.
The hash function transition document specifies that a repository with
compatObjectFormat enabled will compute and store both the SHA-1 and
SHA-256 hash of every object in the repository.
What makes this a challenge is that it is not just an additional hash
over the same object. Instead the hash function transition document
specifies that the compatibility hash (specified with
compatObjectFormat) be computed over the equivalent object that another
git repository whose storage hash (specified with objectFormat) would
store. When comparing equivalent repositories built with different
storage hash functions, the oids embedded in objects used to refer to
other objects differ and the location of signatures within objects
differ.
As blob objects have neither oids referring to other objects nor stored
signatures their storage hash and their compatibility hash are computed
over the same object.
The other kinds of objects: trees, commits, and tags, all store oids
referring to other objects. Signatures are stored in commit and tag
objects. As oids and the tags to store signatures are not the same size
in repositories built with different storage hashes the size of the
equivalent objects are also different.
A version of index_bulk_checkin that supports more than just blobs when
computing both the SHA-1 and the SHA-256 of every object added would
need a different, and more expensive structure. The structure is more
expensive because it would be required to temporarily buffering the
equivalent object the compatibility hash needs to be computed over.
A temporary object is needed, because before a hash over an object can
computed it's object header needs to be computed. One of the members of
the object header is the entire size of the object. To know the size of
an equivalent object an entire pass over the original object needs to be
made, as trees, commits, and tags are composed of a variable number of
variable sized pieces. Unfortunately there is no formula to compute the
size of an equivalent object from just the size of the original object.
Avoid all of those future complications by limiting index_bulk_checkin
to only work on blobs.
Inspired-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Mark-up unused parameters in the code so that we can eventually
enable -Wunused-parameter by default.
* jk/unused-parameter:
t/helper: mark unused callback void data parameters
tag: mark unused parameters in each_tag_name_fn callbacks
rev-parse: mark unused parameter in for_each_abbrev callback
replace: mark unused parameter in each_mergetag_fn callback
replace: mark unused parameter in ref callback
merge-tree: mark unused parameter in traverse callback
fsck: mark unused parameters in various fsck callbacks
revisions: drop unused "opt" parameter in "tweak" callbacks
count-objects: mark unused parameter in alternates callback
am: mark unused keep_cr parameters
http-push: mark unused parameter in xml callback
http: mark unused parameters in curl callbacks
do_for_each_ref_helper(): mark unused repository parameter
test-ref-store: drop unimplemented reflog-expire command
|
|
There are a few callback functions which are used with the fsck code,
but it's natural that not all callbacks need all parameters. For
reporting, even something as obvious as "the oid of the object which had
a problem" is not always used, as some callers are only checking a
single object in the first place. And for both reporting and walking,
things like void data pointers and the fsck_options aren't always
necessary.
But since each such parameter is used by _some_ callback, we have to
keep them in the interface. Mark the unused ones in specific callbacks
to avoid triggering -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for
dynamic array allocation. Moving these macros to git-compat-util.h with
the other alloc macros focuses alloc.[ch] to allocation for Git objects
and additionally allows us to remove inclusions to alloc.h from files
that solely used the above macros.
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This also made it clear that several .c files that depended upon path.h
were missing a #include for it; add the missing includes while at it.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
More header clean-up.
* en/header-split-cache-h-part-2: (22 commits)
reftable: ensure git-compat-util.h is the first (indirect) include
diff.h: reduce unnecessary includes
object-store.h: reduce unnecessary includes
commit.h: reduce unnecessary includes
fsmonitor: reduce includes of cache.h
cache.h: remove unnecessary headers
treewide: remove cache.h inclusion due to previous changes
cache,tree: move basic name compare functions from read-cache to tree
cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c
hash-ll.h: split out of hash.h to remove dependency on repository.h
tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h
dir.h: move DTYPE defines from cache.h
versioncmp.h: move declarations for versioncmp.c functions from cache.h
ws.h: move declarations for ws.c functions from cache.h
match-trees.h: move declarations for match-trees.c functions from cache.h
pkt-line.h: move declarations for pkt-line.c functions from cache.h
base85.h: move declarations for base85.c functions from cache.h
copy.h: move declarations for copy.c functions from cache.h
server-info.h: move declarations for server-info.c functions from cache.h
packfile.h: move pack_window and pack_entry from cache.h
...
|
|
Geometric repacking ("git repack --geometric=<n>") in a repository
that borrows from an alternate object database had various corner
case bugs, which have been corrected.
* ps/fix-geom-repack-with-alternates:
repack: disable writing bitmaps when doing a local repack
repack: honor `-l` when calculating pack geometry
t/helper: allow chmtime to print verbosely without modifying mtime
pack-objects: extend test coverage of `--stdin-packs` with alternates
pack-objects: fix error when same packfile is included and excluded
pack-objects: fix error when packing same pack twice
pack-objects: split out `--stdin-packs` tests into separate file
repack: fix generating multi-pack-index with only non-local packs
repack: fix trying to use preferred pack in alternates
midx: fix segfault with no packs and invalid preferred pack
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
In order to write a bitmap, we need to have full coverage of all objects
that are about to be packed. In the traditional non-multi-pack-index
world this meant we need to do a full repack of all objects into a
single packfile. But in the new multi-pack-index world we can get away
with writing bitmaps when we have multiple packfiles as long as the
multi-pack-index covers all objects.
This is not always the case though. When asked to perform a repack of
local objects, only, then we cannot guarantee to have full coverage of
all objects regardless of whether we do a full repack or a repack with a
multi-pack-index. The end result is that writing the bitmap will fail in
both worlds:
$ git multi-pack-index write --stdin-packs --bitmap <packfiles
warning: Failed to write bitmap index. Packfile doesn't have full closure (object 1529341d78cf45377407369acb0f4ff2b5cdae42 is missing)
error: could not write multi-pack bitmap
Now there are two different ways to fix this. The first one would be to
amend git-multi-pack-index(1) to disable writing bitmaps when we notice
that we don't have full object coverage.
- We don't have enough information in git-multi-pack-index(1) in
order to tell whether the local repository _should_ have full
coverage. Because even when connected to an alternate object
directory, it may be the case that we still have all objects
around in the main object database.
- git-multi-pack-index(1) is quite a low-level tool. Automatically
disabling functionality that it was asked to provide does not feel
like the right thing to do.
We can easily fix it at a higher level in git-repack(1) though. When
asked to only include local objects via `-l` and when connected to an
alternate object directory then we will override the user's ask and
disable writing bitmaps with a warning. This is similar to what we do in
git-pack-objects(1), where we also disable writing bitmaps in case we
omit an object from the pack.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
en/header-split-cache-h
* ab/remove-implicit-use-of-the-repository:
libs: use "struct repository *" argument, not "the_repository"
post-cocci: adjust comments for recent repo_* migration
cocci: apply the "revision.h" part of "the_repository.pending"
cocci: apply the "rerere.h" part of "the_repository.pending"
cocci: apply the "refs.h" part of "the_repository.pending"
cocci: apply the "promisor-remote.h" part of "the_repository.pending"
cocci: apply the "packfile.h" part of "the_repository.pending"
cocci: apply the "pretty.h" part of "the_repository.pending"
cocci: apply the "object-store.h" part of "the_repository.pending"
cocci: apply the "diff.h" part of "the_repository.pending"
cocci: apply the "commit.h" part of "the_repository.pending"
cocci: apply the "commit-reach.h" part of "the_repository.pending"
cocci: apply the "cache.h" part of "the_repository.pending"
cocci: add missing "the_repository" macros to "pending"
cocci: sort "the_repository" rules by header
cocci: fix incorrect & verbose "the_repository" rules
cocci: remove dead rule from "the_repository.pending.cocci"
|
|
In preceding commits we changed many calls to macros that were
providing a "the_repository" argument to invoke corresponding repo_*()
function instead. Let's follow-up and adjust references to those in
comments, which coccinelle didn't (and inherently can't) catch.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Apply the part of "the_repository.pending.cocci" pertaining to
"object-store.h".
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
With the change in the last commit to move several functions to
write-or-die.h, csum-file.h no longer needs to include cache.h.
However, removing that include forces several other C files, which
directly or indirectly dependend upon csum-file.h's inclusion of
cache.h, to now be more explicit about their dependencies.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This is another step towards letting us remove the include of cache.h in
strbuf.c. It does mean that we also need to add includes of abspath.h
in a number of C files.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Dozens of files made use of gettext functions, without explicitly
including gettext.h. This made it more difficult to find which files
could remove a dependence on cache.h. Make C files explicitly include
gettext.h if they are using it.
However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an
include of gettext.h, it was left out to avoid conflicting with an
in-flight topic.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
More work towards -Wunused.
* jk/unused-post-2.39-part2: (21 commits)
help: mark unused parameter in git_unknown_cmd_config()
run_processes_parallel: mark unused callback parameters
userformat_want_item(): mark unused parameter
for_each_commit_graft(): mark unused callback parameter
rewrite_parents(): mark unused callback parameter
fetch-pack: mark unused parameter in callback function
notes: mark unused callback parameters
prio-queue: mark unused parameters in comparison functions
for_each_object: mark unused callback parameters
list-objects: mark unused callback parameters
mark unused parameters in signal handlers
run-command: mark error routine parameters as unused
mark "pointless" data pointers in callbacks
ref-filter: mark unused callback parameters
http-backend: mark unused parameters in virtual functions
http-backend: mark argc/argv unused
object-name: mark unused parameters in disambiguate callbacks
serve: mark unused parameters in virtual functions
serve: use repository pointer to get config
ls-refs: drop config caching
...
|
|
The for_each_{loose,packed}_object interface uses callback functions,
but not every callback needs all of the parameters. Mark the unused ones
to satisfy -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This allows us to replace includes of cache.h with includes of the much
smaller alloc.h in many places. It does mean that we also need to add
includes of alloc.h in a number of C files.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
"git hash-object" now checks that the resulting object is well
formed with the same code as "git fsck".
* jk/hash-object-fsck:
fsck: do not assume NUL-termination of buffers
hash-object: use fsck for object checks
fsck: provide a function to fsck buffer without object struct
t: use hash-object --literally when created malformed objects
t7030: stop using invalid tag name
t1006: stop using 0-padded timestamps
t1007: modernize malformed object tests
|
|
Since c879daa237 (Make hash-object more robust against malformed
objects, 2011-02-05), we've done some rudimentary checks against objects
we're about to write by running them through our usual parsers for
trees, commits, and tags.
These parsers catch some problems, but they are not nearly as careful as
the fsck functions (which make sense; the parsers are designed to be
fast and forgiving, bailing only when the input is unintelligible). We
are better off doing the more thorough fsck checks when writing objects.
Doing so at write time is much better than writing garbage only to find
out later (after building more history atop it!) that fsck complains
about it, or hosts with transfer.fsckObjects reject it.
This is obviously going to be a user-visible behavior change, and the
test changes earlier in this series show the scope of the impact. But
I'd argue that this is OK:
- the documentation for hash-object is already vague about which
checks we might do, saying that --literally will allow "any
garbage[...] which might not otherwise pass standard object parsing
or git-fsck checks". So we are already covered under the documented
behavior.
- users don't generally run hash-object anyway. There are a lot of
spots in the tests that needed to be updated because creating
garbage objects is something that Git's tests disproportionately do.
- it's hard to imagine anyone thinking the new behavior is worse. Any
object we reject would be a potential problem down the road for the
user. And if they really want to create garbage, --literally is
already the escape hatch they need.
Note that the change here is actually in index_mem(), which handles the
HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by
"git-replace --edit" to sanity-check the result. Covering that with more
thorough checks likewise seems like a good thing.
Besides being more thorough, there are a few other bonuses:
- we get rid of some questionable stack allocations of object structs.
These don't seem to currently cause any problems in practice, but
they subtly violate some of the assumptions made by the rest of the
code (e.g., the "struct commit" we put on the stack and
zero-initialize will not have a proper index from
alloc_comit_index().
- likewise, those parsed object structs are the source of some small
memory leaks
- the resulting messages are much better. For example:
[before]
$ echo 'tree 123' | git hash-object -t commit --stdin
error: bogus commit object 0000000000000000000000000000000000000000
fatal: corrupt commit
[after]
$ echo 'tree 123' | git.compile hash-object -t commit --stdin
error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
fatal: refusing to create malformed object
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Commit b25562e63f (object-file: inline calls to read_object(),
2023-01-07) accidentally indented a conditional block with spaces
instead of a tab.
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The only caller of read_object_file_extended() is the thin wrapper of
repo_read_object_file(). Instead of wrapping, let's just rename the
inner function and let people call it directly. This cleans up the
namespace and reduces confusion.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Our sole caller always passes in "1", so we can just drop the parameter
entirely. Anybody who doesn't want this behavior could easily call
oid_object_info_extended() themselves, as we're just a thin wrapper
around it.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Since read_object() is these days just a thin wrapper around
oid_object_info_extended(), and since it only has two callers, let's
just inline those calls. This has a few positive outcomes:
- it's a net reduction in source code lines
- even though the callers end up with a few extra lines, they're now
more flexible and can use object_info flags directly. So no more
need to convert die_if_corrupt between parameter/flag, and we can
ask for lookup replacement with a flag rather than doing it
ourselves.
- there's one fewer function in an already crowded namespace (e.g.,
the difference between read_object() and read_object_file() was not
immediately obvious; now we only have one of them).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Even in a repository with promisor remote, it is useless to
attempt to lazily attempt fetching an object that is expected to be
commit, because no "filter" mode omits commit objects. Take
advantage of this assumption to fail fast on errors.
* jt/avoid-lazy-fetch-commits:
commit: don't lazy-fetch commits
object-file: emit corruption errors when detected
object-file: refactor map_loose_object_1()
object-file: remove OBJECT_INFO_IGNORE_LOOSE
|
|
Instead of relying on errno being preserved across function calls, teach
do_oid_object_info_extended() to itself report object corruption when
it first detects it. There are 3 types of corruption being detected:
- when a replacement object is missing
- when a loose object is corrupt
- when a packed object is corrupt and the object cannot be read
in another way
Note that in the RHS of this patch's diff, a check for ENOENT that was
introduced in 3ba7a06552 (A loose object is not corrupt if it cannot
be read due to EMFILE, 2010-10-28) is also removed. The purpose of this
check is to avoid a false report of corruption if the errno contains
something like EMFILE (or anything that is not ENOENT), in which case
a more generic report is presented. Because, as of this patch, we no
longer rely on such a heuristic to determine corruption, but surface
the error message at the point when we read something that we did not
expect, this check is no longer necessary.
Besides being more resilient, this also prepares for a future patch in
which an indirect caller of do_oid_object_info_extended() will need
such functionality.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This function can do 3 things:
1. Gets an fd given a path
2. Simultaneously gets a path and fd given an OID
3. Memory maps an fd
Keep 3 (renaming the function accordingly) and inline 1 and 2 into their
respective callers.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Its last user was removed in 97b2fa08b6 (fetch-pack: drop
custom loose object cache, 2018-11-12), so we can remove it.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
write_buffer() reports the OS error if it is unable to write. Its only
caller dies in that case, giving some more context in its last message.
Inline this function and show only a single error message that includes
both the context (writing a loose object file) and the OS error. This
shortens the code and simplifies the output.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
More UNUSED annotation to help using -Wunused option with the
compiler.
* jk/unused-anno-more:
ll-merge: mark unused parameters in callbacks
diffcore-pickaxe: mark unused parameters in pickaxe functions
convert: mark unused parameter in null stream filter
apply: mark unused parameters in noop error/warning routine
apply: mark unused parameters in handlers
date: mark unused parameters in handler functions
string-list: mark unused callback parameters
object-file: mark unused parameters in hash_unknown functions
mark unused parameters in trivial compat functions
update-index: drop unused argc from do_reupdate()
submodule--helper: drop unused argc from module_list_compute()
diffstat_consume(): assert non-zero length
|
|
The 0'th entry of our hash_algos array fills out the virtual methods
with a series of functions which simply BUG(). This is the right thing
to do, since the point is to catch use of an invalid algo parameter, but
we need to annotate them to appease -Wunused-parameters.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Remove error detection from a function that fetches from promisor
remotes, and make it die when such a fetch fails to bring all the
requested objects, to give an early failure to various operations.
* jt/promisor-remote-fetch-tweak:
promisor-remote: die upon failing fetch
promisor-remote: remove a return value
|
|
In a partial clone, an attempt to read a missing object results in an
attempt to fetch that single object. In order to avoid multiple
sequential fetches, which would occur when multiple objects are missing
(which is the typical case), some commands have been taught to prefetch
in a batch: such a command would, in a partial clone, notice that
several objects that it will eventually need are missing, and call
promisor_remote_get_direct() with all such objects at once.
When this batch prefetch fails, these commands fall back to the
sequential fetches. But at $DAYJOB we have noticed that this results in
a bad user experience: a command would take unexpectedly long to finish
(and possibly use up a lot of bandwidth) if the batch prefetch would
fail for some intermittent reason, but all subsequent fetches would
work. It would be a better user experience for such a command would
just fail.
Therefore, make it a fatal error if the prefetch fails and at least one
object being fetched is known to be a promisor object. (The latter
criterion is to make sure that we are not misleading the user that such
an object would be present from the promisor remote. For example, a
missing object may be a result of repository corruption and not because
it is expectedly missing due to the repository being a partial clone.)
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Allow large objects read from a packstream to be streamed into a
loose object file straight, without having to keep it in-core as a
whole.
* hx/unpack-streaming:
unpack-objects: use stream_loose_object() to unpack large objects
core doc: modernize core.bigFileThreshold documentation
object-file.c: add "stream_loose_object()" to handle large object
object-file.c: factor out deflate part of write_loose_object()
object-file.c: refactor write_loose_object() to several steps
unpack-objects: low memory footprint for get_data() in dry_run mode
|
|
If we want unpack and write a loose object using "write_loose_object",
we have to feed it with a buffer with the same size of the object, which
will consume lots of memory and may cause OOM. This can be improved by
feeding data to "stream_loose_object()" in a stream.
Add a new function "stream_loose_object()", which is a stream version of
"write_loose_object()" but with a low memory footprint. We will use this
function to unpack large blob object in later commit.
Another difference with "write_loose_object()" is that we have no chance
to run "write_object_file_prepare()" to calculate the oid in advance.
In "write_loose_object()", we know the oid and we can write the
temporary file in the same directory as the final object, but for an
object with an undetermined oid, we don't know the exact directory for
the object.
Still, we need to save the temporary file we're preparing
somewhere. We'll do that in the top-level ".git/objects/"
directory (or whatever "GIT_OBJECT_DIRECTORY" is set to). Once we've
streamed it we'll know the OID, and will move it to its canonical
path.
"freshen_packed_object()" or "freshen_loose_object()" will be called
inside "stream_loose_object()" after obtaining the "oid". After the
temporary file is written, we wants to mark the object to recent and we
may find that where indeed is already the object. We should remove the
temporary and do not leave a new copy of the object.
Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <chiyutianyi@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Split out the part of write_loose_object() that deals with calling
git_deflate() into a utility function, a subsequent commit will
introduce another function that'll make use of it.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When writing a large blob using "write_loose_object()", we have to pass
a buffer with the whole content of the blob, and this behavior will
consume lots of memory and may cause OOM. We will introduce a stream
version function ("stream_loose_object()") in later commit to resolve
this issue.
Before introducing that streaming function, do some refactoring on
"write_loose_object()" to reuse code for both versions.
Rewrite "write_loose_object()" as follows:
1. Figure out a path for the (temp) object file. This step is only
used in "write_loose_object()".
2. Move common steps for starting to write loose objects into a new
function "start_loose_object_common()".
3. Compress data.
4. Move common steps for ending zlib stream into a new function
"end_loose_object_common()".
5. Close fd and finalize the object file.
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <chiyutianyi@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Rename .env_array member to .env in the child_process structure.
* ab/env-array:
run-command API users: use "env" not "env_array" in comments & names
run-command API: rename "env_array" to "env"
|
|
A mechanism to pack unreachable objects into a "cruft pack",
instead of ejecting them into loose form to be reclaimed later, has
been introduced.
* tb/cruft-packs:
sha1-file.c: don't freshen cruft packs
builtin/gc.c: conditionally avoid pruning objects via loose
builtin/repack.c: add cruft packs to MIDX during geometric repack
builtin/repack.c: use named flags for existing_packs
builtin/repack.c: allow configuring cruft pack generation
builtin/repack.c: support generating a cruft pack
builtin/pack-objects.c: --cruft with expiration
reachable: report precise timestamps from objects in cruft packs
reachable: add options to add_unseen_recent_objects_to_traversal
builtin/pack-objects.c: --cruft without expiration
builtin/pack-objects.c: return from create_object_entry()
t/helper: add 'pack-mtimes' test-tool
pack-mtimes: support writing pack .mtimes files
chunk-format.h: extract oid_version()
pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
pack-mtimes: support reading .mtimes files
Documentation/technical: add cruft-packs.txt
|
|
Coding style fix.
* ds/object-file-unpack-loose-header-fix:
object-file: convert 'switch' back to 'if'
|
|
Introduce a filesystem-dependent mechanism to optimize the way the
bits for many loose object files are ensured to hit the disk
platter.
* ns/batch-fsync:
core.fsyncmethod: performance tests for batch mode
t/perf: add iteration setup mechanism to perf-lib
core.fsyncmethod: tests for batch mode
test-lib-functions: add parsing helpers for ls-files and ls-tree
core.fsync: use batch mode and sync loose objects by default on Windows
unpack-objects: use the bulk-checkin infrastructure
update-index: use the bulk-checkin infrastructure
builtin/add: add ODB transaction around add_files_to_cache
cache-tree: use ODB transaction around writing a tree
core.fsyncmethod: batched disk flushes for loose-objects
bulk-checkin: rebrand plug/unplug APIs as 'odb transactions'
bulk-checkin: rename 'state' variable and separate 'plugged' boolean
|
|
Start following-up on the rename mentioned in c7c4bdeccf3 (run-command
API: remove "env" member, always use "env_array", 2021-11-25) of
"env_array" to "env".
The "env_array" name was picked in 19a583dc39e (run-command: add
env_array, an optional argv_array for env, 2014-10-19) because "env"
was taken. Let's not forever keep the oddity of "*_array" for this
"struct strvec", but not for its "args" sibling.
This commit is almost entirely made with a coccinelle rule[1]. The
only manual change here is in run-command.h to rename the struct
member itself and to change "env_array" to "env" in the
CHILD_PROCESS_INIT initializer.
The rest of this is all a result of applying [1]:
* make contrib/coccinelle/run_command.cocci.patch
* patch -p1 <contrib/coccinelle/run_command.cocci.patch
* git add -u
1. cat contrib/coccinelle/run_command.pending.cocci
@@
struct child_process E;
@@
- E.env_array
+ E.env
@@
struct child_process *E;
@@
- E->env_array
+ E->env
I've avoided changing any comments and derived variable names here,
that will all be done in the next commit.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.
Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.
This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).
This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.
When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".
Generating a non-expiring cruft packs works as follows:
- Callers provide a list of every pack they know about, and indicate
which packs are about to be removed.
- All packs which are going to be removed (we'll call these the
redundant ones) are marked as kept in-core.
Any packs the caller did not mention (but are known to the
`pack-objects` process) are also marked as kept in-core. Packs not
mentioned by the caller are assumed to be unknown to them, i.e.,
they entered the repository after the caller decided which packs
should be kept and which should be discarded.
Since we do not want to include objects in these "unknown" packs
(because we don't know which of their objects are or aren't
reachable), these are also marked as kept in-core.
- Then, we enumerate all objects in the repository, and add them to
our packing list if they do not appear in an in-core kept pack.
This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
A bit of test framework fixes with a few fixes to issues found by
valgrind.
* ab/valgrind-fixes:
commit-graph.c: don't assume that stat() succeeds
object-file: fix a unpack_loose_header() regression in 3b6a8db3b03
log test: skip a failing mkstemp() test under valgrind
tests: using custom GIT_EXEC_PATH breaks --valgrind tests
|
|
Introduce and apply coccinelle rule to discourage an explicit
comparison between a pointer and NULL, and applies the clean-up to
the maintenance track.
* ep/maint-equals-null-cocci:
tree-wide: apply equals-null.cocci
tree-wide: apply equals-null.cocci
contrib/coccinnelle: add equals-null.cocci
|
|
This switch statement was recently added to make it clear that
unpack_loose_header() returns an enum value, not an int. This adds
complications for future developers if that enum gains new values, since
that developer would need to add a case statement to this switch for
little real value.
Instead, we can revert back to an 'if' statement, but make the enum
explicit by using "!= ULHR_OK" instead of assuming it has the numerical
value zero.
Co-authored-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Fix a regression in my 3b6a8db3b03 (object-file.c: use "enum" return
type for unpack_loose_header(), 2021-10-01) revealed both by running
the test suite with --valgrind, and with the amended "git fsck" test.
In practice this regression in v2.34.0 caused us to claim that we
couldn't parse the header, as opposed to not being able to unpack
it. Before the change in the C code the test_cmp added here would emit:
-error: unable to unpack header of ./objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
+error: unable to parse header of ./objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
I.e. we'd proceed to call parse_loose_header() on the uninitialized
"hdr" value, and it would have been very unlikely for that
uninitialized memory to be a valid git object.
The other callers of unpack_loose_header() were already checking the
enum values exhaustively. See 3b6a8db3b03 and
5848fb11acd (object-file.c: return ULHR_TOO_LONG on "header too long",
2021-10-01).
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
* ep/maint-equals-null-cocci:
tree-wide: apply equals-null.cocci
contrib/coccinnelle: add equals-null.cocci
|
|
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When adding many objects to a repo with `core.fsync=loose-object`,
the cost of fsync'ing each object file can become prohibitive.
One major source of the cost of fsync is the implied flush of the
hardware writeback cache within the disk drive. This commit introduces
a new `core.fsyncMethod=batch` option that batches up hardware flushes.
It hooks into the bulk-checkin odb-transaction functionality, takes
advantage of tmp-objdir, and uses the writeout-only support code.
When the new mode is enabled, we do the following for each new object:
1a. Create the object in a tmp-objdir.
2a. Issue a pagecache writeback request and wait for it to complete.
At the end of the entire transaction when unplugging bulk checkin:
1b. Issue an fsync against a dummy file to flush the log and hardware
writeback cache, which should by now have seen the tmp-objdir writes.
2b. Rename all of the tmp-objdir files to their final names.
3b. When updating the index and/or refs, we assume that Git will issue
another fsync internal to that operation. This is not the default
today, but the user now has the option of syncing the index and there
is a separate patch series to implement syncing of refs.
On a filesystem with a singular journal that is updated during name
operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS
we would expect the fsync to trigger a journal writeout so that this
sequence is enough to ensure that the user's data is durable by the time
the git command returns. This sequence also ensures that no object files
appear in the main object store unless they are fsync-durable.
Batch mode is only enabled if core.fsync includes loose-objects. If
the legacy core.fsyncObjectFiles setting is enabled, but core.fsync does
not include loose-objects, we will use file-by-file fsyncing.
In step (1a) of the sequence, the tmp-objdir is created lazily to avoid
work if no loose objects are ever added to the ODB. We use a tmp-objdir
to maintain the invariant that no loose-objects are visible in the main
ODB unless they are properly fsync-durable. This is important since
future ODB operations that try to create an object with specific
contents will silently drop the new data if an object with the target
hash exists without checking that the loose-object contents match the
hash. Only a full git-fsck would restore the ODB to a functional state
where dataloss doesn't occur.
In step (1b) of the sequence, we issue a fsync against a dummy file
created specifically for the purpose. This method has a little higher
cost than using one of the input object files, but makes adding new
callers of this mechanism easier, since we don't need to figure out
which object file is "last" or risk sharing violations by caching the fd
of the last object file.
_Performance numbers_:
Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
Windows - Same host as Linux, a preview version of Windows 11.
Adding 500 files to the repo with 'git add' Times reported in seconds.
object file syncing | Linux | Mac | Windows
--------------------|-------|-------|--------
disabled | 0.06 | 0.35 | 0.61
fsync | 1.88 | 11.18 | 2.47
batch | 0.15 | 0.41 | 1.53
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
If we die while trying to fsync a loose object file, pass the actual
filename we're trying to sync. This is likely to be more helpful for a
user trying to diagnose the cause of the failure than the former
'loose object file' string. It also sidesteps any concerns about
translating the die message differently for loose objects versus
something else that has a real path.
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Code clean-up.
* ab/refs-various-fixes:
refs debug: add a wrapper for "read_symbolic_ref"
packed-backend: remove stub BUG(...) functions
misc *.c: use designated initializers for struct assignments
refs: use designated initializers for "struct ref_iterator_vtable"
refs: use designated initializers for "struct ref_storage_be"
|
|
Replace core.fsyncObjectFiles with two new configuration variables,
core.fsync and core.fsyncMethod.
* ns/core-fsyncmethod:
core.fsync: documentation and user-friendly aggregate options
core.fsync: new option to harden the index
core.fsync: add configuration parsing
core.fsync: introduce granular fsync control infrastructure
core.fsyncmethod: add writeout-only mode
wrapper: make inclusion of Windows csprng header tightly scoped
|
|
Change a few miscellaneous non-designated initializer assignments to
use designated initializers.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Object-file API shuffling.
* ab/object-file-api-updates:
object-file API: pass an enum to read_object_with_reference()
object-file.c: add a literal version of write_object_file_prepare()
object-file API: have hash_object_file() take "enum object_type"
object API: rename hash_object_file_literally() to write_*()
object-file API: split up and simplify check_object_signature()
object API users + docs: check <0, not !0 with check_object_signature()
object API docs: move check_object_signature() docs to cache.h
object API: correct "buf" v.s. "map" mismatch in *.c and *.h
object-file API: have write_object_file() take "enum object_type"
object-file API: add a format_object_header() function
object-file API: return "void", not "int" from hash_object_file()
object-file.c: split up declaration of unrelated variables
|
|
This commit introduces the infrastructure for the core.fsync
configuration knob. The repository components we want to sync
are identified by flags so that we can turn on or off syncing
for specific components.
If core.fsyncObjectFiles is set and the core.fsync configuration
also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any
loose objects. This picks the strictest data integrity behavior
if core.fsync and core.fsyncObjectFiles are set to conflicting values.
This change introduces the currently unused fsync_component
helper, which will be used by a later patch that adds fsyncing to
the refs backend.
Actual configuration and documentation of the fsync components
list are in other patches in the series to separate review of
the underlying mechanism from the policy of how it's configured.
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Change the read_object_with_reference() function to take an "enum
object_type". It was not prepared to handle an arbitrary "const
char *type", as it was itself calling type_from_string().
Let's change the only caller that passes in user data to use
type_from_string(), and convert the rest to use e.g. "OBJ_TREE"
instead of "tree_type".
The "cat-file" caller is not on the codepath that
handles"--allow-unknown", so the type_from_string() there is safe. Its
use of type_from_string() doesn't functionally differ from that of the
pre-image.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Split off a *_literally() variant of the write_object_file_prepare()
function. To do this create a new "hash_object_body()" static helper.
We now defer the type_name() call until the very last moment in
format_object_header() for those callers that aren't "hash-object
--literally".
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Change the hash_object_file() function to take an "enum
object_type".
Since a preceding commit all of its callers are passing either
"{commit,tree,blob,tag}_type", or the result of a call to type_name(),
the parse_object() caller that would pass NULL is now using
stream_object_signature().
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Before 0c3db67cc81 (hash-object --literally: fix buffer overrun with
extra-long object type, 2015-05-04) the hash-object code being changed
here called write_sha1_file() to both hash and write a loose
object. Before that we'd use hash_sha1_file() to if "-w" wasn't
provided, and otherwise call write_sha1_file().
Now we'll always call the same function for both writing. Let's rename
it from hash_*_literally() to write_*_literally(). Even though the
write_*() might not actually write if HASH_WRITE_OBJECT isn't in
"flags", having it be more similar to write_object_file_flags() than
hash_object_file(), but carrying a name that would suggest that it's a
variant of the latter is confusing.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Split up the check_object_signature() function into that non-streaming
version (it accepts an already filled "buf"), and a new
stream_object_signature() which will retrieve the object from storage,
and hash it on-the-fly.
All of the callers of check_object_signature() were effectively
calling two different functions, if we go by cyclomatic
complexity. I.e. they'd either take the early "if (map)" branch and
return early, or not. This has been the case since the "if (map)"
condition was added in 090ea12671b (parse_object: avoid putting whole
blob in core, 2012-03-07).
We can then further simplify the resulting check_object_signature()
function since only one caller wanted to pass a non-NULL "buf" and a
non-NULL "real_oidp". That "read_loose_object()" codepath used by "git
fsck" can instead use hash_object_file() followed by oideq().
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|