aboutsummaryrefslogtreecommitdiffstats
path: root/object-file.c
AgeCommit message (Collapse)AuthorFilesLines
2025-04-29object-store: drop `loose_object_path()`Patrick Steinhardt1-2/+2
The function `loose_object_path()` is a trivial wrapper around `odb_loose_path()`, with the only exception that it always uses the primary object database of the given repository. This doesn't really add a ton of value though, so let's drop the function and inline it at every callsite. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-24Merge branch 'ps/object-file-cleanup' into ps/object-store-cleanupJunio C Hamano1-1193/+27
* ps/object-file-cleanup: object-store: merge "object-store-ll.h" and "object-store.h" object-store: remove global array of cached objects object: split out functions relating to object store subsystem object-file: drop `index_blob_stream()` object-file: split up concerns of `HASH_*` flags object-file: split out functions relating to object store subsystem object-file: move `xmmap()` into "wrapper.c" object-file: move `git_open_cloexec()` to "compat/open.c" object-file: move `safe_create_leading_directories()` into "path.c" object-file: move `mkdir_in_gitdir()` into "path.c"
2025-04-15Merge branch 'ps/object-wo-the-repository'Junio C Hamano1-283/+9
The object layer has been updated to take an explicit repository instance as a parameter in more code paths. * ps/object-wo-the-repository: hash: stop depending on `the_repository` in `null_oid()` hash: fix "-Wsign-compare" warnings object-file: split out logic regarding hash algorithms delta-islands: stop depending on `the_repository` object-file-convert: stop depending on `the_repository` pack-bitmap-write: stop depending on `the_repository` pack-revindex: stop depending on `the_repository` pack-check: stop depending on `the_repository` environment: move access to "core.bigFileThreshold" into repo settings pack-write: stop depending on `the_repository` and `the_hash_algo` object: stop depending on `the_repository` csum-file: stop depending on `the_repository`
2025-04-15Merge branch 'jk/zlib-inflate-fixes'Junio C Hamano1-25/+23
Fix our use of zlib corner cases. * jk/zlib-inflate-fixes: unpack_loose_rest(): rewrite return handling for clarity unpack_loose_rest(): simplify error handling unpack_loose_rest(): never clean up zstream unpack_loose_rest(): avoid numeric comparison of zlib status unpack_loose_header(): avoid numeric comparison of zlib status git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT unpack_loose_header(): fix infinite loop on broken zlib input unpack_loose_header(): report headers without NUL as "bad" unpack_loose_header(): simplify next_out assignment loose_object_info(): BUG() on inflating content with unknown type
2025-04-15object-store: merge "object-store-ll.h" and "object-store.h"Patrick Steinhardt1-0/+1
The "object-store-ll.h" header has been introduced to keep transitive header dependendcies and compile times at bay. Now that we have created a new "object-store.c" file though we can easily move the last remaining additional bit of "object-store.h", the `odb_path_map`, out of the header. Do so. As the "object-store.h" header is now equivalent to its low-level alternative we drop the latter and inline it into the former. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: drop `index_blob_stream()`Patrick Steinhardt1-24/+2
The `index_blob_stream()` function is a mere wrapper around `index_blob_bulk_checkin()`. This has been the case since 568508e7657 (bulk-checkin: replace fast-import based implementation, 2011-10-28), which has moved the implementation from `index_blob_stream()` (which was still called `index_stream()`) into `index_bulk_checkin()` (which has since been renamed to `index_blob_bulk_checkin()`). Remove the redirection by dropping the wrapper. Move the comment to `index_blob_bulk_checkin()` to retain its context. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: split up concerns of `HASH_*` flagsPatrick Steinhardt1-9/+9
The functions `hash_object_file()`, `write_object_file()` and `index_fd()` reuse the same set of flags to alter their behaviour. This not only adds confusion, but given that every function only supports a subset of the flags it becomes very hard to see which flags can be passed to what function. Last but not least, this entangles the implementation of all three function families. Split up concerns by creating separate flags for each of the function families. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: split out functions relating to object store subsystemPatrick Steinhardt1-977/+13
While we have the "object-store.h" header, most of the functionality for object stores is actually hosted in "object-file.c". This makes it hard to find relevant functions and causes us to mix up concerns. Split out functions relating to the object store subsystem into a new "object-store.c" file. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: move `xmmap()` into "wrapper.c"Patrick Steinhardt1-48/+0
The `xmmap()` function is provided by "object-file.c" even though its functionality has nothing to do with the object file subsystem. Move it into "wrapper.c", whose header already declares those functions. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: move `git_open_cloexec()` to "compat/open.c"Patrick Steinhardt1-27/+0
The `git_open_cloexec()` wrapper function provides the ability to open a file with `O_CLOEXEC` in a platform-agnostic way. This function is provided by "object-file.c" even though it is not specific to the object subsystem at all. Move the file into "compat/open.c". This file already exists before this commit, but has only been compiled conditionally depending on whether or not open(3p) may return EINTR. With this change we now unconditionally compile the object, but wrap `git_open_with_retry()` in an ifdef. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: move `safe_create_leading_directories()` into "path.c"Patrick Steinhardt1-79/+2
The `safe_create_leading_directories()` function and its relatives are located in "object-file.c", which is not a good fit as they provide generic functionality not related to objects at all. Move them into "path.c", which already hosts `safe_create_dir()` and its relative `safe_create_dir_in_gitdir()`. "path.c" is free of `the_repository`, but the moved functions depend on `the_repository` to read the "core.sharedRepository" config. Adapt the function signature to accept a repository as argument to fix the issue and adjust callers accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: move `mkdir_in_gitdir()` into "path.c"Patrick Steinhardt1-31/+2
The `mkdir_in_gitdir()` function is similar to `safe_create_dir()`, but the former is hosted in "object-file.c" whereas the latter is hosted in "path.c". The latter code unit makes way more sense though as the logic has nothing to do with object files in particular. Move the file into "path.c". While at it, we: - Rename the function to `safe_create_dir_in_gitdir()` so that the function names are similar to one another. - Remove the dependency on `the_repository` by making the callers pass the repository instead. Adjust callers accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21treewide: replace assert() with ASSERT() in special casesElijah Newren1-1/+1
When the compiler/linker cannot verify that an assert() invocation is free of side effects for us (e.g. because the assertion includes some kind of function call), replace the use of assert() with ASSERT(). Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10hash: stop depending on `the_repository` in `null_oid()`Patrick Steinhardt1-1/+1
The `null_oid()` function returns the object ID that only consists of zeroes. Naturally, this ID also depends on the hash algorithm used, as the number of zeroes is different between SHA1 and SHA256. Consequently, the function returns the hash-algorithm-specific null object ID. This is currently done by depending on `the_hash_algo`, which implicitly makes us depend on `the_repository`. Refactor the function to instead pass in the hash algorithm for which we want to retrieve the null object ID. Adapt callsites accordingly by passing in `the_repository`, thus bubbling up the dependency on that global variable by one layer. There are a couple of trivial exceptions for subsystems that already got rid of `the_repository`. These subsystems instead use the repository that is available via the calling context: - "builtin/grep.c" - "grep.c" - "refs/debug.c" There are also two non-trivial exceptions: - "diff-no-index.c": Here we know that we may not have a repository initialized at all, so we cannot rely on `the_repository`. Instead, we adapt `diff_no_index()` to get a `struct git_hash_algo` as parameter. The only caller is located in "builtin/diff.c", where we know to call `repo_set_hash_algo()` in case we're running outside of a Git repository. Consequently, it is fine to continue passing `the_repository->hash_algo` even in this case. - "builtin/ls-files.c": There is an in-flight patch series that drops `USE_THE_REPOSITORY_VARIABLE` in this file, which causes a semantic conflict because we use `null_oid()` in `show_submodule()`. The value is passed to `repo_submodule_init()`, which may use the object ID to resolve a tree-ish in the superproject from which we want to read the submodule config. As such, the object ID should refer to an object in the superproject, and consequently we need to use its hash algorithm. This means that we could in theory just not bother about this edge case at all and just use `the_repository` in "diff-no-index.c". But doing so would feel misdesigned. Remove the `USE_THE_REPOSITORY_VARIABLE` preprocessor define in "hash.c". Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10object-file: split out logic regarding hash algorithmsPatrick Steinhardt1-277/+0
While we have a "hash.h" header, the actual implementation of the subsystem is hosted by "object-file.c". This makes it harder than necessary to find the actual implementation of the hash subsystem and intermingles the different concerns with one another. Split out the implementation of hash algorithms into a new, separate "hash.c" file. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10object-file-convert: stop depending on `the_repository`Patrick Steinhardt1-3/+4
There are multiple sites in "object-file-convert.c" where we use the global `the_repository` variable, either explicitly or implicitly by using `the_hash_algo`. All of these callsites are transitively called from `convert_object_file()`, which indeed has no repo as input. Refactor the function so that it receives a repository as a parameter and pass it through to all internal functions to get rid of the dependency. Remove the `USE_THE_REPOSITORY_VARIABLE` define. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10environment: move access to "core.bigFileThreshold" into repo settingsPatrick Steinhardt1-2/+4
The "core.bigFileThreshold" setting is stored in a global variable and populated via `git_default_core_config()`. This may cause issues in the case where one is handling multiple different repositories in a single process with different values for that config key, as we may or may not see the correct value in that case. Furthermore, global state blocks our path towards libification. Refactor the code so that we instead store the value in `struct repo_settings`, where the value is computed as-needed and cached. Note that this change requires us to adapt one test in t1050 that verifies that we die when parsing an invalid "core.bigFileThreshold" value. The exercised Git command doesn't use the value at all, and thus it won't hit the new code path that parses the value. This is addressed by using git-hash-object(1) instead, which does read the value. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-05Merge branch 'ps/path-sans-the-repository'Junio C Hamano1-7/+7
The path.[ch] API takes an explicit repository parameter passed throughout the callchain, instead of relying on the_repository singleton instance. * ps/path-sans-the-repository: path: adjust last remaining users of `the_repository` environment: move access to "core.sharedRepository" into repo settings environment: move access to "core.hooksPath" into repo settings repo-settings: introduce function to clear struct path: drop `git_path()` in favor of `repo_git_path()` rerere: let `rerere_path()` write paths into a caller-provided buffer path: drop `git_common_path()` in favor of `repo_common_path()` worktree: return allocated string from `get_worktree_git_dir()` path: drop `git_path_buf()` in favor of `repo_git_path_replace()` path: drop `git_pathdup()` in favor of `repo_git_path()` path: drop unused `strbuf_git_path()` function path: refactor `repo_submodule_path()` family of functions submodule: refactor `submodule_to_gitdir()` to accept a repo path: refactor `repo_worktree_path()` family of functions path: refactor `repo_git_path()` family of functions path: refactor `repo_common_path()` family of functions
2025-02-28path: adjust last remaining users of `the_repository`Patrick Steinhardt1-4/+4
With the preceding refactorings we now only have a couple of implicit users of `the_repository` left in the "path" subsystem, all of which depend on global state via `calc_shared_perm()`. Make the dependency on `the_repository` explicit by passing the repo as a parameter instead and adjust callers accordingly. Note that this change bubbles up into a couple of subsystems that were previously declared as free from `the_repository`. Instead of marking all of them as `the_repository`-dependent again, we instead use the repository that is available in the calling context. There are three exceptions though with "copy.c", "pack-write.c" and "tempfile.c". Adjusting these would require us to adapt callsites all over the place, so this is left for a future iteration. Mark "path.c" as free from `the_repository`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_rest(): rewrite return handling for clarityJeff King1-6/+6
We have a pattern like: if (error1) ...handle error 1... else if (error2) ...handle error 2... else ...return buf... ...free buf and return NULL... This is a little subtle because it is the return in the success block that lets us skip the common error handling. Rewrite this instead to free the buffer in each error path, marking it as NULL, and then all code paths can use the common return. This should make the logic a bit easier to follow. It does mean duplicating the buf cleanup for errors, but it's a single line. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_rest(): simplify error handlingJeff King1-3/+3
Inflating a loose object is considered successful only if we got Z_STREAM_END and there were no more bytes. We check both of those conditions and return success, but then have to check them a second time to decide which error message to produce. I.e., we do something like this: if (!error_1 && !error_2) ...return success... if (error_1) ...handle error1... else if (error_2) ...handle error2... ...common error handling... This repetition was the source of a small bug fixed in an earlier commit (our Z_STREAM_END check was not the same in the two conditionals). Instead we can chain them all into a single if/else cascade, which avoids repeating ourselves: if (error_1) ...handle error1... else if (error_2) ...handle error2.... else ...return success... ...common error handling... Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_rest(): never clean up zstreamJeff King1-10/+8
The unpack_loose_rest() function has funny ownership semantics: we pass in a z_stream opened by the caller, but then only _sometimes_ close it. This oddity has developed over time. When the function was originally split out in 5180cacc20 (Split up unpack_sha1_file() some more, 2005-06-02), it always called inflateEnd() to clean up the stream (though nowadays it is a git_zstream and we call git_inflate_end()). But in 7efbff7531 (unpack_sha1_file(): detect corrupt loose object files., 2007-03-05) we added error code paths which don't close the stream. This makes some sense, as we'd still look at parts of the stream struct to decide which error to show (though I am not sure in practice if inflateEnd() even touches those fields). This subtlety makes it hard to know when the caller has to clean up the stream and when it does not. That led to the leak fixed by aa9ef614dc (object-file: fix memory leak when reading corrupted headers, 2024-08-14). Let's instead always leave the stream intact, forcing the caller to clean it up. You might think that would create more work for the callers, but it actually ends up simplifying them, since they can put the call to git_inflate_end() in the common cleanup code path. Two things to note, though: - The check_stream_oid() function is used as a replacement for unpack_loose_rest() in read_loose_object() to read blobs. It inherited the same funny semantics, and we should fix it here, too (to keep the cleanup in read_loose_object() consistent). - In read_loose_object() we need a second "out" label, as we can jump to the existing label before opening the stream at all (and since the struct is opaque, there is no way to if it was initialized or not, so we must not call git_inflate_end() in that case). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_rest(): avoid numeric comparison of zlib statusJeff King1-1/+1
When unpacking the actual content of a loose object file, we insist both that the status code we got is Z_STREAM_END, and that we consumed all bytes. If we didn't, we'll return an error, but the specific error message we produce depends on which of the two error conditions we saw. So we'll check both a second time to decide which error to produce. But this second time, our status code check is loose: it checks for a negative status value. This can get confused by zlib codes which are not negative, such as Z_NEED_DICT. In this case we'd erroneously print nothing at all, when we should say "corrupt loose object". Instead, this second check should check explicitly against Z_STREAM_END. Note that Z_OK is "0", so the existing code also produced no message for Z_OK. But it's impossible to see that status, since we only break out of the inflate loop when we stop seeing Z_OK (so a stream which has more bytes than its object header claims would eventually yield Z_BUF_ERROR). There's no test here, as it would require a loose object whose zlib stream returns Z_NEED_DICT in the middle of the object content. I think that is probably possible, but even our Z_NEED_DICT test in t1006 does not trigger this, because we hit that error while reading the header. I found this bug while reviewing all callers of git_inflate() for bugs similar to the one we saw in unpack_loose_header(). This was the only other case that did a numeric comparison rather than explicitly checking for Z_STREAM_END. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_header(): avoid numeric comparison of zlib statusJeff King1-1/+1
When unpacking a loose header, we try to inflate the first 32 bytes. We'd expect either Z_OK (we filled up the output buffer, but there are more bytes in the object) or Z_STREAM_END (this is a tiny object whose header and content fit in the buffer). We check for that with "if (status < Z_OK)", making the assumption that all of the errors we'd see have negative values (as Z_OK itself is "0", and Z_STREAM_END is "1"). But there's at least one case this misses: Z_NEED_DICT is "2". This isn't something we'd ever expect to see, but if we do see it, we should consider it an error (since we have no dictionary to load). Instead, the current code interprets Z_NEED_DICT as success and looks for the object header's terminating NUL in the bytes we've read. This will generaly be zero bytes if the dictionary is mentioned at the start of the stream. So we'll fail to find it and complain "the header is too long" (ULHR_LONG). But really, the problem is that the object is malformed, and we should return ULHR_BAD. This is a minor bug, as we consider both cases to be an error. But it does mean we print the wrong error message. The test case added in the previous patch triggers this code, so we can just confirm the error message we see here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_header(): fix infinite loop on broken zlib inputJeff King1-1/+1
When reading a loose object, we first try to expand the first 32 bytes to read the type+size header. This is enough for any of the normal Git types. But since 46f034483e (sha1_file: support reading from a loose object of unknown type, 2015-05-03), the caller can also ask us to parse any unknown names, which can be much longer. In this case we keep inflating until we find the NUL at the end of the header, or hit Z_STREAM_END. But what if zlib can't make forward progress? For example, if the loose object file is truncated, we'll have no more data to feed it. It will return Z_BUF_ERROR, and we'll just loop infinitely, calling git_inflate() over and over but never seeing new bytes nor an end-of-stream marker. We can fix this by only looping when we think we can make forward progress. This will always be Z_OK in this case. In other code we might also be able to continue on Z_BUF_ERROR, but: - We will never see Z_BUF_ERROR because the output buffer is full; we always feed a fresh 32-byte buffer on each call to git_inflate(). - We may see Z_BUF_ERROR if we run out of input. But since we've fed the whole mmap'd buffer to zlib, if it runs out of input there is nothing more we can do. So if we don't see Z_OK (and didn't see the end-of-header NUL, otherwise we'd have broken out of the loop), then we should stop looping and return an error. The test case shows an example where the input is truncated (which gives us the input Z_BUF_ERROR case above). Although we do operate on objects we might get from an untrusted remote, I don't think the security implications of this bug are too great. It can only trigger if both of these are true: - You're reading a loose object whose on-disk representation was written by an attacker. So fetching an object (or receiving a push) are mostly OK, because even with unpack-objects it is our local, trusted code that writes out the object file. The exception may be fetching from an untrusted local repo, or using dumb-http, which copies objects verbatim. But... - The only code path which triggers the inflate loop is cat-file's --allow-unknown-type option. This is unlikely to be called at all outside of debugging. But I also suspect that objects with non-standard types (or that are truncated) would not survive the usual fetch/receive checks in the first place. So I think it would be quite hard to trick somebody into running the infinite loop, and we can just fix the bug. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_header(): report headers without NUL as "bad"Jeff King1-1/+1
If a caller asks us to read the whole loose object header value into a strbuf (e.g., via "cat-file --allow-unknown-type"), we'll keep reading until we see a NUL byte marking the end of the header. If we hit Z_STREAM_END before seeing the NUL, we obviously have to stop. But we return ULHR_TOO_LONG, which doesn't make any sense. The "too long" return code is used in the normal, 32-byte limited mode to indicate that we stopped looking. There is no such thing as "too long" here, as we'd keep reading forever until we see the end of stream or the NUL. Instead, we should return ULHR_BAD. The loose object has no NUL marking the end of header, so it is malformed. The behavior difference is slight; in either case we'd consider the object unreadable and refuse to go further. The only difference is the specific error message we produce. There's no test case here, as we'd need to generate a valid zlib stream without a NUL. That's not something Git will do without writing new custom code. And in the next patch we'll fix another bug in this area which will make this easier to do (and we will test it then). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25unpack_loose_header(): simplify next_out assignmentJeff King1-4/+3
When using OBJECT_INFO_ALLOW_UNKNOWN_TYPE to unpack a header that doesn't fit into our initial 32-byte buffer, we loop over calls git_inflate(), feeding it our buffer to the "next_out" pointer each time. As the code is written, we reset next_out after each inflate call (and after reading the output), ready for the next loop. This isn't wrong, but there are a few advantages to setting up "next_out" right before each inflate call, rather than after: 1. It drops a few duplicated lines of code. 2. It makes it obvious that we always feed a fresh buffer on each call (and thus can never see Z_BUF_ERROR due to due to a lack of output space). 3. After we exit the loop, we'll leave stream->next_out pointing to the end of the fetched data (this is how zlib callers find out how much data is in the buffer). This doesn't matter in practice, since nobody looks at it again. But it's probably the least-surprising thing to do, as it matches how next_out is left when the whole thing fits in the initial 32-byte buffer (and we don't enter the loop at all). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-25loose_object_info(): BUG() on inflating content with unknown typeJeff King1-0/+2
After unpack_loose_header() returns, it will have inflated not only the object header, but possibly some bytes of the object content. When we call unpack_loose_rest() to extract the actual content, it finds those extra bytes by skipping past the header's terminating NUL in the buffer. Like this: int bytes = strlen(buffer) + 1; n = stream->total_out - bytes; ... memcpy(buf, (char *) buffer + bytes, n); This won't work with the OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag, as there we allow a header of arbitrary size. We put into a strbuf, but feed only the final 32-byte chunk we read to unpack_loose_rest(). In that case stream->total_out may unexpectedly large, and thus our "n" will be large, causing an out-of-bounds read (we do check it against our allocated buffer size, which prevents an out-of-bounds write). Probably this could be made to work by feeding the strbuf to unpack_loose_rest(), along with adjusting some types (e.g., "bytes" would need to be a size_t, since it is no longer operating on a 32-byte buffer). But I don't think it's possible to actually trigger this in practice. The only caller who passes ALLOW_UNKNOWN_TYPE is cat-file, which only allows it with the "-t" and "-s" options (neither of which access the content). There is one way you can _almost_ trigger it: the oid compat routines (i.e., accessing sha1 via sha256 names and vice versa) will convert objects on the fly (which requires access to the content) using the same flags that were passed in. So in theory this: t='some very large type field that causes an extra inflate call' sha1_oid=$(git hash-object -w -t "$t" file) sha256_oid=$(git rev-parse --output-object-format=sha256 $sha1_oid) git cat-file --allow-unknown-type -s $sha256_oid would try to access the content. But it doesn't work, because using compat objects requires an entry in the .git/objects/loose-object-idx file, and we don't generate such an entry for non-standard types (see the "compat" section of write_object_file_literally()). If we use "t=blob" instead, then it does access the compat object, but it doesn't trigger the problem (because "blob" is a standard short type name, and it fits in the initial 32-byte buffer). So given that this is almost a memory error bug, I think it's worth addressing. But because we can't actually trigger the situation, I'm hesitant to try a fix that we can't run. Instead let's document the restriction and protect ourselves from the out-of-bounds read by adding a BUG() check. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-07path: drop `git_path_buf()` in favor of `repo_git_path_replace()`Patrick Steinhardt1-2/+2
Remove `git_path_buf()` in favor of `repo_git_path_replace()`. The latter does essentially the same, with the only exception that it does not rely on `the_repository` but takes the repo as separate parameter. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-07path: drop `git_pathdup()` in favor of `repo_git_path()`Patrick Steinhardt1-1/+1
Remove `git_pathdup()` in favor of `repo_git_path()`. The latter does essentially the same, with the only exception that it does not rely on `the_repository` but takes the repo as separate parameter. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03Merge branch 'tb/unsafe-hash-cleanup'Junio C Hamano1-15/+26
The API around choosing to use unsafe variant of SHA-1 implementation has been updated in an attempt to make it harder to abuse. * tb/unsafe-hash-cleanup: hash.h: drop unsafe_ function variants csum-file: introduce hashfile_checkpoint_init() t/helper/test-hash.c: use unsafe_hash_algo() csum-file.c: use unsafe_hash_algo() hash.h: introduce `unsafe_hash_algo()` csum-file.c: extract algop from hashfile_checksum_valid() csum-file: store the hash algorithm as a struct field t/helper/test-tool: implement sha1-unsafe helper
2025-01-31global: adapt callers to use generic hash context helpersPatrick Steinhardt1-17/+15
Adapt callers to use generic hash context helpers instead of using the hash algorithm to update them. This makes the callsites easier to reason about and removes the possibility that the wrong hash algorithm is used to update the hash context's state. And as a nice side effect this also gets rid of a bunch of users of `the_hash_algo`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-31hash: provide generic wrappers to update hash contextsPatrick Steinhardt1-0/+6
The hash context is supposed to be updated via the `git_hash_algo` structure, which contains a list of function pointers to update, clone or finalize a hashing context. This requires the callers to track which algorithm was used to initialize the context and continue to use the exact same algorithm. If they fail to do that correctly, it can happen that we start to access context state of one hash algorithm with functions of a different hash algorithm. The result would typically be a segfault, as could be seen e.g. in the patches part of 98422943f0 (Merge branch 'ps/weak-sha1-for-tail-sum-fix', 2025-01-01). The situation was significantly improved starting with 04292c3796 (hash.h: drop unsafe_ function variants, 2025-01-23) and its parent commits. These refactorings ensure that it is not possible to mix up safe and unsafe variants of the same hash algorithm anymore. But in theory, it is still possible to mix up different hash algorithms with each other, even though this is a lot less likely to happen. But still, we can do better: instead of asking the caller to remember the hash algorithm used to initialize a context, we can instead make the context itself remember which algorithm it has been initialized with. If we do so, callers can use a set of generic helpers to update the context and don't need to be aware of the hash algorithm at all anymore. Adapt the context initialization functions to store the hash algorithm in the hashing context and introduce these generic helpers. Callers will be adapted in the subsequent commit. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-31hash: stop typedeffing the hash contextPatrick Steinhardt1-31/+31
We generally avoid using `typedef` in the Git codebase. One exception though is the `git_hash_ctx`, likely because it used to be a union rather than a struct until the preceding commit refactored it. But now that it is a normal `struct` there isn't really a need for a typedef anymore. Drop the typedef and adapt all callers accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-31hash: convert hashing context to a structurePatrick Steinhardt1-15/+15
The `git_hash_context` is a union containing the different hash-specific states for SHA1, its unsafe variant as well as SHA256. We know that only one of these states will ever be in use at the same time because hash contexts cannot be used for multiple different hashes at the same point in time. We're about to extend the structure though to keep track of the hash algorithm used to initialize the context, which is impossible to do while the context is a union. Refactor it to instead be a structure that contains the union of context states. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-31Merge branch 'tb/unsafe-hash-cleanup' into ps/hash-cleanupJunio C Hamano1-15/+26
* tb/unsafe-hash-cleanup: hash.h: drop unsafe_ function variants csum-file: introduce hashfile_checkpoint_init() t/helper/test-hash.c: use unsafe_hash_algo() csum-file.c: use unsafe_hash_algo() hash.h: introduce `unsafe_hash_algo()` csum-file.c: extract algop from hashfile_checksum_valid() csum-file: store the hash algorithm as a struct field t/helper/test-tool: implement sha1-unsafe helper
2025-01-23hash.h: drop unsafe_ function variantsTaylor Blau1-15/+0
Now that all callers have been converted from: the_hash_algo->unsafe_init_fn(); to unsafe_hash_algo(the_hash_algo)->init_fn(); and similar, we can remove the scaffolding for the unsafe_ function variants and force callers to use the new unsafe_hash_algo() mechanic instead. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-23hash.h: introduce `unsafe_hash_algo()`Taylor Blau1-0/+26
In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants, 2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing functions by introducing new functions like "unsafe_init_fn()" and so on. This approach has a major shortcoming that callers must remember to consistently use one variant or the other. Failing to consistently use (or not use) the unsafe variants can lead to crashes at best, or subtle memory corruption issues at worst. In the hashfile API, this isn't difficult to achieve, but verifying that all callers consistently use the unsafe variants is somewhat of a chore given how spread out all of the callers are. In the sha1 and sha1-unsafe test helpers, all of the calls to various hash functions are guarded by an "if (unsafe)" conditional, which is repetitive and cumbersome. Address these issues by introducing a new pattern whereby one 'git_hash_algo' can return a pointer to another 'git_hash_algo' that represents the unsafe version of itself. So instead of having something like: if (unsafe) the_hash_algo->init_fn(...); the_hash_algo->update_fn(...); the_hash_algo->final_fn(...); else the_hash_algo->unsafe_init_fn(...); the_hash_algo->unsafe_update_fn(...); the_hash_algo->unsafe_final_fn(...); we can instead write: struct git_hash_algo *algop = the_hash_algo; if (unsafe) algop = unsafe_hash_algo(algop); algop->init_fn(...); algop->update_fn(...); algop->final_fn(...); This removes the existing shortcoming by no longer forcing the caller to "remember" which variant of the hash functions it wants to call, only to hold onto a 'struct git_hash_algo' pointer that is initialized once. Similarly, while there currently is still a way to "mix" safe and unsafe functions, this too will go away after subsequent commits remove all direct calls to the unsafe_ variants. Note that hash_algo_by_ptr() needs an adjustment to allow passing in the unsafe variant of a hash function. All other query functions on the hash_algos array will continue to return the safe variants of any function. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-16Merge branch 'ps/object-collision-check'Junio C Hamano1-24/+42
CI jobs gave sporadic failures, which turns out that that the object finalization code was giving an error when it did not have to. * ps/object-collision-check: object-file: retry linking file into place when occluding file vanishes object-file: don't special-case missing source file in collision check object-file: rename variables in `check_collision()` object-file: fix race in object collision check
2025-01-06object-file: retry linking file into place when occluding file vanishesPatrick Steinhardt1-4/+21
Prior to 0ad3d65652 (object-file: fix race in object collision check, 2024-12-30), callers could expect that a successful return from `finalize_object_file()` means that either the file was moved into place, or the identical bytes were already present. If neither of those happens, we'd return an error. Since that commit, if the destination file disappears between our link(3p) call and the collision check, we'd return success without actually checking the contents, and without retrying the link. This solves the common case that the files were indeed the same, but it means that we may corrupt the repository if they weren't (this implies a hash collision, but the whole point of this function is protecting against hash collisions). We can't be pessimistic and assume they're different; that hurts the common case that the mentioned commit was trying to fix. But after seeing that the destination file went away, we can retry linking again. Adapt the code to do so when we see that the destination file has racily vanished. This should generally succeed as we have just observed that the destination file does not exist anymore, except in the very unlikely event that it gets recreated by another concurrent process again. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-06object-file: don't special-case missing source file in collision checkPatrick Steinhardt1-2/+1
In 0ad3d65652 (object-file: fix race in object collision check, 2024-12-30) we have started to ignore ENOENT when opening either the source or destination file of the collision check. This was done to handle races more gracefully in case either of the potentially-colliding disappears. The fix is overly broad though: while the destination file may indeed vanish racily, this shouldn't ever happen for the source file, which is a temporary object file (either loose or in packfile format) that we have just created. So if any concurrent process would have removed that temporary file it would indicate an actual issue. Stop treating ENOENT specially for the source file so that we always bubble up this error. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-06object-file: rename variables in `check_collision()`Patrick Steinhardt1-20/+20
Rename variables used in `check_collision()` to clearly identify which file is the source and which is the destination. This will make the next step easier to reason about when we start to treat those files different from one another. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-30object-file: fix race in object collision checkPatrick Steinhardt1-2/+4
One of the tests in t5616 asserts that git-fetch(1) with `--refetch` triggers repository maintenance with the correct set of arguments. This test is flaky and causes us to fail sometimes: ++ git -c protocol.version=0 -c gc.autoPackLimit=0 -c maintenance.incremental-repack.auto=1234 -C pc1 fetch --refetch origin error: unable to open .git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack: No such file or directory fatal: unable to rename temporary file to '.git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack' fatal: could not finish pack-objects to repack local links fatal: index-pack failed error: last command exited with $?=128 The error message is quite confusing as it talks about trying to rename a temporary packfile. A first hunch would thus be that this packfile gets written by git-fetch(1), but removed by git-maintenance(1) while it hasn't yet been finalized, which shouldn't ever happen. And indeed, when looking closer one notices that the file that is supposedly of temporary nature does not have the typical `tmp_pack_` prefix. As it turns out, the "unable to rename temporary file" fatal error is a red herring and the real error is "unable to open". That error is raised by `check_collision()`, which is called by `finalize_object_file()` when moving the new packfile into place. Because t5616 re-fetches objects, we end up with the exact same pack as we already have in the repository. So when the concurrent git-maintenance(1) process rewrites the preexisting pack and unlinks it exactly at the point in time where git-fetch(1) wants to check the old and new packfiles for equality we will see ENOENT and thus `check_collision()` returns an error, which gets bubbled up by `finalize_object_file()` and is then handled by `rename_tmp_packfile()`. That function does not know about the exact root cause of the error and instead just claims that the rename has failed. This race is thus caused by b1b8dfde69 (finalize_object_file(): implement collision check, 2024-09-26), where we have newly introduced the collision check. By definition, two files cannot collide with each other when one of them has been removed. We can thus trivially fix the issue by ignoring ENOENT when opening either of the files we're about to check for collision. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06global: mark code units that generate warnings with `-Wsign-compare`Patrick Steinhardt1-0/+1
Mark code units that generate warnings with `-Wsign-compare`. This allows for a structured approach to get rid of all such warnings over time in a way that can be easily measured. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18object-file: inline empty tree and blob literalsJeff King1-27/+20
We define macros with the bytes of the empty trees and blobs for sha1 and sha256. But since e1ccd7e2b1 (sha1_file: only expose empty object constants through git_hash_algo, 2018-05-02), those are used only for initializing the git_hash_algo entries. Any other code using the macros directly would be suspicious, since a hash_algo pointer is the level of indirection we use to make everything work with both sha1 and sha256. So let's future proof against code doing the wrong thing by dropping the macros entirely and just initializing the structs directly. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18object-file: treat cached_object values as constJeff King1-4/+4
The cached-object API maps oids to in-memory entries. Once inserted, these entries should be immutable. Let's return them from the find_cached_object() call with a const tag to make this clear. Suggested-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18object-file: drop oid field from find_cached_object() return valueJeff King1-11/+12
The pretend_object_file() function adds to an array mapping oids to object contents, which are later retrieved with find_cached_object(). We naturally need to store the oid for each entry, since it's the lookup key. But find_cached_object() also returns a hard-coded empty_tree object. There we don't care about its oid field and instead compare against the_hash_algo->empty_tree. The oid field is left as all-zeroes. This all works, but it means that the cached_object struct we return from find_cached_object() may or may not have a valid oid field, depend whether it is the hard-coded tree or came from pretend_object_file(). Nobody looks at the field, so there's no bug. But let's future-proof it by returning only the object contents themselves, not the oid. We'll continue to call this "struct cached_object", and the array entry mapping the key to those contents will be a "cached_object_entry". This would also let us swap out the array for a better data structure (like a hashmap) if we chose, but there's not much point. The only code that adds an entry is git-blame, which adds at most a single entry per process. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18object-file: move empty_tree struct into find_cached_object()Jeff King1-6/+5
The fake empty_tree struct is a static global, but the only code that looks at it is find_cached_object(). The struct itself is a little odd, with an invalid "oid" field that is handled specially by that function. Since it's really just an implementation detail, let's move it to a static within the function. That future-proofs against other code trying to use it and seeing the weird oid value. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18object-file: drop confusing oid initializer of empty_tree structJeff King1-3/+1
We treat the empty tree specially, providing an in-memory "cached" copy, which allows you to diff against it even if the object doesn't exist in the repository. This is implemented as part of the larger cached_object subsystem, but we use a stand-alone empty_tree struct. We initialize the oid of that struct using EMPTY_TREE_SHA1_BIN_LITERAL. At first glance, that seems like a bug; how could this ever work for sha256 repositories? The answer is that we never look at the oid field! The oid field is used to look up entries added by pretend_object_file() to the cached_objects array. But for our stand-alone entry, we look for it independently using the_hash_algo->empty_tree, which will point to the correct algo struct for the repository. This happened in 62ba93eaa9 (sha1_file: convert cached object code to struct object_id, 2018-05-02), which even mentions that this field is never used. Let's reduce confusion for anybody reading this code by replacing the sha1 initializer with a comment. The resulting field will be all-zeroes, so any violation of our assumption that the oid field is not used will break equally for sha1 and sha256. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18object-file: prefer array-of-bytes initializer for hash literalsJeff King1-17/+21
We hard-code a few well-known hash values for empty trees and blobs in both sha1 and sha256 formats. We do so with string literals like this: #define EMPTY_TREE_SHA256_BIN_LITERAL \ "\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \ "\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5" \ "\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc" \ "\x53\x21" and then use it to initialize the hash field of an object_id struct. That hash field is exactly 32 bytes long (the size we need for sha256). But the string literal above is actually 33 bytes long due to the NUL terminator. This is legal in C, and the NUL is ignored. Side note on legality: in general excess initializer elements are forbidden, and gcc will warn on both of these: char foo[3] = { 'h', 'u', 'g', 'e' }; char bar[3] = "VeryLongString"; I couldn't find specific language in the standard allowing initialization from a string literal where _just_ the NUL is ignored, but C99 section 6.7.8 (Initialization), paragraph 32 shows this exact case as "example 8". However, the upcoming gcc 15 will start warning for this case (when compiled with -Wextra via DEVELOPER=1): CC object-file.o object-file.c:52:9: warning: initializer-string for array of ‘unsigned char’ is too long [-Wunterminated-string-initialization] 52 | "\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ object-file.c:79:17: note: in expansion of macro ‘EMPTY_TREE_SHA256_BIN_LITERAL’ which is understandable. Even though this is not a bug for us, since we do not care about the NUL terminator (and are just using the literal as a convenient format), it would be easy to accidentally create an array that was mistakenly unterminated. We can avoid this warning by switching the initializer to an actual array of unsigned values. That arguably demonstrates our intent more clearly anyway. Reported-by: Sam James <sam@gentoo.org> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02Merge branch 'tb/weak-sha1-for-tail-sum'Junio C Hamano1-6/+118
The checksum at the tail of files are now computed without collision detection protection. This is safe as the consumer of the information to protect itself from replay attacks checks for hash collisions independently. * tb/weak-sha1-for-tail-sum: csum-file.c: use unsafe SHA-1 implementation when available Makefile: allow specifying a SHA-1 for non-cryptographic uses hash.h: scaffolding for _unsafe hashing variants sha1: do not redefine `platform_SHA_CTX` and friends pack-objects: use finalize_object_file() to rename pack/idx/etc finalize_object_file(): implement collision check finalize_object_file(): refactor unlink_or_warn() placement finalize_object_file(): check for name collision before renaming
2024-09-27hash.h: scaffolding for _unsafe hashing variantsTaylor Blau1-0/+42
Git's default SHA-1 implementation is collision-detecting, which hardens us against known SHA-1 attacks against Git objects. This makes Git object writes safer at the expense of some speed when hashing through the collision-detecting implementation, which is slower than non-collision detecting alternatives. Prepare for loading a separate "unsafe" SHA-1 implementation that can be used for non-cryptographic purposes, like computing the checksum of files that use the hashwrite() API. This commit does not actually introduce any new compile-time knobs to control which implementation is used as the unsafe SHA-1 variant, but does add scaffolding so that the "git_hash_algo" structure has five new function pointers which are "unsafe" variants of the five existing hashing-related function pointers: - git_hash_init_fn unsafe_init_fn - git_hash_clone_fn unsafe_clone_fn - git_hash_update_fn unsafe_update_fn - git_hash_final_fn unsafe_final_fn - git_hash_final_oid_fn unsafe_final_oid_fn The following commit will introduce compile-time knobs to specify which SHA-1 implementation is used for non-cryptographic uses. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-27finalize_object_file(): implement collision checkTaylor Blau1-3/+64
We've had "FIXME!!! Collision check here ?" in finalize_object_file() since aac1794132 (Improve sha1 object file writing., 2005-05-03). That is, when we try to write a file with the same name, we assume the on-disk contents are the same and blindly throw away the new copy. One of the reasons we never implemented this is because the files it moves are all named after the cryptographic hash of their contents (either loose objects, or packs which have their hash in the name these days). So we are unlikely to see such a collision by accident. And even though there are weaknesses in sha1, we assume they are mitigated by our use of sha1dc. So while it's a theoretical concern now, it hasn't been a priority. However, if we start using weaker hashes for pack checksums and names, this will become a practical concern. So in preparation, let's actually implement a byte-for-byte collision check. The new check will cause the write of new differing content to be a failure, rather than a silent noop, and we'll retain the temporary file on disk. If there's no collision present, we'll clean up the temporary file as usual after either rename()-ing or link()-ing it into place. Note that this may cause some extra computation when the files are in fact identical, but this should happen rarely. Loose objects are exempt from this check, and the collision check may be skipped by calling the _flags variant of this function with the FOF_SKIP_COLLISION_CHECK bit set. This is done for a couple of reasons: - We don't treat the hash of the loose object file's contents as a checksum, since the same loose object can be stored using different bytes on disk (e.g., when adjusting core.compression, using a different version of zlib, etc.). This is fundamentally different from cases where finalize_object_file() is operating over a file which uses the hash value as a checksum of the contents. In other words, a pair of identical loose objects can be stored using different bytes on disk, and that should not be treated as a collision. - We already use the path of the loose object as its hash value / object name, so checking for collisions at the content level doesn't add anything. Adding a content-level collision check would have to happen at a higher level than in finalize_object_file(), since (avoiding race conditions) writing an object loose which already exists in the repository will prevent us from even reaching finalize_object_file() via the object freshening code. There is a collision check in index-pack via its `check_collision()` function, but there isn't an analogous function in unpack-objects, which just feeds the result to write_object_file(). So skipping the collision check here does not change for better or worse the hardness of loose object writes. As a small note related to the latter bullet point above, we must teach the tmp-objdir routines to similarly skip the content-level collision checks when calling migrate_one() on a loose object file, which we do by setting the FOF_SKIP_COLLISION_CHECK bit when we are inside of a loose object shard. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-27finalize_object_file(): refactor unlink_or_warn() placementTaylor Blau1-1/+6
As soon as we've tried to link() a temporary object into place, we then unlink() the tempfile immediately, whether we were successful or not. For the success case, this is because we no longer need the old file (it's now linked into place). For the error case, there are two outcomes. Either we got EEXIST, in which case we consider the collision to be a noop. Or we got a system error, in which we case we are just cleaning up after ourselves. Using a single line for all of these cases has some problems: - in the error case, our unlink() may clobber errno, which we use in the error message - for the collision case, there's a FIXME that indicates we should do a collision check. In preparation for implementing that, we'll need to actually hold on to the file. Split these three cases into their own calls to unlink_or_warn(). This is more verbose, but lets us do the right thing in each case. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-27finalize_object_file(): check for name collision before renamingTaylor Blau1-2/+6
We prefer link()/unlink() to rename() for object files, with the idea that we should prefer the data that is already on disk to what is incoming. But we may fall back to rename() if the user has configured us to do so, or if the filesystem seems not to support cross-directory links. This loses the "prefer what is on disk" property. We can mitigate this somewhat by trying to stat() the destination filename before doing the rename. This is racy, since the object could be created between the stat() and rename() calls. But in practice it is expanding the definition of "what is already on disk" to be the point that the function is called. That is enough to deal with any potential attacks where an attacker is trying to collide hashes with what's already in the repository. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-25Merge branch 'ak/typofix-2.46-maint'Junio C Hamano1-1/+1
Typofix. * ak/typofix-2.46-maint: upload-pack: fix a typo sideband: fix a typo setup: fix a typo run-command: fix a typo revision: fix a typo refs: fix typos rebase: fix a typo read-cache-ll: fix a typo pretty: fix a typo object-file: fix a typo merge-ort: fix typos merge-ll: fix a typo http: fix a typo gpg-interface: fix a typo git-p4: fix typos git-instaweb: fix a typo fsmonitor-settings: fix a typo diffcore-rename: fix typos config.mak.dev: fix a typo
2024-09-19object-file: fix a typoAndrew Kreimer1-1/+1
Fix a typo in comments. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-12environment: move object database functions into object layerPatrick Steinhardt1-0/+33
The `odb_mkstemp()` and `odb_pack_keep()` functions are quite clearly tied to the object store, but regardless of that they are located in "environment.c". Move them over, which also helps to get rid of dependencies on `the_repository` in the environment subsystem. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-12environment: make `get_object_directory()` accept a repositoryPatrick Steinhardt1-2/+2
The `get_object_directory()` function retrieves the path to the object directory for `the_repository`. Make it accept a `struct repository` such that it can work on arbitrary repositories and make it part of the repository subsystem. This reduces our reliance on `the_repository` and clarifies scope. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-23Merge branch 'ps/leakfixes-part-4'Junio C Hamano1-0/+1
More leak fixes. * ps/leakfixes-part-4: (22 commits) builtin/diff: free symmetric diff members diff: free state populated via options builtin/log: fix leak when showing converted blob contents userdiff: fix leaking memory for configured diff drivers builtin/format-patch: fix various trivial memory leaks diff: fix leak when parsing invalid ignore regex option unpack-trees: clear index when not propagating it sequencer: release todo list on error paths merge-ort: unconditionally release attributes index builtin/fast-export: plug leaking tag names builtin/fast-export: fix leaking diff options builtin/fast-import: plug trivial memory leaks builtin/notes: fix leaking `struct notes_tree` when merging notes builtin/rebase: fix leaking `commit.gpgsign` value config: fix leaking comment character config submodule-config: fix leaking name entry when traversing submodules read-cache: fix leaking hashfile when writing index fails bulk-checkin: fix leaking state TODO object-name: fix leaking symlink paths in object context object-file: fix memory leak when reading corrupted headers ...
2024-08-14object-file: fix memory leak when reading corrupted headersPatrick Steinhardt1-0/+1
When reading corrupt object headers in `read_loose_object()`, we bail out immediately. This causes a memory leak though because we would have already initialized the zstream in `unpack_loose_header()`, and it is the callers responsibility to finish the zstream even on error. While this feels weird, other callsites do it correctly already. Fix this leak by ending the zstream even on errors. We may want to revisit this interface in the future such that the callee handles this for us already when there was an error. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-08fsck: make "fsck_error" callback genericshejialuo1-5/+4
The "fsck_error" callback is designed to report the objects-related error messages. It accepts two parameter "oid" and "object_type" which is not generic. In order to provide a unified callback which can report either objects or refs, remove the objects-related parameters and add the generic parameter "void *fsck_report". Create a new "fsck_object_report" structure which incorporates the removed parameters "oid" and "object_type". Then change the corresponding references to adapt to new "fsck_error" callback. Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-07-02Merge branch 'ew/object-convert-leakfix'Junio C Hamano1-1/+1
Leakfix. * ew/object-convert-leakfix: object-file: fix leak on conversion failure
2024-07-02Merge branch 'ps/use-the-repository'Junio C Hamano1-10/+9
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help transition the codebase to rely less on the availability of the singleton the_repository instance. * ps/use-the-repository: hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE` t/helper: remove dependency on `the_repository` in "proc-receive" t/helper: fix segfault in "oid-array" command without repository t/helper: use correct object hash in partial-clone helper compat/fsmonitor: fix socket path in networked SHA256 repos replace-object: use hash algorithm from passed-in repository protocol-caps: use hash algorithm from passed-in repository oidset: pass hash algorithm when parsing file http-fetch: don't crash when parsing packfile without a repo hash-ll: merge with "hash.h" refs: avoid include cycle with "repository.h" global: introduce `USE_THE_REPOSITORY_VARIABLE` macro hash: require hash algorithm in `empty_tree_oid_hex()` hash: require hash algorithm in `is_empty_{blob,tree}_oid()` hash: make `is_null_oid()` independent of `the_repository` hash: convert `oidcmp()` and `oideq()` to compare whole hash global: ensure that object IDs are always padded hash: require hash algorithm in `oidread()` and `oidclr()` hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()` hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
2024-06-24object-file: fix leak on conversion failureEric Wong1-1/+1
I'm not sure exactly how to trigger the leak, but it seems fairly obvious that the `content' buffer should be freed even if convert_object_file() fails. Noticed while working in this area on unrelated things. Signed-off-by: Eric Wong <e@80x24.org> Acked-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-17Merge branch 'ps/no-writable-strings'Junio C Hamano1-10/+12
Building with "-Werror -Wwrite-strings" is now supported. * ps/no-writable-strings: (27 commits) config.mak.dev: enable `-Wwrite-strings` warning builtin/merge: always store allocated strings in `pull_twohead` builtin/rebase: always store allocated string in `options.strategy` builtin/rebase: do not assign default backend to non-constant field imap-send: fix leaking memory in `imap_server_conf` imap-send: drop global `imap_server_conf` variable mailmap: always store allocated strings in mailmap blob revision: always store allocated strings in output encoding remote-curl: avoid assigning string constant to non-const variable send-pack: always allocate receive status parse-options: cast long name for OPTION_ALIAS http: do not assign string constant to non-const field compat/win32: fix const-correctness with string constants pretty: add casts for decoration option pointers object-file: make `buf` parameter of `index_mem()` a constant object-file: mark cached object buffers as const ident: add casts for fallback name and GECOS entry: refactor how we remove items for delayed checkouts line-log: always allocate the output prefix line-log: stop assigning string constant to file parent buffer ...
2024-06-14global: introduce `USE_THE_REPOSITORY_VARIABLE` macroPatrick Steinhardt1-0/+3
Use of the `the_repository` variable is deprecated nowadays, and we slowly but steadily convert the codebase to not use it anymore. Instead, callers should be passing down the repository to work on via parameters. It is hard though to prove that a given code unit does not use this variable anymore. The most trivial case, merely demonstrating that there is no direct use of `the_repository`, is already a bit of a pain during code reviews as the reviewer needs to manually verify claims made by the patch author. The bigger problem though is that we have many interfaces that implicitly rely on `the_repository`. Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code units to opt into usage of `the_repository`. The intent of this macro is to demonstrate that a certain code unit does not use this variable anymore, and to keep it from new dependencies on it in future changes, be it explicit or implicit For now, the macro only guards `the_repository` itself as well as `the_hash_algo`. There are many more known interfaces where we have an implicit dependency on `the_repository`, but those are not guarded at the current point in time. Over time though, we should start to add guards as required (or even better, just remove them). Define the macro as required in our code units. As expected, most of our code still relies on the global variable. Nearly all of our builtins rely on the variable as there is no way yet to pass `the_repository` to their entry point. For now, declare the macro in "biultin.h" to keep the required changes at least a little bit more contained. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14hash: require hash algorithm in `empty_tree_oid_hex()`Patrick Steinhardt1-8/+2
The `empty_tree_oid_hex()` function use `the_repository` to derive the hash function that shall be used. Require callers to pass in the hash algorithm to get rid of this implicit dependency. While at it, remove the unused `empty_blob_oid_hex()` function. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14global: ensure that object IDs are always paddedPatrick Steinhardt1-0/+2
The `oidcmp()` and `oideq()` functions only compare the prefix length as specified by the given hash algorithm. This mandates that the object IDs have a valid hash algorithm set, or otherwise we wouldn't be able to figure out that prefix. As we do not have a hash algorithm in many cases, for example when handling null object IDs, this assumption cannot always be fulfilled. We thus have a fallback in place that instead uses `the_repository` to derive the hash function. This implicit dependency is hidden away from callers and can be quite surprising, especially in contexts where there may be no repository. In theory, we can adapt those functions to always memcmp(3P) the whole length of their hash arrays. But there exist a couple of sites where we populate `struct object_id`s such that only the prefix of its hash that is actually used by the hash algorithm is populated. The remaining bytes are left uninitialized. The fact that those bytes are uninitialized also leads to warnings under Valgrind in some places where we copy those bytes. Refactor callsites where we populate object IDs to always initialize all bytes. This also allows us to get rid of `oidcpy_with_padding()`, for one because the input is now fully initialized, and because `oidcpy()` will now always copy the whole hash array. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14hash: require hash algorithm in `oidread()` and `oidclr()`Patrick Steinhardt1-2/+2
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash function that shall be used. Require callers to pass in the hash algorithm to get rid of this implicit dependency. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-07object-file: make `buf` parameter of `index_mem()` a constantPatrick Steinhardt1-7/+7
The `buf` parameter of `index_mem()` is a non-constant string. This will break once we enable `-Wwrite-strings` because we also pass constants from at least one callsite. Adapt the parameter to be a constant. As we cannot free the buffer without casting now, this also requires us to move the lifetime of the nested buffer around. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-07object-file: mark cached object buffers as constPatrick Steinhardt1-3/+5
The buffers of cached objects are never modified, but are still stored as a non-constant pointer. This will cause a compiler warning once we enable the `-Wwrite-strings` compiler warning as we assign an empty constant string when initializing the static `empty_tree` cached object. Convert the field to be constant. This requires us to shuffle around the code a bit because we memcpy(3P) into the allocated buffer in `pretend_object_file()`. This is easily fixed though by allocating the buffer into a temporary variable first. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-17refs: refactor `resolve_gitlink_ref()` to accept a repositoryPatrick Steinhardt1-1/+1
In `resolve_gitlink_ref()` we implicitly rely on `the_repository` to look up the submodule ref store. Now that we can look up submodule ref stores for arbitrary repositories we can improve this function to instead accept a repository as parameter for which we want to resolve the gitlink. Do so and adjust callers accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-03-28Merge branch 'eb/hash-transition'Junio C Hamano1-20/+192
Work to support a repository that work with both SHA-1 and SHA-256 hash algorithms has started. * eb/hash-transition: (30 commits) t1016-compatObjectFormat: add tests to verify the conversion between objects t1006: test oid compatibility with cat-file t1006: rename sha1 to oid test-lib: compute the compatibility hash so tests may use it builtin/ls-tree: let the oid determine the output algorithm object-file: handle compat objects in check_object_signature tree-walk: init_tree_desc take an oid to get the hash algorithm builtin/cat-file: let the oid determine the output algorithm rev-parse: add an --output-object-format parameter repository: implement extensions.compatObjectFormat object-file: update object_info_extended to reencode objects object-file-convert: convert commits that embed signed tags object-file-convert: convert commit objects when writing object-file-convert: don't leak when converting tag objects object-file-convert: convert tag objects when writing object-file-convert: add a function to convert trees between algorithms object: factor out parse_mode out of fast-import and tree-walk into in object.h cache: add a function to read an OID of a specific algorithm tag: sign both hashes commit: export add_header_signature to support handling signatures on tags ...
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-8/+0
Each of these were checked with gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE} to ensure that removing the direct inclusion of the header actually resulted in that header no longer being included at all (i.e. that no other header pulled it in transitively). ...except for a few cases where we verified that although the header was brought in transitively, nothing from it was directly used in that source file. These cases were: * builtin/credential-cache.c * builtin/pull.c * builtin/send-pack.c Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02object-file: handle compat objects in check_object_signatureEric W. Biederman1-1/+3
Update check_object_signature to find the hash algorithm the exising signature uses, and to use the same hash algorithm when recomputing it to check the signature is valid. This will be useful when teaching git ls-tree to display objects encoded with the compat hash algorithm. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02object-file: update object_info_extended to reencode objectsEric W. Biederman1-0/+91
oid_object_info_extended is updated to detect an oid encoding that does not match the current repository, use repo_oid_to_algop to find the correspoding oid in the current repository and to return the data for the oid. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02object-file: add a compat_oid_in parameter to write_object_file_flagsEric W. Biederman1-2/+4
To create the proper signatures for commit objects both versions of the commit object need to be generated and signed. After that it is a waste to throw away the work of generating the compatibility hash so update write_object_file_flags to take a compatibility hash input parameter that it can use to skip the work of generating the compatability hash. Update the places that don't generate the compatability hash to pass NULL so it is easy to tell write_object_file_flags should not attempt to use their compatability hash. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02object-file: update the loose object map when writing loose objectsEric W. Biederman1-18/+95
To implement SHA1 compatibility on SHA256 repositories the loose object map needs to be updated whenver a loose object is written. Updating the loose object map this way allows git to support the old hash algorithm in constant time. The functions write_loose_object, and stream_loose_object are the only two functions that write to the loose object store. Update stream_loose_object to compute the compatibiilty hash, update the loose object, and then call repo_add_loose_object_map to update the loose object map. Update write_object_file_flags to convert the object into it's compatibility encoding, hash the compatibility encoding, write the object, and then update the loose object map. Update force_object_loose to lookup the hash of the compatibility encoding, write the loose object, and then update the loose object map. Update write_object_file_literally to convert the object into it's compatibility hash encoding, hash the compatibility enconding, write the object, and then update the loose object map, when the type string is a known type. For objects with an unknown type this results in a partially broken repository, as the objects are not mapped. The point of write_object_file_literally is to generate a partially broken repository for testing. For testing skipping writing the loose object map is much more useful than refusing to write the broken object at all. Except that the loose objects are updated before the loose object map I have not done any analysis to see how robust this scheme is in the event of failure. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-26bulk-checkin: only support blobs in index_bulk_checkinEric W. Biederman1-6/+6
As the code is written today index_bulk_checkin only accepts blobs. Remove the enum object_type parameter and rename index_bulk_checkin to index_blob_bulk_checkin, index_stream to index_blob_stream, deflate_to_pack to deflate_blob_to_pack, stream_to_pack to stream_blob_to_pack, to make this explicit. Not supporting commits, tags, or trees has no downside as it is not currently supported now, and commits, tags, and trees being smaller by design do not have the problem that the problem that index_bulk_checkin was built to solve. Before we start adding code to support the hash function transition supporting additional objects types in index_bulk_checkin has no real additional cost, just an extra function parameter to know what the object type is. Once we begin the hash function transition this is not the case. The hash function transition document specifies that a repository with compatObjectFormat enabled will compute and store both the SHA-1 and SHA-256 hash of every object in the repository. What makes this a challenge is that it is not just an additional hash over the same object. Instead the hash function transition document specifies that the compatibility hash (specified with compatObjectFormat) be computed over the equivalent object that another git repository whose storage hash (specified with objectFormat) would store. When comparing equivalent repositories built with different storage hash functions, the oids embedded in objects used to refer to other objects differ and the location of signatures within objects differ. As blob objects have neither oids referring to other objects nor stored signatures their storage hash and their compatibility hash are computed over the same object. The other kinds of objects: trees, commits, and tags, all store oids referring to other objects. Signatures are stored in commit and tag objects. As oids and the tags to store signatures are not the same size in repositories built with different storage hashes the size of the equivalent objects are also different. A version of index_bulk_checkin that supports more than just blobs when computing both the SHA-1 and the SHA-256 of every object added would need a different, and more expensive structure. The structure is more expensive because it would be required to temporarily buffering the equivalent object the compatibility hash needs to be computed over. A temporary object is needed, because before a hash over an object can computed it's object header needs to be computed. One of the members of the object header is the entire size of the object. To know the size of an equivalent object an entire pass over the original object needs to be made, as trees, commits, and tags are composed of a variable number of variable sized pieces. Unfortunately there is no formula to compute the size of an equivalent object from just the size of the original object. Avoid all of those future complications by limiting index_bulk_checkin to only work on blobs. Inspired-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-25Merge branch 'jk/unused-parameter'Junio C Hamano1-5/+5
Mark-up unused parameters in the code so that we can eventually enable -Wunused-parameter by default. * jk/unused-parameter: t/helper: mark unused callback void data parameters tag: mark unused parameters in each_tag_name_fn callbacks rev-parse: mark unused parameter in for_each_abbrev callback replace: mark unused parameter in each_mergetag_fn callback replace: mark unused parameter in ref callback merge-tree: mark unused parameter in traverse callback fsck: mark unused parameters in various fsck callbacks revisions: drop unused "opt" parameter in "tweak" callbacks count-objects: mark unused parameter in alternates callback am: mark unused keep_cr parameters http-push: mark unused parameter in xml callback http: mark unused parameters in curl callbacks do_for_each_ref_helper(): mark unused repository parameter test-ref-store: drop unimplemented reflog-expire command
2023-07-13fsck: mark unused parameters in various fsck callbacksJeff King1-5/+5
There are a few callback functions which are used with the fsck code, but it's natural that not all callbacks need all parameters. For reporting, even something as obvious as "the oid of the object which had a problem" is not always used, as some callers are only checking a single object in the first place. And for both reporting and walking, things like void data pointers and the fsck_options aren't always necessary. But since each such parameter is used by _some_ callback, we have to keep them in the interface. Mark the unused ones in specific callbacks to avoid triggering -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan1-1/+0
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05treewide: remove unnecessary includes for wrapper.hCalvin Wan1-1/+0
Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21repository: remove unnecessary include of path.hElijah Newren1-0/+1
This also made it clear that several .c files that depended upon path.h were missing a #include for it; add the missing includes while at it. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-05-09Merge branch 'en/header-split-cache-h-part-2'Junio C Hamano1-1/+2
More header clean-up. * en/header-split-cache-h-part-2: (22 commits) reftable: ensure git-compat-util.h is the first (indirect) include diff.h: reduce unnecessary includes object-store.h: reduce unnecessary includes commit.h: reduce unnecessary includes fsmonitor: reduce includes of cache.h cache.h: remove unnecessary headers treewide: remove cache.h inclusion due to previous changes cache,tree: move basic name compare functions from read-cache to tree cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c hash-ll.h: split out of hash.h to remove dependency on repository.h tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h dir.h: move DTYPE defines from cache.h versioncmp.h: move declarations for versioncmp.c functions from cache.h ws.h: move declarations for ws.c functions from cache.h match-trees.h: move declarations for match-trees.c functions from cache.h pkt-line.h: move declarations for pkt-line.c functions from cache.h base85.h: move declarations for base85.c functions from cache.h copy.h: move declarations for copy.c functions from cache.h server-info.h: move declarations for server-info.c functions from cache.h packfile.h: move pack_window and pack_entry from cache.h ...
2023-04-25Merge branch 'ps/fix-geom-repack-with-alternates'Junio C Hamano1-0/+6
Geometric repacking ("git repack --geometric=<n>") in a repository that borrows from an alternate object database had various corner case bugs, which have been corrected. * ps/fix-geom-repack-with-alternates: repack: disable writing bitmaps when doing a local repack repack: honor `-l` when calculating pack geometry t/helper: allow chmtime to print verbosely without modifying mtime pack-objects: extend test coverage of `--stdin-packs` with alternates pack-objects: fix error when same packfile is included and excluded pack-objects: fix error when packing same pack twice pack-objects: split out `--stdin-packs` tests into separate file repack: fix generating multi-pack-index with only non-local packs repack: fix trying to use preferred pack in alternates midx: fix segfault with no packs and invalid preferred pack
2023-04-24object-store.h: reduce unnecessary includesElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24treewide: remove cache.h inclusion due to previous changesElijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-14repack: disable writing bitmaps when doing a local repackPatrick Steinhardt1-0/+6
In order to write a bitmap, we need to have full coverage of all objects that are about to be packed. In the traditional non-multi-pack-index world this meant we need to do a full repack of all objects into a single packfile. But in the new multi-pack-index world we can get away with writing bitmaps when we have multiple packfiles as long as the multi-pack-index covers all objects. This is not always the case though. When asked to perform a repack of local objects, only, then we cannot guarantee to have full coverage of all objects regardless of whether we do a full repack or a repack with a multi-pack-index. The end result is that writing the bitmap will fail in both worlds: $ git multi-pack-index write --stdin-packs --bitmap <packfiles warning: Failed to write bitmap index. Packfile doesn't have full closure (object 1529341d78cf45377407369acb0f4ff2b5cdae42 is missing) error: could not write multi-pack bitmap Now there are two different ways to fix this. The first one would be to amend git-multi-pack-index(1) to disable writing bitmaps when we notice that we don't have full object coverage. - We don't have enough information in git-multi-pack-index(1) in order to tell whether the local repository _should_ have full coverage. Because even when connected to an alternate object directory, it may be the case that we still have all objects around in the main object database. - git-multi-pack-index(1) is quite a low-level tool. Automatically disabling functionality that it was asked to provide does not feel like the right thing to do. We can easily fix it at a higher level in git-repack(1) though. When asked to only include local objects via `-l` and when connected to an alternate object directory then we will override the user's ask and disable writing bitmaps with a warning. This is similar to what we do in git-pack-objects(1), where we also disable writing bitmaps in case we omit an object from the pack. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11object-file.h: move declarations for object-file.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: be explicit about dependence on convert.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-04Merge branch 'ab/remove-implicit-use-of-the-repository' into ↵Junio C Hamano1-2/+2
en/header-split-cache-h * ab/remove-implicit-use-of-the-repository: libs: use "struct repository *" argument, not "the_repository" post-cocci: adjust comments for recent repo_* migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"
2023-03-28post-cocci: adjust comments for recent repo_* migrationÆvar Arnfjörð Bjarmason1-1/+1
In preceding commits we changed many calls to macros that were providing a "the_repository" argument to invoke corresponding repo_*() function instead. Let's follow-up and adjust references to those in comments, which coccinelle didn't (and inherently can't) catch. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28cocci: apply the "object-store.h" part of "the_repository.pending"Ævar Arnfjörð Bjarmason1-1/+1
Apply the part of "the_repository.pending.cocci" pertaining to "object-store.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21csum-file.h: remove unnecessary inclusion of cache.hElijah Newren1-1/+1
With the change in the last commit to move several functions to write-or-die.h, csum-file.h no longer needs to include cache.h. However, removing that include forces several other C files, which directly or indirectly dependend upon csum-file.h's inclusion of cache.h, to now be more explicit about their dependencies. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21setup.h: move declarations for setup.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21environment.h: move declarations for environment.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21wrapper.h: move declarations for wrapper.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21abspath.h: move absolute path functions from cache.hElijah Newren1-0/+1
This is another step towards letting us remove the include of cache.h in strbuf.c. It does mean that we also need to add includes of abspath.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren1-0/+1
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-17Merge branch 'jk/unused-post-2.39-part2'Junio C Hamano1-1/+2
More work towards -Wunused. * jk/unused-post-2.39-part2: (21 commits) help: mark unused parameter in git_unknown_cmd_config() run_processes_parallel: mark unused callback parameters userformat_want_item(): mark unused parameter for_each_commit_graft(): mark unused callback parameter rewrite_parents(): mark unused callback parameter fetch-pack: mark unused parameter in callback function notes: mark unused callback parameters prio-queue: mark unused parameters in comparison functions for_each_object: mark unused callback parameters list-objects: mark unused callback parameters mark unused parameters in signal handlers run-command: mark error routine parameters as unused mark "pointless" data pointers in callbacks ref-filter: mark unused callback parameters http-backend: mark unused parameters in virtual functions http-backend: mark argc/argv unused object-name: mark unused parameters in disambiguate callbacks serve: mark unused parameters in virtual functions serve: use repository pointer to get config ls-refs: drop config caching ...
2023-02-24for_each_object: mark unused callback parametersJeff King1-1/+2
The for_each_{loose,packed}_object interface uses callback functions, but not every callback needs all of the parameters. Mark the unused ones to satisfy -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren1-1/+2
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-30Merge branch 'jk/hash-object-fsck'Junio C Hamano1-32/+23
"git hash-object" now checks that the resulting object is well formed with the same code as "git fsck". * jk/hash-object-fsck: fsck: do not assume NUL-termination of buffers hash-object: use fsck for object checks fsck: provide a function to fsck buffer without object struct t: use hash-object --literally when created malformed objects t7030: stop using invalid tag name t1006: stop using 0-padded timestamps t1007: modernize malformed object tests
2023-01-18hash-object: use fsck for object checksJeff King1-32/+23
Since c879daa237 (Make hash-object more robust against malformed objects, 2011-02-05), we've done some rudimentary checks against objects we're about to write by running them through our usual parsers for trees, commits, and tags. These parsers catch some problems, but they are not nearly as careful as the fsck functions (which make sense; the parsers are designed to be fast and forgiving, bailing only when the input is unintelligible). We are better off doing the more thorough fsck checks when writing objects. Doing so at write time is much better than writing garbage only to find out later (after building more history atop it!) that fsck complains about it, or hosts with transfer.fsckObjects reject it. This is obviously going to be a user-visible behavior change, and the test changes earlier in this series show the scope of the impact. But I'd argue that this is OK: - the documentation for hash-object is already vague about which checks we might do, saying that --literally will allow "any garbage[...] which might not otherwise pass standard object parsing or git-fsck checks". So we are already covered under the documented behavior. - users don't generally run hash-object anyway. There are a lot of spots in the tests that needed to be updated because creating garbage objects is something that Git's tests disproportionately do. - it's hard to imagine anyone thinking the new behavior is worse. Any object we reject would be a potential problem down the road for the user. And if they really want to create garbage, --literally is already the escape hatch they need. Note that the change here is actually in index_mem(), which handles the HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by "git-replace --edit" to sanity-check the result. Covering that with more thorough checks likewise seems like a good thing. Besides being more thorough, there are a few other bonuses: - we get rid of some questionable stack allocations of object structs. These don't seem to currently cause any problems in practice, but they subtly violate some of the assumptions made by the rest of the code (e.g., the "struct commit" we put on the stack and zero-initialize will not have a proper index from alloc_comit_index(). - likewise, those parsed object structs are the source of some small memory leaks - the resulting messages are much better. For example: [before] $ echo 'tree 123' | git hash-object -t commit --stdin error: bogus commit object 0000000000000000000000000000000000000000 fatal: corrupt commit [after] $ echo 'tree 123' | git.compile hash-object -t commit --stdin error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1 fatal: refusing to create malformed object Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-13object-file: fix indent-with-spaceJeff King1-1/+1
Commit b25562e63f (object-file: inline calls to read_object(), 2023-01-07) accidentally indented a conditional block with spaces instead of a tab. Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-08repo_read_object_file(): stop wrapping read_object_file_extended()Jeff King1-4/+4
The only caller of read_object_file_extended() is the thin wrapper of repo_read_object_file(). Instead of wrapping, let's just rename the inner function and let people call it directly. This cleans up the namespace and reduces confusion. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-08read_object_file_extended(): drop lookup_replace optionJeff King1-5/+2
Our sole caller always passes in "1", so we can just drop the parameter entirely. Anybody who doesn't want this behavior could easily call oid_object_info_extended() themselves, as we're just a thin wrapper around it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-08object-file: inline calls to read_object()Jeff King1-28/+17
Since read_object() is these days just a thin wrapper around oid_object_info_extended(), and since it only has two callers, let's just inline those calls. This has a few positive outcomes: - it's a net reduction in source code lines - even though the callers end up with a few extra lines, they're now more flexible and can use object_info flags directly. So no more need to convert die_if_corrupt between parameter/flag, and we can ask for lookup replacement with a flag rather than doing it ourselves. - there's one fewer function in an already crowded namespace (e.g., the difference between read_object() and read_object_file() was not immediately obvious; now we only have one of them). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-05Merge branch 'jt/avoid-lazy-fetch-commits'Junio C Hamano1-56/+52
Even in a repository with promisor remote, it is useless to attempt to lazily attempt fetching an object that is expected to be commit, because no "filter" mode omits commit objects. Take advantage of this assumption to fail fast on errors. * jt/avoid-lazy-fetch-commits: commit: don't lazy-fetch commits object-file: emit corruption errors when detected object-file: refactor map_loose_object_1() object-file: remove OBJECT_INFO_IGNORE_LOOSE
2022-12-15object-file: emit corruption errors when detectedJonathan Tan1-27/+28
Instead of relying on errno being preserved across function calls, teach do_oid_object_info_extended() to itself report object corruption when it first detects it. There are 3 types of corruption being detected: - when a replacement object is missing - when a loose object is corrupt - when a packed object is corrupt and the object cannot be read in another way Note that in the RHS of this patch's diff, a check for ENOENT that was introduced in 3ba7a06552 (A loose object is not corrupt if it cannot be read due to EMFILE, 2010-10-28) is also removed. The purpose of this check is to avoid a false report of corruption if the errno contains something like EMFILE (or anything that is not ENOENT), in which case a more generic report is presented. Because, as of this patch, we no longer rely on such a heuristic to determine corruption, but surface the error message at the point when we read something that we did not expect, this check is no longer necessary. Besides being more resilient, this also prepares for a future patch in which an indirect caller of do_oid_object_info_extended() will need such functionality. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-15object-file: refactor map_loose_object_1()Jonathan Tan1-26/+24
This function can do 3 things: 1. Gets an fd given a path 2. Simultaneously gets a path and fd given an OID 3. Memory maps an fd Keep 3 (renaming the function accordingly) and inline 1 and 2 into their respective callers. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-15object-file: remove OBJECT_INFO_IGNORE_LOOSEJonathan Tan1-3/+0
Its last user was removed in 97b2fa08b6 (fetch-pack: drop custom loose object cache, 2018-11-12), so we can remove it. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-14object-file: inline write_buffer()René Scharfe1-9/+2
write_buffer() reports the OS error if it is unable to write. Its only caller dies in that case, giving some more context in its last message. Inline this function and show only a single error message that includes both the context (writing a loose object file) and the OS error. This shortens the code and simplifies the output. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-25object-file: use real paths when adding alternatesGlen Choo1-13/+13
When adding an alternate ODB, we check if the alternate has the same path as the object dir, and if so, we do nothing. However, that comparison does not resolve symlinks. This makes it possible to add the object dir as an alternate, which may result in bad behavior. For example, it can trick "git repack -a -l -d" (possibly run by "git gc") into thinking that all packs come from an alternate and delete all objects. rm -rf test && git clone https://github.com/git/git test && ( cd test && ln -s objects .git/alt-objects && # -c repack.updateserverinfo=false silences a warning about not # being able to update "info/refs", it isn't needed to show the # bad behavior GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \ -c repack.updateserverinfo=false repack -a -l -d && # It's broken! git status # Because there are no more objects! ls .git/objects/pack ) Fix this by resolving symlinks and relative paths before comparing the alternate and object dir. This lets us clean up a number of issues noted in 37a95862c6 (alternates: re-allow relative paths from environment, 2016-11-07): - Now that we compare the real paths, duplicate detection is no longer foiled by relative paths. - Using strbuf_realpath() allows us to "normalize" paths that strbuf_normalize_path() can't, so we can stop silently ignoring errors when "normalizing" paths from the environment. - We now store an absolute path based on getcwd() (the "future direction" named in 37a95862c6), so chdir()-ing in the process no longer changes the directory pointed to by the alternate. This is a change in behavior, but a desirable one. Signed-off-by: Glen Choo <chooglen@google.com> Acked-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-27Merge branch 'jk/unused-anno-more'Junio C Hamano1-5/+10
More UNUSED annotation to help using -Wunused option with the compiler. * jk/unused-anno-more: ll-merge: mark unused parameters in callbacks diffcore-pickaxe: mark unused parameters in pickaxe functions convert: mark unused parameter in null stream filter apply: mark unused parameters in noop error/warning routine apply: mark unused parameters in handlers date: mark unused parameters in handler functions string-list: mark unused callback parameters object-file: mark unused parameters in hash_unknown functions mark unused parameters in trivial compat functions update-index: drop unused argc from do_reupdate() submodule--helper: drop unused argc from module_list_compute() diffstat_consume(): assert non-zero length
2022-10-17object-file: mark unused parameters in hash_unknown functionsJeff King1-5/+10
The 0'th entry of our hash_algos array fills out the virtual methods with a series of functions which simply BUG(). This is the right thing to do, since the point is to catch use of an invalid algo parameter, but we need to annotate them to appease -Wunused-parameters. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-17Merge branch 'jt/promisor-remote-fetch-tweak'Junio C Hamano1-4/+0
Remove error detection from a function that fetches from promisor remotes, and make it die when such a fetch fails to bring all the requested objects, to give an early failure to various operations. * jt/promisor-remote-fetch-tweak: promisor-remote: die upon failing fetch promisor-remote: remove a return value
2022-10-05promisor-remote: die upon failing fetchJonathan Tan1-4/+0
In a partial clone, an attempt to read a missing object results in an attempt to fetch that single object. In order to avoid multiple sequential fetches, which would occur when multiple objects are missing (which is the typical case), some commands have been taught to prefetch in a batch: such a command would, in a partial clone, notice that several objects that it will eventually need are missing, and call promisor_remote_get_direct() with all such objects at once. When this batch prefetch fails, these commands fall back to the sequential fetches. But at $DAYJOB we have noticed that this results in a bad user experience: a command would take unexpectedly long to finish (and possibly use up a lot of bandwidth) if the batch prefetch would fail for some intermittent reason, but all subsequent fetches would work. It would be a better user experience for such a command would just fail. Therefore, make it a fatal error if the prefetch fails and at least one object being fetched is known to be a promisor object. (The latter criterion is to make sure that we are not misleading the user that such an object would be present from the promisor remote. For example, a missing object may be a result of repository corruption and not because it is expectedly missing due to the repository being a partial clone.) Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-14Merge branch 'hx/unpack-streaming'Junio C Hamano1-30/+203
Allow large objects read from a packstream to be streamed into a loose object file straight, without having to keep it in-core as a whole. * hx/unpack-streaming: unpack-objects: use stream_loose_object() to unpack large objects core doc: modernize core.bigFileThreshold documentation object-file.c: add "stream_loose_object()" to handle large object object-file.c: factor out deflate part of write_loose_object() object-file.c: refactor write_loose_object() to several steps unpack-objects: low memory footprint for get_data() in dry_run mode
2022-06-13object-file.c: add "stream_loose_object()" to handle large objectHan Xin1-0/+104
If we want unpack and write a loose object using "write_loose_object", we have to feed it with a buffer with the same size of the object, which will consume lots of memory and may cause OOM. This can be improved by feeding data to "stream_loose_object()" in a stream. Add a new function "stream_loose_object()", which is a stream version of "write_loose_object()" but with a low memory footprint. We will use this function to unpack large blob object in later commit. Another difference with "write_loose_object()" is that we have no chance to run "write_object_file_prepare()" to calculate the oid in advance. In "write_loose_object()", we know the oid and we can write the temporary file in the same directory as the final object, but for an object with an undetermined oid, we don't know the exact directory for the object. Still, we need to save the temporary file we're preparing somewhere. We'll do that in the top-level ".git/objects/" directory (or whatever "GIT_OBJECT_DIRECTORY" is set to). Once we've streamed it we'll know the OID, and will move it to its canonical path. "freshen_packed_object()" or "freshen_loose_object()" will be called inside "stream_loose_object()" after obtaining the "oid". After the temporary file is written, we wants to mark the object to recent and we may find that where indeed is already the object. We should remove the temporary and do not leave a new copy of the object. Helped-by: René Scharfe <l.s.r@web.de> Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Han Xin <chiyutianyi@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-13object-file.c: factor out deflate part of write_loose_object()Ævar Arnfjörð Bjarmason1-6/+25
Split out the part of write_loose_object() that deals with calling git_deflate() into a utility function, a subsequent commit will introduce another function that'll make use of it. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-13object-file.c: refactor write_loose_object() to several stepsHan Xin1-24/+74
When writing a large blob using "write_loose_object()", we have to pass a buffer with the whole content of the blob, and this behavior will consume lots of memory and may cause OOM. We will introduce a stream version function ("stream_loose_object()") in later commit to resolve this issue. Before introducing that streaming function, do some refactoring on "write_loose_object()" to reuse code for both versions. Rewrite "write_loose_object()" as follows: 1. Figure out a path for the (temp) object file. This step is only used in "write_loose_object()". 2. Move common steps for starting to write loose objects into a new function "start_loose_object_common()". 3. Compress data. 4. Move common steps for ending zlib stream into a new function "end_loose_object_common()". 5. Close fd and finalize the object file. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Han Xin <chiyutianyi@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-10Merge branch 'ab/env-array'Junio C Hamano1-1/+1
Rename .env_array member to .env in the child_process structure. * ab/env-array: run-command API users: use "env" not "env_array" in comments & names run-command API: rename "env_array" to "env"
2022-06-03Merge branch 'tb/cruft-packs'Junio C Hamano1-1/+3
A mechanism to pack unreachable objects into a "cruft pack", instead of ejecting them into loose form to be reclaimed later, has been introduced. * tb/cruft-packs: sha1-file.c: don't freshen cruft packs builtin/gc.c: conditionally avoid pruning objects via loose builtin/repack.c: add cruft packs to MIDX during geometric repack builtin/repack.c: use named flags for existing_packs builtin/repack.c: allow configuring cruft pack generation builtin/repack.c: support generating a cruft pack builtin/pack-objects.c: --cruft with expiration reachable: report precise timestamps from objects in cruft packs reachable: add options to add_unseen_recent_objects_to_traversal builtin/pack-objects.c: --cruft without expiration builtin/pack-objects.c: return from create_object_entry() t/helper: add 'pack-mtimes' test-tool pack-mtimes: support writing pack .mtimes files chunk-format.h: extract oid_version() pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' pack-mtimes: support reading .mtimes files Documentation/technical: add cruft-packs.txt
2022-06-03Merge branch 'ds/object-file-unpack-loose-header-fix'Junio C Hamano1-6/+2
Coding style fix. * ds/object-file-unpack-loose-header-fix: object-file: convert 'switch' back to 'if'
2022-06-03Merge branch 'ns/batch-fsync'Junio C Hamano1-1/+6
Introduce a filesystem-dependent mechanism to optimize the way the bits for many loose object files are ensured to hit the disk platter. * ns/batch-fsync: core.fsyncmethod: performance tests for batch mode t/perf: add iteration setup mechanism to perf-lib core.fsyncmethod: tests for batch mode test-lib-functions: add parsing helpers for ls-files and ls-tree core.fsync: use batch mode and sync loose objects by default on Windows unpack-objects: use the bulk-checkin infrastructure update-index: use the bulk-checkin infrastructure builtin/add: add ODB transaction around add_files_to_cache cache-tree: use ODB transaction around writing a tree core.fsyncmethod: batched disk flushes for loose-objects bulk-checkin: rebrand plug/unplug APIs as 'odb transactions' bulk-checkin: rename 'state' variable and separate 'plugged' boolean
2022-06-02run-command API: rename "env_array" to "env"Ævar Arnfjörð Bjarmason1-1/+1
Start following-up on the rename mentioned in c7c4bdeccf3 (run-command API: remove "env" member, always use "env_array", 2021-11-25) of "env_array" to "env". The "env_array" name was picked in 19a583dc39e (run-command: add env_array, an optional argv_array for env, 2014-10-19) because "env" was taken. Let's not forever keep the oddity of "*_array" for this "struct strvec", but not for its "args" sibling. This commit is almost entirely made with a coccinelle rule[1]. The only manual change here is in run-command.h to rename the struct member itself and to change "env_array" to "env" in the CHILD_PROCESS_INIT initializer. The rest of this is all a result of applying [1]: * make contrib/coccinelle/run_command.cocci.patch * patch -p1 <contrib/coccinelle/run_command.cocci.patch * git add -u 1. cat contrib/coccinelle/run_command.pending.cocci @@ struct child_process E; @@ - E.env_array + E.env @@ struct child_process *E; @@ - E->env_array + E->env I've avoided changing any comments and derived variable names here, that will all be done in the next commit. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-26sha1-file.c: don't freshen cruft packsTaylor Blau1-0/+2
We don't bother to freshen objects stored in a cruft pack individually by updating the `.mtimes` file. This is because we can't portably `mmap` and write into the middle of a file (i.e., to update the mtime of just one object). Instead, we would have to rewrite the entire `.mtimes` file which may incur some wasted effort especially if there a lot of cruft objects and they are freshened infrequently. Instead, force the freshening code to avoid an optimizing write by writing out the object loose and letting it pick up a current mtime. This works because we prefer the mtime of the loose copy of an object when both a loose and packed one exist (whether or not the packed copy comes from a cruft pack or not). This could certainly do with a test and/or be included earlier in this series/PR, but I want to wait until after I have a chance to clean up the overly-repetitive nature of the cruft pack tests in general. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-26builtin/pack-objects.c: --cruft without expirationTaylor Blau1-1/+1
Teach `pack-objects` how to generate a cruft pack when no objects are dropped (i.e., `--cruft-expiration=never`). Later patches will teach `pack-objects` how to generate a cruft pack that prunes objects. When generating a cruft pack which does not prune objects, we want to collect all unreachable objects into a single pack (noting and updating their mtimes as we accumulate them). Ordinary use will pass the result of a `git repack -A` as a kept pack, so when this patch says "kept pack", readers should think "reachable objects". Generating a non-expiring cruft packs works as follows: - Callers provide a list of every pack they know about, and indicate which packs are about to be removed. - All packs which are going to be removed (we'll call these the redundant ones) are marked as kept in-core. Any packs the caller did not mention (but are known to the `pack-objects` process) are also marked as kept in-core. Packs not mentioned by the caller are assumed to be unknown to them, i.e., they entered the repository after the caller decided which packs should be kept and which should be discarded. Since we do not want to include objects in these "unknown" packs (because we don't know which of their objects are or aren't reachable), these are also marked as kept in-core. - Then, we enumerate all objects in the repository, and add them to our packing list if they do not appear in an in-core kept pack. This results in a new cruft pack which contains all known objects that aren't included in the kept packs. When the kept pack is the result of `git repack -A`, the resulting pack contains all unreachable objects. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-23Merge branch 'ab/valgrind-fixes'Junio C Hamano1-2/+6
A bit of test framework fixes with a few fixes to issues found by valgrind. * ab/valgrind-fixes: commit-graph.c: don't assume that stat() succeeds object-file: fix a unpack_loose_header() regression in 3b6a8db3b03 log test: skip a failing mkstemp() test under valgrind tests: using custom GIT_EXEC_PATH breaks --valgrind tests
2022-05-20Merge branch 'ep/maint-equals-null-cocci'Junio C Hamano1-1/+1
Introduce and apply coccinelle rule to discourage an explicit comparison between a pointer and NULL, and applies the clean-up to the maintenance track. * ep/maint-equals-null-cocci: tree-wide: apply equals-null.cocci tree-wide: apply equals-null.cocci contrib/coccinnelle: add equals-null.cocci
2022-05-16object-file: convert 'switch' back to 'if'Derrick Stolee1-6/+2
This switch statement was recently added to make it clear that unpack_loose_header() returns an enum value, not an int. This adds complications for future developers if that enum gains new values, since that developer would need to add a case statement to this switch for little real value. Instead, we can revert back to an 'if' statement, but make the enum explicit by using "!= ULHR_OK" instead of assuming it has the numerical value zero. Co-authored-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-12object-file: fix a unpack_loose_header() regression in 3b6a8db3b03Ævar Arnfjörð Bjarmason1-2/+6
Fix a regression in my 3b6a8db3b03 (object-file.c: use "enum" return type for unpack_loose_header(), 2021-10-01) revealed both by running the test suite with --valgrind, and with the amended "git fsck" test. In practice this regression in v2.34.0 caused us to claim that we couldn't parse the header, as opposed to not being able to unpack it. Before the change in the C code the test_cmp added here would emit: -error: unable to unpack header of ./objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 +error: unable to parse header of ./objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 I.e. we'd proceed to call parse_loose_header() on the uninitialized "hdr" value, and it would have been very unlikely for that uninitialized memory to be a valid git object. The other callers of unpack_loose_header() were already checking the enum values exhaustively. See 3b6a8db3b03 and 5848fb11acd (object-file.c: return ULHR_TOO_LONG on "header too long", 2021-10-01). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-02Merge branch 'ep/maint-equals-null-cocci' for maint-2.35Junio C Hamano1-1/+1
* ep/maint-equals-null-cocci: tree-wide: apply equals-null.cocci contrib/coccinnelle: add equals-null.cocci
2022-05-02tree-wide: apply equals-null.cocciJunio C Hamano1-1/+1
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-06core.fsyncmethod: batched disk flushes for loose-objectsNeeraj Singh1-1/+6
When adding many objects to a repo with `core.fsync=loose-object`, the cost of fsync'ing each object file can become prohibitive. One major source of the cost of fsync is the implied flush of the hardware writeback cache within the disk drive. This commit introduces a new `core.fsyncMethod=batch` option that batches up hardware flushes. It hooks into the bulk-checkin odb-transaction functionality, takes advantage of tmp-objdir, and uses the writeout-only support code. When the new mode is enabled, we do the following for each new object: 1a. Create the object in a tmp-objdir. 2a. Issue a pagecache writeback request and wait for it to complete. At the end of the entire transaction when unplugging bulk checkin: 1b. Issue an fsync against a dummy file to flush the log and hardware writeback cache, which should by now have seen the tmp-objdir writes. 2b. Rename all of the tmp-objdir files to their final names. 3b. When updating the index and/or refs, we assume that Git will issue another fsync internal to that operation. This is not the default today, but the user now has the option of syncing the index and there is a separate patch series to implement syncing of refs. On a filesystem with a singular journal that is updated during name operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS we would expect the fsync to trigger a journal writeout so that this sequence is enough to ensure that the user's data is durable by the time the git command returns. This sequence also ensures that no object files appear in the main object store unless they are fsync-durable. Batch mode is only enabled if core.fsync includes loose-objects. If the legacy core.fsyncObjectFiles setting is enabled, but core.fsync does not include loose-objects, we will use file-by-file fsyncing. In step (1a) of the sequence, the tmp-objdir is created lazily to avoid work if no loose objects are ever added to the ODB. We use a tmp-objdir to maintain the invariant that no loose-objects are visible in the main ODB unless they are properly fsync-durable. This is important since future ODB operations that try to create an object with specific contents will silently drop the new data if an object with the target hash exists without checking that the loose-object contents match the hash. Only a full git-fsck would restore the ODB to a functional state where dataloss doesn't occur. In step (1b) of the sequence, we issue a fsync against a dummy file created specifically for the purpose. This method has a little higher cost than using one of the input object files, but makes adding new callers of this mechanism easier, since we don't need to figure out which object file is "last" or risk sharing violations by caching the fd of the last object file. _Performance numbers_: Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD. Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD. Windows - Same host as Linux, a preview version of Windows 11. Adding 500 files to the repo with 'git add' Times reported in seconds. object file syncing | Linux | Mac | Windows --------------------|-------|-------|-------- disabled | 0.06 | 0.35 | 0.61 fsync | 1.88 | 11.18 | 2.47 batch | 0.15 | 0.41 | 1.53 Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-30object-file: pass filename to fsync_or_dieNeeraj Singh1-4/+4
If we die while trying to fsync a loose object file, pass the actual filename we're trying to sync. This is likely to be more helpful for a user trying to diagnose the cause of the failure than the former 'loose object file' string. It also sidesteps any concerns about translating the die message differently for loose objects versus something else that has a real path. Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-29Merge branch 'ab/refs-various-fixes'Junio C Hamano1-4/+5
Code clean-up. * ab/refs-various-fixes: refs debug: add a wrapper for "read_symbolic_ref" packed-backend: remove stub BUG(...) functions misc *.c: use designated initializers for struct assignments refs: use designated initializers for "struct ref_iterator_vtable" refs: use designated initializers for "struct ref_storage_be"
2022-03-25Merge branch 'ns/core-fsyncmethod'Junio C Hamano1-4/+9
Replace core.fsyncObjectFiles with two new configuration variables, core.fsync and core.fsyncMethod. * ns/core-fsyncmethod: core.fsync: documentation and user-friendly aggregate options core.fsync: new option to harden the index core.fsync: add configuration parsing core.fsync: introduce granular fsync control infrastructure core.fsyncmethod: add writeout-only mode wrapper: make inclusion of Windows csprng header tightly scoped
2022-03-17misc *.c: use designated initializers for struct assignmentsÆvar Arnfjörð Bjarmason1-4/+5
Change a few miscellaneous non-designated initializer assignments to use designated initializers. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-16Merge branch 'ab/object-file-api-updates'Junio C Hamano1-52/+93
Object-file API shuffling. * ab/object-file-api-updates: object-file API: pass an enum to read_object_with_reference() object-file.c: add a literal version of write_object_file_prepare() object-file API: have hash_object_file() take "enum object_type" object API: rename hash_object_file_literally() to write_*() object-file API: split up and simplify check_object_signature() object API users + docs: check <0, not !0 with check_object_signature() object API docs: move check_object_signature() docs to cache.h object API: correct "buf" v.s. "map" mismatch in *.c and *.h object-file API: have write_object_file() take "enum object_type" object-file API: add a format_object_header() function object-file API: return "void", not "int" from hash_object_file() object-file.c: split up declaration of unrelated variables
2022-03-10core.fsync: introduce granular fsync control infrastructureNeeraj Singh1-4/+9
This commit introduces the infrastructure for the core.fsync configuration knob. The repository components we want to sync are identified by flags so that we can turn on or off syncing for specific components. If core.fsyncObjectFiles is set and the core.fsync configuration also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any loose objects. This picks the strictest data integrity behavior if core.fsync and core.fsyncObjectFiles are set to conflicting values. This change introduces the currently unused fsync_component helper, which will be used by a later patch that adds fsyncing to the refs backend. Actual configuration and documentation of the fsync components list are in other patches in the series to separate review of the underlying mechanism from the policy of how it's configured. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: pass an enum to read_object_with_reference()Ævar Arnfjörð Bjarmason1-3/+2
Change the read_object_with_reference() function to take an "enum object_type". It was not prepared to handle an arbitrary "const char *type", as it was itself calling type_from_string(). Let's change the only caller that passes in user data to use type_from_string(), and convert the rest to use e.g. "OBJ_TREE" instead of "tree_type". The "cat-file" caller is not on the codepath that handles"--allow-unknown", so the type_from_string() there is safe. Its use of type_from_string() doesn't functionally differ from that of the pre-image. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file.c: add a literal version of write_object_file_prepare()Ævar Arnfjörð Bjarmason1-10/+29
Split off a *_literally() variant of the write_object_file_prepare() function. To do this create a new "hash_object_body()" static helper. We now defer the type_name() call until the very last moment in format_object_header() for those callers that aren't "hash-object --literally". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: have hash_object_file() take "enum object_type"Ævar Arnfjörð Bjarmason1-14/+21
Change the hash_object_file() function to take an "enum object_type". Since a preceding commit all of its callers are passing either "{commit,tree,blob,tag}_type", or the result of a call to type_name(), the parse_object() caller that would pass NULL is now using stream_object_signature(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API: rename hash_object_file_literally() to write_*()Ævar Arnfjörð Bjarmason1-3/+3
Before 0c3db67cc81 (hash-object --literally: fix buffer overrun with extra-long object type, 2015-05-04) the hash-object code being changed here called write_sha1_file() to both hash and write a loose object. Before that we'd use hash_sha1_file() to if "-w" wasn't provided, and otherwise call write_sha1_file(). Now we'll always call the same function for both writing. Let's rename it from hash_*_literally() to write_*_literally(). Even though the write_*() might not actually write if HASH_WRITE_OBJECT isn't in "flags", having it be more similar to write_object_file_flags() than hash_object_file(), but carrying a name that would suggest that it's a variant of the latter is confusing. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: split up and simplify check_object_signature()Ævar Arnfjörð Bjarmason1-14/+18
Split up the check_object_signature() function into that non-streaming version (it accepts an already filled "buf"), and a new stream_object_signature() which will retrieve the object from storage, and hash it on-the-fly. All of the callers of check_object_signature() were effectively calling two different functions, if we go by cyclomatic complexity. I.e. they'd either take the early "if (map)" branch and return early, or not. This has been the case since the "if (map)" condition was added in 090ea12671b (parse_object: avoid putting whole blob in core, 2012-03-07). We can then further simplify the resulting check_object_signature() function since only one caller wanted to pass a non-NULL "buf" and a non-NULL "real_oidp". That "read_loose_object()" codepath used by "git fsck" can instead use hash_object_file() followed by oideq(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API users + docs: check <0, not !0 with check_object_signature()Ævar Arnfjörð Bjarmason1-1/+1
Change those users of the object API that misused check_object_signature() by assuming it returned any non-zero when the OID didn't match the expected value to check <0 instead. In practice all of this code worked before, but it wasn't consistent with rest of the users of the API. Let's also clarify what the <0 return value means in API docs. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API docs: move check_object_signature() docs to cache.hÆvar Arnfjörð Bjarmason1-6/+0
Move the API documentation for check_object_signature() to cache.h, where its prototype is declared. This is in preparation for adding a companion function. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API: correct "buf" v.s. "map" mismatch in *.c and *.hÆvar Arnfjörð Bjarmason1-5/+5
Change the name of the second argument to check_object_signature() to be "buf" in object-file.c, making it consistent with the prototype in cache.h This fixes an inconsistency that's been with us since 2ade9340262 (Add "check_sha1_signature()" helper function, 2005-04-08), and makes a subsequent commit's diff smaller, as we'll move these API docs to cache.h. While we're at it fix a small grammar error in the documentation, dropping an "an" before "in-core object-data". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: have write_object_file() take "enum object_type"Ævar Arnfjörð Bjarmason1-5/+5
Change the write_object_file() function to take an "enum object_type" instead of a "const char *type". Its callers either passed {commit,tree,blob,tag}_type and can pass the corresponding OBJ_* type instead, or were hardcoding strings like "blob". This avoids the back & forth fragility where the callers of write_object_file() would have the enum type, and convert it themselves via type_name(). We do have to now do that conversion ourselves before calling write_object_file_prepare(), but those codepaths will be similarly adjusted in subsequent commits. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: add a format_object_header() functionÆvar Arnfjörð Bjarmason1-3/+20
Add a convenience function to wrap the xsnprintf() command that generates loose object headers. This code was copy/pasted in various parts of the codebase, let's define it in one place and re-use it from there. All except one caller of it had a valid "enum object_type" for us, it's only write_object_file_prepare() which might need to deal with "git hash-object --literally" and a potential garbage type. Let's have the primary API use an "enum object_type", and define a *_literally() function that can take an arbitrary "const char *" for the type. See [1] for the discussion that prompted this patch, i.e. new code in object-file.c that wanted to copy/paste the xsnprintf() invocation. In the case of fast-import.c the callers unfortunately need to cast back & forth between "unsigned char *" and "char *", since format_object_header() ad encode_in_pack_object_header() take different signedness. 1. https://lore.kernel.org/git/211213.86bl1l9bfz.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: return "void", not "int" from hash_object_file()Ævar Arnfjörð Bjarmason1-8/+8
The hash_object_file() function added in abdc3fc8421 (Add hash_sha1_file(), 2006-10-14) did not have a meaningful return value, and it never has. One was seemingly added to avoid adding braces to the "ret = " assignments being modified here. Let's instead assign "0" to the "ret" variables at the beginning of the relevant functions, and have them return "void". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file.c: split up declaration of unrelated variablesÆvar Arnfjörð Bjarmason1-1/+2
Split up the declaration of the "ret" and "re_allocated" variables. It's not our usual style to group variable declarations simply because they share a type, we'd only prefer to do so when the two are closely related (e.g. "int i, j"). This change makes a subsequent and meaningful change's diff smaller. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-24object-file: use designated initializers for "struct git_hash_algo"Ævar Arnfjörð Bjarmason1-39/+39
As with the preceding commit, change another file-level struct assignment to use designated initializers. Retain the ".name = NULL" etc. in the case of the first element of "unknown hash algorithm", to make it explicit that we're intentionally not setting those, it's not just that we forgot. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-03Merge branch 'ns/tmp-objdir'Junio C Hamano1-2/+48
New interface into the tmp-objdir API to help in-core use of the quarantine feature. * ns/tmp-objdir: tmp-objdir: disable ref updates when replacing the primary odb tmp-objdir: new API for creating temporary writable databases
2021-12-15Merge branch 'ab/run-command'Junio C Hamano1-1/+1
API clean-up. * ab/run-command: run-command API: remove "env" member, always use "env_array" difftool: use "env_array" to simplify memory management run-command API: remove "argv" member, always use "args" run-command API users: use strvec_push(), not argv construction run-command API users: use strvec_pushl(), not argv construction run-command tests: use strvec_pushv(), not argv assignment run-command API users: use strvec_pushv(), not argv assignment upload-archive: use regular "struct child_process" pattern worktree: stop being overly intimate with run_command() internals
2021-12-15Merge branch 'hn/reftable'Junio C Hamano1-5/+2
The "reftable" backend for the refs API, without integrating into the refs subsystem, has been added. * hn/reftable: Add "test-tool dump-reftable" command. reftable: add dump utility reftable: implement stack, a mutable database of reftable files. reftable: implement refname validation reftable: add merged table view reftable: add a heap-based priority queue for reftable records reftable: reftable file level tests reftable: read reftable files reftable: generic interface to tables reftable: write reftable files reftable: a generic binary tree implementation reftable: reading/writing blocks Provide zlib's uncompress2 from compat/zlib-compat.c reftable: (de)serialization for the polymorphic record type. reftable: add blocksource, an abstraction for random access reads reftable: utility functions reftable: add error related functionality reftable: add LICENSE hash.h: provide constants for the hash IDs
2021-12-10Merge branch 'po/size-t-for-vs'Junio C Hamano1-1/+1
On platforms where ulong is shorter than size_t, code paths that shifted 1 or 1U to the left lacked the necessary cast to size_t, which have been corrected. * po/size-t-for-vs: object-file.c: LLP64 compatibility, upcast unity for left shift diffcore-delta.c: LLP64 compatibility, upcast unity for left shift repack.c: LLP64 compatibility, upcast unity for left shift
2021-12-08tmp-objdir: disable ref updates when replacing the primary odbNeeraj Singh1-0/+6
When creating a subprocess with a temporary ODB, we set the GIT_QUARANTINE_ENVIRONMENT env var to tell child Git processes not to update refs, since the tmp-objdir may go away. Introduce a similar mechanism for in-process temporary ODBs when we call tmp_objdir_replace_primary_odb. Now both mechanisms set the disable_ref_updates flag on the odb, which is queried by the ref_transaction_prepare function. Peff's test case [1] was invoking ref updates via the cachetextconv setting. That particular code silently does nothing when a ref update is forbidden. See the call to notes_cache_put in fill_textconv where errors are ignored. [1] https://lore.kernel.org/git/YVOn3hDsb5pnxR53@coredump.intra.peff.net/ Reported-by: Jeff King <peff@peff.net> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-08tmp-objdir: new API for creating temporary writable databasesNeeraj Singh1-2/+42
The tmp_objdir API provides the ability to create temporary object directories, but was designed with the goal of having subprocesses access these object stores, followed by the main process migrating objects from it to the main object store or just deleting it. The subprocesses would view it as their primary datastore and write to it. Here we add the tmp_objdir_replace_primary_odb function that replaces the current process's writable "main" object directory with the specified one. The previous main object directory is restored in either tmp_objdir_migrate or tmp_objdir_destroy. For the --remerge-diff usecase, add a new `will_destroy` flag in `struct object_database` to mark ephemeral object databases that do not require fsync durability. Add 'git prune' support for removing temporary object databases, and make sure that they have a name starting with tmp_ and containing an operation-specific name. Based-on-patch-by: Elijah Newren <newren@gmail.com> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-01object-file.c: LLP64 compatibility, upcast unity for left shiftPhilip Oakley1-1/+1
Visual Studio reports C4334 "was 64-bit shift intended" warning because of size miss-match. Promote unity to the matching type to fit with the assignment. Signed-off-by: Philip Oakley <philipoakley@iee.email> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-29Merge branch 'mc/clean-smudge-with-llp64'Junio C Hamano1-3/+3
The clean/smudge conversion code path has been prepared to better work on platforms where ulong is narrower than size_t. * mc/clean-smudge-with-llp64: clean/smudge: allow clean filters to process extremely large files odb: guard against data loss checking out a huge file git-compat-util: introduce more size_t helpers odb: teach read_blob_entry to use size_t t1051: introduce a smudge filter test for extremely large files test-lib: add prerequisite for 64-bit platforms test-tool genzeros: generate large amounts of data more efficiently test-genzeros: allow more than 2G zeros in Windows
2021-11-25run-command API: remove "env" member, always use "env_array"Ævar Arnfjörð Bjarmason1-1/+1
Remove the "env" member from "struct child_process" in favor of always using the "env_array". As with the preceding removal of "argv" in favor of "args" this gets rid of current and future oddities around memory management at the API boundary (see the amended API docs). For some of the conversions we can replace patterns like: child.env = env->v; With: strvec_pushv(&child.env_array, env->v); But for others we need to guard the strvec_pushv() with a NULL check, since we're not passing in the "v" member of a "struct strvec", e.g. in the case of tmp_objdir_env()'s return value. Ideally we'd rename the "env_array" member to simply "env" as a follow-up, since it and "args" are now inconsistent in not having an "_array" suffix, and seemingly without any good reason, unless we look at the history of how they came to be. But as we've currently got 122 in-tree hits for a "git grep env_array" let's leave that for now (and possibly forever). Doing that rename would be too disruptive. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-12Merge branch 'ab/fsck-unexpected-type'Junio C Hamano1-3/+2
Regression fix. * ab/fsck-unexpected-type: object-file: free(*contents) only in read_loose_object() caller object-file: fix SEGV on free() regression in v2.34.0-rc2
2021-11-11object-file: free(*contents) only in read_loose_object() callerÆvar Arnfjörð Bjarmason1-5/+2
In the preceding commit a free() of uninitialized memory regression in 96e41f58fe1 (fsck: report invalid object type-path combinations, 2021-10-01) was fixed, but we'd still have an issue with leaking memory from fsck_loose(). Let's fix that issue too. That issue was introduced in my 31deb28f5e0 (fsck: don't hard die on invalid object types, 2021-10-01). It can be reproduced under SANITIZE=leak with the test I added in 093fffdfbec (fsck tests: add test for fsck-ing an unknown type, 2021-10-01): ./t1450-fsck.sh --run=84 -vixd In some sense it's not a problem, we lost the same amount of memory in terms of things malloc'd and not free'd. It just moved from the "still reachable" to "definitely lost" column in valgrind(1) nomenclature[1], since we'd have die()'d before. But now that we don't hard die() anymore in the library let's properly free() it. Doing so makes this code much easier to follow, since we'll now have one function owning the freeing of the "contents" variable, not two. For context on that memory management pattern the read_loose_object() function was added in f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13) and subsequently used in c68b489e564 (fsck: parse loose object paths directly, 2017-01-13). The pattern of it being the task of both sides to free() the memory has been there in this form since its inception. 1. https://valgrind.org/docs/manual/mc-manual.html#mc-manual.leaks Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-11object-file: fix SEGV on free() regression in v2.34.0-rc2Ævar Arnfjörð Bjarmason1-0/+2
Fix a regression introduced in my 96e41f58fe1 (fsck: report invalid object type-path combinations, 2021-10-01). When fsck-ing blobs larger than core.bigFileThreshold, we'd free() a pointer to uninitialized memory. This issue would have been caught by SANITIZE=address, but since it involves core.bigFileThreshold, none of the existing tests in our test suite covered it. Running them with the "big_file_threshold" in "environment.c" changed to say "6" would have shown this failure, but let's add a dedicated test for this scenario based on Han Xin's report[1]. The bug was introduced between v9 and v10[2] of the fsck series merged in 061a21d36d8 (Merge branch 'ab/fsck-unexpected-type', 2021-10-25). 1. https://lore.kernel.org/git/20211111030302.75694-1-hanxin.hx@alibaba-inc.com/ 2. https://lore.kernel.org/git/cover-v10-00.17-00000000000-20211001T091051Z-avarab@gmail.com/ Reported-by: Han Xin <chiyutianyi@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-03odb: guard against data loss checking out a huge fileMatt Cooper1-3/+3
This introduces an additional guard for platforms where `unsigned long` and `size_t` are not of the same size. If the size of an object in the database would overflow `unsigned long`, instead we now exit with an error. A complete fix will have to update _many_ other functions throughout the codebase to use `size_t` instead of `unsigned long`. It will have to be implemented at some stage. This commit puts in a stop-gap for the time being. Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-25Merge branch 'ab/fix-commit-error-message-upon-unwritable-object-store'Junio C Hamano1-8/+12
"git commit" gave duplicated error message when the object store was unwritable, which has been corrected. * ab/fix-commit-error-message-upon-unwritable-object-store: commit: fix duplication regression in permission error output unwritable tests: assert exact error output
2021-10-25Merge branch 'jt/no-abuse-alternate-odb-for-submodules'Junio C Hamano1-1/+8
Follow through the work to use the repo interface to access submodule objects in-process, instead of abusing the alternate object database interface. * jt/no-abuse-alternate-odb-for-submodules: submodule: trace adding submodule ODB as alternate submodule: pass repo to check_has_commit() object-file: only register submodule ODB if needed merge-{ort,recursive}: remove add_submodule_odb() refs: peeling non-the_repository iterators is BUG refs: teach arbitrary repo support to iterators refs: plumb repo into ref stores
2021-10-25Merge branch 'ab/fsck-unexpected-type'Junio C Hamano1-99/+77
"git fsck" has been taught to report mismatch between expected and actual types of an object better. * ab/fsck-unexpected-type: fsck: report invalid object type-path combinations fsck: don't hard die on invalid object types object-file.c: stop dying in parse_loose_header() object-file.c: return ULHR_TOO_LONG on "header too long" object-file.c: use "enum" return type for unpack_loose_header() object-file.c: simplify unpack_loose_short_header() object-file.c: make parse_loose_header_extended() public object-file.c: return -1, not "status" from unpack_loose_header() object-file.c: don't set "typep" when returning non-zero cat-file tests: test for current --allow-unknown-type behavior cat-file tests: add corrupt loose object test cat-file tests: test for missing/bogus object with -t, -s and -p cat-file tests: move bogus_* variable declarations earlier fsck tests: test for garbage appended to a loose object fsck tests: test current hash/type mismatch behavior fsck tests: refactor one test to use a sub-repo fsck tests: add test for fsck-ing an unknown type
2021-10-12commit: fix duplication regression in permission error outputÆvar Arnfjörð Bjarmason1-8/+12
Fix a regression in the error output emitted when .git/objects can't be written to. Before 9c4d6c0297b (cache-tree: Write updated cache-tree after commit, 2014-07-13) we'd emit only one "insufficient permission" error, now we'll do so again. The cause is rather straightforward, we've got WRITE_TREE_SILENT for the use-case of wanting to prepare an index silently, quieting any permission etc. error output. Then when we attempt to update to that (possibly broken) index we'll run into the same errors again. But with 9c4d6c0297b the gap between the cache-tree API and the object store wasn't closed in terms of asking write_object_file() to be silent. I.e. post-9c4d6c0297b the first call is to prepare_index(), and after that we'll call prepare_to_commit(). We only want verbose error output from the latter. So let's add and use that facility with a corresponding HASH_SILENT flag, its only user is cache-tree.c's update_one(), which will set it if its "WRITE_TREE_SILENT" flag is set. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-08object-file: only register submodule ODB if neededJonathan Tan1-1/+8
In a35e03dee0 ("submodule: lazily add submodule ODBs as alternates", 2021-09-08), Git was taught to add all known submodule ODBs as alternates when attempting to read an object that doesn't exist, as a fallback for when a submodule object is read as if it were in the_repository. However, this behavior wasn't restricted to happen only when reading from the_repository. Fix this. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-03Merge branch 'hn/refs-errno-cleanup'Junio C Hamano1-68/+0
Futz with the way 'errno' is relied on in the refs API to carry the failure modes up the call chain. * hn/refs-errno-cleanup: refs: make errno output explicit for read_raw_ref_fn refs/files-backend: stop setting errno from lock_ref_oid_basic refs: remove EINVAL errno output from specification of read_raw_ref_fn refs file backend: move raceproof_create_file() here
2021-10-01fsck: report invalid object type-path combinationsÆvar Arnfjörð Bjarmason1-11/+10
Improve the error that's emitted in cases where we find a loose object we parse, but which isn't at the location we expect it to be. Before this change we'd prefix the error with a not-a-OID derived from the path at which the object was found, due to an emergent behavior in how we'd end up with an "OID" in these codepaths. Now we'll instead say what object we hashed, and what path it was found at. Before this patch series e.g.: $ git hash-object --stdin -w -t blob </dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ mv objects/e6/ objects/e7 Would emit ("[...]" used to abbreviate the OIDs): git fsck error: hash mismatch for ./objects/e7/9d[...] (expected e79d[...]) error: e79d[...]: object corrupt or missing: ./objects/e7/9d[...] Now we'll instead emit: error: e69d[...]: hash-path mismatch, found at: ./objects/e7/9d[...] Furthermore, we'll do the right thing when the object type and its location are bad. I.e. this case: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ mv objects/83 objects/84 As noted in an earlier commits we'd simply die early in those cases, until preceding commits fixed the hard die on invalid object type: $ git fsck fatal: invalid object type Now we'll instead emit sensible error messages: $ git fsck error: 8315[...]: hash-path mismatch, found at: ./objects/84/15[...] error: 8315[...]: object is of unknown type 'garbage': ./objects/84/15[...] In both fsck.c and object-file.c we're using null_oid as a sentinel value for checking whether we got far enough to be certain that the issue was indeed this OID mismatch. We need to add the "object corrupt or missing" special-case to deal with cases where read_loose_object() will return an error before completing check_object_signature(), e.g. if we have an error in unpack_loose_rest() because we find garbage after the valid gzip content: $ git hash-object --stdin -w -t blob </dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ chmod 755 objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ echo garbage >>objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ git fsck error: garbage at end of loose object 'e69d[...]' error: unable to unpack contents of ./objects/e6/9d[...] error: e69d[...]: object corrupt or missing: ./objects/e6/9d[...] There is currently some weird messaging in the edge case when the two are combined, i.e. because we're not explicitly passing along an error state about this specific scenario from check_stream_oid() via read_loose_object() we'll end up printing the null OID if an object is of an unknown type *and* it can't be unpacked by zlib, e.g.: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ chmod 755 objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ echo garbage >>objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ /usr/bin/git fsck fatal: invalid object type $ ~/g/git/git fsck error: garbage at end of loose object '8315a83d2acc4c174aed59430f9a9c4ed926440f' error: unable to unpack contents of ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 8315a83d2acc4c174aed59430f9a9c4ed926440f: object corrupt or missing: ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 0000000000000000000000000000000000000000: object is of unknown type 'garbage': ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f [...] I think it's OK to leave that for future improvements, which would involve enum-ifying more error state as we've done with "enum unpack_loose_header_result" in preceding commits. In these increasingly more obscure cases the worst that can happen is that we'll get slightly nonsensical or inapplicable error messages. There's other such potential edge cases, all of which might produce some confusing messaging, but still be handled correctly as far as passing along errors goes. E.g. if check_object_signature() returns and oideq(real_oid, null_oid()) is true, which could happen if it returns -1 due to the read_istream() call having failed. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01fsck: don't hard die on invalid object typesÆvar Arnfjörð Bjarmason1-12/+6
Change the error fsck emits on invalid object types, such as: $ git hash-object --stdin -w -t garbage --literally </dev/null <OID> From the very ungraceful error of: $ git fsck fatal: invalid object type $ To: $ git fsck error: <OID>: object is of unknown type 'garbage': <OID_PATH> [ other fsck output ] We'll still exit with non-zero, but now we'll finish the rest of the traversal. The tests that's being added here asserts that we'll still complain about other fsck issues (e.g. an unrelated dangling blob). To do this we need to pass down the "OBJECT_INFO_ALLOW_UNKNOWN_TYPE" flag from read_loose_object() through to parse_loose_header(). Since the read_loose_object() function is only used in builtin/fsck.c we can simply change it to accept a "struct object_info" (which contains the OBJECT_INFO_ALLOW_UNKNOWN_TYPE in its flags). See f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13) for the introduction of read_loose_object(). Since we'll need a "struct strbuf" to hold the "type_name" let's pass it to the for_each_loose_file_in_objdir() callback to avoid allocating a new one for each loose object in the iteration. It also makes the memory management simpler than sticking it in fsck_loose() itself, as we'll only need to strbuf_reset() it, with no need to do a strbuf_release() before each "return". Before this commit we'd never check the "type" if read_loose_object() failed, but now we do. We therefore need to initialize it to OBJ_NONE to be able to tell the difference between e.g. its unpack_loose_header() having failed, and us getting past that and into parse_loose_header(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: stop dying in parse_loose_header()Ævar Arnfjörð Bjarmason1-34/+33
Make parse_loose_header() return error codes and data instead of invoking die() by itself. For now we'll move the relevant die() call to loose_object_info() and read_loose_object() to keep this change smaller. In a subsequent commit we'll make read_loose_object() return an error code instead of dying. We should also address the "allow_unknown" case (should be moved to builtin/cat-file.c), but for now I'll be leaving it. For making parse_loose_header() not die() change its prototype to accept a "struct object_info *" instead of the "unsigned long *sizep" it accepted before. Its callers can now check the populated populated "oi->typep". Because of this we don't need to pass in the "unsigned int flags" which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do that check in loose_object_info(). This also refactors some confusing control flow around the "status" variable. In some cases we set it to the return value of "error()", i.e. -1, and later checked if "status < 0" was true. Since 93cff9a978e (sha1_loose_object_info: return error for corrupted objects, 2017-04-01) the return value of loose_object_info() (then named sha1_loose_object_info()) had been a "status" variable that be any negative value, as we were expecting to return the "enum object_type". The only negative type happens to be OBJ_BAD, but the code still assumed that more might be added. This was then used later in e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now that parse_loose_header() will return 0 on success instead of the type (which it'll stick into the "struct object_info") we don't need to conflate these two cases in its callers. Since parse_loose_header() doesn't need to return an arbitrary "status" we only need to treat its "ret < 0" specially, but can idiomatically overwrite it with our own error() return. This along with having made unpack_loose_header() return an "enum unpack_loose_header_result" in an earlier commit means that we can move the previously nested if/else cases mostly into the "ULHR_OK" branch of the "switch" statement. We should be less silent if we reach that "status = -1" branch, which happens if we've got trailing garbage in loose objects, see f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13) for a better way to handle it. For now let's punt on it, a subsequent commit will address that edge case. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: return ULHR_TOO_LONG on "header too long"Ævar Arnfjörð Bjarmason1-2/+6
Split up the return code for "header too long" from the generic negative return value unpack_loose_header() returns, and report via error() if we exceed MAX_HEADER_LEN. As a test added earlier in this series in t1006-cat-file.sh shows we'll correctly emit zlib errors from zlib.c already in this case, so we have no need to carry those return codes further down the stack. Let's instead just return ULHR_TOO_LONG saying we ran into the MAX_HEADER_LEN limit, or other negative values for "unable to unpack <OID> header". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: use "enum" return type for unpack_loose_header()Ævar Arnfjörð Bjarmason1-13/+21
In a preceding commit we changed and documented unpack_loose_header() from its previous behavior of returning any negative value or zero, to only -1 or 0. Let's add an "enum unpack_loose_header_result" type and use it for these return values, and have the compiler assert that we're exhaustively covering all of them. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: simplify unpack_loose_short_header()Ævar Arnfjörð Bjarmason1-38/+20
Combine the unpack_loose_short_header(), unpack_loose_header_to_strbuf() and unpack_loose_header() functions into one. The unpack_loose_header_to_strbuf() function was added in 46f034483eb (sha1_file: support reading from a loose object of unknown type, 2015-05-03). Its code was mostly copy/pasted between it and both of unpack_loose_header() and unpack_loose_short_header(). We now have a single unpack_loose_header() function which accepts an optional "struct strbuf *" instead. I think the remaining unpack_loose_header() function could be further simplified, we're carrying some complexity just to be able to emit a garbage type longer than MAX_HEADER_LEN, we could alternatively just say "we found a garbage type <first 32 bytes>..." instead. But let's leave the current behavior in place for now. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: make parse_loose_header_extended() publicÆvar Arnfjörð Bjarmason1-13/+7
Make the parse_loose_header_extended() function public and remove the parse_loose_header() wrapper. The only direct user of it outside of object-file.c itself was in streaming.c, that caller can simply pass the required "struct object-info *" instead. This change is being done in preparation for teaching read_loose_object() to accept a flag to pass to parse_loose_header(). It isn't strictly necessary for that change, we could simply use parse_loose_header_extended() there, but will leave the API in a better end state. It would be a better end-state to have already moved the declaration of these functions to object-store.h to avoid the forward declaration of "struct object_info" in cache.h, but let's leave that cleanup for some other time. 1. https://lore.kernel.org/git/patch-v6-09.22-5b9278e7bb4-20210907T104559Z-avarab@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: return -1, not "status" from unpack_loose_header()Ævar Arnfjörð Bjarmason1-1/+1
Return a -1 when git_inflate() fails instead of whatever Z_* status we'd get from zlib.c. This makes no difference to any error we report, but makes it more obvious that we don't care about the specific zlib error codes here. See d21f8426907 (unpack_sha1_header(): detect malformed object header, 2016-09-25) for the commit that added the "return status" code. As far as I can tell there was never a real reason (e.g. different reporting) for carrying down the "status" as opposed to "-1". At the time that d21f8426907 was written there was a corresponding "ret < Z_OK" check right after the unpack_sha1_header() call (the "unpack_sha1_header()" function was later rename to our current "unpack_loose_header()"). However, that check was removed in c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21) without changing the corresponding return code. So let's do the minor cleanup of also changing this function to return a -1. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: don't set "typep" when returning non-zeroÆvar Arnfjörð Bjarmason1-2/+0
When the loose_object_info() function returns an error stop faking up the "oi->typep" to OBJ_BAD. Let the return value of the function itself suffice. This code cleanup simplifies subsequent changes. That we set this at all is a relic from the past. Before 052fe5eaca9 (sha1_loose_object_info: make type lookup optional, 2013-07-12) we would always return the type_from_string(type) via the parse_sha1_header() function, or -1 (i.e. OBJ_BAD) if we couldn't parse it. Then in a combination of 46f034483eb (sha1_file: support reading from a loose object of unknown type, 2015-05-03) and b3ea7dd32d6 (sha1_loose_object_info: handle errors from unpack_sha1_rest, 2017-10-05) our API drifted even further towards conflating the two again. Having read the code paths involved carefully I think this is OK. We are just about to return -1, and we have only one caller: do_oid_object_info_extended(). That function will in turn go on to return -1 when we return -1 here. This might be introducing a subtle bug where a caller of oid_object_info_extended() would inspect its "typep" and expect a meaningful value if the function returned -1. Such a problem would not occur for its simpler oid_object_info() sister function. That one always returns the "enum object_type", which in the case of -1 would be the OBJ_BAD. Having read the code for all the callers of these functions I don't believe any such bug is being introduced here, and in any case we'd likely already have such a bug for the "sizep" member (although blindly checking "typep" first would be a more common case). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-23Merge branch 'rs/packfile-bad-object-list-in-oidset'Junio C Hamano1-2/+2
Replace a handcrafted data structure used to keep track of bad objects in the packfile API by an oidset. * rs/packfile-bad-object-list-in-oidset: packfile: use oidset for bad objects packfile: convert has_packed_and_bad() to object_id packfile: convert mark_bad_packed_object() to object_id midx: inline nth_midxed_pack_entry() oidset: make oidset_size() an inline function
2021-09-20Merge branch 'jt/grep-wo-submodule-odb-as-alternate'Junio C Hamano1-0/+5
The code to make "git grep" recurse into submodules has been updated to migrate away from the "add submodule's object store as an alternate object store" mechanism (which is suboptimal). * jt/grep-wo-submodule-odb-as-alternate: t7814: show lack of alternate ODB-adding submodule-config: pass repo upon blob config read grep: add repository to OID grep sources grep: allocate subrepos on heap grep: read submodule entry with explicit repo grep: typesafe versions of grep_source_init grep: use submodule-ODB-as-alternate lazy-addition submodule: lazily add submodule ODBs as alternates
2021-09-12packfile: convert has_packed_and_bad() to object_idRené Scharfe1-1/+1
The single caller has a full object ID, so pass it on instead of just its hash member. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12packfile: convert mark_bad_packed_object() to object_idRené Scharfe1-1/+1
All callers have full object IDs, so pass them on instead of just their hash member. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08submodule: lazily add submodule ODBs as alternatesJonathan Tan1-0/+5
Teach Git to add submodule ODBs as alternates to the object store of the_repository only upon the first access of an object not in the_repository, and not when add_submodule_odb() is called. This provides a means of gradually migrating from accessing a submodule's object through alternates to accessing a submodule's object by explicitly passing its repository object. Any Git command can declare that it might access submodule objects by calling add_submodule_odb() (as they do now), but the submodule ODBs themselves will not be added until needed, so individual commands and/or combinations of arguments can be migrated one by one. [The advantage of explicit repository-object passing is code clarity (it is clear which repository an object read is from), performance (there is no need to linearly search through all submodule ODBs whenever an object is accessed from any repository, whether superproject or submodule), and the possibility of future features like partial clone submodules (which right now is not possible because if an object is missing, we do not know which repository to lazy-fetch into).] This commit also introduces an environment variable that a test may set to make the actual registration of alternates fatal, in order to demonstrate that its codepaths do not need this registration. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Reviewed-by: Emily Shaffer <emilyshaffer@google.com> Reviewed-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07hash.h: provide constants for the hash IDsHan-Wen Nienhuys1-5/+2
This will simplify referencing them from code that is not deeply integrated with Git, in particular, the reftable library. Signed-off-by: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: avoid opening multiple MIDXs when writingTaylor Blau1-0/+21
Opening multiple instance of the same MIDX can lead to problems like two separate packed_git structures which represent the same pack being added to the repository's object store. The above scenario can happen because prepare_midx_pack() checks if `m->packs[pack_int_id]` is NULL in order to determine if a pack has been opened and installed in the repository before. But a caller can construct two copies of the same MIDX by calling get_multi_pack_index() and load_multi_pack_index() since the former manipulates the object store directly but the latter is a lower-level routine which allocates a new MIDX for each call. So if prepare_midx_pack() is called on multiple MIDXs with the same pack_int_id, then that pack will be installed twice in the object store's packed_git pointer. This can lead to problems in, for e.g., the pack-bitmap code, which does something like the following (in pack-bitmap.c:open_pack_bitmap()): struct bitmap_index *bitmap_git = ...; for (p = get_all_packs(r); p; p = p->next) { if (open_pack_bitmap_1(bitmap_git, p) == 0) ret = 0; } which is a problem if two copies of the same pack exist in the packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a conditional like the following: if (bitmap_git->pack || bitmap_git->midx) { /* ignore extra bitmap file; we can only handle one */ warning("ignoring extra bitmap file: %s", packfile->pack_name); close(fd); return -1; } Avoid this scenario by not letting write_midx_internal() open a MIDX that isn't also pointed at by the object store. So long as this is the case, other routines should prefer to open MIDXs with get_multi_pack_index() or reprepare_packed_git() instead of creating instances on their own. Because get_multi_pack_index() returns `r->object_store->multi_pack_index` if it is non-NULL, we'll only have one instance of a MIDX open at one time, avoiding these problems. To encourage this, drop the `struct multi_pack_index *` parameter from `write_midx_internal()`, and rely instead on the `object_dir` to find (or initialize) the correct MIDX instance. Likewise, replace the call to `close_midx()` with `close_object_store()`, since we're about to replace the MIDX with a new one and should invalidate the object store's memory of any MIDX that might have existed beforehand. Note that this now forbids passing object directories that don't belong to alternate repositories over `--object-dir`, since before we would have happily opened a MIDX in any directory, but now restrict ourselves to only those reachable by `r->objects->multi_pack_index` (and alternate MIDXs that we can see by walking the `next` pointer). As far as I can tell, supporting arbitrary directories with `--object-dir` was a historical accident, since even the documentation says `<alt>` when referring to the value passed to this option. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-25refs file backend: move raceproof_create_file() hereÆvar Arnfjörð Bjarmason1-68/+0
Move the raceproof_create_file() API added to cache.h and object-file.c in 177978f56ad (raceproof_create_file(): new function, 2017-01-06) to its only user, refs/files-backend.c. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-11Merge branch 'cb/many-alternate-optim-fixup'Junio C Hamano1-1/+1
Build fix. * cb/many-alternate-optim-fixup: object-file: use unsigned arithmetic with bit mask object-store: avoid extra ';' from KHASH_INIT oidtree: avoid nested struct oidtree_node
2021-08-11object-file: use unsigned arithmetic with bit maskRené Scharfe1-1/+1
33f379eee6 (make object_directory.loose_objects_subdir_seen a bitmap, 2021-07-07) replaced a wasteful 256-byte array with a 32-byte array and bit operations. The mask calculation shifts a literal 1 of type int left by anything between 0 and 31. UndefinedBehaviorSanitizer doesn't like that and reports: object-file.c:2477:18: runtime error: left shift of 1 by 31 places cannot be represented in type 'int' Make sure to use an unsigned 1 instead to avoid the issue. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-28Merge branch 'ew/many-alternate-optim'Junio C Hamano1-30/+45
Optimization for repositories with many alternate object store. * ew/many-alternate-optim: oidtree: a crit-bit tree for odb_loose_cache oidcpy_with_padding: constify `src' arg make object_directory.loose_objects_subdir_seen a bitmap avoid strlen via strbuf_addstr in link_alt_odb_entry speed up alt_odb_usable() with many alternates
2021-07-16Merge branch 'jt/partial-clone-submodule-1'Junio C Hamano1-5/+2
Prepare the internals for lazily fetching objects in submodules from their promisor remotes. * jt/partial-clone-submodule-1: promisor-remote: teach lazy-fetch in any repo run-command: refactor subprocess env preparation submodule: refrain from filtering GIT_CONFIG_COUNT promisor-remote: support per-repository config repository: move global r_f_p_c to repo struct
2021-07-07oidtree: a crit-bit tree for odb_loose_cacheEric Wong1-11/+12
This saves 8K per `struct object_directory', meaning it saves around 800MB in my case involving 100K alternates (half or more of those alternates are unlikely to hold loose objects). This is implemented in two parts: a generic, allocation-free `cbtree' and the `oidtree' wrapper on top of it. The latter provides allocation using alloc_state as a memory pool to improve locality and reduce free(3) overhead. Unlike oid-array, the crit-bit tree does not require sorting. Performance is bound by the key length, for oidtree that is fixed at sizeof(struct object_id). There's no need to have 256 oidtrees to mitigate the O(n log n) overhead like we did with oid-array. Being a prefix trie, it is natively suited for expanding short object IDs via prefix-limited iteration in `find_short_object_filename'. On my busy workstation, p4205 performance seems to be roughly unchanged (+/-8%). Startup with 100K total alternates with no loose objects seems around 10-20% faster on a hot cache. (800MB in memory savings means more memory for the kernel FS cache). The generic cbtree implementation does impose some extra overhead for oidtree in that it uses memcmp(3) on "struct object_id" so it wastes cycles comparing 12 extra bytes on SHA-1 repositories. I've not yet explored reducing this overhead, but I expect there are many places in our code base where we'd want to investigate this. More information on crit-bit trees: https://cr.yp.to/critbit.html Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-07make object_directory.loose_objects_subdir_seen a bitmapEric Wong1-3/+8
There's no point in using 8 bits per-directory when 1 bit will do. This saves us 224 bytes per object directory, which ends up being 22MB when dealing with 100K alternates. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>