aboutsummaryrefslogtreecommitdiffstats
path: root/sparse-index.c
AgeCommit message (Collapse)AuthorFilesLines
2025-10-20sparse-index: improve advice message instructionsDerrick Stolee1-2/+3
When an on-disk sparse index is expanded to a full one, it could be due to some worktree state that requires looking at file entries hidden within sparse tree entries. This can be avoided if the worktree is cleaned up and some other issues related to the index state are resolved. Expand the advice message to include all of these cases, since 'git sparse-checkout clean' is not currently capable of handling all cases. In the future, we may improve the behavior of 'git sparse-checkout clean' to handle all of the cases. Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-15sparse-index: point users to new 'clean' actionDerrick Stolee1-1/+2
In my experience, the most-common reason that the sparse index must expand to a full one is because there is some leftover file in a tracked directory that is now outside of the sparse-checkout. The new 'git sparse-checkout clean' command will find and delete these directories, so point users to it when they hit the sparse index expansion advice. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06global: mark code units that generate warnings with `-Wsign-compare`Patrick Steinhardt1-0/+1
Mark code units that generate warnings with `-Wsign-compare`. This allows for a structured approach to get rid of all such warnings over time in a way that can be easily measured. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-04sparse-index: correctly free EWAH contentsPatrick Steinhardt1-2/+5
While we free the `fsmonitor_dirty` member of `struct index_state`, we do not free the contents of that EWAH. Do so by using `ewah_free()` instead of `FREE_AND_NULL()`. This leak is exposed by t7519, but plugging it alone does not make the test suite pass. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02Merge branch 'ds/sparse-checkout-expansion-advice'Junio C Hamano1-2/+3
When "git sparse-checkout disable" turns a sparse checkout into a regular checkout, the index is fully expanded. This totally expected behaviour however had an "oops, we are expanding the index" advice message, which has been corrected. * ds/sparse-checkout-expansion-advice: sparse-checkout: disable advice in 'disable'
2024-09-23sparse-checkout: disable advice in 'disable'Derrick Stolee1-2/+3
When running 'git sparse-checkout disable' with the sparse index enabled, Git is expected to expand the index into a full index. However, it currently outputs the advice message saying that that is unexpected and likely due to an issue with the working directory. Disable this advice message when in this code path. Establish a pattern for doing a similar removal in the future. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-12environment: guard state depending on a repositoryPatrick Steinhardt1-0/+2
In "environment.h" we have quite a lot of functions and variables that either explicitly or implicitly depend on `the_repository`. The implicit set of stateful declarations includes for example variables which get populated when parsing a repository's Git configuration. This set of variables is broken by design, as their state often depends on the last repository config that has been parsed. So they may or may not represent the state of `the_repository`. Fixing that is quite a big undertaking, and later patches in this series will demonstrate a solution for a first small set of those variables. So for now, let's guard these with `USE_THE_REPOSITORY_VARIABLE` so that callers are aware of the implicit dependency. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-07-16Merge branch 'ds/advice-sparse-index-expansion'Junio C Hamano1-0/+28
A new warning message is issued when a command has to expand a sparse index to handle working tree cruft that are outside of the sparse checkout. * ds/advice-sparse-index-expansion: advice: warn when sparse index expands
2024-07-08advice: warn when sparse index expandsDerrick Stolee1-0/+28
Typically, forcing a sparse index to expand to a full index means that Git could not determine the status of a file outside of the sparse-checkout and needed to expand sparse trees into the full list of sparse blobs. This operation can be very slow when the sparse-checkout is much smaller than the full tree at HEAD. When users are in this state, there is usually a modified or untracked file outside of the sparse-checkout mentioned by the output of 'git status'. There are a number of reasons why this is insufficient: 1. Users may not have a full understanding of which files are inside or outside of their sparse-checkout. This is more common in monorepos that manage the sparse-checkout using custom tools that map build dependencies into sparse-checkout definitions. 2. In some cases, an empty directory could exist outside the sparse-checkout and these empty directories are not reported by 'git status' and friends. 3. If the user has '.gitignore' or 'exclude' files, then 'git status' will squelch the warnings and not demonstrate any problems. In order to help users who are in this state, add a new advice message to indicate that a sparse index is expanded to a full index. This message should be written at most once per process, so add a static global 'give_advice_on_expansion' to sparse-index.c. Further, there is a case in 'git sparse-checkout set' that uses the sparse index as an in-memory data structure (even when writing a full index) so we need to disable the message in that kind of case. The t1092-sparse-checkout-compatibility.sh test script compares the behavior of several Git commands across full and sparse repositories, including sparse repositories with and without a sparse index. We need to disable the advice in the sparse-index repo to avoid differences in stderr. By leaving the advice on in the sparse-checkout repo (without the sparse index), we can test the behavior of disabling the advice in convert_to_sparse(). (Indeed, these tests are how that necessity was discovered.) Add a test that reenables the advice and demonstrates that the message is output. The advice message is defined outside of expand_index() to avoid super- wide lines. It is also defined as a macro to avoid compile issues with -Werror=format-security. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-28sparse-index: improve lstat caching of sparse pathsDerrick Stolee1-24/+90
The clear_skip_worktree_from_present_files() method was first introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to allow better interaction with the working directory in the presence of paths outside of the sparse-checkout. The initial implementation would lstat() every single SKIP_WORKTREE path to see if it existed; if it ran across a sparse directory that existed (when a sparse index was in use), then it would expand the index and then check every SKIP_WORKTREE path. Since these lstat() calls were very expensive, this was improved in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14) by caching directories that do not exist so it could avoid lstat()ing any files under such directories. However, there are some inefficiencies in that caching mechanism. The caching mechanism stored only the parent directory as not existing, even if a higher parent directory also does not exist. This means that wasted lstat() calls would occur when the paths passed to path_found() change immediate parent directories but within the same parent directory that does not exist. To create an example repository that demonstrates this problem, it helps to have a directory outside of the sparse-checkout that contains many deep paths. In particular, the first paths (in lexicographic order) underneath the sparse directory should have deep directory structures, maximizing the difference between the old caching algorithm that looks to a single parent and the new caching algorithm that looks to the top-most missing directory. The performance test script p2000-sparse-operations.sh takes the sample repository and copies its HEAD to several copies nested in directories of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior and also lead to some interesting cases for the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do not. This is difficult to notice when running performance tests using the Git repository (or a blow-up of the Git repository, as in p2000-sparse-operations.sh) because Git has a very shallow directory structure. This change reorganizes the caching algorithm to focus on storing the highest level leading directory that does not exist; specifically this means that that directory's parent _does_ exist. By doing a little extra work on a path passed to path_found(), we can short-circuit all of the paths passed to path_found() afterwards that match a prefix with that non-existing directory. When in a repository where the first sparse file is likely to have a much deeper path than the first non-existing directory, this can realize significant gains. The details of this algorithm require careful attention, so the new implementation of path_found() has detailed comments, including the use of a new max_common_dir_prefix() method that may be of independent interest. It's worth noting that this is not universally positive, since we are doing extra lstat() calls to establish the exact path to cache. In the blow-up of the Git repository, we can see that the lstat count _increases_ from 28 to 31. However, these numbers were already artificially low. Contributor Elijah Newren created a publicly-available test repository that demonstrates the difference in these caching algorithms in the most extreme way. To test, follow these steps: git clone --sparse https://github.com/newren/gvfs-like-git-bomb cd gvfs-like-git-bomb ./runme.sh # NOTE: check scripts before running! At this point, assuming you do not have index.sparse=true set globally, the index has one million paths with the SKIP_WORKTREE bit and they will all be sent to path_found() in the sparse loop. You can measure this by running 'git status' with GIT_TRACE2_PERF=1: Sparse files in the index: 1,000,000 sparse_lstat_count (before): 200,000 sparse_lstat_count (after): 2 And here are the performance numbers: Benchmark 1: old Time (mean ± σ): 397.5 ms ± 4.1 ms Range (min … max): 391.2 ms … 404.8 ms 10 runs Benchmark 2: new Time (mean ± σ): 252.7 ms ± 3.1 ms Range (min … max): 249.4 ms … 259.5 ms 11 runs Summary 'new' ran 1.57 ± 0.02 times faster than 'old' By modifying this example further, we can demonstrate a more realistic example and include the sparse index expansion. Continue by creating this directory, confusing both caching algorithms somewhat: mkdir -p bomb/d/e/f/a/a Then re-run the 'git status' tests to see these statistics: Sparse files in the index: 1,000,000 sparse_lstat_count (before): 724,010 sparse_lstat_count (after): 106 Benchmark 1: old Time (mean ± σ): 753.0 ms ± 3.5 ms Range (min … max): 749.7 ms … 760.9 ms 10 runs Benchmark 2: new Time (mean ± σ): 201.4 ms ± 3.2 ms Range (min … max): 196.0 ms … 207.9 ms 14 runs Summary 'new' ran 3.74 ± 0.06 times faster than 'old' Note that if this repository had a sparse index enabled, the additional cost of expanding the sparse index affects the total time of these commands by over four seconds, significantly diminishing the benefit of the caching algorithm. Having existing paths outside of the sparse-checkout is a known performance issue for the sparse index and is a known trade-off for the performance benefits given when no such paths exist. Using an internal monorepo with over two million paths at HEAD and a typical sparse-checkout cone such that the sparse index contains ~190,000 entries (including over two thousand sparse trees), I was able to measure these lstat counts when one sparse directory actually exists on disk: Sparse files in expanded index: 1,841,997 full_lstat_count (before): 1,188,161 full_lstat_count (after): 4,404 This resulted in this absolute time change, on a warm disk: Time in full loop (before): 13.481 s Time in full loop (after): 0.081 s (These times were calculated on a Windows machine, where lstat() is slower than a similar Linux machine.) Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-28sparse-index: count lstat() callsDerrick Stolee1-0/+7
The clear_skip_worktree.. methods already report some statistics about how many cache entries are checked against path_found() due to having the skip-worktree bit set. However, due to path_found() performing some caching, this isn't the only information that would be helpful to report. Add a new lstat_count member to the path_found_data struct to count the number of times path_found() calls lstat(). This will be helpful to help explain performance problems in this method as well as to demonstrate future changes to the caching algorithm in a more concrete way than end-to-end timings. Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-28sparse-index: use strbuf in path_found()Derrick Stolee1-12/+9
The path_found() method previously reused strings from the cache entries the calling methods were using. This prevents string manipulation in place and causes some odd reallocation before the final lstat() call in the method. Refactor the method to use strbufs and copy the path into the strbuf, but also only the parent directory and not the whole path. This looks like extra copying when assigning the path to the strbuf, but we save an allocation by dropping the 'tmp' string, and we are "reusing" the copy from 'tmp' to put the data in the strbuf. Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-28sparse-index: refactor path_found()Derrick Stolee1-16/+29
In advance of changing the behavior of path_found(), take all of the intermediate data values and group them into a single struct. This simplifies the method prototype as well as the initialization. Future changes can be made directly to the struct and method without changing the callers with this approach. Note that the clear_path_found_data() method is currently empty, as there is nothing to free. This method is a placeholder for future changes that require a non-trivial implementation. Its stub is created now so consumers could call it now and not change in future changes. Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-28sparse-checkout: refactor skip worktree retry logicDerrick Stolee1-24/+53
The clear_skip_worktree_from_present_files() method was introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to help cases where sparse-checkout is enabled but some paths outside of the sparse-checkout also exist on disk. This operation can be slow as it needs to check path existence in a way not stored in the index, so caching was introduced in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14). This check is particularly confusing in the presence of a sparse index, as a sparse tree entry corresponding to an existing directory must first be expanded to a full index before examining the paths within. This is currently implemented using a 'goto' and a boolean variable to ensure we restart only once. Even with that caching, it was noticed that this could take a long time to execute. 89aaab11a3 (index: add trace2 region for clear skip worktree, 2022-11-03) introduced trace2 regions to measure this time. Further, the way the loop repeats itself was slightly confusing and prone to breakage, so a BUG() statement was added in 8c7abdc596 (index: raise a bug if the index is materialised more than once, 2022-11-03) to be sure that the second run of the loop does not hit any sparse trees. One thing that can be confusing about the current setup is that the trace2 regions nest and it is not clear that a second loop is running after a sparse index is expanded. Here is an example of what the regions look like in a typical case: | region_enter | ... | label:clear_skip_worktree_from_present_files | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_enter | ... | ..label:ensure_full_index | region_enter | ... | ....label:update | region_leave | ... | ....label:update | region_leave | ... | ..label:ensure_full_index | data | ... | ..sparse_path_count:1 | data | ... | ..sparse_path_count_full:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files One thing that is particularly difficult to understand about these regions is that most of the time is spent between the close of the ensure_full_index region and the reporting of the end data. This is because of the restart of the loop being within the same region as the first iteration of the loop. This change refactors the method into two separate methods that are traced separately. This will be more important later when we change other features of the methods, but for now the only functional change is the difference in the structure of the trace regions. After this change, the same telemetry section is split into three distinct chunks: | region_enter | ... | label:clear_skip_worktree_from_present_files_sparse | data | ... | ..sparse_path_count:1 | region_leave | ... | label:clear_skip_worktree_from_present_files_sparse | region_enter | ... | label:update | region_leave | ... | label:update | region_enter | ... | label:ensure_full_index | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_leave | ... | label:ensure_full_index | region_enter | ... | label:clear_skip_worktree_from_present_files_full | data | ... | ..full_path_count:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files_full Here, we see the sparse loop terminating early with its first sparse path being a sparse directory containing a file. Then, that loop's region terminates before ensure_full_index begins (in this case, the cache-tree must also be computed). Then, _after_ the index is expanded, the full loop begins with its own region. Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-08Merge branch 'jh/sparse-index-expand-to-path-fix'Junio C Hamano1-2/+2
A caller called index_file_exists() that takes a string expressed as <ptr, length> with a wrong length, which has been corrected. * jh/sparse-index-expand-to-path-fix: sparse-index: pass string length to index_file_exists()
2024-02-02sparse-index: pass string length to index_file_exists()Jeff Hostetler1-2/+2
The call to index_file_exists() in the loop in expand_to_path() passes the wrong string length. Let's fix that. The loop in expand_to_path() searches the name-hash for each sub-directory prefix in the provided pathname. That is, by searching for "dir1/" then "dir1/dir2/" then "dir1/dir2/dir3/" and so on until it finds a cache-entry representing a sparse directory. The code creates "strbuf path_mutable" to contain the working pathname and modifies the buffer in-place by temporarily replacing the character following each successive "/" with NUL for the duration of the call to index_file_exists(). It does not update the strbuf.len during this substitution. Pass the patched length of the prefix path instead. Signed-off-by: Jeff Hostetler <jeffhostetler@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-31read_tree(): respect max_allowed_tree_depthJeff King1-1/+1
The read_tree() function reads trees recursively (via its read_tree_at() helper). This can cause it to run out of stack space on very deep trees. Let's teach it about the new core.maxTreeDepth option. The easiest way to demonstrate this is via "ls-tree -r", which the test covers. Note that I needed a tree depth of ~30k to trigger a segfault on my Linux system, not the 4100 used by our "big" test in t6700. However, that test still tells us what we want: that the default 4096 limit is enough to prevent segfaults on all platforms. We could bump it, but that increases the cost of the test setup for little gain. As an interesting side-note: when I originally wrote this patch about 4 years ago, I needed a depth of ~50k to segfault. But porting it forward, the number is much lower. Seemingly little things like cf0983213c (hash: add an algo member to struct object_id, 2021-04-26) take it from 32,722 to 29,080. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan1-1/+0
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21fsmonitor-ll.h: split this header out of fsmonitor.hElijah Newren1-1/+2
This creates a new fsmonitor-ll.h with most of the functions from fsmonitor.h, though it leaves three inline functions where they were. Two-thirds of the files that previously included fsmonitor.h did not need those three inline functions or the six extra includes those inline functions required, so this allows them to only include the lower level header. Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21cache.h: remove this no-longer-used headerElijah Newren1-1/+1
Since this header showed up in some places besides just #include statements, update/clean-up/remove those other places as well. Note that compat/fsmonitor/fsm-path-utils-darwin.c previously got away with violating the rule that all files must start with an include of git-compat-util.h (or a short-list of alternate headers that happen to include it first). This change exposed the violation and caused it to stop building correctly; fix it by having it include git-compat-util.h first, as per policy. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21name-hash.h: move declarations for name-hash.c from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21environment.h: move declarations for environment.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren1-0/+1
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren1-0/+1
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-17treewide: always have a valid "index_state.repo" memberÆvar Arnfjörð Bjarmason1-9/+0
When the "repo" member was added to "the_index" in [1] the repo_read_index() was made to populate it, but the unpopulated "the_index" variable didn't get the same treatment. Let's do that in initialize_the_repository() when we set it up, and likewise for all of the current callers initialized an empty "struct index_state". This simplifies code that needs to deal with "the_index" or a custom "struct index_state", we no longer need to second-guess this part of the "index_state" deep in the stack. A recent example of such second-guessing is the "istate->repo ? istate->repo : the_repository" code in [2]. We can now simply use "istate->repo". We're doing this by making use of the INDEX_STATE_INIT() macro (and corresponding function) added in [3], which now have mandatory "repo" arguments. Because we now call index_state_init() in repository.c's initialize_the_repository() we don't need to handle the case where we have a "repo->index" whose "repo" member doesn't match the "repo" we're setting up, i.e. the "Complete the double-reference" code in repo_read_index() being altered here. That logic was originally added in [1], and was working around the lack of what we now have in initialize_the_repository(). For "fsmonitor-settings.c" we can remove the initialization of a NULL "r" argument to "the_repository". This was added back in [4], and was needed at the time for callers that would pass us the "r" from an "istate->repo". Before this change such a change to "fsmonitor-settings.c" would segfault all over the test suite (e.g. in t0002-gitfile.sh). This change has wider eventual implications for "fsmonitor-settings.c". The reason the other lazy loading behavior in it is required (starting with "if (!r->settings.fsmonitor) ..." is because of the previously passed "r" being "NULL". I have other local changes on top of this which move its configuration reading to "prepare_repo_settings()" in "repo-settings.c", as we could now start to rely on it being called for our "r". But let's leave all of that for now, and narrowly remove this particular part of the lazy-loading. 1. 1fd9ae517c4 (repository: add repo reference to index_state, 2021-01-23) 2. ee1f0c242ef (read-cache: add index.skipHash config option, 2023-01-06) 3. 2f6b1eb794e (cache API: add a "INDEX_STATE_INIT" macro/function, add release_index(), 2023-01-12) 4. 1e0ea5c4316 (fsmonitor: config settings are repository-specific, 2022-03-25) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-13sparse-index API: BUG() out on NULL ensure_full_index()Ævar Arnfjörð Bjarmason1-1/+3
Make the ensure_full_index() function stricter, and have it only accept a non-NULL "struct index_state". This function (and this behavior) was added in [1]. The only reason it needed to be this lax was due to interaction with repo_index_has_changes(). See the addition of that code in [2]. The other reason for why this was needed dates back to interaction with code added in [3]. In [4] we started calling ensure_full_index() in unpack_trees(), but the caller added in 34110cd4e39 wants to pass us a NULL "dst_index". Let's instead do the NULL check in unpack_trees() itself. 1. 4300f8442a2 (sparse-index: implement ensure_full_index(), 2021-03-30) 2. 0c18c059a15 (read-cache: ensure full index, 2021-04-01) 3. 34110cd4e39 (Make 'unpack_trees()' have a separate source and destination index, 2008-03-06) 4. 6863df35503 (unpack-trees: ensure full index, 2021-03-30) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-13sparse-index.c: expand_to_path() can assume non-NULL "istate"Ævar Arnfjörð Bjarmason1-1/+1
This function added in [1] was subsequently used in [2]. All of the calls to it are in name-hash.c, and come after calls to lazy_init_name_hash(istate). The first thing that function does is: if (istate->name_hash_initialized) return; So we can already assume that we have a non-NULL "istate" here, or we'd be segfaulting. Let's not confuse matters by making it appear that's not the case. 1. 71f82d032f3 (sparse-index: expand_to_path(), 2021-04-12) 2. 4589bca829a (name-hash: use expand_to_path(), 2021-04-12) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-04index: raise a bug if the index is materialised more than onceAnh Le1-0/+2
If clear_skip_worktree_from_present_files() encounter a sparse directory, it fully materialise the index which should expand any sparse directories and start going through each entries again. If this happens more than once, raise it with a BUG. Signed-off-by: Anh Le <anh@canva.com> Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-04index: add trace2 region for clear skip worktreeAnh Le1-6/+22
When using sparse checkout, clear_skip_worktree_from_present_files() must enumerate index entries to find ones with the SKIP_WORKTREE bit to determine whether those index entries exist on disk (in which case their SKIP_WORKTREE bit should be removed). In a large repository, this may take considerable time depending on the size of the index. Add a trace2 region to surface this information, keeping a count of how many paths have been checked. Separately, keep counts after a full index is materialized. Signed-off-by: Anh Le <anh@canva.com> Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-06-03Merge branch 'ds/sparse-sparse-checkout'Junio C Hamano1-15/+117
"sparse-checkout" learns to work well with the sparse-index feature. * ds/sparse-sparse-checkout: sparse-checkout: integrate with sparse index p2000: add test for 'git sparse-checkout [add|set]' sparse-index: complete partial expansion sparse-index: partially expand directories sparse-checkout: --no-sparse-index needs a full index cache-tree: implement cache_tree_find_path() sparse-index: introduce partially-sparse indexes sparse-index: create expand_index() t1092: stress test 'git sparse-checkout set' t1092: refactor 'sparse-index contents' test
2022-05-23sparse-index: complete partial expansionDerrick Stolee1-3/+36
To complete the implementation of expand_to_pattern_list(), we need to detect when a sparse directory entry should remain sparse. This avoids a full expansion, so we now need to use the PARTIALLY_SPARSE mode to indicate this state. There still are no callers to this method, but we will add one in the next change. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-23sparse-index: partially expand directoriesDerrick Stolee1-7/+50
The expand_to_pattern_list() method expands sparse directory entries to their list of contained files when either the pattern list is NULL or the directory is contained in the new pattern list's cone mode patterns. It is possible that the pattern list has a recursive match with a directory 'A/B/C/' and so an existing sparse directory 'A/B/' would need to be expanded. If there exists a directory 'A/B/D/', then that directory should not be expanded and instead we can create a sparse directory. To implement this, we plug into the add_path_to_index() callback for the call to read_tree_at(). Since we now need access to both the index we are writing and the pattern list we are comparing, create a 'struct modify_index_context' to use as a data transfer object. It is important that we use the given pattern list since we will use this pattern list to change the sparse-checkout patterns and cannot use istate->sparse_checkout_patterns. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-23sparse-index: introduce partially-sparse indexesDerrick Stolee1-3/+3
A future change will present a temporary, in-memory mode where the index can both contain sparse directory entries but also not be completely collapsed to the smallest possible sparse directories. This will be necessary for modifying the sparse-checkout definition while using a sparse index. For now, convert the single-bit member 'sparse_index' in 'struct index_state' to be a an 'enum sparse_index_mode' with three modes: * INDEX_EXPANDED (0): No sparse directories exist. This is always the case for repositories that do not use cone-mode sparse-checkout. * INDEX_COLLAPSED: Sparse directories may exist. Files outside the sparse-checkout cone are reduced to sparse directory entries whenever possible. * INDEX_PARTIALLY_SPARSE: Sparse directories may exist. Some file entries outside the sparse-checkout cone may exist. Running convert_to_sparse() may further reduce those files to sparse directory entries. The main reason to store this extra information is to allow convert_to_sparse() to short-circuit when the index is already in INDEX_EXPANDED mode but to actually do the necessary work when in INDEX_PARTIALLY_SPARSE mode. The INDEX_PARTIALLY_SPARSE mode will be used in an upcoming change. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-23sparse-index: create expand_index()Derrick Stolee1-3/+29
This is the first change in a series to allow modifying the sparse-checkout pattern set without expanding a sparse index to a full one in the process. Here, we focus on the problem of expanding the pattern set through a command like 'git sparse-checkout add <path>' which needs to create new index entries for the paths now being written to the worktree. To achieve this, we need to be able to replace sparse directory entries with their contained files and subdirectories. Once this is complete, other code paths can discover those cache entries and write the corresponding files to disk before committing the index. We already have logic in ensure_full_index() that expands the index entries, so we will use that as our base. Create a new method, expand_index(), which takes a pattern list, but for now mostly ignores it. The current implementation is only correct when the pattern list is NULL as that does the same as ensure_full_index(). In fact, ensure_full_index() is converted to a shim over expand_index(). A future update will actually implement expand_index() to its full capabilities. For now, it is created and documented. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-10sparse-index: expose 'is_sparse_index_allowed()'Victoria Dye1-1/+1
Expose 'is_sparse_index_allowed()' publicly so that it may be used by callers outside of 'sparse-index.c'. While no such callers exist yet, it will be used in a subsequent commit. Signed-off-by: Victoria Dye <vdye@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-09Merge branch 'en/present-despite-skipped'Junio C Hamano1-0/+74
In sparse-checkouts, files mis-marked as missing from the working tree could lead to later problems. Such files were hard to discover, and harder to correct. Automatically detecting and correcting the marking of such files has been added to avoid these problems. * en/present-despite-skipped: repo_read_index: add config to expect files outside sparse patterns Accelerate clear_skip_worktree_from_present_files() by caching Update documentation related to sparsity and the skip-worktree bit repo_read_index: clear SKIP_WORKTREE bit from files present in worktree unpack-trees: fix accidental loss of user changes t1011: add testcase demonstrating accidental loss of user modifications
2022-03-01repo_read_index: add config to expect files outside sparse patternsElijah Newren1-1/+2
Typically with sparse checkouts, we expect files outside the sparsity patterns to be marked as SKIP_WORKTREE and be missing from the working tree. Sometimes this expectation would be violated however; including in cases such as: * users grabbing files from elsewhere and writing them to the worktree (perhaps by editing a cached copy in an editor, copying/renaming, or even untarring) * various git commands having incomplete or no support for the SKIP_WORKTREE bit[1,2] * users attempting to "abort" a sparse-checkout operation with a not-so-early Ctrl+C (updating $GIT_DIR/info/sparse-checkout and the working tree is not atomic)[3]. When the SKIP_WORKTREE bit in the index did not reflect the presence of the file in the working tree, it traditionally caused confusion and was difficult to detect and recover from. So, in a sparse checkout, since af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14), Git automatically clears the SKIP_WORKTREE bit at index read time for entries corresponding to files that are present in the working tree. There is another workflow, however, where it is expected that paths outside the sparsity patterns appear to exist in the working tree and that they do not lose the SKIP_WORKTREE bit, at least until they get modified. A Git-aware virtual file system[4] takes advantage of its position as a file system driver to expose all files in the working tree, fetch them on demand using partial clone on access, and tell Git to pay attention to them on demand by updating the sparse checkout pattern on writes. This means that commands like "git status" only have to examine files that have potentially been modified, whereas commands like "ls" are able to show the entire codebase without requiring manual updates to the sparse checkout pattern. Thus since af6a51875a, Git with such Git-aware virtual file systems unsets the SKIP_WORKTREE bit for all files and commands like "git status" have to fetch and examine them all. Introduce a configuration setting sparse.expectFilesOutsideOfPatterns to allow limiting the tracked set of files to a small set once again. A Git-aware virtual file system or other application that wants to maintain files outside of the sparse checkout can set this in a repository to instruct Git not to check for the presence of SKIP_WORKTREE files. The setting defaults to false, so most users of sparse checkout will still get the benefit of an automatically updating index to recover from the variety of difficult issues detailed in af6a51875a for paths with SKIP_WORKTREE set despite the path being present. [1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/ [2] The three long paragraphs in the middle of https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/ [3] https://lore.kernel.org/git/CABPp-BFnFpzwGC11TLoLs8YK5yiisA5D5-fFjXnJsbESVDwZsA@mail.gmail.com/ [4] such as the vfsd described in https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/ Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25Merge branch 'ds/sparse-checkout-requires-per-worktree-config'Junio C Hamano1-7/+3
"git sparse-checkout" wants to work with per-worktree configuration, but did not work well in a worktree attached to a bare repository. * ds/sparse-checkout-requires-per-worktree-config: config: make git_configset_get_string_tmp() private worktree: copy sparse-checkout patterns and config on add sparse-checkout: set worktree-config correctly config: add repo_config_set_worktree_gently() worktree: create init_worktree_config() Documentation: add extensions.worktreeConfig details
2022-02-08sparse-checkout: set worktree-config correctlyDerrick Stolee1-7/+3
`git sparse-checkout set/init` enables worktree-specific configuration[*] by setting extensions.worktreeConfig=true, but neglects to perform the additional necessary bookkeeping of relocating `core.bare=true` and `core.worktree` from $GIT_COMMON_DIR/config to $GIT_COMMON_DIR/config.worktree, as documented in git-worktree.txt. As a result of this oversight, these settings, which are nonsensical for secondary worktrees, can cause Git commands to incorrectly consider a worktree bare (in the case of `core.bare`) or operate on the wrong worktree (in the case of `core.worktree`). Fix this problem by taking advantage of the recently-added init_worktree_config() which enables `extensions.worktreeConfig` and takes care of necessary bookkeeping. While at it, for backward-compatibility reasons, also stop upgrading the repository format to "1" since doing so is (unintentionally) not required to take advantage of `extensions.worktreeConfig`, as explained by 11664196ac ("Revert "check_repository_format_gently(): refuse extensions for old repositories"", 2020-07-15). [*] The main reason to use worktree-specific config for the sparse-checkout builtin was to avoid enabling sparse-checkout patterns in one and causing a loss of files in another. If a worktree does not have a sparse-checkout patterns file, then the sparse-checkout logic will not kick in on that worktree. Reported-by: Sean Allred <allred.sean@gmail.com> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-23sparse-index: sparse index is disallowed when split index is activeJohannes Schindelin1-1/+1
In 6e773527b6b (sparse-index: convert from full to sparse, 2021-03-30), we introduced initial support for a sparse index, and were careful to avoid converting to a sparse index in the presence of a split index. However, when we _just_ read a freshly-initialized index, it might not contain a split index even if _writing_ it will add one by virtue of being asked for via the `GIT_TEST_SPLIT_INDEX` variable. We did not notice any problems with checking _only_ for `split_index` (and not `GIT_TEST_SPLIT_INDEX`) right until both `vd/sparse-sparsity-fix-on-read` _and_ `vd/sparse-reset` were merged. Those two topics' interplay triggers a bug in conjunction with running t1091.15 when `GIT_TEST_SPLIT_INDEX=true` in the following way: `vd/sparse-sparsity-fix-on-read` ensures that the index is made sparse right after reading, and `vd/sparse-reset` ensures that the index is made non-sparse again unless running in the `--soft` mode. Since the split index feature is incompatible with the sparse index feature, we see a symptom like this: fatal: position for replacement 4 exceeds base index size 4 Let's fix this by avoiding the conversion to a sparse index when `GIT_TEST_SPLIT_INDEX=true`. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-14Accelerate clear_skip_worktree_from_present_files() by cachingElijah Newren1-2/+54
Trying to clear the skip-worktree bit from files that are present does present some computational overhead, for sparse-checkouts. (We do not do the bit clearing in non-sparse-checkouts.) Optimize it as follows: Rather than lstat()'ing every SKIP_WORKTREE path, take advantage of the fact that entire directories will often be missing, especially for cone mode and even more so ever since commit 55dfcf9591 ("sparse-checkout: clear tracked sparse dirs", 2021-09-08). If we have already determined that the parent directory of a file (or other previous ancestor) does not exist, then the file cannot exist either so we do not need to lstat() it separately. Timings for p2000 included below, reformatted to fit in normal commit message line lengths, which compare three things: * Timings before this series * Timings of the unoptimized version of clear_skip_worktree_from_present_files() from a few commits ago * Timings after the optimization in this commit (NOTE: t/perf/ appears to have timing resolution only down to 0.01 s, which presents significant measurement error when timings only differ by 0.01s. I don't trust any such timings below, and yet all the optimized results differ by at most 0.01s.) Test Before Series Unoptimized Optimized ----------------------------------------------------------------------------- *git status* full-v3 0.15(0.10+0.06) 0.32(0.16+0.17) +113.3% 0.16(0.10+0.07) +6.7% full-v4 0.15(0.11+0.05) 0.32(0.17+0.16) +113.3% 0.16(0.11+0.05) +6.7% sparse-v3 0.04(0.03+0.04) 0.04(0.02+0.05) +0.0% 0.04(0.02+0.05) +0.0% sparse-v4 0.04(0.03+0.04) 0.04(0.02+0.05) +0.0% 0.04(0.03+0.05) +0.0% *git add -A* full-v3 0.40(0.30+0.07) 0.56(0.36+0.17) +40.0% 0.39(0.30+0.07) -2.5% full-v4 0.37(0.28+0.07) 0.54(0.37+0.16) +45.9% 0.38(0.29+0.07) +2.7% sparse-v3 0.06(0.04+0.05) 0.08(0.05+0.05) +33.3% 0.06(0.05+0.04) +0.0% sparse-v4 0.05(0.03+0.05) 0.05(0.04+0.04) +0.0% 0.06(0.04+0.05) +20.0% *git add .* full-v3 0.40(0.31+0.07) 0.57(0.37+0.17) +42.5% 0.41(0.30+0.08) +2.5% full-v4 0.38(0.30+0.06) 0.55(0.37+0.16) +44.7% 0.38(0.30+0.06) +0.0% sparse-v3 0.06(0.04+0.05) 0.06(0.05+0.04) +0.0% 0.06(0.03+0.05) +0.0% sparse-v4 0.06(0.05+0.05) 0.06(0.04+0.05) +0.0% 0.06(0.04+0.06) +0.0% *git commit -a -m A* full-v3 0.41(0.32+0.06) 0.58(0.39+0.17) +41.5% 0.42(0.32+0.07) +2.4% full-v4 0.39(0.30+0.07) 0.56(0.38+0.17) +43.6% 0.40(0.31+0.07) +2.6% sparse-v3 0.04(0.03+0.04) 0.04(0.03+0.04) +0.0% 0.04(0.03+0.04) +0.0% sparse-v4 0.04(0.03+0.05) 0.04(0.03+0.05) +0.0% 0.04(0.03+0.04) +0.0% *git checkout -f -* full-v3 0.56(0.46+0.07) 0.73(0.55+0.16) +30.4% 0.57(0.47+0.08) +1.8% full-v4 0.54(0.45+0.07) 0.71(0.53+0.17) +31.5% 0.55(0.45+0.07) +1.9% sparse-v3 0.06(0.04+0.04) 0.06(0.04+0.05) +0.0% 0.06(0.04+0.05) +0.0% sparse-v4 0.05(0.05+0.04) 0.05(0.04+0.05) +0.0% 0.06(0.04+0.05) +20.0% *git reset* full-v3 0.34(0.26+0.05) 0.51(0.34+0.15) +50.0% 0.34(0.26+0.06) +0.0% full-v4 0.32(0.24+0.06) 0.49(0.32+0.15) +53.1% 0.33(0.25+0.06) +3.1% sparse-v3 0.04(0.03+0.04) 0.04(0.03+0.04) +0.0% 0.04(0.03+0.04) +0.0% sparse-v4 0.03(0.03+0.04) 0.03(0.02+0.04) +0.0% 0.03(0.03+0.04) +0.0% *git reset --hard* full-v3 0.57(0.46+0.07) 0.90(0.61+0.25) +57.9% 0.57(0.45+0.08) +0.0% full-v4 0.54(0.46+0.05) 0.88(0.59+0.26) +63.0% 0.55(0.45+0.07) +1.9% sparse-v3 0.07(0.03+0.03) 0.07(0.04+0.03) +0.0% 0.07(0.03+0.03) +0.0% sparse-v4 0.06(0.03+0.03) 0.06(0.04+0.02) +0.0% 0.06(0.03+0.03) +0.0% *git reset -- does-not-exist* full-v3 0.35(0.27+0.06) 0.52(0.32+0.17) +48.6% 0.35(0.27+0.06) +0.0% full-v4 0.33(0.26+0.05) 0.50(0.33+0.15) +51.5% 0.33(0.26+0.06) +0.0% sparse-v3 0.04(0.03+0.04) 0.04(0.03+0.04) +0.0% 0.04(0.03+0.04) +0.0% sparse-v4 0.04(0.02+0.04) 0.03(0.02+0.04) -25.0% 0.03(0.02+0.04) -25.0% *git diff* full-v3 0.07(0.04+0.04) 0.24(0.11+0.14) +242.9% 0.07(0.04+0.04) +0.0% full-v4 0.07(0.03+0.05) 0.24(0.13+0.12) +242.9% 0.08(0.04+0.05) +14.3% sparse-v3 0.02(0.01+0.04) 0.02(0.01+0.04) +0.0% 0.02(0.01+0.05) +0.0% sparse-v4 0.02(0.02+0.03) 0.02(0.01+0.04) +0.0% 0.02(0.01+0.04) +0.0% *git diff --cached* full-v3 0.05(0.03+0.02) 0.22(0.12+0.09) +340.0% 0.05(0.03+0.01) +0.0% full-v4 0.05(0.03+0.01) 0.23(0.12+0.11) +360.0% 0.05(0.03+0.02) +0.0% sparse-v3 0.01(0.00+0.00) 0.01(0.00+0.00) +0.0% 0.01(0.00+0.00) +0.0% sparse-v4 0.01(0.00+0.00) 0.01(0.00+0.00) +0.0% 0.01(0.00+0.00) +0.0% *git blame f2/f4/a* full-v3 0.18(0.13+0.05) 0.52(0.29+0.23) +188.9% 0.19(0.15+0.04) +5.6% full-v4 0.19(0.15+0.04) 0.52(0.28+0.23) +173.7% 0.19(0.14+0.04) +0.0% sparse-v3 0.10(0.08+0.02) 0.10(0.09+0.01) +0.0% 0.10(0.09+0.01) +0.0% sparse-v4 0.10(0.08+0.02) 0.10(0.08+0.02) +0.0% 0.10(0.08+0.02) +0.0% *git blame f2/f4/f3/a* full-v3 0.45(0.36+0.08) 0.78(0.51+0.27) +73.3% 0.45(0.37+0.08) +0.0% full-v4 0.45(0.37+0.08) 0.78(0.51+0.26) +73.3% 0.45(0.37+0.08) +0.0% sparse-v3 0.36(0.32+0.04) 0.36(0.31+0.05) +0.0% 0.36(0.31+0.04) +0.0% sparse-v4 0.36(0.31+0.05) 0.36(0.31+0.05) +0.0% 0.36(0.31+0.04) +0.0% *git checkout-index -f --all* full-v3 0.07(0.02+0.05) 0.24(0.12+0.12) +242.9% 0.08(0.04+0.04) +14.3% full-v4 0.07(0.03+0.04) 0.24(0.11+0.13) +242.9% 0.08(0.03+0.04) +14.3% sparse-v3 0.04(0.01+0.03) 0.04(0.00+0.03) +0.0% 0.04(0.01+0.03) +0.0% sparse-v4 0.04(0.01+0.02) 0.04(0.01+0.03) +0.0% 0.04(0.01+0.02) +0.0% *git update-index --add --remove f2/f4/a* full-v3 0.29(0.23+0.02) 0.46(0.30+0.12) +58.6% 0.30(0.24+0.02) +3.4% full-v4 0.27(0.22+0.02) 0.45(0.29+0.12) +66.7% 0.28(0.22+0.03) +3.7% sparse-v3 0.02(0.02+0.00) 0.02(0.01+0.00) +0.0% 0.02(0.01+0.00) +0.0% sparse-v4 0.02(0.02+0.00) 0.02(0.02+0.00) +0.0% 0.02(0.02+0.00) +0.0% So, with the optimization, the extra work appears to be essentially 0 for sparse-checkouts that are also using sparse-indexes (even before my optimization), and the extra work appears to be just marginally more than 0 for sparse-checkouts that are using full indexes. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-14repo_read_index: clear SKIP_WORKTREE bit from files present in worktreeElijah Newren1-0/+21
The fix is short (~30 lines), but the description is not. Sorry. There is a set of problems caused by files in what I'll refer to as the "present-despite-SKIP_WORKTREE" state. This commit aims to not just fix these problems, but remove the entire class as a possibility -- for those using sparse checkouts. But first, we need to understand the problems this class presents. A quick outline: * Problems * User facing issues * Problem space complexity * Maintenance and code correctness challenges * SKIP_WORKTREE expectations in Git * Suggested solution * Pros/Cons of suggested solution * Notes on testcase modifications === User facing issues === There are various ways for users to get files to be present in the working copy despite having the SKIP_WORKTREE bit set for that file in the index. This may come from: * various git commands not really supporting the SKIP_WORKTREE bit[1,2] * users grabbing files from elsewhere and writing them to the worktree (perhaps even cached in their editor) * users attempting to "abort" a sparse-checkout operation with a not-so-early Ctrl+C (updating $GIT_DIR/info/sparse-checkout and the working tree is not atomic)[3]. Once users have present-despite-SKIP_WORKTREE files, any modifications users make to these files will be ignored, possibly to users' confusion. Further: * these files will degrade performance for the sparse-index case due to requiring the index to be expanded (see commit 55dfcf9591 ("sparse-checkout: clear tracked sparse dirs", 2021-09-08) for why we try to delete entire directories outside the sparse cone). * these files will not be updated by by standard commands (switch/checkout/pull/merge/rebase will leave them alone unless conflicts happen -- and even then, the conflicted file may be written somewhere else to avoid overwriting the SKIP_WORKTREE file that is present and in the way) * there is nothing in Git that users can use to discover such files (status, diff, grep, etc. all ignore it) * there is no reasonable mechanism to "recover" from such a condition (neither `git sparse-checkout reapply` nor `git reset --hard` will correct it). So, not only are users modifications ignored, but the files get progressively more stale over time. At some point in the future, they may change their sparseness specification or disable sparse-checkouts. At that time, all present-despite-SKIP_WORKTREE files will show up as having lots of modifications because they represent a version from a different branch or commit. These might include user-made local changes from days before, but the only way to tell is to have users look through them all closely. If these users come to others for help, there will be no logs that explain the issue; it's just a mysterious list of changes. Users might adamantly claim (correctly, as it turns out) that they didn't modify these files, while others presume they did. [1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/ [2] https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/ [3] https://lore.kernel.org/git/CABPp-BFnFpzwGC11TLoLs8YK5yiisA5D5-fFjXnJsbESVDwZsA@mail.gmail.com/ === Problem space complexity === SKIP_WORKTREE has been part of Git for over a decade. Duy did lots of work on it initially, and several others have since come along and put lots of work into it. Stolee spent most of 2021 on the sparse-index, with lots of bugfixes along the way including to non-sparse-index cases as we are still trying to get sparse checkouts to behave reasonably. Basically every codepath throughout the treat needs to be aware of an additional type of file: tracked-but-not-present. The extra type results in lots of extra testcases and lots of extra code everywhere. But, the sad thing is that we actually have more than one extra type. We have tracked, tracked-but-not-present (SKIP_WORKTREE), and tracked-but-promised-to-not-be-present-but-is-present-anyway (present-despite-SKIP_WORKTREE). Two types is a monumental amount of effort to support, and adding a third feels a bit like insanity[4]. [4] Some examples of which can be seen at https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ === Maintenance and code correctness challenges === Matheus' patches to grep stalled for nearly a year, in part because of complications of how to handle sparse-checkouts appropriately in all cases[5][6] (with trying to sanely figure out how to sanely handle present-despite-SKIP_WORKTREE files being one of the complications). His rm/add follow-ups also took months because of those kinds of issues[7]. The corner cases with things like submodules and SKIP_WORKTREE with the addition of present-despite-SKIP_WORKTREE start becoming really complex[8]. We've had to add ugly logic to merge-ort to attempt to handle present-despite-SKIP_WORKTREE files[9], and basically just been forced to give up in merge-recursive knowing full well that we'll sometimes silently discard user modifications. Despite stash essentially being a merge, it needed extra code (beyond what was in merge-ort and merge-recursive) to manually tweak SKIP_WORKTREE bits in order to avoid a few different bugs that'd result in an early abort with a partial stash application[10]. [5] See https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/#t and the dates on the thread; also Matheus and I had several conversations off-list trying to resolve the issues over that time [6] ...it finally kind of got unstuck after https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ [7] See for example https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/#t and quotes like "The core functionality of sparse-checkout has always been only partially implemented", a statement I still believe is true today. [8] https://lore.kernel.org/git/pull.809.git.git.1592356884310.gitgitgadget@gmail.com/ [9] See commit 66b209b86a ("merge-ort: implement CE_SKIP_WORKTREE handling with conflicted entries", 2021-03-20) [10] See commit ba359fd507 ("stash: fix stash application in sparse-checkouts", 2020-12-01) === SKIP_WORKTREE expectations in Git === A couple quotes: * From [11] (before the "sparse-checkout" command existed): If it needs too many special cases, hacks, and conditionals, then it is not worth the complexity---if it is easier to write a correct code by allowing Git to populate working tree files, it is perfectly fine to do so. In a sense, the sparse checkout "feature" itself is a hack by itself, and that is why I think this part should be "best effort" as well. * From the git-sparse-checkout manual (still present today): THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE. [11] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/ === Suggested solution === SKIP_WORKTREE was written to allow sparse-checkouts, in particular, as the name of the option implies, to allow the file to NOT be in the worktree but consider it to be unchanged rather than deleted. The suggests a simple solution: present-despite-SKIP_WORKTREE files should not exist, for those using sparse-checkouts. Enforce this at index loading time by checking if core.sparseCheckout is true; if so, check files in the index with the SKIP_WORKTREE bit set to verify that they are absent from the working tree. If they are present, unset the bit (in memory, though any commands that write to the index will record the update). Users can, of course, can get the SKIP_WORKTREE bit back such as by running `git sparse-checkout reapply` (if they have ensured the file is unmodified and doesn't match the specified sparsity patterns). === Pros/Cons of suggested solution === Pros: * Solves the user visible problems reported above, which I've been complaining about for nearly a year but couldn't find a solution to. * Helps prevent slow performance degradation with a sparse-index. * Much easier behavior in sparse-checkouts for users to reason about * Very simple, ~30 lines of code. * Significantly simplifies some ugly testcases, and obviates the need to test an entire class of potential issues. * Reduces code complexity, reasoning, and maintenance. Avoids disagreements about weird corner cases[12]. * It has been reported that some users might be (ab)using SKIP_WORKTREE as a let-me-modify-but-keep-the-file-in-the-worktree mechanism[13, and a few other similar references]. These users know of multiple caveats and shortcomings in doing so; perhaps not surprising given the "SKIP_WORKTREE expecations" section above. However, these users use `git update-index --skip-worktree`, and not `git sparse-checkout` or core.sparseCheckout=true. As such, these users would be unaffected by this change and can continue abusing the system as before. [12] https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/ [13] https://stackoverflow.com/questions/13630849/git-difference-between-assume-unchanged-and-skip-worktree Cons: * When core.sparseCheckout is enabled, this adds a performance cost to reading the index. I'll defer discussion of this cost to a subsequent patch, since I have some optimizations to add. === Notes on testcase modifications === The good: * t1011: Compare to two cases above it ('read-tree will not throw away dirty changes, non-sparse'); since the file is present, it should match the non-sparse case now * t1092: sparse-index & sparse-checkout now match full-worktree behavior in more cases! Yaay for consistency! * t6428, t7012: look at how much simpler the tests become! Merge and stash can just fail early telling the user there's a file in the way, instead of not noticing until it's about to write a file and then have to implement sudden crash avoidance. Hurray for sanity! * t7817: sparse behavior better matches full tree behavior. Hurray for sanity! The confusing: * t3705: These changes were ONLY needed on Windows, but they don't hurt other platforms. Let's discuss each individually: * core.sparseCheckout should be false by default. Nothing in this testcase toggles that until many, many tests later. However, early tests (#5 in particular) were testing `update-index --skip-worktree` behavior in a non-sparse-checkout, but the Windows tests in CI were behaving as if core.sparseCheckout=true had been specified somewhere. I do not have access to a Windows machine. But I just manually did what should have been a no-op and turned the config off. And it fixed the test. * I have no idea why the leftover .gitattributes file from this test was causing failures for test #18 on Windows, but only with these changes of mine. Test #18 was checking for empty stderr, and specifically wanted to know that some error completely unrelated to file endings did not appear. The leftover .gitattributes file thus caused some spurious stderr unrelated to the thing being checked. Since other tests did not intend to test normalization, just proactively remove the .gitattributes file. I'm certain this is cleaner and better, I'm just unsure why/how this didn't trigger problems before. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-24sparse-index: add ensure_correct_sparsity functionVictoria Dye1-4/+29
The `ensure_correct_sparsity` function is intended to provide a means of aligning the in-core index with the sparsity required by the repository settings and other properties of the index. The function first checks whether a sparse index is allowed (per repository & sparse checkout pattern settings). If the sparse index may be used, the index is converted to sparse; otherwise, it is explicitly expanded with `ensure_full_index`. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Victoria Dye <vdye@github.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-24sparse-index: avoid unnecessary cache tree clearingVictoria Dye1-11/+14
When converting a full index to sparse, clear and recreate the cache tree only if the cache tree is not fully valid. The convert_to_sparse operation should exit silently if a cache tree update cannot be successfully completed (e.g., due to a conflicted entry state). However, because this failure scenario only occurs when at least a portion of the cache tree is invalid, we can save ourselves the cost of clearing and recreating the cache tree by skipping the check when the cache tree is fully valid. Helped-by: Derrick Stolee <dstolee@microsoft.com> Co-authored-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Victoria Dye <vdye@github.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-20Merge branch 'ds/sparse-index-ignored-files'Junio C Hamano1-33/+43
In cone mode, the sparse-index code path learned to remove ignored files (like build artifacts) outside the sparse cone, allowing the entire directory outside the sparse cone to be removed, which is especially useful when the sparse patterns change. * ds/sparse-index-ignored-files: sparse-checkout: clear tracked sparse dirs sparse-index: add SPARSE_INDEX_MEMORY_ONLY flag attr: be careful about sparse directories sparse-checkout: create helper methods sparse-index: use WRITE_TREE_MISSING_OK sparse-index: silently return when cache tree fails unpack-trees: fix nested sparse-dir search sparse-index: silently return when not using cone-mode patterns t7519: rewrite sparse index test
2021-09-07sparse-index: add SPARSE_INDEX_MEMORY_ONLY flagDerrick Stolee1-15/+22
The convert_to_sparse() method checks for the GIT_TEST_SPARSE_INDEX environment variable or the "index.sparse" config setting before converting the index to a sparse one. This is for ease of use since all current consumers are preparing to compress the index before writing it to disk. If these settings are not enabled, then convert_to_sparse() silently returns without doing anything. We will add a consumer in the next change that wants to use the sparse index as an in-memory data structure, regardless of whether the on-disk format should be sparse. To that end, create the SPARSE_INDEX_MEMORY_ONLY flag that will skip these config checks when enabled. All current consumers are modified to pass '0' in the new 'flags' parameter. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07sparse-checkout: create helper methodsDerrick Stolee1-11/+3
As we integrate the sparse index into more builtins, we occasionally need to check the sparse-checkout patterns to see if a path is within the sparse-checkout cone. Create some helper methods that help initialize the patterns and check for pattern matching to make this easier. The existing callers of commands like get_sparse_checkout_patterns() use a custom 'struct pattern_list' that is not necessarily the one in the 'struct index_state', so there are not many previous uses that could adopt these helpers. There are just two in builtin/add.c and sparse-index.c that can use path_in_sparse_checkout(). We add a path_in_cone_mode_sparse_checkout() as well that will only return false if the path is outside of the sparse-checkout definition _and_ the sparse-checkout patterns are in cone mode. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07sparse-index: use WRITE_TREE_MISSING_OKDerrick Stolee1-1/+4
When updating the cache tree in convert_to_sparse(), the WRITE_TREE_MISSING_OK flag indicates that trees might be computed that do not already exist within the object database. This happens in cases such as 'git add' creating new trees that it wants to store in anticipation of a following 'git commit'. If this flag is not specified, then it might trigger a promisor fetch or a failure due to the object not existing locally. Use WRITE_TREE_MISSING_OK during convert_to_sparse() to avoid these possible reasons for the cache_tree_update() to fail. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07sparse-index: silently return when cache tree failsDerrick Stolee1-4/+6
If cache_tree_update() returns a non-zero value, then it could not create the cache tree. This is likely due to a path having a merge conflict. Since we are already returning early, let's return silently to avoid making it seem like we failed to write the index at all. If we remove our dependence on the cache tree within convert_to_sparse(), then we could still recover from this scenario and have a sparse index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07sparse-index: silently return when not using cone-mode patternsDerrick Stolee1-5/+11
While the sparse-index is only enabled when core.sparseCheckoutCone is also enabled, it is possible for the user to modify the sparse-checkout file manually in a way that does not match cone-mode patterns. In this case, we should refuse to convert an index into a sparse index, since the sparse_checkout_patterns will not be initialized with recursive and parent path hashsets. Also silently return if there are no cache entries, which is a simple case: there are no paths to make sparse! Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-30sparse-index: copy dir_hash in ensure_full_index()Jeff Hostetler1-0/+1
Copy the 'index_state->dir_hash' back to the real istate after expanding a sparse index. A crash was observed in 'git status' during some hashmap lookups with corrupted hashmap entries. During an index expansion, new cache-entries are added to the 'index_state->name_hash' and the 'dir_hash' in a temporary 'index_state' variable 'full'. However, only the 'name_hash' hashmap from this temp variable was copied back into the real 'istate' variable. The original copy of the 'dir_hash' was incorrectly preserved. If the table in the 'full->dir_hash' hashmap were realloced, the stale version (in 'istate') would be corrupted. The test suite does not operate on index sizes sufficiently large to trigger this reallocation, so they do not cover this behavior. Increasing the test suite to cover such scale is fragile and likely wasteful. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-14sparse-index: recompute cache-treeDerrick Stolee1-0/+2
When some commands run with command_requires_full_index=1, then the index can get in a state where the in-memory cache tree is actually equal to the sparse index's cache tree instead of the full one. This results in incorrect entry_count values. By clearing the cache tree before converting to sparse, we avoid this issue. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-14fsmonitor: integrate with sparse indexDerrick Stolee1-0/+7
If we need to expand a sparse-index into a full one, then the FS Monitor bitmap is going to be incorrect. Ensure that we start fresh at such an event. While this is currently a performance drawback, the eventual hope of the sparse-index feature is that these expansions will be rare and hence we will be able to keep the FS Monitor data accurate across multiple Git commands. These tests are added to demonstrate that the behavior is the same across a full index and a sparse index, but also that file modifications to a tracked directory outside of the sparse cone will trigger ensure_full_index(). Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-14sparse-index: include EXTENDED flag when expandingDerrick Stolee1-1/+1
When creating a full index from a sparse one, we create cache entries for every blob within a given sparse directory entry. These are correctly marked with the CE_SKIP_WORKTREE flag, but the CE_EXTENDED flag is not included. The CE_EXTENDED flag would exist if we loaded a full index from disk with these entries marked with CE_SKIP_WORKTREE, so we can add the flag here to be consistent. This allows us to directly compare the flags present in cache entries when testing the sparse-index feature, but has no significance to its correctness in the user-facing functionality. Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-14sparse-index: skip indexes with unmerged entriesDerrick Stolee1-0/+18
The sparse-index format is designed to be compatible with merge conflicts, even those outside the sparse-checkout definition. The reason is that when converting a full index to a sparse one, a cache entry with nonzero stage will not be collapsed into a sparse directory entry. However, this behavior was not tested, and a different behavior within convert_to_sparse() fails in this scenario. Specifically, cache_tree_update() will fail when unmerged entries exist. convert_to_sparse_rec() uses the cache-tree data to recursively walk the tree structure, but also to compute the OIDs used in the sparse-directory entries. Add an index scan to convert_to_sparse() that will detect if these merge conflict entries exist and skip the conversion before trying to update the cache-tree. This is marked as NEEDSWORK because this can be removed with a suitable update to cache_tree_update() or a similar method that can construct a cache-tree with invalid nodes, but still allow creating the nodes necessary for creating sparse directory entries. It is possible that in the future we will not need to make such an update, since if we do not expand a sparse-index into a full one, this conversion does not need to happen. Thus, this can be deferred until the merge machinery is made to integrate with the sparse-index. Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-21Merge branch 'ds/sparse-index-protections'Junio C Hamano1-1/+1
Fix access to uninitialized piece of memory, introduced during this cycle. * ds/sparse-index-protections: sparse-index: fix uninitialized jump
2021-05-18sparse-index: fix uninitialized jumpDerrick Stolee1-1/+1
While testing the sparse-index, I verified a test with --valgrind and it complained about an uninitialized value being used in a jump in the path_matches_pattern_list() method. The line was this one: if (*dtype == DT_UNKNOWN) In the call stack, the culprit was the initialization of the dtype variable in convert_to_sparse_rec(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-06sparse-index.c: remove set_index_sparse_config()Ævar Arnfjörð Bjarmason1-10/+1
Remove the set_index_sparse_config() function by folding it into set_sparse_index_config(), which was its only user. Since 122ba1f7b52 (sparse-checkout: toggle sparse index from builtin, 2021-03-30) the flow of this code hasn't made much sense, we'd get "enabled" in set_sparse_index_config(), proceed to call set_index_sparse_config() with it. There we'd call prepare_repo_settings() and set "repo->settings.sparse_index = 1", only to needlessly call prepare_repo_settings() again in set_sparse_index_config() (where it would early abort), and finally setting "repo->settings.sparse_index = enabled". Instead we can just call prepare_repo_settings() once, and set the variable to "enabled" in the first place. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-14sparse-index: expand_to_path()Derrick Stolee1-0/+73
Some users of the index API have a specific path they are looking for, but choose to use index_file_exists() to rely on the name-hash hashtable instead of doing binary search with index_name_pos(). These users only need to know a yes/no answer, not a position within the cache array. When the index is sparse, the name-hash hash table does not contain the full list of paths within sparse directories. It _does_ contain the directory names for the sparse-directory entries. Create a helper function, expand_to_path(), for intended use with the name-hash hashtable functions. The integration with name-hash.c will follow in a later change. The solution here is to use ensure_full_index() when we determine that the requested path is within a sparse directory entry. This will populate the name-hash hashtable as the index is recomputed from scratch. There may be cases where the caller is trying to find an untracked path that is not in the index but also is not within a sparse directory entry. We want to minimize the overhead for these requests. If we used index_name_pos() to find the insertion order of the path, then we could determine from that position if a sparse-directory exists. (In fact, just calling index_name_pos() in that case would lead to expanding the index to a full index.) However, this takes O(log N) time where N is the number of cache entries. To keep the performance of this call based mostly on the input string, use index_file_exists() to look for the ancestors of the path. Using the heuristic that a sparse directory is likely to have a small number of parent directories, we start from the bottom and build up. Use a string buffer to allow mutating the path name to terminate after each slash for each hashset test. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30cache-tree: integrate with sparse directory entriesDerrick Stolee1-1/+9
The cache-tree extension was previously disabled with sparse indexes. However, the cache-tree is an important performance feature for commands like 'git status' and 'git add'. Integrate it with sparse directory entries. When writing a sparse index, completely clear and recalculate the cache tree. By starting from scratch, the only integration necessary is to check if we hit a sparse directory entry and create a leaf of the cache-tree that has an entry_count of one and no subtrees. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-checkout: toggle sparse index from builtinDerrick Stolee1-12/+21
The sparse index extension is used to signal that index writes should be in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1. Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that specifies if the sparse index should be used. It also updates the index to use the correct format, either way. Add a warning in the documentation that the use of a repository extension might reduce compatibility with third-party tools. 'git sparse-checkout init' already sets extension.worktreeConfig, which places most sparse-checkout users outside of the scope of most third-party tools. Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of GIT_TEST_SPARSE_INDEX=1. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-index: add index.sparse config optionDerrick Stolee1-5/+29
When enabled, this config option signals that index writes should attempt to use sparse-directory entries. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30submodule: sparse-index should not collapse linksDerrick Stolee1-0/+1
A submodule is stored as a "Git link" that actually points to a commit within a submodule. Submodules are populated or not depending on submodule configuration, not sparse-checkout. To ensure that the sparse-index feature integrates correctly with submodules, we should not collapse a directory if there is a Git link within its range. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-index: convert from full to sparseDerrick Stolee1-0/+139
If we have a full index, then we can convert it to a sparse index by replacing directories outside of the sparse cone with sparse directory entries. The convert_to_sparse() method does this, when the situation is appropriate. For now, we avoid converting the index to a sparse index if: 1. the index is split. 2. the index is already sparse. 3. sparse-checkout is disabled. 4. sparse-checkout does not use cone mode. Finally, we currently limit the conversion to when the GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git config will be added in a later change. The trickiest thing about this conversion is that we might not be able to mark a directory as a sparse directory just because it is outside the sparse cone. There might be unmerged files within that directory, so we need to look for those. Also, if there is some strange reason why a file is not marked with CE_SKIP_WORKTREE, then we should give up on converting that directory. There is still hope that some of its subdirectories might be able to convert to sparse, so we keep looking deeper. The conversion process is assisted by the cache-tree extension. This is calculated from the full index if it does not already exist. We then abandon the cache-tree as it no longer applies to the newly-sparse index. Thus, this cache-tree will be recalculated in every sparse-full-sparse round-trip until we integrate the cache-tree extension with the sparse index. Some Git commands use the index after writing it. For example, 'git add' will update the index, then write it to disk, then read its entries to report information. To keep the in-memory index in a full state after writing, we re-expand it to a full one after the write. This is wasteful for commands that only write the index and do not read from it again, but that is only the case until we make those commands "sparse aware." We can compare the behavior of the sparse-index in t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1 when operating on the 'sparse-index' repo. We can also compare the two sparse repos directly, such as comparing their indexes (when expanded to full in the case of the 'sparse-index' repo). We also verify that the index is actually populated with sparse directory entries. The 'checkout and reset (mixed)' test is marked for failure when comparing a sparse repo to a full repo, but we can compare the two sparse-checkout cases directly to ensure that we are not changing the behavior when using a sparse index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-index: implement ensure_full_index()Derrick Stolee1-1/+97
We will mark an in-memory index_state as having sparse directory entries with the sparse_index bit. These currently cannot exist, but we will add a mechanism for collapsing a full index to a sparse one in a later change. That will happen at write time, so we must first allow parsing the format before writing it. Commands or methods that require a full index in order to operate can call ensure_full_index() to expand that index in-memory. This requires parsing trees using that index's repository. Sparse directory entries have a specific 'ce_mode' value. The macro S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type. This ce_mode is not possible with the existing index formats, so we don't also verify all properties of a sparse-directory entry, which are: 1. ce->ce_mode == 0040000 2. ce->flags & CE_SKIP_WORKTREE is true 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) 4. ce->oid references a tree object. These are all semi-enforced in ensure_full_index() to some extent. Any deviation will cause a warning at minimum or a failure in the worst case. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-index: add guard to ensure full indexDerrick Stolee1-0/+8
Upcoming changes will introduce modifications to the index format that allow sparse directories. It will be useful to have a mechanism for converting those sparse index files into full indexes by walking the tree at those sparse directories. Name this method ensure_full_index() as it will guarantee that the index is fully expanded. This method is not implemented yet, and instead we focus on the scaffolding to declare it and call it at the appropriate time. Add a 'command_requires_full_index' member to struct repo_settings. This will be an indicator that we need the index in full mode to do certain index operations. This starts as being true for every command, then we will set it to false as some commands integrate with sparse indexes. If 'command_requires_full_index' is true, then we will immediately expand a sparse index to a full one upon reading from disk. This suffices for now, but we will want to add more callers to ensure_full_index() later. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>