aboutsummaryrefslogtreecommitdiffstats
path: root/xdiff/xhistogram.c
AgeCommit message (Collapse)AuthorFilesLines
2025-11-18xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hashEzekiel Newren1-2/+2
The ha field is serving two different purposes, which makes the code harder to read. At first glance, it looks like many places assume there could never be hash collisions between lines of the two input files. In reality, line_hash is used together with xdl_recmatch() to ensure correct comparisons of lines, even when collisions occur. To make this clearer, the old ha field has been split: * line_hash: a straightforward hash of a line, independent of any external context. Its type is uint64_t, as it comes from a fixed width hash function. * minimal_perfect_hash: Not a new concept, but now a separate field. It comes from the classifier's general-purpose hash table, which assigns each line a unique and minimal hash across the two files. A size_t is used here because it's meant to be used to index an array. This also avoids ` as usize` casts on the Rust side when using it to index a slice. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-10-03xdiff: change type of xdfile_t.changed from char to boolEzekiel Newren1-4/+4
The only values possible for 'changed' is 1 and 0, which exactly maps to a bool type. It might not look like this because action1 and action2 (which use to be dis1, and dis2) were also of type char and were assigned numerical values within a few lines of 'changed' (what used to be rchg). Using DISCARD/KEEP/INVESTIGATE for action1[i]/action2[j], and true/false for changed[k] makes it clear to future readers that these are logically separate concepts. Best-viewed-with: --color-words Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30xdiff: rename rchg -> changed in xdfile_tEzekiel Newren1-4/+4
The field rchg (now 'changed') declares if a line in a file is changed or not. A later commit will change it's type from 'char' to 'bool' to make its purpose even more clear. Best-viewed-with: --color-words Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30xdiff: delete chastore from xdfile_tEzekiel Newren1-1/+1
xdfile_t currently uses chastore_t which is an arena allocator. I think that xrecord_t used to be a linked list and recs didn't exist originally. When recs was added I think they forgot to remove xdfile_t.next, but was overlooked. This dual data structure setup makes the code somewhat confusing. Additionally the C type chastore_t isn't FFI friendly, and provides little to no performance benefit over using realloc to grow an array. Performance impact of deleting fields from xdfile_t: Deleting ha is about 5% slower. Deleting cha is about 5% faster. Delete ha, but keep cha time hyperfine --warmup 3 -L exe build_v2.51.0/git,build_delete_ha/git '{exe} log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null' Benchmark 1: build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Time (mean ± σ): 1.269 s ± 0.017 s [User: 1.135 s, System: 0.128 s] Range (min … max): 1.249 s … 1.286 s 10 runs Benchmark 2: build_delete_ha/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Time (mean ± σ): 1.339 s ± 0.017 s [User: 1.234 s, System: 0.099 s] Range (min … max): 1.320 s … 1.358 s 10 runs Summary build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null ran 1.06 ± 0.02 times faster than build_delete_ha/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Delete cha, but keep ha time hyperfine --warmup 3 -L exe build_v2.51.0/git,build_delete_chastore/git '{exe} log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null' Benchmark 1: build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Time (mean ± σ): 1.290 s ± 0.001 s [User: 1.154 s, System: 0.130 s] Range (min … max): 1.288 s … 1.292 s 10 runs Benchmark 2: build_delete_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Time (mean ± σ): 1.232 s ± 0.017 s [User: 1.105 s, System: 0.121 s] Range (min … max): 1.205 s … 1.249 s 10 runs Summary build_delete_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null ran 1.05 ± 0.01 times faster than build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Delete ha AND chastore time hyperfine --warmup 3 -L exe build_v2.51.0/git,build_delete_ha_and_chastore/git '{exe} log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null' Benchmark 1: build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Time (mean ± σ): 1.291 s ± 0.002 s [User: 1.156 s, System: 0.129 s] Range (min … max): 1.287 s … 1.295 s 10 runs Benchmark 2: build_delete_ha_and_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Time (mean ± σ): 1.306 s ± 0.001 s [User: 1.195 s, System: 0.105 s] Range (min … max): 1.305 s … 1.308 s 10 runs Summary build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null ran 1.01 ± 0.00 times faster than build_delete_ha_and_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null Best-viewed-with: --color-words Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-12xdiff: avoid signed vs. unsigned comparisons in xhistogram.cDavid Aguilar1-6/+4
The comparisons all involve unsigned variables. Cast the comparison to unsigned to eliminate the mismatch. Signed-off-by: David Aguilar <davvid@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-12xdiff: move sign comparison warning guard into each fileDavid Aguilar1-0/+2
Allow each file to fix the warnings guarded by the macro separately by moving the definition from the shared xinclude.h into each file that needs it. xmerge.c and xprepare.c do not contain any signed vs. unsigned comparisons so the definition was not included in these files. Signed-off-by: David Aguilar <davvid@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-19xdiff: drop unused mmfile parameters from xdl_do_histogram_diff()Jeff King1-2/+1
These are no longer used since 9df0fc3d57 (xdiff: fix a memory leak, 2022-02-16), as the caller is expected to call xdl_prepare_env() itself. After that change the histogram code only examines the prepared xdfenv_t, not the original buffers. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-08xdiff: introduce XDL_CALLOC_ARRAY()Phillip Wood1-6/+3
Add a helper for allocating an array and initialize the elements to zero. This is analogous to CALLOC_ARRAY() in the rest of the codebase but it returns NULL on allocation failures rather than dying to accommodate other users of libxdiff such as libgit2. Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-08xdiff: introduce xdl_callocPhillip Wood1-13/+9
In preparation for introducing XDL_CALLOC_ARRAY() use calloc() to obtain zeroed out memory rather than malloc() followed by memset(). To try and keep the lines a reasonable length this commit also stops casting the pointer returned by calloc() as this is unnecessary. Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16xdiff: fix a memory leakPhillip Wood1-3/+0
Although the patience and histogram algorithms initialize the environment they do not free it if there is an error. In contrast for the Myers algorithm the environment is initalized in xdl_do_diff() and it is freed if there is an error. Fix this by always initializing the environment in xdl_do_diff() and freeing it there if there is an error. Remove the comment in do_patience_diff() about the environment being freed by xdl_diff() as it is not accurate because (a) xdl_diff() does not do that if there is an error and (b) xdl_diff() is not the only caller. Reported-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-04xdiff: drop xpparam_t parameter from histogram cmp_recs()Jeff King1-3/+2
Since 663c5ad035 (diff histogram: intern strings, 2021-11-17), our cmp_recs() does not call xdl_recmatch(), and thus no longer needs an xpparam_t struct from which to get the flags. We can drop the unused parameter from the function, as well as the macro which wraps it. There's no functional change here; it's just simplifying things (and making -Wunused-parameter happier). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-04xdiff: drop CMP_ENV macro from xhistogramJeff King1-3/+0
This macro has been unused since 43ca7530df (xdiff/xhistogram: rely on xdl_trim_ends(), 2011-08-01). The function that called it went away there, and its use in the CMP() macro was inlined. It probably should have been deleted then, but nobody noticed. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-18diff histogram: intern stringsPhillip Wood1-3/+2
Histogram is the only diff algorithm not to call xdl_classify_record(). xdl_classify_record() ensures that the hash values of two strings that are not equal differ which means that it is not necessary to use xdl_recmatch() when comparing lines, all that is necessary is to compare the hash values. This gives a 7% reduction in the runtime of "git log --patch" when using the histogram diff algorithm. Test HEAD^ HEAD ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.18(0.14+0.04) 0.19(0.17+0.02) +5.6% 4000.2: log --raw -3000 (tree-only) 0.99(0.77+0.21) 0.98(0.78+0.20) -1.0% 4000.3: log -p -3000 (Myers) 4.84(4.31+0.51) 4.81(4.15+0.64) -0.6% 4000.4: log -p -3000 --histogram 6.34(5.86+0.46) 5.87(5.19+0.66) -7.4% 4000.5: log -p -3000 --patience 5.39(4.60+0.76) 5.35(4.60+0.73) -0.7% Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-20merge-base, xdiff: zero out xpparam_t structuresMichał Kępień1-0/+2
xpparam_t structures are usually zero-initialized before their specific fields are assigned to, but there are three locations in the tree where that does not happen. Add the missing memset() calls in order to make initialization of xpparam_t structures consistent tree-wide and to prevent stack garbage from being used as field values. Signed-off-by: Michał Kępień <michal@isc.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-28xdiff: remove duplicate headers from xhistogram.cCarlo Marcelo Arenas Belón1-2/+0
8c912eea94 ("teach --histogram to diff", 2011-07-12) included them, but were already part of xinclude.h Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-23xdiff/histogram: remove tail recursionStefan Beller1-6/+14
When running the same reproduction script as the previous patch, it turns out the stack is too small, which can be easily avoided. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-19xdiff/xhistogram: move index allocation into find_lcsStefan Beller1-43/+53
This fixes a memory issue when recursing a lot, which can be reproduced as seq 1 100000 >one seq 1 4 100000 >two git diff --no-index --histogram one two Before this patch, histogram_diff would call itself recursively before calling free_index, which would mean a lot of memory is allocated during the recursion and only freed afterwards. By moving the memory allocation (and its free call) into find_lcs, the memory is free'd before we recurse, such that memory is reused in the next step of the recursion instead of using new memory. This addresses only the memory pressure, not the run time complexity, that is also awful for the corner case outlined above. Helpful in understanding the code (in addition to the sparse history of this file), was https://stackoverflow.com/a/32367597 which reproduces most of the code comments of the JGit implementation. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-19xdiff/xhistogram: factor out memory cleanup into free_index()Stefan Beller1-4/+9
This will be useful in the next patch as we'll introduce multiple callers. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-19xdiff/xhistogram: pass arguments directly to fall_back_to_classic_diffStefan Beller1-11/+11
By passing the 'xpp' and 'env' argument directly to the function 'fall_back_to_classic_diff', we eliminate an occurrence of the 'index' in histogram_diff, which will prove useful in a bit. While at it, move it up in the file. This will make the diff of one of the next patches more legible. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-12Correct common spelling mistakes in comments and testsStefano Lattarini1-1/+1
Most of these were found using Lucas De Marchi's codespell tool. Signed-off-by: Stefano Lattarini <stefano.lattarini@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Acked-by: Matthieu Moy <Matthieu.Moy@imag.fr> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-19xdiff: PATIENCE/HISTOGRAM are not independent option bitsJunio C Hamano1-1/+1
Because the default Myers, patience and histogram algorithms cannot be in effect at the same time, XDL_PATIENCE_DIFF and XDL_HISTOGRAM_DIFF are not independent bits. Instead of wasting one bit per algorithm, define a few macros to access the few bits they occupy and update the code that access them. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-08-08xdiff/xhistogram: drop need for additional variableTay Ray Chuan1-5/+4
Having an additional variable (ptr) instead of changing line(1|2) and count(1|2) was for debugging purposes. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-08-08xdiff/xhistogram: rely on xdl_trim_ends()Tay Ray Chuan1-27/+4
Do away with reduce_common_start_end() and use xdf->dstart and xdf->dend set by xdl_trim_ends() that similarly tells us where the first unmatched line from the start and end occurs. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-08-08xdiff/xhistogram: rework handling of recursed resultsTay Ray Chuan1-6/+9
Previously we were over-complicating matters by trying to combine the recursed results. Now, terminate immediately if a recursive call failed and return its result. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-07-12teach --histogram to diffTay Ray Chuan1-0/+384
Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show that it is faster than its --patience cousin, as well as the default Meyers algorithm. The implementation has been reworked to use structs and pointers, instead of bitmasks, thus doing away with JGit's 2^28 line limit. We also use xdiff's default hash table implementation (xdl_hash_bits() with XDL_HASHLONG()) for convenience. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>