git.git - The core git plumbing

Age	Commit message (Collapse)	Author	Files	Lines
2024-12-06	global: mark code units that generate warnings with `-Wsign-compare`	Patrick Steinhardt	1	-0/+2
	Mark code units that generate warnings with `-Wsign-compare`. This allows for a structured approach to get rid of all such warnings over time in a way that can be easily measured. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-26	doc: switch links to https	Josh Soref	1	-1/+1
	These sites offer https versions of their content. Using the https versions provides some protection for users. Signed-off-by: Josh Soref <jsoref@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-13	Sync with Git 2.31.6	Junio C Hamano	1	-34/+43

2022-12-09	utf8: refactor `strbuf_utf8_replace` to not rely on preallocated buffer	Patrick Steinhardt	1	-21/+13
	In `strbuf_utf8_replace`, we preallocate the destination buffer and then use `memcpy` to copy bytes into it at computed offsets. This feels rather fragile and is hard to understand at times. Refactor the code to instead use `strbuf_add` and `strbuf_addstr` so that we can be sure that there is no possibility to perform an out-of-bounds write. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-09	utf8: fix checking for glyph width in `strbuf_utf8_replace()`	Patrick Steinhardt	1	-5/+14
	In `strbuf_utf8_replace()`, we call `utf8_width()` to compute the width of the current glyph. If the glyph is a control character though it can be that `utf8_width()` returns `-1`, but because we assign this value to a `size_t` the conversion will cause us to underflow. This bug can easily be triggered with the following command: $ git log --pretty='format:xxx%<\|(1,trunc)%x10' >From all I can see though this seems to be a benign underflow that has no security-related consequences. Fix the bug by using an `int` instead. When we see a control character, we now copy it into the target buffer but don't advance the current width of the string. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-09	utf8: fix overflow when returning string width	Patrick Steinhardt	1	-3/+9
	The return type of both `utf8_strwidth()` and `utf8_strnwidth()` is `int`, but we operate on string lengths which are typically of type `size_t`. This means that when the string is longer than `INT_MAX`, we will overflow and thus return a negative result. This can lead to an out-of-bounds write with `--pretty=format:%<1)%B` and a commit message that is 2^31+1 bytes long: ================================================================= ==26009==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x603000001168 at pc 0x7f95c4e5f427 bp 0x7ffd8541c900 sp 0x7ffd8541c0a8 WRITE of size 2147483649 at 0x603000001168 thread T0 #0 0x7f95c4e5f426 in __interceptor_memcpy /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 #1 0x5612bbb1068c in format_and_pad_commit pretty.c:1763 #2 0x5612bbb1087a in format_commit_item pretty.c:1801 #3 0x5612bbc33bab in strbuf_expand strbuf.c:429 #4 0x5612bbb110e7 in repo_format_commit_message pretty.c:1869 #5 0x5612bbb12d96 in pretty_print_commit pretty.c:2161 #6 0x5612bba0a4d5 in show_log log-tree.c:781 #7 0x5612bba0d6c7 in log_tree_commit log-tree.c:1117 #8 0x5612bb691ed5 in cmd_log_walk_no_free builtin/log.c:508 #9 0x5612bb69235b in cmd_log_walk builtin/log.c:549 #10 0x5612bb6951a2 in cmd_log builtin/log.c:883 #11 0x5612bb56c993 in run_builtin git.c:466 #12 0x5612bb56d397 in handle_builtin git.c:721 #13 0x5612bb56db07 in run_argv git.c:788 #14 0x5612bb56e8a7 in cmd_main git.c:923 #15 0x5612bb803682 in main common-main.c:57 #16 0x7f95c4c3c28f (/usr/lib/libc.so.6+0x2328f) #17 0x7f95c4c3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) #18 0x5612bb5680e4 in _start ../sysdeps/x86_64/start.S:115 0x603000001168 is located 0 bytes to the right of 24-byte region [0x603000001150,0x603000001168) allocated by thread T0 here: #0 0x7f95c4ebe7ea in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:85 #1 0x5612bbcdd556 in xrealloc wrapper.c:136 #2 0x5612bbc310a3 in strbuf_grow strbuf.c:99 #3 0x5612bbc32acd in strbuf_add strbuf.c:298 #4 0x5612bbc33aec in strbuf_expand strbuf.c:418 #5 0x5612bbb110e7 in repo_format_commit_message pretty.c:1869 #6 0x5612bbb12d96 in pretty_print_commit pretty.c:2161 #7 0x5612bba0a4d5 in show_log log-tree.c:781 #8 0x5612bba0d6c7 in log_tree_commit log-tree.c:1117 #9 0x5612bb691ed5 in cmd_log_walk_no_free builtin/log.c:508 #10 0x5612bb69235b in cmd_log_walk builtin/log.c:549 #11 0x5612bb6951a2 in cmd_log builtin/log.c:883 #12 0x5612bb56c993 in run_builtin git.c:466 #13 0x5612bb56d397 in handle_builtin git.c:721 #14 0x5612bb56db07 in run_argv git.c:788 #15 0x5612bb56e8a7 in cmd_main git.c:923 #16 0x5612bb803682 in main common-main.c:57 #17 0x7f95c4c3c28f (/usr/lib/libc.so.6+0x2328f) SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 in __interceptor_memcpy Shadow bytes around the buggy address: 0x0c067fff81d0: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x0c067fff81e0: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd 0x0c067fff81f0: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa 0x0c067fff8200: fd fd fd fa fa fa fd fd fd fd fa fa 00 00 00 fa 0x0c067fff8210: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd =>0x0c067fff8220: fd fa fa fa fd fd fd fa fa fa 00 00 00[fa]fa fa 0x0c067fff8230: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8240: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8250: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8260: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8270: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==26009==ABORTING Now the proper fix for this would be to convert both functions to return an `size_t` instead of an `int`. But given that this commit may be part of a security release, let's instead do the minimal viable fix and die in case we see an overflow. Add a test that would have previously caused us to crash. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-09	utf8: fix returning negative string width	Patrick Steinhardt	1	-2/+6
	The `utf8_strnwidth()` function calls `utf8_width()` in a loop and adds its returned width to the end result. `utf8_width()` can return `-1` though in case it reads a control character, which means that the computed string width is going to be wrong. In the worst case where there are more control characters than non-control characters, we may even return a negative string width. Fix this bug by treating control characters as having zero width. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-09	utf8: fix truncated string lengths in `utf8_strnwidth()`	Patrick Steinhardt	1	-5/+3
	The `utf8_strnwidth()` function accepts an optional string length as input parameter. This parameter can either be set to `-1`, in which case we call `strlen()` on the input. Or it can be set to a positive integer that indicates a precomputed length, which callers typically compute by calling `strlen()` at some point themselves. The input parameter is an `int` though, whereas `strlen()` returns a `size_t`. This can lead to implementation-defined behaviour though when the `size_t` cannot be represented by the `int`. In the general case though this leads to wrap-around and thus to negative string sizes, which is sure enough to not lead to well-defined behaviour. Fix this by accepting a `size_t` instead of an `int` as string length. While this takes away the ability of callers to simply pass in `-1` as string length, it really is trivial enough to convert them to instead pass in `strlen()` instead. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-04	t0060: test ntfs/hfs-obscured dotfiles	Jeff King	1	-0/+5
	We have tests that cover various filesystem-specific spellings of ".gitmodules", because we need to reliably identify that path for some security checks. These are from dc2d9ba318 (is_{hfs,ntfs}_dotgitmodules: add tests, 2018-05-12), with the actual code coming from e7cb0b4455 (is_ntfs_dotgit: match other .git files, 2018-05-11) and 0fc333ba20 (is_hfs_dotgit: match other .git files, 2018-05-02). Those latter two commits also added similar matching functions for .gitattributes and .gitignore. These ended up not being used in the final series, and are currently dead code. But in preparation for them being used in some fsck checks, let's make sure they actually work by throwing a few basic tests at them. Likewise, let's cover .mailmap (which does need matching code added). I didn't bother with the whole battery of tests that we cover for .gitmodules. These functions are all based on the same generic matcher, so it's sufficient to test most of the corner cases just once. Note that the ntfs magic prefix names in the tests come from the algorithm described in e7cb0b4455 (and are different for each file). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-10	utf8: use skip_iprefix() in same_utf_encoding()	René Scharfe	1	-5/+4
	Get rid of magic numbers by using skip_iprefix() and skip_prefix() for parsing the leading "[uU][tT][fF]-?" of both strings instead of checking with istarts_with() and an explicit comparison. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-12	utf8: use ARRAY_SIZE() in git_wcwidth()	Beat Bolli	1	-4/+2
	This macro has been available globally since b4f2a6ac92 ("Use #define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))", 2006-03-09), so let's use it. Signed-off-by: Beat Bolli <dev+git@drbeat.li> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-11	utf8: handle systems that don't write BOM for UTF-16	brian m. carlson	1	-0/+14
	When serializing UTF-16 (and UTF-32), there are three possible ways to write the stream. One can write the data with a BOM in either big-endian or little-endian format, or one can write the data without a BOM in big-endian format. Most systems' iconv implementations choose to write it with a BOM in some endianness, since this is the most foolproof, and it is resistant to misinterpretation on Windows, where UTF-16 and the little-endian serialization are very common. For compatibility with Windows and to avoid accidental misuse there, Git always wants to write UTF-16 with a BOM, and will refuse to read UTF-16 without it. However, musl's iconv implementation writes UTF-16 without a BOM, relying on the user to interpret it as big-endian. This causes t0028 and the related functionality to fail, since Git won't read the file without a BOM. Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the iconv implementation has this behavior. When set, Git will write a BOM manually for UTF-16 and UTF-32 and then force the data to be written in UTF-16BE or UTF-32BE. We choose big-endian behavior here because the tests use the raw "UTF-16" encoding, which will be big-endian when the implementation requires this knob to be set. Update the tests to detect this case and write test data with an added BOM if necessary. Always write the BOM in the tests in big-endian format, since all iconv implementations that omit a BOM must use big-endian serialization according to the Unicode standard. Preserve the existing behavior for systems which do not have this knob enabled, since they may use optimized implementations, including defaulting to the native endianness, which may improve performance. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-31	Support working-tree-encoding "UTF-16LE-BOM"	Torsten Bögershausen	1	-10/+32
	Users who want UTF-16 files in the working tree set the .gitattributes like this: test.txt working-tree-encoding=UTF-16 The unicode standard itself defines 3 allowed ways how to encode UTF-16. The following 3 versions convert all back to 'g' 'i' 't' in UTF-8: a) UTF-16, without BOM, big endian: $ printf "\000g\000i\000t" \| iconv -f UTF-16 -t UTF-8 \| od -c 0000000 g i t b) UTF-16, with BOM, little endian: $ printf "\377\376g\000i\000t\000" \| iconv -f UTF-16 -t UTF-8 \| od -c 0000000 g i t c) UTF-16, with BOM, big endian: $ printf "\376\377\000g\000i\000t" \| iconv -f UTF-16 -t UTF-8 \| od -c 0000000 g i t Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the working tree. After a checkout, the resulting file has a BOM and is encoded in "UTF-16", in the version (c) above. This is what iconv generates, more details follow below. iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE: d) UTF-16 $ printf 'git' \| iconv -f UTF-8 -t UTF-16 \| od -c 0000000 376 377 \0 g \0 i \0 t e) UTF-16LE $ printf 'git' \| iconv -f UTF-8 -t UTF-16LE \| od -c 0000000 g \0 i \0 t \0 f) UTF-16BE $ printf 'git' \| iconv -f UTF-8 -t UTF-16BE \| od -c 0000000 \0 g \0 i \0 t There is no way to generate version (b) from above in a Git working tree, but that is what some applications need. (All fully unicode aware applications should be able to read all 3 variants, but in practise we are not there yet). When producing UTF-16 as an output, iconv generates the big endian version with a BOM. (big endian is probably chosen for historical reasons). iconv can produce UTF-16 files with little endianess by using "UTF-16LE" as encoding, and that file does not have a BOM. Not all users (especially under Windows) are happy with this. Some tools are not fully unicode aware and can only handle version (b). Today there is no way to produce version (b) with iconv (or libiconv). Looking into the history of iconv, it seems as if version (c) will be used in all future iconv versions (for compatibility reasons). Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM". libiconv can not handle the encoding, so Git pick it up, handles the BOM and uses libiconv to convert the rest of the stream. (UTF-16BE-BOM is added for consistency) Rported-by: Adrián Gimeno Balaguer <adrigibal@gmail.com> Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-15	Merge branch 'jk/size-t'	Junio C Hamano	1	-5/+5
	Code clean-up to use size_t/ssize_t when they are the right type. * jk/size-t: strbuf_humanise: use unsigned variables pass st.st_size as hint for strbuf_readlink() strbuf_readlink: use ssize_t strbuf: use size_t for length in intermediate variables reencode_string: use size_t for string lengths reencode_string: use st_add/st_mult helpers
2018-07-24	reencode_string: use size_t for string lengths	Jeff King	1	-3/+3
	The iconv interface takes a size_t, which is the appropriate type for an in-memory buffer. But our reencode_string_* functions use integers, meaning we may get confusing results when the sizes exceed INT_MAX. Let's use size_t consistently. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-24	reencode_string: use st_add/st_mult helpers	Jeff King	1	-2/+2
	When converting a string with iconv, if the output buffer isn't big enough, we grow it. But our growth is done without any concern for integer overflow. So when we add: outalloc = sofar + insz * 2 + 32; we may end up wrapping outalloc (which is a size_t), and allocating a too-small buffer. We then manipulate it further: outsz = outalloc - sofar - 1; and feed outsz back to iconv. If outalloc is wrapped and smaller than sofar, we'll end up with a small allocation but feed a very large outsz to iconv, which could result in it overflowing the buffer. Can we use this to construct an attack wherein the victim clones a repository with a very large commit object with an encoding header, and running "git log" reencodes it into utf8, causing an overflow? An attack of this sort is likely impossible in practice. "sofar" is how many output bytes we've written total, and "insz" is the number of input bytes remaining. Imagine our input doubles in size as we output it (which is easy to do by converting latin1 to utf8, for example), and that we start with N input bytes. Our initial output buffer also starts at N bytes, so after the first call we'd have N/2 input bytes remaining (insz), and have written N bytes (sofar). That means our next allocation will be (N + N/2 * 2 + 32) bytes, or (2N + 32). We can therefore overflow a 32-bit size_t with a commit message that's just under 2^31 bytes, assuming it consists mostly of "doubling" sequences (e.g., latin1 0xe1 which becomes utf8 0xc3 0xa1). But we'll never make it that far with such a message. We'll be spending 2^31 bytes on the original string. And our initial output buffer will also be 2^31 bytes. Which is not going to succeed on a system with a 32-bit size_t, since there will be other things using the address space, too. The initial malloc will fail. If we imagine instead that we can triple the size when converting, then our second allocation becomes (N + 2/3N * 2 + 32), or (7/3N + 32). That still requires two allocations of 3/7 of our address space (6/7 of the total) to succeed. If we imagine we can quadruple, it becomes (5/2N + 32); we need to be able to allocate 4/5 of the address space to succeed. This might start to get plausible. But is it possible to get a 4-to-1 increase in size? Probably if you're converting to some obscure encoding. But since git defaults to utf8 for its output, that's the likely destination encoding for an attack. And while there are 4-character utf8 sequences, it's unlikely that you'd be able find a single-byte source sequence in any encoding. So this is certainly buggy code which should be fixed, but it is probably not a useful attack vector. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-09	utf8.c: avoid char overflow	Beat Bolli	1	-4/+4
	In ISO C, char constants must be in the range -128..127. Change the BOM constants to char literals to avoid overflow. Signed-off-by: Beat Bolli <dev+git@drbeat.li> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-29	Sync with Git 2.17.1	Junio C Hamano	1	-12/+46
	* maint: (25 commits) Git 2.17.1 Git 2.16.4 Git 2.15.2 Git 2.14.4 Git 2.13.7 fsck: complain when .gitmodules is a symlink index-pack: check .gitmodules files with --strict unpack-objects: call fsck_finish() after fscking objects fsck: call fsck_finish() after fscking objects fsck: check .gitmodules content fsck: handle promisor objects in .gitmodules check fsck: detect gitmodules files fsck: actually fsck blob data fsck: simplify ".git" check index-pack: make fsck error message more specific verify_path: disallow symlinks in .gitmodules update-index: stat updated files earlier verify_dotfile: mention case-insensitivity in comment verify_path: drop clever fallthrough skip_prefix: add case-insensitive variant ...
2018-05-22	Sync with Git 2.14.4	Junio C Hamano	1	-12/+46
	* maint-2.14: Git 2.14.4 Git 2.13.7 verify_path: disallow symlinks in .gitmodules update-index: stat updated files earlier verify_dotfile: mention case-insensitivity in comment verify_path: drop clever fallthrough skip_prefix: add case-insensitive variant is_{hfs,ntfs}_dotgitmodules: add tests is_ntfs_dotgit: match other .git files is_hfs_dotgit: match other .git files is_ntfs_dotgit: use a size_t for traversing string submodule-config: verify submodule names as paths
2018-05-21	is_hfs_dotgit: match other .git files	Jeff King	1	-12/+46
	Both verify_path() and fsck match ".git", ".GIT", and other variants specific to HFS+. Let's allow matching other special files like ".gitmodules", which we'll later use to enforce extra restrictions via verify_path() and fsck. Signed-off-by: Jeff King <peff@peff.net>
2018-05-08	Merge branch 'ls/checkout-encoding'	Junio C Hamano	1	-2/+63
	The new "checkout-encoding" attribute can ask Git to convert the contents to the specified encoding when checking out to the working tree (and the other way around when checking in). * ls/checkout-encoding: convert: add round trip check based on 'core.checkRoundtripEncoding' convert: add tracing for 'working-tree-encoding' attribute convert: check for detectable errors in UTF encodings convert: add 'working-tree-encoding' attribute utf8: add function to detect a missing UTF-16/32 BOM utf8: add function to detect prohibited UTF-16/32 BOM utf8: teach same_encoding() alternative UTF encoding names strbuf: add a case insensitive starts_with() strbuf: add xstrdup_toupper() strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
2018-04-16	utf8: add function to detect a missing UTF-16/32 BOM	Lars Schneider	1	-0/+13
	If the endianness is not defined in the encoding name, then let's be strict and require a BOM to avoid any encoding confusion. The is_missing_required_utf_bom() function returns true if a required BOM is missing. The Unicode standard instructs to assume big-endian if there in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used in HTML5 recommends to assume little-endian to "deal with deployed content" [3]. Strictly requiring a BOM seems to be the safest option for content in Git. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#gen6 [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf Section 3.10, D98, page 132 [3] https://encoding.spec.whatwg.org/#utf-16le Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16	utf8: add function to detect prohibited UTF-16/32 BOM	Lars Schneider	1	-0/+26
	Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used [1]. The function returns true if this is the case. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#bom10 Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16	utf8: teach same_encoding() alternative UTF encoding names	Lars Schneider	1	-2/+24
	The function same_encoding() could only recognize alternative names for UTF-8 encodings. Teach it to recognize all kinds of alternative UTF encoding names (e.g. utf16). While we are at it, fix a crash that would occur if same_encoding() was called with a NULL argument and a non-NULL argument. This function is used in a subsequent commit. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11	unicode_width.h: rename to use dash in file name	Stefan Beller	1	-1/+1
	This is more consistent with the project style. The majority of Git's source files use dashes in preference to underscores in their file names. Also adjust contrib/update-unicode as well. Signed-off-by: Stefan Beller <sbeller@google.com>
2017-10-10	cleanup: fix possible overflow errors in binary search	Derrick Stolee	1	-1/+1
	A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; This translation is safe since the operation occurs inside a loop conditioned on "min < max". The included changes were found using the following git grep: git grep '/ 2;' '.c' Making this cleanup will prevent future review friction when a new binary search is contructed based on existing code. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-07	utf8: release strbuf on error return in strbuf_utf8_replace()	Rene Scharfe	1	-1/+2
	Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26	utf8: accept "latin-1" as ISO-8859-1	Junio C Hamano	1	-0/+7
	Even though latin-1 is still seen in e-mail headers, some platforms only install ISO-8859-1. "iconv -f ISO-8859-1" succeeds, while "iconv -f latin-1" fails on such a system. Using the same fallback_encoding() mechanism factored out in the previous step, teach ourselves that "ISO-8859-1" has a better chance of being accepted than "latin-1". Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26	utf8: refactor code to decide fallback encoding	Junio C Hamano	1	-11/+18
	The codepath we use to call iconv_open() has a provision to use a fallback encoding when it fails, hoping that "UTF-8" being spelled differently could be the reason why the library function did not like the encoding names we gave it. Essentially, we turn what we have observed to be used as variants of "UTF-8" (e.g. "utf8") into the most official spelling and use that as a fallback. We do the same thing for input and output encoding. Introduce a helper function to do just one side and call that twice. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-09-17	utf8: add function to align a string into given strbuf	Karthik Nayak	1	-0/+21
	Add strbuf_utf8_align() which will align a given string into a strbuf as per given align_type and width. If the width is greater than the string length then no alignment is performed. Helped-by: Eric Sunshine <sunshine@sunshineco.com> Mentored-by: Christian Couder <christian.couder@gmail.com> Mentored-by: Matthieu Moy <matthieu.moy@grenoble-inp.fr> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-16	utf8-bom: introduce skip_utf8_bom() helper	Junio C Hamano	1	-0/+11
	With the recent change to ignore the UTF8 BOM at the beginning of .gitignore files, we now have two codepaths that do such a skipping (the other one is for reading the configuration files). Introduce utf8_bom[] constant string and skip_utf8_bom() helper and teach .gitignore code how to use it. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-01-07	Merge branch 'maint-2.1' into maint	Junio C Hamano	1	-12/+20
	* maint-2.1: is_hfs_dotgit: loosen over-eager match of \u{..47}
2015-01-07	Merge branch 'maint-2.0' into maint-2.1	Junio C Hamano	1	-12/+20
	* maint-2.0: is_hfs_dotgit: loosen over-eager match of \u{..47}
2015-01-07	Merge branch 'maint-1.9' into maint-2.0	Junio C Hamano	1	-12/+20
	* maint-1.9: is_hfs_dotgit: loosen over-eager match of \u{..47}
2015-01-07	Merge branch 'maint-1.8.5' into maint-1.9	Junio C Hamano	1	-12/+20
	* maint-1.8.5: is_hfs_dotgit: loosen over-eager match of \u{..47}
2014-12-29	is_hfs_dotgit: loosen over-eager match of \u{..47}	Jeff King	1	-12/+20
	Our is_hfs_dotgit function relies on the hackily-implemented next_hfs_char to give us the next character that an HFS+ filename comparison would look at. It's hacky because it doesn't implement the full case-folding table of HFS+; it gives us just enough to see if the path matches ".git". At the end of next_hfs_char, we use tolower() to convert our 32-bit code point to lowercase. Our tolower() implementation only takes an 8-bit char, though; it throws away the upper 24 bits. This means we can't have any false negatives for is_hfs_dotgit. We only care about matching 7-bit ASCII characters in ".git", and we will correctly process 'G' or 'g'. However, we _can_ have false positives. Because we throw away the upper bits, code point \u{0147} (for example) will look like 'G' and get downcased to 'g'. It's not known whether a sequence of code points whose truncation ends up as ".git" is meaningful in any language, but it does not hurt to be more accurate here. We can just pass out the full 32-bit code point, and compare it manually to the upper and lowercase characters we care about. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-12-17	Sync with v2.1.4	Junio C Hamano	1	-0/+64
	* maint-2.1: Git 2.1.4 Git 2.0.5 Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17	Sync with v2.0.5	Junio C Hamano	1	-0/+64
	* maint-2.0: Git 2.0.5 Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17	Sync with v1.9.5	Junio C Hamano	1	-0/+64
	* maint-1.9: Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17	Sync with v1.8.5.6	Junio C Hamano	1	-0/+64
	* maint-1.8.5: Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17	utf8: add is_hfs_dotgit() helper	Jeff King	1	-0/+64
	We do not allow paths with a ".git" component to be added to the index, as that would mean repository contents could overwrite our repository files. However, asking "is this path the same as .git" is not as simple as strcmp() on some filesystems. HFS+'s case-folding does more than just fold uppercase into lowercase (which we already handle with strcasecmp). It may also skip past certain "ignored" Unicode code points, so that (for example) ".gi\u200ct" is mapped ot ".git". The full list of folds can be found in the tables at: https://www.opensource.apple.com/source/xnu/xnu-1504.15.3/bsd/hfs/hfscommon/Unicode/UCStringCompareData.h Implementing a full "is this path the same as that path" comparison would require us importing the whole set of tables. However, what we want to do is much simpler: we only care about checking ".git". We know that 'G' is the only thing that folds to 'g', and so on, so we really only need to deal with the set of ignored code points, which is much smaller. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-09-19	Merge branch 'rs/export-strbuf-addchars'	Junio C Hamano	1	-7/+0
	Code clean-up. * rs/export-strbuf-addchars: strbuf: use strbuf_addchars() for adding a char multiple times strbuf: export strbuf_addchars()
2014-09-09	Merge branch 'nd/strbuf-utf8-replace'	Junio C Hamano	1	-0/+3
	* nd/strbuf-utf8-replace: utf8.c: fix strbuf_utf8_replace() consuming data beyond input string
2014-09-08	strbuf: export strbuf_addchars()	René Scharfe	1	-7/+0
	Move strbuf_addchars() to strbuf.c, where it belongs, and make it available for other callers. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-08-11	utf8.c: fix strbuf_utf8_replace() consuming data beyond input string	Nguyễn Thái Ngọc Duy	1	-0/+3
	The main loop in strbuf_utf8_replace() could summed up as: while ('src' is still valid) { 1) advance 'src' to copy ANSI escape sequences 2) advance 'src' to copy/replace visible characters } The problem is after #1, 'src' may have reached the end of the string (so 'src' points to NUL) and #2 will continue to copy that NUL as if it's a normal character. Because the output is stored in a strbuf, this NUL accounted in the 'len' field as well. Check after #1 and break the loop if necessary. The test does not look obvious, but the combination of %>>() should make a call trace like this show_log() pretty_print_commit() format_commit_message() strbuf_expand() format_commit_item() format_and_pad_commit() strbuf_utf8_replace() where %C(auto)%d would insert a color reset escape sequence in the end of the string given to strbuf_utf8_replace() and show_log() uses fwrite() to send everything to stdout (including the incorrect NUL inserted by strbuf_utf8_replace) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-06-06	Merge branch 'tb/unicode-6.3-zero-width'	Junio C Hamano	1	-69/+7
	Update the logic to compute the display width needed for utf8 strings and allow us to more easily maintain the tables used in that logic. We may want to let the users choose if codepoints with ambiguous widths are treated as a double or single width in a follow-up patch. * tb/unicode-6.3-zero-width: utf8: make it easier to auto-update git_wcwidth() utf8.c: use a table for double_width
2014-05-12	utf8: make it easier to auto-update git_wcwidth()	Torsten Bögershausen	1	-59/+2
	The function git_wcwidth() returns for a given unicode code point the width on the display: -1 for control characters, 0 for combining or other non-visible code points 1 for e.g. ASCII 2 for double-width code points. This table had been originally been extracted for one Unicode version, probably 3.2. We now use two tables these days, one for zero-width and another for double-width. Make it easier to update these tables to a later version of Unicode by factoring out the table from utf8.c into unicode_width.h and add the script update_unicode.sh to update the table based on the latest Unicode specification files. Thanks to Peter Krefting <peter@softwolves.pp.se> and Kevin Bracey <kevin@bracey.fi> for helping with their Unicode knowledge. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-05-12	utf8.c: use a table for double_width	Torsten Bögershausen	1	-23/+18
	Refactor git_wcwidth() and replace the if-else-if chain. Use the table double_width which is scanned by the bisearch() function, which is already used to find combining code points. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-04-16	Merge branch 'tb/unicode-6.3-zero-width'	Junio C Hamano	1	-5/+4
	Teach our display-column-counting logic about decomposed umlauts and friends. * tb/unicode-6.3-zero-width: utf8.c: partially update to version 6.3
2014-04-09	utf8.c: partially update to version 6.3	Torsten Bögershausen	1	-5/+4
	Unicode 6.3 defines more code points as combining or accents. For example, the character "ö" could be expressed as an "o" followed by U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above). We should consider that such a sequence of two codepoints occupies one display column for the alignment purposes, and for that, git_wcwidth() should return 0 for them. Affected codepoints are: U+0358..U+035C U+0487 U+05A2, U+05BA, U+05C5, U+05C7 U+0604, U+0616..U+061A, U+0659..U+065F Earlier unicode standards had defined these as "reserved". Only the range 0..U+07FF has been checked to see which codepoints need to be marked as 0-width while preparing for this commit; more updates may be needed. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-02-18	utf8: use correct type for values in interval table	John Keeping	1	-2/+2
	We treat these as unsigned everywhere and compare against unsigned values, so declare them using the typedef we already have for this. While we're here, fix the indentation as well. Signed-off-by: John Keeping <john@keeping.me.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-02-18	utf8: fix iconv error detection	John Keeping	1	-1/+1
	iconv(3) returns "(size_t) -1" on error. Make sure that we cast the "-1" properly when checking for this. Signed-off-by: John Keeping <john@keeping.me.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-28	pretty: Fix bug in truncation support for %>, %< and %><	Ramsay Jones	1	-2/+2
	Some systems experience failures in t4205-*.sh (tests 18-20, 27) which all relate to the use of truncation with the %< padding placeholder. This capability was added in the commit a7f01c6b ("pretty: support truncating in %>, %< and %><", 19-04-2013). The truncation support was implemented with the assistance of a new strbuf function (strbuf_utf8_replace). This function contains the following code: strbuf_attach(sb_src, strbuf_detach(&sb_dst, NULL), sb_dst.len, sb_dst.alloc); Unfortunately, this code is subject to unspecified behaviour. In particular, the order of evaluation of the argument expressions (along with the associated side effects) is not specified by the C standard. Note that the second argument expression is a call to strbuf_detach() which, as a side effect, sets the 'len' and 'alloc' fields of the sb_dst argument to zero. Depending on the order of evaluation of the argument expressions to the strbuf_attach call, this can lead to assigning an empty string to 'sb_src'. In order to remove the undesired behaviour, we replace the above line of code with: strbuf_swap(sb_src, &sb_dst); strbuf_release(&sb_dst); which achieves the desired effect without provoking unspecified behaviour. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Acked-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18	pretty: support %>> that steal trailing spaces	Nguyễn Thái Ngọc Duy	1	-1/+1
	This is pretty useful in `%<(100)%s%Cred%>(20)% an' where %s does not use up all 100 columns and %an needs more than 20 columns. By replacing %>(20) with %>>(20), %an can steal spaces from %s. %>> understands escape sequences, so %Cred does not stop it from stealing spaces in %<(100). Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18	pretty: support truncating in %>, %< and %><	Nguyễn Thái Ngọc Duy	1	-0/+46
	%>(N,trunc) truncates the right part after N columns and replace the last two letters with "..". ltrunc does the same on the left. mtrunc cuts the middle out. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18	utf8.c: add reencode_string_len() that can handle NULs in string	Nguyễn Thái Ngọc Duy	1	-3/+7
	Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18	utf8.c: add utf8_strnwidth() with the ability to skip ansi sequences	Nguyễn Thái Ngọc Duy	1	-6/+14
	Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18	utf8.c: move display_mode_esc_sequence_len() for use by other functions	Nguyễn Thái Ngọc Duy	1	-14/+14
	Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-03-25	Merge branch 'ks/rfc2047-one-char-at-a-time'	Junio C Hamano	1	-0/+39
	When "format-patch" quoted a non-ascii strings on the header files, it incorrectly applied rfc2047 and chopped a single character in the middle of it. * ks/rfc2047-one-char-at-a-time: format-patch: RFC 2047 says multi-octet character may not be split
2013-03-21	Merge branch 'jk/utf-8-can-be-spelled-differently'	Junio C Hamano	1	-2/+18
	Some platforms and users spell UTF-8 differently; retry with the most official "UTF-8" when the system does not understand the user-supplied encoding name that are the common alternative spellings of UTF-8. * jk/utf-8-can-be-spelled-differently: utf8: accept alternate spellings of UTF-8
2013-03-09	format-patch: RFC 2047 says multi-octet character may not be split	Kirill Smelkov	1	-0/+39
	Even though an earlier attempt (bafc478..41dd00bad) cleaned up RFC 2047 encoding, pretty.c::add_rfc2047() still decides where to split the output line by going through the input one byte at a time, and potentially splits a character in the middle. A subject line may end up showing like this: ".... fö?? bar". (instead of ".... föö bar".) if split incorrectly. RFC 2047, section 5 (3) explicitly forbids such beaviour Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. that means that e.g. for Subject: .... föö bar encoding Subject: =?UTF-8?q?....=20f=C3=B6=C3=B6?= =?UTF-8?q?=20bar?= is correct, and Subject: =?UTF-8?q?....=20f=C3=B6=C3?= <-- NOTE ö is broken here =?UTF-8?q?=B6=20bar?= is not, because "ö" character UTF-8 encoding C3 B6 is split here across adjacent encoded words. To fix the problem, make the loop grab one _character_ at a time and determine its output length to see where to break the output line. Note that this version only knows about UTF-8, but the logic to grab one character is abstracted out in mbs_chrlen() function to make it possible to extend it to other encodings with the help of iconv in the future. Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-25	utf8: accept alternate spellings of UTF-8	Jeff King	1	-2/+18
	The iconv implementation on many platforms will accept variants of UTF-8, including "UTF8", "utf-8", and "utf8", but some do not. We make allowances in our code to treat them all identically, but we sometimes hand the string from the user directly to iconv. In this case, the platform iconv may or may not work. There are really four levels of platform iconv support for these synonyms: 1. All synonyms understood (e.g., glibc). 2. Only the official "UTF-8" understood (e.g., Windows). 3. Official "UTF-8" not understood, but some other synonym understood (it's not known whether such a platform exists). 4. Neither "UTF-8" nor any synonym understood (e.g., ancient systems, or ones without utf8 support installed). This patch teaches git to fall back to using the official "UTF-8" spelling when iconv_open fails (and the encoding was one of the synonym spellings). This makes things more convenient to users of type 2 systems, as they can now use any of the synonyms for the log output encoding. Type 1 systems are not affected, as iconv already works on the first try. Type 4 systems are not affected, as both attempts already fail. Type 3 systems will not benefit from the feature, but because we only use "UTF-8" as a fallback, they will not be regressed (i.e., you can continue to use "utf8" if your platform supports it). We could try all the various synonyms, but since such systems are not even known to exist, it's not worth the effort. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-14	Merge branch 'jx/utf8-printf-width'	Junio C Hamano	1	-0/+21
	Use a new helper that prints a message and counts its display width to align the help messages parse-options produces. * jx/utf8-printf-width: Add utf8_fprintf helper that returns correct number of columns
2013-02-11	Add utf8_fprintf helper that returns correct number of columns	Jiang Xin	1	-0/+21
	Since command usages can be translated, they may include utf-8 encoded strings, and the output in console may not align well any more. This is because strlen() is different from strwidth() on utf-8 strings. A wrapper utf8_fprintf() can help to return the correct number of columns required. Signed-off-by: Jiang Xin <worldhello.net@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Reviewed-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-01-02	Merge branch 'sp/shortlog-missing-lf'	Junio C Hamano	1	-7/+6
	When a line to be wrapped has a solid run of non space characters whose length exactly is the wrap width, "git shortlog -w" failed to add a newline after such a line. * sp/shortlog-missing-lf: strbuf_add_wrapped*(): Remove unused return value shortlog: fix wrapping lines of wraplen
2012-12-11	strbuf_add_wrapped*(): Remove unused return value	Steffen Prohaska	1	-7/+6
	Since shortlog isn't using the return value anymore (see previous commit), the functions can be changed to void. Signed-off-by: Steffen Prohaska <prohaska@zib.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-12-07	Merge branch 'jc/same-encoding' into maint	Junio C Hamano	1	-0/+7
	Various codepaths checked if two encoding names are the same using ad-hoc code and some of them ended up asking iconv() to convert between "utf8" and "UTF-8". The former is not a valid way to spell the encoding name, but often people use it by mistake, and we equated them in some but not all codepaths. Introduce a new helper function to make these codepaths consistent. * jc/same-encoding: reencode_string(): introduce and use same_encoding()
2012-11-20	Merge branch 'js/format-2047' into maint	Junio C Hamano	1	-1/+1
	Various rfc2047 quoting issues around a non-ASCII name on the From: line in the output from format-patch have been corrected. * js/format-2047: format-patch tests: check quoting/encoding in To: and Cc: headers format-patch: fix rfc2047 address encoding with respect to rfc822 specials format-patch: make rfc2047 encoding more strict format-patch: introduce helper function last_line_length() format-patch: do not wrap rfc2047 encoded headers too late format-patch: do not wrap non-rfc2047 headers too early utf8: fix off-by-one wrapping of text
2012-11-15	Merge branch 'jc/same-encoding'	Junio C Hamano	1	-0/+7
	Various codepaths checked if two encoding names are the same using ad-hoc code and some of them ended up asking iconv() to convert between "utf8" and "UTF-8". The former is not a valid way to spell the encoding name, but often people use it by mistake, and we equated them in some but not all codepaths. Introduce a new helper function to make these codepaths consistent. * jc/same-encoding: reencode_string(): introduce and use same_encoding() Conflicts: builtin/mailinfo.c
2012-11-09	Merge branch 'js/format-2047'	Jeff King	1	-1/+1
	Fixes many rfc2047 quoting issues in the output from format-patch. * js/format-2047: format-patch tests: check quoting/encoding in To: and Cc: headers format-patch: fix rfc2047 address encoding with respect to rfc822 specials format-patch: make rfc2047 encoding more strict format-patch: introduce helper function last_line_length() format-patch: do not wrap rfc2047 encoded headers too late format-patch: do not wrap non-rfc2047 headers too early utf8: fix off-by-one wrapping of text
2012-11-04	reencode_string(): introduce and use same_encoding()	Junio C Hamano	1	-0/+7
	Callers of reencode_string() that re-encodes a string from one encoding to another all used ad-hoc way to bypass the case where the input and the output encodings are the same. Some did strcmp(), some did strcasecmp(), yet some others when converting to UTF-8 used is_encoding_utf8(). Introduce same_encoding() helper function to make these callers use the same logic. Notably, is_encoding_utf8() has a work-around for common misconfiguration to use "utf8" to name UTF-8 encoding, which does not match "UTF-8" hence strcasecmp() would not consider the same. Make use of it in this helper function. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-10-18	utf8: fix off-by-one wrapping of text	Jan H. Schönherr	1	-1/+1
	The wrapping logic in strbuf_add_wrapped_text() does currently not allow lines that entirely fill the allowed width, instead it wraps the line one character too early. For example, the text "This is the sixth commit." formatted via "%w(11,1,2)" (wrap at 11 characters, 1 char indent of first line, 2 char indent of following lines) results in four lines: " This is", " the", " sixth", " commit." This is wrong, because " the sixth" is exactly 11 characters long, and thus allowed. Fix this by allowing the (width+1) character of a line to be a valid wrapping point if it is a whitespace character. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-08	git on Mac OS and precomposed unicode	Torsten Bögershausen	1	-10/+16
	Mac OS X mangles file names containing unicode on file systems HFS+, VFAT or SAMBA. When a file using unicode code points outside ASCII is created on a HFS+ drive, the file name is converted into decomposed unicode and written to disk. No conversion is done if the file name is already decomposed unicode. Calling open("\xc3\x84", ...) with a precomposed "Ä" yields the same result as open("\x41\xcc\x88",...) with a decomposed "Ä". As a consequence, readdir() returns the file names in decomposed unicode, even if the user expects precomposed unicode. Unlike on HFS+, Mac OS X stores files on a VFAT drive (e.g. an USB drive) in precomposed unicode, but readdir() still returns file names in decomposed unicode. When a git repository is stored on a network share using SAMBA, file names are send over the wire and written to disk on the remote system in precomposed unicode, but Mac OS X readdir() returns decomposed unicode to be compatible with its behaviour on HFS+ and VFAT. The unicode decomposition causes many problems: - The names "git add" and other commands get from the end user may often be precomposed form (the decomposed form is not easily input from the keyboard), but when the commands read from the filesystem to see what it is going to update the index with already is on the filesystem, readdir() will give decomposed form, which is different. - Similarly "git log", "git mv" and all other commands that need to compare pathnames found on the command line (often but not always precomposed form; a command line input resulting from globbing may be in decomposed) with pathnames found in the tree objects (should be precomposed form to be compatible with other systems and for consistency in general). - The same for names stored in the index, which should be precomposed, that may need to be compared with the names read from readdir(). NFS mounted from Linux is fully transparent and does not suffer from the above. As Mac OS X treats precomposed and decomposed file names as equal, we can - wrap readdir() on Mac OS X to return the precomposed form, and - normalize decomposed form given from the command line also to the precomposed form, to ensure that all pathnames used in Git are always in the precomposed form. This behaviour can be requested by setting "core.precomposedunicode" configuration variable to true. The code in compat/precomposed_utf8.c implements basically 4 new functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(), precomposed_utf8_closedir() and precompose_argv(). The first three are to wrap opendir(3), readdir(3), and closedir(3) functions. The argv[] conversion allows to use the TAB filename completion done by the shell on command line. It tolerates other tools which use readdir() to feed decomposed file names into git. When creating a new git repository with "git init" or "git clone", "core.precomposedunicode" will be set "false". The user needs to activate this feature manually. She typically sets core.precomposedunicode to "true" on HFS and VFAT, or file systems mounted via SAMBA. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-02-23	strbuf: add fixed-length version of add_wrapped_text	Jeff King	1	-0/+9
	The function strbuf_add_wrapped_text takes a NUL-terminated string. This makes it annoying to wrap strings we have as a pointer and a length. Refactoring strbuf_add_wrapped_text and all of its sub-functions to handle fixed-length strings turned out to be really ugly. So this implementation is lame; it just strdups the text and operates on the NUL-terminated version. This should be fine as the strings we are wrapping are generally pretty short. If it becomes a problem, we can optimize later. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-02	Merge branch 'rs/optim-text-wrap'	Junio C Hamano	1	-33/+28
	* rs/optim-text-wrap: utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text() utf8.c: remove strbuf_write() utf8.c: remove print_spaces() utf8.c: remove print_wrapped_text()
2010-02-20	utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text()	René Scharfe	1	-6/+17
	is_utf8() works by calling utf8_width() for each character at the supplied location. In strbuf_add_wrapped_text(), we do that anyway while wrapping the lines. So instead of checking the encoding beforehand, optimistically assume that it's utf-8 and wrap along until an invalid character is hit, and when that happens start over. This pays off if the text consists only of valid utf-8 characters. The following command was run against the Linux kernel repo with git 1.7.0: $ time git log --format='%b' v2.6.32 >/dev/null real 0m2.679s user 0m2.580s sys 0m0.100s $ time git log --format='%w(60,4,8)%b' >/dev/null real 0m4.342s user 0m4.230s sys 0m0.110s And with this patch series: $ time git log --format='%w(60,4,8)%b' >/dev/null real 0m3.741s user 0m3.630s sys 0m0.110s So the cost of wrapping is reduced to 70% in this case. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20	utf8.c: remove strbuf_write()	René Scharfe	1	-13/+5
	The patch before the previous one made sure that all callers of strbuf_add_wrapped_text() supply a strbuf. Replace all calls of strbuf_write() with regular strbuf functions and remove it. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20	utf8.c: remove print_spaces()	René Scharfe	1	-9/+6
	The previous patch made sure that strbuf_add_wrapped_text() (and thus strbuf_add_indented_text(), too) always get a strbuf. Make use of this fact by adding strbuf_addchars(), a small helper that adds a char the specified number of times to a strbuf, and use it to replace print_spaces(). Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20	utf8.c: remove print_wrapped_text()	René Scharfe	1	-5/+0
	strbuf_add_wrapped_text() is called only from print_wrapped_text() without a strbuf (in which case it writes its results to stdout). At its only callsite, supply a strbuf, call strbuf_add_wrapped_text() directly and remove the wrapper function. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-01-12	utf8.c: mark file-local function static	Junio C Hamano	1	-1/+1
	Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-23	strbuf_add_wrapped_text(): skip over colour codes	René Scharfe	1	-1/+21
	Ignore display mode escape sequences (colour codes) for the purpose of text wrapping because they don't have a visible width. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-22	strbuf_add_wrapped_text(): factor out strbuf_add_indented_text()	René Scharfe	1	-9/+17
	Add a new helper function, strbuf_add_indented_text(), to indent text without a width limit, and call it from strbuf_add_wrapped_text(). It respects both indent (applied to the first line) and indent2 (applied to the rest of the lines); indent2 was ignored by the indent-only path of strbuf_add_wrapped_text() before the patch. Two simple test cases are added, one exercising strbuf_add_wrapped_text() and the other strbuf_add_indented_text(). Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-10-22	Teach --wrap to only indent without wrapping	Junio C Hamano	1	-0/+13
	When a zero or negative width is given to "shortlog -w<width>,<in1>,<in2>" and --format=%[wrap(w,in1,in2)...%], just indent the text by in1 without wrapping. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-10-19	Add strbuf_add_wrapped_text() to utf8.[ch]	Johannes Schindelin	1	-9/+24
	The newly added function can rewrap text according to a given first-line indent, other-indent and text width. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2009-10-19	print_wrapped_text(): allow hard newlines	Johannes Schindelin	1	-2/+16
	print_wrapped_text() will insert its own newlines. Up until now, if the text passed to it contained newlines, they would not be handled properly (the wrapping got confused after that). The strategy is to replace a single new-line with a space, but keep double new-lines so that already-wrapped text with empty lines between paragraphs will be handled properly. However, single new-line characters are only handled this way if the character after it is an alphanumeric character, as per Linus' suggestion. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2009-06-06	On Solaris choose the OLD_ICONV iconv() declaration based on the UNIX spec	Brandon Casey	1	-1/+1
	OLD_ICONV is only necessary on Solaris until UNIX03. This is indicated by the private macro _XPG6 which is set in /usr/include/sys/feature_tests.h. Signed-off-by: Brandon Casey <drafnel@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-02-04	utf8: add utf8_strwidth()	Geoffrey Thomas	1	-0/+19
	I'm about to use this pattern more than once, so make it a common function. Signed-off-by: Geoffrey Thomas <geofft@mit.edu> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-06	utf8_width(): allow non NUL-terminated input	Junio C Hamano	1	-32/+52
	The original interface assumed that the input string is always terminated with a NUL, but that wasn't too useful. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-06	utf8: pick_one_utf8_char()	Junio C Hamano	1	-6/+21
	utf8_width() function was doing two different things. To pick a valid character from UTF-8 stream, and compute the display width of that character. This splits the former to a separate function pick_one_utf8_char(). Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-15	Remove unreachable statements	Guido Ostkamp	1	-1/+0
	Solaris Workshop Compiler found a few unreachable statements. Signed-off-by: Guido Ostkamp <git@ostkamp.fastmail.fm> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-08	Style: place opening brace of a function definition at column 1	Junio C Hamano	1	-1/+2
	Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-05-07	wcwidth redeclaration	Amos Waterland	1	-2/+2
	Build fails for git 1.5.1.3 on AIX, with the message: utf8.c:66: error: conflicting types for 'wcwidth' /.../lib/gcc/powerpc-ibm-aix5.3.0.0/4.0.3/include/string.h:266: error: previous declaration of 'wcwidth' was here Fix this by renaming our static variant to our own name. Signed-off-by: Amos Waterland <apw@us.ibm.com> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-03	Merge branch 'maint'	Junio C Hamano	1	-6/+14
	* maint: Unset NO_C99_FORMAT on Cygwin. Fix a "pointer type missmatch" warning. Fix some "comparison is always true/false" warnings. Fix an "implicit function definition" warning. Fix a "label defined but unreferenced" warning. Document the config variable format.suffix git-merge: fail correctly when we cannot fast forward. builtin-archive: use RUN_SETUP Fix git-gc usage note
2007-03-03	Fix a "pointer type missmatch" warning.	Ramsay Jones	1	-2/+8
	In particular, the second parameter in the call to iconv() will cause this warning if your library declares iconv() with the second (input buffer pointer) parameter of type const char **. This is the old prototype, which is none-the-less used by the current version of newlib on Cygwin. (It appears in old versions of glibc too). Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-03	Fix some "comparison is always true/false" warnings.	Ramsay Jones	1	-4/+6
	On Cygwin the wchar_t type is an unsigned short (16-bit) int. This results in the above warnings from the return statement in the wcwidth() function (in particular, the expressions involving constants with values larger than 0xffff). Simply replace the use of wchar_t with an unsigned int, typedef-ed as ucs_char_t. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-02	print_wrapped_text: fix output for negative indent	Johannes Schindelin	1	-1/+1
	When providing a negative indent, it means that -indent columns were already printed. Fix a bug where the function ate the first character if already the first word did not fit into the first line. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-27	Actually make print_wrapped_text() useful	Johannes Schindelin	1	-5/+12
	Now, it returns the current column, does not add a newline, and you can pass a negative indent, to indicate that the indent was already printed. With this, you can actually continue in the middle of a paragraph, not having to print everything into a buffer first. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-30	commit-tree: cope with different ways "utf-8" can be spelled.	Junio C Hamano	1	-0/+9
	People can spell config.commitencoding differently from what we internally have ("utf-8") to mean UTF-8. Try to accept them and treat them equally. Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-26	Move encoding conversion routine out of mailinfo to utf8.c	Junio C Hamano	1	-0/+54
	This moves the body of convert_to_utf8() routine used in mailinfo to the utf8.c i18n library. Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-24	commit-tree: encourage UTF-8 commit messages.	Johannes Schindelin	1	-0/+278
	Introduce is_utf() to check if a text looks like it is encoded in UTF-8, utf8_width() to count display width, and implements print_wrapped_text() using them. git-commit-tree warns if the commit message does not minimally conform to the UTF-8 encoding when i18n.commitencoding is either unset, or set to "utf-8". Signed-off-by: Junio C Hamano <junkio@cox.net>