Simple case folding vs full case folding in Python regex module

Question

This is the module I'm asking about: https://pypi.org/project/regex/, it's Matthew Barnett's regex.

In the project description page, the difference in behavior between V0 and V1 are stated as (note what's in bold):

Old vs new behaviour

In order to be compatible with the re module, this module has 2 behaviours:

Version 0 behaviour (old behaviour, compatible with the re module):

Please note that the re module’s behaviour may change over time, and I’ll endeavour to match that behaviour in version 0.

Indicated by the VERSION0 or V0 flag, or (?V0) in the pattern.

Case-insensitive matches in Unicode use simple case-folding by default.

Version 1 behaviour (new behaviour, possibly different from the re module):

Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.

Case-insensitive matches in Unicode use full case-folding by default.

If no version is specified, the regex module will default to regex.DEFAULT_VERSION.

I tried a few examples myself but didn't figure out what it does:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> r = regex.compile("(?V0i)и")
>>> r
regex.Regex('(?V0i)и', flags=regex.I | regex.V0)
>>> r.search("И")
<regex.Match object; span=(0, 1), match='И'>
>>> regex.search("(?V0i)é", "É")
<regex.Match object; span=(0, 1), match='É'>
>>> regex.search("(?V0i)é", "E")
>>> regex.search("(?V1i)é", "E")

What is the difference between simple case-folding and full case-folding? Or can you provide an example where a (case insensitive) regex matches something in V1 but not in V0?

Not tested but it probably follows this table. Full case folding may replace a few special characters by two characters, simple casefolding doesn't. Such characters are e.g. capital and small latin sharp s. — Michael Butscher
– Michael Butscher, Commented Feb 9, 2019 at 6:15
@MichaelButscher Great, it works. You can get a green tick if you write it as an answer. — iBug
– iBug, Commented Feb 9, 2019 at 6:19

Michael Butscher · Accepted Answer · 2019-02-09 14:22:25Z

1

It follows the Unicode case folding table. Excerpt:

# The entries in this file are in the following machine-readable format:
#
# <code>; <status>; <mapping>; # <name>
#
# The status field is:
# C: common case folding, common mappings shared by both simple and full mappings.
# F: full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces.
# S: simple case folding, mappings to single characters where different from F.

[...]

# Usage:
#  A. To do a simple case folding, use the mappings with status C + S.
#  B. To do a full case folding, use the mappings with status C + F.

The folding is only different for a few special characters, examples are small and capital latin sharp s:

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

[...]

1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S

edited Feb 9, 2019 at 14:22

answered Feb 9, 2019 at 6:29

Michael Butscher

11k4 gold badges28 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Simple case folding vs full case folding in Python regex module

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related