Split a string into a list by a set of strings

Question

I am dealing with words written in Uzbek language. The language has the following letters:

alphabet = ["a", "b", "c", "d", "e", "f", "g", "g'", "h", "i", 
    "j", "k", "l", "m", "n", "ng", "o", "o'", "p", "q", "r", 
    "s", "sh", "t", "u", "v", "x", "y", "z"]

As you can see, there are letters with multiple characters like o', g' and sh. How can I split a word in this language into a list of Uzbek letters? So, for example, splitting the word "o'zbek" into ["o'", "z", "b", "e", "k"].

If I do the following:

word = "o'zbek"
letters = list(word)

It results in:

['o', "'", 'z', 'b', 'e', 'k']

which is incorrect as o and ' are not together.

I also tried using regex like this:

import re
expression = "|".join(alphabet)
re.split(expression, word)

But it results in:

['', "'", '', '', '', '']

You say the language has 'sh' as a letter, but it also has 's' and 'h' - how would you expect the script to correctly read 'asha'? Is it ['a', 'sh', 'a'] or ['a', 's', 'h', 'a']? (similarly, is the symbol ' allowed in other contexts, or is it only used after a o or a g?) — Grismar
– Grismar, Commented Mar 31, 2021 at 5:53
If it's a combination of s and h, then it should be recognized as a letter sh, so 'asha' should be split as ['a', 'sh', 'a']. And, yes ' is used only in letters o' and g'. — Sayyor Y
– Sayyor Y, Commented Mar 31, 2021 at 5:55

Mustafa Aydın · Accepted Answer · 2021-03-31 06:12:36Z

3

To give priority to the more-than-one-character letters, first we sort the alphabet over the length of characters. Then pass it to a regex as you did with "|".join, and re.findall gives the list of splits:

import re

sorted_alphabet = sorted(alphabet, key=len, reverse=True)
regex = re.compile("|".join(sorted_alphabet))

def split_word(word):
    return re.findall(regex, word)

using:

>>> split_word("o'zbek")
["o'", 'z', 'b', 'e', 'k']

>>> split_word("asha")
['a', 'sh', 'a']

answered Mar 31, 2021 at 6:12

Mustafa Aydın

18.4k4 gold badges21 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Arpad Horvath -- Слава Україні · Accepted Answer · 2021-03-31 06:16:04Z

2

Something like this works.

double = {"o'", "ng", "g'", "sh"}

string = "o'zbek"
letters = []
while string:
    if string[:2] in double:
        letters.append(string[:2])
        string = string[2:]
    else:
        letters.append(string[0])
        string = string[1:]

If there are no triple letters or longer, you can list all the double letters in a set (finding an element in set is faster than finding it in list).

Than you go through the string, and try to find the double letters at the beginning of the string. If it is there, you store that in the list of letters.

import re
letters = re.findall("(o'|g'|ng|sh|[a-z])", string)

works too.

edited Mar 31, 2021 at 6:16

answered Mar 31, 2021 at 5:58

Arpad Horvath -- Слава Україні

1,9921 gold badge20 silver badges44 bronze badges

Comments

JvdV · Accepted Answer · 2021-03-31 06:33:13Z

2

If you are looking for regex specifically, you could try to use re.findall with a pattern like so:

[a-fh-mp-rt-z]|[go]'?|ng?|sh?

[a-fh-mp-rt-z] - A character class holding all normal alphabets.
| : Or:
[go]'? - Either "g" or "o" followed by an optional quote.
| - Or:
ng? - A literal "n" followed by an optional "g".
| - Or:
sh? - A literal "s" followed by an optional "h".

See the online demo

import re
word = "o'zbek"
letters = re.findall("[a-fh-mp-rt-z]|[go]'?|ng?|sh?", word)
print(letters)

Prints:

["o'", 'z', 'b', 'e', 'k']

Note that you could also give priority to those "double" letters like so: [go]'|ng|sh|[a-z], kind of like how @MustafaAydin explained in his answer.

edited Mar 31, 2021 at 6:33

answered Mar 31, 2021 at 6:21

JvdV

76.8k8 gold badges48 silver badges89 bronze badges

Collectives™ on Stack Overflow

Split a string into a list by a set of strings

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related