2

I am dealing with words written in Uzbek language. The language has the following letters:

alphabet = ["a", "b", "c", "d", "e", "f", "g", "g'", "h", "i", 
    "j", "k", "l", "m", "n", "ng", "o", "o'", "p", "q", "r", 
    "s", "sh", "t", "u", "v", "x", "y", "z"]

As you can see, there are letters with multiple characters like o', g' and sh. How can I split a word in this language into a list of Uzbek letters? So, for example, splitting the word "o'zbek" into ["o'", "z", "b", "e", "k"].

If I do the following:

word = "o'zbek"
letters = list(word)

It results in:

['o', "'", 'z', 'b', 'e', 'k']

which is incorrect as o and ' are not together.

I also tried using regex like this:

import re
expression = "|".join(alphabet)
re.split(expression, word)

But it results in:

['', "'", '', '', '', '']
2
  • 1
    You say the language has 'sh' as a letter, but it also has 's' and 'h' - how would you expect the script to correctly read 'asha'? Is it ['a', 'sh', 'a'] or ['a', 's', 'h', 'a']? (similarly, is the symbol ' allowed in other contexts, or is it only used after a o or a g?) Commented Mar 31, 2021 at 5:53
  • 1
    If it's a combination of s and h, then it should be recognized as a letter sh, so 'asha' should be split as ['a', 'sh', 'a']. And, yes ' is used only in letters o' and g'. Commented Mar 31, 2021 at 5:55

3 Answers 3

3

To give priority to the more-than-one-character letters, first we sort the alphabet over the length of characters. Then pass it to a regex as you did with "|".join, and re.findall gives the list of splits:

import re

sorted_alphabet = sorted(alphabet, key=len, reverse=True)
regex = re.compile("|".join(sorted_alphabet))

def split_word(word):
    return re.findall(regex, word)

using:

>>> split_word("o'zbek")
["o'", 'z', 'b', 'e', 'k']

>>> split_word("asha")
['a', 'sh', 'a']
Sign up to request clarification or add additional context in comments.

Comments

2

Something like this works.

double = {"o'", "ng", "g'", "sh"}

string = "o'zbek"
letters = []
while string:
    if string[:2] in double:
        letters.append(string[:2])
        string = string[2:]
    else:
        letters.append(string[0])
        string = string[1:]

If there are no triple letters or longer, you can list all the double letters in a set (finding an element in set is faster than finding it in list).

Than you go through the string, and try to find the double letters at the beginning of the string. If it is there, you store that in the list of letters.

import re
letters = re.findall("(o'|g'|ng|sh|[a-z])", string)

works too.

Comments

2

If you are looking for regex specifically, you could try to use re.findall with a pattern like so:

[a-fh-mp-rt-z]|[go]'?|ng?|sh?
  • [a-fh-mp-rt-z] - A character class holding all normal alphabets.
  • | : Or:
  • [go]'? - Either "g" or "o" followed by an optional quote.
  • | - Or:
  • ng? - A literal "n" followed by an optional "g".
  • | - Or:
  • sh? - A literal "s" followed by an optional "h".

See the online demo

import re
word = "o'zbek"
letters = re.findall("[a-fh-mp-rt-z]|[go]'?|ng?|sh?", word)
print(letters)

Prints:

["o'", 'z', 'b', 'e', 'k']

Note that you could also give priority to those "double" letters like so: [go]'|ng|sh|[a-z], kind of like how @MustafaAydin explained in his answer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.