I am dealing with words written in Uzbek language. The language has the following letters:
alphabet = ["a", "b", "c", "d", "e", "f", "g", "g'", "h", "i",
"j", "k", "l", "m", "n", "ng", "o", "o'", "p", "q", "r",
"s", "sh", "t", "u", "v", "x", "y", "z"]
As you can see, there are letters with multiple characters like o', g' and sh. How can I split a word in this language into a list of Uzbek letters? So, for example, splitting the word "o'zbek" into ["o'", "z", "b", "e", "k"].
If I do the following:
word = "o'zbek"
letters = list(word)
It results in:
['o', "'", 'z', 'b', 'e', 'k']
which is incorrect as o and ' are not together.
I also tried using regex like this:
import re
expression = "|".join(alphabet)
re.split(expression, word)
But it results in:
['', "'", '', '', '', '']
'sh'as a letter, but it also has's'and'h'- how would you expect the script to correctly read'asha'? Is it['a', 'sh', 'a']or['a', 's', 'h', 'a']? (similarly, is the symbol'allowed in other contexts, or is it only used after aoor ag?)sandh, then it should be recognized as a lettersh, so'asha'should be split as['a', 'sh', 'a']. And, yes'is used only in letterso'andg'.