1

Python
I want to split a string into parts which have at most 5000 characters. (We also need to be aware not to split it when we are in a word, and split it only if we found a space.)
I iterated through the string character by character, and every 4980 characters I split it into parts, and then if there remains a part which is less than 4980 I translate that too. I am new to python, so I'm sure my method is a mess, which works, but certainly isn't good code.
I haven't checked for any spaces in the string because in Japanese and Chinese there aren't spaces, but this would need to be checked too so we don't split a word into two parts.

with open('lightnovel.txt', 'r', encoding="utf8") as f:
file = f.read()

db = 0
partofbook = u''
last = u''
length = len(file)
mult = 0
for character in file:
    db = db + 1
    partofbook = partofbook + character
    if db > 4880:
        mult += 1
        db = 0
        trans(partofbook)
        partofbook = u''
    elif length - (mult * 4980) > 0 and length - (mult * 4980) < 5000 :
        last = last + character
        do = 1
if do == 1:
    trans(last)
4
  • Why don't you start at index 5000, iterate backwards till you find whitespace at position A, let's say, then your first output is string[0,A-1]. Then jump ahead to index A+5000 and do the same thing, searching backwards for whitespace, found at index B, so your next output is string[A, B-1]. Repeat until done. Obviously check that you don't skip beyond len(string). Commented Mar 1, 2021 at 18:52
  • Thank you, this is a great idea! Commented Mar 1, 2021 at 19:01
  • Can you post this comment as an answer so I can check it as a solution? Commented Mar 1, 2021 at 19:05
  • Yes, see How to get char from string by index? and [ ](stackoverflow.com/questions/663171/…) Commented Mar 1, 2021 at 19:06

2 Answers 2

1

I'm also new to python so I apologise for not implementing this into your code.

there is a function called string.split() (where string is the sentence you want to split).

this function would split only when there is a space.

Sign up to request clarification or add additional context in comments.

1 Comment

The problem is that this doesn't split it by length, but occurences like a w3schools example says: apple#banana#orange would give apple, banana, and orange in a list if we choose to split by "#". I haven't found a way to use this function with length parameters.
0

I would start at index 5000, iterate backwards till you find whitespace at position A, let's say, then your first output is string[0,A-1] (in Python, you can use s[0:A] to get this substring).

Then jump ahead to index A+5000 and do the same thing, searching backwards for whitespace, found at index B, so your next output is string[A, B-1] (in Python you can use s[A+1:B] to get this substring). Note: it's A+1 because you want to skip the whitespace found at index A.

Repeat until done. Obviously check that you don't skip beyond len(string).

Also, see

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.