Swift 4 String index offset by too slow while processing a large string

Question

I'm processing a csv file with last column not being always same format. Each row has this structure:

"Root/Word1","some string","some string","some œ0'fqw[å”ƒ#”≠§
\nfw@\tfa0j
"
"Root/Word2","some string","some string","some string"
...

So 6 columns and the last can contain \n. Which makes it hard to split by components. Another restriction is that all the strings can be any possible special character. Which makes it hard to use regex.

I decided to solve the problem brute force first. (Yes I have seen that index offset by is O(n). But can't come up with an alternative.)

static func importData(_ db: DB) {
    let csvString = readDataFromCSV(fileName: "data", fileType: "csv")!

    let totalCharCount = csvString.count
    print("total: \(totalCharCount)")
    for i in 0..<totalCharCount {
        print(i)
        if i+5 >= totalCharCount {
            continue
        }
        let index = csvString.index(csvString.startIndex, offsetBy: i)
        let endIndex = csvString.index(csvString.startIndex, offsetBy:i+5)

        let part = csvString[index ..< endIndex]
        if part == "Root/" {
            let accum = lookInside(i: i, totalCharCount: totalCharCount, csvString: csvString)

            var rows = accum.components(separatedBy: "\",\"")

            if var lastt = rows.last {
                lastt.removeLast()
                lastt.removeLast()
                rows[rows.count-1] = lastt
            }
        }
    }
}

static func lookInside(i:Int, totalCharCount: Int, csvString: String) -> String {
    var accum = ""
    var found = false
    var j = i+5
    while !found {

        if j+5 >= totalCharCount {
            found = true
        }
        let index2 = csvString.index(csvString.startIndex, offsetBy: j)
        let endIndex2 = csvString.index(csvString.startIndex, offsetBy:j+5)

        if csvString[index2 ..< endIndex2] == "Root/" {
            found = true
            accum.removeLast()
        } else {
            accum += String(csvString[index2])
        }
        j += 1
    }

    return accum
}

Basically I'm traversing the whole string looking for a pattern "Root/". When found, I advance from this moment to the next occurrence of the pattern.

Problem is that the csv results in a string 200k chars long and when I run this on simulator it lasts too much time (~30min).

So now I'm asking some help here because according to Instruments all the time is consumed in String.index(offset by) method which is called too many times.

The obvious micro-optimization is to only index from a previous index, not from the start of the string each time. But is there a reason you're not using an OTS CSV parser? This ad-hoc one looks like it will fail in some cases. — Ssswift
– Ssswift, Commented Nov 16, 2017 at 19:27
What @Ssswift said is more than micro-optimization. Iterating through a string is O(n) and since you begin from csvString.startIndex every time, it quickly becomes an O(n^2) operation. Since you question is about performance optimization, please upload the full CSV file to Pastebin and include a link here — Code Different
– Code Different, Commented Nov 16, 2017 at 19:36
You said there's 6 columns, but I only count 4 columns. Can you confirm whether any of these columns may include "? Is it legal to quote a double-quote (i.e \")? Or is it true that there will be precisely 8 (12?) double-quotes per record, and the end of a record will be a double-quote? Is removing Root/ important, or just how you're finding the records? — Rob Napier
– Rob Napier, Commented Nov 16, 2017 at 20:37

Codo · Accepted Answer · 2017-11-16 19:45:50Z

3

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

Quote from Joel on Software

You need to understand how strings work. Poor string handling was slow in 2001 when Joel Spolsky wrote the above story. With proper Unicode handling, things have become even more expensive.

You aren't really traversing the string. You start all over again and again by using String.startIndex.

Instead use functions that return a String.Index result. That's an efficient index into the string. Use it to save your last position and then work relative to this position. Your code will be faster if you use String.Index only once at the beginning.

answered Nov 16, 2017 at 19:45

Codo

79.7k18 gold badges176 silver badges218 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vik Marino Over a year ago

which functions are those?

Codo Over a year ago

One of them is String.index(:offsetBy:), which you already use, but don't make good use of. E.g. pass the result to lookInside instead of the first Int parameter. Then work relative to that position.

GetSwifty · Accepted Answer · 2017-11-16 19:28:11Z

1

First, I would highly recommend using a csv parser if you're needing to read in the entire dataset, e.g. https://github.com/naoty/SwiftCSV

Second, if you're looking for a specific entry in the entire file, you probably want to use the range function: Answer using String.range

answered Nov 16, 2017 at 19:28

GetSwifty

7,7761 gold badge32 silver badges46 bronze badges

3 Comments

Vik Marino Over a year ago

Wow, didn't know about it. Just went with the ad-hoc one. This might solve my problem, but still I'm curious in how would one solve this problem of traversal. No, I'm not looking for a specific entry. Just need to reformat the input csv.

rmaddy Over a year ago

Since this is just a couple of links, you really should post this as a comment, not an answer.

Ssswift Over a year ago

rmaddy: The first one is a general solution -- "use a CSV parser" -- and the link is merely an example of one that could be used. I think it's perfectly fine as an answer.

Vik Marino · Accepted Answer · 2017-11-16 22:17:32Z

0

I have found a solution just using components and matching patterns. This works in less than 4s. Thanks for all the info, I learnt a lot from it.

        let csvString = readDataFromCSV(fileName: "data", fileType: "csv")!
        var components = csvString.components(separatedBy: "Root/")
        var result = [Note]()
        for i in components {
            let noteComponents = i.components(separatedBy: "\",\"")

answered Nov 16, 2017 at 22:17

Vik Marino

1,7751 gold badge19 silver badges37 bronze badges

Collectives™ on Stack Overflow

Swift 4 String index offset by too slow while processing a large string

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related