0

I'm processing a csv file with last column not being always same format. Each row has this structure:

"Root/Word1","some string","some string","some œ0'fqw[唃#”≠§
\nfw@\tfa0j
"
"Root/Word2","some string","some string","some string"
...

So 6 columns and the last can contain \n. Which makes it hard to split by components. Another restriction is that all the strings can be any possible special character. Which makes it hard to use regex.

I decided to solve the problem brute force first. (Yes I have seen that index offset by is O(n). But can't come up with an alternative.)

static func importData(_ db: DB) {
    let csvString = readDataFromCSV(fileName: "data", fileType: "csv")!

    let totalCharCount = csvString.count
    print("total: \(totalCharCount)")
    for i in 0..<totalCharCount {
        print(i)
        if i+5 >= totalCharCount {
            continue
        }
        let index = csvString.index(csvString.startIndex, offsetBy: i)
        let endIndex = csvString.index(csvString.startIndex, offsetBy:i+5)

        let part = csvString[index ..< endIndex]
        if part == "Root/" {
            let accum = lookInside(i: i, totalCharCount: totalCharCount, csvString: csvString)

            var rows = accum.components(separatedBy: "\",\"")

            if var lastt = rows.last {
                lastt.removeLast()
                lastt.removeLast()
                rows[rows.count-1] = lastt
            }
        }
    }
}

static func lookInside(i:Int, totalCharCount: Int, csvString: String) -> String {
    var accum = ""
    var found = false
    var j = i+5
    while !found {

        if j+5 >= totalCharCount {
            found = true
        }
        let index2 = csvString.index(csvString.startIndex, offsetBy: j)
        let endIndex2 = csvString.index(csvString.startIndex, offsetBy:j+5)

        if csvString[index2 ..< endIndex2] == "Root/" {
            found = true
            accum.removeLast()
        } else {
            accum += String(csvString[index2])
        }
        j += 1
    }

    return accum
}

Basically I'm traversing the whole string looking for a pattern "Root/". When found, I advance from this moment to the next occurrence of the pattern.

Problem is that the csv results in a string 200k chars long and when I run this on simulator it lasts too much time (~30min).

So now I'm asking some help here because according to Instruments all the time is consumed in String.index(offset by) method which is called too many times.

4
  • The obvious micro-optimization is to only index from a previous index, not from the start of the string each time. But is there a reason you're not using an OTS CSV parser? This ad-hoc one looks like it will fail in some cases. Commented Nov 16, 2017 at 19:27
  • Thanks. Good point. Will thry the library you mention Commented Nov 16, 2017 at 19:35
  • 1
    What @Ssswift said is more than micro-optimization. Iterating through a string is O(n) and since you begin from csvString.startIndex every time, it quickly becomes an O(n^2) operation. Since you question is about performance optimization, please upload the full CSV file to Pastebin and include a link here Commented Nov 16, 2017 at 19:36
  • You said there's 6 columns, but I only count 4 columns. Can you confirm whether any of these columns may include "? Is it legal to quote a double-quote (i.e \")? Or is it true that there will be precisely 8 (12?) double-quotes per record, and the end of a record will be a double-quote? Is removing Root/ important, or just how you're finding the records? Commented Nov 16, 2017 at 20:37

3 Answers 3

3

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

Quote from Joel on Software

You need to understand how strings work. Poor string handling was slow in 2001 when Joel Spolsky wrote the above story. With proper Unicode handling, things have become even more expensive.

You aren't really traversing the string. You start all over again and again by using String.startIndex.

Instead use functions that return a String.Index result. That's an efficient index into the string. Use it to save your last position and then work relative to this position. Your code will be faster if you use String.Index only once at the beginning.

Sign up to request clarification or add additional context in comments.

2 Comments

which functions are those?
One of them is String.index(:offsetBy:), which you already use, but don't make good use of. E.g. pass the result to lookInside instead of the first Int parameter. Then work relative to that position.
1

First, I would highly recommend using a csv parser if you're needing to read in the entire dataset, e.g. https://github.com/naoty/SwiftCSV

Second, if you're looking for a specific entry in the entire file, you probably want to use the range function: Answer using String.range

3 Comments

Wow, didn't know about it. Just went with the ad-hoc one. This might solve my problem, but still I'm curious in how would one solve this problem of traversal. No, I'm not looking for a specific entry. Just need to reformat the input csv.
Since this is just a couple of links, you really should post this as a comment, not an answer.
rmaddy: The first one is a general solution -- "use a CSV parser" -- and the link is merely an example of one that could be used. I think it's perfectly fine as an answer.
0

I have found a solution just using components and matching patterns. This works in less than 4s. Thanks for all the info, I learnt a lot from it.

        let csvString = readDataFromCSV(fileName: "data", fileType: "csv")!
        var components = csvString.components(separatedBy: "Root/")
        var result = [Note]()
        for i in components {
            let noteComponents = i.components(separatedBy: "\",\"")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.