1

I need to detect the language of a string read from a pdf file the text is basically in English language, but "NLLanguageRecognizer" return that it is "Romanian"

the function I am using is :

 class func detectedLangaugeFormat(for string: String) -> String {
       if #available(iOS 12.0, *) {
           let recognizer = NLLanguageRecognizer()
           recognizer.processString(string)
        guard let languageCode = recognizer.dominantLanguage?.rawValue else { return "rtl" }
           let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
           print("lan")
           let currentLocale = NSLocale.current as NSLocale
           let direction: NSLocale.LanguageDirection = NSLocale.characterDirection(forLanguage: languageCode)
            if direction == .rightToLeft {
                return "rtl"
            }else if direction == .leftToRight {
                return "ltr"
            }
       } else {
           // Fallback on earlier versions
       }


    return "rtl"
   }

and the string given to this method is :

"\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
6
  • 1
    Does the text actually contain the \r\n text? That would probably make a problem. Commented Dec 24, 2019 at 10:35
  • When i tried to code and text language code returns en. Commented Dec 24, 2019 at 10:40
  • @s3cretshadow i update the string please check with this string Commented Dec 24, 2019 at 10:45
  • Is this the actual string that you are passing? Commented Dec 24, 2019 at 10:57
  • @AhmadF yes converted from a sample pdf Commented Dec 24, 2019 at 10:58

3 Answers 3

1

One possible solution can be remove more than one spaces in string.

let regex = try? NSRegularExpression(pattern: "  +", options: .caseInsensitive)
    str = regex?.stringByReplacingMatches(in: str, options: [], range: NSRange(location: 0, length: str.count), withTemplate: " ") ?? ""

I tried your string with this regex and it worked. Language recognizer returned en lang code.

Sign up to request clarification or add additional context in comments.

Comments

1

For some reason, white spaces and newlines make the result of processString(_:) to be inefficient. What you should do is to get rid of them before passing the string to your method:

let givenString = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
let trimmedString = givenString.trimmingCharacters(in: .whitespacesAndNewlines)

let result = detectedLangaugeFormat(for: trimmedString)
print(result) // ltr

At this point, it should be recognizable as English (if you print detectedLangauge inside your method instead of "lan", you'll find it "English").

let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
print(detectedLangauge) // Optional("English")

Comments

0

Remove non-alphabetic[WhiteSpaces,!,@,#, etc] char in the String then try to detect language.

extension String{
    func findFirstAlphabetic() -> String.Index?{
        for index  in self.indices{
            if String(self[index]).isAlphanumeric == true{
                return index
            }
        }
        return nil
    }
    var isAlphanumeric: Bool {
        return !isEmpty && range(of: "[^a-zA-Z0-9]", options: .regularExpression) == nil
    }
    func alphabetic_Leading_SubString() -> String?{
        if let startIndex =  self.findFirstAlphabetic(){
            let newSubString = self[startIndex..<self.endIndex]
            return String(newSubString)
        }
        return nil
    }
}

Usage :-

let string = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
detectedLangaugeFormat(for: string.alphabetic_Leading_SubString()!)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.