Firstly, I realize that question title is about as terrible as the sample code I'll post below, so please bear with me while I explain the problem more clearly, and if you have a better idea for the title - be my guest and edit it.
Imagine a long plain text. It consists of words separated by punctuation marks and/or spaces. What I need to do is convert it to a list of words+punctuation marks that separate this word from the next one. And the twist is I also need to determine which punctuation mark it is (or what's the last one if there's more than one in a row). So, I need to turn the text into a collection of structures:
{
wordFollowedByPunctuation: String;
punctuationMark: PunctuationType; // E. g. {Point, Comma, Colon, Space, ...}
}
If all the punctuation marks were single characters, it would be easy since we could use single-pass character-wise parsing. I have a working, albeit awful, C++ prototype (using Qt - QString and QChar - for Unicode support).
Here's the TextFragment structure - I'll be converting the text into a collection of these:
struct TextFragment
{
enum Delimiter {
Space,
Comma,
Point,
ExclamationMark,
QuestionMark,
Dash,
Colon,
Semicolon,
Ellipsis,
Bracket,
Newline
};
inline TextFragment(const QString& text, Delimiter delimiter) : _text(text), _delimitier(delimiter) {}
const QString _text;
const Delimiter _delimitier;
};
And here's the actual parsing:
const QString text = readText(device);
struct Delimiter {
QChar delimiterCharacter;
TextFragment::Delimiter delimiterType;
inline bool operator< (const Delimiter& other) const {
return delimiterCharacter < other.delimiterCharacter;
}
};
static const std::set<Delimiter> delimiters {
{' ', TextFragment::Space},
{'.', TextFragment::Point},
{':', TextFragment::Colon},
{';', TextFragment::Semicolon},
{',', TextFragment::Comma},
// TODO: dash should be ignored unless it has an adjacent space!
{'-', TextFragment::Dash},
// TODO:
// {"...", TextFragment::Ellipsis},
{'⋯', TextFragment::Ellipsis},
{'…', TextFragment::Ellipsis},
{'!', TextFragment::ExclamationMark},
{'\n', TextFragment::Newline},
{'?', TextFragment::QuestionMark},
{')', TextFragment::Bracket},
{'(', TextFragment::Bracket},
{'[', TextFragment::Bracket},
{']', TextFragment::Bracket},
{'{', TextFragment::Bracket},
{'}', TextFragment::Bracket}
};
std::vector<TextFragment> fragments;
QString buffer;
bool wordEnded = false;
TextFragment::Delimiter lastDelimiter = TextFragment::Space;
for (QChar ch: text)
{
if (ch == '\r')
continue;
const auto it = delimiters.find({ch, TextFragment::Space});
if (it == delimiters.end()) // Not a delimiter
{
if (wordEnded) // This is the first letter of a new word
{
fragments.emplace_back(buffer, lastDelimiter);
wordEnded = false;
buffer = ch;
}
else
buffer += ch;
}
else // This is a delimiter. Append it to the current word.
{
lastDelimiter = it->delimiterType;
wordEnded = true;
buffer += ch;
}
}
return fragments;
Here we have a state machine that tries hard not to look like one, and is worse for that. It works. But it has a bigger problem than coding style: some delimiters are multi-character, and it just can't handle them. One example is ellipsis consisting of 3 dots: "..." I want to tell it apart from a single dot. Another example is I want to distinguish between a hyphen and a dash. A hyphen separates parts of a compound word, e. g. "up-to-date", and as far as I'm concerned it's not a punctuation mark. A dash, on the other hand, is: "Joe — and his trusty mutt — was always welcome." Now, there is a special dash character, but in plain non-Unicode texts both are commonly represented by a hyphen ("-"). Then the only way I see to tell them apart is to only "space+hyphen+space" or "space+hyphen+other punctuation mark" as a dash.
Example: the sentence
If only he had tried... well, it doesn't matter now.
should result in
{"If ", Space},
{"only ", Space},
{"he ", Space},
{"had ", Space},
{"tried... ", Ellipsis},
{"well, ", Comma},
{"it ", Space},
{"doesn't ", Space},
{"matter ", Space},
{"now", Point}
The way I see it, I need some sort of exhaustive parsing instead of my greedy prototype (and, naturally, separators themselves should be represented by strings, not characters). What's the simplest way to do that? Can I use regular expressions for this (I'm terrible with them)?