An elegant way to split text into words combined with adjacent punctuation and determine which punctuation mark it is

Question

Firstly, I realize that question title is about as terrible as the sample code I'll post below, so please bear with me while I explain the problem more clearly, and if you have a better idea for the title - be my guest and edit it.

Imagine a long plain text. It consists of words separated by punctuation marks and/or spaces. What I need to do is convert it to a list of words+punctuation marks that separate this word from the next one. And the twist is I also need to determine which punctuation mark it is (or what's the last one if there's more than one in a row). So, I need to turn the text into a collection of structures:

{
   wordFollowedByPunctuation: String;
   punctuationMark: PunctuationType; // E. g. {Point, Comma, Colon, Space, ...}
}

If all the punctuation marks were single characters, it would be easy since we could use single-pass character-wise parsing. I have a working, albeit awful, C++ prototype (using Qt - QString and QChar - for Unicode support).

Here's the TextFragment structure - I'll be converting the text into a collection of these:

struct TextFragment
{
    enum Delimiter {
        Space,
        Comma,
        Point,
        ExclamationMark,
        QuestionMark,
        Dash,
        Colon,
        Semicolon,
        Ellipsis,
        Bracket,
        Newline
    };

    inline TextFragment(const QString& text, Delimiter delimiter) : _text(text), _delimitier(delimiter) {}

    const QString _text;
    const Delimiter _delimitier;
};

And here's the actual parsing:

const QString text = readText(device);

    struct Delimiter {
        QChar delimiterCharacter;
        TextFragment::Delimiter delimiterType;

        inline bool operator< (const Delimiter& other) const {
            return delimiterCharacter < other.delimiterCharacter;
        }
    };

    static const std::set<Delimiter> delimiters {
        {' ', TextFragment::Space},
        {'.', TextFragment::Point},
        {':', TextFragment::Colon},
        {';', TextFragment::Semicolon},
        {',', TextFragment::Comma},
        // TODO: dash should be ignored unless it has an adjacent space!
        {'-', TextFragment::Dash},
        // TODO:
        // {"...", TextFragment::Ellipsis},
        {'⋯', TextFragment::Ellipsis},
        {'…', TextFragment::Ellipsis},
        {'!', TextFragment::ExclamationMark},
        {'\n', TextFragment::Newline},
        {'?', TextFragment::QuestionMark},

        {')', TextFragment::Bracket},
        {'(', TextFragment::Bracket},
        {'[', TextFragment::Bracket},
        {']', TextFragment::Bracket},
        {'{', TextFragment::Bracket},
        {'}', TextFragment::Bracket}
    };

    std::vector<TextFragment> fragments;

    QString buffer;
    bool wordEnded = false;
    TextFragment::Delimiter lastDelimiter = TextFragment::Space;
    for (QChar ch: text)
    {
        if (ch == '\r')
            continue;

        const auto it = delimiters.find({ch, TextFragment::Space});
        if (it == delimiters.end()) // Not a delimiter
        {
            if (wordEnded) // This is the first letter of a new word
            {
                fragments.emplace_back(buffer, lastDelimiter);
                wordEnded = false;
                buffer = ch;
            }
            else
                buffer += ch;
        }
        else // This is a delimiter. Append it to the current word.
        {
            lastDelimiter = it->delimiterType;
            wordEnded = true;
            buffer += ch;
        }
    }

    return fragments;

Here we have a state machine that tries hard not to look like one, and is worse for that. It works. But it has a bigger problem than coding style: some delimiters are multi-character, and it just can't handle them. One example is ellipsis consisting of 3 dots: "..." I want to tell it apart from a single dot. Another example is I want to distinguish between a hyphen and a dash. A hyphen separates parts of a compound word, e. g. "up-to-date", and as far as I'm concerned it's not a punctuation mark. A dash, on the other hand, is: "Joe — and his trusty mutt — was always welcome." Now, there is a special dash character, but in plain non-Unicode texts both are commonly represented by a hyphen ("-"). Then the only way I see to tell them apart is to only "space+hyphen+space" or "space+hyphen+other punctuation mark" as a dash.

Example: the sentence

If only he had tried... well, it doesn't matter now.

should result in

{"If ", Space},
{"only ", Space},
{"he ", Space},
{"had ", Space},
{"tried... ", Ellipsis},
{"well, ", Comma},
{"it ", Space},
{"doesn't ", Space},
{"matter ", Space},
{"now", Point}

The way I see it, I need some sort of exhaustive parsing instead of my greedy prototype (and, naturally, separators themselves should be represented by strings, not characters). What's the simplest way to do that? Can I use regular expressions for this (I'm terrible with them)?

If the goal is correct parsing of natural English no matter how it uses punctuation, that's an unsolved research-level problem entire books are written about which would be considered "too broad" here. If you can nail down your exact requirements (concrete examples of all the cases you do want your parser to handle would be a good start), then we could answer software design questions like whether or not regular expressions are capable of solving the problem. — Ixrec
– Ixrec, Commented May 18, 2016 at 8:44
@Ixrec: it's not a research problem. I know exactly how the algorithm should behave, I just don't know how to implement it in a non-awkward way. I was going to include examples but have forgotten about it as the post grew longer. Let me fix that. — Violet Giraffe
– Violet Giraffe, Commented May 18, 2016 at 8:47

gbjbaanb · Accepted Answer · 2016-05-18 08:54:44Z

2

hyphen is a punctuation. Consider the text "we need to fix this up - to date we haven't bothered" and "fix the errors up-to and including Tuesday". Relying on spaces won't help you with sloppy typists.

Typically though, you handle your text single character at a time, and once you locate the start of a multi-char punctuation, you then process subsequent text at that point. eg. when you find a '.' you then read-ahead to determine in the next 2 characters are also '.', in which case you combine the 3 into a single 'ellipsis' punctuation. The problem becomes one of reading ahead into the stream and, if you do not consume he subsequent characters, put them back into the stream for the main processing loop to work with. This kind of problem is why stream buffers have functions such as putback() and peek().

answered May 18, 2016 at 8:54

gbjbaanb

48.8k7 gold badges106 silver badges174 bronze badges

You're right, but I don't have a big problem with the two example sentences. More importantly, the implementation you've offered is exactly the obvious, head-on, non-elegant kind I was hoping to avoid.

Violet Giraffe
– Violet Giraffe

2016-05-18 08:59:59 +00:00
Commented May 18, 2016 at 8:59
2

@VioletGiraffe as someone who's worked in the industry for some time, and had to maintain a lot of code, head-on and obvious is far too under-rated.

gbjbaanb
– gbjbaanb

2016-05-18 09:02:36 +00:00
Commented May 18, 2016 at 9:02
Unfortunately, obvious doesn't always mean simple or easy to understand and maintain. In this case it's quite the opposite. A forgotten - 1 in one of the index calculations could provide hours of debugging joy.

Violet Giraffe
– Violet Giraffe

2016-05-18 09:04:11 +00:00
Commented May 18, 2016 at 9:04
@VioletGiraffe trust me, it does! (or at least, if properly isolated in its own functions/modules, then you have a fighting chance if the code is straight-forward even if that means more verbose. You can read through it and figure it out with just a little patience). "Elegant" usually means more to the person originally writing it, and "WTF" to the person maintaining it. The elegant code I've seen that was easy to maintain was liberally annotated with comments.

gbjbaanb
– gbjbaanb

2016-05-18 09:08:37 +00:00
Commented May 18, 2016 at 9:08
I'm not saying it can't be done, I'm only saying it's the worst way of doing it that I can think of. I was really hoping to use regex instead which would do all the actual tedious work for me.

Violet Giraffe
– Violet Giraffe

2016-05-18 09:11:14 +00:00
Commented May 18, 2016 at 9:11

| Show 1 more comment

Stack Exchange Network

An elegant way to split text into words combined with adjacent punctuation and determine which punctuation mark it is

1 Answer 1

Your Answer

Hot Network Questions

An elegant way to split text into words combined with adjacent punctuation and determine which punctuation mark it is

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions