regex with only numbers in a string c++

I stand corrected, I was using Visual Studio 2013 last I tested look arounds. It appears C++ now fully supports ECMAScript! However I'd still make the case that look arounds are the most expensive regex operation. They should be avoided unless absolutely necessary, which they aren't here.

In this case, following this logic, the look-ahead is a must. You can't match the numbers in <SPACE>41<SPACE>31<SPACE> without a look-ahead.

@JonathanMee: Please have a look at your results - your regex does not match the expected 31.

@user3641602: Glad it works for you, please consider accepting the answer.

It is a raw string literal. The notation is R"()". Inside the parentheses, \ symbol means a literal \ symbol, not a C-escaping symbol.

|

Mayur Koshti · Accepted Answer · 2015-11-04 12:12:30Z

1

You need this regex:

(?<!,)\b([\d\.]+)\b(?!,)

edited Nov 4, 2015 at 12:12

answered Nov 4, 2015 at 11:47

Mayur Koshti

1,89218 silver badges21 bronze badges

14 Comments

thank! but with your regex i print token . token , token .

Mayur Koshti Over a year ago

@user3641602 This will match 1.2.3... Do you want to enforce correct numbering on your number?

Then you need to just modification in current regex: \b([\d\.]+)\b

This has the bugs of your original regex: it captures 1.2.3... but now it's also picked up the need for Boost in the regex by @KarolyHorvath

@user2079303 Correction, I meant to type that this will pull from any symbol other than a comma: "12#3" for example will capture 12 and 3.

|

Community · Accepted Answer · 2017-05-23 11:59:23Z

1

As is stated by stribizhev this can only be accomplished via look arrounds. Since a single whitespace separating numbers would otherwise be needed to be consumed in the search for the number before and after the whitespace.

user2079303 poses a viable option to regexes which could be simplified to the point where it rivaled the simplicity of a regexes:

for_each(istream_iterator<string>(istringstream(" li 12.12 si 43,23 45 31 uf 889 uf31 3.12345")),
         istream_iterator<string>(),
         [](const string& i) {
            char* it;
            double num = strtod(i.c_str(), &it);
            if (distance(i.c_str(), const_cast<const char*>(it)) == i.size()) cout << num << endl; });

However it is possible to accomplish this without the weight of an istringstream or a regex, by simply using strtok:

char buffer[] = " li 12.12 si 43,23 45 31 uf 889 uf31 3.12345";

for (auto i = strtok(buffer, " \f\n\r\t\v"); i != nullptr; i = strtok(nullptr, " \f\n\r\t\v")) {
    char* it;
    double num = strtod(i, &it);

    if (*it == '\0') cout << num << endl;
}

Note that for my delimiter argument I'm simply using the default isspace values.

edited May 23, 2017 at 11:59

CommunityBot

11 silver badge

answered Nov 4, 2015 at 12:02

Jonathan Mee

39.1k26 gold badges148 silver badges320 bronze badges

4 Comments

No need escaping if raw string literal is used. 31 is not matched, BTW.

@KarolyHorvath Wrong, notice those are non-capturing parenthesis.

eerorika Over a year ago

+1 Thanks for the simplified use of second parameter of strtod. Took me a while to understand the documentation.

@user2079303 It seems we're the only ones on board with the strtod :( Ah well, if you're looking for a better explanation of how to use it you might want to check out: stackoverflow.com/q/32991193/2642059

eerorika · Accepted Answer · 2015-11-04 15:06:10Z

1

Regexes are usually unreadable and hard to prove correct. Regexes matching only valid rational numbers need to be intricate and are easy to mess up. Therefore, I propose an alternative approach. Instead of regexes, tokenize your string with c++ and use std::strtod to test if input is a valid number. Here is example code:

std::vector<std::string> split(const std::string& str) {
    std::istringstream iss(str);
    return {
        std::istream_iterator<std::string>{iss},
        std::istream_iterator<std::string>{}
    };
}

bool isValidNumber(const std::string& str) {
    char* end;
    std::strtod(str.data(), &end);
    return *end == '\0';
}

// ...
auto tokens = split(" li 12.12 si 43,23 45 31 uf 889 uf31 3.12345");
std::vector<std::string> matches;
std::copy_if(tokens.begin(), tokens.end(), std::back_inserter(matches), isValidNumber);

edited Nov 4, 2015 at 15:06

answered Nov 4, 2015 at 13:00

eerorika

240k13 gold badges212 silver badges354 bronze badges

3 Comments

You beat me to the use of strtod +1

yes, it's a possible way. But i have the solution of the problem. I would like to reduce my code via regex, because if you use regex then you have a powerful tool by hands!! :) But, such as you are mentioned before, "Regexes are usually unreadable and hard to prove correct." :)

@user3641602 His solution is I believe a simpler one than the regex solution in the first place. I've streamlined his code in one of the options I provide in my answer: stackoverflow.com/a/33521413/2642059

Karoly Horvath · Accepted Answer · 2015-11-04 11:55:08Z

0

Use negative lookahead and lookbehind to assert that there are no funny characters on either side of the number:

(?<![^\\s])(\\+|-)?[0-9]+(\\.[0-9]*)?(?![^\\s])

Unfortunately you're going to need Boost.Regex for the task as the builtin one doesn't support these constructs.

You're probably better off splitting the input into words and then using a simple regex on each word.

edited Nov 4, 2015 at 11:55

answered Nov 4, 2015 at 11:53

Karoly Horvath

96.7k11 gold badges123 silver badges181 bronze badges

8 Comments

Karoly Horvath Over a year ago

C++ doesn't support look aheads or look behinds

ATM I don't really see another way of doing it.

Just a note: [^\\s] is looking for characters that are not '\\' or 's'. What you actually meant was \S

@JonathanMee cplusplus.com/reference/regex/ECMAScript c++ support lookahead

I refuse and have always refused to use Boost. I prefer to use standard, for compatibility in team.

|

bobble bubble · Accepted Answer · 2015-11-04 12:33:23Z

0

You could play with a trick to consume stuff you don't want. Something like this.

(?:\d+,|[a-z]+)\d+|(\d+[.\d]*)

Modfiy to everything that should be excluded in pipes pre capture and grab captures of first group.

See demo at regex101. No idea if (: non capture group is ok for c++. Remove, if not.

edited Nov 4, 2015 at 12:33

answered Nov 4, 2015 at 12:26

bobble bubble

11 bronze badge

2 Comments

bobble bubble Over a year ago

Impressive way to think about it, but this will capture: "123abc" and "12#3" do you have a way to work around that?

@JonathanMee This approach only makes sense, if cases that could occur are known. For your samples have to add those cases like this.

Simon Kraemer · Accepted Answer · 2015-11-04 15:31:03Z

0

Two attempts:

#include <string>
#include <iostream>
#include <regex>
#include <sstream>


int main()
{
    using namespace std;

    string buffer(" li 12.12 si 43,23 45 31 uf 889 uf31 3.12345 .5");

    regex num_regex("(^|\\s)([\\+-]?([0-9]+\\.?[0-9]*|\\.?[0-9]+))(\\s|$)");
    smatch num_match;
    while (regex_search(buffer, num_match, num_regex))
    {
        if (num_match.size() >= 4) //3 groups = 4 matches
        {
            //We only need the second group
            auto token = num_match[2].str();
            cout << token << endl;
        }

        buffer = num_match.suffix().str();
    }
    return 0;
}

#include <string>
#include <iostream>
#include <regex>
#include <sstream>


int main()
{
    using namespace std;

    string buffer(" li 12.12 si 43,23 45 31 uf 889 uf31 3.12345 .5");

    istringstream iss(buffer);
    vector<string> tokens{ istream_iterator<string>{iss}, istream_iterator<string>{} };

    regex num_regex("^[\\+-]?([0-9]+\\.?[0-9]*|\\.?[0-9]+)$");
    for(auto token : tokens)
    {
        if (regex_search(token, num_regex))
        {
            //Valid entry
            cout << token << endl;
        }
    }

    return 0;
}

edited Nov 4, 2015 at 15:31

answered Nov 4, 2015 at 12:20

Simon Kraemer

5,7301 gold badge23 silver badges54 bronze badges

2 Comments