Parse C++ and control flow

Question

There is a huge C++ project that is built using CMake and gcc 4.2.3. The application employs multiple processes.

The end goal is to make a list of all error messages that could ever be written to the log file. Information and debug messages are also written to this file.

I found that in some main.cpp (file where everything starts) there is a catch expression where writing to the file occurs. So I need to find throw expressions that satisfy the following criteria:

One of the certain error types used in throw expression (e.g. runtime_error, logic_error, etc.).
There is no other catch in the stack between catch located in main.cpp and throw expression. If there is a catch, it may append additional info (which is important) and rethrow. Moreover, it may rethrow using a different error type or even be silent.

The project is very big and it is difficult to tell whether this part of code will ever be executed in this build. Some builds are using certain libs and others don't.

Maybe I'm wrong with the approach, but I think that the solution is a 2-step process:

Parse all C++ code as compiler sees it (to make sure throw isn't in comment section, isn't a macro, etc.)

Find all throw expressions in the compiled tree and emulate throwing. In fact, I see a problem here because conditions may be really involved, for example:

string error_msg;
enum Condition condition;
switch(condition)
{
   CONDITION1: error_msg = "sadasda"; break;
   CONDITION2: error_msg = "sadasds1111a"; break;
   CONDITION3: error_msg = "sasdasadasda"; break;
   default: error_msg = "sadasda"; break;
}
throw logic_error(error_msg);

Maybe it's all wrong and a different approach should be taken. I would be glad to see your advice.

Parsing C++... not an easy task to do correctly and completely ("as compiler sees it"). — crashmstr
– crashmstr, Commented Nov 25, 2013 at 17:55
Maybe looking at the bigger picture would let us give you a better answer: What to you need that list of error messages for? Does really every error message in that list has to be printed or could there be unused ones in there? Are you planning to change the program (e.g. localization) or is this just informational? Are there other sources of documentation (e.g. design/requirements documents, original developer) available? I don't think parsing the program is the right thing to do. Grepping for strings and manually filtering might be more feasible. — Fozi
– Fozi, Commented Nov 25, 2013 at 19:38
That is the biggest picture I have. There is a client that purchased software and demands a list of all possible errors that may appear in the log file. — Egor Okhterov
– Egor Okhterov, Commented Nov 26, 2013 at 14:25
@Pixar: you really should remove the "regex" tag; regex is completely useless for this task. — Ira Baxter
– Ira Baxter, Commented Jul 26, 2015 at 15:59

Antoine · Accepted Answer · 2013-11-25 18:30:58Z

2

Writing a C++ valid parser is indeed a daunting task to say the least, and probably not the faster way to get where you want.

Basically, what you want is to reuse an existing parser for your purposes, which is not easy either. You'd need to research various compiler plugins and static analysis tools. For example the clang static analyzer seems (relatively!) easily extensible. Perhaps a simpler way would be to use an existing static C++ analyzer, like lint, and detect uncaught exceptions. Then, you modify your main to stop catching the exceptions you're interested in and have a look at the list of uncaught exceptions. You're far from done, but you can start working from there. C++ lint is not free software, but AFAIK free alternatives (cppcheck, clang anlyzer) don't have advanced exception analysis. Maybe coverity could also be of interest, they have scripts and/or SDK for writing extensions.

Another way would be to leak memory on purpose in your exception objects, and any good static analyzer will find the source of the leak at the point where the exception object was created, i.e. the throw site and maybe even points where you add info to the exception. I don't know if this is realistic with your code, but in this setup, I think free analyzers could work.

Anyway, I wish you luck, working with large codebases is never easy ;)

edited Nov 25, 2013 at 18:30

answered Nov 25, 2013 at 18:23

Antoine

14.3k6 gold badges46 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Egor Okhterov Over a year ago

Thank you. At least now I proved myself that this task is as hard as I thought it to be :).

Antoine Over a year ago

Y'our welcome (don't forget to upvote ^^). I still think the static analysis route is realistic, but yes it's not gonna be easy ;)

Ira Baxter · Accepted Answer · 2015-07-26 17:30:07Z

(Responding long after the question; a recent edit to that question popped this question to visibility for me).

OP has it basically right; you need a compiler-accurate parse of the source code, and you need to track the throw sequences to see what they do.

If fact, you need compiler-accurate parses of all the compilation units involved in the project, and you'll need all of them at once to navigate from one compilation unit to another to track the throws. This means using a conventional compiler front end isn't the right starting place; those only parse one compilation unit at a time, and you need all of them at once.

Then there's the bit about tracing the "throws". You need the control flow within each function/method to follow throws within the method, and then you need to track throws across method calls. For the latter, you need an accurate call graph. A standard compiler might give the intra-method control flow, but it won't compute a global call graph.

To get an good call graph, you need resolve explicit calls from foo to bar, and you need to determine for indirect calls through pointers, which methods/functions are possible targets of the call, and you need to determine for polymorphic method calls (a special case of indirect calls) the same thing. So you need a points-to analyzer.

With local control flow and an accurate call graph, you can now find each initial throw, and track ("simulate") them from the throw site through the catch chains to see if they ultimate arrive at main (or at least at a call to a logging function). The throw-catch-test-rethrow is sort of straightforward to track; you'll have trouble in complex catch clause containing a lot of logic that eventually re-throws, tracking the actual re-thrown exception or even when something gets rethrown. Welcome to static analysis and the Turing tar pit.

So in fact you need a tool that is designed to do these things as well as they can be done.

Alas, I know of no tool as of this moment that will do all of that nicely, off the shelf, and I try to keep track of such things. (This is generally true of any specific static analysis somebody might want). So the question becomes, where do you get infrastructure that will let you accomplish this task as a custom job?

Clang can provide some of this; it will certainly parse and build ASTs for C++. After firing up LLVM, you will have intra-method control flow analysis. I think Clang can be configured to parse multiple compilation units, so that's a big step up from what using a compiler will offer you. I don't know what Clang offers for doing points-to analysis or building call graphs. You'll have to fill that in, and build custom code for "simulating" the throws.

Our DMS Software Reengineering Toolkit, used for program analysis and transformation, could be used for this. DMS can also parse full C++ in a compiler accurate way, and is designed to parse/process multiple compilation units simultaneously.

DMS does produce intra-method control flow analysis, and it has intra method-level data flow analysis. We presently don't have points-to analysis for C++, but DMS does have both points-to analysis and call-graph construction for C that could be pressed into service, that has been tested on applications with 15,000 (not a typo) compilation units in one image having some 50,000 functions and indirect calls tangled across all of this. (If Clang doesn't have this kind of machinery already, this is a huge difference in starting places). With that, then you get to build the throw simulation on top.

Having considered all this, my guess is the work to do the above for Clang and/or DMS is significant. If your application is less than a million lines, I'd expect you would get done faster (if not more sloppily) by just hunting for throw clauses using grep and hand-tracing them through the code yourself. You said your application was huge; hard to tell what that actually means without specific numbers. These tools work really well at scale, but aren't worth the effort when your problem is small. [What is interesting is that the boundary for "small" moves over time, as the tools get better].

Looking through the list of my old questions I've found this one with a few tags lacking to get 5 tags, so I decided to complete it :) I have managed to create the list of exceptions the same week I have posted a question. I have took the grep approach and got a bearable amount of sources of exceptions :) Although my previous job with logging exceptions is done, your answer is still valuable and I can use this info in the future.

Collectives™ on Stack Overflow

Parse C++ and control flow

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related