3

I am writing some code to generate call graphs for a particular intermediate representation without executing it by statically scanning the IR code. The IR code itself is not too complex and I have a good understanding of what function call sequences look like so all I need to do is trace the calls. I am currently doing it the obvious way:

  • Keep track of where we are
  • If we encounter a function call, branch to that location, execute and come back
  • While branching put an edge between the caller and the callee

I am satisfied with where I am getting at but I want to make sure that I am not reinventing the wheel here and face corner cases. I am wondering if there are any accepted good algorithms (and/or design patterns) that do this efficiently?

UPDATE: The IR code is a byte-code disassembly from a homebrewn Java-like language and looks like the Jasmine specification.

4
  • Could you say a bit more about the IR that you want to analyze? What does it look like, and what its features related to binding and function calls? Commented Mar 6, 2011 at 17:16
  • @phooji: Sure. It is an IR generated by byte-code disassembly produced by JReveal Jasmine disassembler. It does not seem to describe anything about re-binding but I will post back if I find anything new. But I can definitely say there are no pointers because it is disassembling a Java programs. Commented Mar 6, 2011 at 21:49
  • I have taken a quick look at the Jasmine bytecode spec. The only thing that immediately caught my attention is the virtual method invocation; see updated answer below. Commented Mar 6, 2011 at 22:31
  • @phooji: Thank you for your time. I just accepted your answer and added a comment. Commented Mar 6, 2011 at 22:41

2 Answers 2

4

From an academic perspective, here are some considerations:

  • Do you care about being conservative / correct? For example, suppose the code you're analyzing contains a call through a function pointer. If you're just generating documentation, then it's not necessary to deal with this. If you're doing a code optimization that might go wrong, you will need to assume that 'call through pointer' means 'could be anything.'

  • Beware of exceptional execution paths. Your IR may or may not abstract this away from you, but keep in mind that many operations can throw both language-level exceptions as well as hardware interrupts. Again, it depends on what you want to do with the call graph later.

  • Consider how you'll deal with cycles (e.g. recursion, mutual recursion). This may affect how you write code for traversing the graphs later on (i.e., they will need some sort of 'visited' set to avoid traversing cycles forever).

Cheers.

Update March 6:

Based on extra information added to the original post:

  • Be careful about virtual method invocations. Keep in mind that, in general, it is unknowable which method will execute. You may have to assume that the call will go to any of the subclasses of a particular class. The standard example goes a bit like this: suppose you have an ArrayList<A>, and you have class B extends A. Based on a random number generator, you will add instances of A and B to the list. Now you call x.foo() for all x in the list, where foo() is a virtual method in A with an override in B. So, by just looking at the source code, there is no way of knowing whether the loop calls A.foo, B.foo, or both at run time.
Sign up to request clarification or add additional context in comments.

6 Comments

OP also has to be aware of run-time binding of symbols; he may not get a true picture of the actual call graph. Many langauges, including Python I think, will (re)bind a symbol when new code is loaded. So a static analysis of a call in function F to X interpreted by itself might not reflect that X is bound to a hairball later (or worse, bound to something that eventually calls F again).
@Ira Baxter: Great point. I'm not sure what language being analyzed looks like; I will ask the OP for clarification.
@phooji: +1 for the tips. I have accepted this as an answer. If you have some time, could you please elaborate a little more on virtual method invocations? There are three types currently in the code: invoke-static, invoke-direct and invoke-virtual. While I understood the part where you mentioned the random number generator example, I am still a bit unclear on when byte-code is diassembled to invoke-virtual and how dynamic re-binding helps. If all I am interested is constructing approximate call graphs, could you suggest any steps that I should be taking to solve the problem partially?
@Legend: Thanks for the upvote. I'm not familiar with the details of the JVM (or Jasmin)specification. However, if you can do a full roundtrip (compile-disassemble), then you can probably find out by doing some experimentation. Here is my guess for the foo virtual method if your code looks like A x = new B(); x.foo(): "push object ref x; push args (none); invokevirtual A.foo()." Note that, if this is indeed the case, then you will have to mimick virtual method lookups in your analysis code.
(Oh and 'dynamic re-binding' is just another example of a situation in which a (variable, method, class, module) name can be modified at run time.)
|
2

I don't know the algorithm, but pycallgraph does a decent job. It is worth checking out the source for it. It is not long and should be good for checking out existing design patterns.

1 Comment

+1 Thank you. Sorry about not clarifying properly. I updated my question. I came across pycallgraph but I do not want to execute the code but like to generate the call graph by parsing my intermediate code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.