Assuming your graph is sufficiently large, you want to perform edge-contraction until each node represents a sufficently large unit of work to amortize your parallel overhead. Then process the graph normally, but assigning groups of vertices to a thread to process rather than a single one at a time.
The reason you need the graph to be large is, edge contraction is linear time, and it sounds like your problem is also linear time (with a similar constant factor). But since edge-contraction parallelizes well, you should be able to use it to make your program achieve near linear speedup, thus being faster on sufficiently large graphs.
This becomes quite similar to the graph partitionpartitioning probelm (of which edge contraction is often a step). There exist several parallel graph partitioning packages which scale relatively well: