1

I have a Gremlin query that applies multiple filters to a graph traversal, and I'm seeking advice on optimizing its performance. The query structure is as follows:

g.V()
    .hasLabel('vertex1_label')
    .has('Property1', eq(49))
    .has('Property2', eq(14))
    .as('vertex1')
    .out('edge_json')
    .as('vertexJson_label')
    .select('vertex_label')
    .project('properties', 'edges')
    .by(
        select('vertexJson_label').valueMap()
    )
    .by(
        coalesce(
            outE('edge_label').valueMap().fold(), 
            constant(null)
        )
    ).toList()

In my graph, "vertex1_label" has a large number of properties used for filtering, and it can be connected to other "vertex1_label" instances. Additionally, I introduced "vertexJson_label" to store only a JSON property containing the necessary data from "vertex1_label" to improve query performance. This was necessary because using valueMap() directly on "vertex1_label" was too slow, even when specifying the required properties as parameters. After implementing the JSON solution, the query time was reduced significantly, from 9 seconds to 2.4 seconds, to retrieve 10,000 vertices and 5,000 links from a total of 40,000 vertices in the graph. Also, "edge_label" contains some properties that has to be returned to user. Another condition, can be exist "vertex1_label" without connections.

I'm not an expert in Gremlin, and I'm looking for advice on whether there are ways to further improve query performance as the graph continues to grow. It seems that Amazon Neptune has a "Neptune lookup cache," but I'm not sure if it would benefit my query. Any suggestions or insights would be greatly appreciated.

I've already tried optimizing the query by introducing a 'vertexJson_label' to store a JSON property with the necessary data, which improved performance significantly. Initially, I expected the query to be faster by specifying the required properties with valueMap() directly, but it was slow.

With the JSON solution, the query time improved from 9 seconds to 2.4 seconds for retrieving 10,000 vertices and 5,000 links in a graph with 40,000 vertices. However, as the graph continues to grow, I'm interested in exploring further optimization options to ensure the query's efficiency.

I've heard about Amazon Neptune's 'Neptune lookup cache,' but I'm not sure if it's applicable or beneficial for my use case. Any insights, suggestions, or best practices for Gremlin query optimization in a large graph database would be greatly appreciated.

*******************************************************
                Neptune Gremlin Profile
*******************************************************

Query String
==================
g.V().hasLabel('Location').has('CorporateMember_Id', eq(49)).as('Location').out('LinkedProperty_Json').as('LocationProperty_Json').select('Location').project('properties', 'edges').by(select('LocationProperty_Json').valueMap()).by(coalesce(outE('Linked').valueMap().fold(), constant(null)))

Original Traversal
==================
[GraphStep(vertex,
    []), HasStep([~label.eq(Location), CorporateMember_Id.eq(49)
    ])@[Location
    ], VertexStep(OUT,
    [LinkedProperty_Json
    ],vertex)@[LocationProperty_Json
    ], SelectOneStep(last,Location,null), ProjectStep([properties, edges
    ],
    [
        [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
        ],
        [CoalesceStep([
                [VertexStep(OUT,
                    [Linked
                    ],edge), PropertyMapStep(value), FoldStep
                ],
                [ConstantStep(null)
                ]
            ])
        ]
    ])
]

Optimized Traversal
===================
Neptune steps: [
    NeptuneGraphQueryStep(Vertex)@[LocationProperty_Json
    ] {
        JoinGroupNode {
            PatternNode[(?1, <CorporateMember_Id>, ?13, ?) . project distinct ?1 . ContainsFilter(?13 in (49^^<SBYTE>,
                49^^<SHORT>,
                49^^<INT>,
                49^^<LONG>,
                49.0^^<FLOAT>,
                49.0^^<DOUBLE>)) .
            ],
            {estimatedCardinality=18048, expectedTotalInput=-1, expectedTotalOutput=18048, indexTime=0, joinTime=35, numSearches=1, actualTotalOutput=10574
            }
            PatternNode[(?1, <~label>, ?2=<Location>, <~>) . project ask .
            ],
            {estimatedCardinality=81004, expectedTotalInput=18048, joinRatio=1.0, expectedTotalOutput=18048, indexTime=12, joinTime=123, numSearches=11, actualTotalOutput=10574
            }
            PatternNode[(?1, ?5=<LinkedProperty_Json>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .
            ],
            {estimatedCardinality=75988, expectedTotalInput=18048, indexTime=19, joinTime=176, numSearches=11
            }
        },
        {path=[Vertex(?1):GraphStep@[Location
                ], Vertex(?3):VertexStep@[LocationProperty_Json
                ], Vertex(?1):SelectOneStep
            ], joinStats=true, optimizationTime=3, maxVarId=14, executionTime=2579
        }
    },
    NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: ProjectStep([properties, edges
],
[
    [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
    ],
    [CoalesceStep([
            [VertexStep(OUT,
                [Linked
                ],edge), PropertyMapStep(value), NeptuneMemoryTrackerStep, FoldStep
            ],
            [ConstantStep(null)
            ]
        ])
    ]
]),

WARNING: >> [ProjectStep([properties, edges
    ],
    [
        [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
        ],
        [CoalesceStep([
                [VertexStep(OUT,
                    [Linked
                    ],edge), PropertyMapStep(value), NeptuneMemoryTrackerStep, FoldStep
                ],
                [ConstantStep(null)
                ]
            ])
        ]
    ])
] << (or one of the children for each step) is not supported natively yet

Physical Pipeline
=================
NeptuneGraphQueryStep@[LocationProperty_Json
]
    |-- StartOp
    |-- JoinGroupOp@5344294c
        |-- DynamicJoinOp@54e43dbc(PipelineJoinOp(PatternNode[(?1, <CorporateMember_Id>, ?13, ?) . project distinct ?1 . ContainsFilter(?13 in (49^^<SBYTE>,
    49^^<SHORT>,
    49^^<INT>,
    49^^<LONG>,
    49.0^^<FLOAT>,
    49.0^^<DOUBLE>)) .
],
{estimatedCardinality=18048, expectedTotalInput=-1, expectedTotalOutput=18048
}))
        |-- SpoolerOp(1000, @54e43dbc, null)
        |-- DynamicJoinOp@4498cbbe(PipelineJoinOp(PatternNode[(?1, <~label>, ?2=<Location>, <~>) . project ask .
],
{estimatedCardinality=81004, expectedTotalInput=18048, joinRatio=1.0, expectedTotalOutput=18048
}))
        |-- SpoolerOp(1000, @4498cbbe, @54e43dbc)
        |-- DynamicJoinOp@7fe16b82(PipelineJoinOp(PatternNode[(?1, ?5=<LinkedProperty_Json>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .
],
{estimatedCardinality=75988, expectedTotalInput=18048
}))

Runtime (ms)
============
Query Execution: 2579.766
Serialization: 222.424

Traversal Metrics
=================
Step                                                               Count  Traversers       Time (ms)    % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneGraphQueryStep(Vertex)@[LocationProperty...                 10574       10574         400.334    15.79
NeptuneTraverserConverterStep                                      10574       10574          77.076     3.04
ProjectStep([properties, edges
    ],
    [
        [SelectOneStep...                 10574       10574        2057.792    81.17
  SelectOneStep(last,LocationProperty_Json,null)                   10574       10574          51.849
  PropertyMapStep(value)                                           10574       10574         639.459
  CoalesceStep([
                [VertexStep(OUT,
                    [Linked
                    ],edge),...                 10574       10574        1118.509
    VertexStep(OUT,
                    [Linked
                    ],edge)                                   5001        5001         662.831
    PropertyMapStep(value)                                          5001        5001         330.494
    NeptuneMemoryTrackerStep                                        5001        5001          19.325
    FoldStep                                                       10574       10574          29.690
    ConstantStep(null)                                                                         0.000
                                            >TOTAL                     -           -        2535.204        -

Predicates
==========
# of predicates: 85

Results
=======
Count: 10574
Response serializer: application/vnd.gremlin-v3.0+gryo
Response size (bytes): 5,
                    238,
                    081


Index Operations
================
Query execution:
    # of statement index ops: 26172
    # of unique statement index ops: 26172
    Duplication ratio: 1.0
    # of terms materialized: 1026
Serialization:
    # of statement index ops: 0
    # of terms materialized: 0
2
  • If you run the query against the Neptune Gremlin Profile API [1], what does it say in terms of query execution time versus serialization time? Anecdotally, what you're say leads me to suspect this is a serialization issue because you're pulling back associated properties for the resulting vertices and edges. But I just want to make sure before providing guidance [1] docs.aws.amazon.com/neptune/latest/userguide/… Commented Oct 26, 2023 at 22:06
  • @TaylorRiggan I have added all the result from Neptune Profile API Commented Oct 30, 2023 at 17:10

1 Answer 1

2

If you notice in the profile output:

WARNING: >> [ProjectStep([properties, edges
    ],
    [
        [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
        ],
        [CoalesceStep([
                [VertexStep(OUT,
                    [Linked
                    ],edge), PropertyMapStep(value), NeptuneMemoryTrackerStep, FoldStep
                ],
                [ConstantStep(null)
                ]
            ])
        ]
    ])
] << (or one of the children for each step) is not supported natively yet

This means that, starting with the project() step, everything after is not being executed using Neptune-specific query operators and falling back to using TinkerPop operators. It is often desirable to try to rewrite the query in a way that gets executed using 100% Neptune-specific operators. Sometimes it doesn't provide much in terms of performance, but often times it can.

The reason why it is falling back to TinkerPop is because something inside of the project() step is using a Gremlin step that isn't fully optimized by Neptune [1]. In this particular instance, it is the coalesce() step causing the issue.

If you don't need an explicit null in the results, I would suggest just getting rid of that and having the query return an empty list in the second by-modulator of the project. Maybe include the full valueMap with id and/or label to ensure for situations where you are getting an actual empty list versus an empty list due to the edges not having properties:

    .project('properties', 'edges')
    .by(
        select('vertexJson_label').valueMap()
    )
    .by(
        outE('edge_label').valueMap(true).fold()
    ).toList()

I did some testing with a sample dataset with 5,000 vertices with random incoming edges and found that by making the change above, it improved the query performance by over two-fold.

Note that coalesce() is converted to Neptune-specific operators (as of the time of this post) only when used via the "upsert pattern" for writing new data.

[1] https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-step-support.html

Sign up to request clarification or add additional context in comments.

7 Comments

thanks for your observation and for your time. I changed like you are suggesting. It improve. I am taking the value from Traversal Metric (Profile API), >TOTAL 2395.225 (with coalesce) >TOTAL 1975.002 (without coalesce) do you think that I can improve it more?
Do you need all of the results at the same time, or is your application taking these results and iterating over them? If the latter, you may consider not using toList() as your terminal step and instead using next(#) and streaming the results back to your application. You can do this if you're connecting over websockets. This would allow your application to process initial results returned by the server while remaining results are still being found.
I am using Gremlin.net. Right now, I am executing the query with ToList(). After that, I am iterating over all items and returning the resulted in one web api (I need to return all items). In my case, is not necessary has all the elements in same time, I can process n elements many times, and one time finished I can return the resulted. I am trying understand you idea with Next(#), is like as paging. I'd have to execute Next(#) n times until that the query return empty result, it is that you suggest? it can be faster that tolist() ?
Not sure if streaming would help in that particular case. It is more of a means to lazily execute a query and have it return results as they are computed (instead of waiting for all results to be fetched and returned as a single list - using toList()).
In the profiled query above, it looks like you're going back-and-forth a lot. Instead of fetching things using as/select, what does the performance look like if you go directly into the project: g.V().hasLabel('Location').has('CorporateMember_Id', eq(49)).project('properties', 'edges').by(out('LinkedProperty_Json').valueMap()).by(outE('Linked').valueMap().fold()). Is there a reason for not wanting to do that?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.