1

I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?

If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.

The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:

  • I capture the tweets from each timeline (mentions, direct messages, my tweets, and the full 'home' timeline) into a single index, with each timeline type having a different mapping in ElasticSearch
  • Each tweet refers to a parent type, the user who authored the tweet (which may or may not be the owner), with a parent mapping. There is only a single 'user' type for all the timeline types
  • I search and facet only ever on one owner in a single query, so I don't have to concern myself searching across multiple indexes
  • The home timeline may capture millions of tweets, where the owner's own tweets may result in hundreds or thousands
  • The user documents are routinely updated with information outside of the Twitter timelines, therefore I would like to avoid (if possible) the situation where I have to keep multiple copies of the same user object in sync across multiple indexes

I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.

Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?

Any input would be appreciated.

EDIT

To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.

I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).

3
  • My answer feels incomplete, but so does your question: please provide the has_child query you're using, as well as examples of the different documents with their relationships. In particular I wasn't sure what you meant by "excluding the 'home timeline' type" - I only got a sense of the tweet and user types, so that confused me. Commented Jun 22, 2013 at 1:20
  • Paul, I edited the question a little to clarify the timelines. Also, going back to look at the queries, has_child is not any more of a performance issue than regular queries. Commented Jun 24, 2013 at 13:14
  • 1
    Hmm, okay. Seems like it's a general scalability issue then. Hopefully someone else can chime in. +1 Commented Jun 25, 2013 at 0:59

1 Answer 1

1

Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?

No, types are all lumped together into one index as you guessed.

Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?

The total number of documents in the index is obviously a factor. Whether has_child queries are slow in particular is another question - try comparing the performance of has_child queries with trivial term queries for example. The has_child documentation offers one clue under "memory considerations":

With the current implementation, all _id values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough memory for it.

This would imply a large amount of memory is required for any has_child query where there are millions of potential children. Make sure enough memory is available for such operations, or consider a redesign that removes the need for has_child.

Sign up to request clarification or add additional context in comments.

1 Comment

In response to the first part of this answer, is there any way for an index to optimize based on _type? I understand the has_child memory issue, although my original question was ill-considered mentioning this as that query is not substantially slower than a regular query. Good clarification though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.