Efficient way to retrieve all _ids in ElasticSearch

Question

What is the fastest way to get all _ids of a certain index from ElasticSearch? Is it possible by using a simple query? One of my index has around 20,000 documents.

I found this very helpful.

shellbye
– shellbye

2017-04-14 02:36:36 +00:00
Commented Apr 14, 2017 at 2:36 — shellbye
– shellbye, Commented Apr 14, 2017 at 2:36

jdhao · Accepted Answer · 2023-07-04 11:05:43Z

87

Edit: Please also read the answer from Aleck Landgraf

You just want the elasticsearch-internal _id field? Or an id field from within your documents?

For the former, try

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "stored_fields": []
}
'

If you are using Elastic dev tools, use this instead:

GET <your-index-name>/_search
{ 
    "query" : { 
        "match_all" : {} 
    },
    "stored_fields": []
}

Note 2017 Update: The post originally included "fields": [] but since then the name has changed and stored_fields is the new value.

The result will contain only the "metadata" of your documents

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "index",
      "_type" : "type",
      "_id" : "36",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "38",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "39",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "34",
      "_score" : 1.0
    } ]
  }
}

For the latter, if you want to include a field from your document, simply add it to the fields array

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "fields": ["document_field_to_be_returned"]
}
'

edited Jul 4, 2023 at 11:05

jdhao

29.6k23 gold badges160 silver badges323 bronze badges

answered Jul 5, 2013 at 22:07

Thorsten

5,6446 gold badges37 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Utsav T Over a year ago

Won't this return just 10 results ?

aamiri Over a year ago

Doing a straight query is not the most efficient way to do this. When you do a query, it has to sort all the results before returning it. Scroll and Scan mentioned in response below will be much more efficient, because it does not sort the result set before returning it.

Dzmitry Lazerka Over a year ago

Doesn't work anymore in 5.x, field fields was removed, instead, add "_source": false param.

Freak Over a year ago

"field" is not supported in this query anymore by elasticsearch. use "stored_field" instead

jdhao Over a year ago

This will not return ids. For me trying on Elasticsearch 8.7, it returns 10000 results.

Eddie C. · Accepted Answer · 2024-07-19 09:00:21Z

59

Better to use scroll and scan to get the result list so Elasticsearch doesn't have to rank and sort the results.

With the elasticsearch-dsl Python library this can be accomplished by:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)

s = s.fields([])  # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]

Console log:

GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...

Note. scroll pulls batches of results from a query and keeps the cursor open for a given amount of time (1 minute, 2 minutes, which you can update); scan disables sorting. The scan helper function returns a python generator which can be safely iterated through.

edited Jul 19, 2024 at 9:00

Eddie C.

1,01611 silver badges18 bronze badges

answered Jun 15, 2015 at 21:57

Aleck Landgraf

1,56514 silver badges14 bronze badges

3 Comments

illagrenan Over a year ago

Method fields has been removed in version 5.0.0 (see: elasticsearch-dsl.readthedocs.io/en/latest/…. You should now use s = s.source([]) .

Geo Tom Over a year ago

the given link is not available. Showing 404

aleha_84 Over a year ago

search_type=scan deprecated since 2.1. (https://www.elastic.co/guide/en/elasticsearch/reference/2.1/breaking_21_search_changes.html)

Nav · Accepted Answer · 2016-11-14 04:25:52Z

25

For elasticsearch 5.x, you can use the "_source" field.

GET /_search
{
    "_source": false,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

"fields" has been deprecated. (Error: "The field [fields] is no longer supported, please use [stored_fields] to retrieve stored fields or _source filtering if the field is not stored")

answered Nov 14, 2016 at 4:25

Nav

1,18516 silver badges23 bronze badges

1 Comment

AmericanUmlaut Over a year ago

Bonus points for adding the error text. Elasticsearch error messages mostly don't seem to be very googlable :(

Christopher Peisert · Accepted Answer · 2022-04-26 19:36:33Z

18

Elaborating on answers by Robert Lujo and Aleck Landgraf, if you want the IDs in a list from the returned generator, here is what I use:

from elasticsearch import Elasticsearch
from elasticsearch import helpers


es = Elasticsearch(hosts=[YOUR_ES_HOST])
hits = helpers.scan(
    es,
    query={"query":{"match_all": {}}},
    scroll='1m',
    index=INDEX_NAME
)
    
ids = [hit['_id'] for hit in hits]

edited Apr 26, 2022 at 19:36

Christopher Peisert

24.5k4 gold badges102 silver badges127 bronze badges

answered Feb 10, 2016 at 17:16

sandro scodelller

4404 silver badges8 bronze badges

Comments

Brian Low · Accepted Answer · 2014-08-18 17:07:26Z

13

Another option

curl 'http://localhost:9200/index/type/_search?pretty=true&fields='

will return _index, _type, _id and _score.

edited Aug 18, 2014 at 17:07

answered Aug 18, 2014 at 6:43

Brian Low

11.9k4 gold badges61 silver badges66 bronze badges

4 Comments

PhaedrusTheGreek Over a year ago

-1 Better to use scan and scroll when accessing more than just a few documents. This is a "quick way" to do it, but won't perform well and also might fail on large indices

Serp C Over a year ago

On 6.2: "request ... contains unrecognized parameter: [fields]"

user29671 Over a year ago

Is there any way to get only _id field?

Abhishek Kumar Over a year ago

stored_fields instead of fields for newer versions

Alex Moore-Niemi · Accepted Answer · 2020-02-25 15:51:04Z

6

I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.

The helpers class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at) as well. You'll see I set max_workers to 14, but you may want to vary this depending on your machine.

Additionally, I store the doc ids in compressed format. If you're curious, you can check how many bytes your doc ids will be and estimate the final dump size.

# note below I have es, index, and cluster_name variables already set

max_workers = 14
scroll_slice_ids = list(range(0,max_workers))

def get_doc_ids(scroll_slice_id):
    count = 0
    with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
        query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
        scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
        for doc in scan:
            count += 1
            results_file.write((doc['_id'] + '\n'))
            results_file.flush()

    return count 

if __name__ == '__main__':
    print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
    with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        doc_counts = executor.map(get_doc_ids, scroll_slice_ids)

If you want to follow along with how many ids are in the files, you can use unpigz -c /tmp/doc_ids_4.txt.gz | wc -l.

edited Feb 25, 2020 at 15:51

answered Feb 24, 2020 at 16:44

Alex Moore-Niemi

3,5022 gold badges28 silver badges26 bronze badges

3 Comments

ruslaniv Over a year ago

Is it possible to use multiprocessing approach but skip the files and query ES directly?

Alex Moore-Niemi Over a year ago

Of course, you just remove the lines related to saving the output of the queries into the file (anything with results_file var).

ruslaniv Over a year ago

For some reason it returns as many document id's as many workers I set. So if I set 8 workers it returns only 8 ids

sumit kumar · Accepted Answer · 2020-06-10 19:25:36Z

3

For Python users: the Python Elasticsearch client provides a convenient abstraction for the scroll API:

from elasticsearch import Elasticsearch, helpers
client = Elasticsearch()

query = {
    "query": {
        "match_all": {}
    }
}

scan = helpers.scan(client, index=index, query=query, scroll='1m', size=100)

for doc in scan:
    # do something

edited Jun 10, 2020 at 19:25

sumit kumar

1501 gold badge2 silver badges14 bronze badges

answered Nov 26, 2019 at 15:47

sdcbr

7,1293 gold badges30 silver badges45 bronze badges

Comments

Alix Martin · Accepted Answer · 2015-05-28 07:24:19Z

2

you can also do it in python, which gives you a proper list:

import elasticsearch
es = elasticsearch.Elasticsearch()

res = es.search(
    index=your_index, 
    body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})

ids = [d['_id'] for d in res['hits']['hits']]

answered May 28, 2015 at 7:24

Alix Martin

4322 silver badges5 bronze badges

1 Comment

pregmatch Over a year ago

question was "Efficient way to retrieve all _ids in ElasticSearch". You set it to 30000 ... What if you have 4000000000000000 records!!!???

Robert Lujo · Accepted Answer · 2016-01-16 22:39:47Z

2

Inspired by @Aleck-Landgraf answer, for me it worked by using directly scan function in standard elasticsearch python API:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
es = Elasticsearch()
for dobj in scan(es, 
                 query={"query": {"match_all": {}}, "fields" : []},  
                 index="your-index-name", doc_type="your-doc-type"): 
        print dobj["_id"],

answered Jan 16, 2016 at 22:39

Robert Lujo

16.6k6 gold badges60 silver badges77 bronze badges

Comments

badger · Accepted Answer · 2023-10-29 12:49:20Z

1

I suggest using a neat tool like elasticdump and issue a query like the following:

~/.bin/elasticdump --input='http://username:[email protected]:9200/my-index' --output=output.txt --searchBody='{"_source": ["_id"], "query":{ "match_all": {}}}' --limit 10000

then you can process the output.txt file using the cut linux command and get only the id part for each document

answered Oct 29, 2023 at 12:49

badger

3,3161 gold badge18 silver badges38 bronze badges

Comments

bavincen · Accepted Answer · 2020-12-08 01:55:09Z

0

This is working!

def select_ids(self, **kwargs):
    """

    :param kwargs:params from modules
    :return: array of incidents
    """
    index = kwargs.get('index')
    if not index:
        return None

    # print("Params", kwargs)
    query = self._build_query(**kwargs)
    # print("Query", query)

    # get results
    results = self._db_client.search(body=query, index=index, stored_fields=[], filter_path="hits.hits._id")
    print(results)
    ids = [_['_id'] for _ in results['hits']['hits']]
    return ids

answered Dec 8, 2020 at 1:55

bavincen

11 bronze badge

Comments

Aleksi · Accepted Answer · 2024-08-21 09:20:37Z

0

Elasticsearch docs now recommend search_after + point in time over the scroll API:

We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).

answered Aug 21, 2024 at 9:20

Aleksi

5,14639 silver badges51 bronze badges

Comments

Anth12 · Accepted Answer · 2019-03-01 10:03:01Z

-4

Url -> http://localhost:9200/<index>/<type>/_query
http method -> GET
Query -> {"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]}

edited Mar 1, 2019 at 10:03

Anth12

1,8972 gold badges22 silver badges40 bronze badges

answered Oct 4, 2016 at 8:47

Ankireddy Polu

1,88016 silver badges16 bronze badges

1 Comment

Suomynona Over a year ago

inefficient, especially if the query was able to fetch documents more than 10000

Collectives™ on Stack Overflow

Efficient way to retrieve all _ids in ElasticSearch

13 Answers 13

5 Comments

3 Comments

1 Comment

Comments

4 Comments

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

5 Comments

3 Comments

1 Comment

Comments

4 Comments

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related