1

I'm using the NodeJS elasticsearch package to interact with ElasticSearch. I have a document that has a file field. I want to be able to upload a file to the index but the only way that I have found is by using the elasticsearch-mapper-attachment plugin.

The problem is that if I use it, I have to load the whole file in memory, encode it to Base64 and then pass the String to ElasticSearch.

I'd like to be able to pass a Stream to ElasticSearch (referencing any binary file: pdf, xls, doc, ppt).

4
  • ES will not do it for you. How big are your files? Commented Sep 29, 2016 at 3:53
  • Mmm so the only way is with a base64 string? I'm not sure about the file size. Let's say 1GB, but if 10000 users uploaded a big file, I'd have to load a lot into memory. Commented Sep 29, 2016 at 16:42
  • Do you want that attachment just stored along the index or actually indexed and searchable? Commented Sep 29, 2016 at 18:10
  • The files are stored in S3, I want to be able to perform searches. Commented Sep 29, 2016 at 20:10

1 Answer 1

3

The elasticsearch-mapper-attachment plugin parses the uploaded binary file and extracts text for further indexing using built-in Tika extractor.

What some applications do (for example Search Technology's Aspire) - they run binaries thru Tika locally, extract text and upload just that text with the documents to index.

It might not be the answer you are looking for but you really have just two options - use Elastic plugin (and convert the binary to base64 in yoru code prior to uploading the document to elastic), or parse the binary and extract text in your code and then upload just that text to elastic. Former is easier, latter gives you more control over the process

Sign up to request clarification or add additional context in comments.

3 Comments

Using Tika in my application is out of scope. I found a related issue about this: github.com/elastic/elasticsearch-mapper-attachments/issues/146 Apparently, they don't want to consume the files from external data sources.
@Andrey, if I use Tika and extract the document content as text, then uploading that text to elastic will have any limitation? I mean in terms of if pdf file is huge then some issue will come up?
@AKS - standard ES document size limit of 2GB per document will apply, so unless your pdf + all other fields are less then 2GB you are good

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.