1

I am trying to index PDF using elasticsearch 2.3.4 and python. Want to extract text and metadata from pdf to index. Using mapper_attachment plugin.

When i am trying to index, getting 'mapper_parsing_exception'error. Following is my code,

#Configuration

DIR = 'D:/QA_Testing/testing/data'
ES_HOST = {"host" : "localhost", "port" : 9200}
INDEX_NAME = 'testing'
TYPE_NAME = 'documents'
URL = "D:/xyz.pdf"

es = Elasticsearch(hosts = [ES_HOST])

mapping = {
  "mappings": {
    "documents": {
      "properties": {
        "cv": { "type": "attachment" }
}}}}

file64 = open(URL, "rb").read().encode("base64")
data_dict = {'cv': file64}
data_dict = json.dumps(data_dict)

res = es.indices.create(index = INDEX_NAME, body = mapping)

es.index(index = INDEX_NAME, body = data_dict ,doc_type = "attachment", id=1)

ERROR:

Traceback (most recent call last):
  File "C:/Users/537095/Desktop/QA/IndexingWorkspace/MainWorkspace/index3.py", line 51, in <module>
    es.index(index = INDEX_NAME, body = data_dict ,doc_type = "attachment", id=1)
  File "C:\Python27\lib\site-packages\elasticsearch\client\utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\elasticsearch\client\__init__.py", line 261, in index
    _make_path(index, doc_type, id), params=params, body=body)
  File "C:\Python27\lib\site-packages\elasticsearch\transport.py", line 329, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "C:\Python27\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 106, in perform_request
    self._raise_error(response.status, raw_data)
  File "C:\Python27\lib\site-packages\elasticsearch\connection\base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
RequestError: TransportError(400, u'mapper_parsing_exception', u'failed to parse')

Am i doing anything wrong?

2
  • doc_type = "attachment" should be doc_type = "documents". Also can you show the error you see in the ES server logs? Commented Aug 24, 2016 at 7:28
  • Thank you very much! My silly mistake. Its working now :) Commented Aug 24, 2016 at 8:45

1 Answer 1

1

You need to change your doc_type, it should be documents and not attachment

es.index(index = INDEX_NAME, body = data_dict ,doc_type = "documents", id=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.