getting data into Python from ElasticSearch-JSON files

Question

How can I get the query results to a dataframe with columns preserving the hierarchical structure? columns like this:

type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|

I have an elasticSearch with around 1,000,000 JSOn docs. I want to use this dataset for Natural Language Processing(NLP) with Python. Could someone please help me on how to get data from elasticsearch into Python and to write data back to elasticsearch from Python. Would much appreciate it, As I am stuck unable to do any NLP on the dataset I have, as I cannot get it to connect with Python. This is what the index structure of the elasticsearch look like:
I want to enter a new index into the hierarchy just like "University info" called "Process info" and this new index will index the dataset based on a set of keywords i give - just like "universityKeywords" every jason file should store the set of keywords that the tags used. I want to tag the dataset into "process info" - put 4 tags or categories on the json files named- Applications, offers, Enrolments, requirements based on keywords in the json file post-title and post text

 "educationforumsenriched2": {
          "mappings": {
             "whirlpool": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "references": {
                      "type": "string"
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             },
             "atarnotes": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "discussionTitle": {
                      "type": "string"
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "query": {
                      "properties": {
                         "match_all": {
                            "type": "object"
                         }
                      }
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             }
          }
       }
    }

this is the code i used to create the process info tags in java- I want to do the same in Python

 processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
        processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
        processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
        processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));

pyelasticsearch? i have instaled the packages- but cant figure out how to get this dataset to python. A small example will be highly useful. This is the mapping structure of my elasticsearch index: — BA stu
– BA stu, Commented Jun 3, 2017 at 6:06
"educationforumsenriched2": { "mappings": { "whirlpool": { "properties": { "CourseInfo": {.. — BA stu
– BA stu, Commented Jun 3, 2017 at 6:10

Adonis · Accepted Answer · 2017-06-06 21:45:46Z

1

With the elasticsearch python client, once you have established a successful connection, you just need to provide the DSL Query and the indices you want to search through to retrieve the required information, for instance, if you have a query:

GET educationforumsenriched2/_search
{
    "query": {
        "match" : {
            "CourseInfo.subjectKeywords" : "foo"
        }
    }
}

The equivalent in Python would be:

from elasticsearch import Elasticsearch

es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on

query = {
        "query": {
            "match" : {
                "CourseInfo.subjectKeywords" : "foo"
            }
        }
    }
res = es.search(index="educationforumsenriched2", body=query)

#do some processing

#create new document in ES
es.create(index="educationforumsenriched2", body=new_doc_after_processing)

Edit: Just thinking about it, but if your processing is not too complex you could also think about building an ingest pipeline

edited Jun 6, 2017 at 21:45

answered Jun 3, 2017 at 14:09

Adonis

4,8183 gold badges41 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BA stu Over a year ago

Thank you.But how can I get the results in es into a structure like a data frame with columns for the fields like in question edit

Adonis Over a year ago

@BAstu What kind of dataframe are we talking about, pandas dataframe? Spark dataframe? Maybe this question can help: stackoverflow.com/questions/25186148/…

BA stu Over a year ago

Yes a Pandas Dataframe. Thanks

Collectives™ on Stack Overflow

getting data into Python from ElasticSearch-JSON files

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related