How can I get the query results to a dataframe with columns preserving the hierarchical structure? columns like this:
type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|
I have an elasticSearch with around 1,000,000 JSOn docs.
I want to use this dataset for Natural Language Processing(NLP) with Python.
Could someone please help me on how to get data from elasticsearch into Python and to write data back to elasticsearch from Python.
Would much appreciate it, As I am stuck unable to do any NLP on the dataset I have, as I cannot get it to connect with Python.
This is what the index structure of the elasticsearch look like:
I want to enter a new index into the hierarchy just like "University info" called "Process info"
and this new index will index the dataset based on a set of keywords i give - just like "universityKeywords" every jason file should store the set of keywords that the tags used.
I want to tag the dataset into "process info" - put 4 tags or categories on the json files named- Applications, offers, Enrolments, requirements based on keywords in the json file post-title and post text
"educationforumsenriched2": {
"mappings": {
"whirlpool": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"references": {
"type": "string"
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
},
"atarnotes": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"discussionTitle": {
"type": "string"
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
}
}
}
}
this is the code i used to create the process info tags in java- I want to do the same in Python
processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));