I am working on putting arbitrary json objects into a Lucene.NET index, given an object that might look like:
{
name: "Tony",
age: 40,
address: {
street: "Weakroad",
number: 10,
floor: 2,
door: "Left"
},
skills: [
{ name: ".NET", level: 5, experience: 12 },
{ name: "JavaScript", level: 3, experience: 6 },
{ name: "HTML5", level: 4, experience: 6 },
{ name: "Lucene.NET", level: 1, experience: 12 },
{ name: "C#", level: 10, experience: 12 }
],
aliases: [ "Bucks", "SirTalk", "BeemerBoy" ]
}
That would produce the following fields:
"name": "Tony"
"age": "40"
"address.street": "Weakroad"
"address.number": "10"
"address.floor": "2"
"address.door": "Left"
"skills": ???
"aliases": "Bucks SirTalk BeemerBoy" //should turn into 3 tokens.
As you may noticed skills has a ???, because right now I am not sure how to deal with that... And if there even is any "meaningful-generic" way to do it...
Here are some options I have been able to think about:
1) Concatenation: But then I will lose the ability to do more advanced queries against Lucene, like finding persons with .NET skills above level 4 AFAIK?
For clarification, concatenation could be something like:
"skills": ".NET, JavaScript, HTML5, Lucene.NET, C#"
Discarding numbers as they wouldn't make much sense in this case. If aditional properties on a child object was a string that would have been gathered as well... An alternative would be to concat each field independently:
"skills.name": ".NET, JavaScript, HTML5, Lucene.NET, C#"
"skills.level": "5, 3, 4, 1, 10"
"skills.experience": "12, 6, 6, 12, 12"
Again numbers doesn't make all that much sense here, but added them just for providing an example.
2) Linked Documents: Creating a new document pr. array entry with a back reference to this document, this might work but without new features as Nested Documents and BlockJoinQuery which hasn't been ported to the .NET version yet this really sounds messy + it sounds like it would tank performance. While it would also kill the usefulness of document scoring, I think that might be less of an issue though.
Basically a document would contain a stored field acting as a foreign key, whenever a search found that document we would pick up the referenced document instead.
So if we illustrate documents they would be:
//Primary Document - ContentType: Person
"$id": 1
"$doctype": Primary
"name": "Tony"
...etc
"skills": [ 2, 3 ] //Just a stored field for retrieving data
//Child Document - ContentType: Skill
"$id": 2
"$ref": 1
"$doctype": Secondary
"name": ".NET"
"level": 5
"experience": 12
//Child Document - ContentType: Skill
"$id": 3
"$ref": 1
"$doctype": Secondary
"name": "JavaScript"
"level": 3
"experience": 6
etc.
I have added a some meta fields
3) A third Option I have found since is to Index the properties as the multiple fields with the same name, so the above example would then result in:
// index: 0
"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
// index: 1
"skills.name": "JavaScript"
"skills.level": 3
"skills.experience": 6
// index: 2
"skills.name": "HTML5"
"skills.level": 4
"skills.experience": 6
// index: 3
"skills.name": "Lucene.NET"
"skills.level": 1
"skills.experience": 12
// index: 4
"skills.name": "C#"
"skills.level": 10
"skills.experience": 12
This is supported by Lucene.NET, yet it still leaves me behind on the demand to query like: [skill.name: ".NET" AND skill.level: [3 TO 5]].
But since this does allow me to search in the fields separately, I might be able to solve the other issue in another way by:
- a) adding an extra combined field.
- b) make Post validations in a collector on the results.
- c) combination of the above
All depending on the data, obviously sticking to post validation of data like the above would yield really bad results as I am likely to get allot of false hits. It will still filter out people without .NET skills however which is a good thing.
But At least so far I am a step closer, I think.
Taken the scenario above, we can now have: (shortened greatly)
[{
name: "Tony",
skills: [
{ name: ".NET", level: 1 },
{ name: "JavaScript", level: 3 },
{ name: "HTML5", level: 5 }
]
},
{
name: "Peter",
skills: [
{ name: ".NET", level: 5 },
{ name: "HTML5", level: 3 },
{ name: "Lucene.NET", level: 1 }
]
},
{
name: "Marilyn",
skills: [
{ name: "JavaScript", level: 5 },
{ name: "HTML5", level: 3 },
{ name: "Node", level: 1 }
]
}]
What we get is 3 documents with duplicate fields for skills.name and skills.level, that's fine... And I can actually search for { skills.name: 'JavaScript', skills.level: [1 TO 5] } which correctly returns Marilyn and Tony.
But if I search for { skills.name: 'JavaScript', skills.level: [4 TO 5] } I obviously still get both of them with this way of structuring the document where I should only have gotten Marilyn as a result.
Hence the need for a post filtering that will reject Tony as an actual match...
skills.nameappears multiple times in a single Lucene document? What is "a compound fashion"?skills.namewould exist 5 times in the document in the sample provided here. Since I found out that Lucene handles this well I have been going down that path for now... There is still much to figure out though, so I might change paths...