Indexing Json Object Arrays in Lucene.NET

Question

I am working on putting arbitrary json objects into a Lucene.NET index, given an object that might look like:

{
  name: "Tony",
  age: 40,
  address: {
     street: "Weakroad",
     number: 10,
     floor: 2,
     door: "Left"
  },
  skills: [ 
    { name: ".NET", level: 5, experience: 12 },
    { name: "JavaScript", level: 3, experience: 6 },
    { name: "HTML5", level: 4, experience: 6 },
    { name: "Lucene.NET", level: 1, experience: 12 },
    { name: "C#", level: 10, experience: 12 }
  ],
  aliases: [ "Bucks", "SirTalk", "BeemerBoy" ]
}

That would produce the following fields:

"name": "Tony"
"age": "40"
"address.street": "Weakroad"
"address.number": "10"
"address.floor": "2"
"address.door": "Left"
"skills": ???
"aliases": "Bucks SirTalk BeemerBoy" //should turn into 3 tokens.

As you may noticed skills has a ???, because right now I am not sure how to deal with that... And if there even is any "meaningful-generic" way to do it...

Here are some options I have been able to think about:

1) Concatenation: But then I will lose the ability to do more advanced queries against Lucene, like finding persons with .NET skills above level 4 AFAIK?

For clarification, concatenation could be something like:

"skills": ".NET, JavaScript, HTML5, Lucene.NET, C#"

Discarding numbers as they wouldn't make much sense in this case. If aditional properties on a child object was a string that would have been gathered as well... An alternative would be to concat each field independently:

"skills.name": ".NET, JavaScript, HTML5, Lucene.NET, C#"
"skills.level": "5, 3, 4, 1, 10"
"skills.experience": "12, 6, 6, 12, 12"

Again numbers doesn't make all that much sense here, but added them just for providing an example.

2) Linked Documents: Creating a new document pr. array entry with a back reference to this document, this might work but without new features as Nested Documents and BlockJoinQuery which hasn't been ported to the .NET version yet this really sounds messy + it sounds like it would tank performance. While it would also kill the usefulness of document scoring, I think that might be less of an issue though.

Basically a document would contain a stored field acting as a foreign key, whenever a search found that document we would pick up the referenced document instead.

So if we illustrate documents they would be:

//Primary Document - ContentType: Person
"$id": 1
"$doctype": Primary
"name": "Tony"
...etc
"skills": [ 2, 3 ] //Just a stored field for retrieving data

//Child Document - ContentType: Skill
"$id": 2
"$ref": 1
"$doctype": Secondary
"name": ".NET"
"level": 5
"experience": 12

//Child Document - ContentType: Skill
"$id": 3
"$ref": 1
"$doctype": Secondary
"name": "JavaScript"
"level": 3
"experience": 6

etc.

I have added a some meta fields

3) A third Option I have found since is to Index the properties as the multiple fields with the same name, so the above example would then result in:

// index: 0
"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
// index: 1
"skills.name": "JavaScript"
"skills.level": 3
"skills.experience": 6
// index: 2
"skills.name": "HTML5"
"skills.level": 4
"skills.experience": 6
// index: 3
"skills.name": "Lucene.NET"
"skills.level": 1
"skills.experience": 12
// index: 4    
"skills.name": "C#"
"skills.level": 10
"skills.experience": 12

This is supported by Lucene.NET, yet it still leaves me behind on the demand to query like: [skill.name: ".NET" AND skill.level: [3 TO 5]].

But since this does allow me to search in the fields separately, I might be able to solve the other issue in another way by:

a) adding an extra combined field.
b) make Post validations in a collector on the results.
c) combination of the above

All depending on the data, obviously sticking to post validation of data like the above would yield really bad results as I am likely to get allot of false hits. It will still filter out people without .NET skills however which is a good thing.

But At least so far I am a step closer, I think.

Taken the scenario above, we can now have: (shortened greatly)

[{
  name: "Tony",
  skills: [ 
    { name: ".NET",       level: 1 },
    { name: "JavaScript", level: 3 },
    { name: "HTML5",      level: 5 }
  ]
 },
 {
  name: "Peter",
  skills: [ 
    { name: ".NET",       level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Lucene.NET", level: 1 }
  ]
 },
 {
  name: "Marilyn",
  skills: [ 
    { name: "JavaScript", level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Node",       level: 1 }
  ]
 }]

What we get is 3 documents with duplicate fields for skills.name and skills.level, that's fine... And I can actually search for { skills.name: 'JavaScript', skills.level: [1 TO 5] } which correctly returns Marilyn and Tony.

But if I search for { skills.name: 'JavaScript', skills.level: [4 TO 5] } I obviously still get both of them with this way of structuring the document where I should only have gotten Marilyn as a result.

Hence the need for a post filtering that will reject Tony as an actual match...

I'd like to help but am finding your question hard to understand. Could you give examples to illustrate what you mean by 1) Concatenation and 2) Linked Documents? Also for Option 3) does "multiple fields with the same name" - mean field skills.name appears multiple times in a single Lucene document? What is "a compound fashion"? — groverboy
– groverboy, Commented Apr 1, 2014 at 13:11
I have tried clarifying the question according to your question except for the multiple field, each line in the data examples refers to a field by it's name and with a value, so yes skills.name would exist 5 times in the document in the sample provided here. Since I found out that Lucene handles this well I have been going down that path for now... There is still much to figure out though, so I might change paths... — Jens
– Jens, Commented Apr 1, 2014 at 17:32

Jens · Accepted Answer · 2014-05-07 09:03:16Z

1

For now I ended up Accepting the Limitations of Solution 3, the rationality for that is that If it's needed to query data in that way, data should be structured differently in the index (in line with Solution 2).

But I have chosen to move that decision outside if a possible framework handling this. As a result I have created https://github.com/dotJEM/json-index

answered May 7, 2014 at 9:03

Jens

3,4421 gold badge26 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pelican_george Over a year ago

It seems than you figured your own solution. Can you have a look at this "similar" question and give your 2 cents on it ? stackoverflow.com/questions/35268277/…

cris almodovar · Accepted Answer · 2016-03-14 02:39:56Z

0

Adding on to Option 3, you could try indexing "skills" separately, i.e. something like this:

"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
"skills": "name .NET level 5 experience 12"

This way you can do a query like this:

skills: ("name .NET" AND "level 5" AND "experience 12")

answered Mar 14, 2016 at 2:39

cris almodovar

1711 silver badge6 bronze badges

Collectives™ on Stack Overflow

Indexing Json Object Arrays in Lucene.NET

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related