4

I am working on putting arbitrary json objects into a Lucene.NET index, given an object that might look like:

{
  name: "Tony",
  age: 40,
  address: {
     street: "Weakroad",
     number: 10,
     floor: 2,
     door: "Left"
  },
  skills: [ 
    { name: ".NET", level: 5, experience: 12 },
    { name: "JavaScript", level: 3, experience: 6 },
    { name: "HTML5", level: 4, experience: 6 },
    { name: "Lucene.NET", level: 1, experience: 12 },
    { name: "C#", level: 10, experience: 12 }
  ],
  aliases: [ "Bucks", "SirTalk", "BeemerBoy" ]
}

That would produce the following fields:

"name": "Tony"
"age": "40"
"address.street": "Weakroad"
"address.number": "10"
"address.floor": "2"
"address.door": "Left"
"skills": ???
"aliases": "Bucks SirTalk BeemerBoy" //should turn into 3 tokens.

As you may noticed skills has a ???, because right now I am not sure how to deal with that... And if there even is any "meaningful-generic" way to do it...

Here are some options I have been able to think about:


1) Concatenation: But then I will lose the ability to do more advanced queries against Lucene, like finding persons with .NET skills above level 4 AFAIK?

For clarification, concatenation could be something like:

"skills": ".NET, JavaScript, HTML5, Lucene.NET, C#"

Discarding numbers as they wouldn't make much sense in this case. If aditional properties on a child object was a string that would have been gathered as well... An alternative would be to concat each field independently:

"skills.name": ".NET, JavaScript, HTML5, Lucene.NET, C#"
"skills.level": "5, 3, 4, 1, 10"
"skills.experience": "12, 6, 6, 12, 12"

Again numbers doesn't make all that much sense here, but added them just for providing an example.


2) Linked Documents: Creating a new document pr. array entry with a back reference to this document, this might work but without new features as Nested Documents and BlockJoinQuery which hasn't been ported to the .NET version yet this really sounds messy + it sounds like it would tank performance. While it would also kill the usefulness of document scoring, I think that might be less of an issue though.

Basically a document would contain a stored field acting as a foreign key, whenever a search found that document we would pick up the referenced document instead.

So if we illustrate documents they would be:

//Primary Document - ContentType: Person
"$id": 1
"$doctype": Primary
"name": "Tony"
...etc
"skills": [ 2, 3 ] //Just a stored field for retrieving data

//Child Document - ContentType: Skill
"$id": 2
"$ref": 1
"$doctype": Secondary
"name": ".NET"
"level": 5
"experience": 12

//Child Document - ContentType: Skill
"$id": 3
"$ref": 1
"$doctype": Secondary
"name": "JavaScript"
"level": 3
"experience": 6

etc.

I have added a some meta fields


3) A third Option I have found since is to Index the properties as the multiple fields with the same name, so the above example would then result in:

// index: 0
"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
// index: 1
"skills.name": "JavaScript"
"skills.level": 3
"skills.experience": 6
// index: 2
"skills.name": "HTML5"
"skills.level": 4
"skills.experience": 6
// index: 3
"skills.name": "Lucene.NET"
"skills.level": 1
"skills.experience": 12
// index: 4    
"skills.name": "C#"
"skills.level": 10
"skills.experience": 12

This is supported by Lucene.NET, yet it still leaves me behind on the demand to query like: [skill.name: ".NET" AND skill.level: [3 TO 5]].

But since this does allow me to search in the fields separately, I might be able to solve the other issue in another way by:

  • a) adding an extra combined field.
  • b) make Post validations in a collector on the results.
  • c) combination of the above

All depending on the data, obviously sticking to post validation of data like the above would yield really bad results as I am likely to get allot of false hits. It will still filter out people without .NET skills however which is a good thing.

But At least so far I am a step closer, I think.


Taken the scenario above, we can now have: (shortened greatly)

[{
  name: "Tony",
  skills: [ 
    { name: ".NET",       level: 1 },
    { name: "JavaScript", level: 3 },
    { name: "HTML5",      level: 5 }
  ]
 },
 {
  name: "Peter",
  skills: [ 
    { name: ".NET",       level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Lucene.NET", level: 1 }
  ]
 },
 {
  name: "Marilyn",
  skills: [ 
    { name: "JavaScript", level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Node",       level: 1 }
  ]
 }]

What we get is 3 documents with duplicate fields for skills.name and skills.level, that's fine... And I can actually search for { skills.name: 'JavaScript', skills.level: [1 TO 5] } which correctly returns Marilyn and Tony.

But if I search for { skills.name: 'JavaScript', skills.level: [4 TO 5] } I obviously still get both of them with this way of structuring the document where I should only have gotten Marilyn as a result.

Hence the need for a post filtering that will reject Tony as an actual match...

2
  • I'd like to help but am finding your question hard to understand. Could you give examples to illustrate what you mean by 1) Concatenation and 2) Linked Documents? Also for Option 3) does "multiple fields with the same name" - mean field skills.name appears multiple times in a single Lucene document? What is "a compound fashion"? Commented Apr 1, 2014 at 13:11
  • I have tried clarifying the question according to your question except for the multiple field, each line in the data examples refers to a field by it's name and with a value, so yes skills.name would exist 5 times in the document in the sample provided here. Since I found out that Lucene handles this well I have been going down that path for now... There is still much to figure out though, so I might change paths... Commented Apr 1, 2014 at 17:32

2 Answers 2

1

For now I ended up Accepting the Limitations of Solution 3, the rationality for that is that If it's needed to query data in that way, data should be structured differently in the index (in line with Solution 2).

But I have chosen to move that decision outside if a possible framework handling this. As a result I have created https://github.com/dotJEM/json-index

Sign up to request clarification or add additional context in comments.

1 Comment

It seems than you figured your own solution. Can you have a look at this "similar" question and give your 2 cents on it ? stackoverflow.com/questions/35268277/…
0

Adding on to Option 3, you could try indexing "skills" separately, i.e. something like this:

"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
"skills": "name .NET level 5 experience 12"

This way you can do a query like this:

skills: ("name .NET" AND "level 5" AND "experience 12")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.