3

I have a sample of JSON data that I am converting to a JArray with NewtonSoft.

        string jsonString = @"[{'features': ['sunroof','mag wheels']},{'features': ['sunroof']},{'features': ['mag wheels']},{'features': ['sunroof','mag wheels','spoiler']},{'features': ['sunroof','spoiler']},{'features': ['sunroof','mag wheels']},{'features': ['spoiler']}]";

I am trying to retrieve the features that are most commonly requested together. Based on the above dataset, my expected output would be:

sunroof, mag wheels, 2
sunroof, 1
mag wheels 1
sunroof, mag wheels, spoiler, 1
sunroof, spoiler, 1
spoiler, 1

However, my LINQ is rusty, and the code I am using to query my JSON data is returning the count of the individual features, not the features selected together:

        JArray autoFeatures = JArray.Parse(jsonString);
        var features = from f in autoFeatures.Select(feat => feat["features"]).Values<string>()
                       group f by f into grp
                       orderby grp.Count() descending
                       select new { indFeature = grp.Key, count = grp.Count() };

        foreach (var feature in features)
        {
            Console.WriteLine("{0}, {1}", feature.indFeature, feature.count);
        }

Actual Output:
sunroof, 5
mag wheels, 4
spoiler, 3

I was thinking maybe my query needs a 'distinct' in it, but I'm just not sure.

1
  • var features = JsonConvert.DeserializeObject<List<Dictionary<string, string[]>>>(jsonString).SelectMany(d => d).GroupBy(k => string.Concat(k.Value.OrderBy(s => s))).Select(g => new { Feature = g.Key, Count = g.Count() }).OrderByDescending(a => a.Count);. Value strings internally pre-ordered (to generates ordered groups that ignore the string values positions) Commented Aug 30, 2019 at 21:55

2 Answers 2

4

This is a problem with the Select. You are telling it to make each value found in the arrays to be its own item. In actuality you need to combine all the values into a string for each feature. Here is how you do it

var features = from f in autoFeatures.Select(feat => string.Join(",",feat["features"].Values<string>()))
                       group f by f into grp
                       orderby grp.Count() descending
                       select new { indFeature = grp.Key, count = grp.Count() };

Produces the following output

sunroof,mag wheels, 2
sunroof, 1
mag wheels, 1
sunroof,mag wheels,spoiler, 1
sunroof,spoiler, 1
spoiler, 1
Sign up to request clarification or add additional context in comments.

2 Comments

That's gets exactly what I need. Thanks. I didn't realize I would have to join the strings together. I figured there was a way to extract that information without doing a string manipulation.
There very well may be a way to do that with LINQ, but its far beyond my powers! Though hopefully now that I have posted an answer, the experts will come out of the woodwork to show how wrong and inefficient my version is and we can both learn something :)
3

You could use a HashSet to identify the distinct sets of features, and group on those sets. That way, your Linq looks basically identical to what you have now, but you need an additional IEqualityComparer class in the GroupBy to help compare one set of features to another to check if they're the same.

For example:

var featureSets = autoFeatures
    .Select(feature => new HashSet<string>(feature["features"].Values<string>()))
    .GroupBy(a => a, new HashSetComparer<string>())
    .Select(a => new { Set = a.Key, Count = a.Count() })
    .OrderByDescending(a => a.Count);

foreach (var result in featureSets)
{
    Console.WriteLine($"{String.Join(",", result.Set)}: {result.Count}");
}

And the comparer class leverages the SetEquals method of the HashSet class to check if one set is the same as another (and this handles the strings being in a different order within the set, etc.)

public class HashSetComparer<T> : IEqualityComparer<HashSet<T>>
{
    public bool Equals(HashSet<T> x, HashSet<T> y)
    {
        // so if x and y both contain "sunroof" only, this is true 
        // even if x and y are a different instance
        return x.SetEquals(y);
    }

    public int GetHashCode(HashSet<T> obj)
    {
        // force comparison every time by always returning the same, 
        // or we could do something smarter like hash the contents
        return 0; 
    }
}

4 Comments

Just tried your solution on a more complex dataset. The HashSetComparer is essential to capture and combine(group) the cases where the features are not listed in the same order. Thank you.
sure, it's a good consideration, but I'm not so sure it's a good answer to this question. For, it seems the merit of raising this consideration is in the absence of it being part of the question itself. Also, raising this consideration prompts more considerations: was there any meaning in the order of the features to begin with and/or means of ordering these features to begin with. if you may want "sunroof, mag wheels" and "mag wheels, sunroof" to be indistinguishably different then you wouldn't use this approach; If you can order features already, this approach would be unnecessary.
I don't want to dissect this thoughtful answer too much, but in the event someone thinks this is the best approach.. it depends.
Based on my question 'as asked' it's a bit overkill. Based on my actual needs and dataset, this answer is absolutely necessary.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.