Sorted data structure that allows duplicates on sort key, but replaces duplicates on another key in sub linear time

Question

This question is language-agnostic, but I'm specifically looking for a solution using C++ STL containers. I have a struct like this.

struct User {
   int query_count;
   std::string user_id;
}

std::multiset<User> users; //currently using

I use a multiset with a comparator that sort on query_count. This allow me to have sorted multiple entries with the same query_count. Now, if I want to avoid duplicates on user_id, I need to scan data and remove the entry and create a new one, taking O(n). I'm trying to think of a way to do this in sub-linear time. I was thinking of a solution based on a map ordered on user_id, but then I would have to scan all the whole data when trying to locate the largest query_count.

EDIT: requirements are insert, delete, update(delete/insert), get highest query_count, find user_id in sub-linear time.

I prefer to use the standard stl containers, but simple modifications are fine. Is there any way to achieve my requirements?

Summary:

The summary of the answers is that to use a ootb solution, I can use boost bi-directional map.

If I'm sticking to STL, then it has to be a combination of using two maps together, updating both carefully, for each insertion of a user.

Wouldn't that create duplicates based on user_id? Could you elaborate? — dev_nut
– dev_nut, Commented Apr 21, 2015 at 17:46
Maybe I missed this (1, "User"), (2, "User"), ... Do we have an XY-problem? — user2249683
– user2249683, Commented Apr 21, 2015 at 17:52
How are users added and removed? How is the query count modified? Can you include the minimum required interface in the question? — Arun
– Arun, Commented Apr 21, 2015 at 17:53
@Arun, just insert and delete on the multiset. There is no required interface. I want to get sub-linear time in all operations, i.e. find user, get highest count, insert, delete, update (delete/insert). — dev_nut
– dev_nut, Commented Apr 21, 2015 at 17:57

Mark B · Accepted Answer · 2015-04-21 17:43:54Z

2

This sounds like a job for boost's multi_index: http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/tutorial/

You can set one index based on the user id to easily prevent duplicates (you insert based on this), and then another sorted index on the query count to easily locate the max.

answered Apr 21, 2015 at 17:43

Mark B

96.6k10 gold badges113 silver badges198 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dev_nut Over a year ago

I wonder whether there's a roundabout way to do this with the STL containers?

SumitV · Accepted Answer · 2015-04-22 07:25:05Z

multi_index from boost is the way to go. But if you want to use your own DataStructure using basic STL containers, then i suggest you create a class which has two conatiners internally.

keep an itertor to SortedContainer in the map. So that you can delete and access it in O(1)( same as lookup of unordered_map).

X

struct User {
    int query_count;
    std::string user_id;
}


class UserQueryCountSomething
{
    typedef std::list<int> SortedContainer; //better to use a Stack or Heap here instead of list.
    SortedContainer  sortedQueryCount; //keep the query_count sorted here.
    typedef std::pair< User, typename SortedContainer::iterator>  UserPosition_T;//a pair of User struct and the iterator in list.
    typedef unordered_map  < std::string,  UserPosition_T > Map_T;  // Keep your User struct and the iterator here in this map, aginst the user_id.

    Map_T map_;

    public:

    Insert(User u)
    {
        //insert into map_ and also in sortedQueryCount
    }

    int getHighestQueryCount()
    {
        //return first element in sortedQueryCount.
    }

    Delete()
    {
        //find in map and delete.
        //get the iterator from the map's value type here.
        //delete from the sortedQueryCount using the iteartor.
    }
};
}

This can be a starting point for you. Let me know if you more details.

Arun · Accepted Answer · 2015-04-22 15:10:48Z

1

If we just need the highest count, not other ranks of count, then one approach may be to track it explicitly. We may do that as

unordered_map<UserId, QueryCount>;
int max_query_count;

Unfortunately, in some operations, e.g. when the user with max query count is removed, the max value need to freshly computed. Note that, for all other users, whose query count is not maximum, removal of them does not need re-computation of max_query_count. The re-computation, when done, would be O(N), which does not meet the "sub linear" requirement. That may be good enough for many use cases, because the user with maximum query count may not be frequently removed.

However, if we absolutely want to avoid the O(N) re-computation, then we may introduce another container as

multimap<QueryCount, UserId>

to map a specific query count to a collection of users.

In this approach, any mutation operation e.g. add, remove, update, may need to update both the containers. That is little bit of pain, but the gain is that such updates are expected to be logarithmic, e.g. O(lg N), i.e. sub linear.

Update with some code sketch. Note I have used unordered_map and unordered_set, instead of multimap, for count-to-user mapping. Since we do not really need ordering on count, this might be fine; in case if not, unordered_map may be simply changed to map.

class UserQueryCountTracker {
 public:
  typedef std::string UserId;
  typedef int QueryCount;

  void AddUser(UserId id) {
    int new_count = -1;
    auto it = user_count_map_.find(id);
    if (it == user_count_map_.end()) {  // id does not exist
      new_count = 1;
      user_count_map_[id] = new_count;
      count_user_map_[new_count].insert(id);
    }
    else {                              // id exists
      const int old_count = it->second;
      new_count = old_count + 1;
      it->second = new_count;
      // move 'id' from old count to new count
      count_user_map_[old_count].erase(id);
      count_user_map_[new_count].insert(id);
    }
    assert(new_count != -1);
    if (new_count > max_query_count_) {
      max_query_count_ = new_count;
    }
  }

  const unordered_set<UserId>& UsersWithMaxCount() const {
    return count_user_map_[max_query_count_];
  }

 private:
  unordered_map<UserId, QueryCount> user_count_map_{};
  int max_query_count_{0};
  unordered_map<QueryCount, unordered_set<UserId>> count_user_map_{};
};

edited Apr 22, 2015 at 15:10

answered Apr 21, 2015 at 18:35

Arun

20.6k12 gold badges54 silver badges61 bronze badges

3 Comments

dev_nut Over a year ago

That complication is acceptable, as long as it's sublinear. I will check this out. Thank you for pointing to multimap. If you include some sample code, I could mark this as an answer.

dev_nut Over a year ago

It seems you are not storing and data as to who has the max count. Maybe, I'm not clear about my requirements.

Arun Over a year ago

I added a new method UsersWithMaxCount()

Nikita · Accepted Answer · 2015-04-22 15:16:47Z

Use bidirectional map, where user id is key and query count is value

#include <map>
#include <utility>
#include <functional>
template
<
    typename K, // key
    typename V, // value
    typename P = std::less<V>  // predicate
>
class value_ordered_map
{
private:
    std::map<K, V>         key_to_value_;
    std::multimap<V, K, P> value_to_key_;

public:
    typedef typename std::multimap<typename V, typename K, typename P>::iterator by_value_iterator;

    const V& value(const K& key) {
        return key_to_value_[key];
    }

    std::pair<by_value_iterator, by_value_iterator> keys(const V& value) {
        return value_to_key_.equal_range(value);
    }

    void set(const K& key, const V& value) {
        by_key_iterator it = key_to_value_.find(key);
        if (key_to_value_.end() != it) {
            std::pair<by_value_iterator, by_value_iterator> it_pair = value_to_key_.equal_range(key_to_value_[key]);
            while (it_pair.first != it_pair.second)
                if (it_pair.first->first == it->second) {
                    value_to_key_.erase(it_pair.first);
                    break;
                } else ++it_pair.first;
        }
        key_to_value_[key] = value;
        value_to_key_.insert(std::make_pair(value, key));
    }
};

Collectives™ on Stack Overflow

Sorted data structure that allows duplicates on sort key, but replaces duplicates on another key in sub linear time

4 Answers 4

1 Comment

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related