2

I'm struggling to put together a query and could use some help. The documents are very simply and just record a users login time

{
"timestamp":"2019-01-01 13:14:15",
"username":"theuser"
}

I would like counts using the following rules based on an offset from today, for example 10 days ago.

  • Any user whose latest login is before 10 days ago is counted as 'inactive user'
  • Any user whose first login is after 10 days ago is counted as 'new user'
  • Any one else is just counted as 'active user'.

I can get the first and latest logins per user using this (I've found this can also be done with the top_hits aggregation)

GET mytest/_search?filter_path=**.buckets
{
    "aggs" : {
        "username_grouping" : {
            "terms" : {
                "field" : "username"
            },
            "aggs" : {
                "first_login" : {
                    "min": { "field" : "timestamp" }
                },
                "latest_login" : {
                    "max": { "field" : "timestamp" }
                }
            }
        }
    }
}

I was thinking of using this as the source for a date range aggregation but couldn't get anything working.

Is this possible in one query, if not can the 'inactive user' and 'new user' counts be calculated in separate queries?

Here's some sample data, assuming todays date is 2019-08-20 and an offset of 10 days this will give counts of 1 for each type of user

PUT _template/mytest-index-template
{
  "index_patterns": [ "mytest" ],
  "mappings": {
    "properties": {
      "timestamp": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },
      "username": { "type": "keyword" }
    }
  }
}

POST /mytest/_bulk
{"index":{}}
{"timestamp":"2019-01-01 13:14:15","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-20 18:55:05","username":"olduser"}
{"index":{}}
{"timestamp":"2019-01-31 09:33:19","username":"olduser"}
{"index":{}}
{"timestamp":"2019-08-16 08:02:43","username":"newuser"}
{"index":{}}
{"timestamp":"2019-08-18 07:31:34","username":"newuser"}
{"index":{}}
{"timestamp":"2019-03-01 09:02:54","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-14 07:34:22","username":"activeuser"}
{"index":{}}
{"timestamp":"2019-08-19 06:09:08","username":"activeuser"}

Thanks in advance.

2 Answers 2

1

First, sorry in advance. This will be a long answer.

How about using the Date Range Aggregation?

You can set the "from" and "to" to an specific field and "tag" them. This will help you to determine who is an old user and an acive user.

I can think in something like this:

{
"aggs": {
    "range": {
        "date_range": {
            "field": "timestamp",
            "ranges": [
                { "to": "now-10/d", "key": "old_user" }, #If they have more than 10 days inactive.
                { "from": "now-10d/d", "to": "now/d", "key": "active_user" } #Ig they have at least logged in in the last 10 days.
            ],
            "keyed": true
        }
    }
}

The first object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or more will be old_users". In math is expressed like:

"from" (empty value, which could be let's call it '-infinite') <= timestamp < "TO" 10 days ago

The second object can be read as: "All the docs with their field 'timestamp' with a diference of 10 days or less will be active_users". In math is expressed like:

"FROM" 10 days ago <= timestamp < "TO" now

Ok, we have figured out how to "tag" your users. But if you ran the query like that, you will find something like this in the results:

user1: old_user
user1: old_user
user1: active_user
user2: old_user
user2: old_user
user2: active_user
user2: old_user
user3: old_user
user3: active_user

This is becasue you have all the timestamps stored in one single index and it would run on all your docs. I'm assuming you want to play only with the last timestamp. You can do one of the following:

  1. Playing with bucket paths.

I'm thinking of having the max aggregation on the timestamp filed, create a bucket_path to it and run the date_range aggregation on that bucket_path. This might be a pain in the back. If you have issues, create another question for that.

  1. Add the field "is_active" to your docs. You can do it in two ways:

2a. Everytime an user logs-in, add a script in your back-end code which do the comparision. Like this:

#You get the user_value from your back-end code
{
    "query":{
        "match": {
            "username": user_value
        }
    },
    "_source": "timestamp" #This will only bring the field timestamp
    "size": 1 #This will only bring back one doc
    "sort":[
        { "timestamp" : {"order" : "desc"}} #This will sort the timestamsps descending
    ]
}

Get the results in your back-end. If the timestamp you get is more than 10 days older, add to your soon-to-be indexed doc the value "is_active": 0 #Or a value you want like 'no'. In other cases "is_active": 1 #Or a value you want like 'yes'

2b. Run a script in logstash that will parse the info. This will require you to:

  • Play with Ruby scripts
  • Send the info via sockets from your back-end

Hope this is helpful! :D

Sign up to request clarification or add additional context in comments.

Comments

1

I think I have a working solution, thanks to Kevin. Rather than using max and min dates, just get login counts and use cardinality aggregation to get the number of users. The final figures I want are just differences of the three values returned from the query.

GET mytest/_search?filter_path=aggregations.username_groups.buckets.key,aggregations.username_groups.buckets.username_counts.value,aggregations.active_and_inactive_and_new.value
{
  "size": 0,
  "aggs": {
    "active_and_inactive_and_new": {
      "cardinality": {
        "field": "username"
      }
    },
    "username_groups": {
      "range": {
        "field": "timestamp",
        "ranges": [
          {
            "to": "now-10d/d",
            "key": "active_and_inactive"
          },
          {
            "from": "now-10d/d",
            "key": "active_and_new"
          }
        ]
      },
      "aggs": {
        "username_counts": {
          "cardinality": {
            "field": "username"
          }
        }
      }
    }
  }
}

1 Comment

That's a great solution! Just keep in mind that cardinality can affect a little to your performance (I have some bad memories about it) and the max threshold is 40,000. If your own answer solved your problem, consider accept it so people won't see this question as "not solved".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.