Logically verifying column num_people
Yes, for the count of people you're right finding that 1st row of query 2 = sum(column 2) over (query 1), because
sum(column 2) over (query 1) represents "count of people that visited once + count of people that visited twice + … + count of people that visited 5 times",
- while
1st row of query 2 = sum(column 2) over (query 1) represents "1st visit of all those people" (so even those who visited 5 times had a first visit)
Logically, only those who visited twice or more had a second visit. Thus 2nd row of query 2 = sum(column 2) over (query 1) where num_visits >= 2 (7 = sum of only the 4 last rows of query 1 = 3 + 2 + 1 + 1).
Thus the count of people having a nth visit is the sum of every people having visited a total of n times or more,
which we can express in terms of an SQL windowed sum:
WITH visit_people_counts AS (
/* your first query here */
)
SELECT num_visits, SUM(num_people) OVER (ORDER BY num_visits DESC) people_having_a_nth_visit
FROM visit_people_counts
ORDER BY num_visits;
(see it as the first SELECT in this fiddle)
Logically verifying column avg_days_between
We could cross validate by verifying the grand total (relying on "the sum of averages should be the average of sums"), and by hand-validating one item of this grand total.
For the hand validation, beware measuring something significant; and for both the hand validation and the grand total, beware measuring the same reality.
For example, we shouldn't consider that patient12 has an average of 38 days between his visits, because the "average for 5" in the results array doesn't mean "the average for patient12", but "the average for the 5th visit of patient12" (so it's an uninteresting case: OK, 38 days between 05-08 and 06-15 is quickly validated, but we didn't do any average: it's unsignificant).
On the other hand, we cannot isolate "people with 1 visit only" in your results table either, because "1" regroups the first (and only visit) of patients with 1 visit, as well as the first visit of patients up to 5 visits.
So for now we'll not try verifying an individual line from your query (which would force us to at least reuse the first CTE, thus trust it, which is not the way to validate…).
Instead we'll take the individual validation in the "average of sums (of delays between two visits)" pool, which will be easier.
So let's start with the "average of sums" way:
Computing average delay patient by patient
The people doing 5 visits with "on average" n days between each could have done each visit exactly n days after the previous, to get this average (thus on days 0, n, 2n, 3n, 4n). Or it could have visited 4 times in a row (days 0, 1, 2, 3) and then 4n-3 days after (→ day 4n), it would give the same average.
What is important here is that the last visit always occur on day 4n: the difference between first and last visit always = (count of visits - 1) * avg(duration between visits).
Thus we can simply our "sum of delays between visits" as the MAX(hospital_visit_date) - MIN(hospital_visit_date) divided by COUNT(hospital_visit_date) - 1:
WITH by_people AS
(
SELECT
DAYS(MAX(hospital_visit_date)) - DAYS(MIN(hospital_visit_date)) as total_span,
COUNT(hospital_visit_date) as num_visits
FROM hospital_visits
GROUP BY name
)
SELECT
num_visits, COUNT(*) num_people, AVG(1.0 * total_span) avg_total_span,
CASE WHEN num_visits > 1 THEN ROUND(AVG(1.0 * total_span) / (num_visits - 1), 1) END as avg_days_between
FROM by_people
GROUP BY num_visits
ORDER BY num_visits;
(second SELECT of the same fiddle)
| NUM_VISITS |
NUM_PEOPLE |
AVG_TOTAL_SPAN |
AVG_DAYS_BETWEEN |
| 1 |
5 |
0 |
null |
| 2 |
3 |
86.3 |
86.3 |
| 3 |
2 |
106 |
53.0 |
| 4 |
1 |
142 |
47.3 |
| 5 |
1 |
162 |
40.5 |
In fact at first I got only round numbers, but then I found it suspect that it had no decimal part, and understood that AVG(162) / 5 returned the euclidian divide of 162 by 5, that is, 40 (with remainder 2) instead of 40.5.
So I introduced a 1.0 * in my expression to make it a floating point AVG().
So I learnt two things while doing this query:
- use
DAYS(…) - DAYS(…) to get day count between two dates. Just using … - … for periods spanning over more than a month returns an unexpected number.
Instead of 162 (days) patient12 got 510, which reads as "5 months, 10 days" (January 5th to June 10th) but gives us an unexpected average of course.
- convert your integers to floating point before doing an
AVG()
Let's consider that this rounding error served as the hand-validation of one row (here patient12).
And thus our grand total of average days between visits should be (3 * 1 * 86.3 + 2 * 2 * 53 + 3 * 47.3 + 4 * 40.5) / 10 = 55.34 as an average delay over all visits.
Validating against delay number of visits by number of visits
If I wrap your last SELECT in a CTE then compute the grand total:
SELECT SUM(1.0 * num_people * avg_days_between) / SUM(num_people) FROM res WHERE current_visit IS NOT NULL;
I only get 55.0 within the Developer-C 11.1 DB2 fiddle.
So based on our now long-time experience with AVG(), we get back to your query, put an 1.0 into AVG(1.0 * days_between_visits), and here we go with the fixed results:
| CURRENT_VISIT |
NEXT_VISIT |
NUM_PEOPLE |
AVG_DAYS_BETWEEN |
| 1 |
2 |
7 |
59.4 |
| 2 |
3 |
4 |
54 |
| 3 |
4 |
2 |
52.5 |
| 4 |
5 |
1 |
38 |
| null |
1 |
12 |
null |
(last query in the fiddle)
Conclusion
… But after having re-read your question, I saw that you already had averages with decimals.
While copy-pasting your data and query to the fiddle returned:
- integer averages
- "random" results for the number of days due to the
DATE - DATE rule that I discovered
Morality: even if your query works in your environment, to make it universal and avoid unexpected side effects in other versions / configurations of DB2, you should apply the 2 rules that I learnt:
- convert your numbers from int to float before
AVG()ing them
- convert your dates to int before subtracting them