-1

I have a dataset in SQL like this:

CREATE TABLE hospital_visits (
    name VARCHAR(50),
    hospital_visit_date DATE
);


INSERT INTO hospital_visits (name, hospital_visit_date) VALUES

('patient1', '2024-01-15'),
('patient2', '2024-02-20'),
('patient3', '2024-03-10'),
('patient4', '2024-04-05'),
('patient5', '2024-05-12'),


('patient6', '2024-01-08'),
('patient6', '2024-03-15'),
('patient7', '2024-02-14'),
('patient7', '2024-04-20'),
('patient8', '2024-01-25'),
('patient8', '2024-05-30'),


('patient9', '2024-01-12'),
('patient9', '2024-02-28'),
('patient9', '2024-05-15'),
('patient10', '2024-03-05'),
('patient10', '2024-04-10'),
('patient10', '2024-06-01'),


('patient11', '2024-01-20'),
('patient11', '2024-02-25'),
('patient11', '2024-04-15'),
('patient11', '2024-06-10'),


('patient12', '2024-01-05'),
('patient12', '2024-02-12'),
('patient12', '2024-03-20'),
('patient12', '2024-05-08'),
('patient12', '2024-06-15');

I first ran a query which does a group by/count on the number of visits:

WITH visit_counts AS (
    SELECT name, COUNT(*) as num_visits
    FROM hospital_visits
    GROUP BY name
)
SELECT num_visits, COUNT(*) as num_people
FROM visit_counts

GROUP BY num_visits

ORDER BY num_visits;

  num_visits num_people
          1          5
          2          3
          3          2
          4          1
          5          1

I then wrote an SQL query which looks at the average time between consecutive visits for different visit combinations:

 WITH visit_sequence AS (
    SELECT 
        name,
        hospital_visit_date,
        ROW_NUMBER() OVER (PARTITION BY name ORDER BY hospital_visit_date) as visit_number,
        LAG(hospital_visit_date) OVER (PARTITION BY name ORDER BY hospital_visit_date) as prev_visit_date
    FROM hospital_visits
),
visit_transitions AS (
    SELECT 
        name,
        CASE WHEN visit_number = 1 THEN NULL ELSE visit_number - 1 END as current_visit,
        visit_number as next_visit,
        hospital_visit_date,
        prev_visit_date,
        hospital_visit_date - prev_visit_date as days_between_visits
    FROM visit_sequence
)
SELECT 
    current_visit,
    next_visit,
    COUNT(*) as num_people,
    ROUND(AVG(days_between_visits), 1) as avg_days_between
FROM visit_transitions
GROUP BY current_visit, next_visit
ORDER BY current_visit, next_visit;


 current_visit next_visit num_people avg_days_between
            NA          1         12               NA
             1          2          7             59.4
             2          3          4             54.0
             3          4          2             52.5
             4          5          1             38.0

I am wondering: how can I check to make sure that both queries are correct? For example, the first row of the second query is 12 and the sum of the second column in the first query is also 12. Is this a logical check I can use?

2
  • 1
    You know your data... You know your requirements... You get to write your test scenarios and verify if they are correct. Commented Jun 17 at 5:07
  • 1
    You could start with much less data. Verify the results, add a little more data (one row?), verify the changes, repeat. Commented Jun 17 at 7:02

1 Answer 1

2

Logically verifying column num_people

Yes, for the count of people you're right finding that 1st row of query 2 = sum(column 2) over (query 1), because

  • sum(column 2) over (query 1) represents "count of people that visited once + count of people that visited twice + … + count of people that visited 5 times",
  • while 1st row of query 2 = sum(column 2) over (query 1) represents "1st visit of all those people" (so even those who visited 5 times had a first visit)

Logically, only those who visited twice or more had a second visit. Thus 2nd row of query 2 = sum(column 2) over (query 1) where num_visits >= 2 (7 = sum of only the 4 last rows of query 1 = 3 + 2 + 1 + 1).

Thus the count of people having a nth visit is the sum of every people having visited a total of n times or more,
which we can express in terms of an SQL windowed sum:

WITH visit_people_counts AS (
/* your first query here */
)
SELECT num_visits, SUM(num_people) OVER (ORDER BY num_visits DESC) people_having_a_nth_visit
FROM visit_people_counts
ORDER BY num_visits;

(see it as the first SELECT in this fiddle)

Logically verifying column avg_days_between

We could cross validate by verifying the grand total (relying on "the sum of averages should be the average of sums"), and by hand-validating one item of this grand total.

For the hand validation, beware measuring something significant; and for both the hand validation and the grand total, beware measuring the same reality.

For example, we shouldn't consider that patient12 has an average of 38 days between his visits, because the "average for 5" in the results array doesn't mean "the average for patient12", but "the average for the 5th visit of patient12" (so it's an uninteresting case: OK, 38 days between 05-08 and 06-15 is quickly validated, but we didn't do any average: it's unsignificant).
On the other hand, we cannot isolate "people with 1 visit only" in your results table either, because "1" regroups the first (and only visit) of patients with 1 visit, as well as the first visit of patients up to 5 visits.

So for now we'll not try verifying an individual line from your query (which would force us to at least reuse the first CTE, thus trust it, which is not the way to validate…).
Instead we'll take the individual validation in the "average of sums (of delays between two visits)" pool, which will be easier.

So let's start with the "average of sums" way:

Computing average delay patient by patient

The people doing 5 visits with "on average" n days between each could have done each visit exactly n days after the previous, to get this average (thus on days 0, n, 2n, 3n, 4n). Or it could have visited 4 times in a row (days 0, 1, 2, 3) and then 4n-3 days after (→ day 4n), it would give the same average.
What is important here is that the last visit always occur on day 4n: the difference between first and last visit always = (count of visits - 1) * avg(duration between visits).

Thus we can simply our "sum of delays between visits" as the MAX(hospital_visit_date) - MIN(hospital_visit_date) divided by COUNT(hospital_visit_date) - 1:

WITH by_people AS
(
    SELECT
        DAYS(MAX(hospital_visit_date)) - DAYS(MIN(hospital_visit_date)) as total_span,
        COUNT(hospital_visit_date) as num_visits
    FROM hospital_visits
    GROUP BY name
)
SELECT
    num_visits, COUNT(*) num_people, AVG(1.0 * total_span) avg_total_span,
    CASE WHEN num_visits > 1 THEN ROUND(AVG(1.0 * total_span) / (num_visits - 1), 1) END as avg_days_between
FROM by_people
GROUP BY num_visits
ORDER BY num_visits;

(second SELECT of the same fiddle)

NUM_VISITS NUM_PEOPLE AVG_TOTAL_SPAN AVG_DAYS_BETWEEN
1 5 0 null
2 3 86.3 86.3
3 2 106 53.0
4 1 142 47.3
5 1 162 40.5

In fact at first I got only round numbers, but then I found it suspect that it had no decimal part, and understood that AVG(162) / 5 returned the euclidian divide of 162 by 5, that is, 40 (with remainder 2) instead of 40.5.
So I introduced a 1.0 * in my expression to make it a floating point AVG().

So I learnt two things while doing this query:

  • use DAYS(…) - DAYS(…) to get day count between two dates. Just using … - … for periods spanning over more than a month returns an unexpected number.
    Instead of 162 (days) patient12 got 510, which reads as "5 months, 10 days" (January 5th to June 10th) but gives us an unexpected average of course.
  • convert your integers to floating point before doing an AVG()

Let's consider that this rounding error served as the hand-validation of one row (here patient12).

And thus our grand total of average days between visits should be (3 * 1 * 86.3 + 2 * 2 * 53 + 3 * 47.3 + 4 * 40.5) / 10 = 55.34 as an average delay over all visits.

Validating against delay number of visits by number of visits

If I wrap your last SELECT in a CTE then compute the grand total:
SELECT SUM(1.0 * num_people * avg_days_between) / SUM(num_people) FROM res WHERE current_visit IS NOT NULL;
I only get 55.0 within the Developer-C 11.1 DB2 fiddle.

So based on our now long-time experience with AVG(), we get back to your query, put an 1.0 into AVG(1.0 * days_between_visits), and here we go with the fixed results:

CURRENT_VISIT NEXT_VISIT NUM_PEOPLE AVG_DAYS_BETWEEN
1 2 7 59.4
2 3 4 54
3 4 2 52.5
4 5 1 38
null 1 12 null

(last query in the fiddle)

Conclusion

… But after having re-read your question, I saw that you already had averages with decimals.

While copy-pasting your data and query to the fiddle returned:

  1. integer averages
  2. "random" results for the number of days due to the DATE - DATE rule that I discovered

Morality: even if your query works in your environment, to make it universal and avoid unexpected side effects in other versions / configurations of DB2, you should apply the 2 rules that I learnt:

  • convert your numbers from int to float before AVG()ing them
  • convert your dates to int before subtracting them
Sign up to request clarification or add additional context in comments.

1 Comment

(after posting a first version of this answer that ended by "fix your query!", I re-reading the question, and it appeared that @farrow90 you already had the right results, contrary to what I got in the fiddle: so the conclusion is now to ensure your query works universally, because what works on your edition won't always work the same)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.