Skip Navigation
Expand
Reports have different totals for same group of data
Answer ID 1748   |   Last Review Date 03/18/2019

Why are there discrepencies in aggregate columns given the same data set?

Environment:

Analytics, Grouping Results

Resolution:

It is very important when adding aggregate columns to reports that you fully understand the data set being used. Especially in cases where you are counting a group of distinct items, the results can differ depending on which element you are counting and how those items are grouped.

For example, even a simple data set gives different results depending on how the data is grouped and counted.

Sample data set: 1,1,1,2,2,3,4,5,6,6

 

Case 1: If you use a count distinct function on the data set above, you get 6 as the result since there are only 6 distinct values of the 10 data points listed.

Case 2: If the data is grouped as illustrated below, however, the count can differ.

Group 1: 1, 2 (2 distinct values)
Group 2: 1, 2, 3, 4, 5 (5 distinct values)
Group 3: 6, 1, 6 (2 distinct values)

In this case, a count of distinct values may return 9 (2+5+2) because values are duplicated in the subsets that you may not intend.Therefore, when comparing reports it is strongly recommended that you consider the grouping used within the report.

 

Example:

Let's assume you have two aggregate columns, each displaying a count of distinct answer IDs grouped by the interface (interface 1 & interface 2)

Column #1 expression: count(distinct(if(ans_stats.interface_id=1, ans_stats.a_id)))

Column #2 expression: count(distinct(if(ans_stats.interface_id=2, ans_stats.a_id)))

 

Ultimately, you want to display a total count of unique answer IDs from ans_stats across all interfaces. Thus you create the following two columns:

Total #1 colum expression: count(distinct ans_stats.a_id)

Total #2 column expression: count(distinct(if(ans_stats.interface_id=1, ans_stats.a_id)))+count(distinct(if(ans_stats.interface_id=8, ans_stats.a_id)))

 

Initially you may expect these both to return the same value. The 'Total #1' column disregards the interface value while the 'Total #2' column simply adds the unique count of both interfaces together. However, if you have an answer ID row associated to both interfaces in ans_stats that value will be duplicated in the 'Total #2' column and the 'Total #1' column is the true count of unique answer IDs from ans_stats across all interfaces.

ex.

Total #1          Total #2          Column #1          Column #2

         30                  33                        30                        3