Why are there discrepencies in aggregate columns given the same data set?
Analytics, Grouping Results
It is very important when adding aggregate columns to reports that you fully understand the data set being used. Especially in cases where you are counting a group of distinct items, the results can differ depending on which element you are counting and how those items are grouped.
For example, even a simple data set gives different results depending on how the data is grouped and counted.
Sample data set: 1,1,1,2,2,3,4,5,6,6
Case 1: If you use a count distinct function on the data set above, you get 6 as the result since there are only 6 distinct values of the 10 data points listed.
Case 2: If the data is grouped as illustrated below, however, the count can differ.
Group 1: 1, 2 (2 distinct values)
Group 2: 1, 2, 3, 4, 5 (5 distinct values)
Group 3: 6, 1, 6 (2 distinct values)
In this case, a count of distinct values may return 9 (2+5+2) because values are duplicated in the subsets that you may not intend.Therefore, when comparing reports it is strongly recommended that you consider the grouping used within the report.
Let's assume you have two aggregate columns, each displaying a count of distinct answer IDs grouped by the interface (interface 1 & interface 2)
Column #1 expression: count(distinct(if(ans_stats.interface_id=1, ans_stats.a_id)))
Column #2 expression: count(distinct(if(ans_stats.interface_id=2, ans_stats.a_id)))
Ultimately, you want to display a total count of unique answer IDs from ans_stats across all interfaces. Thus you create the following two columns:
Total #1 colum expression: count(distinct ans_stats.a_id)
Total #2 column expression: count(distinct(if(ans_stats.interface_id=1, ans_stats.a_id)))+count(distinct(if(ans_stats.interface_id=8, ans_stats.a_id)))
Initially you may expect these both to return the same value. The 'Total #1' column disregards the interface value while the 'Total #2' column simply adds the unique count of both interfaces together. However, if you have an answer ID row associated to both interfaces in ans_stats that value will be duplicated in the 'Total #2' column and the 'Total #1' column is the true count of unique answer IDs from ans_stats across all interfaces.
Total #1 Total #2 Column #1 Column #2
30 33 30 3