Why are there discrepencies in aggregate columns given the same data set?
Environment:
Analytics, Grouping Results
Resolution:
It is very important when adding aggregate columns to reports that you fully understand the data set being used. Especially in cases where you are counting a group of distinct items, the results can differ depending on which element you are counting and how those items are grouped.
For example, even a simple data set gives different results depending on how the data is grouped and counted.
Sample data set: 1,1,1,2,2,3,4,5,6,6
Case 1: If you use a count distinct function on the data set above, you get 6 as the result since there are only 6 distinct values of the 10 data points listed.
Case 2: If the data is grouped as illustrated below, however, the count can differ.
Group 1: 1, 2 (2 distinct values)
Group 2: 1, 2, 3, 4, 5 (5 distinct values)
Group 3: 6, 1, 6 (2 distinct values)
In this case, a count of distinct values may return 9 (2+5+2) because values are duplicated in the subsets that you may not intend.Therefore, when comparing reports it is strongly recommended that you consider the grouping used within the report.
Example:
Let's assume you have two aggregate columns, each displaying a count of distinct answer IDs grouped by the interface (interface 1 & interface 2)
Column #1 expression: count(distinct(if(ans_stats.interface_id=1, ans_stats.a_id)))
Column #2 expression: count(distinct(if(ans_stats.interface_id=2, ans_stats.a_id)))
Ultimately, you want to display a total count of unique answer IDs from ans_stats across all interfaces. Thus you create the following two columns:
Total #1 colum expression: count(distinct ans_stats.a_id)
Total #2 column expression: count(distinct(if(ans_stats.interface_id=1, ans_stats.a_id)))+count(distinct(if(ans_stats.interface_id=8, ans_stats.a_id)))
Initially you may expect these both to return the same value. The 'Total #1' column disregards the interface value while the 'Total #2' column simply adds the unique count of both interfaces together. However, if you have an answer ID row associated to both interfaces in ans_stats that value will be duplicated in the 'Total #2' column and the 'Total #1' column is the true count of unique answer IDs from ans_stats across all interfaces.
ex.
Total #1 Total #2 Column #1 Column #2
30 33 30 3