Monday, September 29, 2014

Aggregate Bins / Summarized Bins

Have you ever wanted to create 'bins', 'buckets', or some categorical breakdown from a number that needs to be aggregated before the bins are created?

For example, say you had data that was one row per transaction, but you wanted to bin the customers into lifetime spend buckets.  You first need to aggregate the data to the customer level and THEN make the bins out of it (from SQL perspective need to sum the sales and 'group by' customer).  This is not possible using Tableau's normal 'Create Bin' functionality because that will result in each row (transaction) falling into a particular bin, not the sum of all transactions for a given customer falling into a bin.

The first step is to create the aggregate bins.  This can be done with a neat math trick using the INT() function to round things to the nearest bin:


Where X is the size you want each bin to be (in this case 0-1000, 1000-2000, 2000-3000, etc,)

After you create it, you can right-click on this field and choose 'Convert to Discrete' so it appears blue in your data window.  This will make sure it displays as text headers when you drag it to your Columns instead of making an axis.

Drag the new calculated field to Columns.

Drag the Customer Name field to Detail (this will make one mark per customer).

Next we need to create another calculated field that will sum up the number of customers in each bin:

window_sum(countd([Customer Name])

Now drag that field onto the Rows shelf.  After that, right-click on it and choose 'Compute Using-> Customer Name'.  This will tell Tableau to compute the window sum across all the customers.

The numbers in the resulting view will probably look way too high.  This is because Tableau is stacking the same number on top of one another over and over (once for each customer).  To fix this we need to create one final calculated field that will be used to filter the view to only show the last mark in each bin:


Put that new calculated field onto the Filters shelf and check the box for 'True' (or dont, we are going to change the settings anyway so it doesnt matter what you initially choose).

Now right-click the field you just put on the Filters and choose 'Compute Using->Customer Name'.  It will pop up the filter dialogue box again and this time only select 'True'.

Almost there!  So now you probably have a view that looks like a single bar.  The last step to fix this is to right-click on the bins field on your Columns and uncheck the option that says 'Ignore in Table Calculations':

(This is very important!)

Once you uncheck that, each customer will be placed into the correct bin based on the total spend they have in the entire dataset.

Last cleanup item would be to edit the tooltip and remove the Customer Name from showing up.  It is showing the last customer in each bin, but that really doesnt mean anything and can be misleading to users.

And there you have it!  Bins based on aggregated/summarized number instead of at the row level.

Note:  Performance can be a concern using this technique if you have hundreds of thousands or millions of 'things' you are trying to count.  For instance in this example we are counting Customers.  If I have millions of customers in my dataset this technique could be slow.  This is because tableau is still including all the customers in the Group By statement of the query we sent to the database and then we are doing some local math (the window sum computation) and hiding them locally in tableau (the 'show last' calculation).  Alternatives would be to create a aggregated extract in tableau (hide all dimensions at a lower level than Customer Name and choose 'Aggregate for visible dimensions' when creating your extract) or create a pre-aggregated view/table in the database and connect tableau to that.


-The Don Data

1 comment:

  1. Very nice. My only concern is that when using this method you remove the ability to see what accounts (or whatever is you are aggregating over prior to binning) make up the population within each bin.

    This method is nice as it allows Tableau filters to be applied to the data without the need to paramatarise a customer sql query.

    Anyway of geting around this loss of visibility of the underlying population this without pre-aggregatingthe data within a SQL query?