I’ve just spent 2 hours with our DBA trying to force SQL to apply the 80/20 rule to a massive data source. I’m so pleased with the result that I have to write it up as a tutorial even though it is not technically an excel solution. In my defence, I will use the output in excel and the SQL trick will enable me to play with manageable volumes rather than gazillions of records. Besides, it is a totally exciting solution and I can’t resist sharing.
The 80/20 Rule or Pareto Principle
The 80/20 rule or Pareto principle is sometimes called the law of the vital few. It is terribly useful in business and analysis in general as it enables you to focus your minimal efforts in the area which will make the biggest difference. Using this rule can turn you into a seriously efficient and effective analyst. Without this rule you run the risk of being run ragged analysing trivia.
The basis is that 80% of the results can be explained by 20% of the inputs, so for example, 20% of company’s customers generate 80% of the revenue. Quite often the ratios can be more skewed e.g. 99% of all complaints generated by only 1% of customers but the principle is the same.
With the two examples above, the run-ragged approach would see you spending a lot of your time responding to and worrying about all your complaints while at the same time attempting to attract your entire customer base to return for a re-order with use of an unfocussed marketing campaign. Alternatively, you could determine whether your 1% of complainants also fell into the 20% of customers who generated 80% of your income. Let’s assume for arguments sake that they didn’t, you could then reprioritise your efforts away from the noise and now focus on the 20% of customers that keep you afloat.
If we are looking for NHS examples:
- The bulk of a hospitals expenditure (80%) is related to a relatively small proportion of patients (20%).
- Most GP appointments are consumed by a small proportion of patients – the chronically sick.
Using the 80/20 Rule to focus Analytical Efforts
My analytical challenge is to amend the monthly income estimate based on some last minute trend data that I have access to.
The reporting estimate has shed loads of data, with each row comprised of hospital site code, specialty code, POD, HRG and commissioner code. The data is a profiled estimate based on the actual activity achieved over the previous months. It includes piffling volume records where perhaps a single activity record in the past 6 months is split into a possible 0.166666 activities estimated for M7.
I need to increase or decrease this estimate based on a trend I can see from another more recent activity source.
Given that I need to take this estimate, then apply a scaling with the output on an excel file (this is how our adjustment process works), I really don’t want to be messing around with many thousands of records, changing an activity estimate of 0.166666 to a far more accurate but equally inconsequential 0.183333
That’s where the 80/20 rule comes into its own. I reckon that I can trim down the volume of data in the estimate file to about 20% of its full size and still have records that contribute in the region of 80% of the value. If I focus my efforts here I can save a lot of time and potential for error that arises when you deal with excessive volumes.
How would I do this in Excel?
This illustration shows an extract of data from my original estimates table (on the left), I’ve coloured it to represent the 4 different partitions I’m interested in, for each unique Site/POD/Spec that exists.
On the right I have done a bit of analysis to show the cumulative total by partition and then the Cumulative % by Partition. In order to do this I have had to sort the data in descending order of price by partition and then work out the running total for each.
I’ve dragged the colouring across to show the rows that I want to retain on the basis that they contribute to at least 80% of the total value of each partition.
This second set of data has only 5 rows but still contributes to £3.5m
In 80/20 speak, 23% of the data contribute to 90% of the value. That’s pretty good!
How to use the 80/20 rule in SQL Server
My SQL skills are fairly rudimentary so I’m going to explain the process step by step so I have a chance of remembering what and why we did what we did. There may be better ways to do this, if so shout up in the comments section – I like to learn.
First of all, my script is riddled with cte scripts. That stands for common table expression and basically allows you to create a temporary table, on the fly, that you can reference within your query. A cte isn’t stored and lives only as long as the query.
Step 1 – Summarise the Original Estimate Data
I only want selected PODs and specialties in my extract and I want to group my detail PODs to a higher level grouping eg. OPFAMPCL and OPFASPCL to Outpatients.
The estimates table holds all the historic estimates as well so the first cte, called cte_max_month, returns the maximum and therefore latest month’s estimate. I use this in the second cte called cte_POD.
The 3rd CTE called cte_All just aggregates at my new higher level WEEKLY_POD to ensure I have only one record returned for each unique level of detail.
WITH cte_max_month AS ( SELECT MAX(MONTH) maxm FROM [BH_SLAM_STAGE].[dbo].[QVD_Income_Estimate_1718] ), cte_POD AS ( SELECT [Physical Site Code] ,CASE WHEN [POD] IN ('NAFF','OPFAMPCL','OPFASPCL','OPFUPMPCL','OPFUPSPCL','OPPROC') THEN 'Outpatients' ELSE [POD] END AS WEEKLY_POD ,[SpecCode] ,[CCGCode] ,SUM([priceactual]) price ,SUM([activityactual]) activity FROM [BH_SLAM_STAGE].[dbo].[QVD_Income_Estimate_1718] est LEFT JOIN cte_max_month ON est.month = cte_max_month.maxm WHERE POD IN ('DC','EL','NAFF','OPFAMPCL','OPFASPCL','OPFUPMPCL','OPFUPSPCL','OPPROC','NEL') AND [SpecCode] NOT IN ('650','812') AND MONTH = maxm GROUP BY [Physical Site Code] ,[POD] ,[SpecCode] ,[CCGCode] ), cte_All AS ( SELECT [Physical Site Code] ,WEEKLY_POD ,[SpecCode] ,[CCGCode] ,SUM([price]) price ,SUM([activity]) activity FROM cte_POD GROUP BY [Physical Site Code] ,[WEEKLY_POD] ,[SpecCode] ,[CCGCode] )
If I were to return the output of cte_All it would look something like the left hand section of the partition illustration shown above.
Step 2 – Use over(Partition by) to insert a Running Total and a Partition Total
At this stage we begin to build on the summary table cte_All and add a series of columns that will help identify the top 80%.
It utilises OVER(Partition by).
OVER allows you to aggregate data and apply detail alongside it
SELECT SUM(price) OVER() AS VALUE ,[Physical Site Code] ,WEEKLY_POD ,[SpecCode] ,[CCGCode] FROM cte_All
Returns all the detail from cte_ALL but the first column holds the total estimate value for the entire file.
In this case I want to split my data into partitions that correspond to unique combinations of SITE/POD/Spec
SELECT SUM(price) OVER (partition BY [Physical Site Code], WEEKLY_POD, [SpecCode]) AS VALUE ,[Physical Site Code] ,WEEKLY_POD ,[SpecCode] ,[CCGCode] FROM cte_All
If you also add ORDER BY in the syntax you can use it to return a running total.
So while this script returns the total value for each partition:
SUM(price) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode]) AS Grp_Price
This script returns the running total for each partition:
SUM(price) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode] ORDER BY price DESC) AS Running_Price
Here’s the next stage of my script:
cte_Groups AS ( SELECT [Physical Site Code] ,WEEKLY_POD ,[SpecCode] ,[CCGCode] ,price ,activity ,COUNT(*) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode]) AS Grp_Count ,SUM(price) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode]) AS Grp_Price ,SUM(activity) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode]) AS Grp_Activity ,SUM(price) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode] ORDER BY price DESC) AS Running_Price ,SUM(activity) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode] ORDER BY activity DESC) AS Running_Activity FROM cte_All )
At this point I am starting to get data that looks useful to me. If this was in excel I would very easily be able to calculate the cumulative percentages and work out a way to filter out the ones I considered significant.
Step 3 – Tidy up the table to remove future Divide by Zero errors
I freely admit to not understanding this stage very well. In step 4 we were getting some divide by zero errors and this step just forces the group totals to have a value.
I’m sure this could be refined. In some way.
cte_cte_Filter AS ( SELECT [Physical Site Code] ,WEEKLY_POD ,[SpecCode] ,[CCGCode] ,price ,activity ,Grp_Count ,iif(Grp_Price = 0, 1, Grp_Price) AS Grp_Price ,iif(Grp_Activity = 0, 1, Grp_Activity) AS Grp_Activity ,Running_Price ,Running_Activity FROM cte_Groups )
Step 4 – Add Cumulative Proportion Columns
cte_Portions AS ( SELECT [Physical Site Code] ,WEEKLY_POD ,[SpecCode] ,[CCGCode] ,price ,activity ,Grp_Count ,Running_Price ,Grp_Price ,1.0 * Running_Price / Grp_Price AS Portion_Pct_Price ,1.0 * isnull(Running_Activity, 0) / Grp_Activity AS Portion_pct_Activity FROM cte_cte_Filter )
In this extract we can see that the very first record contributes to more than 80% of the value of DC for Spec Code 100
Step 5 – Using SQL Lead and Lag functions to Identify at least 80% of data by Value
In the table above, we could return the correct subset by using where Portion_Pct_Price <= 0.8
However, there are many cases where two or more CCG’s have a significant contribution eg:
In this example 13R contributes to 51.6% and so would be captured by the Portion_Pct_Price <= 0.8 filter. The next CCG 13G now brings the cumulative Portion_Pct_Price to 85.3% and so would not meet the <0.8 filter. In order to ensure that at least 80% is captured we use the LEAD function to reveal information about the next record in the dataset and the LAG function that reveals the detail of the previous dataset. In this case I want to return values where the cumulative % is less than 80% and where the cumulative % is greater than 80% but the previous value was less than 80%. Here’s the script that does that along with the final output query:
cte_Lead AS ( --this query determines the cumulative portion % for the next (lead) and the previous (lag) record by partition SELECT [Physical Site Code] ,WEEKLY_POD ,[SpecCode] ,[CCGCode] ,price ,activity ,Grp_Count ,Running_Price ,Grp_Price ,Portion_Pct_Price ,Portion_pct_Activity ,lead(Portion_Pct_Price, 1) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode] ORDER BY price DESC) AS Nxt_Portion_Pct_Price ,lead(Portion_pct_Activity, 1) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode] ORDER BY price DESC) AS Nxt_Portion_pct_Activity ,lag(Portion_Pct_Price, 1) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode] ORDER BY price DESC) AS Pre_Portion_Pct_Price ,lag(Portion_pct_Activity, 1) OVER(partition BY [Physical Site Code], WEEKLY_POD, [SpecCode] ORDER BY price DESC) AS Pre_Portion_pct_Activity FROM cte_Portions ) --output SELECT * FROM cte_Lead WHERE Portion_Pct_Price <= 0.8 UNION ALL SELECT * FROM cte_Lead WHERE (Portion_Pct_Price > 0.8 AND Pre_Portion_Pct_Price < 0.8) OR (Portion_Pct_Price > 0.8 AND (Pre_Portion_Pct_Price) IS NULL)
Outstanding Issues to Consider
At the moment my process assumes all my financial values are positive. If I had significant credits in my dataset I would be ignoring them. That might not be appropriate for your application so think carefully about it – perhaps you could let me know how you solved it.