The ANSI stand SQL dialect provides a set of aggregate functions for performing quantitative analysis on database tables like SUM(), MAX(), MIN(), AVG() and, COUNT(). There are two main ways these functions get used in practice: (i) apply aggregate functions to groups of rows then collapse those rows down to single rows containing only the aggregate results or, (ii) apply the aggregate calculation on similar grouping of rows but maintain the original row level granularity and represent the aggregate results as new columns per row. This second option where row granularity is maintained along with aggregate calculations are achieved through window functions which is the focus of this article.
I should note that not only do windowing functions provide for fairly complex analytics right in your SQL based database but, these same techniques become quite powerful with the extension of SQL being applied to technologies like HIVE, Spark and ksqlDB for batch and stream data processing on highly parallelized compute clusters.
To aid in this discussion I will use the following simple orders table.
create table orders (
id integer primary key,
date date not null,
year integer not null,
month integer not null,
amount decimal(9, 2) not null
);
insert into orders
(id, date, year, month, amount)
values
(1, '2021-01-10', 2021, 1, 17.52),
(2, '2021-01-12', 2021, 1, 81.27),
(3, '2021-01-18', 2021, 1, 71.28),
(4, '2021-01-19', 2021, 1, 10.03),
(5, '2021-01-21', 2021, 1, 76.96),
(6, '2021-02-03', 2021, 2, 12.46),
(7, '2021-02-12', 2021, 2, 25.30),
(8, '2021-02-23', 2021, 2, 13.48),
(9, '2021-02-25', 2021, 2, 33.70),
(10, '2021-03-03', 2021, 3, 83.41),
(11, '2021-03-07', 2021, 3, 38.70),
(12, '2021-03-17', 2021, 3, 58.77);
Aggregate calcuations are often applied in the context of collapsing the data in a table down to either a single row representative of the entire table or a row per grouping of data as specified in a GROUP BY clause.
For example lets say I want to find the average order price in the orders table. That would look something like the following.
orders=# select AVG(amount) as avg_order from orders;
avg_order
---------------------
43.5733333333333333
(1 row)
However, what if I instead wanted to make a comparison of each order relative to the average order price. Windowing functions provide a way to apply an analytic function such as an average of the order amount over a defined window of rows without collapsing them down. That is exactly what a window function with the default window size of the entire table does for us.
orders=# select id,
date,
amount,
AVG(amount) OVER() as avg_order
from orders;
id | date | amount | avg_order
----+------------+--------+---------------------
1 | 2021-01-10 | 17.52 | 43.5733333333333333
2 | 2021-01-12 | 81.27 | 43.5733333333333333
3 | 2021-01-18 | 71.28 | 43.5733333333333333
4 | 2021-01-19 | 10.03 | 43.5733333333333333
5 | 2021-01-21 | 76.96 | 43.5733333333333333
6 | 2021-02-03 | 12.46 | 43.5733333333333333
7 | 2021-02-12 | 25.30 | 43.5733333333333333
8 | 2021-02-23 | 13.48 | 43.5733333333333333
9 | 2021-02-25 | 33.70 | 43.5733333333333333
10 | 2021-03-03 | 83.41 | 43.5733333333333333
11 | 2021-03-07 | 38.70 | 43.5733333333333333
12 | 2021-03-17 | 58.77 | 43.5733333333333333
(12 rows)
The part that is specific to the window function, or sometimes referred to as an analytic function, is the line below.
AVG(amount) OVER() as avg_order
You can break down this into the general form:
ANALYTIC_FUNCTION(expression ...) OVER (window definition) as alias
Where the ANALYTIC_FUNCTION can be any standard or custom function that returns a scalar value. Here is a list of the aggregate functions one's I've found most useful along with a link back to PostgreSQL docs on Window Functions where you can find more.
You define the window of rows to apply analytic functions to using the OVER(...) operator which has three optional clauses for the following form.
OVER (
PARTITION BY ...
ORDER BY ...
ROWS BETWEEN ....
)
Here is a break down of what each of these mean.
I've found that this notion of a windowed set of rows can be a bit difficult to mentally internalize if you've not used this technique a whole lot so, let me try to expand upon this with another example paired with a graphical depiction.
For the new example, I use the PARTITION BY clause of the OVER(...) operator which windows the set of rows being operated on defined as those spanning each month. Said in a more practical way I am calculating the average order amount over each month.
orders=# select id,
date,
month,
amount,
AVG(amount) OVER () as avg_order,
AVG(amount) OVER (PARTITION BY month) as avg_month_order
from orders
order by date;
id | date | month | amount | avg_order | avg_month_order
----+------------+-------+--------+---------------------+---------------------
1 | 2021-01-10 | 1 | 17.52 | 43.5733333333333333 | 51.4120000000000000
2 | 2021-01-12 | 1 | 81.27 | 43.5733333333333333 | 51.4120000000000000
3 | 2021-01-18 | 1 | 71.28 | 43.5733333333333333 | 51.4120000000000000
4 | 2021-01-19 | 1 | 10.03 | 43.5733333333333333 | 51.4120000000000000
5 | 2021-01-21 | 1 | 76.96 | 43.5733333333333333 | 51.4120000000000000
6 | 2021-02-03 | 2 | 12.46 | 43.5733333333333333 | 21.2350000000000000
7 | 2021-02-12 | 2 | 25.30 | 43.5733333333333333 | 21.2350000000000000
8 | 2021-02-23 | 2 | 13.48 | 43.5733333333333333 | 21.2350000000000000
9 | 2021-02-25 | 2 | 33.70 | 43.5733333333333333 | 21.2350000000000000
10 | 2021-03-03 | 3 | 83.41 | 43.5733333333333333 | 60.2933333333333333
11 | 2021-03-07 | 3 | 38.70 | 43.5733333333333333 | 60.2933333333333333
12 | 2021-03-17 | 3 | 58.77 | 43.5733333333333333 | 60.2933333333333333
(12 rows)
An interesting and quite useful apsect of window functions is the ability to change the order of the rows within the defined window via the familiar ole ORDER BY clause. Where this can be particularly useful is determining a rank of each row within the defined order within the window. For example, if I wanted to rank orders from lowest to highest based off their order amount while still maintaining the overall order of the rows sorted by date. I accomplish this using the RANK() function and specify ORDER BY amount within the OVER(...) operator like so.
orders=# select id,
date,
amount,
RANK() OVER(ORDER BY amount) as amount_rank
from orders
order by date;
id | date | amount | amount_rank
----+------------+--------+-------------
1 | 2021-01-10 | 17.52 | 4
2 | 2021-01-12 | 81.27 | 11
3 | 2021-01-18 | 71.28 | 9
4 | 2021-01-19 | 10.03 | 1
5 | 2021-01-21 | 76.96 | 10
6 | 2021-02-03 | 12.46 | 2
7 | 2021-02-12 | 25.30 | 5
8 | 2021-02-23 | 13.48 | 3
9 | 2021-02-25 | 33.70 | 6
10 | 2021-03-03 | 83.41 | 12
11 | 2021-03-07 | 38.70 | 7
12 | 2021-03-17 | 58.77 | 8
(12 rows)
Here I see that the lowest rank goes to the lowest order amount and the highest rank goes to the largest order amount.
Of course, this works with window subsets defined using the PARTITION BY clause of the OVER(...) operator as well so, finding the rank of order amounts for each month works as follows.
orders=# select id,
date,
month,
amount,
RANK() OVER (
PARTITION BY month
ORDER BY amount
) as month_amt_rank
from orders
order by date;
id | date | month | amount | month_amt_rank
----+------------+-------+--------+----------------
1 | 2021-01-10 | 1 | 17.52 | 2
2 | 2021-01-12 | 1 | 81.27 | 5
3 | 2021-01-18 | 1 | 71.28 | 3
4 | 2021-01-19 | 1 | 10.03 | 1
5 | 2021-01-21 | 1 | 76.96 | 4
6 | 2021-02-03 | 2 | 12.46 | 1
7 | 2021-02-12 | 2 | 25.30 | 3
8 | 2021-02-23 | 2 | 13.48 | 2
9 | 2021-02-25 | 2 | 33.70 | 4
10 | 2021-03-03 | 3 | 83.41 | 3
11 | 2021-03-07 | 3 | 38.70 | 1
12 | 2021-03-17 | 3 | 58.77 | 2
The ability to control the order that rows appear to window function is a very powerful property of these analytical functions. For example, I could combine a group by style of aggregate to calculate total monthly order sales then use the LAG aggegate window function to detemine the expansion or contraction of sales. Lets work through this example in steps.
Calculating the total monthly order sales would look like this.
orders=# select month, SUM(amount) as monthly_sales
from orders
group by month
order by month;
month | monthly_sales
-------+---------------
1 | 257.06
2 | 84.94
3 | 180.88
(3 rows)
Now if I use the LAG(expression, rows_behind) function on a window ordered by month with the rows_behind parameter as 1 and subtracting that from the current row I get the month to month differences representing expansion and contraction of sales.
orders=# WITH monthly_tbl as (
select month, SUM(amount) as monthly_sales
from orders
group by month
order by month
)
select *, monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month) as month_change
from monthly_tbl;
month | monthly_sales | month_change
-------+---------------+--------------
1 | 257.06 |
2 | 84.94 | -172.12
3 | 180.88 | 95.94
(3 rows)
I've used a Common Table Expression (CTE) to wrap the group by monthly sales aggregate results here into its own table result set to feed into the difference calculation in the LAG based window function. I'm working in Postgres which has had support for CTEs for quite some time. Unfortunately not all relational database systems have support for them but, I could have accomplish the same thing with a select ... from (select ... from ) nested query like so.
orders=# select *, monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month) as month_change
from (
select month, SUM(amount) as monthly_sales
from orders
group by month
order by month
) monthly_tbl
order by monthly_tbl.month;
month | monthly_sales | month_change
-------+---------------+--------------
1 | 257.06 |
2 | 84.94 | -172.12
3 | 180.88 | 95.94
(3 rows)
Within the OVER(...) operator of a window function there is a whole lot of power to define the specific set of rows within a window you wish to apply analytic functions to. For example, if I wanted to get a running total of orders over time I would use the ROWS BETWEEN clause of the OVER(...) operator to give me the UNBOUNDED PRECEDING rows up to the CURRENT ROW being sure to order by date like so.
orders=# select id,
date,
month,
amount,
SUM(amount) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as total_sales
from orders
order by date;
id | date | month | amount | total_sales
----+------------+-------+--------+-------------
1 | 2021-01-10 | 1 | 17.52 | 17.52
2 | 2021-01-12 | 1 | 81.27 | 98.79
3 | 2021-01-18 | 1 | 71.28 | 170.07
4 | 2021-01-19 | 1 | 10.03 | 180.10
5 | 2021-01-21 | 1 | 76.96 | 257.06
6 | 2021-02-03 | 2 | 12.46 | 269.52
7 | 2021-02-12 | 2 | 25.30 | 294.82
8 | 2021-02-23 | 2 | 13.48 | 308.30
9 | 2021-02-25 | 2 | 33.70 | 342.00
10 | 2021-03-03 | 3 | 83.41 | 425.41
11 | 2021-03-07 | 3 | 38.70 | 464.11
12 | 2021-03-17 | 3 | 58.77 | 522.88
Of course, this works in conjunction with a partitioned subset of windowed rows as well. If I wanted a running total of monthly sales I would use the following.
orders=# select id,
date,
month,
amount,
SUM(amount) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as total_sales,
SUM(amount) OVER (
PARTITION BY month
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as monthly_sales
from orders
order by date;
id | date | month | amount | total_sales | monthly_sales
----+------------+-------+--------+-------------+---------------
1 | 2021-01-10 | 1 | 17.52 | 17.52 | 17.52
2 | 2021-01-12 | 1 | 81.27 | 98.79 | 98.79
3 | 2021-01-18 | 1 | 71.28 | 170.07 | 170.07
4 | 2021-01-19 | 1 | 10.03 | 180.10 | 180.10
5 | 2021-01-21 | 1 | 76.96 | 257.06 | 257.06
6 | 2021-02-03 | 2 | 12.46 | 269.52 | 12.46
7 | 2021-02-12 | 2 | 25.30 | 294.82 | 37.76
8 | 2021-02-23 | 2 | 13.48 | 308.30 | 51.24
9 | 2021-02-25 | 2 | 33.70 | 342.00 | 84.94
10 | 2021-03-03 | 3 | 83.41 | 425.41 | 83.41
11 | 2021-03-07 | 3 | 38.70 | 464.11 | 122.11
12 | 2021-03-17 | 3 | 58.77 | 522.88 | 180.88
(12 rows)
You can also look at downstream rows. Similar to how I used the PRECEDING keyword to specify rows to include before the current row, I can use the FOLLOWING keyword to specify an amount of rows to include ahead of the current row. As an example of this I will demonstrate finding the max order amount for current row along with the row immediately before and after it within a window.
orders=# select id,
date,
amount,
MAX(amount) OVER (
PARTITION BY month
ORDER BY date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) as three_order_max
from orders
order by date;
id | date | amount | three_order_max
----+------------+--------+-----------------
1 | 2021-01-10 | 17.52 | 81.27
2 | 2021-01-12 | 81.27 | 81.27
3 | 2021-01-18 | 71.28 | 81.27
4 | 2021-01-19 | 10.03 | 76.96
5 | 2021-01-21 | 76.96 | 76.96
6 | 2021-02-03 | 12.46 | 25.30
7 | 2021-02-12 | 25.30 | 25.30
8 | 2021-02-23 | 13.48 | 33.70
9 | 2021-02-25 | 33.70 | 33.70
10 | 2021-03-03 | 83.41 | 83.41
11 | 2021-03-07 | 38.70 | 83.41
12 | 2021-03-17 | 58.77 | 58.77
(12 rows)
In this How To article I demonstrated the use of SQL window functions (sometimes also referred to as analytic functions) to perform simple to moderately complex analysis right in your SQL code.
As always, thanks for reading and please do not hesitate to critique or comment below.