theCodingInterface

How To Use Window Functions in SQL

By Adam McQuistan in Data Engineering 03/29/2021 Comment

Introduction

The ANSI stand SQL dialect provides a set of aggregate functions for performing quantitative analysis on database tables like SUM(), MAX(), MIN(), AVG() and, COUNT(). There are two main ways these functions get used in practice: (i) apply aggregate functions to groups of rows then collapse those rows down to single rows containing only the aggregate results or, (ii) apply the aggregate calculation on similar grouping of rows but maintain the original row level granularity and represent the aggregate results as new columns per row. This second option where row granularity is maintained along with aggregate calculations are achieved through window functions which is the focus of this article.

I should note that not only do windowing functions provide for fairly complex analytics right in your SQL based database but, these same techniques become quite powerful with the extension of SQL being applied to technologies like HIVE, Spark and ksqlDB for batch and stream data processing on highly parallelized compute clusters.

What are Window Functions and Why Do We Need Them

To aid in this discussion I will use the following simple orders table.

create table orders (
   id integer primary key,
   date date not null,
   year integer not null,
   month integer not null,
   amount decimal(9, 2) not null
);
insert into orders 
  (id, date, year, month, amount) 
values 
  (1,  '2021-01-10', 2021, 1, 17.52),
  (2,  '2021-01-12', 2021, 1, 81.27),
  (3,  '2021-01-18', 2021, 1, 71.28),
  (4,  '2021-01-19', 2021, 1, 10.03),
  (5,  '2021-01-21', 2021, 1, 76.96),
  (6,  '2021-02-03', 2021, 2, 12.46),
  (7,  '2021-02-12', 2021, 2, 25.30),
  (8,  '2021-02-23', 2021, 2, 13.48),
  (9,  '2021-02-25', 2021, 2, 33.70),
  (10, '2021-03-03', 2021, 3, 83.41),
  (11, '2021-03-07', 2021, 3, 38.70),
  (12, '2021-03-17', 2021, 3, 58.77);

Aggregate calcuations are often applied in the context of collapsing the data in a table down to either a single row representative of the entire table or a row per grouping of data as specified in a GROUP BY clause.

For example lets say I want to find the average order price in the orders table. That would look something like the following.

orders=# select AVG(amount) as avg_order from orders;
      avg_order
---------------------
 43.5733333333333333
(1 row)

However, what if I instead wanted to make a comparison of each order relative to the average order price. Windowing functions provide a way to apply an analytic function such as an average of the order amount over a defined window of rows without collapsing them down. That is exactly what a window function with the default window size of the entire table does for us.

orders=# select id,
                date,
                amount,
                AVG(amount) OVER() as avg_order
from orders;
 id |    date    | amount |      avg_order
----+------------+--------+---------------------
  1 | 2021-01-10 |  17.52 | 43.5733333333333333
  2 | 2021-01-12 |  81.27 | 43.5733333333333333
  3 | 2021-01-18 |  71.28 | 43.5733333333333333
  4 | 2021-01-19 |  10.03 | 43.5733333333333333
  5 | 2021-01-21 |  76.96 | 43.5733333333333333
  6 | 2021-02-03 |  12.46 | 43.5733333333333333
  7 | 2021-02-12 |  25.30 | 43.5733333333333333
  8 | 2021-02-23 |  13.48 | 43.5733333333333333
  9 | 2021-02-25 |  33.70 | 43.5733333333333333
 10 | 2021-03-03 |  83.41 | 43.5733333333333333
 11 | 2021-03-07 |  38.70 | 43.5733333333333333
 12 | 2021-03-17 |  58.77 | 43.5733333333333333
(12 rows)

The part that is specific to the window function, or sometimes referred to as an analytic function, is the line below.

AVG(amount) OVER() as avg_order

You can break down this into the general form:

ANALYTIC_FUNCTION(expression ...) OVER (window definition) as alias

Where the ANALYTIC_FUNCTION can be any standard or custom function that returns a scalar value. Here is a list of the aggregate functions one's I've found most useful along with a link back to PostgreSQL docs on Window Functions where you can find more.

MIN() - max value over the row set window defined
MAX() - min value over the row set window defined
AVG() - average value over the row set window defined
COUNT() - number of items over the row set window defined
SUM() - sum of numeric items over the row set window defined
ROW_NUMBER() - row index number starting from 1 of the window defined
RANK() - rank the current row over the row set defined
LAG(expression) - returns the result the expression applied to the previous row, or an optional second value ofter the expression dictating another number of rows before
LEAD(expression) - returns the result of the expression applied to the next row, or an optional second value ofter the expression dictating another number of rows after
FIRST_VALUE(expression) - returns the expression applied to the first value of the first row in the defined window
LAST_VALUE(expression) - returns the expression applied tothe last value of the last row in the defined window
NTH_VALUE(expression, N) - returns the pression applied to the Nth row's value in the defined window

You define the window of rows to apply analytic functions to using the OVER(...) operator which has three optional clauses for the following form.

OVER (
  PARTITION BY ...
  ORDER BY ...
  ROWS BETWEEN ....
)

Here is a break down of what each of these mean.

PARTITION BY is used to define zero or more columns to window the row set by and, if omitted then the window of rows to apply the analytic function will be the entire rowset returned by the query.
ORDER BY works similar to a regular ORDER BY clause except it is scoped only the window defined by the PARTITION BY clause and is only applied to the rows operated on by the analytic function irrespective of the ORDER BY clause of the outer query
ROWS BETWEEN is used to frame a specific sets of rows within the window defined by the PARTITION BY clause to be fed to the analytic function

Developing an Understanding of Windowing Rows of a Resultset

I've found that this notion of a windowed set of rows can be a bit difficult to mentally internalize if you've not used this technique a whole lot so, let me try to expand upon this with another example paired with a graphical depiction.

For the new example, I use the PARTITION BY clause of the OVER(...) operator which windows the set of rows being operated on defined as those spanning each month. Said in a more practical way I am calculating the average order amount over each month.

orders=# select id,
       date,
       month,
       amount,
       AVG(amount) OVER () as avg_order,
       AVG(amount) OVER (PARTITION BY month) as avg_month_order
from orders
order by date;
 id |    date    | month | amount |      avg_order      |   avg_month_order
----+------------+-------+--------+---------------------+---------------------
  1 | 2021-01-10 |     1 |  17.52 | 43.5733333333333333 | 51.4120000000000000
  2 | 2021-01-12 |     1 |  81.27 | 43.5733333333333333 | 51.4120000000000000
  3 | 2021-01-18 |     1 |  71.28 | 43.5733333333333333 | 51.4120000000000000
  4 | 2021-01-19 |     1 |  10.03 | 43.5733333333333333 | 51.4120000000000000
  5 | 2021-01-21 |     1 |  76.96 | 43.5733333333333333 | 51.4120000000000000
  6 | 2021-02-03 |     2 |  12.46 | 43.5733333333333333 | 21.2350000000000000
  7 | 2021-02-12 |     2 |  25.30 | 43.5733333333333333 | 21.2350000000000000
  8 | 2021-02-23 |     2 |  13.48 | 43.5733333333333333 | 21.2350000000000000
  9 | 2021-02-25 |     2 |  33.70 | 43.5733333333333333 | 21.2350000000000000
 10 | 2021-03-03 |     3 |  83.41 | 43.5733333333333333 | 60.2933333333333333
 11 | 2021-03-07 |     3 |  38.70 | 43.5733333333333333 | 60.2933333333333333
 12 | 2021-03-17 |     3 |  58.77 | 43.5733333333333333 | 60.2933333333333333
(12 rows)

Specifying Order within a Windowed Calculation

An interesting and quite useful apsect of window functions is the ability to change the order of the rows within the defined window via the familiar ole ORDER BY clause. Where this can be particularly useful is determining a rank of each row within the defined order within the window. For example, if I wanted to rank orders from lowest to highest based off their order amount while still maintaining the overall order of the rows sorted by date. I accomplish this using the RANK() function and specify ORDER BY amount within the OVER(...) operator like so.

orders=# select id,
       date,
       amount,
       RANK() OVER(ORDER BY amount) as amount_rank
from orders
order by date;
 id |    date    | amount | amount_rank
----+------------+--------+-------------
  1 | 2021-01-10 |  17.52 |           4
  2 | 2021-01-12 |  81.27 |          11
  3 | 2021-01-18 |  71.28 |           9
  4 | 2021-01-19 |  10.03 |           1
  5 | 2021-01-21 |  76.96 |          10
  6 | 2021-02-03 |  12.46 |           2
  7 | 2021-02-12 |  25.30 |           5
  8 | 2021-02-23 |  13.48 |           3
  9 | 2021-02-25 |  33.70 |           6
 10 | 2021-03-03 |  83.41 |          12
 11 | 2021-03-07 |  38.70 |           7
 12 | 2021-03-17 |  58.77 |           8
(12 rows)

Here I see that the lowest rank goes to the lowest order amount and the highest rank goes to the largest order amount.

Of course, this works with window subsets defined using the PARTITION BY clause of the OVER(...) operator as well so, finding the rank of order amounts for each month works as follows.

orders=# select id,
       date,
       month,
       amount,
       RANK() OVER (
         PARTITION BY month
         ORDER BY amount
       ) as month_amt_rank
from orders
order by date;
 id |    date    | month | amount | month_amt_rank
----+------------+-------+--------+----------------
  1 | 2021-01-10 |     1 |  17.52 |              2
  2 | 2021-01-12 |     1 |  81.27 |              5
  3 | 2021-01-18 |     1 |  71.28 |              3
  4 | 2021-01-19 |     1 |  10.03 |              1
  5 | 2021-01-21 |     1 |  76.96 |              4
  6 | 2021-02-03 |     2 |  12.46 |              1
  7 | 2021-02-12 |     2 |  25.30 |              3
  8 | 2021-02-23 |     2 |  13.48 |              2
  9 | 2021-02-25 |     2 |  33.70 |              4
 10 | 2021-03-03 |     3 |  83.41 |              3
 11 | 2021-03-07 |     3 |  38.70 |              1
 12 | 2021-03-17 |     3 |  58.77 |              2

Calculating Difference using LAG

The ability to control the order that rows appear to window function is a very powerful property of these analytical functions. For example, I could combine a group by style of aggregate to calculate total monthly order sales then use the LAG aggegate window function to detemine the expansion or contraction of sales. Lets work through this example in steps.

Calculating the total monthly order sales would look like this.

orders=# select month, SUM(amount) as monthly_sales
            from orders
            group by month
            order by month;
 month | monthly_sales
-------+---------------
     1 |        257.06
     2 |         84.94
     3 |        180.88
(3 rows)

Now if I use the LAG(expression, rows_behind) function on a window ordered by month with the rows_behind parameter as 1 and subtracting that from the current row I get the month to month differences representing expansion and contraction of sales.

orders=# WITH monthly_tbl as (
        select month, SUM(amount) as monthly_sales
        from orders
        group by month
        order by month
    )
    select *, monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month) as month_change
    from monthly_tbl;
 month | monthly_sales | month_change
-------+---------------+--------------
     1 |        257.06 |
     2 |         84.94 |      -172.12
     3 |        180.88 |        95.94
(3 rows)

I've used a Common Table Expression (CTE) to wrap the group by monthly sales aggregate results here into its own table result set to feed into the difference calculation in the LAG based window function. I'm working in Postgres which has had support for CTEs for quite some time. Unfortunately not all relational database systems have support for them but, I could have accomplish the same thing with a select ... from (select ... from ) nested query like so.

orders=# select *, monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month) as month_change
             from (
                   select month, SUM(amount) as monthly_sales
                   from orders
                   group by month
                   order by month
            ) monthly_tbl
            order by monthly_tbl.month;
 month | monthly_sales | month_change
-------+---------------+--------------
     1 |        257.06 |
     2 |         84.94 |      -172.12
     3 |        180.88 |        95.94
(3 rows)

Defining Ranges of Row to Operate on Within a Window

Within the OVER(...) operator of a window function there is a whole lot of power to define the specific set of rows within a window you wish to apply analytic functions to. For example, if I wanted to get a running total of orders over time I would use the ROWS BETWEEN clause of the OVER(...) operator to give me the UNBOUNDED PRECEDING rows up to the CURRENT ROW being sure to order by date like so.

orders=# select id,
       date,
       month,
       amount,
       SUM(amount) OVER (
         ORDER BY date 
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) as total_sales
from orders
order by date;
 id |    date    | month | amount | total_sales
----+------------+-------+--------+-------------
  1 | 2021-01-10 |     1 |  17.52 |       17.52
  2 | 2021-01-12 |     1 |  81.27 |       98.79
  3 | 2021-01-18 |     1 |  71.28 |      170.07
  4 | 2021-01-19 |     1 |  10.03 |      180.10
  5 | 2021-01-21 |     1 |  76.96 |      257.06
  6 | 2021-02-03 |     2 |  12.46 |      269.52
  7 | 2021-02-12 |     2 |  25.30 |      294.82
  8 | 2021-02-23 |     2 |  13.48 |      308.30
  9 | 2021-02-25 |     2 |  33.70 |      342.00
 10 | 2021-03-03 |     3 |  83.41 |      425.41
 11 | 2021-03-07 |     3 |  38.70 |      464.11
 12 | 2021-03-17 |     3 |  58.77 |      522.88

Of course, this works in conjunction with a partitioned subset of windowed rows as well. If I wanted a running total of monthly sales I would use the following.

orders=# select id,
       date,
       month,
       amount,
       SUM(amount) OVER (
         ORDER BY date 
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) as total_sales,
       SUM(amount) OVER (
         PARTITION BY month
         ORDER BY date
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) as monthly_sales
from orders
order by date;
 id |    date    | month | amount | total_sales | monthly_sales
----+------------+-------+--------+-------------+---------------
  1 | 2021-01-10 |     1 |  17.52 |       17.52 |         17.52
  2 | 2021-01-12 |     1 |  81.27 |       98.79 |         98.79
  3 | 2021-01-18 |     1 |  71.28 |      170.07 |        170.07
  4 | 2021-01-19 |     1 |  10.03 |      180.10 |        180.10
  5 | 2021-01-21 |     1 |  76.96 |      257.06 |        257.06
  6 | 2021-02-03 |     2 |  12.46 |      269.52 |         12.46
  7 | 2021-02-12 |     2 |  25.30 |      294.82 |         37.76
  8 | 2021-02-23 |     2 |  13.48 |      308.30 |         51.24
  9 | 2021-02-25 |     2 |  33.70 |      342.00 |         84.94
 10 | 2021-03-03 |     3 |  83.41 |      425.41 |         83.41
 11 | 2021-03-07 |     3 |  38.70 |      464.11 |        122.11
 12 | 2021-03-17 |     3 |  58.77 |      522.88 |        180.88
(12 rows)

You can also look at downstream rows. Similar to how I used the PRECEDING keyword to specify rows to include before the current row, I can use the FOLLOWING keyword to specify an amount of rows to include ahead of the current row. As an example of this I will demonstrate finding the max order amount for current row along with the row immediately before and after it within a window.

orders=# select id,
       date,
       amount,
       MAX(amount) OVER (
         PARTITION BY month
         ORDER BY date
         ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
       ) as three_order_max
from orders
order by date;
 id |    date    | amount | three_order_max
----+------------+--------+-----------------
  1 | 2021-01-10 |  17.52 |           81.27
  2 | 2021-01-12 |  81.27 |           81.27
  3 | 2021-01-18 |  71.28 |           81.27
  4 | 2021-01-19 |  10.03 |           76.96
  5 | 2021-01-21 |  76.96 |           76.96
  6 | 2021-02-03 |  12.46 |           25.30
  7 | 2021-02-12 |  25.30 |           25.30
  8 | 2021-02-23 |  13.48 |           33.70
  9 | 2021-02-25 |  33.70 |           33.70
 10 | 2021-03-03 |  83.41 |           83.41
 11 | 2021-03-07 |  38.70 |           83.41
 12 | 2021-03-17 |  58.77 |           58.77
(12 rows)

Conclusion

In this How To article I demonstrated the use of SQL window functions (sometimes also referred to as analytic functions) to perform simple to moderately complex analysis right in your SQL code.

As always, thanks for reading and please do not hesitate to critique or comment below.

Data Engineering PostgreSQL Databases

Share with friends and colleagues

[[ likes ]] likes