Ranking with Databricks

It is not always possible and sometimes does not make sense to assign rankings uniquely ...

Ranking with Databricks
Photo by Joshua Golde / Unsplash

When it comes to deal with historical data, it is a common task to work with the most recent and fresh data to perform analysis: that's where the ranking comes in.

The rank of an object is defined as a relationship between a set of items such that, for any pair of items, the first is either "ranked higher than", "ranked lower than", or "ranked equal to" the second.

Databricks implementations

It is not always possible and sometimes does not make sense to assign rankings uniquely. Depending of the use case, some strategies might be better than other. A common shorthand way to distinguish the best ranking strategy for your use-case is by the ranking 4 items:

  • The first item ranked ahead
  • The second and third, which compare equal (ties)
  • The fourth item ranked last

The object of the different methods is to define the value of the ranking for equal items and items around them.

Databricks implements 3 ranking strategies.

Standard competition ranking “1224”

This ranking strategy is frequently adopted for competitions, as items that are equal receive the same ranking number, and the following item rank is its position in the ordered list.

In the competition example, it means that if two (or more) competitors tie for a position in the ranking, the position of all those ranked below them is unaffected.

Databricks implements this ranking with rank() over subsets of data called partitions.

SELECT 
    id_runner
  , rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions

According to the documentation, rank will produce gaps in the ranking sequence but will not break ties.

Dense ranking “1223”

In dense ranking, items that compare equally receive the same ranking number, and the next items receive the immediately following ranking number.

Databricks implements this ranking with dense_rank() over partitions.

SELECT 
    id_runner
  , dense_rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions

According to the documentation, dense_rank will not produce gaps in the ranking sequence and will not break ties.

Ordinal ranking “1234”

This strategy it meant for scenarios where the uniqueness of the rank is required.

In ordinal ranking, all items receive distinct ordinal numbers, including items that compare equal. The assignment of distinct ordinal numbers to items that compare equal can be done at random, or arbitrarily, but it is generally preferable to use a system that is arbitrary but consistent, as this gives stable results if the ranking is done multiple times.

Databricks implements this ranking with row_number() over partitions.

SELECT 
    id_runner
  , row_number() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions

According to the documentation, row_number will not produce gaps in the ranking sequence but will break ties.

If the order is not unique, the result is non-deterministic.

An example of an arbitrary but consistent system would be to incorporate other attributes into the ranking order

Comparing methods

Now let's wrap up everything and compare the methods. It's up to you to use the one that best fits your needs!

SELECT 
    id_runner
  , race
  , rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
  , dense_rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
  , row_number() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions
id_runner | race | rank | dense_rank | row_number |
----------|------|------|------------|------------|
        21|     A|     1|           1|           1|
        56|     A|     2|           2|           2|
        87|     A|     2|           2|           3|
        18|     A|     4|           3|           4|
----------|------|------|------------|------------|
        87|     B|     1|           1|           1|
       ...|   ...|   ...|         ...|         ...|

Sources

Ranking - Wikipedia
rank ranking window function
Learn the syntax of the rank function of the SQL language in Databricks SQL and Databricks Runtime.
dense_rank ranking window function
Learn the syntax of the dense_rank function of the SQL language in Databricks SQL and Databricks Runtime.
row_number ranking window function
Learn the syntax of the row_number function of the SQL language in Databricks SQL and Databricks Runtime.