By Martin Labenne — Jun 13, 2024

Ranking with Databricks

It is not always possible and sometimes does not make sense to assign rankings uniquely ...

When it comes to deal with historical data, it is a common task to work with the most recent and fresh data to perform analysis: that's where the ranking comes in.

The rank of an object is defined as a relationship between a set of items such that, for any pair of items, the first is either "ranked higher than", "ranked lower than", or "ranked equal to" the second.

Databricks implementations

It is not always possible and sometimes does not make sense to assign rankings uniquely. Depending of the use case, some strategies might be better than other. A common shorthand way to distinguish the best ranking strategy for your use-case is by the ranking 4 items:

The first item ranked ahead
The second and third, which compare equal (ties)
The fourth item ranked last

The object of the different methods is to define the value of the ranking for equal items and items around them.

Databricks implements 3 ranking strategies.

Standard competition ranking “1224”

This ranking strategy is frequently adopted for competitions, as items that are equal receive the same ranking number, and the following item rank is its position in the ordered list.

In the competition example, it means that if two (or more) competitors tie for a position in the ranking, the position of all those ranked below them is unaffected.

Databricks implements this ranking with rank() over subsets of data called partitions.

SELECT 
    id_runner
  , rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions

According to the documentation, rank will produce gaps in the ranking sequence but will not break ties.

Dense ranking “1223”

In dense ranking, items that compare equally receive the same ranking number, and the next items receive the immediately following ranking number.

Databricks implements this ranking with dense_rank() over partitions.

SELECT 
    id_runner
  , dense_rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions

According to the documentation, dense_rank will not produce gaps in the ranking sequence and will not break ties.

Ordinal ranking “1234”

This strategy it meant for scenarios where the uniqueness of the rank is required.

In ordinal ranking, all items receive distinct ordinal numbers, including items that compare equal. The assignment of distinct ordinal numbers to items that compare equal can be done at random, or arbitrarily, but it is generally preferable to use a system that is arbitrary but consistent, as this gives stable results if the ranking is done multiple times.

Databricks implements this ranking with row_number() over partitions.

SELECT 
    id_runner
  , row_number() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions

According to the documentation, row_number will not produce gaps in the ranking sequence but will break ties.

If the order is not unique, the result is non-deterministic.

An example of an arbitrary but consistent system would be to incorporate other attributes into the ranking order

Comparing methods

Now let's wrap up everything and compare the methods. It's up to you to use the one that best fits your needs!

SELECT 
    id_runner
  , race
  , rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
  , dense_rank() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
  , row_number() OVER(PARTITION BY race ORDER BY time_to_finish DESC)
FROM competitions

id_runner | race | rank | dense_rank | row_number |
----------|------|------|------------|------------|
        21|     A|     1|           1|           1|
        56|     A|     2|           2|           2|
        87|     A|     2|           2|           3|
        18|     A|     4|           3|           4|
----------|------|------|------------|------------|
        87|     B|     1|           1|           1|
       ...|   ...|   ...|         ...|         ...|

Databricks implementations

Standard competition ranking “1224”

Dense ranking “1223”

Ordinal ranking “1234”

Comparing methods

Sources