Recently, I was trying to group consecutive rows in SQL using one and then multiple column values. I looked up the internet on how to do and couldn't find anything so finally when I was able to figure it out, I thought I should share this so that it can be of some help others who might be facing the problem.

## The Problem

Let's start with defining and understanding the problem first. Let's say we are running a mobility company (take for example, Rapido). In order to solve a particular use case, we are collecting information on the time spent by each bike in different areas. Let's say this is represented as a table named `lapse_time`

in our database with `bike_id`

, `area_id`

and the corresponding `time_spent`

(in minutes).

```
CREATE table lapse_time(
bike_id varchar,
area_id VARCHAR,
time_spent INTEGER
)
```

Here is the information we have collected for example:

***Note:** This data is sorted by `bike_id`

and then by the order of entry of the row in the database. For example, `bike_1`

first spent 120 minutes in `area_1`

. Then, it went to `area_2`

and spent 10 minutes over there and so on.

Now if you take a close look at row numbers 2, 3, and 4 in figure 1, you will observe that all these consecutive rows have the same `area_id`

and `bike_id`

, and as the data is sorted by the order of entry of the row in the database, it means that `bike_1`

spent 10 minutes in `area_2`

, then 20 minutes again in `area_2`

, and then again 40 minutes in `area_2`

. This looks very cumbersome and to solve the particular use case I was looking at, I needed to group together these 3 consecutive columns so that it conveys that `bike_1`

spent 70 (40 + 20 +10) minutes in `area_2`

. Let's take another example to make it abundantly clear. Both the consecutive rows, 6 and 7, have the same `area_id`

and `bike_id`

, and thus can be grouped to the `time_spent`

by the `bike_2`

in`area_2`

as 37 (12 + 25) minutes.

Many of you might be wondering why at all have we got separate rows for the same `bike_id`

in the same area. That can be because of multiple reasons, primarily owing to the way how the data collection has been done. For example, the hardware might be collecting information at specific intervals (e.g. every 10 minutes or so) and calculating the time spent in that interval. It might be the case that a bike is at the same location in multiple intervals, in which case we get separate rows. So, this scenario is expected in real-time data.

## The Solution

So, now let's try to solve this problem. The end goal here is to convert what it looks like in the figure 2 to that in the figure 3.

**To this →**

I personally encountered this problem while I was trying to calculate `median idle time_spent`

in an area. If you take the data present in figure 2 and try to calculate the median idle time spent in let's say `area_2`

, you won't get the desired result as the data hasn't been arranged accordingly. In figure 3, we have arranged the data by merging consecutive rows having the same `bike_id`

and `area_id`

; thus making it perfect for all sorts of calculations. Merging consecutive rows according to values of multiple columns was something that wasn't easily available, which then inspired me to write a blog regarding it.

So, how to do this? Let's go step by step.

Let's first make a dummy table in postgres named as `lapse_time`

```
CREATE table lapse_time(
bike_id varchar,
area_id VARCHAR,
time_spent INTEGER
)
```

Now, let's insert some dummy data into it.

```
INSERT into lapse_time VALUES
('bike_1', 'area_1', 120),
('bike_1', 'area_2', 20),
('bike_1', 'area_2', 40),
('bike_1', 'area_2', 10),
('bike_1', 'area_3', 45),
('bike_2', 'area_2', 12),
('bike_2', 'area_2', 25),
('bike_2', 'area_3', 50)
```

We have a table exactly like the one in figure 2.

- Now, we need to add a
`row_number()`

to our table. This can be done simply using the following query. Let's call it`query_1`

```
select *, ROW_NUMBER() OVER (ORDER BY bike_id) as rw_nm from lapse_time;
```

Here's the result:

We need to calculate LAG of the columns based on which we want to perform the sum (in this case, `bike_id`

and `area_id`

). So, let's call the following operation `query_2`

.

```
SELECT t.*,
LAG(bike_id) OVER (
ORDER BY rw_nm
) as prev_1,
LAG(area_id) OVER (
ORDER BY rw_nm
) AS prev_2
from ({query_1}) t;
```

Here's the result:

Now comes the tricky part. We'll need to use `sum over window function`

. In order to know more about the same, one should read more about window functions in postgres.

Before going to the actual query, here's one query that one should try to get clarity:

```
SELECT t.*,
SUM(1) OVER (
ORDER BY rw_nm
) as grp
from ({ query_2 }) t
```

Here's the result for this query

For the actual query, we would use this with a slight modification in the `sum`

function. We would be using CASE condition with sum, so here's the actual query. Let's call this `query_3`

.

```
SELECT t.*,
SUM(
CASE
WHEN bike_id = prev_1
AND area_id = prev_2 THEN 0
ELSE 1
END
) OVER (
ORDER BY rw_nm
) as grp
from ({ query_2 }) t
```

We would do the sum only when the column value is not equal to its lag value. Here's the result that we would get.

If we look at the `grp`

column over here, we would notice that the consecutive rows with the same values of `bike_id`

and `area_id`

have the same value for `grp`

. The final step is to use group by clause over the `grp`

column, here's the query for the same:

```
select bike_id,
area_id,
SUM(time_spent) as time_spent,
min(rw_nm) as rw_nm
from ({query_3}) t
GROUP by grp,
bike_id,
area_id
ORDER by rw_nm
```

And here's the result:

Now one can further go above this and select just the `bike_id`

, `area_id`

and `time_spent`

from this table to get the exact result as in the Figure 3.

### The final query

Aggregating all the steps mentioned above, here's the final query for merging consecutive rows in SQL using one or multiple columns

```
select bike_id,
area_id,
SUM(time_spent) as time_spent,
min(rw_nm) as rw_nm
from (
SELECT t.*,
SUM(
CASE
WHEN bike_id = prev_1
AND area_id = prev_2 THEN 0
ELSE 1
END
) OVER (
ORDER BY rw_nm
) as grp
from (
SELECT t.*,
LAG(bike_id) OVER (
ORDER BY rw_nm
) as prev_1,
LAG(area_id) OVER (
ORDER BY rw_nm
) AS prev_2
from (
select *,
ROW_NUMBER() OVER (
ORDER BY bike_id
) as rw_nm
from lapse_time
) t
) t
) t
GROUP by grp,
bike_id,
area_id
ORDER by rw_nm
```

## Further Reading

- This problem can further be generalized as recognition of patterns in SQL and rapid detection of unbroken sequences, better known as
**gaps and island problem**. - I personally faced this problem while working to measure fleet utilization and idle time. I have written another detailed blog post on the same. You might find interesting to deep dive into that problem here:

Read more on the problem we are solving at Locale.ai: