# Sorting (with attitude)

This post takes a look at a number of different sorting techniques: heap sort, merge sort, quick sort and quick select, counting sort, radix sort, and bucket sort.

The notes below come from the Algorithms with Attitude YouTube channel, specifically the Sorting playlist comprised of the following videos: , Heap Sort, Merge Sort: Top-Down and Bottom-Up, Quick Sort and Quick Select, Runtime Analysis for Quick Sort and Quick Select, Lower Bounds for Comparison Based Sorting: Decision Trees, and Linear Time Sorting: Counting Sort, Radix Sort, and Bucket Sort

There are several sections in this post, accessible in the table of contents on the right, but most of the sections have been listed below for ease of reference and navigation.

- Heap sort
- Merge sort - top-down and bottom-up
- Merging lists
- Algorithm concept
- Standard recursive merge sort (top-down)
- Analysis
- Optimizations
- Naive parallelization
- Unwinding the recursion
- Bottom-up merge sort (iterative)
- Comparison to top-down (8 elements)
- Comparison to top-down (13 elements)
- History lesson
- A different optimization (bottom-up)
- Bottom-up implementation (optimized)
- Timsort ideas

- Quick sort and quick select
- Introduction
- Quick sort concept
- Quick sort pseudocode (1)
- Partitioning (CLRS/Lomuto partition)
- Worst case performance
- Quick sort pseudocode (2)
- Quick sort pseudocode (3)
- Expected performance
- Randomized quick sort
- Expected performance (continued)
- Partitioning with duplicates
- Optimizations
- Naive parallelization
- Selection
- Quick select
- Implementations

- Runtime analysis for quick sort and quick select
- Lower bounds for comparison based sorting - decision trees
- Linear time sorting - counting sort, radix sort, and bucket sort
- LeetCode implementations

## Heap sort

There is another post dedicated to binary heaps, where heap sort is discussed in the general context of heaps as a whole. The content below for heap sort has been excerpted from the linked post for the sake of completeness.

### Overview

Suppose we have the `heapify`

and `deleteMax`

operations as well as the `buildHeap`

operation. Heap sort is then quite easy: first, build a max-heap; then, delete all its values:

`buildHeap()`

`deleteMax()`

$n$ times

Done.

### buildHeap() phase

For a bit more detail, we start with the `buildHeap`

operation. We assume we're given an array of values to be sorted, and we use the space of that array to store the heap itself. One of the nice properties of heap sort is that it is an in-place algorithm — it only needs a fixed amount of memory beyond what is used to store the data being sorted.

### deleteMax() phase

Next, we delete the max value. We swap the root with the last leaf node (bottom level to the far right) and then delete the last leaf node — the new root value then has to be compared to its children and possibly sifted down the tree to ensure the properties of the whole heap and all sub-heaps are respected. We conclude the sifting down when what was previously the last leaf node is now either a leaf or the root of its own sub-heap.

We continue to call `deleteMax`

until the entire array is empty. As indicated above, note that when `deleteMax`

is called, we don't just overwrite the value to be deleted — we swap it with the last leaf of the heap, and continue as explained above. What happens is that we delete more and more items, and the heap gets smaller and smaller. But all of the values are still stored in the array, until finally, the heap is gone.

### Analysis

`buildHeap`

: $O(n)$`deleteMax`

$n$ times: $O(\lg n)$ each, $O(n\lg n)$ total

For our analysis, we had a `buildHeap`

operation, which was $O(n)$ time, and then $n$ `deleteMax`

operations, which are worst-case $O(\lg n)$ time each. That's $O(n\lg n)$ total time.

Note that we would still have an $O(n\lg n)$ algorithm even if we ran $n$ insertions to start the algorithm instead of the linear `buildHeap`

. If we want to be more precise, we can better bound the total number of comparisons:

`buildHeap`

: $< 2n$ comparisons (CLRS)`delteMax`

$n$ times: $< 2n\lg n$ comparisons (CLRS)

Note that the above are *not* asymptotic results. The `2`

term above comes from the fact that to move down one level in `heapify`

usually takes `2`

comparisons, but that can be improved by optimizing the `heapify`

method.

In all, heap sort gives us a worst-case $O(n\lg n)$ in-place sorting algorithm that is *not* stable. It's supposedly slower in practice than a good quicksort.

## Merge sort - top-down and bottom-up

### Merging lists

The idea for merge sort is that if we have two lists of numbers that are *already* sorted, then we can merge those two lists into one longer sorted list in linear total time. If each list is sorted smallest to largest, then we can find the overall smallest by just comparing the smallest of each list. The next smallest is the smallest of the lists remaining after we take that first element out. For equal values, we take the one from the left list as the smaller one (even though it really doesn't make a difference) — its value started in a smaller index; thus, taking it in case of a tie, we'll keep it in a smaller index. That will make merge sort a *stable* sorting routine — it won't flip the relative order of equal values.

Here's an example run:

Merging two lists is pretty easy and takes linear time: $\Theta(n)$.

### Algorithm concept

`MergeSort(list)`

:

- What if the first half of the list was already sorted?
- What if the second half of the list was already sorted?
- Merge the two sorted lists into one.

If we had a list and broke it in half and each half was magically sorted for us, then we could merge those two halves to make a complete sorted list:

Excluding that magic part, that's a *really* simple algorithm. So where did those two sorted lists come form? Starting with the two unsorted halves of the original list, we sort them recursively with merge sort:

Recursion is the magic. As a base case for the recursion, if we're given a list with one item (or none), then it's sorted:

`MergeSort(array A)`

if A.length < 2: return

Copy the first half of A to new array B

MergeSort(B)

Copy the second half of A to new array C

MergeSort(C)

merge B and C back into A

The pseudocode above can be implemented in Python as follows:

`Click "Run Code" to see the output here`

Here's a concise breakdown of the time and space complexity of the approach above:

**Time:**The algorithm recursively divides the source array,`A`

, into halves,`B`

and`C`

, leading to a logarithmic number of levels, $\lg n$, as shown in the animation above (in the animation example, we have $\lceil \lg 13 \rceil \approx \lceil 3.70 \rceil = 4$). Each level involves merging the entire array, which takes time linear, $O(n)$. Hence, the*total*time complexity is $O(n\lg n)$.**Space:***New*arrays,`B`

and`C`

, are created to hold halves of the original array,`A`

, for each recursive call. Each level of the recursion tree, as demonstrated in the animation, uses $O(n)$ space; since there are $\lg n$ levels in the recursion tree, this means the*total*space complexity is $O(n\lg n)$.

### Standard recursive merge sort (top-down)

While we could copy the array halves out before they're sorted and then sort them, as shown above and in the animation, that implementation uses more space than is needed. It's not too bad, maybe twice what we'd usually need, until we get into implementations that run in parallel, which can be much worse.

Instead, it's more efficient to only copy the data out of the original array after it's sorted, immediately merging back into the original array; that is, only copy lists *after* they are sorted, not beforehand:

- Use original array to store unsorted values waiting to be sorted
- Copy sorted values to another array just before the merge back into the original space
- Less space (especially for naive parallelized version)

`MergeSort(A[], start, end)`

if(end <= start) return

mid = (end + start) / 2

MergeSort(A[], start, mid)

MergeSort(A[], mid + 1, end)

copy A[start...mid] to B[]

copy A[mid + 1...end] to C[]

merge B[] and C[] back into A[start...end]

We can have `start`

and `end`

indices as parameters in our recursive calls, specifying which parts of the original array we are currently sorting. So, to sort the array from the start to the end, we recursively try to sort its first half from the start index to the middle index. And that will first try sort its first half, the first quarter of the whole array, etc., and we'll return to the full second half of the array after the first half is complete.

Each of the merge operations will copy out two sorted lists, and then merge back into the original array, overwriting it. We'll call this the standard "unoptimized" version of top-down merge sort (it's better than the previous approach, but there are still several improvements we can make). We assume here that we're just comparing integer values in an array. More generally we might be sorting records or objects by some key value, but the basic idea is the same. If we're working in a language where we have to explicitly free up memory instead of using garbage collection, we'd need to free up the extra space used for merging the lists once we finish with them.

In the animation of this algorithm below, it's easy to see that we use linear extra space during the algorithm instead of $O(n\lg n)$ space as before:

The pseudocode above can be implemented in Python as follows:

`Click "Run Code" to see the output here`

Summary of time and space complexity for the approach above:

**Time:**The algorithm above again recursively divides the source array into halves ($\lg n$ levels in the recursion tree) and merges all elements at each level for a total runtime of $O(n\lg n)$.**Space:**The total extra space used by temporary arrays`B`

and`C`

, at any point in time, is proportional to $n$, the number of elements in the source array,`A`

; hence, the overall space complexity is $O(n)$.

### Analysis

Concise summaries of the time/space analysis for each previous approach was provided at the end of each approach's section, but it's helpful to consider the analysis of merge sort from a more formal point of view. It's worth noting merge sort is probably one of the most analyzed sorting algorithms there is, and its recurrence relation is likely the most frequently analyzed recurrence relation ever.

To run the algorithm on $n$ items, we recursively sort the first $n/2$ items and then recursively sort the second $n/2$ items, and then we merge them. The merge takes linear time. We get the following recurrence relation:

$T(n) = 2T(n/2) + \Theta(n)$Solving this recurrence relation shows that merge sort takes time $\Theta(n\lg n)$.

As an arbitrary tie-breaker, we'll use the convention CLRS uses, namely that if there are an odd number of items, then the first or left half will get the extra one (e.g., if the entire array has `13`

elements then the first half will range from index `0`

throught index `6`

, `(end + start) // 2 = (0 + 12) // 2 = 6`

, thus containing a total of `7`

items).

In summary:

- Runtime analysis:
- $T(n) = 2T(n/2) + \Theta(n)$
- $T(n) = \Theta(n\lg n)$

- Tie breakers:
- Important: for equal key values, the lower indexed (left) value goes first (makes algorithm stable)
- Arbitrary: for an odd number of elements, the left half gets the extra one

### Optimizations

#### Optimization 1

Only copy the first sorted list, and leave the second list in place:

- Decreases copies by about 25%
- Doesn't change number of comparisons

Recall the animation for the standard recursive merge sort algorithm:

Before the merge, the *entire* sorted array is copied elsewhere, and it looks kind of like one operation, but it isn't. We're copying the entire array into another position, which takes time linear in the size of that array.

Instead of copying out both the left and the right arrays, we can copy out the left array and just leave the right array in-place. If we fill in the sorted values from small to large, then by the time we need to *overwrite* anything in the right array, that value will already have been copied elsewhere. Unoptimized, a merge of two size $n/2$ lists takes $n$ read and write operations to copy the arrays elsewhere and $n$ more to copy them back into the array during the merge. There are also between $n/2$ and $n$ comparisons made *during* the merge. Here, for this first optimization, we get rid of $n/2$ read and write operations for one of the initial list copies (the right one). It won't change the asymptotic runtime — it will just make it run a little more quickly, getting rid of a quarter of our copy operations.

This new optimization of only copying the left array and leaving the right one in-place can be visualized as follows:

We can modify our code from the standard implementation to implement this optimization as follows:

`Click "Run Code" to see the output here`

#### Optimization 2

Once the first list is merged, the remainder of the second list is already in-place

- Decrease copies (amount depends on data)
- Decrease comparisons (amount depends on data)

This second optimization is easier to see after we've made the first optimization. Once all elements have been copied from the left list into the combined list, the remaining elements from the right list are already in-place so we don't have to loop through the remaining elements from the right list to copy them into the merged list — they're already there.

That was true even without the first optimization. In the standard version, we would have been copying values from the right list (i.e., list `C`

) to just overwrite the same values in the original list (i.e., list `A`

). But the second optimization is easier to notice here, where instead of just overwriting data (i.e., list `A`

) with the same data from somewhere else (i.e., list `C`

), we'd be overwriting the data with itself (i.e., list `A`

) from the exact same position (i.e., also list `A`

).

We can explicitly see this if we look at the code provided for implementing the first optimization:

`# merge B and C back into A`

def merge(A, start, end, B):

i = 0 # start index of B

j = (end + start) // 2 + 1 # start index of C (in A)

k = start

# compare smallest elements before each merge

while i < len(B) and j < end + 1:

if B[i] <= A[j]:

A[k] = B[i]

i += 1

else:

A[k] = A[j]

j += 1

k += 1

# merge any remaining elements of B into A

while i < len(B):

A[k] = B[i]

i += 1

k += 1

# merge any remaining elements of C into A

while j < end + 1:

A[k] = A[j]

j += 1

k += 1

def merge_sort(A, start, end):

# base case (empty or single-element list is already sorted)

if end <= start:

return

mid = (end + start) // 2

merge_sort(A, start, mid) # recursively sort first half of A

merge_sort(A, mid + 1, end) # recursively sort second half of A

B = A[start:mid+1] # copy sorted first half to B[]

merge(A, start, end, B) # merge sorted values B[] and C[] back into A[start...end]

All we have to do to implement this second optimization is remove the highlighted lines above:

`Click "Run Code" to see the output here`

#### Optimization 3

Only allocate working space once

- Decrease allocations and deallocations

Instead of allocating or deallocating or garbage collecting space for the merges over and over again, we *know* it's going to take linear total space, so we can just allocate a single array and use it for *all* of the needed copies. Anytime we need to copy a value from the original array, just copy it into the same indexed position of the extra array. That spot is reserved for data from the same index. We'll see why using that position is nice in a moment.

`Click "Run Code" to see the output here`

#### Optimization 4

Before any copy prior to merge, test if lists are already ordered:

- Potentially decrease copies and comparisons
- In worst case, adds $n - 1$ comparisons to the runtime
- In bese case, algorithm now takes $\Theta(n)$

Another optimization sort of overlaps the idea of the first two, although it can be put into place independently from them. Sometimes we might run into the case that the two lists we're going to merge are *already* sorted; that is, the largest element from the leftmost list is no bigger than the smallest element from the rightmost list. In that case, we don't need to copy the lists out and merge them — they're already forming a sorted list when combined. So we can check for that before we perform the merge.

This optimization has a bit of a tradeoff. For each of the merges that takes place in the algorithm, $n-1$ of them for a sort of $n$ items, we're adding an extra comparison to see if the merge is needed. If the elements are random, then we can avoid approximately half of the $n/2$ one-item list merges. For $n/4$ merges of two-sized two lists, we can avoid approximately 1/6 of the comparisons. Those avoided merges should pay for the extra comparisons. But without really careful analysis, or better yet, *testing*, it's hard to be sure just how much we'll gain for random inputs for this optimization.

But even if we don't gain much on random inputs, it gets us something in some special cases. Our basic merge sort, even with the first three optimizations, will always take time proportional to $n\lg n$: best case, worse case, every case. With this fourth optimization, if the input is already sorted or even almost sorted, then merge sort can take linear time to run, improving the asymptotic best-case performance of the algorithm. This is a way we can help those special cases, and it shouldn't *hurt* the random case. In the worst case, it will add just $n-1$ comparisons to the algorithm, which isn't that big of a deal for an $n\lg n$ algorithm anyway.

Here's an animation that shows the third and fourth optimizations at work:

The code needed to implement this final optimization to top-down merge sort is as simple as adding the following `if`

statement before attempting to merge:

`# ...`

if A[mid] > A[mid + 1]:

# ...

That is, the first list goes from `A[start]`

to `A[mid]`

, inclusive, and the second list goes from `A[mid + 1]`

to `A[end]`

, inclusive. If `A[mid]`

is greater than `A[mid + 1]`

, then this means the two lists do need to be merged; otherwise, no merging is necessary. This gives us our final implementation of top-down merge sort:

`Click "Run Code" to see the output here`

#### Optimization summary

**Optimization 0:**Only copy lists*after*they are sorted, not beforehand (less space, especially for naive parallelized version)**Optimization 1:**Only copy the first sorted list, leave the second sorted list in-place (decreases copies by 25%)**Optimization 2:**Once first list is merged, the remainder of the second list is already in-place and doesn't need to be copied (decrease copies, amount depends on data)**Optimization 3:**Only allocate working space once (decrease allocations and deallocations)**Optimization 4:**Before any copy prior to merge, test if lists are already ordered (decreases copies, amount depends on data, gives $\Theta(n)$ best case runtime)

### Naive parallelization

There's an almost trivial way to partially parallelize this algorithm.

#### 0th version

Once we say that we're going to separately sort the first and second half of the array, we can do those in parallel. Below, the animation shows all of the sub calls for the zeroth version of the unoptimized algorithm happening simulataneously:

Each of the merge operations is done by one processor, and the last merge operation takes linear time all by itself. The two merges at the level below that are each done in parallel, so that level takes only half as much time. Each level takes half as much time as the level above it. So it takes linear total time to do order $\Theta(n\lg n)$ work for the algorithm, using up to $n/2$ processors at a time.

The animation above actually shows the problem with our original approach of copying out unsorted values and then recursively sorting them before the merge. It takes a *lot* of space. Each recursive level uses linear space to hold unsorted data, but it needs to wait for deeper recursive calls to complete in order to overwrite that data. That version would need $\Theta(n\lg n)$ space. So we shouldn't copy the data into new space until *after* it's been sorted and is ready to merge (like our standard recursive model).

But what about the four optimizations we made from that model? All four of them work in parallel too.

#### 4th version

Use the original array as our unsorted data instead of copying it out. Don't copy the second of the two sorted arrays to be merged. When the left array is done merging, don't re-copy the right array. Just allocate one array for space (ever). And check to see if the two arrays are already in order before merging. We can do all of those in parallel. The problem there is, if we animate it, so much is going on at once it's hard to see what's happening:

But at least it happens quickly! It will take $n/2$ processors, and order $\Theta(n)$ runtime and order $\Theta(n)$ space. For perfectly sorted data, it would take logarithmic time, but generally we'd almost always expect it to take linear time.

But there are ways to parallelize the merge itself, and parallel merge sort can run in like logarithmic runtime, but we won't get into that for this post.

### Unwinding the recursion

Let's now take a look at what happens within the recursive calls with a smaller example of just eight items:

Consider the comparisons and merges done throughout the entire algorithm. We'll use the 0th version of the algorithm to visualize this (because we don't erase stuff it shows a nice trace of everything the algorithm does so we can see which lists were merged to make which list):

At the very bottom of the recursion, we see the unsorted values:

We just interpret each element as a sorted list of length `1`

, the base case of the recursion. One level up, we see that we will quickly have a sorted list of the first two items to be merged from the sorted list of the second two items. Eventually, a sorted list of the third pair of items gets merged with a sorted list of the last two items:

`merge 17 with -22: compare (17, -22)`

merge 15 with 20: compare (15, 20)

merge -22, 17 with 15, 20: compare (-22, 15), (17, 15), (17, 20)

merge 33 with 25: compare (33, 25)

merge 22 with 19: compare (22, 19)

merge 25, 33 with 19, 22: compare (25, 19), (25, 22)

merge -22, 15, 17, 20 with 19, 22, 25, 33:

compare (-22, 19), (15, 19), (17, 19), (20, 19), (20, 22)

### Bottom-up merge sort (iterative)

Another version of merge sort, bottom-up (or *iterative*), doesn't use recursion. Instead, it just starts with an interpretation of the list as a bunch of length 1 sorted lists, and it merges the first two of those, and the next two, and the next two, etc., until it's passed through the entire array once, merging sorted lists of one item each into sorted lists of two items each. Then it goes through the entire array *again*, this time merging sorted lists of size two into lists of size four. It keeps going and each pass through the array it merges lists that are twice as large as they were for the previous pass.

That's the idea of bottom-up merge sort. Starting with an unsorted set of $n$ numbers, interpreted as $n$ sorted lists of size $1$ each, pass over the list, merging pairs of lists, leaving us with $n/2$ lists of size $2$ each, and then again leaving $n/4$ lists of size $4$ each. And continue that until we have one list of all items.

Let's consider again the example we used above:

Bottom-up merge sort sorts the array above in the following manner:

The following list of merges and comparisons can be helpful to track with the animation above:

`merge 17 with -22: compare (17, -22)`

merge 15 with 20: compare (15, 20)

merge 33 with 25: compare (33, 25)

merge 22 with 19: compare (22, 19)

merge -22, 17 with 15, 20: compare (-22, 15), (17, 15), (17, 20)

merge 25, 33 with 19, 22: compare (25, 19), (25, 22)

merge -22, 15, 17, 20 with 19, 22, 25, 33:

compare (-22, 19), (15, 19), (17, 19), (20, 19), (20, 22)

We'll compare the bottom-up approach to the top-down approach in the next two sections, but the bottom-up implementation is included towards the end.

### Comparison to top-down (8 elements)

The example above is somewhat of a special case because $n=8$, which is a perfect power of $2$. It turns out that when we're sorting a list comprised of a number of items that is a perfect power of $2$, the comparisons done by top-down and bottom-up merge sort will exactly match. For example, we can place side-by-side the bottom-up comparisons we just did in the example above (left) with the comparisons made in the unwinding the recursion section (right):

`merge 17 with -22: compare (17, -22)`

merge 15 with 20: compare (15, 20)

merge 33 with 25: compare (33, 25)

merge 22 with 19: compare (22, 19)

merge -22, 17 with 15, 20: compare (-22, 15), (17, 15), (17, 20)

merge 25, 33 with 19, 22: compare (25, 19), (25, 22)

merge -22, 15, 17, 20 with 19, 22, 25, 33:

compare (-22, 19), (15, 19), (17, 19), (20, 19), (20, 22)

`merge 17 with -22: compare (17, -22)`

merge 15 with 20: compare (15, 20)

merge -22, 17 with 15, 20: compare (-22, 15), (17, 15), (17, 20)

merge 33 with 25: compare (33, 25)

merge 22 with 19: compare (22, 19)

merge 25, 33 with 19, 22: compare (25, 19), (25, 22)

merge -22, 15, 17, 20 with 19, 22, 25, 33:

compare (-22, 19), (15, 19), (17, 19), (20, 19), (20, 22)

As can be seen above, the *order* can change, but the actual comparisons will be the same ones. That won't be the case if we look at this example with $13$ items, which is clearly not a perfect power of $2$.

### Comparison to top-down (13 elements)

Let's now consider an example array with $n = 13$ items, a number that is very clearly *not* a power of $2$:

To sort $13$ items bottom-up, we start in the same way: compare the first two items, then the next two, and continue until we have $6$ sorted lists of size $2$ each, where the last item has nothing to merge with so it just gets left as a size $1$ list. On the next pass, we start merging those longer lists. The first two lists of length $2$ get merged into a list of length $4$. Then the next two lists of length $2$ get merged into a list of length $4$. Then another. But that last item is *still* on its own:

After our $i$th pass through the numbers merging lists, we'll be left with a bunch of sorted lists that are all of length $2^i$. Except the last list! After the first pass, that last list will be smaller than all other lists (unless we started with a number of items that is a perfect power of $2$). Above, we paused the process after the second pass, where we had three lists of length $2^2 = 4$ each and then one list of length $1$:

The next pass will be the first pass where we have an even number of lists, so that last list will finally get merged with something, specifically the length $4$ list to its left. In our final pass, we'll merge a length $8$ list on the left with a length $5$ list on the right for an overall length $13$ sorted list:

Note that in the top-down version of merge sort we *never* merge lists of such different sizes. The lists always need to be within size $1$ of each other. But in the bottom-up case, the last list can be very different in size while all the other lists are exactly the same size. Also note that the bottom-up version can have really different comparisons from the top-down version:

`1st pass:`

merge 17 with -22

merge 15 with 20

merge -50 with 25

merge 22 with 19

merge 10 with 40

merge 44 with 50

2nd pass

merge -22, 17 with 15, 20

merge -50, 25 with 19, 22

merge 10, 40 with 44, 50

3rd pass

merge -22, 15, 17, 20 with -50, 19, 22, 25

merge 10, 40, 44, 50 with 25

4th pass

merge -50, -22, 15, 17, 19, 20, 22, 25 with 10, 25, 40, 44, 50

`merge 17 with -22`

merge 15 with 20

merge -22, 17 with 15, 20

merge -50 with 25

merge -50, 25 with 22

merge -22, 15, 17, 20 with -50, 22, 25

merge 19 with 10

merge 10, 19 with 40

merge 44 with 50

merge 44, 50 with 25

merge 10, 19, 40 with 25, 44, 50

-50, -22, 15, 17, 20, 22, 25 with 10, 19, 25, 40, 44, 50

Unlike when we sorted $8$ items, the comparisons above aren't just the same comparisons in different order. For example, in the bottom-up example, $19$ never gets compared to $10$, but that would be the first comparison done to recursively sort the second half of the elements in the top-down version.

On the other hand, the bottom-up version *will* compare $19$ and $22$ as its fourth comparison. But because they're in separate halves of the original array, the $20$ from the left will be compared to the $19$ on the right, and the $19$ won't ever get compared to $22$.

The two versions give different sets of comparisons even though they are asymptotically the same. The bottom-up versions makes $\lg n$ passes through all the data-making merges, linea time for each pass, so it also takes $\Theta(n\lg n)$ time. Comparisons are different because *the lists being merged are different*. We can take a look at individual comparisons if we really want to make sure we understand all of the details.

### History lesson

Why would we want the bottom-up version instead of the top-down version? Some people like to avoid recursion. Some older languages might even not allow recursion, but either approach will work in most situations. But that may not have always been the case in the past.

In the past, the standard way to store lots of data was on reel-to-reel tapes. Those tapes didn't really have random access. They worked best just reading data sequentially from the beginning of the tape. Recursive merge sort doesn't work so well on those. It jumps around a lot, but if we had our data split between two tapes, with half of the lists on one tape and half of the lists on the other tape, then we could play both of those tapes from the start, reading off the first item from each. We could merge those two lists into a list on a third tape, and then we could merge the next list from the two tapes on to a fourth tape. After going through all of the lists on the first two tapes, alternating which tape we merged them on to, we'd now have all of our data in half as many lists, split between third and fourth tapes. Next, we'd take those two tapes and merge their data back onto the first two tapes. This allows us to merge while only having to rewind the tapes a logarithmic number of times.

People used to have to worry about stuff like minimizing tape rewinds. More recently, external memory merge sort makes a special attempt to minimize how many times we need to access a disk when we have too much data to fit in main memory.

Let's recap:

- Data was stored on reel-to-reel tapes. Random access was poor. Sequential access was better.
- For merge sort:
- Store half of the lists on one tape, half of the lists on a second tape.
- Merge the first list from each tape onto a third tape, and the second list from each tape onto a fourth.
- Alternate which tape you merge onto between the 3rd and 4th tapes.
- After completing a pass through all data, now you have all data on the 3rd and 4th tapes; merge them onto tapes 1 and 2.

### A different optimization (bottom-up)

Let's take a look at one optimization for the bottom-up version, different than those we used for the top-down recursive version. Maybe we want to just allocate a second array once, but this time we won't copy sorted values out and then merge them back; instead, we'll just use the sorted lists and merge them onto the other array. Then use the sorted lists in that second array and merge them back into the original array. Go back and forth between the two. If we happen to need an odd number of total passes, then we'll need to copy data back into the original array when we finish, but we get rid of all of the copy operations right before any merge for the entire algorithm.

We can visualize this different sort of optimization that mimics details found in the history lesson section:

The algorithm can be implemented in Python as follows:

`Click "Run Code" to see the output here`

### Bottom-up implementation (optimized)

Recalling details from the section on bottom-up merge sort, along with the different optimizations discussed for top-down merge sort, what would a decent implementation of bottom-up merge sort actually look like in code? There won't be much of a change in the actual merging of sublists:

`# merge left list and right list back into A`

def merge(A, start, mid, end, temp):

for i in range(start, mid + 1): # copy left subarray into temp

temp[i] = A[i]

i = start # index into temp (left subarray)

j = mid + 1 # index into right subarray in A

k = start # index into merged array in A

# merge temp (left subarray) and A (right subarray) back into A

while i < mid + 1 and j < end + 1:

if temp[i] <= A[j]:

A[k] = temp[i]

i += 1

else:

A[k] = A[j]

j += 1

k += 1

# merge remaining elements from temp back into A

while i < mid + 1:

A[k] = temp[i]

i += 1

k += 1

# remaining elements of right list in A are already in-place (no need for merging)

def merge_sort(A, ...):

# ???

Specifically, the left sublist will range from `start`

to `mid`

, inclusive; furthermore, the left sublist is copied into another list, `temp`

, while the right sublist, which ranges from `mid + 1`

to `end`

, inclusive, remains in-place. If we recall the various optimizations, then we see the code outline above hints at optimizations 0 through 3:

**Optimization 0:**Only copy lists*after*they are sorted, not beforehand (less space, especially for naive parallelized version)**Optimization 1:**Only copy the first sorted list, leave the second sorted list in-place (decreases copies by 25%)**Optimization 2:**Once first list is merged, the remainder of the second list is already in-place and doesn't need to be copied (decrease copies, amount depends on data)**Optimization 3:**Only allocate working space once (decrease allocations and deallocations)**Optimization 4:**Before any copy prior to merge, test if lists are already ordered (decreases copies, amount depends on data, gives $\Theta(n)$ best case runtime)

Optimization 4 should be simple to implement within the body of the `merge_sort`

function. The trickier stuff comes down to effectively handling the *mechanics*. For example, each pass through the array needs to involve merging lists twice the size of the previous pass (except the base case which just merges lists of size $1$), and we also need to effectively handle cases where the size of the input array is *not* a perfect power of $2$ (e.g., see the example where the input array had $13$ items). Let's handle these concerns separately.

#### Doubling the size of lists to be merged on each pass

Let `size`

denote the size or length of the lists to be merged for a given pass over all items. The bottom-up implementation starts from the *bottom* (i.e., the base cases, which are trivially sorted lists of length $1$ each) and works *up* (i.e., the entire input array of size $n$). Thus, it makes sense we should initialize `size = 1`

to indicate we are starting with the base case. When should we stop? Maybe it's easier to think about when we should continue: while `size < n`

(this means whatever list we've built up so far by sorting and merging is not yet complete). We can outline the body of the `merge_sort`

function as follows:

`def merge_sort(A):`

n = len(A)

temp = [0] * n # allocate working space only once

size = 1 # size of sublists to sort and merge

while size < n:

for start in range(0, n, 2 * size):

mid = ?

end = ?

? merge(A, start, mid, end, temp) ?

size *= 2

- Line
`4`

indicates our initialize`size = 1`

condition. - Line
`5`

indicates we're going to continue attempting to sort and merge lists until`size >= n`

, at which point we should stop. - Line
`6`

indicates the`start`

position for the left list of whatever left/right list pair we consider for sorting and merging. The`start`

position should begin at`0`

for the very first left list, and`start`

should clearly never go beyond the last index of the input array (i.e.,`start`

can go up to, but not include,`n`

). Importantly, the`2 * size`

argument to`range`

specifies the index where`start`

should be for the left list of each new left/right pair of lists (in Python, the third argument to`range`

specifies the step size by which the first argument should grow). For example, when`size = 1`

, the step size is`2 * start = 2 * 1 = 2`

, which means we have the following`start`

values (i.e., first endpoint of each left list for pairs of left/right lists):`0, 2, 4, 6, ...`

. In general, the following illustration shows the structure of how left/right pairs of sublists are arranged: $\overbrace{ [\;\overbrace{\overbrace{[\texttt{start},\ldots,\texttt{mid}]}^{\texttt{size}},\overbrace{[\texttt{mid+1},\ldots,\texttt{end}]}^{\texttt{size}}}^{2\,*\,\texttt{size}}, \ldots, \underbrace{\overbrace{\overbrace{[\texttt{start},\ldots,\texttt{mid}]}^{\texttt{size}},\overbrace{[\texttt{mid+1},\ldots,\texttt{end}]}^{\texttt{size}}}^{2\,*\,\texttt{size}}}_{\substack{\text{caution must be exercised with the last}\\\text{left/right pair of sublists because}\\\text{the right sublist may not even exist}}}\;] }^{n}$ The cautionary note under the last pair of left/right sublists will be discussed and resolved in the next section, but for now it's helpful to simply note we will need to deal with cases where a left sublist exists and a right sublist does not exist, indicating we should not attempt a merge. - Line
`10`

indicates we should double the size of sublists we're attempting to merge on each full pass.

Given the details outlined above, how should `mid`

and `end`

be calculated (lines `7`

and `8`

, respectively)? Since these variables denote array *index values*, they need to be computed so as to ensure each left and right sublist has a length of `size`

(except possibly the last right sublist).

`mid`

: If the left sublist starts at index`start`

and ends at index`mid`

, inclusive, then we must have`mid - start + 1 == size`

to accurately capture the fact that`mid`

is included and the size of the entire sublist is of length`size`

. For example, if we had`start = x`

and we're about to try to merge sublists of size`4`

, then we have`size = 4`

, but going from`start = x`

to`start + size = x + 4`

, inclusive, would be a mistake because then we'd have five items, not four, specifically at the following indexes:`x`

,`x + 1`

,`x + 2`

,`x + 3`

,`x + 4`

. We need to subtract out the last one. Consequently, we should ultimately have`mid = start + (size - 1)`

.`end`

: The right sublist begins at`mid + 1`

and ends at`end`

. If we're not dealing with the very last right sublist, then all other right sublists should have a length of`size`

, which can be obtained just like above:`end - (mid + 1) + 1 == size`

. That is, we should have`end = (mid + 1) + (size - 1)`

. If we are dealing with the very last right sublist, then note that the`end`

value should not exceed the last index of the input array,`n - 1`

. Hence, we should really have`end = min((mid + 1) + (size - 1), n - 1)`

.

#### Handling edge cases concerning the last sublist(s)

Our work in the previous section gives us the following outline of the `merge_sort`

function:

`def merge_sort(A):`

n = len(A)

temp = [0] * n # allocate working space only once

size = 1 # size of sublists to sort and merge

while size < n:

for start in range(0, n, 2 * size):

mid = start + (size - 1)

end = min((mid + 1) + (size - 1), n - 1)

? merge(A, start, mid, end, temp) ?

size *= 2

Our final concern is highlighted above: how and when should we actually attempt to merge sublists? When they're not sorted, of course! We can easily go ahead and implement the fourth and final optimization from the top-down approach:

`# ...`

if A[mid] > A[mid + 1]:

merge(A, start, mid, end, temp)

# ...

Recall that `mid`

is the right endpoint of the left sublist, and `mid + 1`

is the left endpoint of the right sublist. If `A[mid] > A[mid + 1]`

, then it must be the case that the current pair of sublists we're considering are *not* in order. We need to sort and merge them.

But what if `A[mid + 1]`

doesn't actually exist? Consider again the example of the array input with $13$ items:

On the first pass, where `size = 1`

, note that `start`

takes on the following values: `0`

, `2`

, `4`

, `6`

, `8`

, `10`

, `12`

. Notably, the last `start`

value corresponds to the last element in the array: `A[12] = 25`

. We have the following:

`start = 12`

`mid = start + (size - 1) = 12 + (1 - 1) = 12`

`end = min((mid + 1) + (size - 1), n - 1) = min(12 + 1 + 1 - 1, 13 - 1) = min(13, 12) = 12`

This is a problem. Even checking the condition `A[mid] > A[mid + 1]`

would throw an error because, in the case of the example array pictured above, `A[mid + 1] = A[12 + 1] = A[13]`

, which does not exist. We'd get an index out of bounds error. As previously mentioned and illustrated via animation, when there's nothing for the last sublist to be sorted and/or merged with (i.e., the right sublist does not exist), we simply skip it because it's trivially sorted:

Only when the last left sublist can be merged with another sublist do we actually attempt a sort and merge (even though the lists may be of very different sizes):

When does it make sense to attempt a sort and merge between two lists? When we actually have two lists! The condition `A[mid] > A[mid + 1]`

is simply an optimization, but the condition `mid < end`

should be checked first before the optimization and certainly before any attempted merge; that is, if `mid >= end`

, then we don't actually have a left list and a right list, only a left list. This observation lets us complete the body of the `merge_sort`

function:

`def merge_sort(A):`

n = len(A)

temp = [0] * n # allocate working space only once

size = 1 # size of sublists to sort and merge

while size < n:

for start in range(0, n, 2 * size):

mid = start + (size - 1)

end = min((mid + 1) + (size - 1), n - 1)

if mid < end and A[mid] > A[mid + 1]: # `mid < end` ensures we have two lists for sorting and merging;

# `A[mid] > A[mid + 1]` ensures we don't sort and merge the two lists unnecessarily (optimization 4)

merge(A, start, mid, end, temp)

size *= 2

#### Implementation

Using the observations detailed above, we can stitch the `merge`

and `merge_sort`

functions together to complete a decently optimized bottom-up implementation of merge sort:

`Click "Run Code" to see the output here`

### Timsort ideas

There are, of course, other optimizations that have not been discussed. For instance, insertion sort can sort a small number of elements faster than merge sort. So we could use that to change merge sort's base case. A sort called Timsort does that and adds other optimizations, some of which are extensions of the optimizations we've alread seen. It aims to use less time and less space even if it isn't asymptotically different for most cases.

Recap:

- Use insertion sort for bigger base case (it works better on a smaller number of elements than merge sort)
- Timsort:
- Also use large increasing/decreasing runs for base case
- Copy only part of left or right array before merge (whichever is smaller)
- Also use large increasing/decreasing runs for base case
- Optimize when many items come from one list in a row

## Quick sort and quick select

### Introduction

Quicksort is a recursive algorithm that sorts data in expected order $O(n\lg n)$ time with excellent real-world performance. It was introduced by Tony Hoare in 1960, and then Robert Sedgewick put out a 300 page dissertation analyzing the Knuth out of it in 1975.

### Quick sort concept

Especially if we already understand recursion, the idea behind quick sort is really simple. First, separate the items into one group with small values and another group with large values. Next, sort each of those two groups separately. The initial splitting of the set into two parts is called *partitioning* the set. Traditionally, we partition the set by picking an element from the array, called the *pivot* (or bound), and compare every element against it.

For example, suppose we have the following array, and we choose the *last* element, `25`

, to be the pivot:

`[40, 41, 17, -22, 25, 55, -18, 35, 10, 25, 33, 19, 44, 51, 25]`

Let's identify the elements that are larger than or smaller or equal to the pivot:

$[ \overbrace{40}^{\text{larger}}, \overbrace{41}^{\text{larger}}, \underbrace{17}_{\text{smaller}}, \underbrace{-22}_{\text{smaller}}, \underbrace{25}_{\substack{\text{smaller}\\\text{or equal}}}, \overbrace{55}^{\text{larger}}, \underbrace{-18}_{\text{smaller}}, \overbrace{35}^{\text{larger}}, \underbrace{10}_{\text{smaller}}, \underbrace{25}_{\substack{\text{smaller}\\\text{or equal}}}, \overbrace{33}^{\text{larger}}, \underbrace{19}_{\text{smaller}}, \overbrace{44}^{\text{larger}}, \overbrace{51}^{\text{larger}}, 25 ]$For the sake of clarity, let's color the pivot in $\color{gray}{\text{gray}}$, the values *strictly larger* than the pivot in $\color{magenta}{\text{magenta}}$, and the values less than or equal to the pivot in $\color{cyan}{\text{cyan}}$:

Ideally, we could just move the elements smaller than or equal to the pivot to the lower index positions of the array and then move the elements larger than the pivot to the higher index positions of the array, leaving the pivot between those blocks in its correct, final location:

$[ \overbrace{\color{cyan}{17}, \color{cyan}{-22}, \color{cyan}{25}, \color{cyan}{-18}, \color{cyan}{10}, \color{cyan}{25}, \color{cyan}{19}}^{\text{smaller or equal block}}, \overbrace{\color{gray}{25}}^{\text{pivot}}, \overbrace{\color{magenta}{40}, \color{magenta}{41}, \color{magenta}{55}, \color{magenta}{35}, \color{magenta}{33}, \color{magenta}{44}, \color{magenta}{51}}^{\text{larger block}} ]$This process can be animated in the following way:

Next, we just need to recursively sort the small and large sets/blocks, which are *completely* independent from each other, so they don't need to interact:

Let's recap what we have so far for the concept of quick sort:

- Separate elements into two groups: small keys, large keys
- Recursively sort the group with small keys, then the group with large keys
- Use one element (the bound, or
*pivot*) to separate small vs. large groups

### Quick sort pseudocode (1)

To flesh out a few more details, like a lot of recursive programs, after the user calls the high-level sorting routine, *it* calls recursive quick sort, which uses a couple of extra parameters to mark the start and the end part of the array we're currently sorting. When we make our recursive calls, those parameters allow us to specify which of the two partitions we want to recursively sort:

`QuickSort(A[], start, end)`

if(start >= end) return

pivotIndex = Partition(A, start, end) # Partition returns an index

QuickSort(A, start, pivotIndex - 1)

QuickSort(A, pivotIndex + 1, end)

The other big detail to flesh out is how we actually partition.

### Partitioning (CLRS/Lomuto partition)

We'll start with the version in CLRS [20] (by Nico Lomuto). For the part of the array we're currently sorting, it uses the *last* element as its pivot and compares it against the other elements in the block being sorted, left to right. It will separate the array into two blocks. The first block gets elements that are smaller than or equal to the pivot, and the second block gets elements that are larger than the pivot. While it's running (i.e., while the `Partition`

routine is executing), there's a third block, the elements we haven't looked at yet.

Technically, as CLRS notes, as the `Partition`

procedure runs, each element falls into exactly one of four regions, some of which may be empty:

- Elements known to be smaller than or equal to the pivot (first block described above)
- Elements known to be greater than the pivot (second black described above)
- Elements not yet placed into either side of the partition because their comparative value to the pivot is not yet known (third block described above)
- The pivot itself

So when we start, the third region above (i.e., the third block) contains everything except the pivot:

So if the value we see is larger than the pivot, then it's a really simple case: we just extend the block of bigger elements in $\color{magenta}{\text{magenta}}$ past the one we just saw. This is the case for the first two elements above, namely $40$ and $41$. These elements are both greater than the pivot, $25$; thus, we extend the region colored in magenta:

If, however, the new value is smaller than or equal to the pivot, then we need to make room for the small partition (in $\color{cyan}{\text{cyan}}$) to grow. We do that by swapping the new value with the *first* item in the partition of larger elements. Above, this means swapping $17$ with $40$ because

- $17$ is smaller than or equal to the pivot (i.e., $17\leq 25$), and
- $40$ is the first element of the partition of larger elements

We can see this as follows:

This actually happens for the next two elements as well, $-22$ and $25$, because they are both smaller than or equal to our pivot:

The single swap we see when we encounter an element smaller than or equal to the pivot essentially slides the partition or "window" of larger elements over by one. But notice that this changes the relative order of elements in the large block. We can easily see this if we revisit the last three swaps in our example:

Note how the large block changes from $[40,41]$ to $[41,40]$ after encountering $17$, then back to $[40,41]$ after encountering $-22$, and then again back to $[41,40]$ after encountering $25$.

Shuffling those items isn't a big problem, but this means that quick sort is not a *stable* sort: if two elements have the same key, then quick sort might switch which one comes first.

After the partition completes, the pivot never needs to be moved again:

It has its location in the final sorted array. All of the elements to its left are less than or equal to it, and all of the elements to its right are strictly larger than it.

We can recursively sort the block of small items with smaller indices and then the block of larger items with larger indices. The following animation shows the rest of the sort:

Let's recap our partitioning strategy so far:

- Pivot is rightmost element
- Compare pivot against elements, left to right
- Block of small elements, then large elements, then untested elements
- Unstable sort: elements with equal keys might switch order

### Worst case performance

If our data has almost all distinct values, in random order, then quicksort has good expected runtime. But rewatch the last part of the animation above — as soon as we finish recursively sorting the small items with smaller indices and move on to the larger items with larger indices, we see a case where quick sort does not work very well:

If the data is already sorted, then each time we pivot, we pick the largest item, compare it against all of the other items, and then sort again on all of the items except that large pivot. In the worst case, if we keep picking really bad pivots, then quick sort takes $\Theta(n^2)$ time.

Excluding the recursion, it's really good on space: it just needs a constant number of indices and room to swap elements. But if we consider the program call stack, the same worst case example above looks really bad. If we keep picking the largest element as the pivot and we always make our first recursive call on that first partition with almost all of the elements from our current subproblem, then we can end up needing linear depth for our program call stack.

### Quick sort pseudocode (2)

To fix that, make the first recursive call on whichever partition has fewer elements:

`QuickSort(A[], start, end)`

if(start >= end) return

pivotIndex = Partition(A, start, end) # partition returns an index

if(pivotIndex - start < end - pivotIndex) # first partition has fewer elements

QuickSort(A, start, pivotIndex - 1)

QuickSort(A, pivotIndex + 1, end)

else # second partition has fewer elements

QuickSort(A, pivotIndex + 1, end)

QuickSort(A, start, pivotIndex - 1)

This will "procrastinate" the majority of our work until later, letting us empty the small subproblem from our program stack before we get to our real work.

Assuming that our compiler optimizes tail recursion, that will reduce our worst case program stack logarithmic depth.

### Quick sort pseudocode (3)

Even if our compiler doesn't optimize tail recursion, *we can*. First, we manually make the recursive call to the subproblem with fewer elements. Next, because that second recursive call was the last thing that we needed to do, we can unwind the recursion, getting rid of the second recursion by putting the entire procedure inside of a loop:

`QuickSort(A[], start, end)`

while (start < end)

pivotIndex = Partition(A, start, end) # partition returns an index

if(pivotIndex - start < end - pivotIndex)

QuickSort(A, start, pivotIndex - 1)

start = pivotIndex + 1

else

QuickSort(A, pivotIndex + 1, end)

end = pivotIndex - 1

This will improve the worst case space requirements of the algorithm. If we're not quite sure what tail recursion is (a Stack Overflow post might be helpful too), then we don't need to worry about it too much. Just know the pseudocodes discussed above are iterations on improving the *space* requirement for quick sort, not the *time* requirement.

Let's recap what we've got so far for the worst case:

**Time:**$T(n) = T(n-1) + n = \Theta(n^2)$**Space:**$\Theta(1)$ references, $\Theta(\lg n)$ each, excluding program stack,*but the program stack grows to $\Theta(n)$ depth*. $\Theta(n)$ space in uniform cost model, $\Theta(n\lg n)$ in logarithmic cost model.- Make first recursive call on group with fewer elements. Worst case $\Theta(\lg n)$ program stack depth; $\Theta(\lg n)$ space in uniform cost model, $\Theta(\lg^2 n)$ in logarithmic cost model.

### Expected performance

Above, we talked about worst-case performance, which is pretty bad. But what about average-case performance? Well, average over what exactly? What inputs do we *expect* to be given, and what order are they in? Generally, we don't know ahead of time.

To analyze expected behavior, first we'll assume the array doesn't have too many duplicate items, ideally none like in the following example array:

But the items above are almost perfectly sorted, which we previously saw was our worst case.

### Randomized quick sort

So we need one minor change in the algorithm. Pick a *random* element from the block we're sorting as our pivot. We can use the same `Partition`

code we used previously, but first swap a random pivot from the block we're sorting to the back of that block, where the `Partition`

code uses it as a pivot.

For example, with the array above, when we first start, "the block we're sorting" is the entire array; hence, we pick *any* element at random in the entire array, say `12`

in the example above, and swap it with the item at the back of the block, `59`

, and then carry on with our swaps as usual:

Now we need to recursively sort the lower block and then the higher block. Starting with the lower block, we pick an element at random from `[-23, -18, -15, 9, -10]`

, say `-18`

, and we swap that element with `-10`

:

The entire array gets sorted in this manner. The full animation of the sort is shown below, where the pivot used in each block sort is random:

Let's recap what we've discussed so far about randomized quick sort:

- Pick a random element from the partition as a pivot
- Swap that pivot into the last position of the block
- Run the rest of the partitioning just like before

### Expected performance (continued)

Now, regardless of the initial order of the elements, we can talk about the expected performance of the algorithm over those random pivot choices. The analysis gets pretty rough, which is why it's broken out into its own entire section. But the expected runtime is $O(n\lg n)$ with a logarithmic depth program stack. It's considered to be an in-place algorithm — it doesn't use a lot of extra space. CLRS calls this "randomized quick sort" and, in general, it's probably a good idea in real life to randomize our pivots.

Let's recap:

- Assumption: no (or few) duplicate values
- Expected $O(n\lg n)$ runtime
- Expected $O(\lg n)$ stack depth; $O(\lg n)$

### Partitioning with duplicates

What if we don't want to make the assumption that there aren't many duplicates? Randomized or not, if everything has the same key, then we end up in a worst-case situation:

Even if we had some special-case code to check if all keys are equal, inputs where *almost* everything is equal, except a constant number of smaller values, will still take expected $\Theta(n^2)$ time. There will also be expected linear stack depth unless we call the subproblem with fewer elements first (e.g., the second and third pseudocodes discussed previously).

If we're *really* worried about duplicates, then we can modify the `Partition`

procedure to partition into three blocks, so that after partitioning, we have items *less* than the pivot, items *equal* to the pivot, and then items *greater* than the pivot.

There's a trade-off here. Maybe for elements that are less than the pivot we still have only one comparison, but now we need a *three*-way rotation instead of a swap (this is remarked on in greater detail in the implementation of a more optimized version of randomized quick sort), and for other elements we have an extra comparison to check for equality. The payoff is that when we make our recursive calls, we get to ignore the *entire* block of elements equal to the pivot. Our previous worst-case input, where we had lots of duplicate values, took $\Theta(n^2)$ time, but having many duplicate values is now a best-case input, which would result in a linear runtime.

The animation below illustrates all of this (for the sake of clarity, the choice of pivot has not been randomized):

When our pivot *is* randomized, if one element is repeated a lot, then we have a higher chance to pick it and get rid of *all* of those values from our recursive subproblems.

Let's recap:

- Many duplicate values force expected $\Theta(n^2)$ runtime
- Instead of two blocks ($\leq$, $>$), partition into three blocks ($<$, $==$, $>$)
- $\Theta(n\lg n)$ expected and best-case runtime if there are no duplicates
- Duplicates decrease the runtime, $\Theta(n)$ best-case runtime

### Optimizations

#### Hoare's partition

Let's now talk about some optimizations, starting with a different way to partition, that's much closer to Hoare's original method from somewhere around 1960. To switch things up, we'll use the *middle* element as the pivot value instead of the end:

That value tends to work well if the data is almost sorted when we start. We just copy its value and compare against *that* copy without needing to swap it to the last position.

This strategy will also start a block of small elements on the left, comparing the pivot value against values and expanding that block until it runs into something that is greater than or equal to the pivot:

The change is that next it starts the block of large elements on the *right* side of the block, comparing against elements with high indices, expanding *that* block until it runs into something smaller than or equal to the pivot:

After both of those progressions have stopped, we know that the lower index references an element not smaller than the pivot (i.e., $55$ in the example above), and the larger index references an element not larger than the pivot (i.e., $25$ in the example above). So we swap those two items (i.e., we swap $55$ and $25$ in our example above), extending the block on each side:

Then, we continue again from the lower index, repeating all of that until those two blocks meet:

This way of doing things can be summarized as follows:

- Use pivot from middle (not required)
- Partition of small elements on left (extend this partition until we run into a large element)
- Partition of large elements on right (extend this partition until you run into a small element)
- When both partitions are blocked, swap those elements and continue

Two things to note here. The first issue is that after we finish partitioning, the pivot might not be in its final location. It gets absorbed or swapped into one of the two blocks. It might not be on the boundary between the two blocks.

For example, this happened above in our example, where the pivot, $33$, ended up in the right block and not on the boundary between the two blocks:

What this means is that when we make our two recursive calls, those two calls, combined, include *all* of the elements of our current block — they don't get to exclude the pivot.

The second issue is that no matter if we are looking at the block on the left or on the right, if we have a value *equal* to the pivot, then *that* value will stop *that* block from progressing:

And it will be swapped into the other block:

That can make us do some weird things, like above, where we swap a $25$ for a $25$. We can see the rest of the sort play out in the animation below:

The observation of doing weird things like swapping a $25$ for a $25$ stands in sharp contrast to the intuition we gained in our previous approaches; specifically, our intuition previously was to let equal values stay in whatever block we were trying to extend, *not stopping* on those values, *not swapping* them, and that would let our partition finish as quickly as possible.

That was also the approach suggested by Hoare in one of his early papers. But why is that seemingly more optimized approach a mistake? Why stop each block on values equal to the pivot? Isn't it good to avoid having extra swaps? If we again consider extreme cases with one value duplicated for almost the entire array, it could be easy to just extend the partition from one side to include them all — that might finish *partitioning* without any swaps at all, but it can also bring us back to the worst-case $\Theta(n^2)$ performance of our first partition when we didn't worry about duplicates:

Instead, if we stop extending each block when we hit an equal value, then that extreme duplicate case works really nicely. The duplicate values should get divided nicely between the two partitions, and we don't end up with worst-case performance for the algorithm. We avoid that worst-case performance without any other changes needed.

It seems a little weird to call this an "optimization" over the other partition when this one came first (historically). But what makes it faster? Both partitions use about the same number of comparisons, but the Hoare partition uses a lot fewer *swaps*. The Lomuto partition will basically require one swap for each element less than or equal to the pivot. We expect about $n/2$ swaps to partition $n$ elements. Each swap in the Hoare partition causes *both* blocks to get bigger, expanding towards each other, so the most we can get is $n/2$ swaps, but randomly distributed distinct values will give about $n/6$ expected swaps, only one-third as many.

So if the Hoare partition came first, and it runs faster than on random data, because it needs fewer swaps, and it works better in the case of duplicate data, then why is the other partition (i.e., Lomuto) the one that's most commonly taught?

The Hoare partition is really easy to mess up, especially the edge cases (e.g., maybe because we don't automatically remove the pivot itself from the subproblems we call). The code might seem to work well for large problems with a random pivot, but when the subproblems get smaller, then it's easy for those edge cases to bite us. Even if our code is correct, it's harder to prove it.

Let's recap some of the observations above about the Hoare partition:

- Pivot gets included in one partition
- Values equal to the pivot can be split between partitions (no special block needed for duplicates)
- Random pivot on distinct values requires $n/6$ expected swaps (Lomuto expects about $n/2$)
- Harder to code, teach, debug, and prove correct

#### Thinking about quick sort (pedagogical simulation)

Many things discussed above are all relatively low-level implementation details for partitions that don't really play into the big picture *idea* of quick sort. For David's class, if he wants students to simulate the default vanilla quick sort, then he lets students use the Lomuto partition, but he instructs them to *pretend* it's stable.

That is, pick the last element as the pivot, imagine that it gets compared to all of the other elements, left to right, and then the small elements slide left, the large elements slide right, and the pivot moves to its proper location, all without changing the relative order of elements within each group:

We then recurse on the small elements, then the large elements:

That makes it really easy to simulate quick sort, to think about it, and its recursion as a whole while letting us ignore lower level details of how swaps change the ordering of elements within the blocks for some particular partitioning algorithm.

#### Pivot picking

Beyond trying to decrease the number of swaps, like Hoare's partition does, we really do want to pick a good pivot. Randomizing the choice of pivot is good but still needs some reasonable chance to pick a bad pivot. And when we do pick a bad pivot, one of our subproblems will be almost as large as the one we started with.

So if we're partitioning 10,000 items, there's a 2% chance that one of our partitions is really big, 9,900 items or more. There's a 20% chance that we get a partition with at least 9,000 items. Maybe for that many items it's worth just a little more time to try to avoid a bad pivot.

If we pick three random elements and then use use the median of them as our pivot, then the odds of picking a bad pivot falls by a factor of over 33 for that extreme case. It helps us avoid really bad pivot picks. For large blocks, it's probably worth that minor overhead. Even without randomizing, consider finding the median from among the first, last, and middle element of the block. If the block is in random order, then that should work well. And if the block is somewhat sorted, then that middle element will be a good pivot.

Let's recap:

- Probability of picking a "bad" pivot for large $n$:
- 2% chance of picking a pivot in top or bottom 1%
- 20% chance of picking a pivot in top or bottom 10%

- If we pick three random elements and take median as pivot:
- 0.0596% chance of picking a pivot in top or bottom 1%
- 5.6% chance of picking a pivot in top or bottom 10%

- For non-randomized version, consider using the median of the first, last, and middle items from a block

#### Insertion sort (small subproblems)

For small enough subproblems, insertion sort is faster than quick sort. Similar to what was mentioned for merge sort, once the block we are sorting in quick sort becomes small enough, we can just use insertion sort or we could leave those small problems unsorted and run insertion sort on the entire array after the rest of quick sort finishes. Caching issues supposedly make the first approach faster though.

Recap:

- Insertion sort runs faster than quick sort on small inputs
- Similar to merge sort — we can use it for small subproblems
- Unlike merge sort, we could instead skip the small subproblems and call insertion sort once after we are done with quick sort
- Insertion sort works well on "almost sorted" arrays
- The first approach may work better due to caching issues

#### Multiple pivots

We can also try using two or three pivots to partition into more than two subproblems. Sedgewick looked at this in the 70s, and it didn't look like it would outperform using just a single pivot. But new analysis and experiments since 2009 have shown that 2 or 3 pivots might be faster, and 2 pivots was actually used to sort primitives in Jave 7. Some papers say it's due to fewer swaps. Others think it's due to cache efficiencies. Ands others think it may be architecture dependent.

Recap:

- 2 or 3 Pivots
- 3 or 4 Subproblems
- Sedgewick, 1975: not worth it
- Yaroslavskiy, 2009: dual pivot is faster (experimentally)
- Papers in the 2010s...
- Fewer swaps?
- Caching efficiency?
- Architecture dependent?

### Naive parallelization

We should also mention parallelism. When we have two recursive subproblems to handle, if we have multiple processors, then that's a really simple place we could run those two recursive calls in parallel, similar to what we saw in the naive parallel merge sort.

Simple parallelism would take quick sort down to expected linear time and logarithmic depth in the stack, but its worst-case performance wouldn't improve over a regular quick sort.

There is a more complex way to parallelize the partitioning, giving a polylogarithmic parallel quicksort.

Recap: Naive quick sort parallelization would give expected

- $\Theta(n)$ runtime
- $\Theta(\lg n)$ stack depth
- $\Theta(\lg n)$ space in the uniform cost model
- $\Theta(\lg n)$ space in logarithmic cost model

by just running subproblems in parallel, but there is also a faster, more complicated parallel quick sort that parallelizes the partitioning.

### Selection

Let's now cover a different problem that is so highly related to quick sort that it almost seems wrong to not consider: quick select.

Quick select is an algorithm for *selection*: given a set of $n$ numbers, it's easy to pick the maximum item or minimum item in linear time. But what if we want to select the median element or the $k$th smallest element?

One possible approach would be to sort the numbers first. And then just take the correct one after they're sorted.

To recap the simple selection problem:

- Examples:
- Find the maximum element
- Find the minimum element
- Find the median element
- Find the $k$th smallest element

- Possible approach:
- Sort the elements
- Take the correct one

### Quick select

Imagine we take the sorting approach mentioned above. Futhermore, suppose we use quick sort as our sorting algorithm of choice. Below, imagine that we're looking for the 10th smallest element:

The 10th smallest element in the example array above, $35$, is highlighted for the sake of clarity.

For such a small example input array, we can also just present the sorted array to confirm this before illustrating how quick sort figures into things:

`# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

[-45, -18, 10, 17, 19, 25, 25, 25, 33, 35, 40, 41, 44, 51, 55]

Of course, in general, we don't know ahead of time what the $k$th smallest value will be. It just helps to have the element highlighted in order to make the animations clearer.

We use the regular version of quick sort, and we begin sorting the numbers by using the last element as the pivot, $25$. Let's see how things work after the first partition:

It's business as usual: all values smaller than or equal to the pivot end up in the partition to the left of the pivot, and all values greater than the pivot end up in the partition to the right of the pivot. After we finish this first partitioning, quick sort expects us to recursively sort that partition of values less than or equal to the pivot, $25$. But why bother with them? We know that there are `8`

elements with value $25$ or less:

`# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

[17, 25, -45, -18, 19, 10, 25, 25, 51, 35, 40, 33, 44, 55, 41]

But we're looking for the `10`

th smallest value! So we should just ignore that entire left partition of smaller (or equal) items and just look at the items in the partition of larger items. Above, we can see $35$ is the second smallest element in the partition of larger items but still the 10th smallest overall.

Let's now recursively look in the partition of larger items:

As we see above, when we step into that block of larger items (i.e., from after the first partition), we pick the last element of the partition, $41$, as our pivot, and we partition around that pivot, leaving us with the following arrangement of items after the partitioning:

`# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

[17, 25, -45, -18, 19, 10, 25, 25, 35, 40, 33, 41, 44, 55, 51]

We see that $41$ is the fourth smallest element from the block we just partitioned and the 12th smallest overall. Thus, the element we're looking for must come from the partition of values smaller than $41$, namely $35$, $40$, $33$. We thus recurse on that partition of smaller values, choosing $33$ as our pivot:

We see that our pivot ends up being the smallest element from the block we just partitioned and the 9th smallest overall:

`# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

[17, 25, -45, -18, 19, 10, 25, 25, 33, 40, 35, 41, 44, 55, 51]

The element we're looking for must come from the partition of values larger than $33$, namely $40$ and $35$. We recurse on that partition of larger values, choosing $35$ as our pivot:

After the partitioning, we see that our pivot ends up being the smallest element from the block we just partition, and its the 10th smallest overall:

`# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

[17, 25, -45, -18, 19, 10, 25, 25, 33, 35, 40, 41, 44, 55, 51]

This is what we wanted. We can finally return its value as our answer. Note how the final array above isn't close to being sorted. We have `17, 25, -45, ...`

as the start, and *that is okay*. The goal of quick select isn't to sort the array. It's to select some value in expected linear time using a sorting algorithm to help do so, where quick sort is well suited to the task with its partitioning scheme.

Let's recap: if we're using quick sort, and our pivot is

- less than the element we are looking for: ignore the partition of smaller elements
- larger than the element we are looking for: ignore the partition of bigger elements
- the element we are looking for: stop looking

#### Quick select pseudocode (1)

The pseudocode for quick select looks a lot like that for quick sort:

`QuickSelect(A[], start, end, rank)`

if(start == end) # implies rank == 1

return A[start]

pivotIndex = Partition(A, start, end) # Partition returns an index

if(pivotIndex == rank)

return A[rank]

if(pivotIndex < rank)

return QuickSelect(A, pivotIndex + 1, end, rank) # desired value resides in partition of larger values

else

return QuickSelect(A, start, pivotIndex - 1, rank) # desired value resides in partition of smaller values

But note that `QuickSelect`

above has only one recursive call (i.e., to the partition we should search to find our desired value). Furthermore, `QuickSelect`

needs one more parameter than `QuickSort`

, namely `rank`

, which specifies the "`rank`

th" smallest value from the array `A[start:end]`

, where `1 <= rank <= end - start + 1`

. When `start == end`

, we have `end - start + 1 == 1`

, which means `1 <= rank <= 1`

, meaning `rank == 1`

.

After partitioning, we'll need at most one recursive call to either the smaller or larger partition, unless we get lucky: if the `rank`

of the pivot equals the rank we seek, then stop searching.

#### Quick select pseudocode (2)

Also, notice above that because there's only one recursive call, and it's tail recursion, we can use the same technique we used on quick sort to get rid of the tail recursion, giving us a version of quick select with no recursion:

`QuickSelect(A[], start, end, rank)`

while(start <= end)

if(start == end) # implies rank == 1

return A[start]

pivotIndex = Partition(A, start, end) # Partition returns an index

if(pivotIndex == rank)

return A[rank]

if(pivotIndex < rank)

start = pivotIndex + 1 # desired value resides in partition of larger values

else

end = pivotIndex - 1 # desired value resides in partition of smaller values

#### Conclusion

All of the quick sort partitioning algorithm choices are still present in both pseudocode implementations of quick select. We probably want to pick our pivot randomly (unlike what's presented in the pseudocode).

#### Quick select pseudocode (2)

### Implementations

#### Quick sort pseudocode (1)

If we revisit the quick sort pseudocode (1) section, and then fill out the details for the `Partition`

procedure (it's helpful to use [20] for this purpose), then we end up with pseudocode that looks something like the following:

`Partition(A[], start, end)`

pivot = A[end] # the pivot

i = start - 1 # highest index into the low side (empty at first)

for j = start to end - 1 # process each element other than the pivot

if A[j] <= pivot # does element A[j] belong on the low side?

i = i + 1 # index of a new slot in the low side

exchange A[i] with A[j] # place element A[j] there

exchange A[i + 1] with A[end] # pivot goes just to the right of the low side

return i + 1 # new index of the pivot

QuickSort(A[], start, end)

if(start < end)

pivotIndex = Partition(A, start, end) # partition the subarray around the pivot, which ends up in A[pivotIndex]

QuickSort(A, start, pivotIndex - 1) # recursively sort the low side

QuickSort(A, pivotIndex + 1, end) # recursively sort the high side

The pseudocode above can be implemented almost exactly as above in Python:

`Click "Run Code" to see the output here`

#### Quick sort pseudocode (2)

If we revisit the quick sort pseudocode (2) section, where we made a first attempt at reducing the space requirement by first recursing on the smaller partition, then we get pseudocode that looks like the following, where the `Partition`

procedure remains the same (i.e., the space requirement has now been improved a little bit, but the time requirement, largely dictated by the untouched `Partition`

procedure, remains the same):

`Partition(A[], start, end)`

pivot = A[end] # the pivot

i = start - 1 # highest index into the low side (empty at first)

for j = start to end - 1 # process each element other than the pivot

if A[j] <= pivot # does element A[j] belong on the low side?

i = i + 1 # index of a new slot in the low side

exchange A[i] with A[j] # place element A[j] there

exchange A[i + 1] with A[end] # pivot goes just to the right of the low side

return i + 1 # new index of the pivot

QuickSort(A[], start, end)

if(start >= end) return

pivotIndex = Partition(A, start, end) # partition returns an index

if(pivotIndex - start < end - pivotIndex) # first partition has fewer elements

QuickSort(A, start, pivotIndex - 1)

QuickSort(A, pivotIndex + 1, end)

else # second partition has fewer elements

QuickSort(A, pivotIndex + 1, end)

QuickSort(A, start, pivotIndex - 1)

We can implement the pseudocode above in Python as follows:

`Click "Run Code" to see the output here`

#### Quick sort pseudocode (3)

If we revisit the quick sort pseudocode (3) section, where we tried to further reduce the space requirement by eliminating the recursion ourselves, then we end up with pseudocode that looks something like the following (as before, the `Partition`

procedure remains the same, meaning we're improving our space requirement but not our time requirement):

`Partition(A[], start, end)`

pivot = A[end] # the pivot

i = start - 1 # highest index into the low side (empty at first)

for j = start to end - 1 # process each element other than the pivot

if A[j] <= pivot # does element A[j] belong on the low side?

i = i + 1 # index of a new slot in the low side

exchange A[i] with A[j] # place element A[j] there

exchange A[i + 1] with A[end] # pivot goes just to the right of the low side

return i + 1 # new index of the pivot

QuickSort(A[], start, end)

while (start < end)

pivotIndex = Partition(A, start, end) # partition returns an index

if(pivotIndex - start < end - pivotIndex)

QuickSort(A, start, pivotIndex - 1)

start = pivotIndex + 1

else

QuickSort(A, pivotIndex + 1, end)

end = pivotIndex - 1

We can implement the pseudocode above in Python as follows:

`Click "Run Code" to see the output here`

#### Randomized quick sort

If we revisit the randomized quick sort section, as well as CLRS [20], then we see choosing a random pivot at the outset results in updating our pseudocode in the following way:

`RandomizedPartition(A[], start, end)`

i = Random(start, end)

exchange A[end] with A[i]

return Partition(A[], start, end)

Partition(A[], start, end)

pivot = A[end] # the pivot

i = start - 1 # highest index into the low side (empty at first)

for j = start to end - 1 # process each element other than the pivot

if A[j] <= pivot # does element A[j] belong on the low side?

i = i + 1 # index of a new slot in the low side

exchange A[i] with A[j] # place element A[j] there

exchange A[i + 1] with A[end] # pivot goes just to the right of the low side

return i + 1 # new index of the pivot

QuickSort(A[], start, end)

if(start < end)

pivotIndex = RandomizedPartition(A, start, end) # partition the subarray around the pivot, which ends up in A[pivotIndex]

QuickSort(A, start, pivotIndex - 1) # recursively sort the low side

QuickSort(A, pivotIndex + 1, end) # recursively sort the high side

Note that the `Partition`

procedure has been left unchanged, but we now call the new `RandomizedPartition`

procedure within `QuickSort`

. It's worth noting the `QuickSort`

layout in the pseudocode above reverts back to the first pseudocode for quick sort (i.e., no attempts at space optimization).

We can implement the pseudocode above in Python by using the `randint`

method from Python's `random`

module (i.e., `random.randint(a, b)`

returns a random integer $N$ such that $a\leq N\leq b$):

`Click "Run Code" to see the output here`

#### Randomized quick sort (optimization with three partitions)

TBD

#### Quick sort with Hoare's partition

The following pseudocode is adapted from CLRS [20] and shows the original partitioning algorithm due to C. A. R. Hoare:

`HoarePartition(A, start, end)`

pivot = A[start]

i = start - 1

j = end + 1

while TRUE

repeat

j = j - 1

until A[j] <= x

repeat

i = i + 1

until A[i] >= x

if i < j

exchange A[i] with A[j]

else return j

The pseudocode above may be implemented in Pythom in the following way (the point of emphasis is the `partition`

function, not the `quicksort`

function, which may be implemented to use tail recursion or not):

`Click "Run Code" to see the output here`

#### Quick select (randomized)

If we look at the pseudocode in the first section on quick select, and we follow the advice to randomize our pivots, then we end up with pseudocode that looks something like the following:

`RandomizedPartition(A[], start, end)`

i = Random(start, end)

exchange A[end] with A[i]

return Partition(A[], start, end)

Partition(A[], start, end)

pivot = A[end] # the pivot

i = start - 1 # highest index into the low side (empty at first)

for j = start to end - 1 # process each element other than the pivot

if A[j] <= pivot # does element A[j] belong on the low side?

i = i + 1 # index of a new slot in the low side

exchange A[i] with A[j] # place element A[j] there

exchange A[i + 1] with A[end] # pivot goes just to the right of the low side

return i + 1 # new index of the pivot

QuickSelect(A[], start, end, rank)

if(start == end) # implies rank == 1

return A[start]

pivotIndex = RandomizedPartition(A, start, end) # Partition returns an index

if(pivotIndex == rank)

return A[rank]

if(pivotIndex < rank)

return QuickSelect(A, pivotIndex + 1, end, rank) # desired value resides in partition of larger values

else

return QuickSelect(A, start, pivotIndex - 1, rank) # desired value resides in partition of smaller values

The pseudocode above can be implemented in Python in the following way:

`Click "Run Code" to see the output here`

#### Quick select (randomized with tail recursion)

If we look at the pseudocode in the second section on quick select, and we follow the advice to randomize our pivots, then we end up with pseudocode that looks something like the following:

`RandomizedPartition(A[], start, end)`

i = Random(start, end)

exchange A[end] with A[i]

return Partition(A[], start, end)

Partition(A[], start, end)

pivot = A[end] # the pivot

i = start - 1 # highest index into the low side (empty at first)

for j = start to end - 1 # process each element other than the pivot

if A[j] <= pivot # does element A[j] belong on the low side?

i = i + 1 # index of a new slot in the low side

exchange A[i] with A[j] # place element A[j] there

exchange A[i + 1] with A[end] # pivot goes just to the right of the low side

return i + 1 # new index of the pivot

QuickSelect(A[], start, end, rank)

while(start <= end)

if(start == end) # implies rank == 1

return A[start]

pivotIndex = Partition(A, start, end) # Partition returns an index

if(pivotIndex == rank)

return A[rank]

if(pivotIndex < rank)

start = pivotIndex + 1 # desired value resides in partition of larger values

else

end = pivotIndex - 1 # desired value resides in partition of smaller values

The pseudocode above can be implemented in Python in the following way:

`Click "Run Code" to see the output here`

## Runtime analysis for quick sort and quick select

### Introduction

We'll start by looking at the intuition behind and formal anlysis of quick select. Then we'll move on to quick sort.

### Quick select analysis

In quick select, when we partition the array, it takes linear time, and afterwards we have one partition comprised of elements smaller than the pivot and one partition comprised of elements larger than the pivot. If one partition has 90% of the original set, and the number we're looking for is in *that* partition, it seems like we got a bit unlucky. We'd hope to do better than that.

What happens if we get unlucky like that every time? That is, if our pivot only eliminates 10% of the elements for each pass, then we get a simple recurrence relation:

$T(n) = T(9n/10) + n$Any of the three analytical methods we saw previously can be used to prove that this recurrence relation is linear.

We could *guess* that it's linear and use the substitution method, expand it using the recursion tree method, or use the master method:

- Substitution method:
- Inductive hypothesis: for $k < n$, $T(k)\leq Ck$
- $T(n) = T(9n/10) + n\stackrel{\text{i.h.}}{\leq} C9n/10 + n\stackrel{?}{\leq} Cn$
- True for $C=10$

- Recursion tree:
- Prove $T(n) = T((9/10)^i n) + \sum_{j=0}^{i-1}(9/10)^jn$ by induction
- For $i = \lg_{10/9}n$, $T(n) = T(1) + \sum_{j=0}^{(\lg_{10/9}n)-1}(9/10)^j n\leq T(1) + \sum_{j=0}^\infty(9/10)^j n\leq T(1)+10n$

- Master Theorem:
- $a = 1, b = 10/9, f(n) = n$
- $\lg_b a = \lg_{10/9}1 = 0$
- $n=\Omega(n^{0+\epsilon})$ for $\epsilon=1$, case 3, $T(n) = O(f(n)) = O(n)$

Even if we're most pessimistic and imagine that we only get rid of 1% of the set each time we partition, changing our recurrence to $T(n) = T(99n/100) + n$:

- Substitution method:
- Inductive hypothesis: for $k < n$, $T(k)\leq Ck$
- $T(n) = T(99n/100) + n\stackrel{\text{i.h.}}{\leq} C99n/100 + n\stackrel{?}{\leq} Cn$
- True for $C=100$

- Recursion tree:
- Prove $T(n) = T((99/100)^i n) + \sum_{j=0}^{i-1}(99/100)^jn$ by induction
- For $i = \lg_{100/99}n$, $T(n) = T(1) + \sum_{j=0}^{(\lg_{100/99}n)-1}(99/100)^j n\leq T(1) + \sum_{j=0}^\infty(99/100)^j n\leq T(1)+100n$

- Master Theorem:
- $a = 1, b = 100/99, f(n) = n$
- $\lg_b a = \lg_{100/99}1 = 0$
- $n=\Omega(n^{0+\epsilon})$ for $\epsilon=1$, case 3, $T(n) = O(f(n)) = O(n)$

Or maybe 1% of the set each 5 times we partition, changing our recurrence to $T(n) = T(99n/100) + 5n$:

- Substitution method:
- Inductive hypothesis: for $k < n$, $T(k)\leq Ck$
- $T(n) = T(99n/100) + 5n\stackrel{\text{i.h.}}{\leq} C99n/100 + 5n\stackrel{?}{\leq} Cn$
- True for $C=500$

- Recursion tree:
- Prove $T(n) = T((99/100)^i n) + \sum_{j=0}^{i-1}(99/100)^jn$ by induction
- For $i = \lg_{100/99}n$, $T(n) = T(1) + \sum_{j=0}^{(\lg_{100/99}n)-1}(99/100)^j 5n\leq T(1) + \sum_{j=0}^\infty(99/100)^j 5n\leq T(1)+500n$

- Master Theorem:
- $a = 1, b = 100/99, f(n) = 5n$
- $\lg_b a = \lg_{100/99}1 = 0$
- $5n=\Omega(n^{0+\epsilon})$ for $\epsilon=1$, case 3, $T(n) = O(f(n)) = O(5n) = O(n)$

We *still* get a linear runtime. It's just that the *constants* get worse, as seen above. That's a good intuition, but what if we want to make it a bit more formal and accurate?

### Complexity of precision

We're assuming that we're looking for the $k$th smallest item from $n$ distinct values when we perform quick select, where the first smallest item is the smallest item, and the $n$th smallest item is the largest item, and all of the items are distinct.

Our runtime might then depend not only on $n$ but also on $k$. Can we write down an accurate recurrence relation for that? It takes linear time to partition, and for any $i$ from $1$ to $n$, there's an equal probability of picking the $i$th smallest item from the array as our pivot. If $i$ is less than $k$, then we need to recursively search all elements with rank larger than $i$ for the $k-1$th smallest element. If $i$ equals $k$, then we're done. And if $i$ is greater than $k$, then we need to search all elements with rank less than $i$ for the $k$th smallest. This gives us the following recurrence:

$T(n, k) = n - 1 + \frac{1}{n}\Bigl(\sum_{i=1}^{k-1} T(n-i, k-i) + 0 + \sum_{i=k+1}^n T(i-1,k)\Bigr)$Maybe the formulation above is a bit too precise. It looks complex enough that we really don't want to solve it. We'll leave it as an exercise for the reader. Ha.

### Formal analysis

Instead, we'll make some pessimistic assumptions that aren't quite as clumsy as the ones we started with. Imagine we're running the algorithm but against an adversary that gets to change the rank we're seeing after each time we partition, as long as it doesn't change it to something we've already discarded, but it can force us to always have to search the bigger partition (i.e., the adversary can tell us if the rank is larger or smaller than each pivot we pick):

- Pick a random pivot from $n$ distinct items
- Adversary won't let us finish until there is only one item left
- Adversary always picks the larger partition for us to recursively search

Thus, given $n$ distinct items, we still assume, that we pick a random item as our pivot, but unless there's just one item, we never get lucky and pick it, and we always have to recursively step into whichever partition has more elements. If our pivot has rank less than $n/2$, then we need to search everything that's larger than it, and if the pivot has rank larger than $n/2$, then we need to search everything smaller:

$T(n) = n\frac{1}{n}\Bigl(\sum_{i=1}^{n/2} T(n-i) + \sum_{i=n/2+1}^n T(i)\Bigr) = n+\frac{2}{n}\sum_{i=n/2+1}^n T(i)$We can simplify some repeated terms, and the recurrence relation above is pessimistic but also more realistic than our first pessimistic recurrence relation. It allows for the possibility that we get a really bad partition like picking the absolute maximum value element when we're looking for something else, but that possibility is weighted by a realistic probability.

So how do we solve that recurrence relation? Well, this is a recurrence relation where we should be glad to have a guess to start. Guessing that it's linear, we use the substitution method, plug in for the arithmetic series, and see that linear works.

Specifically, the inductive hypothesis is that for $k < n$, $T(k)\leq Ck$, and we have

$\begin{align*} T(n) &= n+\frac{2}{n}\sum_{i=n/2+1}^n T(i)\\ &\stackrel{\text{i.h.}}{\leq} n + \sum_{i=n/2+1}^n + Ci\\ &= n + \frac{2}{n}C\frac{n}{2}\frac{n/2+1+n}{2}\\ &= n + C\frac{3}{4}n + C/2\\ &\stackrel{?}{\leq} Cn \end{align*}$which is true for $C = 8, n\geq 4=n_0$.

The math above makes the simplifying assumption that $n$ is even, but the result holds for odd $n$ too.

### Quick sort analysis

Let's start with the same intuitive approach we used for quick select. If we break exactly in the middle, we get a well-known recurrence relation that gives us order $n\lg n$ runtime:

$T(n) = 2T(n/2) + n = \Theta(n\lg n)$But what if we go back to our assumption that we get a pivot 90% of the way through the block? We can use the substitution method to show that that gives an order $n\lg n$ relation:

- What if we pick 90% pivot each pass?
- $T(n) = T(9n/10) + T(n/10) + n$
- Substitution method:
- Inductive hypothesis: for $k < n$, $T(k)\leq Ck\lg k$
- $T(n) = T(9n/10) + T(n/10) + n \stackrel{\text{i.h.}}{\leq} C9n/10\lg 9n/10 + Cn/10\lg n/10 + n = C9n/10(\lg n - \lg 10/9) + Cn/10(\lg n - \lg 10) + n = Cn\lg n + n - (C9n/10) \lg 10/9 - (Cn/10)\lg 10\stackrel{?}{\leq} Cn\lg n$
- True for $C = 10$

But already that's pretty ugly looking. We could also prove that with the Akra-Bazzi extension of the Master Theorem.

### Original approach

If we want more precision, then, unlike quick select, we only have one parameter here. So let's again go back to write out the recurrence relation, assuming that we pick the $i$th smallest element as our pivot. And we'll assume that we're using the Lomuto partition on distinct elements. We'll have two recursive calls, one on the $i - 1$ elements smaller than the pivot and one on the $n - i$ elements larger than the pivot:

$T(n) = n - 1 + \frac{1}{n}\Bigl(\sum_{i=1}^n (T(i-1) + T(n-i)\Bigr) = n - 1 + \frac{2}{n}\sum_{i=0}^{n-1} T(i)$Not very pretty. The above was the analysis in the first edition of CLRS. A much more graceful analysis has sense been used.

Below is a really small array:

And instead of asking *how long* it takes to run, let's ask a much more limited question: if we run randomized quick sort, what's the probability that `3`

will get compared to `17`

? If we happen to pick `9`

as our first pivot, then it will separate `3`

and `17`

into different partitions, and then during the entire rest of the sort they won't get compared:

Similarly, if we had chosen `7`

, `8`

, or `15`

, then `3`

and `17`

also wouldn't ever get compared. Only if we choose `3`

or `17`

as our *first* pivot will they get compared to each other. Because `3`

and `17`

are the minimum and maximum items, we will know after picking just one pivot if those values will get compared. If one of them is chosen first, then they get compared. Or if one of the four values between them gets chosen first, then they don't get compared.

There's a $1/3$ chance they get compared because there are four values between them.

Let's recap — for randomized quick sort on the small example array above, what is the probability that `3`

gets compared to `17`

?

- If we pick $9$ as our first pivot, then $3$ and $17$ will never be compared.
- Similarly, if $7$, $8$, or $15$ is our first pivot, then $3$ and $17$ will never be compared.
- If we pick $3$ or $17$ as our first pivot, then $3$ and $17$ will be compared.
- Four items between the two extreme values:
- $4/(4 + 2) = 2/3$ chance they do not get compared.
- $2/(4 + 2) = 1/3$ chance they get compared.

Now let's consider a bigger array:

To simplify it, the array holds integers `1`

through `20`

, inclusive. What's the probability of comparing the number `2`

against the number `7`

during quick sort? To make it a bit easier to see what's going on, we'll add a second view of the same values but in sorted order:

Like before, there are four values between `2`

and `7`

in the sorted order, but now we have a lot more values that we can pick as a pivot. If we happen to pick `11`

as our first pivot, then it gets compared to everything, and both `2`

and `7`

fall into the partition of smaller elements. Unlike before, we picked our first pivot, but we still don't know if they're ever going to get compared to each other or not:

The array after the first partition is as follows:

Stepping into that partition on the left, if we pick `1`

as our next pivot, then it gets compared to the numbers `2`

to `10`

, but both `7`

and `2`

are larger than the pivot, `1`

, so they both fall into the same partition again:

The partition containing `7`

and `2`

after this second partitioning still needs to be sorted, and we still don't know if they are going to be compared:

As long as we keep picking pivots either larger than `7`

or smaller than `2`

, then `7`

and `2`

will both fall into the same partition, and we won't know yet if we're going to compare them. But, at some point, we'll pick a pivot of `2`

, `7`

, or something between them. If we pick `2`

or `7`

before anything between them, then they get compared to each other, and if we pick something between them first, then they get separated into different partitions, and they won't ever be compared (below, `7`

is picked):

The result after this third partitioning, where we use `7`

as the pivot, shows how `7`

and `2`

get compared and are then separated thereafter:

Even though everything above *looks* different than the first small array with just $6$ values, the probably that `2`

and `7`

get compared is still $1/3$. The added values larger than `7`

, or smaller than `2`

, don't change that probability of comparing those two items. The important thing is that there are four values between them.

Let's recap our observations from using this bigger array — what is the probability that (non-extreme) elements, `2`

and `7`

, get compared?

- For pivot $> 7$, both fall into the same partition, maybe compared later.
- For pivot $< 2$, both fall into the same partition, maybe compared later.
- Eventually, $2$, $7$, or something between them will be chosen as a pivot.
- If $2$ or $7$ is chosen before $x$, for $2 < x < 7$, they get compared.
- If $x$, for $2 < x < 7$, is chosen before $2$ or $7$, they do not get compared.

- Four items between the two chosen values:
- $4/(4 + 2) = 2/3$ chance they do not get compared.
- $2/(4 + 2) = 1/3$ chance they get compared.

More generally, we can calculate the probability that the $i$th smallest value is ever compared to the $j$th smallest value. If we want to know how many comparisons take place during the entire algorithm, then we can sum that up over all possible $i$ and $j$ values. That might look a little complex but not compared to the original recurrence relation we were considering.

We manipulate our sum to get the inner summation to be the harmonic series, and it works out to order $n\lg n$ comparisons. Because quick sort takes time proportional to the number of comparisons it performs, this sorting algorithm gives us an order $n\lg n$ expected runtime for quick sort.

In summary:

- For $i < j$, there are $j - i = 1$ elements from the $i$th to the $j$th smallest
- Probability $i$th smallest element is compared to the $j$th smallest element: $2/(j - i + 1)$
- Over all $i < j$ pairs, the expected number of comparisons is given by the following:

## Lower bounds for comparison based sorting - decision trees

### Introduction

This section will be about lower bounds for comparison-based sorting, using decision trees. We should have already seen some of the following basic comparison-based sorting algorithms:

- Bubble sort
- Selection sort
- Insertion sort
- Merge sort
- Heap sort
- Quick sort

We'll use insertion sort the most. We'll see what a comparison-based sort is, and we'll specifically look at how it acts on $3$ elements. We'll also introduce decision trees and how they relate back to the algorithm, and then show trees for each of the comparison-based sorts listed above. We use those trees to make a lower bound argument for sorting $3$ items, and then generalize that to prove a lower bound for sorting $n$ items.

### Sorting (n square to n log n to ?)

Let's say that the first sorting algorithm we learn is insertion sort. It seems great at first. But then we learn merge sort, and we realize that insertion sort isn't that great. Insertion sort's $O(n^2)$ worst-case runtime doesn't look so good once we know that we can do better and sort the same array in $O(n\lg n)$ using merge sort. But is $O(n\lg n)$ good? Well, can we do better?

### Insertion sort

Here's the insertion sort algorithm in pseudocode:

`InsertionSort(array A)`

for(i = 1 to A.length - 1)

tmp = A[i]

j = i

while(j > 0 && tmp < A[j - 1])

A[j] = A[j - 1]

j--

A[j] = tmp

Let's consider running the algorithm above on an array with just three items:

`InsertionSort(array A[0..2])`

for(i = 1 to 2)

tmp = A[i]

j = i

while(j > 0 && tmp < A[j - 1])

A[j] = A[j - 1]

j--

A[j] = tmp

If the code above is running, then the outcomes of the comparisons between elements from `A`

tell us everything there is to know about what is happening in the code. So, if the first comparison comes out true (i.e., line `5`

), then the first two items of the array will get swapped; otherwise, they won't get swapped. Either way, we won't enter that while loop a second time, until we are in the second pass of the outer for loop.

An ordered list of true/false outcomes for the comparisons tells us everything that there is to know about what happens in the code, even without a list of what was compared. The first comparison is between the first two elements, and the second comparison is between the third element and which of the first two elements is smaller, assuming they are different. If that second comparison comes out false, then there isn't a third comparison. If the first two comparisons are false, then the numbers given to us were already in increasing order.

This is what we're talking about when we say a *comparison-based* sorting algorithm. The important, data-driven *forks* in the code happen because two different elements are compared against each other, and the outcome of that comparison determines what will happen next in the code. Insertion, bubble, selection, heap, merge, and quick sorts are all comparison-based sorting algorithms. And there are more. They are general sorting algorithms that can sort any inputs as long as input values can be compared to each other.

With just $3$ items, we know that we go through that outer loop twice. Let's get rid of that loop:

`InsertionSort(array A[0..2])`

i = 1

tmp = A[i]

j = i

while(j > 0 && tmp < A[j - 1])

A[j] = A[j - 1]

j--

A[j] = tmp

i = 2

tmp = A[i]

j = i

while(j > 0 && tmp < A[j - 1])

A[j] = A[j - 1]

j--

A[j] = tmp

For the comparisons inside the while loop, what happens depends on the outcome of those comparisons. Let's get rid of those loops too, while keeping the same comparisons between the same elements in the same order, but the rest of the code looks different:

`InsertionSort(array A[0..2])`

x = A[0], y = A[1], z = A[2]

if(x <= y)

if(y <= z)

A[0] = x, A[1] = y, A[2] = z x <= y <= z

else

if(x <= z)

A[0] = x, A[1] = z, A[2] = y x <= z < y

else

A[0] = z, A[1] = x, A[2] = y z < x <= y

else

if(x <= z)

A[0] = y, A[1] = x, A[2] = z y < x <= z

else

if(y <= z)

A[0] = y, A[1] = z, A[2] = x y <= z <= x

else

A[0] = z, A[1] = y, A[2] = x z < y < x

The code above is ugly, but we could write it once we we know `if`

/`else`

statements without even knowing loops. It has the original comparisons, and they will be executed in the same order as the original code, starting with the first two items. But all comparisons are now written explicitly.

If we consider running insertion sort on `[1, 3, 2]`

, then we can see the three comparisons the code will perform:

`InsertionSort(array A = [1, 3, 2])`

x = 1, y = 3, z = 2

if(1 <= 3)

if(3 <= 2)

A[0] = x, A[1] = y, A[2] = z x <= y <= z

else

if(1 <= 2)

A[0] = 1, A[1] = 2, A[2] = 3 1 <= 2 < 3

else

A[0] = z, A[1] = x, A[2] = y z < x <= y

else

if(x <= z)

A[0] = y, A[1] = x, A[2] = z y < x <= z

else

if(y <= z)

A[0] = y, A[1] = z, A[2] = x y <= z <= x

else

A[0] = z, A[1] = y, A[2] = x z < y < x

Like any comparison-based sort, it's the relative sizes of the inputs that are important, not the absolute values. So, insertion sort will go through the same steps to sort `[11, 13, 12]`

as it does for `[1, 3, 2]`

. The problem here is that this is just a horrific way to see what's happening. So let's introduce decision trees.

### Decision tree for insertion sort

Below is a decision tree for insertion sort on $3$ items:

It might look like some kind of data structure, but it isn't. It's just a more visual representation of which elements get compared by the code. The non-leaf nodes of the decision tree represent the comparisons, and the two children for a node represent the two possible outcomes of its comparison. The leaves represent final outcomes for the program.

So we can again look at what happens on input `[1, 3, 2]`

, and see which leaf we get to:

The path above, from the decision tree root to a leaf, represents running the algorithm on some particular set of data. Each inner node on the path represents a comparison that the algorithm executes, in that order. The length of the path is the number of comparisons the algorithm executes for that input.

It's easy to see above that while `[1, 3, 2]`

has $3$ comparisons, other inputs like `[2, 1, 3]`

, would only have two.

### Other decision trees

Different sorting algorithms have different decision trees. We just saw the decision tree for insertion sort:

Let's look at the decision tree for bubble sort:

We see from the decision tree above that bubble sort *always* performs $3$ comparisons, although for two possible inputs, the last comparison doesn't tell us anything we didn't already know:

Let's consider now the decision tree for selection sort:

Selection sort is only broadly defined in [20], but for our implementation (looping from front, select min element, swap to front), we have another occurrence of making a comparison that doesn't tell us something we don't already know:

We also get this weird 7th outcome, where if the first two elements are equal but larger than the third, it sorts but unstably puts the second item before the first:

Weird stuff isn't exclusive to quadratic sorts. Heap sort has two of those useless comparisons, a bunch of unstable outputs:

And the strange leaf node with two labels, `x <= z < y`

and `x < z = y`

, we hit if the first value is strictly less than the second, and the third is somewhere between those two, but if all three are equal, then we go to a completely different node (far right leaf node with label `y <= z <= x`

).

We wouldn't *choose* to make redundant comparisons if we were making a decision tree for $3$ items from scratch, but to write code that cleanly works on any number of inputs using natural operations on a heap, some comparisons can be duplicated.

Let's now consider the decision tree for merge sort:

The tree above is the decision tree for both top-down and bottom-up merge sort. Those are different algorithms, with different trees for $5$ or more elements, but for $4$ or less, they have the same trees. Meanwhile, for just $2$ items, if we are only doing one comparison, then there are pretty much only $2$ possible trees, depending on if we are stable on $2$ equal items or not.

Finally, let's look at the decision tree for quick sort:

We can see its instability in the output above (leaf node on bottom with label `x < y <= x`

).

### Bound for 3 items

Let's recall all of the decision trees remarked on above:

Some of these trees *always* take $3$ comparisons, while others sometimes use $2$ and other times use $3$. The real question is whether or not we can make a decision tree with at most $2$ comparisons. And the answer is *no*.

Each of the trees above has at least $6$ leaf nodes, because if we give the program $3$ distinct values like `1`

, `2`

, and `3`

, then there are `6`

possible permutations that we can use to order the input. Each comparison has at most $2$ children, for the possible outcomes of the comparison. If we try to make a binary tree, with at most $2$ comparisons, then it gives us a tree with height at most $2$, with at most $4$ leaves. To get $6$ permutations into $4$ leaves, at least $2$ permutations must go to one leaf, and the algorithm will do the exact same thing for both of those permutations. If we reorder $2$ different permutations in the same way such as, "keep the first element first but swap the other two," at most one of those two permutations can be sorted afterwards. So every permutation needs to get to its own leaf. It boils down to the problem that $6$ is not less than or equal to $4$. So no matter what our comparison-based sorting algorithm is, for $3$ items, its decision tree needs height at least $3$, and some inputs will require $3$ comparisons.

Let's recap these seemingly dense observations — to sort $3$ items:

- $3$ distinct inputs have $3! = 6$