Find duplicates using Python
The exercise of finding duplicates in a list is a classical example of a didactic problem that you may find during your journey through learning algorithms. During this post, we will try solving this problem using different algorithms, showing the differences between each one.
This article has been mainly written for fun and as a writing exercise. Its main purpose is to show different solutions to a single problem, explaining the reasoning behind certain decisions.
Finding duplicates in a list is a trivial problem for an experienced Software Engineer. Maybe it is one of the first algorithms you developed at the University and you should not push yourself so hard on this.
The statement for this problem is the following:
Given an array of n elements that contains elements from 0 to n-1, with any of these numbers appearing any number of times. Find this repeating number and return it.
For the lack of simplicity, we will consider only lists of integers. Some of the algorithms shown, however, can be used also for other types.
In this post, we will see different solutions for this problem. Every solution has some requirements and uses a given amount of space and time (and a degree of coolness). We will use Python to implement them.
To generate our random lists we can use the following functions:
The first one generate_random_list(n,k)
generates an unsorted list of n
random integers from 0
to k
.
Instead, the second one generate_random_list_pigeon(n)
generates a list of n
random elements from 1
to n-1
. It is named after the Pigeonhole principle so that we can ensure that only one element is duplicated. This function will come in handy at some point.
Dummy solution
The dummy solution has no special prerequisites and does not require a detailed explanation.
It iterates over the elements in the list from the first element in the list and checks if there is a duplicate.
When the outer for is at its first iteration, the inner one will iterate over the other elements, comparing the element pointed by i and the element pointed by j. When both the elements at index i and j are the same, it will return this element. Then the index i will be moved to the next element and so on so forth. Easy!
Analysis
Constraints: The algorithm works for every list of elements that are comparable in which there is at least a duplicate. The input list will not be modified.
Space complexity: O(1). We only allocated two variables i and j.
Time complexity: in the worst case, the algorithm compares every pair of elements that can be obtained only once. So, i can assume values from 0 to n-2, and for each value of i, j can assume n-2-i values. Indeed, the first time j iterates from 1 to n-1 (n-2 times), the second time from 2 to n-2 (n-3 times), etc. So, the time complexity is approximately O(n²).
Sort and find
The sort and find algorithm is based on a simple principle. If we sort our list, duplicated elements will be adjacent, so we don’t need to look for every pair, we just need to inspect every adjacent pair of elements.
The first iteration is in-place sorting. This can be achieved using the method sort of Python. After the sorting, elements can be compared two-by-two, checking for adjacent ones. This is straightforward because once the list is sorted, equal elements are adjacent.
The following animation shows the algorithm on our example list:
Analysis
Constraints: the input list must be modifiable for the algorithm to be correct. The method sort()
of Python does in-place sorting. We could have used the sorted
method to return a brand new sorted list, but this would have required O(n)
additional space to a new list;
Space complexity: O(1) if sorting is in-place, O(n) if we use a new sorted list;
Time complexity: O(n log n) because we require at least O(n log n) to sort our list and O(n) to traverse it since every element is checked at most twice.
Using a HashMap
Another cool solution for finding a duplicate in a list would require the use of a HashMap.
A hash map is a data structure that can map keys to values. The average cost for inserting, searching, or deleting a key is constant.
Using this structure, we can mark elements that have been seen during our traversal and, if an element is already present in the seen map, it is a duplicated element.
Also in that case, the implementation is pretty straightforward: a dictionary is used as the map of the seen values. For each element of the list, we check out if it is already present in our map. If so, it is the duplicated we were looking for, otherwise, we insert it on the map.
Analysis
Constraints: elements in the list have to be hashable to be used as keys in the map;
Space complexity: O(n). The extra space is for the hash map.
Time complexity: O(n) since we check every element once and for each element we do a lookup and an insert in the map (that costs O(1) in the average case).
Using Math
Math is almost always the key for solving algorithmic problems. Every software engineer should be skilled on math.
The following code shows a very simple implementation of the algorithm to solve our problem. However, this has many limitations, but is a good starting point and works well in some cases.
We know that the sum of the first n natural number can be achieved using the following equation:
We can apply another step and find the sum of consecutive natural numbers from k to n that equals to:
In our code we define a lambda function to get this formula, then we apply it to the first interval composed of numbers less than our minimum (numbers before our interval), eventually, we do the same for natural numbers from 1 to our maximum.
By subtracting the sum outside our interval from our complete sum, we would obtain the expected sum of natural numbers from our minimum to our maximum.
After getting the real sum of the elements inside our list, we can subtract the above mentioned value to this one, obtaining the element that is present twice in our list.
It can be easily proven that this algorithm has many constraints and does perform correctly in a single condition. Let’s analyze this.
Analysis
Constraints: the algorithm performs correctly when the list contains integers and only a single value is repeated a single time. It is strict enough, isn’t it?
Space complexity: O(1). We only need to keep 3 values in memory.
Time complexity: O(n). The two sums for obtaining the expected sum in the interval would require constant time, the sum of all the elements in the list would require O(n).
I hope you enjoyed reading this article as much as I enjoyed writing it.