Python Sets: What, Why and How
<blog-title-header :frontmatter=”frontmatter” title=”Python Sets: What, Why and How” />
Python comes equipped with several built-in data types to help us organize our data. These structures include lists
, dictionaries
, tuples
and sets
.
set
is an unordered collection with no duplicate elements. Basic uses include membership testing and eliminating duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference
In this article, we are going to review every one of the elements listed in the above definition. Let’s start right away and see how we can create them.
Initializing a Set
There are two ways to create a set: one is to use the built-in function set()
and pass a list
of elements, and the other is to use the curly braces {}
.
Initializing a set using the set()
built-in function
>>> s1 = set([1, 2, 3])
>>> s1
{1, 2, 3}
>>> type(s1)
<class 'set'>
Initializing a set using curly braces {}
>>> s2 = {3, 4, 5}
>>> s2
{3, 4, 5}
>>> type(s2)
<class 'set'>
>>>
{}
or you will get an empty dictionary instead.
>>> s = {}
>>> type(s)
<class 'dict'>
It’s a good moment to mention that for the sake of simplicity, all the examples provided in this article will use single digit integers, but sets can have all the hashable data types that Python support. In other words, integers, strings and tuples, but not mutable items like lists or dictionaries:
>>> s = {1, 'coffee', [4, 'python']}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
Now that you know how to create a set and what type of elements it can have, let’s continue and see why we should always have them in our arsenals.
Why you should Use them
We can write code in more than a single way. Some are considered to be pretty bad, and others, clear, concise and maintainable. Or “pythonic”.
Let’s start exploring the way that Python sets can help us not just with readability, but also with our program’s execution time.
Unordered collection of elements
First things first: you can’t access a set
object using indexes.
>>> s = {1, 2, 3}
>>> s[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing
Or modify them with slices:
>>> s[0:2]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable
BUT, if what we need is to remove duplicates, or do mathematical operations like combining lists (unions), we can, and SHOULD always use sets.
I have to mention that when iterating over, lists outperform sets, so prefer them if that is what you need. Why? Well, this article does not intend to explain the inner workings of sets, but here are a couple of links where you can read about it:
- Time Complexity
- How is set() implemented?
- Python Sets vs Lists
- Is there any advantage or disadvantage to using sets over list comps to ensure a list of unique entries?
No duplicate items
While writing this, I cannot stop thinking in all the times I used a for loop and the if statement to check and remove duplicate elements in a list. My face turns red remembering that, more than once, I wrote something like this:
>>> my_list = [1, 2, 3, 2, 3, 4]
>>> no_duplicate_list = []
>>> for item in my_list:
... if item not in no_duplicate_list:
... no_duplicate_list.append(item)
...
>>> no_duplicate_list
[1, 2, 3, 4]
Or used a list comprehension:
>>> my_list = [1, 2, 3, 2, 3, 4]
>>> no_duplicate_list = []
>>> [no_duplicate_list.append(item) for item in my_list if item not in no_duplicate_list]
[None, None, None, None]
>>> no_duplicate_list
[1, 2, 3, 4]
But it’s ok, nothing of that matters anymore because we now have the sets:
>>> my_list = [1, 2, 3, 2, 3, 4]
>>> no_duplicate_list = list(set(my_list))
>>> no_duplicate_list
[1, 2, 3, 4]
Sets performance
Now let’s use the timeit module and see the execution time of lists and sets when removing duplicates:
>>> from timeit import timeit
>>> def no_duplicates(list):
... no_duplicate_list = []
... [no_duplicate_list.append(item) for item in list if item not in no_duplicate_list]
... return no_duplicate_list
...
>>> # first, let's see how the list perform:
>>> print(timeit('no_duplicates([1, 2, 3, 1, 7])', globals=globals(), number=1000))
0.0018683355819786227
>>> from timeit import timeit
>>> # and the set:
>>> print(timeit('list(set([1, 2, 3, 1, 2, 3, 4]))', number=1000))
0.0010220493243764395
>>> # faster and cleaner =)
Not only we write fewer lines of code with sets than with lists comprehensions, we also obtain more readable and performant code.
From the Zen of Python:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Flat is better than nested.
Aren’t sets just Beautiful, Explicit, Simple, and Flat?
Membership tests
Every time we use an if statement to check if an element is, for example, in a list, you are doing a membership test:
my_list = [1, 2, 3]
>>> if 2 in my_list:
... print('Yes, this is a membership test!')
...
# Yes, this is a membership test!
And sets are more performant than lists when doing them:
>>> from timeit import timeit
>>> def in_test(iterable):
... for i in range(1000):
... if i in iterable:
... pass
...
>>> timeit('in_test(iterable)', setup="from __main__ import in_test; iterable = list(range(1000))", number=1000)
# 12.459663048726043
>>> from timeit import timeit
>>> def in_test(iterable):
... for i in range(1000):
... if i in iterable:
... pass
...
>>> timeit('in_test(iterable)', setup="from __main__ import in_test; iterable = set(range(1000))", number=1000)
# 0.12354438152988223
The above tests come from this Stack Overflow thread.
So if you are doing comparisons like this in massive lists, it should speed you a good bit if you convert that list into a set.
Adding Elements
Depending on the number of elements to add, we will have to choose between the add()
and update()
methods.
add()
Will add a single element:
>>> s = {1, 2, 3}
>>> s.add(4)
>>> s
{1, 2, 3, 4}
And update()
multiple ones:
>>> s = {1, 2, 3}
>>> s.update([2, 3, 4, 5, 6])
>>> s
{1, 2, 3, 4, 5, 6}
Remember, sets remove duplicates.
Removing Elements
If you want to be alerted when your code tries to remove an element that is not in the set, use remove()
. Otherwise, discard()
provides a suitable alternative:
>>> s = {1, 2, 3}
>>> s.remove(3)
>>> s
{1, 2}
>>> s.remove(3)
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# KeyError: 3
discard()
won’t raise any errors:
>>> s = {1, 2, 3}
>>> s.discard(3)
>>> s
{1, 2}
>>> s.discard(3)
>>> # nothing happens!
We can also use pop()
to randomly discard an element:
>>> s = {1, 2, 3, 4, 5}
>>> s.pop() # removes an arbitrary element
1
>>> s
{2, 3, 4, 5}
Or clear()
to remove all the values from a set:
>>> s = {1, 2, 3, 4, 5}
>>> s.clear() # discard all the items
>>> s
set()
The union() method
union()
or |
will create a new set that contains all the elements from the sets we provide:
>>> s1 = {1, 2, 3}
>>> s2 = {3, 4, 5}
>>> s1.union(s2) # or 's1 | s2'
{1, 2, 3, 4, 5}
The intersection() method
intersection
or &
will return a set containing only the elements that are common in all of them:
>>> s1 = {1, 2, 3}
>>> s2 = {2, 3, 4}
>>> s3 = {3, 4, 5}
>>> s1.intersection(s2, s3) # or 's1 & s2 & s3'
{3}
The difference() method
Difference creates a new set with the values that are in “s1” but not in “s2”:
>>> s1 = {1, 2, 3}
>>> s2 = {2, 3, 4}
>>> s1.difference(s2) # or 's1 - s2'
{1}
symmetric_difference()
symmetric_difference
or ^
will return all the values that are not common between the sets.
>>> s1 = {1, 2, 3}
>>> s2 = {2, 3, 4}
>>> s1.symmetric_difference(s2) # or 's1 ^ s2'
{1, 4}
Conclusions
I hope that after reading this article you know what a set is, how to manipulate their elements and the operations they can perform. Knowing when to use a set will definitely help you write cleaner code and speed up your programs.
- Previous
- Next