Python Concurrency: An Intro to Threads

Python has a number of different concurrency constructs such as threading, queues and multiprocessing. The threading module used to be the primary way of accomplishing concurrency. A few years ago, the multiprocessing module was added to the Python suite of standard libraries. This article will be focused on the threading module though.

Getting Started

We will start with a simple example that just demonstrates how threads work. We will sub-class the Thread class and make it print out its name to stdout. Let’s get coding!

import random
import time

from threading import Thread

########################################################################
class MyThread(Thread):
    """
    A threading example
    """

    #----------------------------------------------------------------------
    def __init__(self, name):
        """Initialize the thread"""
        Thread.__init__(self)
        self.name = name
        self.start()
        
    #----------------------------------------------------------------------
    def run(self):
        """Run the thread"""
        amount = random.randint(3, 15)
        time.sleep(amount)
        msg = "%s has finished!" % self.name
        print(msg)
        
#----------------------------------------------------------------------
def create_threads():
    """
    Create a group of threads
    """
    for i in range(5):
        name = "Thread #%s" % (i+1)
        my_thread = MyThread(name=name)
    
if __name__ == "__main__":
    create_threads()

In the code above, we import Python’s random module, the time module and we import the Thread class from the threading module. Next we sub-class Thread and make override its __init__ method to accept an argument we label “name”. To start a thread, you have to call its start() method, so we do that at the end of the init. When you start a thread, it will automatically call its run method. We have overridden its run method to make it choose a random amount of time to sleep. The random.randint example here will cause Python to randomly choose a number from 3-15. Then we make the thread sleep the number of seconds that we just randomly chose to simulate it actually doing something. Finally we print out the name of the thread to let the user know that the thread has finished.

The create_threads function will create 5 threads, giving each of them a unique name. If you run this code, you should see something like this:

Thread #2 has finished!
Thread #1 has finished!
Thread #3 has finished!
Thread #4 has finished!
Thread #5 has finished!

The order of the output will be different each time. Try running the code a few times to see the order change.

Writing a Threaded Downloader

The previous example wasn’t very useful other than as a tool to explain how threads work. So in this example, we will create a Thread class that can download files from the internet. The U.S. Internal Revenue Service has lots of PDF forms that it has its citizens use for taxes. We will use this free resource for our demo. Here’s the code:

# Use this version for Python 2
import os
import urllib2

from threading import Thread

########################################################################
class DownloadThread(Thread):
    """
    A threading example that can download a file
    """

    #----------------------------------------------------------------------
    def __init__(self, url, name):
        """Initialize the thread"""
        Thread.__init__(self)
        self.name = name
        self.url = url
        
    #----------------------------------------------------------------------
    def run(self):
        """Run the thread"""
        handle = urllib2.urlopen(self.url)
        fname = os.path.basename(self.url)
        with open(fname, "wb") as f_handler:
            while True:
                chunk = handle.read(1024)
                if not chunk:
                    break
                f_handler.write(chunk)
        msg = "%s has finished downloading %s!" % (self.name,
                                                   self.url)
        print(msg)
        
#----------------------------------------------------------------------
def main(urls):
    """
    Run the program
    """
    for item, url in enumerate(urls):
        name = "Thread %s" % (item+1)
        thread = DownloadThread(url, name)
        thread.start()
    
if __name__ == "__main__":
    urls = ["http://www.irs.gov/pub/irs-pdf/f1040.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040a.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040ez.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040es.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040sb.pdf"]
    main(urls)

This is basically a complete rewrite of the first script. In this one we import the os and urllib2 modules as well as the threading module. We will be using urllib2 to do the actual downloading inside the thread class. The os module is used to extracting the name of the file we’re downloading so we can use it to create a file with the same name on our machine. In the DownloadThread class, we set up the __init__ to accept a url and a name for the thread. In the run method, we open up the url, extract the filename and then use that filename for naming / creating the file on disk. Then we use a while loop to download the file a kilobyte at a time and write it to disk. Once the file is finished saving, we print out the name of the thread and which url has finished downloading.

Update:

Python 3’s version of the code is slightly different. You have to import urllib instead and use urllib.request.urlopen instead of urllib2.urlopen. Here’s the Python 3 version:

# Use this version for Python 3
import os
import urllib.request
 
from threading import Thread
 
########################################################################
class DownloadThread(Thread):
    """
    A threading example that can download a file
    """
 
    #----------------------------------------------------------------------
    def __init__(self, url, name):
        """Initialize the thread"""
        Thread.__init__(self)
        self.name = name
        self.url = url
 
    #----------------------------------------------------------------------
    def run(self):
        """Run the thread"""
        handle = urllib.request.urlopen(self.url)
        fname = os.path.basename(self.url)
        with open(fname, "wb") as f_handler:
            while True:
                chunk = handle.read(1024)
                if not chunk:
                    break
                f_handler.write(chunk)
        msg = "%s has finished downloading %s!" % (self.name,
                                                   self.url)
        print(msg)
 
#----------------------------------------------------------------------
def main(urls):
    """
    Run the program
    """
    for item, url in enumerate(urls):
        name = "Thread %s" % (item+1)
        thread = DownloadThread(url, name)
        thread.start()
 
if __name__ == "__main__":
    urls = ["http://www.irs.gov/pub/irs-pdf/f1040.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040a.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040ez.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040es.pdf",
            "http://www.irs.gov/pub/irs-pdf/f1040sb.pdf"]
    main(urls)

Wrapping Up

Now you know how to use threads both in a theory and in a practical way. Threads are especially useful when you are creating a user interface and you want to keep your interface usable. Without threads, the user interface would become unresponsive and would appear to hang while you did a large file download or a big query against a database. To keep that from happening, you do the long running processes in threads and then communicate back to your interface when you are done.

Related Reading

Python Documentation – Section 16.2: Threading
Python Concurrency: An Example of a Queue
Python Concurrency: Porting from a Queue to Multiprocessing

yacc

February 25, 2014 at 3:22 am

> Unfortunately, threads in Python are severely limited by the

> Global
Interpreter Lock (GIL) which causes all your threads to

> run on the same
core.

This is wrong in multiple ways:

* Simply speaking, the GIL does not ensure in any way that the program gets run on the same core.

* Next one is that the GIL is not a limitation, it’s an optimization that gives cpython overall an incredible single thread performance. Notice that all trials to remove the GIL and replace it with something with comparable safety have lead to so significant slow down in single thread speed, that a QuadCore CPU running the work in 4 threads was still slower than a single threaded version.

* the GIL is critical part of the safety of the language, e.g. Python in practically all circumstances does raise an expection, instead of a core dump (or for the Windows developers, a dialog asking them if they’d like to start Visual Studio to debug python.exe). Now for many complex data structures (e.g. a python dict) manipulating it without exclusivity is secure way to get a hard bug. Achieving that exclusivity without a GIL, automatically (as in not changing Python semantics), with acceptable performance is hard.

* Python exposes parallelism in a number of ways, all of them with some pains associated. E.g. multiprocessing (you need to think about communication overhead), or C-level module (where the developer can release the GIL, but the developer is responsible for the correctness of the code).

Basically, before complaining about the GIL, one should consider:

* the GIL is here after 2 decades.

* the Python community is quite aware of it.

* there have been a number of attempts to get rid of it. Some died because the developer quit early. But the usual place where the attempts died where when a GIL-enabled standard cpython ran circles around the “properly multithreaded” experiment, usually without even starting to sweat. For a current example see http://morepypy.blogspot.co.at/2013/10/update-on-stm.html

hunterji

February 25, 2014 at 12:27 am

I was getting errors with the first example code. If I move “name” from the Thread.__init__(self) to the class’ def __init__(self, name) it seems to do the trick.