Python 201: An Intro to Generators

Posted by Mike on January 27th, 2014 filed in Cross-Platform, Python

The topic of generators has been covered numerous times before. However, it’s still a topic that a lot of new programmers have trouble with and I would hazard a guess that even experienced users don’t really use them either.

Python generators allow developers to lazily evaluate data. This is very helpful when you are dealing with so-called “big data”. Their main use is for generating values and for doing so in an efficient manner. In this article, we will go over how to use a generator and take a look at generator expressions. Hopefully by the end you will comfortable using generators in your own projects.

The canonical use case for a generator is to show how to read a large file in a series of chunks or lines. There’s nothing wrong with that idea, so let’s use that for our first example too. To create a generator, all we need to do is use Python’s yield keyword. The yield statement will turn a function into an iterator. All you have to do to change a regular function into an iterator is to replace the return statement with a yield statement. Let’s take a look at an example:

#----------------------------------------------------------------------
def read_large_file(file_object):
    """
    Uses a generator to read a large file lazily
    """
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data
 
#----------------------------------------------------------------------
def process_file(path):
    """"""
    try:
        with open(path) as file_handler:
            for line in read_large_file(file_handler):
                # process line
                print(line)
    except (IOError, OSError):
        print("Error opening / processing file")
 
#----------------------------------------------------------------------
if __name__ == "__main__":
    path = "TB_burden_countries_2014-01-23.csv"
    process_file(path)

To make testing easier, I went to the World Health Organization’s (WHO) site and downloaded a CSV file on Tuberculosis. Specifically I grabbed the “WHO TB burden estimates [csv 890kb]” file from here. If you already have a big file to play with, feel free to edit the code appropriately. Anyway, in this code we create a function named read_large_file and turn it into a generator by making it yield back its data.

Here’s how the magic works: We create a for loop that loops over our generator function. For each iteration, the generator function will yield up a generator object that contains a line of data and the for loop will process it. In this case, the “process” is just to print the line to stdout, but you can modify that to do whatever you need. In a real program, you would probably be saving data to a database or creating a PDF or other report with the data. When the generator is returned, it suspends the state of execution in the function so that local variables are preserved. This allows us to continue on the next loop without losing our place.

Anyway, when the generator function runs out of data, we break out of it so that the loop doesn’t continue on indefinitely. The generator allows us to process only one chunk of data at a time, which saves a lot of memory.

Update 2014/01/28: One of my readers pointed out that files return lazy iterators to begin with, which is something I thought they did. Oddly enough, everyone and their dog recommends using generators for reading files, but just iterating over the file is enough. So let’s rewrite the example above to utilize this concept:

#----------------------------------------------------------------------
def process_file_differently(path):
    """
    Process the file line by line using the file's returned iterator
    """
    try:
        with open(path) as file_handler:
            while True:
                print next(file_handler)
    except (IOError, OSError):
        print("Error opening / processing file")
    except StopIteration:
        pass
 
#----------------------------------------------------------------------
if __name__ == "__main__":
    path = "TB_burden_countries_2014-01-23.csv"
    process_file_differently(path)

In this code, we create an infinite loop that will call Python’s next function on the file handler object. This will cause Python to return the file back to use line-by-line. When the file runs out of data, the StopIteration exception is raised, so we make sure we catch it and ignore it.

Generator Expressions

Python has the concept of generator expressions. The syntax for a generator expression is very similar to a list comprehension. Let’s take a look at both to see the difference:

# list comprehension
lst = [ord(i) for i in "ABCDEFGHI"]
 
# equivalent generator expression
gen = list(ord(i) for i in "ABCDEFGHI")

This example is based on one found in Python’s HOWTO section on generators and frankly I find it a bit obtuse. The main difference between a generator expression and a list comprehension is in what encloses the expression. For a list comprehension, it is square brackets; for the generator expression, it is regular parentheses. Let’s create the generator expression itself without turning it into a list:

gen = (ord(i) for i in "ABCDEFGHI")
while True:
    print gen.next()

If you run this code, you will see it print out each ordinal value for each member of the string and then you’ll see a traceback stating that a StopIteration has occurred. That means that the generator has exhausted itself (i.e. it’s empty). So far, I have not found a use for the generator expression in my own work, but I would be interested to know what you’re using it for.

Wrapping Up

Now you know what a generator is for and one of it’s most popular uses. You have also learned about the generator expression and how it works. I have personally used a generator for parsing data files that are supposed to become “big data”. What have you used these for?

Print Friendly

  • yacc

    Hint, the whole generator is unnecessary, file objects in python happen to be lazy iterators returning the file line by line.

    E.g.

    with open(“/etc/passwd”) as file_obj:
    print next(file_obj)

    I think that should be mentioned somewhere in the article, that writing a readline yielding generator is kind of not idiomatic to Python.

  • glenfant

    Hi,
    Another great stuff of generators is the easy possibility to reuse a complicated looping logic over elements for various usages (printing, record in a CSV file, HTML rendering, …) instead of a dirty code copy/paste (DRY principle).
    Newbie readers can have a look at the itertools module of the standard library that provides common use cases of generators.

  • http://www.blog.pythonlibrary.org/ Mike Driscoll

    I thought that was what happened when I read a file line by line, but everyone says that using a generator is the way to go with “Big” files.

  • http://www.blog.pythonlibrary.org/ Mike Driscoll

    I have updated the article with another example. Thanks for the heads up!

  • yacc

    Reading the whole file:

    fo = open(…)

    # as a complete byte string:
    fo.read()
    # as list of lines with line endings in the strings.
    list(fo)

    # as list of EOL-free strings:
    # (that reads the whole things as one huge
    # string, and splits it, memory usage more
    # than 2x file size)

    fo.read().splitlines()

    # as list of EOL-free strings
    # that does not read the whole file at once
    [x.rstrip(“n” for x in fo]

    # The last one can be a generator expression
    # too, which is nearly O(m) memory usage,
    # m being the longest line in the file.

    (x.rstrip(“n”) for x in fo) # that creates a generator

  • http://sourcecode.net.br/ Rafael Santos

    Nice Article! thanks to this I also learned that in python3 the .next() method was renamed to __next__().

    http://stackoverflow.com/questions/1073396/is-generator-next-visible-in-python-3-0

  • http://www.blog.pythonlibrary.org/ Mike Driscoll

    That’s odd. I just tested with Python 3.3 and it has the next() function available. http://docs.python.org/3.3/library/functions.html#next – I think that article is misleading. The “dunder” method “__next__” is what the next() function itself calls under the covers. At least, that’s my understanding.

  • Pingback: Articles for 2014-jan-29 | Readings for a day()