Tuesday, March 06, 2007

The beauty of Python Generators

Say you had to read a file line by line and change the delimiter from comma to semi-colon. The python code would look like this:
f = open("c:\\testfile.txt") 
for line in f.readlines():
';'.join(line.split(','))
Now say you wanted to do the same thing, but only to the first 5 lines of the file. The impulse for a Python newbie would be to write the following code:
f = open("c:\\testfile.txt") 
counter = 0
for line in f.readlines():
if counter <> 5
';'.join(line.split(','))
counter += 1
A more savvy python newbie would write the following code taking the advantage of the beautiful enumerate() function:
f = open("c:\\testfile.txt") 
for counter,line in enumerate(f.readlines()):
if counter <> 5
';'.join(line.split(','))
The problem with is code is that it has two concepts intermingled, one to read 5 lines and the other for the actual process of changing delimiters. If we wanted to keep the concepts separate, we would have to be able to write something like this:
f = open("c:\\testfile.txt") 
for line in f.readfirst5lines():
';'.join(line.split(','))
Here the process of stopping after 5 lines is encapsulated by the readfirst5lines() method in the file object and the method body only does the changing of delimiters as before. Now this code will actually work! The reason that we can write code like this is due the python generators feature. readfirst5lines will look like this:
import itertools 
class myfile(file):
def readfirst5lines(self):
for i in itertools.count():
if i == 5:
break
next = self.readline()
i += 1
yield next
This is of course more lines of code, but the concept is abstracted away nicely. We are separating the conditions for processing the file from the actual processing. And whats more, this method can be slightly changed to take number of lines as the argument. So if you want choose to read the first 2, 3, 5 or how many ever lines, the method will look like this:
import itertools 
class myfile(file):
def readfirstfewlines(self, n):
for i in itertools.count():
if i == n:
break
next = self.readline()
i += 1
yield next
And you can use it like this:
f = myfile("c:\\testfile.txt") 
for line in f.readfirstfewlines(3):
';'.join(line.split(','))
This will process the first 3 lines.

(Note that we used the 'myfile' constructor method instead of the 'open' function to open the file, since we need an object of type 'myfile' and not 'file'. There are other ways to downcast in python but this is probably the simplest way to do it in this case.)

Python generators go a long way in making the code more elegant and encapsulating separate concepts. I will blog about the mechanics of how generators work and its other uses as and when I learn more about them. For Rubyists, this concept of generators is pretty much similar to "blocks".

2 comments:

Ryan Ginstrom said...

Nice! You could also do this with a function:

def read_n_lines(inbuffer, num_lines):
    for i, line in enumerate(inbuffer):
        if i >= num_lines:
            return
        yield line

And call it like this:
read_n_lines( open("/python25/readme.txt"), 5 )

And make a read_5 function like this:
read_5 = lambda inbuffer : read_n_lines(inbuffer, 5)

The cool thing about the function version is that it abstracts away the file part -- you could pass any iterable object (such as cStringIO for testing). The class version is simpler, though, because it wraps the file-open mechanics.

chandrakant said...

thanks for the inputs, thats slick indeed! :)