PyDev of the Week: Stefan van der Walt

This week we welcome Stefan van der Walt (@stefanvdwalt) as our PyDev of the Week! Stefan is the creator of scikit-image, which is a collection of algorithms for image processing. You can see some of the projects that he is a part of on Github or on Berkeley’s website. Stefan also has his own website which is worth checking out. Let’s take a few moments to get to know Stefan better!

Can you tell us a little about yourself (hobbies, education, etc):

I am currently a researcher at the Berkeley Institute for Data Science (BIDS) at the University of California, Berkeley. I was born and raised in the university town of Stellenbosch, South Africa—renowned for its beautiful nature and world-class wines—where I studied electronic engineering, computer science, and applied mathematics. Growing up there, it was easy to fall in love with nature: I love running and hiking in the mountains, and exploring in general. Nowadays, most of my hobby time is spent with my two children, aged 1 and 3.

Why did you start using Python?

I’ve always been drawn to new languages, and enjoy tinkering with them to see what constructs they provide, and how you they allow you to express familiar problems in novel ways. So, while I dabbled with Python in high school (for little projects like organizing my music collection), it was really during a summer internship that I learned it inside out. They gave me two weeks to learn Python, after which I had to solve some database-related problems. Those first two weeks were great! Later at university, I did most of my work in Octave, but switched when my advisor got inspired by Python. Those were early days in the scientific Python ecosystem, but I was just too happy that I could use and develop open source software as part of my work.

What other programming languages do you know and which is your favorite?

I feel like “knowing” a language means having developed intuition around it, instinctively knowing how to best express yourself. I spent several years learning C++, but never truly felt comfortable with it. There’s this great book by Scott Meyers where he shows code snippets, and asks you to figure out what’s wrong with them. You often can’t see it, but when he shows you it turns out to be some BIG issue. This had me worried: do I really want to spend so much time learning a language that
easily hides catastrophically bad behavior? In that regard, I think C++ has improved a lot since, so that nowadays it is easier to program safely—but I haven’t gone back.

Day-to-day, I use JavaScript—to build scientific web portals for machine learning and astronomy—and elisp, because I practically live inside of emacs and org-mode. It’s hard to pick favorites: each serves a purpose, and has its own beauty and warts.

There are a lot of others I wish to explore still: Haskell—to understand its type system, Rust—to see what a modern system language looks like, and C# and .NET—to see why users are so excited about their library support and documentation.

What projects are you working on now?

At BIDS, I lead a team of three programmers that work with the community to develop NumPy, the numerical array library. This was the first Python library I ever contributed to, because it was so fundamental to all the numerical work we did.

I also spend a bunch of time on scikit-image (the image processing library, for which we just released v0.15) and SkyPortal (an astronomy data web portal).

And then there’s the software we write for research purposes. At the moment, I’m working with the Natural History Museum in London to help accelerate their image processing pipelines for digitizing their vast collection of insects.

Which Python libraries are your favorite (core or 3rd party)?

There are so many great ones to choose from! The Python 3 standard library is fantastic. I use IPython all the time, and I love the docstrings of the scientific libraries; we really should populate these for the standard library too.

My most imported 3rd party libraries are numpy, scipy, and matplotlib, and I think the dask project is exciting for scaling work from a laptop to larger systems.

How did scikit-image come about?

The research for my masters and PhD projects involved a lot of image processing, and over the years I built up a collection of algorithms. The mailing list was the marketplace for these, but wasn’t the best venue to host code. So, in 2009, I packaged up what I had into a library.

I took pride in the implementation of these algorithms, so found it surprising when the first users uncovered several bugs and unsupported corner cases. It was here that I learned that, under the right circumstances, groups of people are able to produce significantly higher quality work than individuals. Of course there are exceptions, but I try to remind myself of this when I struggle to let go of ownership of a piece of code.

What are the top three things you feel you have learned from working in open source?

  1. Focus on people. Technical sophistication only takes you so far; in the end, success depends crucially on collaboration.
  2. There are many smarter people than you out there. That’s OK; learn from them, and enjoy working with them.
  3. When we share, there is more to share. Building together, we can build more, and better.

Why is Python so popular in data science?

Data science is about building complex data pipelines. You need to ingest data from disparate sources, visualize it, do numerical computation (often at scale), and publish. Ideally, this would happen in a uniform environment, and Python checks all the boxes. Furthermore, Python is easy to read and learn, and has an active developer and user community.

It is easy to convince yourself that, because Python is so successful in this field, that it is the best for the job. But there are some great libraries out there produced by other communities too, and we would do well to build connections with them and learn from their experiences.

That said, my faith in the Python ecosystem lies as much in its community as it does in the tools and technologies. It is an extraordinary group of people who, having the necessary skills and attitudes, will build the next generation of high-quality tools that data science needs.

Is there anything else you’d like to say?

I feel immensely lucky to have been a part of the scientific Python community. It changed my life, in no small measure, and have given me many lifelong friends. The community taught me a lot, technically, but also about the importance of treating people kindly and with respect. In the end, that is much more important than any of the work we do.

Thanks for doing the interview, Stefan!