PyDev of the Week: Tom Augspurger

This week we welcome Tom Augspurger (@TomAugspurger) as our PyDev of the Week! Tom is a core developer of the pandas, dask and distributed Python packages. You can see what Tom is up to by checking out his blog or over on Github. Let’s spend some time getting to know Tom better!

Can you tell us a little about yourself (hobbies, education, etc):

I became interested in the financial crises that rolled through the world in 2008. This pushed me into studying economics, to try to figure out what was going on and how to fix it. I went on to study economics in graduate school at the University of Iowa. It took me 3 years to realize that I wasn’t suited for a PhD, so I left the program early with a masters.

My hobbies used to revolve around data analysis and contributing to open source projects. Since having a son last year, raising him has become my one and only hobby 🙂

Why did you start using Python?

I learned Python about 5 years ago. In graduate school, I needed something to help with data analysis and simulations. They taught us a bit of Matlab, which was OK, but I pretty quickly went searching for alternatives that offered more from a language design standpoint. I enjoyed the programming and software engineering side of research as much as I enjoyed the analysis itself.

What other programming languages do you know and which is your favorite?

SQL is probably the only other language I could actually claim to know. I’ve picked up bits and pieces of R, Matlab, Javascript, Haskell, C, and C++, but Python is far and away my favorite. I do think it’s important to interact with other communities so we can take inspiration from how they solve problems. Some of my most important contributions have come from copying the API design of Hadley Wickham, an R developer.

What projects are you working on now?

I help maintain pandas, dask, and distributed, and I’m neglecting a few side projects I’ve written like engarde and stitch.

Pandas provides some high-performance data structures that are useful for data analysis, and a bunch of methods for operating on those data structures. It’s Python’s go-to implementation of the dataframe that was popularized by R.

Dask and distributed are related pair of projects for parallelizing the existing scientific python stack. Dask provides APIs for working with Arrays or DataFrames that look like NumPy or Pandas, but operate on larger-than-memory datasets in parallel. Dask will use all of the cores on your single machine; distributed does the same but for an entire cluster of machines.

Which Python libraries are your favorite (core or 3rd party)?

After dask and pandas of course 🙂 I think seaborn deserves a mention here. It’s a library for statistical visualization of datasets that builds on top of Matplotlib. Pandas has some plotting functionality built in, but most of the time I tell people to just use seaborn because it does things better.

How did you get started with the pandas and dask projects?

For pandas, I started with a small pull request to fix the missing value indicator that a remote data provided used. Despite being a 3-4 line change, I managed to completely mess up the git commands to make a clean PR. After the maintainers calmly walked me through the process of fixing everything, I knew this was a community I wanted to be part of. I kept helping out on issues, submitting PRs, and submitting answers on StackOverflow.

Dask has a tight relationship with other projects in the scientific python ecosystem. I was able to submit PRs to implement features that worked in pandas, but weren’t implemented in Dask yet. Now that I worker at Anaconda, some of my work time goes to Dask and pandas.

What kinds of challenges have you faced working in open source?

Early on, it was difficult putting my code out in the open for everyone to see. I was new to programming in general, not just Python. For me, at least, that dread of having other’s judge your code (and you) has faded away.

As a maintainer, the hardest part is probably the scale of issues pandas has. As of this writing we have 2,038 open issues and 100 open Pull Requests. It can be difficult to know how much attention to give to each issue.

Do you have any advice for new developers that would like to join an open source project?

Find a problem that you’re interested in. Altruism alone wouldn’t have motivated me to donate my time to write code or triage issues. I needed to be invested in the broader goal of building tools for doing data analysis (and using those tools myself).

Is there anything else you’d like to say?

Pandas and dask are always open to new contributors, so if you’re passionate about data science, feel free to claim an issue on Github.

I’m currently thinking about “scalable machine learning” for python. Basically, taking the workflows data scientists use for day-to-day machine learning projects, and scaling them up to more complex models and larger datasets. I’m blogging about that (part 1 is at http://tomaugspurger.github.io/scalable-ml-01.html). If you’ve thought about that then please do reach out.

Thanks for doing the interview!