PyDev of the Week: Thomas Fan

This week we welcome Thomas Fan (@thomasjpfan) as our PyDev of the Week! Thomas is a core developer of the scikit-learn, a machine learning package for Python.

If you’d like to see what else Thomas is up to you can check out Thomas’s website or his GitHub profile.

Let’s take a few moments to get to know Thomas better!

Can you tell us a little about yourself (hobbies, education, etc):

I am a Staff Software Engineer at Quansight Labs, which aims to sustain and grow community-driven PyData open-source projects. My academic background includes mathematics and physics, where I conducted Quantum Computing research. Outside of software, I enjoy walking around NYC, swimming, and reading non-fiction books.

Why did you start using Python?

In graduate school, I became interested in Python to assist with my Quantum Computing research. Specifically, I started using the Python library, QuTiP, to simulate the dynamics of quantum systems.

What other programming languages do you know and which is your favorite?

I have learned C, C++, Cython, JavaScript, Go, R, and Rust through the years. Each language has its strengths and weaknesses in expressiveness and its target domain. Nowadays, I consider other programming languages as a gateway to accelerate computation in Python. For example, NumPy uses C and Fortran, PyTorch uses C++, Scikit-learn uses Cython, and Polars uses Rust. Python is still my favorite language, but Rust is a close second.

What projects are you working on now?

Currently, I am working on improving how scikit-learn works on heterogeneous data. We recently released v1.2, which allows scikit-learn transformers to return Pandas DataFrames. Additionally, I am working on adding more machine-learning methods for encoding categorical features and improving the user experience for tree-based models, such as our Histogram-based Gradient Boosting Trees.

Which Python libraries are your favorite (core or 3rd party)?

From the core library, my favorite modules are functools and itertools. My favorite third-party libraries are Polars for fast DataFrames and PyTorch for deep learning.

How did you get involved with the scikit-learn project?

In 2018, I was unsatisfied with how scikit-learn worked with Pandas, so I started contributing to improving Pandas compatibility. Later that year, I joined Andreas Mueller at Columbia University to work on scikit-learn and related projects. A few months into the position, I flew to Paris to meet with scikit-learn maintainers during a development sprint. A few weeks after the development sprint, I was nominated and voted in to be a maintainer.

What are some of the most surprising things you’ve seen scikit-learn used for?

At a conference, I was surprised to learn that an attendee used the SplineTransformer and Generalized Linear Models to construct Generalized Additive Models for time series analysis. These features were relatively new then, so hearing about their quick adoption was surprising.

Is there anything else you’d like to say?

In 2016, while expanding my Python skills, I found Mike’s Python 201 book beneficial for learning intermediate topics. Therefore, I want to thank Mike for his incredible work in teaching and promoting Python.

Thanks for doing the interview, Thomas!