PyDev of the Week: Roni Kobrosly

This week we welcome Roni Kobrosly as our PyDev of the Week! Roni is the creator of the causal-curve package. You can catch up with Roni on Roni’s website. If you want to see some of Roni’s code, you can check it out on GitHub.

Let’s spend a few moments getting to know Roni better!

Can you tell us a little about yourself (hobbies, education, etc):

I grew up in Austin, TX, back when it was a small, hippie, college town known for its live music and not yet known as the “Silicon Hills”. My parents and brother still live there. My wife and I have a cute 5-week-old baby girl named Cora so my current primary hobbies are hanging out with her and napping, but before that I loved cooking elaborate meals, biking, climbing, doing digital art (using Procreate, Vectornator, and Pixaki), and traveling. I have a PhD in Epidemiology and in a former life I was doing research in environmental health.

Why did you start using Python?

Writing code for ETL and statistical modeling is something I’ve been doing since 2006, but until 2015 it had been in R. In 2015 I joined the now-defunct Insight Data Science Fellowship in NYC to transition into the data industry, and after hearing so many industry folks talk about its benefits (e.g. a versatile general purpose language, very readable, lots of support for data and math projects) I set out to learn it through web tutorials and such. Almost immediately I loved working with it and I voraciously watched and worked through tutorials on its various packages.

I’m one of those python people who didn’t start learning python with an engineering perspective or background in place. That is to say, I initially wasn’t thinking about PEP8, the principles of solid application design, what a pure function was, OOP vs functional programming, what a virtual environment was, what a container was, etc. So over time I had to learn to complement my data analysis and statistical technical skills with engineering ones. I developed my software engineering skills while using python, so I have a particular fondness for it (while also completely getting that a language is just a tool).

What other programming languages do you know and which is your favorite?

R, Go, Scala, and a modest amount of JS. Python is my favorite, hands down, though I’ve come to really appreciate how readable, opinionated, and fast Go can be. In learning a new language, I would look for associations between python’s built-in functions and data structures to those in the new language (e.g. “It sounds like a Go `map` is like a python dictionary…”). So in that sense, python is pretty hard-wired into my head!

What projects are you working on now?

Two active projects, but only one of them involves writing python code.

I’m putting the finishing touches on a 3-hour O’Reilly Media live training session entitled “An introduction to causal inference using python”, which is set to occur on March 16. It’s a workshop that will involve slides, Q&A, and Google Colab notebooks with python-based exercises for attendees. Part of this workshop will cover the causal-curve python package that I rolled out to PyPI back in 2020.

Beyond that, I’ve been working on building out an extensive public collection of articles, posts, and videos pertaining to data team leadership. I love the feeling of getting into the python programming “flow state” but these days I lead data teams, so I spend more time thinking about how I can create the best conditions for coders. I spend a lot of time searching for suggestions around different aspects of management, and I thought a giant aggregation of this content could be helpful to others. It follows the awesome-list format, so it’s in the form of a GitHub README file.

Which Python libraries are your favorite (core or 3rd party)?

It’s hard to pick a few! I would say I’m an enormous fan of the typical python data stack, by which I mean libraries like NumPy, SciPy, Pandas, Polars, pyspark, Dask, statsmodels, PyMC3, scikit-learn, to name a few. They’re tools, and like all tools they have their strengths and limitations, but I’ve always appreciated their fantastic documentation and the communities behind them.

I would also add the core package `pdb`. I love the `pdb` debugger and it makes me sad when I switch over to another language and can’t find a tool that works equally well…

How did you get involved with the causal-curve project?

I started causal-curve in the dead of the first COVID summer in 2020. Everyone I spoke with at the time was scared, confused, isolated, and bored. Primarily, it was a project to give my mind something to focus on, but I genuinely noticed a gap in a subfield known as “causal inference” within the data science / data analysis python space. It was a problem I regularly came upon in my professional work involving causal inference, and yet there didn’t seem to be any established tools for addressing it in python (there were and are many papers in the academic literature around this topic).

I’m not arrogant enough to assume your readers would know about `causal-curve`, so I’ll briefly describe what it does.

In industry and in academia, the best tool we have for determining the effect some has on some outcome is to do an experiment where you have a “treated” group and some sort of control group. In the medical field, for example, they will randomize the folks in some population to either receive a new drug (the treatment) or get a sugar pill (the control), and then follow them up to see whether their cholesterol levels (for example) have improved after 6 weeks. The experiment allows you to determine the true causal effect of the treatment (assuming the experiment was run properly). Experiments aren’t always feasible though, particularly in the tech industry at a company; they can be resource-intensive or tricky to run. You can’t simply look at raw correlations and averages in observational, non-experimental data, because they are subject to all sorts of biases (e.g. confounding). “Causal inference” methods are a set of tools for taking observational, non-experimental data, and making clean causal estimates from them, like you would have gotten from an experiment if you were able to run one.

Typically causal inference methods assume the “treatment” is binary (e.g. you saw an ad vs you didn’t), but there are tons of scenarios where a treatment could be continuous in nature (e.g. price of a product, minutes per week of exercise someone does, customer service phone wait times in minutes). This python package allows one to estimate the “causal curve”, or the causal relationship between some continuous treatment and some outcome.

What do you love about causal-curve?

Primarily, I love hearing from folks that they found causal-curve to be useful 🙂

Is there anything else you’d like to say?

I don’t think I have to say this as the python community is a friendly lot but just in case… a friendly reminder for the more experienced python developers out there: be kind to and mentor junior folks!

Thanks for doing the interview, Roni!