PyDev of the Week: Marcin Kozak

This week we welcome Marcin Kozak as our PyDev of the Week! Marcin maintains perftester and makepackage. You can see what else Marcin is working on over on GitHub.

Marcin Kozak

Let’s spend some time getting to know Marcin better!

Can you tell us a little about yourself (hobbies, education, etc.):

I graduated in agriculture, from Warsaw University of Life Science – SGGW. My main interest was in statistical applications for agriculture in biology, and I pursued this topic in my M.Sc, Ph.D. and habilitation (in the Polish science system, it’s a scientific degree over Ph.D.), and then my professorship (which is the highest scientific title in Poland, given by the president of Poland). I spent over 20 years of my academic career involved with statistics, data analysis, and data visualization across various scientific disciplines. But the truth is that I am both an interdisciplinary and a multidisciplinary researcher. I have worked on various topics related to biology, scientometrics, economy, social sciences, and academic writing. Currently, I work at the University of Information Technology and Management in Rzeszów, Poland. But academia is just one part of my career; the other is data science. I cooperate with 7N, an outsourcing company, as a senior data scientist and a Python developer. The two worlds — academia and data science — intermingle in my life, but both require the same amount of creativity; hence I find joy in pursuing both these paths, in realizing myself in both these worlds.

As for my hobbies, you can say that my life is my hobby, at least these days. For most of my life, I dreamt of living in a wooden house, close to nature and the woods, in low mountains, far from the rush and cars and people. And for eight years, I’ve been living in a wooden house, close to nature and the woods, in low mountains, far from the rush and cars and people. I spend much of my free time (which I don’t have too much of, unfortunately) with my wife and two great kids. I also read a lot, play chess, and go to the woods with our dogs and cats. Yes, you heard it right: our cats do like walks in the woods!

 Why did you start using Python?

Because I had to. It was four years ago. After I joined a project as a data scientist and an R developer, I noticed some .py files in the project’s repository. The truth is (shame on me!), I didn’t know what sort of extension it was. Of course, it did not take me more than a minute to learn that it was Python, but it did take me more time to accept that I had to learn Python, in a very short time, to work with these scripts, and the project called for a lot of changes to them. The project was planned for three months, so that’s not much if you have to learn Python, while at the same time having a lot of other tasks to do, such as building machine-learning models, validating and tuning them, creating dashboards and the like — and all of that during a half-time project. So, I could either give up the project or continue.

I decided to continue and to learn Python. This was one of the best decisions in my career! It took me a couple of days to fall in love in Python, even though — or maybe because? — I had been using R for 16 years. The truth is, I have never really liked R as a programming language. I still consider it a great tool (for some tasks, perhaps better than Python) for exploratory data analysis and visualization, but that’s it. When I found Python, I decided it’s time for me to return to real programming, which I left… 24 years earlier, when I graduated from high school!

What other programming languages do you know and which is your favorite?

I started learning programming back in the early 1990s in high school, where I attended a computer-science class. My dad was a cybernetics engineer and later a programmer; he used assembly language, C, Algol and Fortran. My mum was a machine operator of Zama-41 and Odra computers. So, since my early years I knew what a computer was and what programming meant. In high school, we learned mainly Pascal, and after three difficult years, we knew it quite well. Today, however, I remember nothing at all about this language, nothing.

When I returned to programming, which was when I started my job at the Central Statistical Office of Poland back in 2003, I picked up R. I didn’t have much to do at work, so I was spending full working days learning survey statistics and R. I have been using it ever since, now for almost 20 years. I used it for a lot of statistics, data analysis and data visualization, and for developing software products — but as I mentioned, I have never liked R as a programming language.

Then, in 2018, out of the blue I fell in love with Python, my heart went back to programming, and my life started to change for the better.

I quickly realized, however, that neither R nor Python enables one to understand the essence of programming. I gained some general programming knowledge back in high school, but I needed to recall it, and to learn even more. So, I decided to start learning C. I consider C a great language to learn for any developer (particularly for Python developers who do not know any other programming language), as it helps in understanding the deeper aspects of programming. And indeed, C helped me open my eyes to advanced Python.

I have never wanted to use C outside of Python extensions and Cython, so I wanted to learn yet another programming language, one that I could use in my projects. This is why I picked up Go. And, just as I fell in love with Python from the beginning, I fell in love with Go. Ever since, I like both languages equally. As part of my learning process, I wrote an open-source Go package, check. Nonetheless, after some time I decided that in the near future, I would not have too many opportunities to use Go, as I was expected to use Python and R in all my projects. Thus, since my time was limited (what an understatement!), I decided to focus on Python. Now it’s my main programming language, and whenever I have some time to spare, I develop my Python skills.

Recently, I heard about the V programming language. V’s authors claim that if you know Go, you know 80% of V. So, you can say that I know V quite well — maybe 50%, 60% of it? — even though I have not used it. (Am I joking? Most likely. But it depends on interpreting the 80% claim of the authors of V.)

What projects are you working on now?

These days, I focus mainly on perftester and makepackage.

`perftester` is a Python package for lightweight performance testing of Python functions. You can use it for performance benchmarking and testing in terms of execution time and memory. It can be used as part of `pytest`, but its main role is to serve as an independent performance testing framework that is used in a similar way as `pytest`. I am currently working on implementing decorators that will enable the developer to profile code, but also to log profiling details of subsequent calls to the function in the production environment. This is nothing particularly novel — as writing a decorator to measure time is simple, as one can use `memory_usage()`, and the `@profile` decorator from the `memory_profiler` package, for in-depth profiling of memory usage. However, I want to do this to make `perftester` complete. The decorator-based functionality will enable developers to profile execution time and memory usage in a similar way to `perftester`; it can also be used to log detailed performance of each call of the decorated functions. This is quite a big coding task, but I find a lot of pleasure in this work.

I am also working on new functionality of `makepackage`. For the moment, you can use it to create a Python package that uses `pytest` for unit testing. I am going to add a new ability to create what I call a ‘mini-package’. This is a minimalistic and flat structure of a Python package, with only one module, located in the root, and with testing based on `doctest`. The user can later add `pytest`, if this is what he or she prefers. Such a minimalistic structure is a great way to create a Python package that is very simple to develop, and one that does not even have to be installed via PyPi — the module’s file can be copy-pasted into a project’s source folder.

Lately, I have also been spending quite some time on thinking about easycheck. It’s a package I wrote with a colleague, Miko?aj Kobyli?ski, to enable a developer to make functionalized and highly customizable assertion-like checks, without the actual use of `assert`; if a check fails, the developer can either throw an error or issue a warning. The package offers rich functionality, but is currently too slow, so I want to improve that. I am afraid that we may have to decrease its readability and compromise its style to achieve that, but hopefully this will improve the package’s performance. We are also working on disabling all checks in an application’s code, which can be helpful when deploying an application to a production environment.

As you can see, there’s a lot to do, but I do not have enough time for all that, since I work on all these open-source projects during my free time. But I do it for two reasons. First, I love the open-source philosophy. I love it in science publishing, and I like it in programming languages. But secondly, at the end of the day, someone has to pay the bills, and so my involvement into open-source activities depends on many factors. In this regard, my main aim is to keep active the four open-source packages I have authored or co-authored: the three aforementioned and rounder. Even though I do have some ideas for new projects, I try to calm myself down and focus on what needs further development, rather than on creating new packages.

Who knows, maybe one day I will find the time to delve even further into open source, but unfortunately, this would require an unexpected (but highly welcomed) source of cash so that I can buy a new Ford Mustang. Nah, that’s not true: I don’t need to buy a new Mustang, because my 11-year old son, Tymek, promised me he would buy me one (of course, when he becomes financially independent). But that’s not all. My 15-year old daughter, Tosia, did a similar thing: she promised me she would buy me an old Mustang (maybe even a 1964 model). So, as you see, my future is set, and I can now focus on my open-source activities, and I will someday become a happy owner of two Ford Mustangs.

Which Python libraries are your favorite (core or 3rd-party)?

I like quite a few Python packages, such as `pytest` (as I love writing tests), `itertools`, `multiprocessing` and `pathos` (I use parallelization a lot), and `functools` (I love functional programming), but I could make this list longer.

I do have, however, a _favorite_ Python package. It helps me a lot at various stages of development, in both PoC/PoV and production projects. It’s `doctest`. I think it’s heavily underappreciated. Most Python developers know it and some use it, but I don’t think it has the reputation it deserves.

I use it in various circumstances:

  • to write documentation tests in module/function/class/class method docstrings;
  • to write unit tests, in docstrings but also in dedicated files (I usually use Markdown files for doctests); such unit tests have limitations and are rather simple, but they do constitute a fantastic approach to regression testing. I think that more often than not they are much more readable than unit tests written in other frameworks (including `pytest`, which I really like and use in almost all my projects);
  • to write testable documentation files, including READMEs; and
  • to use in test-driven development (TDD).

Note that I distinguish two different purposes of `doctest`: writing documentation tests, or testable documentation; and writing unit tests. These two are not the same. When I write documentation files using `doctest` (usually in Markdown files), I put forth much effort to make the text, and the tests, readable. Their purpose is to explain, not to test, and `doctest` serves as a safeguard that the code is correct and up-to-date. But when I use `doctest` to write unit tests (also in Markdown files), I do not put so much effort into readability. These are unit tests, but thanks to Markdown, they can be readable so that the reader understands the essence of the tests. I can put edge cases and some boring stuff there, something I would seldom (if ever at all) put in a documentation file. You can also use the `assert` statements in such tests. I am also of the opinion that `doctest` enables one to write nice and readable integration and end-to-end tests, but such tasks are best achieved in dedicated files.

Since `doctest` aims for simplicity, however, it does not offer many advanced features. This is why more advanced Python documentation testing approaches have been proposed, such as byexample. I do prefer the simplicity and directness of `doctest`, however, and I don’t think I will replace it with more advanced tools. Besides, `doctest` is part of the standard library, which is quite an advantage.

How did makepackage come about?

When I work on a Python app, I almost always package the code. It makes things a lot easier, even if I do not want to upload the package to PyPi. But the packaging step is not only boring, it can also be quite tricky. I wanted to make this step simpler, so I thought of using Cookiecutter, but I quickly realized that it was not the tool for me: I simply wanted to create a simple package structure, and Cookiecutter seemed to be an overkill for such a simple task. This led me to the creation of a package template. I was using it for all my projects, but frankly, after some time, I got tired of copy-pasting the template. So, I created a script for this, and life got easier. Then I thought, if I created a script to create a package, why didn’t I create a package for this? That’s how I decided to package the app’s code. At first, I didn’t even plan to publish it in PyPi, for the simple reason that I did not believe anyone would use it. Fortunately, I quickly changed my mind and decided to publish the package, which I called `makepackage`.

It’s quite a simple package, consisting of simple code. My main purpose was to make it as simple as possible to use. For this very reason, `makepackage` does not offer too much: you can package your code, with or without using a command-line interface. That’s all. The API could be made richer, but this would go against the purpose of simplicity. This is why, for example, when creating a package, you cannot choose a different license type than MIT. But you can change that after creating the package, can’t you? You do not include your name in the setup.py and LICENSE files; this means you need to fill in the corresponding fields in these files. Soon, however, the package will have the mini-package functionality, as I mentioned before.

`makepackage` can be used by beginner, intermediate and advanced Pythonistas, as it simply helps save time. However, I decided that `makepackage` can also help beginner and intermediate developers in another way. The package created from `makepackage` not only has a working structure, but also examples of `pytest` tests (they are parametrized and use fixtures defined in a conftest file) and docstrings with `doctest` tests. In addition, the resulting package’s README offers some content, which will be overwritten during development. After being populated by `makepackage`, this README contains some relevant information on developing a package in a virtual environment, building the package and uploading it to PyPi, and the like. The information is customized, as it uses the package’s name in the commands. I hope some users will find this information helpful.

Why do you enjoy writing documentation?

I see two reasons behind this.

First, I like writing. As a scientist, unlike most scientists, I really like writing articles. It is one of the things I enjoy most in my academic work.

Second, I am tired of projects that are poorly documented. I have been a part of several projects in which it was very difficult to understand anything. Sure, I agree that code should be self-standing; but code documentation and project documentation are two different things. When a project consists of five thousand lines of code and one line of documentation, I think something’s wrong. If someone has spent so much time to write so much code, why no documentation? Maybe the code is not yet complete? Would you leave that big of a project undocumented?

Thus, whenever I see poor documentation that does not help me understand the project (apps, packages), I do not feel like using it at all. That’s why I try to write rich documentation. I know that my documentation is often very rich, including of a long README and several docs files, so some time ago I started adding short “TL;DR” sections to package READMEs; after reading such a short section, the users can quickly learn how to use the package. If the users need to learn more, they can read the full documentation.

I consider documentation a significant part of any project. A person claiming that their code is self-standing should remember that a user would have to read the entire code to understand what can be done with it. Is that a joke? In GitHub, I have seen big packages that did not have a single line of documentation, despite thousands of lines of code. Do such authors think that we will read those thousands of lines of code to learn what we can do with the package? That’s a totally crazy assumption.

 Is there anything else you’d like to say?

Actually, yes. There is one thing I’d like to say to Python beginners. I noticed many of them read a lot of content available online, such as blogs. I am afraid some of them consider online resources as the main source of their Python knowledge, and I do not consider this a good learning approach for beginners, as they seldom can distinguish good from poor resources. Unfortunately, these days the net is full of bad or inaccurate information, and Python is no exception.

A funny example of such a misguided online recommendation is one saying that you should use `while` loops instead of `for` loops in order to improve performance. Had the author even checked this before writing such a recommendation? (I don’t even comment on how Pythonic such code would be, with `while` loops everywhere.) I don’t believe he did. What’s even more funny is that the same author suggested, immediately after the replace-`for`-with-`while` recommendation, that to increase performance, one should use list comprehension instead of `for` loops. What? So, to increase performance, should I use a `while` loop or a list comprehension? Another ‘trick’ to improve performance I learned from the internet is to use multiple assignments in one line (based on tuple unpacking) instead of assignments in separate lines; it is enough to run simple benchmarks to see that this is far from true. These are, of course, trivial and funny examples, but they are just a drop in the ocean.

Thanks to its popularity, Python is likely one of the richest programming languages in terms of both web resources and web garbage. So, if you want to learn Python, get a good book for beginners. Return to web resources when you feel you’re ready to tell good Python from bad Python.