Let’s take some time to get to know Joshua better!
Can you tell us a little about yourself (hobbies, education, etc):
Professionally, I own a data science training company, Sharp Sight.
My background is somewhat unrelated. I graduated from Cornell with a degree in Physics, but I decided that I didn’t want to be a physicist.
Through a series of fortunate events, I had an opportunity to join the Marketing Analytics department of Bank of America. This was way before data science was popular and it was just dumb luck that I sort of fell into the opportunity.
Early in my career, I still wasn’t sure what I wanted to do, and I wasn’t sure that I would stay in the analytics industry. But at some point around 2011 or 2012, it became obvious that data science and analytics were going to be huge, so I doubled down and started learning everything I could about all of the emerging tools and techniques.
Through another series of fortunate events, I landed a job at Apple as a data scientist. I stayed there for a while, and then left to start my own company.
Why did you start using Python?
When I got my start in the analytics industry, we were using SAS and SQL almost exclusively.
If you’re not familiar, SAS is a statistical programming language used for data analytics. Back then, it was practically the only language used for data analytics, and certainly the analytics language of choice for large banks and more traditional Fortune 500 companies (e.g., ad agencies, pharmaceutical companies, etc).
Around 2012, the industry began to change. More companies in Silicon Valley began focusing on data (although some, like Google, had been data-driven for much longer). I started seeing new job postings in San Francisco and Silicon Valley for analytics jobs, but they were starting to call it something different: data science.
And whereas the industry had previously used SAS, around 2012, you started seeing more job posts and articles that talked about R and Python.
It was obvious to me that SAS was likely to become a dinosaur. Big tech companies started building data science teams, and almost all of them used Python, and to a lesser extent, R.
It was clear that the industry was moving to R and Python, so I started learning them.
What other programming languages do you know and which is your favorite?
Python and R are my main programming languages right now. I also know quite a bit of SQL, although I use it less now that I’m no longer in a big enterprise setting.
I’m not sure that I have an overall favorite. I think that different languages are toolkits that are good at different things.
When I want to wrangle, analyze, and visualize my data, Python and R are both excellent.
If I want to automate or build a system, I prefer Python, because the syntax is easy to write and easy to read. Compared to R, in my opinion, Python is much better for more traditional software building.
Python is also better overall for machine learning and deep learning.
So Python and R are the two big languages that I use, but I also have a few other tricks up my sleeve.
What projects are you working on now?
I’m mostly working on building more data science courses.
At Sharp Sight, we teach data science. In particular, I’ve developed a training system that helps people memorize syntax and become “fluent” in writing code.
If you’ve ever forgotten a piece of syntax, and had to google it, you know that forgetting syntax is a major drag on productivity. I created a training system to solve this by:
- breaking syntax down into small learnable units
- giving people a training system to memorize those syntax units
- showing students how to put the pieces back together to do real work
To give credit where credit is due, this system is largely based on the metalearning system described by Tim Ferriss in The 4-Hour Chef (a book actually about learning, instead of cooking). It also incorporates insights from cognitive psychology to “hack” your memory so you can memorize syntax. I’ve spent many years studying “how to learn” and I’ve applied good learning principles to our courses for mastering data science.
With all of that said, I’m currently in then process of creating a course on Plotly (for data visualization in Python) and a course on Scikit Learn (for machine learning in Python).
Which Python libraries are your favorite (core or 3rd party)?
Oh man, Python has lots to choose from.
At my core, I’m a data scientist, so my favorite Python libraries are data-related.
Pandas is the one that I probably use the most, and it’s really great in many ways. Having said that, it pains me to look at most Pandas code. Unfortunately, many people use it in ways that make data wrangling and data analysis complicated, hard to understand, and hard to debug.
Over the years, I’ve developed a somewhat unique style in how I use Pandas that allows you to do complicated data transformations in relatively compact blocks of code (this technique is a lot like dplyr pipes in R, or pipes in Unix). If you use Pandas this way, it’s a lot of fun to use.
I’m also a really big proponent of data visualization. I think that data visualization is dramatically underrated. Having said that, I really like Seaborn for static data visualizations. The syntax is much easier to learn and use than Matplotlib, but it’s also built on top of Matplotlib. So Seaborn gives you the power of Matplotlib, but with added simplicity.
Additionally, I’m increasingly interested in Plotly. It’s very powerful, it has a clean syntax, and it also provides a toolkit for building dashboards and interactive charts.
You have multiple data science courses on your site. How did you decide which courses to make?
I think that data science is going to be very important over the next few decades, but learning data science is often very challenging. I created our courses to be the fastest, most efficient way to learn data science, without wasting time and money.
That said, there’s a lot inside of that statement, so let me unpack it.
It’s obviously a bit trite to say that “X is going to change the world,” but I really do believe that data science and machine learning will change almost everything.
In a somewhat recent Forbes article, Jeff Bezos said:
“The most interesting thing about machine learning, as opposed to a lot of other technologies, is just how horizontal it’s going to be … There’s not a single category of business or government or anything, really, that can’t improve itself.”
Machine learning and data science will impact everything. In turn, they will probably be very valuable for individuals to learn.
But there is a lot of bad books and bad advice out there on how to get started and what to focus on.
I’ve been somewhat lucky, in that I had an analytics job really early on. And early in my career, I had good mentorship about some parts of the data science process.
But since I started in the data industry over a decade ago, it has changed substantially. I needed to learn new tools and I didn’t always have good mentors for those things. I wasted a lot of time on things that I didn’t really need and many of the resources that I used were confusing.
So, I faced many challenges as I learned and upgraded my skills.
But my industry experience – along with the challenges I’ve had upskilling and learning data science – has given me a unique perspective on what students need to learn, and how to learn it.
I created courses that help people learn the right things, in the right order, and I show my students how to practice so that they remember all of the important syntax permanently.
In some sense, I simply created the courses that I wish I had years ago.
With all that said, if any of your readers are interested in learning data science here is what I recommend:
- Focus on the fundamentals. Focus on data wrangling (Pandas, Numpy), data visualization (Seaborn, Plotly), and data analysis (how to use wrangling + visualization together to find insights).
- Avoid advanced math in the beginning (unless you’re in academia). Most math is overrated for beginners in industry.
- Once you’ve mastered the basics, use that as a foundation to learn intermediate to advanced skills, like machine learning, deep learning, and geospatial visualization.
And if you want to connect or have a question, just reach out to me on Twitter at @Josh_Ebner.
Is there anything else you’d like to say?
Over the next decade or two, I think we’re going to see a major shift in the software and tech industry.
Increasingly, I think that almost all software is going to be data-driven software.
In this regard, I don’t mean that data will be incidental to software or merely part of how it operates, but rather that data will be central to most software.
Another way of saying this, is that most software will involve machine learning. We’re going to increasingly build machine learning elements into almost all software
Andrej Karpathy, the director of AI at Tesla, recently called this “Software 2.0.”
In a blog post, Karpathy described it like this:
“Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software. They are Software 2.0.
… we are witnessing a massive transition across the industry where of a lot of 1.0 code is being ported into 2.0 code. Software (1.0) is eating the world, and now AI (Software 2.0) is eating software.”
If this is the case, then going forward, most software will have machine learning built in as a critical component. This Software 2.0 will actively learn from data streams. And it will become even more important as we digitize the world and add sensors to almost everything (a subject discussed in the book The Second Machine Age, by Brynjolfsson and McAfee).
In turn, I think that data science and machine learning will become increasingly important for developers. Most developers will need to know how to build, train, and maintain machine learning systems … or at least, build the infrastructure around them.
The next few decades will be really exciting. But also, challenging. Many Python developers will need to skill up and learn some data science and ML.
The question is, how? Where do you start?
I firmly believe that the foundation of machine learning and data science is:
- data wrangling
- data visualization
- data analysis
I strongly think that math is overrated for data science and machine learning beginners, but data wrangling, data visualization, and data analysis are used for almost everything, including machine learning. So those three are the essential foundations that you need to know if you want to get started with machine learning or data science more broadly.
So to sum up, if you want to learn foundational data science in Python, here are my recommendations:
- learn Pandas for data wrangling
- learn Seaborn or Plotly for data visualization
- learn to combine data wrangling and data visualization to analyze data. (Most data analysis is simply using data wrangling + data visualization to “find insights in data”)
Those are the foundations. Once you know those foundations, then you can move on to Scikit learn for machine learning. But make sure that you don’t get “shiny object” syndrome and try to jump to the sexiest stuff first. Master the foundations first, and you’ll be much better prepared once you start learning machine learning, deep learning, and other advanced topics.
And finally, if you want to save yourself the time and frustration of trying to figure it all out yourself, consider one of our courses at sharpsightlabs.com/course-directory/.
Thanks for doing the interview, Joshua!