How to pick a machine learning project that would get you hired

I’ve mentored about 165 projects during the last five years at the companies where I work. The people who built those project all got jobs (mostly). The more I talk to companies interviewing today, the more apparent it is: A portfolio project is decisive when making hiring judgments. Jeremy Howard recommends it. Andrew Ng recommends it. Why? It’s far better at discriminating talent than any other proxy (CVs don’t work; pedigree doesn’t work; puzzles don’t work).

And yet, the proportion of people who finish MOOCs AND have a portfolio project is tiny. What’s the catch?

Well, it could be that these are orthogonal skills. Finishing something as demanding as a deep learning MOOC (or a uni degree) show that you have grit. You can start and finish difficult, demanding projects. But what hiring managers want to know is: “What did you do when nobody told you what to do?”, “Can you find ways to generate value (that we may have overlooked)?”, “Can you work with other people?”. Many people can prepare for, and pass, the ‘typical developer interview.’ That one with the whiteboard, and the typical questions that you never encounter in real life. However, what everyone, on both sides, hopes for is to meet someone who can work well beyond ‘the standard.’ What separates you from the rest is a substantial portfolio project. Not that many people can display the creativity and grit to get a killer portfolio project off the ground.

Because this post is about how to pick an ML project, you have to make a decision: you need to focus on machine learning (ML) or on data management to start your career. The second option is plenty lucrative and doesn’t require you to know much ML. But then you will be stuck with the dreaded ‘data cleaning’ (A data scientist spends 80% of her time cleaning data, and 20% of her time bitching about cleaning data). You may realize that after working extra-hard to become competent in machine learning, you are not doing all that much of it once you get a job.

There’s a trick to avoid too much data cleaning: go to a company that has a good team of data engineers. These are rare, though. So it’s safe to assume that in a ‘normal’ company you will have to do some data cleaning, even if you want to stay away from it. I have more to say about companies; the last bit of this post will cover how to choose which companies to apply for. But now let’s focus on the project.

How do you pick a good project? (Proof Of Concept)

Morten T. Hansen’s book ‘great at work’ shows that exceptional performers tend to ‘Do less, then obsess.’ He shows pretty convincing empirical evidence. So I’m going to recommend: Do one project, not many that are half-assed. If the hiring manager looks at your GitHub, there must be a clear ‘wow’ story there. Something she still remembers after an entire day looking at resumes.

You want to have something to show that is a Proof of Concept (POC). While you will not have resources to build an entire product, your GitHub should have something that demonstrates that such product is feasible. The interviewer must think, ‘ok, if he came to my company and did something like this, he would be adding serious value.’

Good problem

Picking the right problem is a great skill, and it takes years of practice to do it reliably. It’s a mixture of ‘Fermi-style’ estimations (‘how many piano tuners there are in a city?’) time management, and general technical know-how.

The good news is that in ML there’s lots of low hanging fruit. There’s lots of code already, so many pretrained networks, .and so few solved-with-ML problems! No matter how many papers per year we get (too many to track even in a reduced domain), there’s plenty of opportunity in ML right now. Applications of technology this complex have never been more accessible! If you want to practice this skill, tweet 2-3 project-worthy problems a day, and ask people for feedback. If you tweet them at me (@quesada), I’ll do my best to rate them with a boolean for plausibility (in case lots of people take up this suggestion!)

We data scientists have the luxury to pick problems. And you should practice that skill! Knowing what is possible today is the first step. Twitter is a great place to pick up what’s happening in ML. You must have heard of most of the breakthroughs, and have an internal ranking of methods and applications. For example, everybody seems to have watched David Ha’s ‘Everybody dances’ demo. If you didn’t, stop what you are doing and watch it now. Your immediate reaction should be: ‘how can I apply this in a way that generates business value?’

Over a year, at least a couple of dozen papers make a big splash in the scene. Your job is to keep a catalog and match it to business problems. Medium has several writers that are amazing at implementing the latest algos fast. If you want the most exhaustive collection of implementations, try To navigate arXiV, try (this is good to pick up trends, I don’t recommend you to make paper reading a priority if you want to be a practitioner.) About videos: has now videos for most talks. ‘Processing’ NeurIPS is a serious job, so it’s easier to read summaries from people soon after they attended. Remember, you are doing this to pick up “what’s possible,” not to become a learned scholar. It’s tempting to get lost in the rabbit hole of awesomeness.

Which problem you pick says a lot about your maturity as a data scientist and your creativity. You are demonstrating your taste and your business acumen. Passion projects are ok if your passion is not too obscure (for example, I wouldn’t work on detecting fake Cohiba cigars, no matter how passionate you are about those). If knowing the domain can give you an unfair advantage, by all means, use it. A good idea gets people saying ‘I would use that’ or ‘How can I help?’ If you don’t get these reactions, keep looking.

Deep learning (DL) has two good advantages over ‘classical’ machine learning: 1/ it can use unstructured data (images, audio, text) which is both flashier and more abundant than tabular data, and 2/ You can use pretrained networks! If there’s a pretrained network that does something remotely similar to what you want to do, you have just saved yourself a lot of time. Because of transfer learning and lots of pretrained networks, demonstrating value should be easy. Note that taking what you build to production would be problematic if you depend on GPUs, but this is beyond the scope of your proof of concept. Just be prepared to defend your choices in an interview.

If you don’t want to do DL, there’s another useful trick: pick a topic companies care about. For example, If you want to impress companies in the EU, you can do a POC that uses the company’s proprietary datasets while protecting privacy. Impossible, until recently: Andrew Trask is building a great framework to train models in data you cannot access at once. His book ‘Grokking deep learning’ (great stuff!) has an entire chapter on privacy, secure aggregation and homomorphic encryption. The keyword you want is ‘federated learning.’ Google is doing quite a bit of research about it because it could save the company from a frontal attach coming from regulators.

How long should you work on a ‘good problem’? At least 1.5 months. This length seems to work fine at the companies I work for. Anything shorter can look like a ‘weekend project’; you want to avoid that at all costs.


Rather than having one of the popular datasets, for example, ‘mammograms,’ or worse ‘Titanic,’ pick some original data and problem. Remember you are trying to impress a hiring manager that sees hundreds of CVs and many have the same old datasets. For example, MOOCs and bootcamps that ask everybody to do a project on the same topic produce reams of CVs that all look the same. Now imagine you are the hiring manager, and you have seen dozens in the last month; will you be impressed?

You are going to do all that alone (or in a small team of two). In a company, accepting a project is something that goes through a more substantial group:

  • Product defines the problem and creates a user story that describes the need.
  • The research team reads relevant articles and looks for similar problems which other data science teams have faced.
  • The data team collects relevant data, and someone (internal or external) needs to label it.
  • The team tries some algorithms and approaches and returns with some kind of baseline.

Because you are working on a project to showcase your skills, you would have to do this without a large team. But is it impossible? No. Here are some examples:

Malaria Microscope

Malaria kills about 400k people per year, mostly children. It’s curable, but detecting it is not trivial, and it happens in parts of the world where hospitals and doctors are not very accessible. Malaria parasites are quite big and a simple microscope can show them; the standard diagnostic method involves a doctor counting them.

It turns out you can attach a USB microscope to a mobile phone, and run DL code on the phone that counts parasites with accuracy comparable to a human. Eduardo Peire, DSR alumni, started with a pubic Malaria dataset. While small, this was enough to demonstrate value to people, and his crowdfunding campaign got him enough funds to fly to the Amazon and collect more samples. This is a story for another day; you can follow their progress here:

Wheelchair helper

Information about pedestrian road surface type and its roughness are essential for vehicles that need smoothness to operate. Especially for wheelchair users, scooters, or bicycle riders, road surface roughness significantly affects the comfortability of the riding. A large number of street images are available on the web, but for the pedestrian road, they typically do not have roughness information or surface labels.

Masanori Kanazu and Dmitry Efimenko classify road surface type and predict roughness value using only a smartphone camera image. They mount a smartphone to a wheeled device and record video and accelerometer data simultaneously. From the accelerometer data, they calculate a roughness index and train a classifier to predict the roughness of the road and surface type.

Their system can be utilized to annotate crowdsourcing street maps like OpenStreetMap with roughness values and surface types. Also, their approach demonstrates that one can map the road surface with an off-the-shelf device such as a smartphone camera and accelerometer.

These two projects are original. They demonstrate what a small team can do with modern DL and … well, cheap sensors that already come with a phone or you can buy online. If I were a hiring manager, these projects would make me pay attention.


A project can be original, but useless. Example; translate Klingon to Dothraki language (Languages used on TV series that don’t exist in reality). You may laugh, but after > 165 projects I’ve mentored, I’ve seen people generating ideas that fail to meet the relevant criterion.

How do you know what is relevant?

If you are part of an industry, you should know of pain points. What is something trivial that looks automatable? What is something everyone in your sector loves to complain about?

Imagine you are not part of any industry; If you have a hobby, you can do a portfolio project on it IF:
1/ It’s damn interesting and memorable even if way outside the range of interest of the industry. Example: “recommender system for draft beer.” Many people will pay attention to beer.

2/ It showcases your skills: You found or created an exceptionally good dataset. You integrated multiple data sources. You made something that on first thought, most people wouldn’t believe it’s possible. You showed serious technical chops.

Good data

You may complain that there’s not that much data available out there, or that the massive, nice datasets are all inside companies.

The good news is that you don’t need big data to deliver value with ML! No matter how much hype there’s around the importance of large datasets. Take it from Andrew Ng:

Ok, so where do you get data?
Generate it yourself (see previous examples; they both generated their own data; or begged anyone who had data to contribute it (hospitals that deal with infectious diseases for the Malaria microscope).

Some hiring managers may not care much about how you fitted which model; they have seen it before. But if the dataset you generated or found is new to them, that may impress them.

Clean code

Have a clean project on GitHub, something the interviewer can run on his computer. If you are lucky enough to be dealing with an interviewer who checks code, the last thing you want is to disappoint her with something that is hard to run or doesn’t run at all.

Most people have ‘academic quality’ code on their GitHubs, i.e., something that works just enough to get you a plot. It’s damn hard to replicate, even putting in some effort (something a hiring manager will not do).

If you have clean code, it will feel refreshing. A good readme, appropriate comments, modules in python, good commit messages, a requirements.txt file, and … a simple way to replicate your results, such as ‘python’ That’s all it takes to be above 90% of the code most people learning data science have on their GitHub. If you can defend your library or model choices in an interview, that’s moving you up the ranking too.

Apply Now

Do not obsess with performance

I call this the “Kaggle Mirage.” You have seen Kagglers making real money and raking in the prestige just by moving a decimal point. Then you may think “I need to get to 90% accuracy because my manager (not a data scientist) set it as a goal.” Don’t let that happen to you without asking some important questions. What would happen if you only got to 85%? Are there nonalgorithmic ways to compensate for the missing 5%? What is the baseline? What is the algorithm? What is the current accuracy? Is 90% a reasonable goal? Is it worth the effort if it’s significantly harder than getting to 85%?

In all but a few real projects, moving performance a decimal point may not bring that much business value.

Now, if you have been working on a problem for a week and performance didn’t change… is it time to stop? Maybe your peers on a standup notice the pattern and are encouraging you to drop that project and join another one (coincidentally, theirs). Expectation management for data scientists would be a great post.

In data science, progress is not linear. That is you could spend days working on something that doesn’t move the needle, and then, one day, ‘boom,’ you have gained 5 points in accuracy. Companies not used to work with data science will have a hard time to understand that.

I recommend not to do Kaggle (or not only!) if your goal is to build a portfolio project to get hired.

Deciding whether to apply and preparing for an interview

Even if you do have a great project to show off, you still have to (1) find good companies to apply to, and (2) prepare. I estimate about 10% of companies are a slam dunk: They have a team that writes blog posts, they are using reasonable algos for their problems, they have a good team already, and AI is a core component of their product. Startups with funding and big N companies land in this box. For the remaining 90% of companies, it’s far harder to estimate whether they are a good match for someone building an AI career. I have quite a few heuristics, but this is worth a different post.

What doesn’t discriminate excellent and bad AI-career-building companies? The job ad.

It is difficult to write up a job description for a data scientist role; more so if there isn’t already a DS team in the company to write it! Even when there is one, HR could have mangled the job ad by adding criteria.

As a candidate, you can use this rule of thumb: ignore requirements and apply; If you only have 50% of the bullet points, that might be better than most other applicants. Knowing this would add some companies to your list, ones in which you could have disqualified yourself.

When companies don’t have an easy way to tell if a candidate is any good, they revert to ‘correlates’: pedigree, paper certifications, titles. Terrible choice, because so many great data scientists today are self-taught. Those companies asking for a Ph.D. or Masters are missing out on great talent that cut their teeth on and Kaggle.

Before you start carpet bombing LinkedIn, … Have you prepared for interviews? Because it’s not a walk in the park! Don’t try to go interviewing without preparing; your competitors will.

The standard software developer interview in the Valley takes about three months of preparation. If you don’t believe me, go to this Reddit. That goes equally for fresh CS graduates (who should still remember all those ‘algorithms and data structures’ classes) and for more senior people (who have all the rights in the world to have forgotten them).

In less competitive places, it may not be that important to know all these CS tricks. In fact, preparing for a DS and CS interview at the same time may be overkill. Some positions (machine learning engineer) may require you to be a senior engineer on top of knowing machine learning.

For data science positions today, you don’t necessarily need to prepare ‘algorithms and data structures’ questions. But the trend is that companies ask for more and more engineering from their data scientists. You can know in advance by checking the profiles of company data scientists in LinkedIn. Are they very engineer-y? Then get the book ‘cracking the coding interview,’ and do a few exercises.


Work on one project. Spend 1.5 months on it. Make sure it’s impressive to a hiring manager that sees hundreds of CVs for each position. Make sure you pick a ‘good problem’, one that is, original, relevant, and uses good data.

Make sure you produce clean code, pick decent companies before you mass-apply, and prepare for interviews.

Apply Now

HandySpark: bringing pandas-like capabilities to Spark DataFrames

“Panda statues on gray concrete stairs during daytime” by chuttersnap on Unsplash

Originally posted on Towards Data Science.


HandySpark is a new Python package designed to improve PySpark user experience, especially when it comes to exploratory data analysis, including visualization capabilities.

Try it yourself using Google Colab:

Check the repository:


Apache Spark is the most popular cluster computing framework. It is listed as a required skill by about 30% of job listings (link).

The majority of Data Scientists uses Python and Pandas, the de facto standard for manipulating data. Therefore, it is only logical that they will want to use PySpark — Spark Python API and, of course, Spark DataFrames.

But, the transition from Pandas to Spark DataFrames may not be as smooth as one could hope…


I’ve been teaching Applied Machine Learning using Apache Spark at Data Science Retreat to more than 100 students over the course of 2 years.

My students were quite often puzzled with some of the quirks of PySpark and, some other times, baffled by the lack of some functionalities Data Scientists take for granted while using the traditional Pandas/Scikit-Learn combo.

I decided to address these problems by developing a Python package that would make exploratory data analysis much easier in PySpark

Introducing HandySpark

HandySpark is really easy to install and to integrate into your PySpark workflow. It takes only 3 steps to make your DataFrame a HandyFrame:

    1. Install HandySpark using pip install handyspark
    1. Import HandySpark with from handyspark import *
  1. Make your DataFrame a HandyFrame with hdf = df.toHandy()

After importing HandySpark, the method toHandy is added to Spark’s DataFrame as an extension, so you’re able to call it straight away.

Let’s take a quick look at everything you can do with HandySpark 🙂

1. Fetching Data

No more cumbersome column selection, collection and manual extraction from Row objects!

Now you can fetch data just like you do in Pandas, using cols :


0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Much, much easier, right? The result is a Pandas Series!

Just keep in mind that, due to the distributed nature of data in Spark, it is only possible to fetch the top rows of any given HandyFrame — so, no, you still cannot do things like [3:5] or [-1] and so on… only [:N].

There are also other pandas-like methods available:

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

If you haven’t guessed yet, the examples above (and all others in this post) are built using the famous Titanic dataset 🙂

2. Plotting Data

The lack of an easy way of visualizing data always puzzled my students. And, when one searches the web for examples of plotting data using PySpark, it is even worse: many, many tutorials simply convert the WHOLE dataset to Pandas and then plot it the traditional way.

Please, DON’T EVER DO IT! It will surely work with toy datasets, but it would fail miserably if used with a really big dataset (the ones you likely handle if you’re using Spark).

HandySpark addresses this problem by properly computing statistics using Spark’s distributed computing capabilities and only then turning the results into plots. Then, it turns out to be easy like that:

fig, axs = plt.subplots(1, 4, figsize=(12, 4))
hdf.cols[['Fare', 'Age']].scatterplot(ax=axs[3])

Plotting with HandySpark!

Yes, there is even a scatterplot! How is that possible?! HandySpark splits both features into 30 bins each, computes frequencies for each and every one of the 900 combinations and plots circles which are sized accordingly.

3. Stratify

What if you want to perform stratified operations, using a split-apply-combine approach? The first idea that may come to mind is to use a groupby operation… but groupby operations trigger the dreaded data shuffling in Spark, so they should be avoided.

HandySpark handles this issue by filtering rows accordingly, performing computations on each subset of the data and then combining the results. For instance:

Pclass  Embarked
1       C            85
        Q             2
        S           127
2       C            17
        Q             3
        S           164
3       C            66
        Q            72
        S           353
Name: value_counts, dtype: int64

You can also stratify it with non-categorical columns by leveraging on either Bucket or Quantile objects. And then use it in a stratified plot:

hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].hist()

Stratified histogram

4. Imputing Missing Values

“Thou shall impute missing values”

First things first, though. How many missing values are there?

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
Name: missing(ratio), dtype: float64

OK, now we know there are 3 columns with missing values. Let’s drop Cabin(after all, 77% of its values are missing) and focus on the imputation of values for the other two columns: Age and Embarked.

The imputation of missing values could not be integrated into a Spark pipeline until version 2.2.0, when the Imputer transformer was released. But it still does not handle categorical variables (like Embarked), let alone stratified imputation…

Let’s see how HandySpark can help us with this task:

hdf_filled = hdf.fill(categorical=['Embarked'])
hdf_filled = (hdf_filled.stratify(['Pclass', 'Sex'])
              .fill(continuous=['Age'], strategy=['mean']))

First, it uses the most common value to fill missing values of our categorical column. Then, it stratifies the dataset according to Pclass and Sex to compute the mean value for Age , which is going to be used in the imputation.

Which values did it use for the imputation?

{'Embarked': 'S',
 'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
 'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
 'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
 'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
 'Pclass == "3" and Sex == "female"': {'Age': 21.75},
 'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}

So far, so good! Time to integrate it into a Spark pipeline, generating a custom transformer with transformers:

imputer = hdf_filled.transformers.imputer()

The imputer object is now a full-fledged serializable PySpark transformer! What does that mean? You can use it in your pipeline and save / load at will 🙂

5. Detecting Outliers

“You shall not pass!”

How many outliers should we not allow to pass?

hdf_filled.outliers(method='tukey', k=3.)
PassengerId      0.0
Survived         0.0
Pclass           0.0
Age              1.0
SibSp           12.0
Parch          213.0
Fare            53.0
dtype: float64

Currently, only Tukey’s method is available (I am working on Mahalanobis distance!). This method takes an optional k argument, which you can set to larger values (like 3) to allow for a more loose detection.

Take the Fare column, for instance. There are, according to Tukey’s method, 53 outliers. Let’s fence them!

hdf_fenced = hdf_filled.fence(['Fare'])

What are the lower and upper fence values?

{'Fare': [-26.7605, 65.6563]}

Remember that, if you want to, you can also perform a stratified fencing 🙂

As you’d probably guessed already, you can also integrate this step into your pipeline, generating the corresponding transformer:

fencer = hdf_fenced.transformers.fencer()

6. Pandas Functions

In Spark 2.3, Pandas UDFs were released! This turned out to be a major improvement for us, PySpark users, as we could finally overcome the performance bottleneck imposed by traditional User Defined Functions (UDFs). Awesome!

HandySpark takes it one step further, by doing all the heavy lifting for you 🙂 You only need to use its pandas object and voilà — lots of functions from Pandas are immediately available!

For instance, let’s use isin as you’d use with a regular Pandas Series:

some_ports = hdf_fenced.pandas['Embarked'].isin(values=['C', 'Q'])
Column<b'udf(Embarked) AS `<lambda>(Embarked,)`'>

But, remember Spark has lazy evaluation, so the result is a column expression which leverages the power of Pandas UDFs. The only thing left to do is to actually assign the results to a new column, right?

hdf_fenced = hdf_fenced.assign(is_c_or_q=some_ports)
# What's in there?
0     True
1    False
2    False
3     True
4     True
Name: is_c_or_q, dtype: bool

You got that right! HandyFrame has a very convenient assign method, just like in Pandas!

And this is not all! Both specialized str and dt objects from Pandas are available as well! For instance, what if you want to find if a given string contains another substring?

col_mrs = hdf_fenced.pandas['Name'].str.find(sub='Mrs.')
hdf_fenced = hdf_fenced.assign(is_mrs=col_mrs > 0)

For a complete list of all supported functions, please check the repository.

7. Your Own UDFs

The sky is the limit! You can create regular Python functions and use assign to create new columns 🙂 And they will be turned into Pandas UDFs for you!

The arguments of your function (or lambda) should have the names of the columns you want to use. For instance, to take the log of Fare:

import numpy as np
hdf_fenced = hdf_fenced.assign(logFare=lambda Fare: np.log(Fare + 1))

You can also use functions that take multiple columns as arguments. Keep in mind that the default return type, that is, the data type of the new column, will be the same as the first column used (Fare, in the example).

It is also possible to specify different return types — please check the repository for examples on that.

8. Nicer Exceptions

Spark exceptions are loooong… whenever something breaks, the error bubbles up through a seemingly infinite number of layers!

I always advise my students to scroll all the way down and work their way up trying to figure out the source of the problem… but, not anymore!

HandySpark will parse the error and show you a nice and bold red summary at the very top 🙂 It may not be perfect, but it will surely help!

Handy Exception

9. Safety First

Some dataframe operations, like collect or toPandas will trigger the retrieval of ALL rows of the dataframe!

To prevent the undesirable side effects of these actions, HandySpark implements a safety mechanism! It will automatically limit the output to 1,000 rows:

Safety mechanism in action!

Of course, you can specify a different limit using set_safety_limit or throw caution to the wind and tell your HandyFrame to ignore the safety using safety_off. Turning the safety mechanism off is good for a single action, though, as it will kick back in after returning the requested unlimited result.

Final Thoughts

My goal is to improve PySpark user experience and allow for a smoother transition from Pandas to Spark DataFrames, making it easier to perform exploratory data analysis and visualize the data. Needless to say, this is a work in progress, and I have many more improvements already planned.

If you are a Data Scientist using PySpark, I hope you give HandySpark a try and let me know your thoughts on it 🙂

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Understanding a Machine Learning workflow through food

Photo by Cel Lisboa on Unsplash

Originally posted on Towards Data Science.

Through food?!

Yes, you got that right, through food! 🙂

Imagine yourself ordering a pizza and, after a short while, getting that nice, warm and delicious pizza delivered to your home.

Have you ever wondered the workflow behind getting such a pizza delivered to your home? I mean, the full workflow, from the sowing of tomato seeds to the bike rider buzzing at your door! It turns out, it is not so different from a Machine Learning workflow.

Really! Let’s check it out!

This post draws inspiration from a talk given by Cassie Kozyrkov, Chief Decision Scientist at Google, at the Data Natives Conference in Berlin.

Photo by SwapnIl Dwivedi on Unsplash

1. Sowing

The farmer sows the seeds that will grow to become some of the ingredients to our pizza, like the tomatoes.

This is equivalent to the data generating process, be it a user action, be it movement, heat or noise triggering a sensor, for instance.

Photo by no one cares on Unsplash

2. Harvesting

Then it comes the time for the harvest, that is, when the vegetables or fruits are ripe.

This is equivalent to the data collection, meaning the browser or sensor will translate the user action or the event that triggered the sensor into actual data.

Photo by Matthew T Rader on Unsplash

3. Transporting

After the harvest, the products must be transported to their destination to be used as ingredients in our pizza.

This is equivalent to ingesting the data into a repository where its going be fetched from later, like a database or data lake.

Photo by Nicolas Gras on Unsplash

4. Choosing Appliances and Utensils

For every ingredient, there is the most appropriate utensil for handling it. If you need to slice, use a knife. If you need to stir, a spoon. The same reasoning is valid for the appliances: if you need to bake, use an oven. If you need to fry, a stove. You can also use a more sophisticated appliance like a microwave, with many, many more available options for setting it up.

Sometimes, it is even better to use a simpler appliance — have you ever seen a restaurant advertise “microwaved pizzas”?! I haven’t!

In Machine Learning, utensils are techniques for preprocessing the data, while the appliances are the algorithms, like a Linear Regression or a Random Forest. You can also use a microwave, I mean, Deep Learning. The different options available are the hyper-parameters. There are only a few in simple appliances, I mean, algorithms. But there are many, many more in a sophisticated one. Besides, there is no guarantee a sophisticated algorithm will deliver a better performance (or do you like microwaved pizzas better?!). So, choose your algorithms wisely.

Photo by S O C I A L . C U T on Unsplash

5. Choosing a Recipe

It is not enough to have ingredients and appliances. You also need a recipe, which has all the steps you need to follow to prepare your dish.

This is your model. And no, your model is not the same as your algorithm. The model includes all pre– and postprocessing required by your algorithm. And, talking about pre-processing

Photo by Caroline Attwood on Unsplash

6. Preparing the Ingredients

I bet you the first instructions in most recipes are like: “slice this”, “peel that” and so on. They don’t tell you to wash the vegetables, because that’s a given — no one wants to eat dirty vegetables, right?

Well, the same holds true for data. No one wants dirty data. You have to clean it , that is, handling missing values and outliers. And then you have to peel it and slice it, I mean, pre-process it, like encoding categorical variables (male or female, for instance) into numeric ones (0 or 1).

No one likes that part. Neither the data scientists nor the cooks (I guess).

Photo by Bonnie Kittle on Unsplash

7. Special Preparations

Sometimes you can get creative with your ingredients to achieve either a better taste or a more sophisticated presentation.

You can dry-age a steak for a different flavor or carve a carrot to look like a rose and place it on top of your dish 🙂

This is feature engineering! It is an important step that may substantially improve the performance of your model, if done in a clever way.

Pretty much every data scientist enjoys that part. I guess the cooks like it too.

Photo by Clem Onojeghuo on Unsplash

8. Cooking

The fundamental step — without actually cooking, there is no dish. Obviously. You put the prepared ingredients into the appliance, adjust the heat and wait a while before checking it again.

This is the training of your model. You feed the data to your algorithm, adjust its hyper-parameters and wait a while before checking it again.

Photo by Icons8 team on Unsplash

9. Tasting

Even if you follow a recipe to the letter, you cannot guarantee everything is exactly right. So, how do you know if you got it right? You taste it! If it is not good, you may add more salt to try and fix it. You may also change the temperature. But you keep on cooking!

Unfortunately, sometimes your pizza is going to burn, or taste horribly no matter what you do to try to salvage it. You throw it in the garbage, learn from your mistakes and start over.

Hopefully, persistence and a bit of luck will produce a delicious pizza 🙂

Tasting is evaluating. You need to evaluate your model to check if it is doing alright. If not, you may need to add more features. You may also change a hyper-parameter. But you keep on training!

Unfortunately, sometimes your model is not going to converge to a solution, or make horrible predictions no matter what you do to try to salvage it. You discard your model, learn from your mistakes and start over.

Hopefully, persistence and a bit of luck will result in a high-performing model 🙂

Photo by Kai Pilger on Unsplash

10. Delivering

From the point of the view of the cook, his/her work is done. He/she cooked a delicious pizza. Period.

But if the pizza does not get delivered nicely and in time to the customer, the pizzeria is going out of business and the cook is losing his/her job.

After the pizza is cooked, it must be promptly packaged to keep it warm and carefully handled to not look all squishy when it reaches the hungry customer. If the bike rider doesn’t reach his/her destination, loses the pizza along the way or shake it beyond recognition, all cooking effort is good for nothing.

Delivering is deployment. Not pizzas, but predictions. Predictions, like pizzas, must be packaged, not in boxes, but as data products, so they can be delivered to the eager customers. If the pipeline fails, breaks along the way or modifies the predictions in any way, all model training and evaluation is good for nothing.

That’s it! Machine Learning is like cooking food — there are several people involved in the process and it takes a lot of effort, but the final result can be delicious!

Just a few takeaways:

    • if ingredients are bad, the dish is going to be bad no recipe can fix that and certainly, no appliance, either;
    • if you are a cook, never forget that, without delivering, there is no point in cooking, as no one will ever taste your delicious food;
  • if you are a restaurant owner, don’t try to impose appliances on your cook — sometimes microwaves are not the best choice — and you’ll get a very unhappy cook if he/she spends all his/her time washing and slicing ingredients

I don’t know about you, but I feel like ordering a pizza now! 🙂

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Understanding binary cross-entropy / log loss: a visual explanation

Photo by G. Crescoli on Unsplash

Originally posted on Towards Data Science.


If you are training a binary classifier, chances are you are using binary cross-entropy / log loss as your loss function.

Have you ever thought about what exactly does it mean to use this loss function? The thing is, given the ease of use of today’s libraries and frameworks, it is very easy to overlook the true meaning of the loss function used.


I was looking for a blog post that would explain the concepts behind binary cross-entropy / log loss in a visually clear and concise manner, so I could show it to my students at Data Science Retreat. Since I could not find any that would fit my purpose, I took the task of writing it myself 🙂

A Simple Classification Problem

Let’s start with 10 random points:

x = [-2.2, -1.4, -0.8, 0.2, 0.4, 0.8, 1.2, 2.2, 2.9, 4.6]

This is our only feature: x.

Figure 0: the feature

Now, let’s assign some colors to our points: red and green. These are our labels.

Figure 1: the data

So, our classification problem is quite straightforward: given our feature x, we need to predict its label: red or green.

Since this is a binary classification, we can also pose this problem as: “is the point green” or, even better, “what is the probability of the point being green”? Ideally, green points would have a probability of 1.0 (of being green), while red points would have a probability of 0.0 (of being green).

In this setting, green points belong to the positive class (YES, they are green), while red points belong to the negative class (NO, they are not green).

If we fit a model to perform this classification, it will predict a probability of being green to each one of our points. Given what we know about the color of the points, how can we evaluate how good (or bad) are the predicted probabilities? This is the whole purpose of the loss function! It should return high values for bad predictions and low values for good predictions.

For a binary classification like our example, the typical loss function is the binary cross-entropy / log loss.

Loss Function: Binary Cross-Entropy / Log Loss

If you look this loss function up, this is what you’ll find:

Binary Cross-Entropy / Log Loss

where y is the label (1 for green points and 0 for red points) and p(y) is the predicted probability of the point being green for all N points.

Reading this formula, it tells you that, for each green point (y=1), it adds log(p(y)) to the loss, that is, the log probability of it being green. Conversely, it adds log(1-p(y)), that is, the log probability of it being red, for each red point (y=0). Not necessarily difficult, sure, but no so intuitive too…

Besides, what does entropy have to do with all this? Why are we taking log of probabilities in the first place? These are valid questions and I hope to answer them on the “Show me the math” section below.

But, before going into more formulas, let me show you a visual representation of the formula above…

Computing the Loss — the visual way

First, let’s split the points according to their classes, positive or negative, like the figure below:

Figure 2: splitting the data!

Now, let’s train a Logistic Regression to classify our points. The fitted regression is a sigmoid curve representing the probability of a point being green for any given x . It looks like this:

Figure 3: fitting a Logistic Regression

Then, for all points belonging to the positive class (green), what are the predicted probabilities given by our classifier? These are the green bars under the sigmoid curve, at the x coordinates corresponding to the points.

Figure 4: probabilities of classifying points in the POSITIVE class correctly

OK, so far, so good! What about the points in the negative class? Remember, the green bars under the sigmoid curve represent the probability of a given point being green. So, what is the probability of a given point being red? The red bars ABOVE the sigmoid curve, of course 🙂

Figure 5: probabilities of classifying points in the NEGATIVE class correctly

Putting it all together, we end up with something like this:

Figure 6: all probabilities put together!

The bars represent the predicted probabilities associated with the corresponding true class of each point!

OK, we have the predicted probabilities… time to evaluate them by computing the binary cross-entropy / log loss!

These probabilities are all we need, so, let’s get rid of the x axis and bring the bars next to each other:

Figure 7: probabilities of all points

Well, the hanging bars don’t make much sense anymore, so let’s reposition them:

Figure 8: probabilities of all points — much better 🙂

Since we’re trying to compute a loss, we need to penalize bad predictions, right? If the probability associated with the true class is 1.0, we need its loss to be zero. Conversely, if that probability is low, say, 0.01, we need its loss to be HUGE!

It turns out, taking the (negative) log of the probability suits us well enough for this purpose (since the log of values between 0.0 and 1.0 is negative, we take the negative log to obtain a positive value for the loss).

Actually, the reason we use log for this comes from the definition of cross-entropy, please check the “Show me the math” section below for more details.

The plot below gives us a clear picture —as the predicted probability of the true class gets closer to zero, the loss increases exponentially:

Figure 9: Log Loss for different probabilities

Fair enough! Let’s take the (negative) log of the probabilities — these are the corresponding losses of each and every point.

Finally, we compute the mean of all these losses.

Figure 10: finally, the loss!

Voilà! We have successfully computed the binary cross-entropy / log loss of this toy example. It is 0.3329!

Show me the code

If you want to double check the value we found, just run the code below and see for yourself 🙂

[gist id=”c8cf1c6bbbb4422d082bfd77074bb257″]

Show me the math (really?!)

Jokes aside, this post is not intended to be very mathematically inclined… but for those of you, my readers, looking to understand the role of entropy, logarithms in all this, here we go 🙂

If you want to go deeper into information theory, including all these concepts — entropy, cross-entropy and much, much more — check Chris Olah’s post out, it is incredibly detailed!


Let’s start with the distribution of our points. Since y represents the classes of our points (we have 3 red points and 7 green points), this is what its distribution, let’s call it q(y), looks like:

Figure 11: q(y), the distribution of our points


Entropy is a measure of the uncertainty associated with a given distribution q(y).

What if all our points were green? What would be the uncertainty of that distribution? ZERO, right? After all, there would be no doubt about the color of a point: it is always green! So, entropy is zero!

On the other hand, what if we knew exactly half of the points were green and the other half, red? That’s the worst case scenario, right? We would have absolutely no edge on guessing the color of a point: it is totally random! For that case, entropy is given by the formula below (we have two classes (colors)— red or green — hence, 2):

Entropy for a half-half distribution

For every other case in between, we can compute the entropy of a distribution, like our q(y), using the formula below, where C is the number of classes:


So, if we know the true distribution of a random variable, we can compute its entropy. But, if that’s the case, why bother training a classifier in the first place? After all, we KNOW the true distribution…

But, what if we DON’T? Can we try to approximate the true distribution with some other distribution, say, p(y)? Sure we can! 🙂


Let’s assume our points follow this other distribution p(y). But we know they are actually coming from the true (unknown) distribution q(y), right?

If we compute entropy like this, we are actually computing the cross-entropy between both distributions:


If we, somewhat miraculously, match p(y) to q(y) perfectly, the computed values for both cross-entropy and entropy will match as well.

Since this is likely never happening, cross-entropy will have a BIGGER value than the entropy computed on the true distribution.

Cross-Entropy minus Entropy

It turns out, this difference between cross-entropy and entropy has a name…

Kullback-Leibler Divergence

The Kullback-Leibler Divergence,or “KL Divergence” for short, is a measure of dissimilarity between two distributions:

KL Divergence

This means that, the closer p(y) gets to q(y), the lower the divergence and, consequently, the cross-entropy, will be.

So, we need to find a good p(y) to use… but, this is what our classifier should do, isn’t it?! And indeed it does! It looks for the best possible p(y), which is the one that minimizes the cross-entropy.

Loss Function

During its training, the classifier uses each of the N points in its training set to compute the cross-entropy loss, effectively fitting the distribution p(y)! Since the probability of each point is 1/N, cross-entropy is given by:

Cross-Entropy —point by point

Remember Figures 6 to 10 above? We need to compute the cross-entropy on top of the probabilities associated with the true class of each point. It means using the green bars for the points in the positive class (y=1) and the red hanging bars for the points in the negative class (y=0) or, mathematically speaking:

Mathematical expression corresponding to Figure 10 🙂

The final step is to compute the average of all points in both classes, positive and negative:

Binary Cross-Entropy — computed over positive and negative classes

Finally, with a little bit of manipulation, we can take any point, either from the positive or negative classes, under the same formula:

Binary Cross-Entropy — the usual formula

Voilà! We got back to the original formula for binary cross-entropy / log loss 🙂

Final Thoughts

I truly hope this post was able shine some new light on a concept that is quite often taken for granted, that of binary cross-entropy as loss function. Moreover, I also hope it served to show you a little bit how Machine Learning and Information Theory are linked together.

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Hyper-parameters in Action! Weight Initializers

Photo by Jesper Aggergaard on Unsplash

Originally posted on Towards Data Science.


This is the second post of my series on hyper-parameters. In this post, I will show you the importance of properly initializing the weights of your deep neural network. We will start with a naive initialization scheme and work out its issues, like the vanishing / exploding gradients, till we (re)discover two popular initialization schemes: Xavier / Glorot and He.

I am assuming you’re already familiar with some key concepts (Z-values, activation functions and its gradients) which I covered on my first post of this series.

The plots illustrating this post were generated using my package, DeepReplay, which you can find on GitHub and learn more about it on this post.


On my quest to have a deeper understanding of the effects of each and every one of the different hyper-parameters on training a deep neural network, it is time to investigate the weight initializers.

If you have ever searched for this particular topic, you’ve likely ran into some common initialization schemes:

  • Random
  • Xavier / Glorot
  • He

If you dug a little bit deeper, you’ve likely also found out that one should use Xavier / Glorot initialization if the activation function is a Tanh, and that He initialization is the recommended one if the activation function is a ReLU.

By the way, just to clear something out: Xavier Glorot, along with Yoshua Bengio, are the authors of the “Understanding the difficulty of training deep feedforward neural networks” paper, which outlines the initialization scheme that takes either first (Xavier) or last (Glorot) name of its first author. So, sometimes this scheme will be referred to as Xavier initialization, and some other times (like in Keras), it will be referred to as Glorot initialization. Don’t be confused by this, as I was the first time I learned about this topic.

Having cleared that out, I ask you once again: have you ever wondered what exactly is going on under the hood? Why is initialization so important? What is the difference between the initialization schemes? I mean, not only their different definitions for what the variance should be, but the overall effect of using one or the other while training a deep neural network!

Before diving into it, I want to give credit where it is due: the plots were heavily inspired by Andre Perunicic’s awesome post on this very same topic.

OK, NOW let’s dive into it!


Make sure you’re using Keras 2.2.0 or newer—older versions had an issue, generating sets of weights with variance lower than expected!

In this post, I will use a model with 5 hidden layers with 100 units each, and a single unit output layer as in a typical binary classification task (that is, using a sigmoid as activation function and binary cross-entropy as loss). I will refer to this model as the BLOCK model, as it has consecutive layers of same size. I used the following code to build my model:

[gist id=”62a7e4242bae75c19c7b92452efa42c4″]

Model builder function

BLOCK model

This is the architecture of the BLOCK model, regardless of the activation function and/or initializer I will be using to build the plots.


The inputs are 1,000 random points drawn from a 10-dimensional ball (this seems fancier than it actually is, you can think of it as a dataset with 1,000 samples with 10 features each) such that the samples have zero mean and unit standard deviation.

In this dataset, points situated within half of the radius of the ball are labeled as negative cases (0), while the remaining points are labeled positive cases (1).

[gist id=”2efcb07268b5be8fc35a6b4c937f1778″]

Loading 10-dimensional ball dataset using DeepReplay

Naive Initialization Scheme

At the very beginning, there was a sigmoid activation function and randomly initialized weights. And training was hard, convergence was slow, and results were not good.

But, WHY?

Back then, the usual procedure was to draw random values from a Normal distribution (zero mean, unit standard deviation) and multiply them by a small number, say 0.01. The result would be a set of weights with a standard deviation of approximately 0.01. And this led to some problems…

Before going a bit deeper (just a bit, I promise!) into the mathematical reason why this was a bad initialization scheme, let me show what would be the result of using it in the BLOCK model with sigmoid activation function:

[gist id=”3a2909b23669790ddcc527b43b12c5ee”]

Code to build the plots! Just change the initializer and have fun! 🙂

Figure 1. BLOCK model using sigmoid and naive initialization — don’t try this at home!

This does NOT look good, right? Can you spot everything that is going bad?

  • Both Z-values (remember, these are the outputs BEFORE applying the activation function) and Activations are within a narrow range;
  • Gradients are pretty much zero;
  • And what is the deal with the weird distributions on the left column?!

Unfortunately, this network is likely not learning much anytime soon, should we choose to try and train it. Why? Because its gradients VANISHED!

And we have not even started the train process yet! This is Epoch 0! What happened so far, then? Let’s see:

  1. Weights were initialized using the naive scheme (upper right subplot)
  2. 1,000 samples were used in a forward pass through the network and generated Z-values (upper left subplot) and Activations (lower left subplot) for all the layers (output layer was not included in the plot)
  3. The loss was computed against the true labels and backpropagated through the network, generating the gradients for all layers (lower right subplot)

That is it! One single pass through the network!

Next, we should update the weights accordingly and repeat the process, right? But, wait… if gradients are pretty much zero, the updated weights are going to be pretty much the same, right?

What does it mean? It means that our network is borderline useless, as it cannot LEARN anything (that is, update its weights to perform the proposed classification task) in a reasonable amount of time.

Welcome to an extreme case of vanishing gradients!

You may be thinking: “yeah, sure, the standard deviation was too low, no way it could work like that”. So, how about trying different values, say, 10x or 100x bigger?

Standard deviation 10x bigger = 0.10

Figure 2. BLOCK model using 10x bigger standard deviation

OK, this looks a bit better… Z-values and Activations are within a decent range, gradients for the second to last hidden layers are showing some improvement, but still vanishing towards the initial layers.

Maybe going even BIGGER can fix the gradients, let’s see…

Standard deviation 100x bigger = 1.00

Figure 3. BLOCK model using 100x bigger standard deviation

OK, seems like we had some progress in the vanishing gradients problem, as the ranges of all layers got more similar to each other. Yay! But… we ruined the Z-values and Activations altogether… Z-values now exhibit a too wide range, forcing the Activations pretty much into binary mode.

Trying a different Activation Function

If you read my first post on hyper-parameters, you’ll recall that a Sigmoid activation function has this fundamental issue of being centered around 0.5. So, let’s keep following the evolution path of neural networks and use a Tanh activation function instead!

Figure 4. BLOCK model using Tanh and naive initialization

Replacing a Sigmoid for a Tanh activation function while keeping the naive initialization scheme with small random values from a Normal distribution led us to yet another vanishing gradients situation (it may not look like it, after all, they are similar along all the layers, but check the scale, gradients vanished in the last layer already!), accompanied by vanishing Z-values and vanishing Activations also (just to be clear, these two are not real terms)! Definitely, not the way to go!

Let’s go all the way to using a BIG standard deviation and see how it goes (yes, I am saving the best for last…).

Figure 5. BLOCK model using Tanh and a BIG standard deviation

In this setup, we can observe the exploding gradients problem, for a change. See how gradient values grow bigger and bigger as we backpropagate from the last to the first hidden layer? Besides, just like it happened when using a Sigmoid activation function, Z-values have a too wide range and Activations are collapsing into either zero or one for the most part. Again, not good!

And, as promised, the winner is… Tanh with a standard deviation of 0.10!

Figure 6. BLOCK model looking good!

Why is this one the winner? Let’s check its features:

  • First, gradients are reasonably similar along all the layers (and within a decent scale— about 20x smaller than the weights)
  • Second, Z-values are within a decent range (-1, 1) and are reasonably similar along all the layers (though some shrinkage is noticeable)
  • Third, Activations have not collapsed into binary mode, and are reasonably similar along all the layers (again, with some shrinkage)

If you haven’t noticed yet, being similar along all the layers is a big deal! Putting it simply, it means we can stack yet another layer at the end of the network and expect a similar distribution of Z-values, Activations and, of course, gradients. We definitely do NOT like collapsing, vanishing or exploding behaviors in our network, no, Sir!

But, being similar along all the layers is the effect, not the cause… as you probably guessed already, the key is the standard deviation of the weights!

So, we need an initialization scheme that uses the best possible standard deviation for drawing random weights! Enter Xavier Glorot and Yoshua Bengio

Xavier / Glorot Initialization Scheme

Glorot and Bengio devised an initialization scheme that tries to keep all the winning features listed , that is, gradients, Z-values and Activations similar along all the layers. Another way of putting it: keeping variance similar along all the layers.

How did they pull this one out, you ask? We’ll see that in a moment, we just need to do a really brief recap on a fundamental property of the variance.

Really Brief Recap

Let’s say we have x values (either inputs or activation values from a previous layer) and W weights. The variance of the product of two independent variables is given by the formula below:

Then, let’s assume both x and W have zero mean. The expression above turns into a simple product of both variances of x and W.

There are two important points to make here:

  1. The inputs should have zero mean for this to hold in the first layer, so always scale and center your inputs!
  2. Sigmoid activation function poses a problem for this, as the activation values will have a mean of 0.5, NOT zero! For more details on how to compensate for this, please check this post.

Given the point #2, it only makes sense to stick with Tanh, right? So, that is exactly what we’ll do! Now, it is time to apply this knowledge to a tiny example, so we arrive (hopefully!) at the same conclusion as Glorot and Bengio.

Figure 7. Two hidden layers of a network

Tiny Example

This example is made of two hidden layers, X and Y, fully connected (I am throwing usual conventions to the wind to keep mathematical notation to a bare minimum!).

We only care about the weights connecting these two layers and, guess what, the variances of both activations and gradients.

For the activations, we need to go through a forward pass in the network. For the gradients, we need to backpropagate.

Figure 8. Tanh activation function

And, for the sake of keeping the math simple, we will assume that the activation function is linear (instead of a Tanh) in the equations, meaning that the activation values are the same as Z-values.

Although this may seem a bit of a stretch, Figure 8 shows us that a Tanh is roughly linear in the interval [-1, 1], so the results should hold, at least in this interval.

Figure 9. Forward Pass

Forward Pass

So, instead of using a vectorized approach, we’ll be singling out ONE unit, y1, and work the math for it.

Figure 9 provides a clear-cut picture of the parts involved, namely:

  • three units in the previous layer (fan-in)
  • weights (w11, w21 and w31)
  • the unit we want to compute variance for, y1

Assuming that x and W are independent and identically distributed, we can work out some simple math for the variance of y1:

Remember the brief recap on variance? Time to put it for good use!

OK, almost there! Remember, our goal is to keep variance similar along all the layers. In other words, we should aim for making the variance of x the same as the variance of y.

For our single unit, y1, this can be accomplished by choosing the variance of its connecting weights to be:

And, generalizing for all the connecting weights between hidden layers X and Y, we have:

By the way, this is the variance to be used if we’re drawing random weights from a Normal distribution!

What if we want to use a Uniform distribution? We just have to compute the (symmetric) lower and upper limits, as shown below:

Are we done?! Not yet… don’t forget the backpropagation, we also want to keep the gradients similar along all the layers (its variance, to be more precise).

Figure 10. Backward Pass

Backward Pass (Backpropagation)

Again, let’s single out ONE unit, x1, for the backward pass.

Figure 10 provides a clear-cut picture of the parts involved, namely:

  • five units in the following layer (fan-out)
  • weights (w11, w12, w13, w14 and w15)
  • the unit we want to compute the variance of the gradients with respect to it, x1

Basically, we will make the same assumptions and follow the same steps as in the forward pass. For the variance of the gradients with respect to x1, we can work out the math the same way:

Using what we learned on the brief recap once again:

And, to keep the variance of gradients similar along all the layers, we find the needed variance of its connecting weights to be:

OK, we have gone a long way already! The inverse of the “fan in” gives us the desired variance of the weights for the forward pass, while the inverse of the “fan out” gives us the desired variance of the (same!) weights for the backpropagation.

But… what if “fan in” and “fan out” have VERY different values?

Reconciling Forward and Backward Passes

Can’t decide which one to choose, “fan in” or “fan out”, for calculating the variance of your network weights? No problem, just take the average!

So, we finally arrive at the expression for the variance of the weights, as in Glorot and Bengio, to be used with a Normal distribution:

And, for the Uniform distribution, we compute the limits accordingly:

Congratulations! You (re)discovered the Xavier / Glorot initialization scheme!

But, there is still one small detail, should you choose a Normal distribution to draw the weights from…

Truncated Normal and Keras’ Variance Scaling

When it comes to the weights of a neural network, we want them to be neatly distributed around zero and, even more than that, we don’t want any outliers! So, we truncate it!

What does it mean to truncate it? Just get rid of any values farther than twice the standard deviation! So, if you use a standard deviation of 0.1, a truncated normal distribution will have absolutely no values below -0.2 or above 0.2 (as in the left plot of Figure 11).

The thing is, once you cut out the tails of the normal distribution, the remaining values have a slightly lower standard deviation… 0.87962566103423978 of the original value, to be precise.

In Keras, before version 2.2.0, this difference in a truncated normal distribution was not taken into account in the Variance Scaling initializer, which is the base for Glorot and He initializers. So, it is possible that, in deeper models, initializers based on uniform distributions would have performed better than its normal counterparts, which suffered from a slowly shrinking variance layer after layer…

As of today, this is not an issue anymore, and we can observe the effect of compensating for the truncation in the right plot of Figure 11, where the distribution of the Variance Scaling initializer is clearly wider.

Figure 11. Truncated normal and Keras’ Variance Scaling

Some plots, PLEASE!

Thank you very much for bearing with me through the more mathematical parts. Your patience is going to be rewarded with plots galore!

Let’s see how the Glorot initializer (as it is called in Keras) performs, using both Normal and Uniform distributions.

Figure 12. BLOCK model with Glorot Normal initializer
Figure 13. BLOCK model with Glorot Uniform initializer

It looks like we have two winners!

Do you remember our previous winner, the BLOCK model using the naive initialization scheme with a standard deviation of 0.1 in Figure 6? The results are incredibly similar, right?

Well, it turns out that a standard deviation of 0.1 we used there is exactly the right value, according to the Glorot initialization scheme, when we have “fan in” and “fan out” equal to 100. It was not using a truncated normal distribution, though…

So, this initialization scheme solves our issues with vanishing and exploding gradients… but does it work with a different activation function other than the Tanh? Let’s see…

Rectified Linear Unit (ReLU) Activation Function

Can we stick with the same initialization scheme and use a ReLU as activation function instead?

Figure 14. BLOCK model with ReLU and Glorot Normal initializer — they don’t mix well…

The answer is: NO!

Back to square one… we need a new and improved initialization scheme. Enter He et al., with their “Delving Deep into Rectifiers” paper…

He Initialization Scheme

Luckily, everything we got while (re)discovering the Glorot initialization scheme still holds. There is only one tiny adjustment we need to make… multiply the variance of the weights by 2! Really, that’s all that it takes!

Figure 15. ReLU activation function

Simple enough, right? But, WHY?

The reason is also pretty straightforward: the ReLU turns half of the Z-values (the negative ones) into zeros, effectively removing about half of the variance. So, we need to double the variance of the weights to compensate for it.

Since we know that the Glorot initialization scheme preserves variance (1), how to compensate for the variance halving effect of the ReLU (2)? The result (3), as expected, is doubling the variance.

So, the expression for the variance of the weights, as in He et al., to be used with a Normal distribution is:

And, for the Uniform distribution, we compute the limits accordingly:

Congratulations! You (re)discovered the He initialization scheme!

But… what about the backpropagation? Shouldn’t we use the average of both “fans” once again? Actually, there is no need for it. He et al. showed in their paper that, for common network designs, if the initialization scheme scales the activation values during the forward pass, it does the trick for the backpropagation as well! Moreover, it works both ways, so we could even use “fan out” instead of “fan in”.

Now, it is time for more plots!

Figure 15. BLOCK model with ReLU and He Normal initializer

Figure 16. BLOCK model with ReLU and He Uniform initializer

Again, two more winners! When it comes to the distributions of Z-values, they look remarkably similar along all the layers! As for the gradients, they look a bit more “vanishy” now than when we used the Tanh/Glorot duo… Does this mean that Tanh/Glorot is better than ReLU/He? We know this is not true…

But, then, why its gradients did not look so good on Figure 16? Well, once again, don’t forget to look at the scale! Even though the variance of the gradients decreases as we backpropagate through the network, its values are nowhere near vanished (if you remember Figure 4, it was the other way around — the variance was similar along the layers, but it scale was around 0.0000001!).

So, we need not only a similar variance along all the layers, but also a proper scale for the gradients. The scale is quite important, as it will, together with the learning rate, define how fast the weights are going to be updated. If the gradients are way too small, the learning (that is, the update of the weights) will be extremely slow.

How small is too small, you ask? As always, it depends… on the magnitude of the weights. So, too small is not an absolute measure, but a relative one.

If we compute the ratio between the variance of the gradients and the variance of the corresponding weights (or its standard deviations, for that matter), we can roughly compare the learning speed of different initialization schemes and its underlying distributions (assuming a constant learning rate).

So, it is time for the…

Showdown — Normal vs Uniform and Glorot vs He!

To be honest, Glorot vs He actually means Tanh vs ReLU and we all know the answer to this match (spoiler alert!): ReLU wins!

And what about Normal vs Uniform? Let’s check the plot below:

Figure 17. How big are the gradients, after all?

And the winner is… Uniform! It is clear that, at least for our particular BLOCK model and inputs, using an Uniform distribution yields relatively bigger gradients than using a Normal distribution.

Moreover, as expected, using a ReLU yields relatively bigger gradients than using a Tanh. For our particular example, this doesn’t hold true for the first layer because its “fan in” is only 10 (the dimension of the inputs). Should we use 100-dimension inputs, the gradients for a ReLU would have been bigger for that layer as well.

And, even though it may seem like gradients are kinda “vanishing” when using a ReLU, just take a look at the tiny purple bar to the very right of Figure 17… I slipped the naively initialized and Sigmoid activated network into the plot to highlight how bad true vanishing gradients are 🙂 And, if the plot still cannot convince you, I throw in also the corresponding table:

Ratio: standard deviation of gradients over standard deviation of weights

In summary, for a ReLU activated network, the He initialization scheme using an Uniform distribution is a pretty good choice 😉

There are many, many more ways to analyze the effects of choosing a particular initialization scheme… we could try different network architectures (like “funnel” or “hourglass” shaped), deeper networks, changing the distribution of the labels (and, therefore, the loss)… I tried LOTS of combinations, and He/Uniform always outperformed the other initialization schemes, but this post is too long already!

Final Thoughts

This was a looong post, especially for a topic so taken for granted as weight initializers! But I felt that, for one to really appreciate its importance, one should follow the steps and bump into the issues that led to the development of the schemes used nowadays.

Even though, as a practitioner, you know the “right” combinations to use for initializing your network, I really hope this post was able to give you some insights into what is really happening and, most importantly, why that particular combination is the “right” one 🙂

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Hyper-parameters in Action! Introducing DeepReplay

Photo by Immo Wegmann on Unsplash

Originally posted on Towards Data Science.


In my previous post, I invited you to wonder what exactly is going on under the hood when you train a neural network. Then I investigated the role of activation functions, illustrating the effect they have on the feature space using plots and animations.

Now, I invite you to play an active role on the investigation!

It turns out these plots and animations drew quite some attention. So I decided to organize my code and structure it into a proper Python package, so you can plot and animate your own Deep Learning models!

How do they look like, you ask? Well, if you haven’t checked the original post yet, here it is a quick peek at it:

This is what animating with DeepReplay looks like 🙂

So, without further ado, I present you… DeepReplay!


The package is called DeepReplay because this is exactly what it allows you to do: REPLAY the process of training your Deep Learning Model, plotting and animating several aspects of it.

The process is simple enough, consisting of five steps:

  1. It all starts with creating an instance of a callback!
  2. Then, business as usual: build and train your model.
  3. Next, load the collected data into Replay.
  4. Finally, create a figure and attach the visualizations to it.
  5. Plot and/or animate it!

Let’s go through each one of these steps!

1. Creating an instance of a callback

The callback should be an instance of ReplayData.

[gist id=”61394f6733e33ec72522a58614d1425a” /]

The callback takes, as arguments, the model inputs (X and y), as well as the filename and group name where you want to store the collected training data.

Two things to keep in mind:

  • For toy datasets, it is fine to use the same X and y as in your model fitting. These are the examples that will be plot —so, you can choose a random subset of your dataset to keep computation times reasonable, if you are using a bigger dataset.
  • The data is stored in a HDF5 file, and you can use the same file several times over, but never the same group! If you try running it twice using the same group name, you will get an error.

2. Build and train your model

Like I said, business as usual, nothing to see here… just don’t forget to add your callback instance to the list of callbacks when fitting!

[gist id=”86591c9796731c21f920e01ed2376b23″ /]

3. Load collected data into Replay

So, the part that gives the whole thing its name… time to replay it!

It should be straightforward enough: create an instance of Replay, providing the filename and the group name you chose in Step 1.

[gist id=”019637d6d041fdbd269db9a78a2311b6″ /]

4. Create a figure and attach visualizations to it

This is the step where things get interesting, actually. Just use Matplotlib to create a figure, as simple as the one in the example, or as complex as subplot2grid allows you to make it, and start attaching visualizations from your Replay object to the figure.

[gist id=”ba49bdca40a2abaa68af39922e78a556″ /]

The example above builds a feature space based on the output of the layer named, suggestively, hidden.

But there are five types of visualizations available:

  • Feature Space: plot representing the twisted and turned feature space, corresponding to the output of a hidden layer (only 2-unit hidden layers supported for now), including grid lines for 2-dimensional inputs;
  • Decision Boundary: plot of a 2-D grid representing the original feature space, together with the decision boundary (only 2-dimensional inputs supported for now);
  • Probability Histogram: two histograms of the resulting classification probabilities for the inputs, one for each class, corresponding to the model output (only binary classification supported for now);
  • Loss and Metric: line plot for both the loss and a chosen metric, computed over all the inputs you passed as arguments to the callback;
  • Loss Histogram: histogram of the losses computed over all the inputs you passed as arguments to the callback (only binary cross-entropy loss supported for now).

5. Plot and/or animate it!

For this example, with a single visualization, you can use its plot and animate methods directly. These methods will return, respectively, a figure and an animation, which you can then save to a file.

[gist id=”83ef91da63de149f5a58f6e428ab37f3″ /]

If you decide to go with multiple simultaneous visualizations, there are two helper methods that return composed plots and animations, respectively: compose_plots and compose_animations.

To illustrate these methods, here is a gist that comes from the “canonicalexample I used in my original post. There are four visualizations and five plots (Probability Histogram has two plots, for negative and positive cases).

The animated GIF at the beginning of this post is actually the result of this composed animation!

[gist id=”6ad78608f5ae7ebe2c31f84f9b001625″ /]


At this point, you probably noticed that the two coolest visualizations, Feature Space and Decision Boundary, are limited to two dimensions.

I plan on adding support for visualizations in three dimensions also, but most of datasets and models have either more inputs or hidden layers with many more units.

So, these are the options you have:

  • 2D inputs, 2-unit hidden layer: Feature Space with optional grid (check the Activation Functions example);
  • 3D+ inputs, 2-unit hidden layer: Feature Space, but no grid;
  • 2D inputs, hidden layer with 3+ units: Decision Boundary with optional grid (check the Circles example);
  • nothing is two dimensional: well… there is always a workaround, right?

Working around multidimensionality

What do we want to achieve? Since we can only do 2-dimensional plots, we want 2-dimensional outputs — simple enough.

How to get 2-dimensional outputs? Adding an extra hidden layer with two units, of course! OK, I know this is suboptimal, as it is actually modifying the model (did I mention this is a workaround?!). We can then use the outputs of this extra layer for plotting.

You can check either the Moons or the UCI Spambase notebooks, for examples on adding an extra hidden layer and plotting it.

NOTE: The following part is a bit more advanced, it delves deeper into the reasoning behind adding the extra hidden layer and what it represents. Proceed at your own risk 🙂

What are we doing with the model, anyway? By adding an extra hidden layer, we can think of our model as having two components: an encoder and a decoder. Let’s dive just a bit deeper into those:

  • Encoder: the encoder goes from the inputs all the way to our extra hidden layer. Let’s consider its 2-dimensional output as features and call them f1 and f2.
  • Decoder: the decoder, in this case, is just a plain and simple logistic regression, which takes two inputs, say, f1 and f2, and outputs a classification probability.

Let me try to make it more clear with a network diagram:

Encoder / Decoder after adding an extra hidden layer

What do we have here? A 9-dimensional input, an original hidden layer with 5 units, an extra hidden layer with two units, its corresponding two outputs (features) and a single unit output layer.

So, what happens with the inputs along the way? Let’s see:

  1. Inputs (x1 through x9) are fed into the encoder part of the model.
  2. The original hidden layer twists and turns the inputs. The outputs of the hidden layer can also be thought of as features (these would be the outputs of units h1 through h5 in the diagram), but these are assumed to be n-dimensional and therefore not suited for plotting. So far, business as usual.
  3. Then comes the extra hidden layer. Its weights matrix has shape (n, 2) (in the diagram, n = 5 and we can count 10 arrows between h and e nodes). If we assume a linear activation function, this layer is actually performing an affine transformation, mapping points from a n-dimensional to a 2-dimensional feature space. These are our features, f1 and f2, the output of the encoder part.
  4. Since we assumed a linear activation function for the extra hidden layer, f1 and f2 are going to be directly fed to the decoder (output layer), that is, to a single unit with a sigmoid activation function. This is a plain and simple logistic regression.

What does it all mean? It means that our model is also learning a latent space with two latent factors (f1 and f2) now! Fancy, uh?! Don’t get intimidated by the fanciness of these terms, though… it basically means the model learned to best compress the information to only two features, given the task at hand — a binary classification.

This is the basic underlying principle of auto-encoders, the major difference being the fact that the auto-encoder’s task is to reconstruct its inputs, not classify them in any way.

Final Thoughts

I hope this post enticed you to try DeepReplay out 🙂

If you come up with nice and cool visualizations for different datasets, or using different network architectures or hyper-parameters, please share it on the comments section. I am considering starting a Gallery page, if there is enough interest in it.

For more information about the DeepReplay package, like installation, documentation, examples and notebooks (which you can play with using Google Colab), please go to my GitHub repository:

Have fun animating your models! 🙂

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Hyper-parameters in Action! Activation Functions


This is the first of a series of posts aiming at presenting visually, in a clear and concise way, some of the fundamental moving parts of training a neural network: the hyper-parameters.

Originally posted on Towards Data Science.


Deep Learning is all about hyper-parameters! Maybe this is an exaggeration, but having a sound understanding of the effects of different hyper-parameters on training a deep neural network is definitely going to make your life easier.

While studying Deep Learning, you’re likely to find lots of information on the importance of properly setting the network’s hyper-parameters: activation functions, weight initializer, optimizer, learning rate, mini-batch size, and the network architecture itself, like the number of hidden layers and the number of units in each layer.

So, you learn all the best practices, you set up your network, define the hyper-parameters (or just use its default values), start training and monitor the progress of your model’s losses and metrics.

Perhaps the experiment doesn’t go so well as you’d expect, so you iterate over it, tweaking the network, until you find out the set of values that will do the trick for your particular problem.

Looking for a deeper understanding (no pun intended!)

Have you ever wondered what exactly is going on under the hood? I did, and it turns out that some simple experiments may shed quite some light on this matter.

Take activation functions, for instance, the topic of this post. You and I know that the role of activation functions is to introduce a non-linearity, otherwise the whole neural network could be simply replaced by a corresponding affine transformation (that is, a linear transformation, such as rotating, scaling or skewing, followed by a translation), no matter how deep the network is.

A neural network having only linear activations (that is, no activation!) would have a hard time handling even a quite simple classification problem like this (each line has 1,000 points, generated for x values equally spaced between -1.0 and 1.0):

Figure 1: in this two-dimensional feature space, the blue line represents the negative cases (y = 0), while the green line represents the positive cases (y= 1).

If the only thing a network can do is to perform an affine transformation, this is likely what it would be able to come up with as a solution:

Figure 2: linear boundary — doesn’t look so good, right?

Clearly, this is not even close! Some examples of much better solutions are:

Figure 3: Non-linearities to the rescue!

These are three fine examples of what non-linear activation functions bring to the table! Can you guess which one of the images corresponds to a ReLU?

Non-linear boundaries (or are they?)

How does these non-linear boundaries come to be? Well, the actual role of the non-linearity is to twist and turn the feature space so much so that the boundary turns out to be… LINEAR!

OK, things are getting more interesting now (at least, I thought so first time I laid my eyes on it in this awesome Chris Olah’s blog post, from which I drew my inspiration to write this). So, let’s investigate it further!

Next step is to build the simplest possible neural network to tackle this particular classification problem. There are two dimensions in our feature space (x1 and x2), and the network has a single hidden layer with two units, so we preserve the number of dimensions when it comes to the outputs of the hidden layer (z1 and z2).

Figure 4: diagram of a simple neural network with a single 2-unit hidden layer

Up to this point, we are still on the realm of affine transformations… so, it is time for a non-linear activation function, represented by the Greek letter sigma, resulting in the activation values (a1 and a2) for the hidden layer.

These activation values represented the twisted and turned feature space I referred to in the first paragraph of this section. This is a preview of what it looks like, when using a sigmoid as activation function:

Figure 5: two-dimensional feature space: twisted and turned!

As promised, the boundary is LINEAR! By the way, the plot above corresponds to the left-most solution with a non-linear boundary on the original feature space (Figure 3).

Neural network’s basic math recap

Just to make sure you and I are on the same page, I am showing you below four representations of the very basic matrix arithmetic performed by the neural network up to the hidden layer, BEFORE applying the activation function (that is, just an affine transformation such as xW + b)

Basic matrix arithmetic: 4 ways of representing the same thing in the network

Time to apply the activation function, represented by the Greek letter sigma on the network diagram.

Activation function: applied on the results of the affine transformations

Voilà! We went from the inputs to the activation values of the hidden layer!

Implementing the network in Keras

For the implementation of this simple network, I used Keras Sequential model API. Apart from distinct activation functions, every model trained used the very same hyper-parameters:

  • weight initializers: Glorot (Xavier) normal (hidden layer) and random normal (output layer);
  • optimizer: Stochastic Gradient Descent (SGD);
  • learning rate: 0.05;
  • mini-batch size: 16;
  • number of hidden layers: 1;
  • number of units (in the hidden layer): 2.

Given that this is a binary classification task, the output layer has a single unit with a sigmoid activation function and the loss is given by binary cross-entropy.

[gist id=”e2536b9f45c4884f90d20d68e1b3d8c3″]

Code: simple neural network with a single 2-unit hidden layer

Activation functions in action!

Now, for the juicy part — visualizing the twisted and turned feature space as the network trains, using a different activation function each time: sigmoid, tanh and ReLU.

In addition to showing changes in the feature space, the animations also contain:

  • histograms of predicted probabilities for both negative (blue line) and positive cases (green line), with misclassified cases shown in red bars (using threshold = 0.5);
  • line plots of accuracy and average loss;
  • histogram of losses for every element in the dataset.


Let’s start with the most traditional of the activation functions, the sigmoid, even though, nowadays, its usage is pretty much limited to the output layer in classification tasks.

Figure 6: sigmoid activation function and its gradient

As you can see in Figure 6, a sigmoid activation functionsquashes” the inputs values into the range (0, 1) (same range probabilities can take, the reason why it is used in the output layer for classification tasks). Also, remember that the activation values of any given layer are the inputs of the following layer and, given the range for the sigmoid, the activation values are going to be centered around 0.5, instead of zero (as it usually is the case for normalized inputs).

It is also possible to verify that its gradient peak value is 0.25 (for z = 0) and that it gets already close to zero as |z| reaches a value of 5.

So, how does using a sigmoid activation function work for this simple network? Let’s take a look at the animation:

Sigmoid in action!

There are a couple of observations to be made:

  • epochs 15–40: it is noticeable the typical sigmoidsquashing” happening on the horizontal axis;
  • epochs 40–65: the loss stays at a plateau, and there is a “widening” of the transformed feature space on the vertical axis;
  • epoch 65: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 65–100: the aforementioned “widening” becomes more and more intense, up to the point pretty much all feature space is covered again, while the loss falls steadily;
  • epoch 103: thanks to the “widening”, all positive cases are now lying within the proper boundary, although some still have probabilities barely above the 0.5 threshold;
  • epoch 100–150: there is now some “squashing” happening on the vertical axis as well, the loss falls a bit more to what seems to be a new plateau and, except for a few of the positive edge cases, the network is pretty confident on its predictions.

So, the sigmoid activation function succeeds in separating both lines, but the loss declines slowly, while staying at plateaus for a significant portion of the training time.

Can we do better with a different activation function?


The tanh activation function was the evolution of the sigmoid, as it outputs values with a zero mean, differently from its predecessor.

Figure 7: tanh activation function and its gradient

As you can see in Figure 7, the tanh activation functionsquashes” the input values into the range (-1, 1). Therefore, being centered at zero, the activation values are already (somewhat) normalized inputs for the next layer.

Regarding the gradient, it has a much bigger peak value of 1.0 (again, for z = 0), but its decrease is even faster, approaching zero to values of |z| as low as 3. This is the underlying cause to what is referred to as the problem of vanishing gradients, which causes the training of the network to be progressively slower.

Now, for the corresponding animation, using tanh as activation function:

Tanh in action!

There are a couple of observations to be made:

  • epochs 10–40: there is a tanhsquashing” happening on the horizontal axis, though it less pronounced, while the loss stays at a plateau;
  • epochs 40–55: there is still no improvement in the loss, but there is a “widening” of the transformed feature space on the vertical axis;
  • epoch 55: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 55–65: the aforementioned “widening” quickly reaches the point where pretty much all feature space is covered again, while the loss falls abruptly;
  • epoch 69: thanks to the “widening”, all positive cases are now lying within the proper boundary, although some still have probabilities barely above the 0.5 threshold;
  • epochs 65–90: there is now some “squashing” happening on the vertical axis as well, the loss keeps falling until reaching a new plateau and the network exhibits a high level of confidence for all predictions;
  • epochs 90–150: only small improvements in the predicted probabilities happen at this point.

OK, it seems a bit better… the tanh activation function reached a correct classification for all cases faster, with the loss also declining faster (when declining, that is), but it also spends a lot of time in plateaus.

What if we get rid of all the “squashing”?


Rectified Linear Units, or ReLUs for short, are the commonplace choice of activation function these days. A ReLU addresses the problem of vanishing gradients so common in its two predecessors, while also being the fastest to compute gradients for.

Figure 8: ReLU activation function and its gradient

As you can see in Figure 8, the ReLU is a totally different beast: it does not “squash” the values into a range — it simply preserves positive values and turns all negative values into zero.

The upside of using a ReLU is that its gradient is either 1 (for positive values) or 0 (for negative values) — no more vanishing gradients! This pattern leads to a faster convergence of the network.

On the other hand, this behavior can lead to what it is called a “dead neuron”, that is, a neuron whose inputs are consistently negative and, therefore, always has an activation value of zero.

Time for the last of the animations, which is quite different from the previous two, thanks to the absence of “squashing” in the ReLU activation function:

ReLU in action!

There are a couple of observations to be made:

  • epochs 0–10: the loss falls steadily from the very beginning
  • epoch 10: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 10–60: loss falls until reaching a plateau, all cases are already correctly classified since epoch 52, and the network already exhibits a high level of confidence for all predictions;
  • epochs 60–150: only small improvements in the predicted probabilities happen at this point.

Well, no wonder the ReLUs are the de facto standard for activation functions nowadays. The loss kept falling steadily from the beginning and only plateaued at a level close to zero, reaching correct classification for all cases in about 75% the time it took tanh to do it.


The animations are cool (ok, I am biased, I made them!), but not very handy to compare the overall effect of each and every different activation function on the feature space. So, to make it easier for you to compare them, there they are, side by side:

Figure 9: linear boundaries on transformed feature space (top row), non-linear boundaries on original feature space (bottom row)

What about side-by-side accuracy and loss curves, so I can also compare the training speeds? Sure, here we go:

Figure 10: accuracy and loss curves for each activation function

Final Thoughts

The example I used to illustrate this post is almost as simple as it could possibly be, and the patterns depicted in the animations are intended to give you just a general idea of the underlying mechanics of each one of the activation functions.

Besides, I got “lucky” with my initialization of the weights (maybe using 42 as seed is a good omen?!) and all three networks learned to classify correctly all the cases within 150 epochs of training. It turns out, training is VERY sensitive to the initialization, but this is a topic for a future post.

Nonetheless, I truly hope this post and its animations can give you some insights and maybe even some “a-ha!” moments while learning about this fascinating topic that is Deep Learning.

Training your empathy as a cure for cofounder problems, the #1 statup killer

I’m Jose; I run Data Science Retreat and AI Deep Dive, a 3-month school that takes pretty advanced machine learners and gets them to work together with mentors to produce a killer deep learning portfolio project. We want these projects to have social impact (example: a malaria microscope).

I’ve been doing startups for 10 years. I’m a solo founder now, and successful (bootstrapped two companies). I was miserable when I cofounded. I’ve experienced my share of hard conversations, cofounders leaving, jealousy, bickering for equity, etc. I’m happy to risk a drop in happiness cofounding again because I want to do the most good I can, and cofounders are the surest way of scaling impact.

I’m still scared of having cofounders, a single point of failure in your life. Or I was! I think I’ve found a hack that makes me feel confident that I will deal with any disagreement in a non-stressful way.

In this post, I’ll tell you about ‘Non-Violent communication’ (NVC), a technique that makes you better at ’empathy.’ This is what I did, warts and all. It may, or may not work for you. It’s not a surefire solution, but it is a beautiful hack, and I’m ecstatic to see if it works for anyone else.

The hack

How could I prevent any stress that comes from human interactions in a startup? It feels like an impossible task. But I’ve been here before (we all have!): a task that feels impossible blown wide open by a simple, elegant solution. The true definition of a hack. Last time I can remember finding a hack this powerful was when I was suffering from RSI. I thought my career in front of a keyboard was over. A horrible thought. Nothing standard medicine knew worked. Then, I found a solution (that may deserve a different post, ‘how a tennis ball in a socket saved my career’).

This book: The Trigger Point Therapy Workbook (Clair and Amber Davis) was enough to help me understand the way out. There are trigger points, and you have the ability to massage yourself. You know better where it hurts. Boom. Solved problem.

When you find something that effective (and rare), you want to tell others. This is exactly why I’m writing now. I’ve found another hack that is as effective, solving a problem as large (or larger!). You can train your empathy, and have better relationships with everyone around you. Including difficult people. Including cofounders (non-overlapping sets, with any luck).

It’s all in a single book, Marshall Rosenberg’s ‘Non-Violent communication,’ aka NVC.

What is empathy?

Empathy is hard to define. For me, it’s about

1- Being present, in the moment

2- Deep curiosity for what’s going on in the other person’s life

Empathy is a language. The components are simple (like a grammar), but the implementation is not. It takes lots of practice to learn a new language, and this is also true for empathy, at least the NVC variety.

Empathy is not compassion. You can have compassion for a group of people on a different continent who are suffering. Empathy happens one on one, with a person in close proximity. This is my definition, and I’m not an expert. Fortunately, there’s a much better definition of this language that is extremely actionable: NVC. This is how it works.

Empathy, the NVC way

Rosenberg’s method is simple and elegant, like lisp. it has the beauty of internal consistency.

An empathic conversation should go through four steps: observation, feeling, need, and request. On this post, I’m going to use a simpler version without the request at the end.


Just observe what the other person is doing that is causing you trouble, and state it. Do not provide evaluation. Otherwise, it feels like a judgment. Avoid judging people at all costs. Judging and blaming has become second nature in our culture, and you can see how it’s extremely toxic. You get intuitively that judging is bad, yet we do it all the time.


‘I have noticed that you haven’t  been able to work very long the last two days because you went to a festival.’

When we judge others, we contribute to violence. Intellectual analysis is often received as criticism. Stop the urge of providing it. The ‘because’ part may be already too much.

Rosenberg has a really nice rule. Do not use ‘but.’ “Rather than putting your ‘but’ in the face of an angry person, empathize.”


Express your feelings. This is hard; some cultures make it harder than needed. There’s only about a hundred emotion words, make sure you use one of these. Many sentences using ‘I feel’ are not about emotions at all. You can use simple words, like ‘sad,’ ‘angry,’ ‘disappointed.’ No need for poetry here.


‘I feel disappointed because I’m working harder than you.’

Make it clear to the other person how you feel, but don’t make her responsible for your feelings. Only you are.

Rosenberg says ‘speak your pain nakedly without blame.’ Nobody said NVC is easy.

That should be enough for this step; outside an NVC convo, take a guess on someone else’s feelings, and tell them to see if you are right. Many people work hard hiding their feelings and may feel exposed an uncomfortable talking about them. Others will feel elated and impressed that you nailed them. Learn to identify feelings from the stream of communications that goes around you.


Express your need. What can is ‘alive’ in you that the other person could satisfy? Note that needs create emotions. The other person doesn’t create the feeling, you do (through your needs). The other person may not be aware of your needs if you never communicated them. Rosenberg says ‘If we express our needs, we have a better chance of getting them.’ So simple, so true, so uncommon.


“I need to feel reassured that we are both totally committed to the success of our company.”

Unlike feelings, needs come in all forms; there is no shortlist. This one is a need to feel something. it could also be something more mundane, like a need for cleanliness (together with a request that you do the dishes).

Most people never communicate their needs. Clearly, they ask you to infer them. This is ineffective, don’t be that person.

In the example, you were the one communicating a need. But you could be the listener. Stay in the moment, no matter how uncomfortable the silence may be, no matter how much the other person is struggling, till she has been heard.

You will ‘see’ it when they felt heard: there’s a body change, shoulder drop, maybe a sigh, the flow of words stops. The conversation tone changes. You feel closer.

Don’t worry about the last part, the request. If you manage to see that body change you are already on your way to an agreement. Just make sure that your request is not a demand, is actionable, and worded in positive terms (i.e., don’t use ‘not’).

The more we hear them, the more they’ll hear us.

‘Ok, but does it work at all? Does it work for my particular conflict, at least?’ you may be thinking.

The challenge

I read voraciously, so I consumed all of Rosenberg’s books. I tried it with my six-year-old son and my girlfriend, to different degrees of success.

Son: crying because his football cards felt and are out of order

Me: Are you crying because you have a need for order and the cards are on the floor?

Son: nods, feeling heard. Crying stops in a too-good-to-be-true way

One other story:

Me: (some request). To make sure you understood, could you repeat what I said to me? (part of the method)

Girlfriend: Are you running this NVC thing on me? (Angry)

You can see it didn’t really work on the second story. This was a bad experience I didn’t want to repeat. As much as I had read about it, I was not any good at NVC. I was still not any better at empathy.

People write a lot about how NVC in the hands of a newbie sounds robotic as hell as not at all like receiving empathy. This robotic delivery may backfire. There are variants, for example, ‘street NVC’ that make it more informal. But the point is that you need to be present and curious about the other person. No scaffolding can turn a dull, uninspiring conversation into an emphatic one. Don’t try to give empathy if you don’t really care about the other person.

The practice

So how do you get out of ‘robotic NVC’? Practice. The best way I’ve found is to get together with others who are also trying to get better. A meetup is ideal. In Berlin, where I lived, there was none, so I created one. Meeting at the DSR office, which makes for fascinating contrast (people door to door discussing hardcore algorithms for deep learning vs. people practicing empathy).

There are lots of meetups all over, but if there’s none where you live, I suggest you start one. It was extremely beneficial for me, even on my tight schedule.

NVC works excellent in ‘pair programming’ mode: one person is the one who receives empathy, one gives empathy (the one with hands in the keyboard so to say), and one observes and gives hints. Then you rotate.

It’s not impossible to practice NVC while alone. Self-empathy is a thing. We are terribly judgmental with ourselves (self-criticisms). For example, try to avoid the word ‘should’ (to yourself or two others). This word has enormous power to create blame. “Don’t do anything that isn’t play” says Rosenberg.

What else can you practice alone?

1- Observation. Practice translating judgment into something neutral.

2- Connect your feeling with your need: “I feel… because I need …”

3- Saying thank you in NVC: “This is what you did; this is how I felt; this is the need of mine that was felt.” Compliments are not great. They are actually judgments, positive ones but judgments. It’s much better to know how you enriched the life of the person praising you so that you can do more of it.

4 – Monitor how you talk, stop yourself if you are using judgments, ‘buts’ or ‘shoulds.’ “From the moment people begin talking about what they need rather than what’s wrong with one another, the possibility of finding ways to meet everybody’s needs is greatly increased,” says Rosenberg.

So, to answer the ‘does it work’ question… I’m a total beginner. I can tell you they are not kidding when they say NVC is like a language.

I need to practice more.

Have I experienced any benefits? I became Vegan around the same time I started learning NVC. That may mean I’m improving my health. I have a much deeper appreciation of who my outstanding girlfriend is; I ‘see’ her in ways that I didn’t before. I truly love this woman. I became involved with the ‘effective altruism’ movement. I constantly think about what is violent and how we can remove violence from our communication. I come up with better ideas to use deep learning for social impact. I feel more connected to the world around me. I meet more people by chance that are in the same mental space (the DLR scholarships are an example).

But to the point: do I feel capable of starting something with cofounders? Yes, for the first time in many years. I’m confident any difficult conversation will be solved, even with my primitive empathy training.

NVC feels as effective as that tennis ball in a socket that saved my career once.

Getting your first job as a Deep Learning engineer: The current state

This post should give you an insider view of how it feels to be in the market for a deep learning engineer job. I have interviewed thousands of people in machine learning in the last five years; for deep learning, only a few dozens in the last year; I’ve been paying attention to the market, who goes where, salaries etc. It’s enough for me to form an impression.

How to pick your first job or company: intelligence compounds

If you go and take a relaxed, well paid, 9–5 early in your career, you are getting paid to forgo the growth in your capabilities. Ok, that’s fine, many people take that deal. More so in Germany, the number one country in the risk aversion scale (well, it would be off the scale if there was one). You may have better things to do with your time than just getting smarter.

But then, if intelligence compounds (as in ‘compound interest’)… Your choice has a dramatic effect mid or long term. Your employer should pay you a lot more for that opportunity that you are forgoing. If you take a job at say Deutsche Bank (I could not think of a shittier company) and stay say seven years you are virtually accepting you are never going to be competitive, even if you may have been at the beginning of these seven years.

Someone who used these seven years in situations with high growth opportunities will be very well positioned after that to solve really significant problems.

intelligence compounds, it’s essential to pick the right starting job in your career. Or the right startup to found if you are planning your life as an entrepreneur.

Because we spend so much of our time at work, one of the most powerful leverage points for increasing our learning rate is our choice of a work environment. How can we translate this observation into tactics?

Optimize for growth

Everyone who is any good at anything optimized for growth early in their career. Sometimes without knowing.

Small, but constant, daily growth leads to dramatic changes long term. The good news is that you don’t have to think about the long-term plan. Just make sure that today was not wasted, and that you are learning at say a 1% rate per week. That is enough. It’s vital that you measure your improvement. Track the time it takes you to finish simple programming tasks. Then plot progress every year or so. Measuring progress doesn’t work well for creative things like designing architecture or debugging some difficult bug. But it’s worth trying. You can also ask others for honest feedback. Am I getting better at this? Often people won’t tell you to your face when you are getting ‘kinda good’, but they talk to each other. It’s only by chance that you realize you are growing a reputation.


A work environment that iterates quickly provides a faster feedback cycle and enables you to learn at a quicker rate. Lengthy release cycles, formalized product approvals, and indecisive leadership slow down iteration speed; automation tools, lightweight approval processes, and a willingness to experiment accelerate progress. Anything that works at ‘Deutsche bank speed’ is to be avoided.

Do people help each other to get better? Or is the culture one of ‘one-up-manship’? Do the guys that have mastery share what they know? Are there others in your team with complementary skillsets?


Empathy is a skill like any other; it improves with practice. Reading fiction improves empathy. But the best book I can think of is Marshall Rosenberg’s Nonviolent Communication (NVC). If you can speak NVC you will be a fantastic manager; probably a tremendous father, partner, anything. Bradford Cross recommends it, and I can see why. I’ve read it, and I’m extremely impressed by Rosenberg’s ingenuity. It does take practice to get it to work. Another technocrat that recommends improving empathy is Chad Fowler.

Bosses: make sure you have a good one, and that you understand what each other needs. People leave bosses, not companies. If you have a bad feeling during interviews, just say no. This single person can have an outsized effect on your growth and happiness.


Do good ideas get killed in front of everyone by management?

Do bad ideas get the green light by sheer politics?

Are you treated with respect when you deliver results? Do people admire that you are effective?

Is incompetence tolerated? If so, run away. Having the freedom to fuck up and not been crucified is fantastic, but there are different types of fuck ups. Trying innovative things is bound to produce plenty of failures. But executing at world class level and failing is one thing. Failing to perform because of sheer incompetence: No. Everyone in the company should be able to tell the difference. The problem with tolerating incompetence is that it deters good people from doing their best. In fact, very likely they will leave. So finding the incompetent and ousting him or her should be everyone’s priority.

One exception to this rule: polymaths, or well-balanced generalists. They are not world-class at any one single skill. A specialist may consider their competence not up to scratch. Generalists abound in early-stage companies. In fact, without them, there would be no startups, as everyone tends to wear many hats. The guy who can code, sell, and do strategy is worth his weight in gold even if you wince when you see his commit messages. Eventually, generalists get replaced as the company grows. This is painful for the generalist, so be gentle.

People who can think strategically (e.g., find the intersection of which products can be built with current tech, and what the market is willing to pay for) are enormously valuable. More so in areas like machine learning, where everyone is trying to figure out what is plausible. If your company culture seems to be toxic for these profiles, it’s time to run.