Artificial Neural Network Deep Dive

Activation Functions in NN

Learning Outcome

5

Improving ReLU with Leaky ReLU

4

ReLU and the Deep Learning Revolution

3

Vanishing Gradient Problem

2

Mathematical Intuition Behind Non-Linearity

1

Neural Networks Without Activation Functions

Recall

In the Perceptron model, a neuron calculates the net input using the formula:
Net Input (z) = (Weights × Inputs) + Bias

The neuron calculates a weighted sum of inputs and adds a bias to get the net input value.

After calculating the sum, the neuron must decide what to do with the value.

If the result is very large, the neuron cannot pass the raw value directly to the next layer.

A filter decides whether and how strongly the neuron activates.
This filter is called the Activation Function.

Hook/Story/Analogy(Slide 4)

Transition from Analogy to Technical Concept(Slide 5)

Sigmoid (The S-Curve)

The Sigmoid function is one of the earliest activation functions used in neural networks.

Its job is to take any input value and convert it into a number between 0 and 1.

So no matter how big or small the input is, the output will always be:

0<Output<1

Because of this property, it is often interpreted as a probability.

So the model can answer Yes/No type questions.

These are called binary classification problems.

The Sigmoid function is called smooth because the curve changes gradually and continuously without any sudden jumps or sharp corners.

Example:

InputSigmoid Output
100.99995
1000.9999
10000.9999

Even though the inputs are very different, the outputs become almost the same.

So the curve at the edges becomes flat.

When the curve is flat, the "gradient" (the learning signal) becomes zero.

During Backpropagation, the network stops learning entirely. Deep layers literally "starve" to death.

Tanh (The Zero-Centered S-Curve)

The Tanh (Hyperbolic Tangent) activation function is very similar to the Sigmoid function, but with one important improvement.

  • Sigmoid output range: 0 → 1
  • Tanh output range: −1 → 1

So instead of compressing values only into positive numbers, Tanh allows both negative and positive outputs.

Think of it like this:

Input ValueTanh Output
Very large positive numberClose to +1
00
Very large negative numberClose to −1

So Tanh squishes any number into the range −1 to 1

Why Researchers Created Tanh ?

Sigmoid had a problem: outputs are always positive.

Example:

InputSigmoid Output
-50.006
00.5
50.99

Notice something:

All outputs are positive numbers.

Researchers solved this by creating Tanh, which is zero-centered.

During gradient descent optimization, this causes unbalanced weight updates.

Before ReLU – The Problem

Early activation functions like Sigmoid and Tanh had a big issue called the Vanishing Gradient Problem.

When networks became very deep, the learning signal became almost zero.

Because of this, deep neural networks could not learn properly.

This slow progress period in AI is often called the AI Winter.

The Revolution: ReLU (Rectified Linear Unit)

Researchers introduced a very simple activation function called ReLU, and it changed Deep Learning.

The rule is extremely simple:

  • If the input is negative → output = 0
  • If the input is positive → output = same value

The mathematical function is:

Why ReLU Became a Superpower

Look at the positive side of the ReLU graph:

  • It is a straight diagonal line
  • The gradient is always 1
  • The gradient never becomes zero

Because of this:

Gradients do not vanish

Deep networks continue learning

Models can now have hundreds of layers

ReLU solved the Vanishing Gradient Problem, but it introduced a new issue called the "Dying ReLU Problem."

Think of a neuron like a switch:

  • If the input is positive → it works normally
  • If the input is negative → ReLU outputs 0

Over time during training, some neurons can get stuck in the negative region.

Summary

4

Multi-Class Classification Output: Use Softmax (converts outputs into probabilities that sum to 1).

3

Binary Classification Output: Use Sigmoid (outputs probability between 0 and 1).

2

Alternative: Use Leaky ReLU if neurons stop learning (Dying ReLU problem).

1

Hidden Layers: Use ReLU because it is fast and reduces the vanishing gradient problem.

Quiz

Why was the Sigmoid activation function abandoned in deep neural network hidden layers?

A. It outputs only negative numbers

B. It is purely linear and cannot learn complex patterns

C. Large inputs flatten the curve, causing the Vanishing Gradient problem

D. It causes the Dying ReLU problem

Quiz-Answer

Why was the Sigmoid activation function abandoned in deep neural network hidden layers?

A. It outputs only negative numbers

B. It is purely linear and cannot learn complex patterns

C. Large inputs flatten the curve, causing the Vanishing Gradient problem

D. It causes the Dying ReLU problem

Copy of OG Template

By Content ITV

Copy of OG Template

  • 5