Login

Sign Up

Neural Network Basics for Beginners: Build and Train Your First MLP
_Ujjwal_

Posted on Jun 3, 2025 | AIML

Neural Network Basics for Beginners: Build and Train Your First MLP

Key Terms in Deep Learning – Explained Simply

Term Meaning
Neuron A single computing unit that processes input by applying weights, adds a bias, then passes the result through an activation function.
Activation Function Makes the neuron output non-linear (like squashing, switching, or shaping the signal). Examples: ReLU, Sigmoid, Tanh.
Weights Numbers the network learns to decide how important each input feature is.
Bias An extra value added to help shift the output of a neuron. Helps the model fit data better.
Loss Function A formula that tells us how far the prediction is from the true answer (like a penalty score).
Learning Rate (η) A small number that controls how big the weight updates are during learning. Too high = unstable, too low = slow.
Epoch One full cycle of training where the model sees all the training data once.
Batch A smaller group of training examples used in one update step (e.g., 32 or 64 samples).
Mini-Batch Gradient Descent Updating weights using a small batch of data instead of the whole dataset or just one sample.
Forward Pass Data moves through the network, layer by layer, producing an output.
Backpropagation The process of calculating how much each weight contributed to the error, and adjusting them.
Gradient The direction and amount each weight needs to change to reduce the loss.
Optimizer An algorithm that adjusts weights to minimize the loss. Examples: SGD, Adam, RMSprop.
Overfitting When the model learns the training data too well (including noise), but performs poorly on new/unseen data.
Underfitting When the model is too simple to capture the underlying pattern in the data.
Regularization Techniques like dropout or L2 to prevent overfitting by discouraging the model from relying too heavily on any one feature.
Dropout Randomly turning off neurons during training to make the model more robust.
Epoch vs Iteration Epoch: 1 pass over the full dataset. Iteration: 1 update step (per batch).
Loss vs Accuracy Loss: How wrong the model is. Accuracy: How many predictions were correct.
Training Set The data used to teach the model.
Validation Set Data used to tune hyperparameters and check model performance during training.
Test Set Data never seen during training, used to evaluate final model performance.
Softmax Turns raw output scores into probabilities for multi-class classification.
One-Hot Encoding Represents class labels as vectors (e.g., class 2 of 4 → [0, 0, 1, 0]).
Confusion Matrix A table showing how many predictions were correct or incorrect for each class.
Precision & Recall Metrics used to evaluate classification:
- Precision: How many predicted positives were correct
- Recall: How many actual positives were found
F1 Score A balance between precision and recall.
Hyperparameters Settings chosen before training (like learning rate, batch size, number of layers).
Model Parameters Weights and biases learned by the model.
Gradient Clipping Prevents gradients from getting too large and destabilizing training.
Vanishing Gradient When gradients become too small and stop learning, common in deep networks.
Exploding Gradient When gradients grow too large and cause unstable updates.

Neural Network Basics: A Hands-On Guide to MLPs, Activation Functions, and Learning Dynamics

Introduction

Neural networks have become foundational to modern machine learning, powering everything from image recognition to language models.
In this notebook, we’ll break down the basics of neural networks, focusing on perceptrons, multi-layer perceptrons (MLPs), activation functions,
and core training principles. We'll also build a simple MLP using PyTorch, explore the role of weight initialization, and visualize decision
boundaries to understand what our networks learn.


1. Perceptron: The Building Block

The perceptron is the simplest type of neural network—a single neuron. It performs a weighted sum of inputs and passes the result through an activation function.

Equation:

𝑦 = 𝜙(𝑤⊤𝑥 + 𝑏)
x: input vector
w: weight vector
b: bias
ϕ: activation function (e.g., step function)

Limitations: A single-layer perceptron can only solve linearly separable problems like AND and OR—not XOR.


2. Multi-Layer Perceptron (MLP)

An MLP is a neural network with one or more hidden layers. It allows for the modeling of complex, non-linear functions.

Structure:

  • Input layer: Receives raw input.
  • Hidden layers: Transform input using learned weights and non-linear activation functions.
  • Output layer: Produces the final prediction.

Each layer applies:
h = ϕ(Wx + b)

MLPs are universal function approximators—with enough neurons and layers, they can model any function.


3. Activation Functions

Activation functions introduce non-linearity, allowing networks to learn complex mappings.

Function Formula Use Case
Sigmoid 𝜎(𝑥) = 1 / (1 + 𝑒^−𝑥) Binary classification; suffers from vanishing gradients
Tanh tanh(𝑥) = (𝑒^𝑥 − 𝑒^−𝑥) / (𝑒^𝑥 + 𝑒^−𝑥) Centered at 0, better than sigmoid, but still vanishes
ReLU ReLU(𝑥) = max(0,𝑥) Fast and sparse; default for hidden layers
Softmax Softmax(𝑥𝑖) = 𝑒^𝑥𝑖 / ∑𝑗 𝑒^𝑥𝑗 Multi-class classification

4. Forward Pass and Backpropagation

Forward Pass: How Neural Networks Make Predictions

Imagine you’re feeding an image of a cat into a neural network. What happens next?

In an MLP:

  • Input data is multiplied by weights and passed through an activation function (like ReLU or sigmoid).
  • This process happens layer by layer, from input to output — this is called the forward pass.
  • The final layer gives us a prediction (e.g., “80% chance this is a cat”).

Mathematically:

z = W*x + b
a = activation(z)

Where:

  • x = input
  • W = weight matrix
  • b = bias
  • a = activation output

Backpropagation: Learning From Mistakes

Now that the network made a prediction, we compare it to the true label and calculate the loss.

Backpropagation is how the network learns:

  1. Compute the gradient of the loss w.r.t. each weight using the chain rule.
  2. Adjust the weights slightly in the direction that reduces the loss.
W = W - learning_rate * ∇(loss)

Error is calculated (loss), and gradients are computed using the chain rule to update weights via gradient descent.

High-level Steps:

  1. Compute loss (e.g., cross-entropy)
  2. Compute gradients of loss w.r.t. weights
  3. Update weights:

Vanishing and Exploding Gradients

Backpropagation works well — until it doesn’t.

In deep networks, gradients can become:

  • Very small (vanishing) ➜ early layers learn very slowly
  • Very large (exploding) ➜ weights blow up, training becomes unstable

This is especially problematic with activation functions like sigmoid, which squash outputs into a small range.


Solution: Weight Initialization

Smart weight initialization helps keep gradients in a good range. Two popular strategies:

  1. Xavier (Glorot) Initialization – best for tanh activations
  2. He Initialization – ideal for ReLU

These methods set initial weights with a variance that depends on the number of inputs/outputs per layer.


5. Build an MLP in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Generate toy data
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define model
class SimpleMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 16),
            nn.ReLU(),
            nn.Linear(16, 2)
        )

    def forward(self, x):
        return self.net(x)

model = SimpleMLP()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(100):
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
---
Thanks for Reading
---

7 Reactions

0 Bookmarks

Read next

_Ujjwal_

_Ujjwal_

Dec 14, 24

4 min read

|

Building an Own AI Chatbot: Integrating Custom Knowledge Bases

_Ujjwal_

_Ujjwal_

Dec 15, 24

9 min read

|

Exploratory data analysis with Pandas:Part 1