
Demystifying Neural Networks: Building One From Scratch (No Magic Involved!)
In this blog post, we'll strip away the layers of abstraction and build a simple neural network from the ground up using Python and NumPy. Our goal isn't to create a state-of-the-art AI, but rather to illuminate the core mechanics that drive even the most sophisticated deep learning models.
Table of Contents
- The Theory Behind the Magic: How Neural Networks Learn
- Key Concepts for Further Learning
- Beyond the Basics: Important Considerations
- Code Walkthrough: Our Tiny Neural Network
- Explanation of the Code
Have you ever wondered how machines "learn"? How do they recognize faces, understand speech, or even recommend your next favorite movie? The answer, in many cases, lies in the fascinating world of Neural Networks.
While often portrayed as complex black boxes, the fundamental principles of neural networks are surprisingly intuitive. At their core, they are inspired by the human brain, aiming to mimic how our neurons fire and connect to process information.
In this blog post, we'll strip away the layers of abstraction and build a simple neural network from the ground up using Python and NumPy. Our goal isn't to create a state-of-the-art AI, but rather to illuminate the core mechanics that drive even the most sophisticated deep learning models.
The Theory Behind the Magic: How Neural Networks Learn
Imagine a neural network as a series of interconnected "neurons" organized into layers:
- Input Layer: This is where our data enters the network. Each input feature corresponds to a neuron.
- Hidden Layers: These layers are the "thinking" part of the network. Neurons in hidden layers perform calculations on the inputs they receive from the previous layer. A network can have one or many hidden layers.
- Output Layer: This layer produces the network's final prediction or decision.
How do these neurons work?
Each connection between neurons has a weight associated with it. Think of weights as the "strength" of the connection. Additionally, each neuron has a bias, which is an extra value added to the weighted sum of its inputs.
When data flows through the network (this is called feedforward), each neuron in a hidden or output layer performs two main steps:
- Weighted Sum: It takes the outputs from the previous layer, multiplies them by their respective weights, and adds its bias.
- Activation Function: The result of the weighted sum is then passed through an activation function. This function introduces non-linearity into the network, allowing it to learn complex patterns. Common activation functions include Sigmoid, ReLU, and Tanh. The Sigmoid function, which squashes values between 0 and 1, is a classic choice for understanding the basics.
So, how does the network "learn"?
Learning in a neural network is an iterative process of adjusting the weights and biases to minimize the difference between its predictions and the actual true values. This is where backpropagation comes into play.
-
Loss Function: We first need a way to measure how "wrong" our predictions are. This is done using a loss function (also known as a cost function). A popular choice for regression problems is the Mean Squared Error (MSE), which calculates the average of the squared differences between predicted and true values.
-
Gradient Descent: The core idea behind learning is to find the set of weights and biases that minimize this loss. We use an optimization algorithm called Gradient Descent. Imagine the loss function as a mountainous landscape; gradient descent helps us find the lowest point (the minimum loss) by taking small steps in the direction of the steepest descent.
-
Backpropagation: To know which direction to take those steps, we need to calculate the gradient of the loss function with respect to each weight and bias. This is precisely what backpropagation does. It efficiently calculates these gradients by propagating the error backwards through the network, from the output layer to the input layer. The chain rule of calculus is fundamental here.
-
Weight and Bias Updates: Once we have the gradients, we update the weights and biases by subtracting a small fraction of their respective gradients. This fraction is determined by the learning rate, a hyperparameter that controls how big our steps are during optimization. A carefully chosen learning rate is crucial for effective learning.
This entire process – feedforward, calculating loss, backpropagation, and updating weights – is repeated for many epochs (passes through the entire dataset) until the network's predictions are sufficiently accurate.
Key Concepts for Further Learning
Our simple neural network touches upon several fundamental concepts in machine learning and deep learning. To deepen your understanding, consider exploring:
- Different Activation Functions: ReLU, Leaky ReLU, Tanh, Softmax – each has its advantages and disadvantages.
- Optimization Algorithms: Beyond basic Gradient Descent, learn about Adam, RMSprop, SGD with momentum, etc., which often lead to faster and more stable training.
- Regularization Techniques: L1/L2 regularization, Dropout – methods to prevent overfitting, where the model performs well on training data but poorly on unseen data.
- Network Architectures: Convolutional Neural Networks (CNNs) for image processing, Recurrent Neural Networks (RNNs) for sequential data, Transformers for natural language processing.
- Hyperparameter Tuning: Understanding how to select the best learning rate, number of epochs, network size, etc.
- Frameworks: Libraries like TensorFlow and PyTorch abstract away much of the low-level implementation, allowing you to build and train complex networks more efficiently.
Code Walkthrough: Our Tiny Neural Network
import numpy as np
# --- Activation Functions and Loss Function ---
def sigmoid(x):
# Sigmoid activation function: f(x) = 1 / (1 + e^(-x))
# This function squashes any input value between 0 and 1.
return 1 / (1 + np.exp(-x))
def deriv_sigmoid(x):
# Derivative of sigmoid: f'(x) = f(x) * (1 - f(x))
# This is crucial for backpropagation to calculate gradients.
fx = sigmoid(x)
return fx * (1 - fx)
def mse_loss(y_true, y_pred):
# Mean Squared Error loss function.
# Measures the average of the squared differences between true and predicted values.
return ((y_true - y_pred) ** 2).mean()
# --- Our Neural Network Class ---
class OurNeuralNetwork:
'''
A neural network with:
- 2 inputs
- a hidden layer with 2 neurons (h1, h2)
- an output layer with 1 neuron (o1)
*** DISCLAIMER ***:
The code below is intended to be simple and educational, NOT optimal.
Real neural net code looks nothing like this. DO NOT use this code.
Instead, read/run it to understand how this specific network works.
'''
def __init__(self):
# Initialize weights and biases with random normal values.
# These are the parameters the network will learn.
self.w1 = np.random.normal()
self.w2 = np.random.normal()
self.w3 = np.random.normal()
self.w4 = np.random.normal()
self.w5 = np.random.normal()
self.w6 = np.random.normal()
self.b1 = np.random.normal()
self.b2 = np.random.normal()
self.b3 = np.random.normal()
def feedforward(self, x):
# The feedforward process: calculating outputs for given inputs.
# x is a numpy array with 2 elements (our two input features).
# Hidden neuron h1
sum_h1 = self.w1 * x[0] + self.w2 * x[1] + self.b1
h1 = sigmoid(sum_h1)
# Hidden neuron h2
sum_h2 = self.w3 * x[0] + self.w4 * x[1] + self.b2
h2 = sigmoid(sum_h2)
# Output neuron o1
sum_o1 = self.w5 * h1 + self.w6 * h2 + self.b3
o1 = sigmoid(sum_o1)
return o1
def train(self, data, all_y_trues):
'''
- data is a (n x 2) numpy array, n = # of samples in the dataset.
- all_y_trues is a numpy array with n elements.
Elements in all_y_trues correspond to those in data.
'''
learn_rate = 0.1 # How much we adjust weights/biases in each step.
epochs = 1000 # Number of times to loop through the entire dataset.
for epoch in range(epochs):
for x, y_true in zip(data, all_y_trues):
# --- Feedforward for the current sample (x, y_true) ---
# We need to store these intermediate sums and activated values
# because they are used in the backpropagation step.
sum_h1 = self.w1 * x[0] + self.w2 * x[1] + self.b1
h1 = sigmoid(sum_h1)
sum_h2 = self.w3 * x[0] + self.w4 * x[1] + self.b2
h2 = sigmoid(sum_h2)
sum_o1 = self.w5 * h1 + self.w6 * h2 + self.b3
o1 = sigmoid(sum_o1)
y_pred = o1 # Our network's prediction
# --- Backpropagation: Calculate partial derivatives ---
# This is where the error is propagated backward to find how much
# each weight and bias contributed to the error.
# Partial derivative of Loss with respect to y_pred (output)
d_L_d_ypred = -2 * (y_true - y_pred)
# Neuron o1 gradients
# d_ypred_d_w5: How much y_pred changes with respect to w5
# (chain rule: d_ypred_d_sum_o1 * d_sum_o1_d_w5)
d_ypred_d_w5 = h1 * deriv_sigmoid(sum_o1)
d_ypred_d_w6 = h2 * deriv_sigmoid(sum_o1)
d_ypred_d_b3 = deriv_sigmoid(sum_o1)
# d_ypred_d_h1: How much y_pred changes with respect to h1 (output of hidden neuron 1)
d_ypred_d_h1 = self.w5 * deriv_sigmoid(sum_o1)
d_ypred_d_h2 = self.w6 * deriv_sigmoid(sum_o1)
# Neuron h1 gradients (with respect to its weights and bias)
d_h1_d_w1 = x[0] * deriv_sigmoid(sum_h1)
d_h1_d_w2 = x[1] * deriv_sigmoid(sum_h1)
d_h1_d_b1 = deriv_sigmoid(sum_h1)
# Neuron h2 gradients
d_h2_d_w3 = x[0] * deriv_sigmoid(sum_h2)
d_h2_d_w4 = x[1] * deriv_sigmoid(sum_h2)
d_h2_d_b2 = deriv_sigmoid(sum_h2)
# --- Update weights and biases ---
# Each weight/bias is updated by subtracting learn_rate * (total gradient)
# The total gradient is found by multiplying the chain of partial derivatives.
# Neuron h1 updates
self.w1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w1
self.w2 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w2
self.b1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_b1
# Neuron h2 updates
self.w3 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w3
self.w4 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w4
self.b2 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_b2
# Neuron o1 updates
self.w5 -= learn_rate * d_L_d_ypred * d_ypred_d_w5
self.w6 -= learn_rate * d_L_d_ypred * d_ypred_d_w6
self.b3 -= learn_rate * d_L_d_ypred * d_ypred_d_b3
# --- Monitor Loss ---
# Periodically print the total loss to see how the network is learning.
if epoch % 10 == 0:
y_preds = np.apply_along_axis(self.feedforward, 1, data)
loss = mse_loss(all_y_trues, y_preds)
print("Epoch %d loss: %.3f" % (epoch, loss))
# --- Dataset for demonstration ---
# This small dataset aims to classify people based on two features (e.g., age and weight)
# into two categories (e.g., "likes tea" vs "doesn't like tea").
data = np.array([
[-2, -1], # Alice (e.g., young, slim -> likes tea)
[25, 6], # Bob (e.g., old, heavy -> doesn't like tea)
[17, 4], # Charlie (e.g., middle-aged, average -> doesn't like tea)
[-15, -6], # Diana (e.g., very young, very slim -> likes tea)
])
all_y_trues = np.array([
1, # Alice likes tea
0, # Bob doesn't like tea
0, # Charlie doesn't like tea
1, # Diana likes tea
])
# --- Train and Test ---
network = OurNeuralNetwork()
network.train(data, all_y_trues)
# --- Make some predictions after training ---
print("\n--- Making new predictions ---")
# Example: What about someone similar to Alice?
emily = np.array([-7, -3]) # Emily
print("Emily ([-7, -3]): %.3f" % network.feedforward(emily)) # Should be closer to 1
# Example: What about someone similar to Bob?
frank = np.array([20, 2]) # Frank
print("Frank ([20, 2]): %.3f" % network.feedforward(frank)) # Should be closer to 0
Explanation of the Code:
sigmoid(x) and deriv_sigmoid(x):
sigmoid: This is our chosen activation function. It squashes any input value into a range between 0 and 1, making it suitable for binary classification outputs or for internal neuron activations.
deriv_sigmoid: The derivative of the sigmoid function is crucial for backpropagation. It tells us the slope of the sigmoid curve at any given point, which is essential for calculating how much to adjust the weights and biases.
mse_loss(y_true, y_pred):
This function calculates the Mean Squared Error. It quantifies the difference between the network's predictions (y_pred) and the actual target values (y_true). Our goal during training is to minimize this loss.
OurNeuralNetwork Class:
__init__(self)
:
This is the constructor. It initializes the network's weights (w1 to w6) and biases (b1 to b3) with small random numbers. These random initializations are important to break symmetry and allow the network to learn different features.
Our network has two input neurons, two hidden neurons (h1, h2), and one output neuron (o1). The weights connect neurons between layers (e.g., w1 connects input 1 to h1, w5 connects h1 to o1).
feedforward(self, x)
:
This method implements the forward pass of the network. Given an input x (a 2-element array), it calculates the activated outputs of h1, h2, and finally o1.
Each hidden neuron's output is calculated as sigmoid(weighted_sum_of_inputs + bias).
The output neuron's calculation is similar, taking inputs from the hidden layer.
train(self, data, all_y_trues)
:
This is the core learning loop.
-
learn_rate: A critical hyperparameter. It determines the step size taken during gradient descent. A too-large learning rate can overshoot the minimum, while a too-small one can make training very slow.
-
epochs: The number of times the network iterates over the entire dataset. More epochs generally lead to better learning, but too many can lead to overfitting.
Inner Loop (for x, y_true in zip(data, all_y_trues)
):
For each training example (x, y_true):
-
Feedforward: It first performs a feedforward pass to get the y_pred for the current input. It also saves the intermediate sum_h1, sum_h2, and sum_o1 values because their derivatives are needed for backpropagation.
-
Backpropagation (Partial Derivatives Calculation): This is the most mathematically intensive part. It calculates how much the loss changes with respect to each weight and bias. This involves applying the chain rule repeatedly, working backward from the output layer to the hidden layer.
-
d_L_d_ypred: The derivative of the MSE loss with respect to the predicted output.
-
d_ypred_d_wX, d_ypred_d_bX: These represent how the output neuron's activation changes with respect to its weights and bias.
-
d_ypred_d_hX: How the output neuron's activation changes with respect to the hidden neuron's activation. This is where the error propagates back.
-
d_hX_d_wX, d_hX_d_bX: How the hidden neuron's activation changes with respect to its own weights and bias.
-
-
Weight and Bias Updates: Each weight and bias is then updated using the calculated gradients and the learn_rate. The
self.wX -= learn_rate * total_gradient
formula is the essence of gradient descent. -
Loss Monitoring: Every 10 epochs, the code calculates and prints the total MSE loss across the entire dataset. This helps us observe the network's learning progress – ideally, the loss should decrease over time.
Dataset and Training:
A small data array represents our training examples, and all_y_trues are the corresponding true labels.
An instance of OurNeuralNetwork is created, and the train method is called to start the learning process.
Making New Predictions:
After training, you can use the feedforward method to get predictions for new, unseen data, demonstrating what the network has learned.
Key Concepts for Further Learning
1. The Importance of Data: Garbage In, Garbage Out
Quantity and Quality: Emphasize that neural networks are data-hungry. They need a substantial amount of data to learn meaningful patterns. Crucially, the quality of the data is as important as the quantity. Noisy, incomplete, or biased data will lead to poor model performance.
Data Preprocessing: Briefly mention that raw data almost always needs preprocessing (e.g., scaling, normalization, handling missing values, encoding categorical data). Explain that this prepares the data for the network to consume effectively. You could even hint that in your simplified example, the data is already "clean."
2. Overfitting and Underfitting: Common Pitfalls
Underfitting: Explain that underfitting occurs when the model is too simple to capture the underlying patterns in the data. It performs poorly on both training and new data.
(Analogy: like trying to learn complex calculus with only basic arithmetic).
Overfitting: Explain that overfitting occurs when the model learns the training data too well, including its noise and idiosyncrasies. It performs very well on training data but poorly on unseen data.
(Analogy: like memorizing answers to a test without understanding the concepts – you'll fail a slightly different test).
How to Combat (Briefly):
- More Data: The best solution if available.
- Regularization: (You already have this in "Key Concepts")
- Early Stopping: Stopping training when performance on a validation set starts to degrade, even if the training loss is still decreasing.
- Cross-Validation: A technique to get a more robust estimate of model performance and detect overfitting.
3. Hyperparameters vs. Parameters
Parameters: Clearly define that weights and biases are the parameters of the model – these are the values the network learns during training.
Hyperparameters: Reinforce that the learning rate, number of epochs, and network architecture (number of hidden layers, number of neurons per layer) are hyperparameters. These are values that you, the human, set before training begins, and they significantly influence how well the network learns. Explain that choosing good hyperparameters often involves experimentation or more advanced techniques (like hyperparameter tuning).
4. Different Types of Neural Networks (Briefly)
You already listed some, but you could add a very brief sentence about their general use cases:
- Convolutional Neural Networks (CNNs): Excellent for image processing (e.g., image recognition, object detection).
- Recurrent Neural Networks (RNNs) / LSTMs: Ideal for sequential data like text, speech, and time series (e.g., natural language processing, speech recognition).
- Generative Adversarial Networks (GANs): Used for generating new data (e.g., realistic images, music).
This would give the reader a glimpse into the vast landscape of neural network applications beyond the simple feedforward example.
5. The "Black Box" Problem (and Interpretability)
Briefly touch upon the concept that while neural networks are powerful, their decision-making process can be opaque, often referred to as the "black box" problem.
Mention that interpretability is an active area of research, as understanding why a network makes a certain prediction is crucial in sensitive applications (e.g., medical diagnosis, legal decisions).
6. Computational Resources
Acknowledge that training complex neural networks, especially deep ones, requires significant computational resources (GPUs, TPUs). This is why cloud computing platforms are often used. Your simple network trains quickly on a CPU, but larger models do not.
7. Ethics and Bias
It's important to briefly mention that AI models, including neural networks, can inherit biases present in their training data. This can lead to unfair or discriminatory outcomes. Ethical considerations in AI development are paramount.
4 Reactions
0 Bookmarks