Data Science/Engineering Insights

The LLM Council and the Human Mind

Harvey Ducay — Tue, 16 Dec 2025 00:43:39 GMT

Weeks ago, Andrej Karpathy(ex. director of AI at tesla) launched LLM Council. The concept is brilliant but simple: instead of asking a single AI model (like ChatGPT) to answer a question, you create a "council" of different models. You have one model draft an answer, another critique it, and a "Chairman" model that makes the final decision.

Generative AI is know to hallucinate every now and then, it makes things up or at worst, even reply with nonsense. But when you force it to debate, reflect, and critique, the intelligence skyrockets.

As I read the documentation, I realized this isn't just a new way to code. This is exactly how a healthy human mind works.

The Internal Negotiation

Again, this reminded me of the lessons from Jordan Peterson has on meditation and prayer. The way we progress as individuals is through managing our thoughts. “What is it the I truly want?” He highlighted that we should learn to think deeply on what we aspire to be, or what we feel like the greatest good in the world is, and plan our actions accordingly.

This goes without saying that we should learn to negotiate with ourselves, nothing good comes out of tyranny. This means that we must negotiate a fair reward system for our efforts.

We think of ourselves as a single person, but we are actually a noisy room of internal agents. And just like software, if we don't generate "logs" or if we don't slow down to meditate or pray, we crash.

Internal Agents

If you look inside your own head, you rarely find a single opinion. You find a negotiation.

The Fear Agent (The Amygdala): You might recognize this as the internal thoughts as a child when you are alone in the dark place of your house, or maybe the voice that speaks whenever you are about to send that crucial email for work.
The Dopamine Agent: This part of you wants the short-term reward. It wants the sweets, the fast money, the scroll on TikTok. You know it’s successfully taken over you the moment you chose to play video games over doing your work or school. It optimizes for immediate gratification.
The Long-Term Agent: This is the part of you that wants deep success, health, and meaning.

Most people live their lives on autopilot. They let the loudest agent (usually fear or dopamine) take action without taking a decent time reflecting. They react immediately. In AI terms, this can be thought of as Hallucination, a confident but wrong output. I might be reaching out a bit in terms of comparing hallucination over doing micro-wrong decisions in day to day life of a person, but the idea of doing something wrong because it is not thought of well still stands.

Meditation

I view meditation as a form of First Principles Thinking. It is the act of clearing the "Context Window".

When life gets overwhelming, our internal RAM (random access memory) gets full of noise, stress, opinions, social media, and many other useless (or useful) things. If you try to make a decision in that state, you will fail. Meditation can be thought of as hitting the reset button. It wipes the cache.

It allows me to switch from "Zero-Shot" reacting to Chain of Thought reasoning. I can sit back and look at my thoughts from a third-person perspective. I can "judge the judger." I can ask: Why am I afraid? Is there an actual danger? Can I do it afraid anyway?

It’s not about deleting the existing agents. As Jordan Peterson says, you can't tyrannize yourself. If you try to crush your fear or starve your desires, they will rebel. You have to negotiate. You have to be the Chairman of the Council, listening to the fear, acknowledging it, but ultimately deciding to follow the Long-Term Agent.

Journaling

If a program crashes and you didn't set up a logging system, you will have no idea what happened. You can't fix the bug. You will just keep crashing in the same way, over and over.

Humans do this too. We repeat the same toxic patterns, the same bad habits, the same anxious spirals. Why? Because we never generated the logs.

We didn't slow down. We didn't meditate, pray, or journal.

Meditation and Prayer are the tools we use to generate logs of our existence. They force us to stop the execution of the code and look at the logic, or in real-life’s case, it forces us to evaluate the decisions we made on the day to day.

What triggered that anger?
Why did I chase that fast money?
What truly matters to me right now?

If we don't slow down to "log it out," we are just autonomous agents running trash code, reacting to the world with no direction until we burn out.

But if we take the time to convene the Council, to meditate on first principles and pray for guidance, we stop reacting, and we start living.

Conclusion

For years, AI development optimized for speed. But we reached a point where speed wasn't the problem, reasoning was. The solution wasn't to go faster, it was to introduce agents that think, critique, and work together.

The same applies to us. We cannot define a perfect life through speed or autopilot reactions. We need an architecture of thought. Chain of thought was a good starting point, but then the whole architecture of agents thinking together could still improve. I’m wondering, what would make the perfect agentic chain of thought architecture?

We must define our own "Philosophical Guide" agent. By taking the time to convene the Council, meditate on first principles, and log our internal states, we stop merely reacting to the code and we start writing it.

Now the question also goes, “How do we define an Agent that is a philosophical guide?”.

One-Shot Trauma: When Reinforcement Learning and Human Minds Overcorrect

Harvey Ducay — Mon, 17 Nov 2025 14:03:51 GMT

The Day My Internal Agent Received a -1,000,000 Penalty

It only took a second to rewire my brain.

By early 2022, I was just your average joe, living life day by day. Eating was one of my daily tasks that was necessary, automatic, and unconscious. It was a simple background task that required no conscious thought. Then, one day, I failed on that task that was supposed to be automatic. I choked on my food.

It wasn't just a moment of discomfort, it was a primal, terrifying alert that flooded my entire system. The world narrowed to the single, desperate need for air. My heart hammered against my ribs, adrenaline surged, and in that moment, my brain registered a single, blaring data point: This is death. This is how you die.

Even after the danger passed, the damage was done. For weeks and months, I had a debilitatingly difficult time eating solid foods. Every sensation in my throat was a potential prerequisite to a disaster, triggering panic attacks that, in a cruel feedback loop, caused GERD, which in turn created more throat sensations.

What I didn’t realize at the time was that my brain was running a perfect, albeit terrifying, simulation of a core problem in artificial intelligence. I had become a reinforcement learning agent that had just received a penalty so massive, so disproportionate to all my previous experiences, that my entire operating policy had been corrupted.

A Crash Course in Reinforcement Learning

Before we get to the catastrophe, let’s quickly define the terms. Reinforcement Learning (RL) is a field of AI where we teach an "agent" to make decisions. Think of it like training a dog, but with algorithms.

The basic components are simple:

The Agent: The learner and decision-maker (the AI, the dog, or in my case, me).
The Environment: The world the agent operates in (a video game, a maze, or the dinner table).
The State: A snapshot of the agent's current situation ("I am at a crossroad," "My plate is full of solid food").
The Action: Something the agent can do ("Turn left," "Take a bite").
The Reward/Penalty: The feedback the agent gets from the environment after an action (+1 for finding cheese, -1 for hitting a wall).

The agent’s goal is to learn a policy, a strategy or a map of which actions to take in which states to maximize its total cumulative reward over time. It does this through trial and error, gradually updating its policy as it explores the world. For 99.9% of its life, this process is gradual and iterative.

The Catastrophe

Now, let's return to our RL agent. It’s happily exploring its environment, collecting small rewards: +1, +5, +2. Its policy is getting better and better.

Then, it wanders into an unknown territory and takes an action. The environment's response isn't a small penalty. It's a catastrophic, system-shocking -1,000,000.

From a technical standpoint, the value assigned to that state-action pair plummets. The agent's algorithm, designed to maximize reward, now sees any path leading to that state as unimaginably bad. The policy updates instantly and brutally: "Whatever you do, never go there again."

This is precisely what happened in my brain.

State: "Eating solid food."
Action: "Swallowing."
Penalty: The choking experience, a neurological -1,000,000.

My internal policy was updated in a flash. The value of that action became catastrophic. My brain’s simple new rule was: Avoid this state at all costs. It is not worth the risk.

Reinforcement learning agents reacts to a huge penalty as much as humans react to real life traumas. Some human traumas are key to survival and most of it were gained through evolution. Humans were designed to have panic attacks during moments in the wild, for example when we encounter lions or other predators in the forest, but this part of our brain weren’t optimized for the modern human experience.

The Flawed Policy and The Dragon of Chaos

This is where a core AI challenge, the Exploration-Exploitation Dilemma, collides with human psychology. An agent must balance exploiting known good strategies with exploring new ones to find even better rewards.

After a catastrophic penalty, this dilemma is shattered. The agent stops exploring. It retreats into a tiny, "safe" corner of its world, only performing actions it knows won't lead to disaster. It has sacrificed growth and opportunity for the illusion of total safety.

This is where the ideas of psychologist Jordan Peterson become incredibly relevant. Peterson often frames the world as a duality of Order and Chaos.

Order is the realm of the known, the predictable, the safe. It's your home, your routine, your settled knowledge.
Chaos is the unknown, the unexpected. It is the place of both terrifying dragons and undiscovered treasure.

My normal life of eating was Order. The choking incident was a violent, sudden immersion into Chaos. My response, my agent's response, was to retreat and drastically shrink the walls of my known, safe Order. Solid food, a previously mundane part of Order, was now re-categorized as Chaos. It was a territory on my internal map suddenly marked, "Here be dragons."

But here's the flaw in the policy, for both me and the AI: you can’t just skip eating. The AI agent, by walling off a huge part of its environment, might be dooming itself to a sub-optimal existence, missing out on the vast rewards that lie just beyond that one terrifying spot.

A perfect reinforcement learning agent (as well as the perfect human life experience) consists of a balance between a good amount of order, and a few bits of chaos. It is in order that we find peace in the world, and in facing chaos that we learn how to adapt to the ever changing world. A reinforcement learning agent, and a human, would paradoxically be unsafe in a perfectly ordered environment because it will never be prepared for chaos.

Recalibration

So, how do you fix a policy that has been broken by a single, traumatic data point? You can't just delete the memory. The agent, and the human, needs new countervailing data.

Peterson's prescription for this is not to ignore Chaos, but to confront it voluntarily. You don't wait for the dragon to find you again. You approach its lair on your own terms, in small, manageable steps.

In psychology, this is the foundation of exposure therapy. For me, it meant I couldn't go back to eating a steak dinner. But I could start with something soft. I could eat a piece of well-chewed bread. I was voluntarily taking a small step back into the "dangerous" territory. I was telling my internal agent, "See? We took an action in this state-space, and the penalty was 0, not -1,000,000."

Each successful, non-choking bite was a small, positive reward (+1) that began to slowly, painstakingly, update my flawed policy.

We can apply this same logic to building more resilient AI:

Curriculum Learning: Don't throw the agent into the most chaotic environment at once. Start it in a simple, safe version and gradually increase the complexity, the AI equivalent of starting with soft foods.
Reward Shaping: Can we design systems that give small rewards for "bravery", for cautiously re-exploring a territory with a known high penalty? This encourages the agent not to write it off forever.
Decaying Memory: Perhaps the memory of a massive penalty shouldn't be permanent. It could slowly decay over time if not reinforced, allowing the agent to become cautiously curious once more.

At first, I had just eating soft foods starting with oatmeal and yogurt. I then eventually tried to eat sandwiches, before I fully tried eating meat with rice. It was a such an experience I never knew I will encounter. I monitored progress every step of the way and gave myself a pat on the back whenever I faced a fear I was very hesitant to do at first.

Conclusion: Building Agents with Digital Courage

My experience taught me that humans and our most advanced learning algorithms share a fundamental vulnerability: we are profoundly shaped by our worst moments. A single, catastrophic failure can create a brittle, over-cautious policy that prioritizes avoiding pain over seeking growth.

The path to recovery and optimal performance, for both man and machine, isn't about erasing that bad memory. It’s about courageously and methodically gathering new data to prove that the catastrophe was an outlier, not the rule.

Perhaps the next frontier in AI isn't just about bigger models or faster processing. It’s about instilling the digital equivalent of courage, the ability to face the remembered dragon, learn from failure, and refuse to let a single scar define the entire map of one's world.

At this some point, the technical knowledge in AI (reinforcement learning) is determined by how we imitate lessons and occurrences from psychology. Learning comes from a good amount of order to stand on, and a small amount of chaos to learn from.

What is AI?

Harvey Ducay — Sun, 02 Nov 2025 08:06:31 GMT

I find it a little funny that I'm only getting to this article now. As an applied mathematician, I have a deep-seated need for rigor, for building arguments from first principles. In a field as dynamic and hype-driven as Artificial Intelligence, a rigorous definition can feel elusive. This article is my attempt to provide one. Not in the form of a dense mathematical proof, but through a structured framework that is both intuitive and fundamentally sound: viewing the core learning mechanisms of AI through the lens of child development.

My own "AI genesis" story wasn't seeing a robot, but coding a neural network from scratch and realizing the 'magic' was just calculus and matrix transformations. That revelation is the core of this post: AI is not a magic box. It is a set of mathematical tools. And to truly understand them, we must first understand how they "grow up."

Learning Through Guidance and Imitation: ML & DL

A child's first and most essential way of learning is by observing the ordered world their parents create for them. This is the domain of Machine Learning (ML) and its powerful subfield, Deep Learning (DL).

Machine Learning as Cultural Conditioning

Think of how a child learns the specific, non-negotiable rules of a Filipino household. They are taught to say "po" and "opo" to elders. They learn to take off their shoes or slippers the moment they step inside. This isn't learned through abstract reasoning; it's learned through direct instruction and imitation.

This is a perfect parallel for supervised Machine Learning. The model is given a massive dataset of specific inputs (an elder speaks to you) and the correct, labeled outputs ("opo"). It learns the function to map one to the other, perfectly mimicking the "correct" behavior it was shown.

Deep Learning as Internalizing Values

A child doesn't just parrot rules forever. Eventually, they move beyond mimicry and grasp the underlying concept of respect. They begin to apply it in novel situations, showing deference to other figures of authority even if they were never explicitly told to.

This is Deep Learning. The neural network's layers allow it to learn not just the surface-level pattern, but the deeper, abstract principles behind the data. It builds an internal model of "respect," allowing for a more flexible and intuitive application of the learned rules.

Learning Through Consequence: Reinforcement Learning (RL)

But not everything can be learned from a guiding hand. A child must eventually face the world on their own and learn from its direct, unfiltered feedback. This is the world of Reinforcement Learning (RL), and it is a process of conquering chaos.

Reinforcement Learning as Learning to Walk

The best analogy for RL is a toddler learning to walk. There is no instruction manual. No parent can perfectly explain the infinite micro-adjustments of balance and muscle control. The child must learn through brutal trial and error.

The agent is the toddler.
The environment is the physical world, governed by the unforgiving laws of gravity.
The action is attempting to take a step.
The penalty is the immediate, painful feedback of falling.
The reward is the exhilarating success of staying upright and moving forward. (and probably the applause of your parents)

The toddler is not trying to imitate a perfect "walk" from a dataset. They are developing their own strategy to maximize reward and minimize punishment, building a robust understanding directly from the consequences of their actions. This is how RL agents master complex games and robotic controls—by bravely confronting the chaos of their environment and structuring it through experience.

Conclusion: Think Like a Mathematician, Not a Movie Director

So when we ask, "What is AI?", the answer is multifaceted. It’s the carefully guided student, learning the cultural rules of its environment (ML/DL). And it’s the determined toddler, courageously facing gravity to learn to stand on its own two feet (RL).

The next time you interact with an AI, I challenge you to see past the code and think of its upbringing. Was it taught by the book, or did it learn from the school of hard knocks? Understanding its developmental journey demystifies its capabilities. Because beneath it all, whether it's a child internalizing respect or a toddler learning to walk, the engine is the same: a mathematical function, optimizing for a goal, and turning the unknown chaos of the world into the ordered structure of knowledge.

From "It Works" to "Why It Works": A Call for Deeper Understanding in Data Science

Harvey Ducay — Sun, 05 Oct 2025 06:02:39 GMT

Sometimes, the most valuable lessons come from unexpected moments. I was attending a data science workshop recently, and a brief discussion served as a powerful reminder of a crucial question we must ask ourselves: are we content with knowing that something works, or do we strive to understand why it works? It's the difference between being a technician and an engineer, and it is crucial for building robust and reliable solutions.

This question feels more relevant than ever. It's never been easier to get amazing results in data science. We can build powerful models that were cutting-edge just a few years ago with only a few lines of code. But this ease of use brings a hidden risk. We're often tempted to treat these powerful tools like "black boxes," focusing only on the final accuracy score without really knowing what’s happening inside.

Don't Skip the "Why": The Soul of the CNN

Let's use a classic example, the Convolutional Neural Network (CNN). Too often, tutorials and talks jump straight into the architecture, talking about layers, filters, and code, but they skip the most important question of all: why do we even use them?

The reason we use CNNs for images instead of a standard neural network comes down to a couple of brilliant ideas:

Translation Invariance: A picture of a cat is still a picture of a cat, whether the cat is in the top left or the bottom right. A basic neural network would struggle with this, needing to learn what a "top-left cat" and a "bottom-right cat" are separately. This is incredibly inefficient. CNNs solve this by using sliding filters that spot features no matter where they are in the image.
Parameter Efficiency: By using these sliding filters, a CNN reuses the same weights across the entire image. This drastically cuts down on the number of parameters the model has to learn, which means it trains faster and is less likely to overfit.

Understanding this "why" isn't just for textbooks. It’s the very soul of the architecture. It helps you make better design choices and explain your work with real confidence.

The Anatomy of a Convolution: More Than Just Guesswork

This need for understanding goes all the way down to the basic building blocks. When we set up a convolutional layer, we have to pick its kernel size, padding, and stride. These are not just random numbers to guess. They are key design decisions that have a huge impact on what your model learns.

Let's quickly break them down:

Kernel Size: Think of the kernel as the network's magnifying glass.
- A small kernel (like 3x3) is great for spotting fine details like sharp edges and textures. Most modern models use these to build up a complex picture from small pieces.
- A large kernel (like 7x7) sees bigger patterns at once, like the general shape of an object. It’s less common now but can be useful for capturing broader strokes.
Padding: This means adding a border of pixels around the image.
- Without padding, the image gets smaller with every layer, and information at the edges can get lost.
- With padding, you can keep the image size the same. This lets you build deeper networks and makes sure the features at the borders are treated fairly.
Stride: This is the step size the kernel takes as it moves across the image.
- A stride of 1 is very thorough, moving one pixel at a time. It captures the most information but is computationally slower.
- A stride of 2 or more makes the kernel jump, shrinking the output size quickly. It’s a fast way to down-sample and helps the network see the bigger picture, but you lose some fine-grained detail.

Choosing these values is an act of engineering, not a lucky guess. You are actively deciding how your model sees the world.

The Case for Building from the Ground Up

So, how do we get this deeper knowledge? We can do this by fighting the urge to always use the fanciest, most automated tools first. This is why I'm a huge believer in trying to build models from a more fundamental level.

A framework like PyTorch is perfect for this. While it handles the heavy-lifting of calculus for you, it doesn’t hide everything. You still have to define your network layer by layer and write the training loop yourself, which includes the forward pass, calculating the loss, the backward pass, and updating the model.

Going through this process connects you directly to the mechanics. You see how the data changes shape as it flows through the network. You finally understand why certain steps are necessary. Your model stops being a magic box and becomes a logical system you created

Conclusion

At the end of the day, our job is to solve problems with tools that are reliable and that we can explain. That kind of work isn’t built on trial and error. It’s built on rigor, intention, and a real curiosity to learn.

So, the next time you start a project, I encourage you to ask "why." Why this model? Why this setting? The best models, and the best data scientists, are made when we step away from the easy abstractions and get our hands dirty with the fundamentals.

Building Intuition for Convolutional Neural Networks

Harvey Ducay — Sun, 24 Aug 2025 03:54:24 GMT

The motivation behind Convolutional Neural Networks (CNNs) comes from the inability of traditional dense neural networks to perform well on image classification tasks. Why is that? A dense network, also known as a fully-connected network, treats an image as a flat vector of pixels. If you flatten a 32x32 pixel image, you get a 1024-dimensional vector. This process discards all spatial information. The network has no inherent understanding that a pixel is "next to" another. This makes it difficult to learn concepts like edges, textures, or shapes, and it completely fails to grasp translation invariance, the idea that a cat is still a cat whether it's in the top-left or bottom-right corner of the image.

This is where CNNs shine. They are specifically designed to process pixel data by creating better feature maps out of raw images. Instead of flattening the input, they use small filters (kernels) that slide across the image, recognizing patterns like edges, corners, and textures. These initial patterns are then combined in deeper layers to form more complex features like eyes, wheels, or wings.

In this post, we'll build a CNN from scratch using PyTorch to understand its core components. We'll train it on the popular CIFAR-10 dataset and see how it learns to classify images into one of ten categories.

Let's break down the process step-by-step.

Phase 1: Importing Dependencies

First, we import all the necessary libraries. We'll be using torch and its nn module for building the network, torchvision for the dataset and image transformations, and PIL for handling our own custom images later.

import numpy as np
from PIL import Image

import torch
import os
from torch import nn

import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader

import torchvision
from torchvision import datasets, transforms

Phase 2: Data Transformation and Loading

Before we can feed images to our network, we need to preprocess them. This is done using torchvision.transforms.

transforms.ToTensor(): This converts the image from a PIL Image format (with pixel values from 0-255) to a PyTorch tensor (with values from 0.0 to 1.0).
transforms.Normalize(): This standardizes the pixel values. The arguments (0.5, 0.5, 0.5) are the mean and standard deviation for each of the three (R, G, B) channels. This normalization helps the network train faster and more stably by centering the data around zero.

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

With our transformation pipeline ready, we can load the CIFAR-10 dataset. We also wrap our datasets in a DataLoader, which is a handy utility that provides batches of data, shuffles it for each epoch, and can even use multiple workers to load data in parallel.

train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform, download=True)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle=True, num_workers=2)

class_names = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

The CIFAR-10 images are 3-channel (RGB) images of size 32x32 pixels. Let's confirm this:

image, label = train_data[0]
print(image.size())
# Output: torch.Size([3, 32, 32])

Phase 3: Defining the Neural Network Architecture

This is the core of our project. We'll define a class NeuralNet that inherits from nn.Module.

class NeuralNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Input: (3, 32, 32)
        self.conv1 = nn.Conv2d(3, 16, 5, padding=2) # 32x32 -> 32x32
        self.pool1 = nn.MaxPool2d(2, 2)            # 32x32 -> 16x16
        # Shape: (16, 16, 16)

        self.conv2 = nn.Conv2d(16, 32, 3, padding=1) # 16x16 -> 16x16
        self.pool2 = nn.MaxPool2d(2, 2)             # 16x16 -> 8x8
        # Shape: (32, 8, 8)

        self.conv3 = nn.Conv2d(32, 64, 3, padding=1) # 8x8 -> 8x8
        self.pool3 = nn.MaxPool2d(2, 2)             # 8x8 -> 4x4
        # Shape: (64, 4, 4)

        # IMPORTANT: Calculate the new flattened size
        self.fc1 = nn.Linear(64 * 4 * 4, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

Let's break down how the shape of our data changes as it flows through the network:

Convolutional Layers (nn.Conv2d)

The shape of the output from a convolutional layer depends on the input size, kernel size, stride, and padding. The formula is:
Output_Size = (Input_Size - Kernel_Size + 2 * Padding) / Stride + 1

self.conv1 = nn.Conv2d(3, 16, 5, padding=2)
- in_channels=3: We start with a 3-channel (RGB) image.
- out_channels=16: The layer will produce 16 feature maps.
- kernel_size=5: The filter is a 5x5 matrix.
- padding=2: We add a 2-pixel border around the image.
- Shape Change: (32 - 5 + 2*2) / 1 + 1 = 32. With this padding, the height and width are preserved. Our shape goes from (3, 32, 32) to (16, 32, 32).

Pooling Layers (nn.MaxPool2d)

Pooling layers are used to down sample the feature maps. This reduces the computational load and makes the detected features more robust to their exact location in the image.

self.pool1 = nn.MaxPool2d(2, 2)
- This takes a 2x2 window and keeps only the maximum value, effectively halving the height and width.
- Shape Change: The input (16, 32, 32) becomes (16, 16, 16).

We repeat this pattern. After conv3 and pool3, our final feature map has a shape of (64, 4, 4).

The Flattening and Fully-Connected Layers (nn.Linear)

The convolutional layers have done their job of extracting spatial features. Now, we need to feed these features into a standard dense network to perform the final classification. To do this, we must "flatten" our 3D feature map (64, 4, 4) into a 1D vector.

The size of this vector is channels height width, which is 64 4 4 = 1024.

This is why our first fully-connected layer, fc1, is defined as nn.Linear(64 4 4, 256). It takes the 1024 features from our flattened map and transforms them into 256 features. The final layer, fc3, outputs 10 values, one for each class in CIFAR-10.

To understand how fully connected layers work, here’s a link explaining the math and intuition behind it as we do it from scratch.

Phase 4: Defining the Forward Propagation

The forward method defines the actual path our data takes through the layers. We apply a ReLU activation function after each convolution and after the first two fully-connected layers to introduce non-linearity, which is crucial for learning complex patterns.

def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        x = self.pool3(F.relu(self.conv3(x)))

        x = torch.flatten(x, 1) # Flatten all dimensions except batch

        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Phase 5: Optimizer and Loss Function

To train the network, we need two things:

Loss Function: Measures how wrong the model's predictions are.
Optimizer: Updates the model's weights to reduce the loss.

net = NeuralNet()
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Why CrossEntropyLoss?

nn.CrossEntropyLoss is the standard choice for multi-class classification problems like this one. It's particularly effective because it combines two operations: LogSoftmax and NLLLoss (Negative Log Likelihood Loss). Internally, it takes the raw output scores (logits) from our final layer, converts them into probabilities using a softmax function, and then calculates the loss. It heavily penalizes the model for being confident in the wrong prediction, which makes it a very effective teacher during training.

Phase 6: The Training Loop

Here, we iterate through our training data for a set number of epochs. In each step, we perform the standard training routine:

Get a batch of inputs and labels.
Clear previous gradients with optimizer.zero_grad().
Make a prediction (outputs = net(inputs)).
Calculate the loss.
Perform backpropagation to calculate gradients (loss.backward()).
Update the network's weights (optimizer.step()).

for epoch in range(30):
    print(f"Training Epoch: {epoch}")
    running_loss = 0.0

    for i, data in enumerate(train_loader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)

        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Loss: {running_loss/len(train_loader):.4f}")

Phase 7: Saving the Model and Evaluating Performance

After training, we save the learned weights (the model's "state") to a file. Then, we load these weights into a fresh instance of our network and evaluate its performance on the test dataset, which it has never seen before.

We switch the network to evaluation mode with net.eval(). This is important as it disables certain layers like Dropout that behave differently during training and inference. We use torch.no_grad() to tell PyTorch not to calculate gradients, which saves memory and computation.

torch.save(net.state_dict(), 'trained_net.pth')

net = NeuralNet()
net.load_state_dict(torch.load('trained_net.pth'))

correct = 0
total = 0

net.eval()
with torch.no_grad():
    for data in test_loader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Accuracy: {accuracy}%")

This will give us a final accuracy score, showing how well our CNN learned to generalize.

Phase 8: Testing with Our Own Image

Finally, the fun part! Let's see how our trained model performs on a completely new image from the web. We create a simple function to load, resize, and transform an image to match the input format our network expects.

Note the image.unsqueeze(0) step. Our network was trained on batches of images. This adds a "batch dimension" of size 1, so the tensor shape becomes (1, 3, 32, 32), which is what the network expects.

new_transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

def load_image(image_path):
    image = Image.open(image_path)
    image = image.convert('RGB')
    image = new_transform(image)
    image = image.unsqueeze(0) # Add batch dimension
    return image

# Replace with the path to your image
image_paths = ['path/to/your/image.png'] 
images = [load_image(img) for img in image_paths]

net.eval()
with torch.no_grad():
    for image in images:
        output = net(image)
        _, predicted = torch.max(output, 1)
        print(f"Prediction: {class_names[predicted.item()]}")

Conclusion

We've successfully built, trained, and tested a Convolutional Neural Network. We saw how convolutional and pooling layers work together to extract meaningful features from raw pixels, and how these features are then used by a classifier to make a final prediction. This ability to learn spatial hierarchies of patterns is what makes CNNs the powerhouse behind modern computer vision. From here, you can experiment by changing the architecture, adding more layers, or trying different optimizers to see how it affects performance.

Python Code Link: https://github.com/HarvsDucs/hashnode_python_scripts/tree/main/Building%20Intuition%20for%20Convolutional%20Neural%20Networks

Building a Neural Network from Scratch in Python and NumPy

Harvey Ducay — Fri, 08 Aug 2025 09:23:25 GMT

Ever looked at a neural network formula and felt a disconnect from what it's actually doing? You're not alone. Frameworks like TensorFlow and PyTorch are powerful, but their high-level abstractions can hide the beautiful, intuitive mathematics that make learning possible.

Today, we're going to build a neural network from scratch using only Python and NumPy. But more importantly, at every step we'll stop and connect the code to the core mathematical ideas.

Our goal isn't just to make it work, it's to understand how a collection of matrix multiplications and derivatives can learn to recognize something as complex as a handwritten digit.

Let's translate the math into intuition.

Files and Dependencies

Every learning process starts with information. For our network, that information is the MNIST dataset.

Our project relies on this main library:

NumPy: This is our mathematical workhorse. It will perform the vector and matrix operations that form the very language of neural networks.

import numpy as np

The MNIST dataset contains 70,000 images (28x28 pixels) of handwritten digits. We load it from a NumPy-native .npz file.

# Define the path to your dataset file
file_path = '/kaggle/input/mnist-from-scratch/mnist.npz'

data = np.load(file_path)
print(f"Arrays in the file: {data.files}")

# Unpack the data into training and testing sets
x_train, y_train = data['x_train'], data['y_train']
x_test, y_test = data['x_test'], data['y_test']

Our network doesn't see images, it sees numbers. To make these numbers easier to work with, we perform normalization.

The Math: We take pixel values from [0, 255] and scale them to [0, 1].
The Intuition: This ensures that all our input features are on a similar scale. During training, the gradients (which tell us how to update our weights) are sensitive to the scale of the input. Normalization prevents gradients from becoming too large and unstable, leading to a smoother, more predictable learning process.

# Normalize pixel values to the [0, 1] range
x_train = x_train / 255.0
x_test = x_test / 255.0

Building the Neural Network

We are building a brain with a specific structure: an input layer, two hidden layers, and an output layer. The "learning" happens in the connections between these layers. These connections are defined by weights and biases.

Initializing Weights and Biases

The Math: We create matrices of random numbers. The shape of each matrix is (neurons_in_layer, neurons_in_previous_layer).

The Intuition:

Weights (w): Think of a weight as the strength or importance of a connection. A large weight means the signal from a previous neuron has a strong influence. We initialize them randomly to break symmetry. If all weights started at zero, every neuron in a layer would learn the exact same thing, and our network would be no better than a single neuron. Randomness ensures they each start on a unique path to find different features.
Biases (b): Think of a bias as an "activation threshold." It's a value that determines how easy it is for a neuron to "fire" (output a high value). A neuron with a large negative bias will require a very strong input signal to become active.

# --- Weights: The strength of connections ---
# w_i_h1 connects 784 input neurons to 64 hidden layer 1 neurons
w_i_h1 = np.random.uniform(-0.5, 0.5, (64, 784))
# w_h1_h2 connects 64 L1 neurons to 32 L2 neurons
w_h1_h2 = np.random.uniform(-0.5, 0.5, (32, 64))
# w_h2_o connects 32 L2 neurons to 10 output neurons
w_h2_o = np.random.uniform(-0.5, 0.5, (10, 32))

# --- Biases: The neuron's activation threshold ---
b_i_h1 = np.random.uniform(-0.5, 0.5, (64, 1))
b_h1_h2 = np.random.uniform(-0.5, 0.5, (32, 1))
b_h2_o = np.random.uniform(-0.5, 0.5, (10, 1))

On a side note, I’ve set up 64 neurons first the first hidden layer and 32 neurons for the second hidden layer. These numbers are arbitrary and can be replaced with almost any number of integer neurons and still perform the same. I feel like using 64 and 32 for the two layers are the easiest to understand and make an example out of. Also, the weights were initialized as a uniform distribution from -0.5 to 0.5, as well as the biases.

Training Loop

The training loop is where the magic happens. It's a cycle of guessing, checking, and correcting that, when repeated thousands of times, allows the network to learn. We will dissect this process into four distinct stages that occur for every single image in our training data.

Here's the full code block for context. We will break it down piece by piece below.

# --- The Full Training Loop ---
for epoch in range(epochs):
    # ... (code for tracking error and accuracy)
    for image, label in zip(x_train, y_train):
        # STAGE A: Data Preparation
        image = image.reshape(784, 1)
        label_vec = np.zeros((10, 1))
        label_vec[label] = 1

        # STAGE B: Forward Propagation
        h1_pre = w_i_h1 @ image + b_i_h1
        h1 = 1 / (1 + np.exp(-h1_pre))
        h2_pre = w_h1_h2 @ h1 + b_h1_h2
        h2 = 1 / (1 + np.exp(-h2_pre))
        o_pre = w_h2_o @ h2 + b_h2_o
        exps = np.exp(o_pre - np.max(o_pre))
        o = exps / np.sum(exps)
        o = np.clip(o, epsilon, 1 - epsilon)

        # STAGE C: Loss Calculation
        error = -np.sum(label_vec * np.log(o))
        # ... (update total error and accuracy)

        # STAGE D: Backpropagation
        delta_o = o - label_vec
        w_h2_o += -learning_rate * delta_o @ np.transpose(h2)
        b_h2_o += -learning_rate * delta_o

        delta_h2 = np.transpose(w_h2_o) @ delta_o * (h2 * (1 - h2))
        w_h1_h2 += -learning_rate * delta_h2 @ np.transpose(h1)
        b_h1_h2 += -learning_rate * delta_h2

        delta_h1 = np.transpose(w_h1_h2) @ delta_h2 * (h1 * (1 - h1))
        w_i_h1 += -learning_rate * delta_h1 @ np.transpose(image)
        b_i_h1 += -learning_rate * delta_h1

Data Preparation

Before we can feed data to our network, we must format it correctly.

# Reshape the input image from a 28x28 matrix to a 784x1 vector
image = image.reshape(784, 1)

# Create a "one-hot" encoded vector for the label
label_vec = np.zeros((10, 1))
label_vec[label] = 1

1. Flattening the Image:
Our network's input layer has 784 neurons, arranged as a single line. The raw image is a 28x28 grid of pixels. By reshape-ing it into a (784, 1) column vector, we are "unspooling" the grid into a flat list that can be directly multiplied by our first weight matrix (w_i_h1), which has a shape of (64, 784).

2. Vectorizing the Label (One-Hot Encoding):
A human understands the label 7, but our network's output is a vector of 10 probabilities. To measure the error, we need a "ground truth" target in the same format. One-hot encoding does this. For the label 7, it creates a vector that is zero everywhere except for a 1 at the 7th index, representing "100% confidence that the digit is 7."

Forward Propagation

Here, data flows forward through the network, from the input pixels to the final probability outputs.

# Hidden Layer 1
h1_pre = w_i_h1 @ image + b_i_h1
h1 = 1 / (1 + np.exp(-h1_pre))  # Sigmoid activation

# Hidden Layer 2
h2_pre = w_h1_h2 @ h1 + b_h1_h2
h2 = 1 / (1 + np.exp(-h2_pre))  # Sigmoid activation

# Output Layer
o_pre = w_h2_o @ h2 + b_h2_o
exps = np.exp(o_pre - np.max(o_pre)) # Stable Softmax
o = exps / np.sum(exps)
o = np.clip(o, epsilon, 1-epsilon)

At each layer, we do two things:

Linear Combination: Z = W @ A_prev + b. This is a weighted sum. The matrix multiplication (@) aggregates signals from the previous layer, and the bias (b) shifts the result.
Activation: A = g(Z). We pass the result through a non-linear activation function.

A Closer Look at Our Activation Functions

Sigmoid (for Hidden Layers): σ(z) = 1 / (1 + e⁻ᶻ)
- The Math: This function takes any real number z and "squashes" it into a range between 0 and 1.
- The Intuition: It acts like a dimmer switch or a "gate." A value close to 1 means the neuron is highly "active" and passing its signal forward. A value close to 0 means it's "inactive." Crucially, it's non-linear. Without non-linearity, stacking layers would be pointless; the entire network would collapse into a single, less powerful linear transformation.
Softmax (for the Output Layer): S(zᵢ) = eᶻᵢ / Σ eᶻⱼ
- The Math: It exponentiates each input score (making them all positive) and then divides by the sum of all exponentiated scores.
- The Intuition: This is why we use Softmax for the output layer in a classification problem. Its output has two beautiful properties:
  1. All output values are between 0 and 1.
  2. All output values sum to 1.
    This transforms the network's raw final scores (o_pre) into a probability distribution. We can interpret the output o as the network's confidence in each digit. We can't use a function like ReLU (max(0,z)) here because its outputs don't sum to 1 and can't be interpreted as probabilities.

Finally, you could see that value of “o” is is clipped by epsilon. It's to prevent the computation from breaking. The loss function uses np.log(), and log(0) is negative infinity.

Cross-Entropy Loss

Now that we have a prediction (o) and a ground truth (label_vec), we can quantify how wrong the network was.

error = -np.sum(label_vec * np.log(o)) # Cross-Entropy Loss

The Math: L = -Σ yᵢ log(pᵢ), where y is the true label (our one-hot label_vec) and p is the prediction (o). Since y is 1 for the correct class and 0 for all others, this simplifies to L = -log(p_correct).
The Intuition: This measures "surprise." If the network predicts a high probability for the correct digit (e.g., p_correct = 0.95), then -log(0.95) is a very small error. If the network is confidently wrong (e.g., p_correct = 0.01), then -log(0.01) is a very large error. This loss function heavily penalizes confident mistakes.

Backpropagation

This is the most mathematically rich part. We use the error to figure out how to adjust every single weight and bias. The goal is to calculate the gradient of the loss function with respect to each parameter (∂L/∂W, ∂L/∂b). The gradient tells us the direction of steepest ascent for the loss, so we take a small step in the opposite direction to reduce the error. This entire process is a practical application of the Chain Rule from calculus.

Step 1: The Output Error Gradient

delta_o = o - label_vec

The Math: This is the derivative of the Loss with respect to the pre-activation output scores, ∂L/∂o_pre. For the specific combination of Softmax and Cross-Entropy Loss, the calculus simplifies beautifully to this intuitive form: (prediction - actual). This vector tells us the magnitude and direction of the error for each output neuron.

Step 2: Update Output Layer Weights & Biases

w_h2_o += -learning_rate * delta_o @ np.transpose(h2)
b_h2_o += -learning_rate * delta_o

The Math: To get the gradient for the weights (∂L/∂w_h2_o), the Chain Rule states: ∂L/∂w_h2_o = (∂L/∂o_pre) * (∂o_pre/∂w_h2_o).
Connecting to Code: We already have ∂L/∂o_pre (it's delta_o). The term ∂o_pre/∂w_h2_o is simply the activation of the layer that fed into it, h2. Therefore, the full gradient is delta_o @ np.transpose(h2). We then take a small step (-learning_rate) in the opposite direction of this gradient. The bias update is even simpler as its derivative is just 1.

Step 3: Propagate Error to Hidden Layer 2

delta_h2 = np.transpose(w_h2_o) @ delta_o * (h2 * (1 - h2))

The Math: Now we need the error for the next layer back, ∂L/∂h2_pre. The Chain Rule expands: ∂L/∂h2_pre = (∂L/∂o_pre) (∂o_pre/∂h2) (∂h2/∂h2_pre).
Connecting to Code:
- (∂L/∂o_pre) * (∂o_pre/∂h2): This is the output error (delta_o) propagated backward through the weights (w_h2_o). This is np.transpose(w_h2_o) @ delta_o. It tells us how much each h2 neuron contributed to the final output error.
- (∂h2/∂h2_pre): This is the derivative of the Sigmoid activation function, which conveniently is σ(z) (1 - σ(z)), or in our code, h2 (1 - h2).
- The term h2 * (1-h2) has a powerful intuition: neurons that were very certain (output near 0 or 1) have a small derivative and are changed very little. Neurons that were uncertain (output near 0.5) have the largest derivative and are updated the most. The network focuses its learning on its points of uncertainty!

Steps 4-6: Continue the Process Backward
The remaining lines repeat this exact pattern. The error delta_h2 is used to update w_h1_h2 and b_h1_h2, and then it's propagated further back to create delta_h1, which is used to update the final set of weights and biases. This "chain" is what gives the Chain Rule its name.

Conclusion

We took a blank file, a handful of mathematical principles, and a collection of pixels, and we forged a system that can learn. But the most valuable thing we built today wasn't a digit classifier, it was intuition.

In a world dominated by powerful frameworks like PyTorch and TensorFlow, it's tempting to jump straight to the high-level commands.

These tools are incredible for productivity and are essential for building state-of-the-art models. But used without a foundational understanding, they can become "magic black boxes" that work in mysterious ways. When they break, we don't know why. When we need to innovate, we don't know how.

This project was our refusal to accept the magic box.

Take this blog with a grain of salt, but I’d rather share a project that’s done from scratch and close from intuition (that may be prone to more mistakes too) in order to open express my creativity of applied mathematics, and also further the discussion when it comes to AI.

code link: https://github.com/HarvsDucs/mnist_from_scratch/blob/main/main.py

I also highly suggest checking out 3blue1brown’s video about backpropagation calculus: https://www.youtube.com/watch?v=tIeHLnjs5U8

Why ML Fails at Stock/Crypto Prediction

Harvey Ducay — Fri, 20 Jun 2025 04:57:14 GMT

The use of ML (machine learning) has been on the rise for the past 10 years. The idea of machine learning has been a trend for quite a while now and many people coming from different domains try to grasp the topic in hopes for a better opportunity career-wise and financial-wise.

From my personal view, there has been lots of side-projects made by newcomers in the field, of analysis that involves stock price prediction or crypto price prediction. Don’t get me wrong, I’m not against the idea of using ML for these use-cases, but I just want to point out how traditional ML techniques are misused a lot for this specific domain.

Beyond price and time: Missing market signals

Almost all (if not all) of the side projects I’ve seen that tried to predict stock/crypto price settled with a open/close price through time with some added lagged features to predict future prices of stock/crypto. As some who’s been a crypto trader for more than 5 years now, and a data scientist for more than 3 years, I knew that there are a lot more factors affecting crypto prices. Some of these factors involves market structured factors, technical, adoption, regulatory, and market sentiment factors.

Honestly speaking, there are more factors than what I have listed out and I knew that making a model out of all these factors as features or parameters for your ML will only make it better and more complex. If you could have a model that captures all these features well (if you have a way of representing these factors well enough) then maybe, you really have a shot at actually predicting the future stock/crypto price.

Trading against Quant Traders, Hedge Funds, and Banks

I knew that using a very simple model to predict stock/crypto prices would be disrespectful to the years of domain experience professional traders have when it comes to trading. They knew factors that we didn’t know exist that is a key thing for predicting stock/crypto prices.

I’m not saying that we can’t beat professional traders through traditional ML techniques in building a stock/crypto price prediction model, maybe we can build a better one, but at the end of the day these guys have the ‘data advantage’. Being in data science for quite a while now, I knew that nothing beats a good and clean data that you can use for modelling. My personal long term goal was to also build an algorithm that could aid in my trading journey. I knew that it’s a long and continuous process, but it’s something I’m willing to be a part of.

Conclusion

The reality is that when you are trading, you are going against institutions with years of domain experience that has helped develop a superb market intuition, and an unending pocket of money to buy whatever data advantage they would need in order to create a better model. The gap between side projects to predict stock/crypto prices and professional trading systems is beyond most people's expectations. This doesn’t entirely mean that developing your own model for stock/crypto price prediction as an individual or ‘retail’ won’t ever work, it just means that there is room for improvement and more chances in making whatever model you have to be better. In a game where your opponents have a billion dollar budget and decades of experience, it is good to have a little humility that will push you to try and learn more. That is a lesson worth learning before risking your capital before trying to use your model.

Reinforcement Learning is the inevitable

Harvey Ducay — Wed, 18 Jun 2025 01:34:38 GMT

The internet is full of bad data, and that is where the training data for LLMs are coming from. Some are polluting the internet with bad data out of spite. Most just spam the internet of AI generated data for profit One day, we might see a future where an outside training data is necessary in order for these models to train as reinforcement learning will be the new standard for model training. Feels odd, doesn’t it?

Why reinforcement learning

AGI (Artificial General Intelligence) has been in the talks right now especially with top executives working closely with AI. I sure ain’t got much, but I’m willing to bet a lot on AGI being done out of reinforcement learning training. You could imagine reinforcement learning as a brute force algorithm (in comparison to traditional neural networks architecture) that tries every set of possible solutions in a given environment space, and then the most optimal one is chosen (according to rewards and punishments set).

What made reinforcement learning different

Reinforcement learning has proven time and time again that models trained out of it always find the most efficient way to reach its goal, often surpassing human intuition and assumptions about the domain. An autonomous helicopter even learned to fly in an inverted manner through reinforcement learning as its main goal is to just learn how to fly, stay above the surface, and do not crash. Isn’t it kind of funny that we humans hadn’t ever thought of flying in this way?

Types of Reinforcement Learning Algorithms

Value-Based
Policy-Based
Model-Based
Actor-Critic Methods

Conclusion

Reinforcement Learning isn’t getting too much of a hype now and it’s understandable because it isn’t as feasible as we thing it is. Doing reinforcement learning is very much more compute and memory intensive in comparison to traditional way of doing neural networks and machine learning. In a time where the compute becomes less expensive, we will see a world where reinforcement learning is a more prominent way to train AI.

PCA as a Last Resort

Harvey Ducay — Mon, 28 Apr 2025 14:34:23 GMT

Introduction

Principal Component Analysis (PCA) is often the first dimensionality reduction technique that data scientists reach for when faced with high-dimensional data. While PCA is powerful and mathematically elegant, treating it as a default first step can lead to missed opportunities and suboptimal models. This post explores why feature engineering and feature removal should be your first considerations before applying PCA.

Understanding the Limitations of PCA

PCA transforms your original features into new components that capture maximum variance. However, this mathematical transformation comes with significant tradeoffs:

Loss of interpretability - Principal components are linear combinations of original features, making them difficult to explain to stakeholders
Domain knowledge is discarded - PCA is a purely statistical technique that ignores valuable domain expertise
Non-linear relationships are missed - Standard PCA only captures linear relationships between features

Feature Engineering: Creating Meaningful Representations

Before reducing dimensions, consider creating more informative features:

Ratio features that capture relationships between variables (e.g., debt-to-income ratio)
Interaction terms that represent how features work together
Domain-specific transformations based on expert knowledge
Polynomial features to capture non-linear relationships

These engineered features often provide more predictive power than abstract principal components while maintaining interpretability.

Feature Removal: The Simplest Form of Dimensionality Reduction

Feature removal should be your first dimensionality reduction approach because:

It preserves the original meaning of remaining features
It forces critical thinking about which variables truly matter
It simplifies your model and reduces overfitting

Methods for informed feature removal include:

Correlation analysis to identify redundant features
Feature importance rankings from tree-based models
Filter methods like variance thresholds and mutual information
Wrapper methods such as recursive feature elimination

When PCA Makes Sense

PCA becomes valuable after you've exhausted feature engineering and removal options, particularly when:

You still have high dimensionality after careful feature selection
Multicollinearity remains a significant issue
Computational efficiency is a critical concern
You're using specific algorithms that benefit from orthogonal features
Visualization of high-dimensional data is needed

A Better Workflow for Dimensionality Reduction

Instead of immediately applying PCA, follow this approach:

Start with domain knowledge to engineer meaningful features
Apply feature selection techniques to remove redundant or irrelevant variables
Use PCA only on the remaining features if dimensionality is still problematic
Consider non-linear dimensionality reduction techniques (t-SNE, UMAP) if linear PCA performs poorly

Conclusion

While PCA is a valuable tool in the data scientist's toolkit, it should rarely be your first choice for dimensionality reduction. By prioritizing feature engineering and thoughtful feature removal, you'll create models that are not only more accurate but also more interpretable and actionable. Save PCA for when you truly need it—as a last resort after you've leveraged your domain knowledge and simpler techniques.

Chunking Methods for RAG: What and Why

Harvey Ducay — Mon, 03 Mar 2025 06:17:33 GMT

The Day My RAG System Failed

Meet Charlie, a developer who learned a valuable lesson about RAG systems the hard way. (Let’s just call him Charlie but we all really know who he is. 😉)

Last year, Charlie built what he thought was the perfect RAG (Retrieval-Augmented Generation) system for a Solar Company. The architecture was elegant, the UI sleek, and the LLM integration seamless. It was perfect for consuming PDFs and being a central source of truth for onboarding new employees.

Then came the testing portion. An employee asked the system a specific question about sales technique. The system confidently responded with completely wrong information that contradicted their report. The room went silent.

That was the day Charlie learned that a RAG system is only as good as its chunking strategy, and he had chosen the wrong one.

You might think chunking documents is just about splitting text into smaller pieces. But in reality, it's the strategic foundation that can make or break your entire RAG system's effectiveness.

Why Most RAG Systems Fail Despite Using Advanced Models

When engineers build RAG systems, they often focus on the fancy parts, the latest embedding models, vector stores, and prompt engineering techniques. Yet many overlook the humble chunking step, treating it as a trivial preprocessing task.

This is a big mistake.

No matter how advanced your retrieval algorithms or language models are, if your chunks don't properly preserve context and semantic meaning, your RAG system will inevitably deliver hallucinations and irrelevant responses.

The Three Chunking Methods You Need to Know

1. Fixed-Size Chunking: The Default Trap

Most engineers start with fixed-size chunking, slicing documents into equal segments of token or character counts. It's simple and conventional.

But here's the shocking truth: fixed-size chunking regularly destroys the semantic cohesion of your content. When you arbitrarily split text every 512 tokens, you're likely cutting right through important concepts, breaking relationships between sentences, and fragmenting contextual information.

Example Scenario

Imagine we're feeding a RAG system these two sentences:

"The brown dog jumps over the lazy fox."
"A brown dog jumps over the lazy fox quickly."

If we use fixed-size chunks (e.g., 20 characters), we might get these chunks:

Sentence 1: "The brown dog jumps o" and "ver the lazy fox."
Sentence 2: "A brown dog jumps ove" and "r the lazy fox quickly."

Now, if a user asks "What animal jumps over the lazy fox?", neither chunk perfectly captures the key information. The query terms are split across chunks due to the slight shift ("The" vs. "A"). The RAG system might miss the crucial link, even though both sentences clearly answer the question. This demonstrates how fixed-size chunking can break semantic context and hurt retrieval accuracy.

2. Recursive Character Text Splitting

Recursive character splitting aims to create smart chunks, but it can still fragment context.

Example Scenario

Recursive character splitting breaks text by punctuation (periods, etc.), then characters if needed. Sounds good, but it can still fragment context.

Example:

Consider: "Climate change threatens coasts. Rising sea levels are a problem. Communities rely on fishing and tourism. Reducing emissions is crucial."

Recursive splitting (aiming for ~100-character chunks) might give:

"Climate change threatens coasts. Rising sea levels are a problem."
"Communities rely on fishing and tourism. Reducing emissions is crucial."

If someone asks, "How do we protect coastal communities?", the link between climate change specifically and the need for emissions reductions is weakened. It's now less clear that reducing emissions directly addresses the threats caused by climate change.

The problem? Even with punctuation-based splitting, semantic relationships between chunks can be lost. Don't assume it's perfect; experiment and consider smarter chunking for optimal RAG!

3. Semantic Chunking: The Contextual Approach

Unlike fixed-size chunking, semantic chunking preserves meaning by respecting natural boundaries in the text, paragraphs, sections, or semantic units.

Example Scenario

"The use of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP). LLMs, trained on massive datasets, demonstrate impressive capabilities in text generation, translation, and question answering. However, LLMs also present challenges. One significant concern is the potential for generating biased or harmful content. Careful data curation and bias mitigation techniques are crucial. Another challenge is the computational cost associated with training and deploying these models. Research into more efficient architectures is ongoing. Despite these challenges, the benefits of LLMs are undeniable, and their applications are rapidly expanding across various industries."

Fixed-Size Chunking (Problem):

If we used a fixed-size chunk of, say, 200 characters, we might get chunks like:

"The use of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP). LLMs, trained on massive datasets, demonstrate impressive capabilities in text generation, translation, and quest"
"ion answering. However, LLMs also present challenges. One significant concern is the potential for generating biased or harmful content. Careful data curation and bias mitigation techniques are cru"
"cial. Another challenge is the computational cost associated with training and deploying these models. Research into more efficient architectures is ongoing. Despite these challenges, the benefits"
" of LLMs are undeniable, and their applications are rapidly expanding across various industries."

Notice how the chunks break in the middle of sentences and thoughts.

Semantic Chunking (Solution):

A semantic chunking approach might produce the following chunks, identifying logical breaks between topics:

Chunk 1: "The use of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP). LLMs, trained on massive datasets, demonstrate impressive capabilities in text generation, translation, and question answering." (This chunk focuses on the positive impact of LLMs)
Chunk 2: "However, LLMs also present challenges. One significant concern is the potential for generating biased or harmful content. Careful data curation and bias mitigation techniques are crucial." (This chunk focuses on the bias challenge and mitigation)
Chunk 3: "Another challenge is the computational cost associated with training and deploying these models. Research into more efficient architectures is ongoing." (This chunk focuses on the computational cost challenge)
Chunk 4: "Despite these challenges, the benefits of LLMs are undeniable, and their applications are rapidly expanding across various industries." (This chunk serves as a conclusion, summarizing the overall value of LLMs).

Why Semantic Chunking Wins:

If the user asks: "What are the advantages of LLMs?", Chunk 1 is a perfect match.
If the user asks: "What are the challenges with LLMs?", Chunks 2 and 3 provide detailed answers to different challenges.
If the user asks: "Are LLMs useful despite their problems?", Chunk 4 provides the concluding perspective.

Traditional chunking methods (fixed-size, punctuation-based) often fragment context, hurting RAG performance. Semantic chunking aims for meaningful chunks, leading to better retrieval and generation.

Why This Works (A Simplified Analogy):

Imagine each word has a "location" in a semantic space. You could imagine the location as a single point in a vector space described by the values of the embedding vector. Semantic chunking tries to group words into chunks where the "average location" (centroid) of all words in the chunk is close to each individual word. The closer the words are, the more coherent and semantically related the chunk is. This minimizes "semantic distance" within each chunk, maximizing its relevance. Fixed chunking ignores this all together.

How it Works (Simply):

Instead of rigidly sticking to character counts or punctuation, semantic chunking tries to understand the text. It identifies logical boundaries based on the content itself. This might involve:

Looking for topic shifts.
Identifying clear beginnings and endings of arguments.
Using more sophisticated NLP techniques to recognize semantic similarity within a chunk.

A visual example:

Note that the sentences and the vectors that represent them on the graph are arbitrary, and only exists to show what it means to semantically chunk.*

The moment the next sentence is far off, its automatically considered as another chunk. One could imagine this as similar to removing duplicates in a database to save storage and compute resources (Although objectively speaking, these vectors aren’t similar unless they are linearly dependent).

4. Agentic Chunking: The Advanced Solution

For complex documents with nested structure, hierarchical chunking creates multiple granularity levels—section-level chunks, paragraph-level chunks, and sentence-level chunks.

Even the best semantic chunking has limitations. It's static – analyzes data once and creates fixed chunks. This fails with messy data like scraped websites or complex PDFs.

The Problem: Unstructured Data & Varying Queries

Websites: Noisy HTML, ads, irrelevant disclaimers.
PDFs: Complex formatting, embedded images breaking text flow.
Long Text: Subtle topic shifts, making uniform chunking ineffective.
Query Variance: Some questions need broad context, others specific details. Static chunks can't adapt.

Agentic RAG: Dynamic & Adaptive Chunking to the Rescue!

Agentic RAG uses an "agent" (often a smaller LLM) to dynamically analyze the data and adapt the chunking strategy based on the source and the user's query.

Example: Scraping a Product Review Website

Static Chunking: You scrape a product review page and try to split by HTML structure. You end up with chunks containing navigation menus, ads, and user comments alongside the actual review.
Agentic RAG Approach:
1. Agent Identifies Core Content: The agent identifies the main review text, ignoring irrelevant parts. It might use rules like, "Find the longest text block within the
  tag" or "Identify the section with the highest concentration of keywords related to the product."
2. Content-Aware Chunking: Now that the agent has the main article, it can do semantic chunking on just the review content, prioritizing sections with headings like "Pros," "Cons," or "Performance."
3. Query-Aware Chunking: If the user asks, "What are the drawbacks of this product?", the agent could re-chunk the review, focusing specifically on sentences containing keywords related to "drawbacks," "cons," "problems," or "issues," creating highly targeted chunks.

Benefits of Agentic RAG:

Noise Reduction: Filters irrelevant content before chunking.
Contextual Understanding: Adapts to different data types (webpages, PDFs, etc.).
Query Optimization: Tailors chunk sizes and content to answer the user's specific question.
PDF Mastery: Handles PDFs by first extracting text, identifying headings, and chunking structurally.
Long Text Savvy: For long texts, employs techniques like sliding windows or hierarchical summarization to maintain context across large distances.

Cons of Agentic RAG:

Complexity Overload: Designing intelligent agents adds layers of code and complexity. You'll need stronger programming and NLP skills than with simple chunking. *
Higher Costs: Running agents, especially those powered by LLMs, consumes more computational resources. Expect increased latency and potentially higher cloud bills.
Prompt Engineering: Just like with other applications using LLMs, prompt engineering plays a critical role. If the prompts used to generate the agents are off, then the performance will not meet the requirements.
Over-Engineering Trap: It's tempting to over-engineer your agents. Start with simple agents and add complexity only when it demonstrably improves results.

Agentic RAG isn't just chunking; it's intelligent chunking. By dynamically adapting to the data and the user's needs, it unlocks far better accuracy and relevance in RAG systems compared to static approaches. If you're serious about RAG, you need to explore agentic strategies. But then again, it might be overkill to use Agentic RAG for simple use cases.

The Impact on Your RAG System's Performance

Choosing the right chunking method isn't just a technical decision, it's a business-critical one. Here's what happens when you get it right:

Reduced Hallucinations: Proper chunks preserve context, giving the LLM less reason to "fill in the gaps" with fabricated information
Improved Relevance: Better chunks mean more precise retrieval, ensuring responses actually address the user's query
Enhanced Context Window Utilization: Strategic chunking makes better use of limited context windows in LLMs
Lower Operational Costs: Better retrieval means fewer tokens processed and less computational overhead

Implementing the Right Chunking Strategy Today

The most successful RAG engineers I've worked with follow this process:

Analyze your document structure and content type
Experiment with multiple chunking strategies on a test dataset
Measure retrieval effectiveness using precision, recall, and answer relevance metrics
Implement a hybrid approach tailored to your specific knowledge base

Remember, what works for general web content may fail spectacularly for legal documents, code bases, or scientific papers.

Conclusion: The Decision That Will Define Your RAG System

As AI engineers, we're often drawn to the exciting parts of RAG, the latest models, complex retrievers, and advanced prompting techniques. But I've seen time and again that the engineers who master the seemingly mundane art of chunking are the ones who build systems that actually work when it matters most. For most cases, semantic chunking just might be enough.

What chunking method are you using in your RAG system today? And more importantly, are you absolutely certain it's the right one?

If you find this blog interesting, connect with me on Linkedin sure to leave a message!

Links:

https://www.youtube.com/@hddatascience

https://harveyducay.blog/

https://github.com/harvsDucs/

Semantic Search Data Engineering Pipeline: RAG Without the AI

Harvey Ducay — Fri, 28 Feb 2025 03:56:06 GMT

Building a Semantic Document Search System

flowchart TD
    %% Color definitions
    classDef default fill:#2c3e50,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef processing fill:#3498db,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef storage fill:#9b59b6,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef embedding fill:#2ecc71,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef query fill:#f1c40f,stroke:#34495e,stroke-width:2px,color:#34495e
    classDef display fill:#e74c3c,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1

    A["PDF Documents"] --> B["Supabase Storage Upload"]
    B --> C["File Parsing via Llama Index"]
    C --> D["Text Semantic Chunking via LangChain Flask API (Vercel)"]
    D --> E["Text Embedding Generation via Nomic-Embed-Text Flask API (Vercel)"]

    E --> F["Supabase Upload Text per Embedding ID"]
    E --> G["Pinecone Upload Embedding per Embedding ID"]

    H["User Query"] --> I["Convert Query to Embedding via Nomic-Embed-Text Flask API (Vercel)"]
    I --> J["Compare Embeddings via Pinecone Query API (Return Top 2 References)"]
    J --> K["Display References in UI Show Source Information"]
    K --> L["Display Results Based on Retrieved References"]

    %% Styling nodes by category
    class A default
    class B,C storage
    class D processing
    class E,F,G,I embedding
    class H,J query
    class K,L display

    subgraph Pipeline1["Document Processing Pipeline"]
        A
        B
        C
        D
        E
        F
        G
    end

    subgraph Pipeline2["Query Processing Pipeline"]
        H
        I
        J
        K
        L
    end

In today's data-driven world, organizations are drowning in unstructured information. PDF documents, reports, manuals, and other text-based resources contain valuable knowledge, but accessing this information efficiently remains challenging. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are gaining popularity, not every solution requires the generative AI component.

In this post, I'll walk through how I built a powerful semantic search system for documents that captures the "retrieval" part of RAG without the "generation" component - providing accurate document references without synthesizing new content.

The Architecture

Our system consists of two primary pipelines:

Document Processing Pipeline

This pipeline handles the ingestion and processing of documents:

PDF Document Collection: The starting point is a repository of PDF documents containing the information we want to make searchable.
Supabase Storage Upload: Documents are uploaded to Supabase storage, providing a centralized location for all our documents.
File Parsing via Llama Index: We utilize Llama Index to extract and structure the content from our PDFs. This tool effectively transforms unstructured documents into structured content.
Text Semantic Chunking: Using LangChain's Flask API (hosted on Vercel), we divide the document content into semantic chunks - logical sections that preserve context rather than arbitrary splits.
Text Embedding Generation: Each chunk is processed through Nomic-Embed-Text Flask API to generate vector embeddings. These embeddings capture the semantic meaning of text in a mathematical format.
Dual Storage Strategy:
- We store the text chunks in Supabase, indexed by unique embedding IDs.
- We upload the vector embeddings to Pinecone, a vector database optimized for similarity search.

Query Processing Pipeline

This pipeline handles user interactions:

User Query: The process begins when a user submits a text query seeking information.
Query Embedding: The user's query is converted into an embedding using the same Nomic-Embed-Text model, ensuring compatibility with our document embeddings.
Embedding Comparison: Pinecone's Query API compares the query embedding with stored document embeddings, returning the top 2 most semantically similar text chunks.
Reference Display: The system displays these references in the UI along with source information, helping users understand where the information originated.
Results Display: Finally, the system presents the retrieved information based on semantic relevance rather than keyword matching.

Technical Implementation Details

For this implementation, I leveraged several key technologies:

Embedding Model: Nomic-Embed-Text provides high-quality embeddings for both document chunks and user queries.
Vector Database: Pinecone stores and efficiently searches through vector embeddings.
Storage Solution: Supabase stores both the original documents and the text chunks.
Processing Tools: Llama Index for document parsing and LangChain for semantic chunking.
Deployment: All API components are deployed on Vercel for reliable scaling.

The Benefits of This Approach

By implementing a "RAG without the AI" approach, we gain several advantages:

Reference Transparency: Users receive direct references to relevant documents rather than AI-generated summaries that might contain hallucinations.
Semantic Understanding: Unlike traditional keyword search, this system understands the meaning behind queries, returning contextually relevant results.
Source Verification: Each result links directly to its source document, enabling users to verify information.
Reduced Complexity: Without the generative component, the system is simpler to implement, debug, and maintain.
Lower Computational Requirements: Vector similarity search requires fewer resources than running large language models.

Real-World Applications

This system is particularly valuable for:

Legal Firms: Searching through case law and precedents
Healthcare Organizations: Finding relevant medical documentation
Financial Institutions: Locating specific regulatory guidance
Research Organizations: Discovering relevant papers and findings
Educational Institutions: Connecting students with relevant learning materials

Conclusion

Building a semantic document search system using embedding-based retrieval provides organizations with a powerful tool to unlock the value hidden in their unstructured data. By focusing on the retrieval component without the generative AI aspect, we create a system that:

Delivers accurate, source-verified information
Understands the semantic meaning behind user queries
Scales efficiently with growing document collections
Maintains transparency in information retrieval

For organizations with large collections of documents that need to be searchable by meaning rather than just keywords, this approach offers significant value. It bridges the gap between traditional search and full RAG systems, providing a practical solution for making institutional knowledge accessible without the complexity and potential pitfalls of generative AI.

The next time you're considering implementing a document search solution, remember that sometimes you don't need the "G" in RAG to deliver transformative results.

P.S. Let's Build Something Cool Together!

Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

Self-Learning AI: Does Reinforcement Learning Really Eliminate Data Engineering?

Harvey Ducay — Tue, 07 Jan 2025 04:43:01 GMT

Picture a machine learning model that learns like a child, through trial and error, with no need for massive pre-existing datasets. That's the allure of reinforcement learning (RL), a branch of artificial intelligence that's revolutionizing everything from game-playing robots to industrial automation. While it's true that RL agents generate their own training data through interaction, the popular belief that this eliminates the need for data engineering might be too good to be true. Let's dive into the reality of data engineering in reinforcement learning and uncover whether this compelling promise holds up in practice.

The Case for Reduced Data Engineering in RL

Self-Generating Data Through Interaction

One of the most compelling arguments for reduced data engineering in RL is its ability to generate training data through direct interaction with environments. Unlike traditional supervised learning approaches, where data must be collected, cleaned, and labeled beforehand, RL agents learn through experience, creating their own training examples along the way.

The Power of the Reward Signal

Reinforcement learning's reliance on reward signals rather than labeled examples presents another potential reduction in data engineering overhead. Instead of requiring extensive human annotation, RL systems learn from simple feedback signals that indicate the success or failure of actions. This fundamental shift can significantly reduce the traditional data preparation burden.

Leveraging Synthetic Environments

Many RL applications begin their training journey in simulated environments, providing a controlled and readily available data source. This approach can substantially reduce the initial data engineering requirements typically associated with real-world data collection and processing.

Lunar Landing Reinforcement Learning

One of the best ways to understand the amount of complexity necessary in order to train a reinforcement learning model is through this code, with these few lines of code, a lunar lander was able to learn how to land safely on its target position.

import gymnasium as gym

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

vec_env = make_vec_env("LunarLander-v3")

model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=750000)
model.save("LunarLander-v3-750k-ts")

No Training

250k Timesteps Results

500k Timesteps Results

750k Timesteps Results

The Hidden Data Engineering Challenges

Complex Environment Engineering

While RL might reduce certain aspects of traditional data engineering, it introduces its own set of challenges. Creating and maintaining effective training environments requires sophisticated engineering work, including:

Designing accurate state representations
Defining appropriate action spaces
Crafting meaningful reward functions
Developing realistic simulators

Managing Interaction Histories

The need to store and process interaction histories introduces significant data management challenges. Each training episode generates sequences of state-action-reward tuples that must be efficiently stored, accessed, and analyzed. This becomes particularly demanding in applications with extended training periods or complex environmental interactions.

Specialized Data Pipeline Requirements

RL systems often require specialized data pipeline components to handle unique requirements such as:

Experience replay mechanisms for efficient learning
Data synchronization in distributed training setups
Storage and processing of historical policy data
Real-time monitoring and debugging capabilities

The Reality: Different Rather Than Less

The relationship between reinforcement learning and data engineering isn't about reduction, it's about transformation. While RL might minimize certain traditional data engineering tasks, it introduces new challenges that require equally sophisticated solutions.

Conclusion

While reinforcement learning offers exciting alternatives to traditional machine learning approaches, it doesn't eliminate the need for data engineering, it transforms it. Success in RL projects requires understanding and embracing these unique data engineering challenges rather than assuming they don't exist.

P.S. Let's Build Something Cool Together!

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

Navigating Your Path to Data Engineering: A Comprehensive Guide to Breaking Into the Field

Harvey Ducay — Wed, 18 Dec 2024 08:20:14 GMT

The Data Dilemma: From Frustrated Coder to Strategic Problem Solver

Let me be honest—when I first started my journey in data science, I was that developer who could barely string together a machine learning model without feeling like I was trying to solve a Rubik's cube blindfolded. Machine learning felt like an intricate dance where I constantly had two left feet.

Then, data engineering entered my life like a superhero with a Swiss Army knife of technological solutions. Suddenly, those complicated data pipelines that used to make me want to throw my laptop out the window became... manageable. Dare I say, even enjoyable?

The Growing Demand: Why Data Engineering is Your Golden Ticket

Data has become the new oil, and data engineers are the drilling experts of the 21st century. With top companies processing terabytes of information daily and platforms like LinkedIn showcasing thousands of data engineering positions, this field isn't just a career—it's a technological revolution.

💡 Fun Fact: The average data engineer earns approximately $130,000 annually. That's not just a salary; that's a "buy-a-Tesla-and-still-have-money-for-artisan-coffee" kind of income!

From Chaos to Clarity

Imagine being the person who transforms raw, messy data into crystal-clear insights that help businesses make game-changing decisions. That's not just a job—it's almost like being a data wizard.

When I help a business understand its customer behavior, reduce inefficiencies, or predict market trends, I'm not just moving numbers around. I'm helping create stories from seemingly random data points, turning complexity into comprehensible narratives.

Foundational Skills

1. Master the Core Technologies

To build a solid foundation in data engineering, focus on three fundamental technologies:

Python: An open-source language with extensive third-party libraries and robust virtual environment capabilities. It's like the friendly neighborhood superhero of coding—flexible, powerful, and always ready to save the day.
SQL: More than just a declarative language, SQL offers advanced transaction properties that make data manipulation efficient. Think of it as a precise dance of data manipulation, where every query is a carefully choreographed move. Key advanced topics to master include:
- Group by functions
- Window functions
- Complex querying techniques
Command Line Tools: Like the stage managers of your data engineering theater, these help facilitate data pipeline interactions and improve productivity.

2. Data Storage and Orchestration

Understanding data storage is crucial for data engineers. Focus on:

Object Stores: Ideal for unstructured data like images, audio, and text
Relational Databases: Often the solution to most data engineering challenges
Data Orchestration: Learn Extract, Transform, Load (ETL) processes
Apache Airflow: The industry-standard tool for workflow management

3. Advanced Data Processing Techniques

Differentiate yourself by understanding:

Batch Processing: Utilizing tools like Apache Spark to handle large-scale data
Stream Processing: Learning frameworks like Apache Kafka for real-time data handling
Distributed Systems: Understanding concepts like map-reduce and parallel processing

Learning Strategies: Turning Passion into Profession

The "No Pressure" Learning Approach

Take at least three months
Build projects that make your heart sing
Choose resources that don't make you want to fall asleep

Pro Tip: If a learning resource feels more boring than watching paint dry, it's time to find a new one!

Real-World Impact: Beyond the Code

Data engineering isn't just about technical skills. It's about:

Helping businesses make smarter decisions
Transforming complex data into actionable insights
Creating value that goes beyond lines of code

Conclusion: Your Strategic Roadmap to Data Engineering Success

Some days, you'll feel like a coding genius. Other days, you'll wonder if you accidentally signed up for technological self-torture. Spoiler alert: It's totally worth it. The journey into data engineering is more than a career choice—it's a strategic investment in your professional future. As businesses increasingly rely on data-driven decision-making, the role of a data engineer has transformed from a technical support position to a critical strategic partner in organizational success.

Uncovering Semantic Relationships with the Universal Sentence Encoder

Harvey Ducay — Thu, 05 Dec 2024 08:41:45 GMT

As the amount of text data we interact with on a daily basis continues to grow, the ability to quickly identify meaningful connections between pieces of information becomes increasingly valuable. This is where semantic similarity models can be incredibly useful, by capturing the underlying meaning of text, rather than just looking at surface-level similarities.

One powerful tool for building semantic similarity models is the Universal Sentence Encoder, provided by the TensorFlow Hub library. In this article, I'll walk through how you can leverage this pre-trained model to uncover interesting relationships in your own text data.

Getting Started with the Universal Sentence Encoder

The Universal Sentence Encoder is a machine learning model that has been trained on a large corpus of text to produce high-quality vector representations of sentences and phrases. These vector embeddings encode the semantic meaning of the input, allowing you to easily compare the relatedness of different pieces of text.

To get started, you'll first need to import the necessary libraries:

from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

With the imports set up, you can then load the pre-trained Universal Sentence Encoder model from TensorFlow Hub:

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

This model can now be used to encode your text data into semantic embeddings, which you can then use to compute similarity scores. An embed function also initiated.

Next up is defining our functions necessary for the plotting. We plot in order to visualize the similarities in the semantics calculated by the model. Code snippet is shown below:

def plot_similarity(labels, features, rotation):
  corr = np.inner(features, features)
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
  message_embeddings_ = embed(messages_)
  plot_similarity(messages_, message_embeddings_, 90)

Computing Semantic Similarity

Let's say you have a dataset of customer messages, and you want to identify which messages are discussing similar topics. You can use the Universal Sentence Encoder to generate embeddings for each message, and then calculate the pairwise cosine similarity between those embeddings.

The resulting similarity matrix will contain values between 0 and 1, where 1 indicates that two messages are semantically identical, and 0 indicates they are completely unrelated.

You can then use this matrix to cluster messages, identify outliers, or visualize the semantic relationships between your data points.

Exploring and Evaluating the Model

One way to get a better understanding of how the Universal Sentence Encoder is capturing semantic meaning is to examine the similarity scores for a few sample messages. For example:

messages = [
    # Smartphones
    "I like my phone",
    "My phone is not good.",
    "Your cellphone looks great.",

    # Weather
    "Will it snow tomorrow?",
    "Recently a lot of hurricanes have hit the US",
    "Global warming is real",

    # Food and health
    "An apple a day, keeps the doctors away",
    "Eating strawberries is healthy",
    "Is paleo better than keto?",

    # Asking about age
    "How old are you?",
    "what is your age?",
]

run_and_plot(messages)

By reviewing these examples, you can start to get a sense of how well the model is performing and where it may be struggling. Additionally, you can manually review high and low similarity pairs to further evaluate the model's effectiveness for your specific use case.

Conclusion

The Universal Sentence Encoder is a powerful tool that can help unlock the semantic insights hidden within your text data. By leveraging this pre-trained model, you can quickly generate high-quality vector representations of your content and uncover meaningful relationships that would be difficult to spot through manual review alone.

Of course, as with any machine learning model, it's important to carefully evaluate its performance and understand its limitations. But with a bit of experimentation and exploration, you can harness the power of semantic similarity to drive valuable discoveries in your data.

P.S. Let's Build Something Cool Together!

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

Why should you have a Data Science Team for your business?

Harvey Ducay — Sat, 23 Nov 2024 23:57:29 GMT

Let's be honest, if you've ever tried to make sense of a massive Excel spreadsheet at 3 AM, desperately searching for that one insight that could make or break your quarterly presentation, you know the pain. Trust me, I've been there, frantically googling "how to pivot table" while my coffee got cold. But what if I told you there's a better way?

A Personal Story: From Crypto Chaos to Data-Driven Success

Before we dive deep into business data science, let me share a story that might resonate with you. Back in 2020, I was like many others, trying to navigate the volatile crypto markets with nothing but gut feelings and Twitter threads. Every trade felt like a game of chance. Should I buy the dip? Is this the top? My portfolio looked like a roller coaster designed by someone who'd had too much coffee.

Then I discovered the power of data science. I began analyzing market patterns, building predictive models, and suddenly, those seemingly random price movements started making sense. My decisions became calculated rather than emotional. Sure, the market is still volatile, but now I sleep better knowing my trades are backed by data, not just hopes and dreams.

This same transformation can happen for your business. Whether you're in retail, finance, healthcare, or selling artisanal pet rocks online (hey, no judgment!), data science can turn uncertainty into clarity.

Everybody wants confidence in their decision making. Everybody wants peace of mind knowing that they made the right business decision, knowing there is a lot at stake. Sometimes, domain experience isn’t enough and you are gonna make mistakes in decision making, hence the existence of data science.

How I Learned to Stop Worrying and Love the Algorithm

Remember when making business decisions felt like throwing darts blindfolded? Yeah, those were not the good old days. As someone who started their journey as a data engineer, I can tell you firsthand that a lot of people underestimate the value of data.

When a business doesn’t have a data science team, most of the decision making comes from the executives from an intuitive point of view. Having an objective reasoning for every decision not only leads business growth in a more predictable way, but also gives a peace of mind towards its stakeholders.

For business employees, having a data science team might mean less probability for the business to go bankrupt and being laid off. For stock holders, it might mean a peace of mind knowing their stocks is more stable due to business decisions being backed by data.

Human Learning

Don’t get me wrong though, I don’t discount the capabilities of people who has been in their domain for so long now. Sometimes the intuition of business leaders through years of experience prove to be valuable. What works best is that the decision of these business leaders be augmented with the objective conclusion of data so that a decision could be more reliable and stable.

Building Your Dream Team

Data Scientists: They're like the Tony Stark of your team, brilliant minds who turn complex problems into flawless solutions. Just don't expect them to build an Iron Man suit... :). They are the ones who do the math heavy stuff, data pre-processing, machine learning, deep learning, and many more.
Data Engineers: The unsung heroes who build and maintain your data infrastructure. We're like the plumbers of the digital world, nobody thinks about us until something goes wrong, and then we're everyone's best friend. Whole data teams should start of with a data engineer or a data engineering team. It provides business capability to use data.
Data Analysts: The storytellers who turn numbers into narratives. They're the ones who make sure your executives don't fall asleep during presentations. Data analysts are the ones who connect the gap between the objective nature of data science and what it means for the decision making, as well as the communication to the non-math or non-technical people who might be the decision makers.

On a more serious note, an illustration below might help you understand the roles better.

The Trading Parallel: Your Business Decisions

Just like how data transformed my crypto trading from a guessing game into a strategic operation, it can revolutionize your business decisions. Remember:

Without data: "This product might sell well because my cousin Linda likes it"
With data: "Our analysis shows a 78% probability of market success based on consumer behavior patterns"

It doesn’t take a lot to allot resources into a data science team versus its potential upside for your business. It’s time to take action and start learning how data can help you in your domain.

Conclusion

The truth is, your business leaders' experience and intuition are invaluable, but in today's complex market they shouldn't have to navigate alone. A data science team turns gut feelings into validated decisions and transforms uncertainty into measurable outcomes. Think of it as adding a high-powered telescope to your captain's decades of sailing experience. You still need both to chart the best course forward.

So whether you're aiming to boost sales, reduce costs, or simply sleep better knowing your decisions are data-backed, consider this: in a world where everyone has access to data, the real advantage lies in how well you use it. If you weren’t using data to your advantage now, then when?

P.S. Let's Build Something Cool Together!

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

Top 5 Common Machine Learning Mistakes Beginners Do

Harvey Ducay — Fri, 15 Nov 2024 13:41:14 GMT

Machine learning has become an essential tool in data science, but it's surprisingly easy to make fundamental mistakes that can severely impact your model's performance. In fact my motivation for this blog came out of peers underestimating the complexities of Linear Regression as a way to model data. In this guide, we'll explore the most common pitfalls and how to avoid them.

1. The "Just Throw It Into a Model" Syndrome

One of the most prevalent mistakes is treating machine learning like a magic black box. Many newcomers simply load their dataset into scikit-learn's LinearRegression() and expect meaningful results. This approach ignores crucial preprocessing steps and can lead to severely underperforming models.

Key Problems:

No train-test split
Missing data preprocessing
Lack of feature engineering
Ignoring data leakage

2. Data Preprocessing Oversights

Feature Scaling

Not normalizing or standardizing features is a common oversight that can significantly impact model performance. Different scales across features can cause:

Gradient descent algorithms to converge slowly
Some features to dominate others unnecessarily
Poor performance in distance-based algorithms like k-NN

Dimensionality Issues

Many practitioners fail to address the curse of dimensionality. High-dimensional data often needs:

Principal Component Analysis (PCA)
Feature selection methods
Other dimensionality reduction techniques

3. Evaluation Metric Mismatches

Choosing the wrong evaluation metric is like using a ruler to measure weight. Different problems require different metrics:

Classification Metrics

Imbalanced Data: Using accuracy for imbalanced datasets can be misleading
False Positives vs. False Negatives: Not considering the business impact of different types of errors
Common Solutions:
- F1-score for balanced precision and recall
- Area Under ROC Curve (AUC-ROC)
- Precision for minimizing false positives
- Recall for minimizing false negatives

4. Validation Vulnerabilities

Cross-Validation Mistakes

Simple train-test splits aren't enough. Common issues include:

Not using k-fold cross-validation
Applying cross-validation incorrectly
Ignoring temporal aspects in time-series data

Data Leakage

Subtle forms of data leakage can creep in through:

Preprocessing before splitting the data
Using future information in time-series
Including target-related features

5. Overcomplicating Solutions

Sometimes simpler is better. Common overcomplications include:

Using deep learning when linear regression would suffice
Adding unnecessary features without validation
Over-tuning hyperparameters without significant gains

Best Practices Checklist

Start with data exploration and visualization (EDA)
Implement proper train-test splits
Apply appropriate preprocessing techniques
Choose metrics based on business objectives
Use cross-validation for robust evaluation
Monitor for data leakage
Start simple and iterate based on results

Conclusion

Avoiding these common mistakes can significantly improve your machine learning models' performance. Remember that machine learning is not about throwing data at algorithms – it's about understanding your data, choosing appropriate methods, and carefully validating your results.

Would you like to build better models? Start by auditing your current practices against these common pitfalls. Your future self (and your models) will thank you.

P.S. Let's Build Something Cool Together!

As a versatile data professional, I have expertise in both data engineering (most recent job exp) and data science (my undergrad), including machine learning, AI. I'd be excited to collaborate on an interesting project that leverages my diverse skillset.

Also, I do a little bit of Next.JS on the side 😉.

Connect with me on Linkedin, and let's discuss potential opportunities.

What is ETL in Data Engineering?

Harvey Ducay — Wed, 13 Nov 2024 05:16:18 GMT

So imagine you're making a smoothie - that's basically what ETL is in the data world. First, you Extract all your ingredients (data) from different places, like grabbing berries from the fridge, bananas from the counter, and yogurt from the store - just like pulling data from different systems and files. Then comes the Transform part, where you wash the fruit, cut it up, and make sure everything's ready to blend - similar to cleaning up messy data, fixing errors, and making sure everything fits together nicely. Finally, you Load it all into your blender (or in data terms, your final database or warehouse) where it becomes something useful that everyone can consume!

“E-T-L” meaning

Extract

Data extraction is the first crucial step in the ETL process, involving the gathering of data from various source systems. This phase focuses on pulling raw data from multiple sources and preparing it for the transformation phase. The extraction process can be as simple as copying data from a single database or as complex as gathering information from dozens of disparate systems.

Transform

The transformation phase is where raw data becomes valuable business information. This critical stage involves converting extracted data into a format that's suitable for analysis and storage. During transformation, data undergoes various operations to ensure it meets business rules, quality standards, and technical requirements of the target system.

Load

The loading phase represents the final step where transformed data is written into the target system. This phase requires careful planning to ensure data is loaded efficiently while maintaining system performance and data integrity.

The Three Critical Pillars of ETL's Business Impact

1. Data-Driven Decision Making

Strategic Advantage

Consolidates data from multiple sources into a single source of truth
Sales, marketing, financial, and operational data integration
Enables real-time business intelligence and reporting

Business Impact

Faster, more accurate decision making
Reduced analysis time and effort
Better resource allocation based on actual data
Improved forecasting and planning capabilities

2. Operational Excellence

Process Optimization

Automates manual data processing tasks
Standardizes data handling across the organization
Eliminates redundant data entry and processing
Ensures consistent data quality and formatting

Cost Benefits

Reduces manual labor costs by 40-60% on average
Minimizes errors and associated correction costs
Improves employee productivity
Faster time-to-market for data-dependent projects

3. Customer Experience Enhancement

Customer Understanding

Creates comprehensive 360-degree customer views
Combines data from all customer touchpoints
Enables personalized marketing and service delivery
Supports predictive customer behavior analysis

Business Growth

Improves customer retention through better service
Increases cross-selling and upselling opportunities
Enables targeted marketing campaigns
Supports customer satisfaction monitoring and improvement

How can ETL go wrong?

Data extraction fails when source systems unexpectedly change formats or experience downtime, breaking automated processes. Legacy systems with outdated security protocols can become inaccessible, forcing rushed workarounds. Poor connectivity leads to incomplete datasets.

Transformation can silently corrupt data when business rules aren't properly maintained, while faulty loading processes create database deadlocks. Multiple competing ETL jobs can bring production systems to a halt during peak hours, causing business disruption.

Conclusion

ETL, or Extract, Transform, Load, is a crucial data integration process that consolidates data from multiple sources into a central repository, enabling organizations to perform analytics and drive informed decision-making. While ETL provides significant advantages such as improved data quality, automation, and enhanced business intelligence, it can also encounter challenges related to data quality issues, complex transformations, and performance bottlenecks. To mitigate potential pitfalls, businesses should follow best practices, such as implementing robust error handling, ensuring data quality, and utilizing automation to streamline and optimize the ETL process.

P.S. Let's Build Something Cool Together!

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

How to Learn Machine Learning?

Harvey Ducay — Thu, 07 Nov 2024 23:41:11 GMT

What is Machine Learning?

Machine learning is basically teaching computers to learn from data - kind of like how we humans learn from experience, except computers don't get tired or need coffee breaks! It's a branch of artificial intelligence that's taking over pretty much every industry you can think of, from helping doctors detect diseases to recommending that next Netflix show you'll probably binge-watch.

As someone who's been a data engineer for years now, I've seen countless people get overwhelmed when starting their machine learning journey. Trust me, I've been there - staring at mathematical equations that looked more like ancient hieroglyphics than something I'd never understand. But don't worry, I'll break down the different paths you can take, depending on your goals and how much math you're willing to tolerate (just kidding... kind of).

Different Ways to Learn Machine Learning

Let me tell you something funny - there are actually two types of people in the machine learning world: those who dive straight into coding with libraries like scikit-learn (the "just make it work" crowd), and those who start with calculus textbooks (the "but why does it work?" crowd). Both approaches are totally valid!

As someone who's tried both paths (and crashed and burned a few times), here's my honest breakdown:

The Quick and Dirty Way

Want to start building ML models ASAP? This is what I like to call the "scikit-learn and pray" approach. You can:

Learn Python (it's friendlier than it sounds, I promise!)
Jump straight into machine learning libraries
Start building models without diving too deep into the math

I actually started this way when my boss needed a prediction model "by yesterday." Did I fully understand what was happening under the hood? Nope! Did it work? Well... eventually!

The Deep Dive Approach

This is for the brave souls who want to understand every single detail. You'll need:

Calculus (yes, those derivatives are coming back to haunt you)
Linear Algebra (matrices are your new best friends)
Statistics (probability distributions will be your breakfast reading)

I remember spending years in school with these math concepts, fueled by energy drinks and questionable life choices. But I'll tell you what - once it clicks, it's like having a superpower!

You may learn how to do machine learning concepts using pre-existing libraries such as scikit-learn and many more, but at some point you will feel the debt in knowledge where there will be gaps in what you are trying to do.

Why Should You Care?

Look, I get it - learning machine learning can feel like trying to eat an elephant. But here's the thing: the field is exploding faster than my coffee addiction (and that's saying something). Every day I see companies scrambling to hire people who understand this stuff. Whether you're a fresh graduate or a seasoned developer, this knowledge is becoming as essential as knowing how to use a spreadsheet was in the '90s. If you’re here, well, I know you know how much machine learning engineers are getting paid by the hour ;).

Tips From Someone Who's Been There

Start Small: My first ML project was predicting house prices. It wasn't revolutionary, but hey, it worked! And I didn't cry... much. Starting small prevents you from getting overwhelmed and it lets you start.
Join Communities: Trust me, you'll need people to commiserate with when your model's accuracy is lower than your high school math grades. Getting feedback in public is as crucial as spending time to learn.
Build Real Projects: Theory is great, but nothing beats the thrill (and frustration) of building something real. I learned more from my failed projects than from any tutorial. I believe that learning mostly comes from our failures more than our success.

Conclusion

Whether you choose to dive straight into coding or take the scenic route through math town, remember that everyone starts somewhere. I went from barely understanding what ML meant to building production models that actually work (most of the time). If I can do it, so can you!

And hey, if you're feeling overwhelmed, just remember: even the most sophisticated ML models sometimes make predictions that are about as accurate as my weather app - and we still keep trying!

For more nerdy data science content and occasional attempts at humor, check out my other articles on getting started with Python and data science fundamentals. Trust me, they're marginally more entertaining than watching paint dry! 😉

P.S. Let's Build Something Cool Together!

After years of stumbling through machine learning, I've found that learning is always better when done together. If you're stuck on a concept, need guidance, or just want to chat about ML, feel free to reach out to me on Linkedin

What is The Scatterplot Data Generator?

Harvey Ducay — Sat, 02 Nov 2024 01:52:05 GMT

Introduction

Picture this: It's 3 AM, you're on your fifth cup of coffee, and you're staring at your screen thinking, "If only I had the perfect dataset to test this clustering algorithm..." Trust me, I've been there – we've all been there. As a data engineer who's spent countless nights wrestling with algorithms that just won't behave (much like my neighbor's cat), I've discovered a tool that's become my secret weapon: the Scatterplot Data Generator.

In this post, we'll explore what this magical tool is, why it's a game-changer for data scientists and engineers, and how it can save you from those late-night data hunting expeditions.

What is The Scatterplot Data Generator?

Scatterplot Data Generator is a web-based tool that lets you literally draw your data points into existence. Think of it as MS Paint meets data science (minus the questionable artistic results we all created in the '90s). It allows users to draw points of different colors on a coordinate system, which are then converted into actual numerical data that you can use for machine learning models, testing, or educational purposes.

Why is Scatterplot Data Generator Important?

Let me tell you a story that might sound familiar. Last year, I was working on a multi-class classification model that needed very specific data patterns to test edge cases. After hours of searching through Kaggle and various datasets (and possibly losing a bit of my sanity), I realized I was doing it the hard way.

Simplest Use Case for Scatterplot Data Generator

Visual Pattern Recognition:

The tool shows two distinct plots: one with color-coded points (blue and red) and another with black points
This helps learners understand how clustering algorithms identify and separate data points into groups based on their spatial relationships

Interactive Learning Features:

The interface has color selection options (Blue, Red, Green)
A "Reset" button to start fresh
"Download CSV" functionality to export the data
These features allow hands-on experimentation with different data patterns

Educational Benefits:

Learners can create custom data distributions to test clustering scenarios
The tool demonstrates how points that are closer together tend to form clusters
The right-side plot shows how raw data looks before classification/clustering
The left-side plot shows how clustering algorithms might separate the data into distinct groups

Practical Applications:

Users can generate synthetic datasets for testing clustering algorithms like K-means or DBSCAN
They can experiment with different data patterns and see how clustering algorithms might perform
The CSV export feature lets them use the generated data in actual ML tools and frameworks

This tool essentially bridges the gap between theoretical understanding and practical application in machine learning clustering concepts.

Real Examples of Scatterplot Data Generator in Action

1. The Classification Conundrum

Picture this: I was working with a peer who couldn't understand why their beautiful linear classifier was failing miserably. Rather than diving into complex math, I fired up Scatterplot Data Generator and drew a simple XOR pattern – you know, that classic "cross" shape that makes linear classifiers cry themselves to sleep. Five minutes of interactive demonstration showed what would have taken an hour to explain with equations. The best part? They immediately started experimenting with their own patterns, creating increasingly diabolical datasets to break various classifiers. It's all fun and games until someone creates a spiral pattern!

2. The Edge Case Emergency

It was Sunday night (because production issues never happen on a Tuesday afternoon, right?), and our anomaly detection system was having false positives. We needed to test edge cases, and fast. Using Scatterplot Data Generator, we created datasets with specific outlier patterns that mimicked our production scenarios. Within an hour, we had a suite of test cases that would have taken days to find in real data. The best part? We could tweak the patterns in real-time as we discovered new edge cases. Our Monday morning post-mortem turned into a "look how we nailed it" presentation!

Workflows for Scatterplot Data Generator

flowchart LR
    A[Draw Points] -->|Click & Drag| B[Generate Data]
    B --> C[Download CSV]
    B --> D[Export Visual]
    C --> E[Use in ML Pipeline]
    D --> F[Use in Documentation]

    style A fill:#5b9aa0
    style B fill:#5b9aa0
    style C fill:#5b9aa0
    style D fill:#5b9aa0
    style E fill:#5b9aa0
    style F fill:#5b9aa0

Tips and Reminders for Using Scatterplot Data Generator

1. Plan Your Pattern

Before diving in, spend five minutes sketching your intended pattern. Trust me, I learned this the hard way after creating what I thought would be a perfect Gaussian distribution but ended up looking more like my failed attempt at drawing a cat. Quick tip: Use graph paper for your sketches – your coordinates will thank you later! ###

2. Save Everything

I cannot stress this enough: Save. Your. Work. Name your files descriptively (not "test1_final_final_REALLY_FINAL.csv"). Keep both the visual and the data. Document your patterns. I once spent three hours recreating a "perfect" dataset because I forgot to save the original. Learn from my pain!

Conclusion

Just as every great artist needs their canvas, every data scientist needs their tools. Scatterplot Data Generator bridges the gap between imagination and implementation, between "I wish I had this data" and "I created exactly what I needed." Whether you're a seasoned data scientist battling with edge cases, a teacher illuminating the mysteries of machine learning, or a beginner trying to understand why your neural network has trust issues, this tool transforms the abstract into the tangible. Remember: in a world where data is the new oil, being able to generate exactly what you need makes you not just a data scientist, but a data artist. And sometimes, the best datasets are the ones we draw ourselves – even if they occasionally end up looking like abstract art!

P.S. Let's Build Something Cool Together!

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

What is Data Engineering?: Everything You Need to Know

Harvey Ducay — Thu, 24 Oct 2024 23:25:28 GMT

Ever found yourself drowning in a sea of data, trying to make sense of countless Excel sheets while your computer fan sounds like it's about to take off? Trust me, I've been there. As someone who once tried to train a machine learning model on my laptop with 100GB of unstructured data (spoiler alert: it didn't end well), I learned the hard way why data engineering is the unsung hero of the data world.

My laptop trying to process 100GB of data…

Introduction

In today's digital age, data is the new gold – but just like raw gold, raw data needs refining before it becomes valuable. That's where data engineering comes in. Whether you're a startup trying to make sense of your customer data or a large enterprise handling petabytes of information, data engineering is the foundation that makes modern data science and analytics possible.

The Data Pipeline Journey

Raw Data → Data Engineering → Clean Data → Analysis → Insights

In this post, we'll define data engineering, explore its crucial role in the data ecosystem, and provide practical insights into how it can transform your business's data operations. We'll also look at real-world examples and best practices that can help you get started on your data engineering journey.

What is Data Engineering?

Data engineering is the practice of designing, building, and maintaining the infrastructure and systems needed to collect, store, process, and deliver data for analysis. Think of data engineers as the architects and plumbers of the data world – they build the pipelines and systems that ensure data flows smoothly from source to destination, arriving clean and ready for analysis.

A sample python BigQuery snippet

from google.cloud import bigquery
import pandas as pd
import logging
from datetime import datetime

class BigQueryETL:
    def __init__(self, project_id: str):
        """Initialize BigQuery client"""
        self.client = bigquery.Client(project=project_id)
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)

    def extract(self, query: str) -> pd.DataFrame:
        """Extract data from BigQuery"""
        try:
            df = self.client.query(query).to_dataframe()
            self.logger.info(f"Extracted {len(df)} rows")
            return df
        except Exception as e:
            self.logger.error(f"Extraction failed: {str(e)}")
            raise

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply transformations to the data"""
        try:
            # Convert dates
            date_cols = df.select_dtypes(include=['datetime64[ns]']).columns
            for col in date_cols:
                df[col] = pd.to_datetime(df[col])

            # Handle missing values in numeric columns
            num_cols = df.select_dtypes(include=['float64', 'int64']).columns
            df[num_cols] = df[num_cols].fillna(df[num_cols].mean())

            # Add time-based features if timestamp exists
            if 'timestamp' in df.columns:
                df['hour'] = df['timestamp'].dt.hour
                df['is_weekend'] = df['timestamp'].dt.dayofweek.isin([5, 6]).astype(int)

            return df.drop_duplicates()
        except Exception as e:
            self.logger.error(f"Transformation failed: {str(e)}")
            raise

    def load(self, df: pd.DataFrame, table_id: str) -> None:
        """Load data into BigQuery"""
        try:
            job_config = bigquery.LoadJobConfig(
                write_disposition='WRITE_TRUNCATE',
                schema_update_options=[bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]
            )

            load_job = self.client.load_table_from_dataframe(
                df, table_id, job_config=job_config
            )
            load_job.result()  # Wait for job to complete

            self.logger.info(f"Loaded {len(df)} rows to {table_id}")
        except Exception as e:
            self.logger.error(f"Load failed: {str(e)}")
            raise

    def run_pipeline(self, query: str, destination_table: str) -> None:
        """Execute the full ETL pipeline"""
        start_time = datetime.now()
        try:
            df = self.extract(query)
            df_transformed = self.transform(df)
            self.load(df_transformed, destination_table)

            duration = datetime.now() - start_time
            self.logger.info(f"Pipeline completed in {duration}")
        except Exception as e:
            self.logger.error(f"Pipeline failed: {str(e)}")
            raise

# Example usage
if __name__ == "__main__":
    # Initialize pipeline
    etl = BigQueryETL("your-project-id")

    # Example query
    query = """
    SELECT user_id, timestamp, activity_type, duration
    FROM `your-project-id.dataset.user_activity`
    WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    """

    # Run pipeline
    etl.run_pipeline(query, "your-project-id.dataset.processed_activity")

Why is Data Engineering Important?

1. Data Volume: 175 Zettabytes by 2025

By 2025, global data creation will hit 175 zettabytes (that's 175 billion terabytes!). To put this in perspective, if you stored this data on DVDs, the stack would reach the moon 23 times. This explosive growth, driven by IoT devices, social media, and streaming services, makes robust data engineering not just important, but critical for business survival.

2. Decision Speed: 25% Faster

Organizations with proper data engineering make decisions 25% faster than their competitors. Think retail making inventory decisions in hours instead of days, or healthcare reducing patient diagnosis time by a third. This speed comes from automated data pipelines, real-time analytics, and streamlined access to clean, reliable data.

3. Cost Reduction: Up to 70%

Companies can slash data-related costs by up to 70% through data engineering. How? Through smart infrastructure optimization (30% savings), automated processes (20% savings), and better resource allocation (20% savings). Instead of throwing money at storing and processing messy data, proper engineering means you spend less while getting better results.

Real Examples of Data Engineering

Real-World Data Engineering Examples

1. Netflix's Data Pipeline

[ARCHITECTURE DIAGRAM SUGGESTION: Netflix Data Flow]

graph TD
    A[User Interactions] -->|Streaming Events| B[Kafka]
    B -->|Real-time Processing| C[Apache Flink]
    B -->|Batch Processing| D[Spark]
    C -->|Hot Data| E[Cassandra]
    D -->|Cold Data| F[S3 Data Lake]
    E --> G[Feature Store]
    F --> G
    G -->|ML Training| H[Model Training]
    H -->|Model Serving| I[Recommendation Service]
    I -->|Personalization| J[User Interface]

Netflix processes a staggering 450+ billion events per day through their data pipeline. Here's how their architecture works:

Data Collection Layer
- Captures user interactions (clicks, views, pauses, ratings)
- Records viewing quality metrics
- Tracks device-specific information
- Processes content metadata
Processing Layer
- Real-time processing for immediate recommendations
- Batch processing for deeper insights
- A/B testing data for feature optimization
- Content performance analytics
Storage Layer
- Hot data in Cassandra for real-time access
- Cold data in S3 for historical analysis
- Feature store for ML model training
- Redis cache for quick access to recommendations

The result? Those eerily accurate "Because you watched..." recommendations that keep us binge-watching!

2. Uber's Real-Time Analytics

[ARCHITECTURE DIAGRAM SUGGESTION: Uber's Real-time System]

graph TD
    A[Rider/Driver Apps] -->|Events| B[Apache Kafka]
    B -->|Stream Processing| C[Apache Flink]
    B -->|Batch Processing| D[Apache Spark]
    C -->|Real-time Metrics| E[Apache AthenaX]
    D -->|Historical Data| F[Hudi Data Lake]
    E -->|Current State| G[Redis]
    F -->|Analytics| H[Presto]
    G -->|Real-time Decisions| I[Matching Service]
    H -->|Business Intelligence| J[Analytics Dashboard]

Uber's real-time data pipeline handles millions of events per second. Here's their architecture breakdown:

Real-time Processing Layer
- Processes GPS coordinates every 4 seconds
- Handles surge pricing calculations
- Manages driver-rider matching
- Monitors service health
Storage Layer
- Temporal data in Redis for immediate access
- Historical data in Apache Hudi
- Geospatial indexing for location services
- Cached frequently accessed routes
Analytics Layer
- Real-time city demand forecasting
- Dynamic pricing algorithms
- Driver supply optimization
- Route optimization based on traffic patterns

The result is a system that can match you with a driver in seconds while optimizing for countless variables in real-time!

Conclusion: The Future is Data-Driven

Looking at these real-world examples, it's clear that data engineering isn't just about moving data from point A to point B – it's the backbone of modern digital experiences we take for granted. From Netflix knowing exactly what show you'll love next to Uber finding you the perfect driver in seconds, data engineering makes the impossible possible.

Remember when I mentioned my laptop meltdown trying to process 100GB of data? That's like trying to deliver packages on a bicycle when you need a fleet of trucks. Modern data engineering is that fleet of trucks, complete with GPS, route optimization, and real-time tracking.

As we move toward an even more data-intensive future, the role of data engineering will only grow. Whether you're a startup processing your first thousand users' worth of data or an enterprise handling petabytes, the principles remain the same:

Build scalable, resilient pipelines
Automate everything you can
Monitor religiously
Plan for growth