<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Data Science/Engineering Insights]]></title><description><![CDATA[Data Science/Engineering Insights]]></description><link>https://hddatascience.tech</link><generator>RSS for Node</generator><lastBuildDate>Wed, 08 Apr 2026 12:40:22 GMT</lastBuildDate><atom:link href="https://hddatascience.tech/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[The LLM Council and the Human Mind]]></title><description><![CDATA[Weeks ago, Andrej Karpathy(ex. director of AI at tesla) launched LLM Council. The concept is brilliant but simple: instead of asking a single AI model (like ChatGPT) to answer a question, you create a "council" of different models. You have one model...]]></description><link>https://hddatascience.tech/the-llm-council-and-the-human-mind</link><guid isPermaLink="true">https://hddatascience.tech/the-llm-council-and-the-human-mind</guid><category><![CDATA[AI]]></category><category><![CDATA[meditation]]></category><category><![CDATA[chain of thought]]></category><category><![CDATA[psychology]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Tue, 16 Dec 2025 00:43:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765845668409/e3d3da8d-70ec-4518-a0fc-240f9199ba0a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Weeks ago, Andrej Karpathy(ex. director of AI at tesla) launched <strong>LLM Council</strong>. The concept is brilliant but simple: instead of asking a single AI model (like ChatGPT) to answer a question, you create a "council" of different models. You have one model draft an answer, another critique it, and a "Chairman" model that makes the final decision.</p>
<p>Generative AI is know to hallucinate every now and then, it makes things up or at worst, even reply with nonsense. But when you force it to debate, reflect, and critique, the intelligence skyrockets.</p>
<p>As I read the documentation, I realized this isn't just a new way to code. This is exactly how a healthy human mind works.</p>
<h2 id="heading-the-internal-negotiation"><strong>The Internal Negotiation</strong></h2>
<p>Again, this reminded me of the lessons from Jordan Peterson has on meditation and prayer. The way we progress as individuals is through managing our thoughts. “What is it the I truly want?” He highlighted that we should learn to think deeply on what we aspire to be, or what we feel like the greatest good in the world is, and plan our actions accordingly.</p>
<p>This goes without saying that we should learn to negotiate with ourselves, nothing good comes out of tyranny. This means that we must negotiate a fair reward system for our efforts.</p>
<p>We think of ourselves as a single person, but we are actually a noisy room of internal agents. And just like software, if we don't generate "logs" or if we don't slow down to meditate or pray, we crash.</p>
<h2 id="heading-internal-agents">Internal Agents</h2>
<p>If you look inside your own head, you rarely find a single opinion. You find a negotiation.</p>
<ul>
<li><p><strong>The Fear Agent (The Amygdala):</strong> You might recognize this as the internal thoughts as a child when you are alone in the dark place of your house, or maybe the voice that speaks whenever you are about to send that crucial email for work.</p>
</li>
<li><p><strong>The Dopamine Agent:</strong> This part of you wants the short-term reward. It wants the sweets, the fast money, the scroll on TikTok. You know it’s successfully taken over you the moment you chose to play video games over doing your work or school. It optimizes for immediate gratification.</p>
</li>
<li><p><strong>The Long-Term Agent:</strong> This is the part of you that wants deep success, health, and meaning.</p>
</li>
</ul>
<p>Most people live their lives on autopilot. They let the loudest agent (usually fear or dopamine) take action without taking a decent time reflecting. They react immediately. In AI terms, this can be thought of as <strong>Hallucination</strong>, a confident but wrong output. I might be reaching out a bit in terms of comparing hallucination over doing micro-wrong decisions in day to day life of a person, but the idea of doing something wrong because it is not thought of well still stands.</p>
<h2 id="heading-meditation">Meditation</h2>
<p>I view meditation as a form of <strong>First Principles Thinking</strong>. It is the act of clearing the "Context Window".</p>
<p>When life gets overwhelming, our internal RAM (random access memory) gets full of noise, stress, opinions, social media, and many other useless (or useful) things. If you try to make a decision in that state, you will fail. Meditation can be thought of as hitting the reset button. It wipes the cache.</p>
<p>It allows me to switch from "Zero-Shot" reacting to <strong>Chain of Thought</strong> reasoning. I can sit back and look at my thoughts from a third-person perspective. I can "judge the judger." I can ask: <em>Why am I afraid? Is there an actual danger? Can I do it afraid anyway?</em></p>
<p>It’s not about deleting the existing agents. As Jordan Peterson says, you can't tyrannize yourself. If you try to crush your fear or starve your desires, they will rebel. You have to negotiate. You have to be the Chairman of the Council, listening to the fear, acknowledging it, but ultimately deciding to follow the Long-Term Agent.</p>
<h2 id="heading-journaling">Journaling</h2>
<p>If a program crashes and you didn't set up a logging system, you will have no idea what happened. You can't fix the bug. You will just keep crashing in the same way, over and over.</p>
<p>Humans do this too. We repeat the same toxic patterns, the same bad habits, the same anxious spirals. Why? <strong>Because we never generated the logs.</strong></p>
<p>We didn't slow down. We didn't meditate, pray, or journal.</p>
<p>Meditation and Prayer are the tools we use to generate logs of our existence. They force us to stop the execution of the code and look at the logic, or in real-life’s case, it forces us to evaluate the decisions we made on the day to day.</p>
<ul>
<li><p><em>What triggered that anger?</em></p>
</li>
<li><p><em>Why did I chase that fast money?</em></p>
</li>
<li><p><em>What truly matters to me right now?</em></p>
</li>
</ul>
<p>If we don't slow down to "log it out," we are just autonomous agents running trash code, reacting to the world with no direction until we burn out.</p>
<p>But if we take the time to convene the Council, to meditate on first principles and pray for guidance, we stop reacting, and we start living.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>For years, AI development optimized for speed. But we reached a point where speed wasn't the problem, reasoning was. The solution wasn't to go faster, it was to introduce agents that think, critique, and work together.</p>
<p>The same applies to us. We cannot define a perfect life through speed or autopilot reactions. We need an architecture of thought. Chain of thought was a good starting point, but then the whole architecture of agents thinking together could still improve. I’m wondering, what would make the perfect agentic chain of thought architecture?</p>
<p>We must define our own "Philosophical Guide" agent. By taking the time to convene the Council, meditate on first principles, and log our internal states, we stop merely reacting to the code and we start writing it.</p>
<p>Now the question also goes, “How do we define an Agent that is a philosophical guide?”.</p>
]]></content:encoded></item><item><title><![CDATA[One-Shot Trauma: When Reinforcement Learning and Human Minds Overcorrect]]></title><description><![CDATA[The Day My Internal Agent Received a -1,000,000 Penalty
It only took a second to rewire my brain.
By early 2022, I was just your average joe, living life day by day. Eating was one of my daily tasks that was necessary, automatic, and unconscious. It ...]]></description><link>https://hddatascience.tech/one-shot-trauma-when-reinforcement-learning-and-human-minds-overcorrect</link><guid isPermaLink="true">https://hddatascience.tech/one-shot-trauma-when-reinforcement-learning-and-human-minds-overcorrect</guid><category><![CDATA[Reinforcement Learning]]></category><category><![CDATA[AI]]></category><category><![CDATA[psychology]]></category><category><![CDATA[jordan peterson]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Mon, 17 Nov 2025 14:03:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763388189147/2ef23053-d2be-45c8-852d-33f758b828e6.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-the-day-my-internal-agent-received-a-1000000-penalty"><strong>The Day My Internal Agent Received a -1,000,000 Penalty</strong></h3>
<p>It only took a second to rewire my brain.</p>
<p>By early 2022, I was just your average joe, living life day by day. Eating was one of my daily tasks that was necessary, automatic, and unconscious. It was a simple background task that required no conscious thought. Then, one day, I failed on that task that was supposed to be automatic. I choked on my food.</p>
<p>It wasn't just a moment of discomfort, it was a primal, terrifying alert that flooded my entire system. The world narrowed to the single, desperate need for air. My heart hammered against my ribs, adrenaline surged, and in that moment, my brain registered a single, blaring data point: <em>This is death. This is how you die.</em></p>
<p>Even after the danger passed, the damage was done. For weeks and months, I had a debilitatingly difficult time eating solid foods. Every sensation in my throat was a potential prerequisite to a disaster, triggering panic attacks that, in a cruel feedback loop, caused GERD, which in turn created more throat sensations.</p>
<p>What I didn’t realize at the time was that my brain was running a perfect, albeit terrifying, simulation of a core problem in artificial intelligence. I had become a reinforcement learning agent that had just received a penalty so massive, so disproportionate to all my previous experiences, that my entire operating policy had been corrupted.</p>
<h3 id="heading-a-crash-course-in-reinforcement-learning"><strong>A Crash Course in Reinforcement Learning</strong></h3>
<p>Before we get to the catastrophe, let’s quickly define the terms. Reinforcement Learning (RL) is a field of AI where we teach an "agent" to make decisions. Think of it like training a dog, but with algorithms.</p>
<p>The basic components are simple:</p>
<ul>
<li><p><strong>The Agent:</strong> The learner and decision-maker (the AI, the dog, or in my case, me).</p>
</li>
<li><p><strong>The Environment:</strong> The world the agent operates in (a video game, a maze, or the dinner table).</p>
</li>
<li><p><strong>The State:</strong> A snapshot of the agent's current situation ("I am at a crossroad," "My plate is full of solid food").</p>
</li>
<li><p><strong>The Action:</strong> Something the agent can do ("Turn left," "Take a bite").</p>
</li>
<li><p><strong>The Reward/Penalty:</strong> The feedback the agent gets from the environment after an action (+1 for finding cheese, -1 for hitting a wall).</p>
</li>
</ul>
<p>The agent’s goal is to learn a <strong>policy</strong>, a strategy or a map of which actions to take in which states to maximize its total cumulative reward over time. It does this through trial and error, gradually updating its policy as it explores the world. For 99.9% of its life, this process is gradual and iterative.</p>
<h3 id="heading-the-catastrophe"><strong>The Catastrophe</strong></h3>
<p>Now, let's return to our RL agent. It’s happily exploring its environment, collecting small rewards: +1, +5, +2. Its policy is getting better and better.</p>
<p>Then, it wanders into an unknown territory and takes an action. The environment's response isn't a small penalty. It's a catastrophic, system-shocking <strong>-1,000,000</strong>.</p>
<p>From a technical standpoint, the value assigned to that state-action pair plummets. The agent's algorithm, designed to maximize reward, now sees any path leading to that state as unimaginably bad. The policy updates instantly and brutally: "Whatever you do, <em>never go there again</em>."</p>
<p>This is precisely what happened in my brain.</p>
<ul>
<li><p><strong>State:</strong> "Eating solid food."</p>
</li>
<li><p><strong>Action:</strong> "Swallowing."</p>
</li>
<li><p><strong>Penalty:</strong> The choking experience, a neurological -1,000,000.</p>
</li>
</ul>
<p>My internal policy was updated in a flash. The value of that action became catastrophic. My brain’s simple new rule was: <em>Avoid this state at all costs. It is not worth the risk.</em></p>
<p>Reinforcement learning agents reacts to a huge penalty as much as humans react to real life traumas. Some human traumas are key to survival and most of it were gained through evolution. Humans were designed to have panic attacks during moments in the wild, for example when we encounter lions or other predators in the forest, but this part of our brain weren’t optimized for the modern human experience.</p>
<h3 id="heading-the-flawed-policy-and-the-dragon-of-chaos"><strong>The Flawed Policy and The Dragon of Chaos</strong></h3>
<p>This is where a core AI challenge, the <strong>Exploration-Exploitation Dilemma</strong>, collides with human psychology. An agent must balance <em>exploiting</em> known good strategies with <em>exploring</em> new ones to find even better rewards.</p>
<p>After a catastrophic penalty, this dilemma is shattered. The agent stops exploring. It retreats into a tiny, "safe" corner of its world, only performing actions it <em>knows</em> won't lead to disaster. It has sacrificed growth and opportunity for the illusion of total safety.</p>
<p>This is where the ideas of psychologist Jordan Peterson become incredibly relevant. Peterson often frames the world as a duality of <strong>Order</strong> and <strong>Chaos</strong>.</p>
<ul>
<li><p><strong>Order</strong> is the realm of the known, the predictable, the safe. It's your home, your routine, your settled knowledge.</p>
</li>
<li><p><strong>Chaos</strong> is the unknown, the unexpected. It is the place of both terrifying dragons and undiscovered treasure.</p>
</li>
</ul>
<p>My normal life of eating was Order. The choking incident was a violent, sudden immersion into Chaos. My response, my agent's response, was to retreat and drastically shrink the walls of my known, safe Order. Solid food, a previously mundane part of Order, was now re-categorized as Chaos. It was a territory on my internal map suddenly marked, "Here be dragons."</p>
<p>But here's the flaw in the policy, for both me and the AI: you can’t just skip eating. The AI agent, by walling off a huge part of its environment, might be dooming itself to a sub-optimal existence, missing out on the vast rewards that lie just beyond that one terrifying spot.</p>
<p>A perfect reinforcement learning agent (as well as the perfect human life experience) consists of a balance between a good amount of order, and a few bits of chaos. It is in order that we find peace in the world, and in facing chaos that we learn how to adapt to the ever changing world. A reinforcement learning agent, and a human, would paradoxically be unsafe in a perfectly ordered environment because it will never be prepared for chaos.</p>
<h3 id="heading-recalibration"><strong>Recalibration</strong></h3>
<p>So, how do you fix a policy that has been broken by a single, traumatic data point? You can't just delete the memory. The agent, and the human, needs new countervailing data.</p>
<p>Peterson's prescription for this is not to ignore Chaos, but to <strong>confront it voluntarily</strong>. You don't wait for the dragon to find you again. You approach its lair on your own terms, in small, manageable steps.</p>
<p>In psychology, this is the foundation of <strong>exposure therapy</strong>. For me, it meant I couldn't go back to eating a steak dinner. But I could start with something soft. I could eat a piece of well-chewed bread. I was voluntarily taking a small step back into the "dangerous" territory. I was telling my internal agent, "See? We took an action in this state-space, and the penalty was 0, not -1,000,000."</p>
<p>Each successful, non-choking bite was a small, positive reward (+1) that began to slowly, painstakingly, update my flawed policy.</p>
<p>We can apply this same logic to building more resilient AI:</p>
<ol>
<li><p><strong>Curriculum Learning:</strong> Don't throw the agent into the most chaotic environment at once. Start it in a simple, safe version and gradually increase the complexity, the AI equivalent of starting with soft foods.</p>
</li>
<li><p><strong>Reward Shaping:</strong> Can we design systems that give small rewards for "bravery", for cautiously re-exploring a territory with a known high penalty? This encourages the agent not to write it off forever.</p>
</li>
<li><p><strong>Decaying Memory:</strong> Perhaps the memory of a massive penalty shouldn't be permanent. It could slowly decay over time if not reinforced, allowing the agent to become cautiously curious once more.</p>
</li>
</ol>
<p>At first, I had just eating soft foods starting with oatmeal and yogurt. I then eventually tried to eat sandwiches, before I fully tried eating meat with rice. It was a such an experience I never knew I will encounter. I monitored progress every step of the way and gave myself a pat on the back whenever I faced a fear I was very hesitant to do at first.</p>
<h3 id="heading-conclusion-building-agents-with-digital-courage"><strong>Conclusion: Building Agents with Digital Courage</strong></h3>
<p>My experience taught me that humans and our most advanced learning algorithms share a fundamental vulnerability: we are profoundly shaped by our worst moments. A single, catastrophic failure can create a brittle, over-cautious policy that prioritizes avoiding pain over seeking growth.</p>
<p>The path to recovery and optimal performance, for both man and machine, isn't about erasing that bad memory. It’s about courageously and methodically gathering new data to prove that the catastrophe was an outlier, not the rule.</p>
<p>Perhaps the next frontier in AI isn't just about bigger models or faster processing. It’s about instilling the digital equivalent of courage, the ability to face the remembered dragon, learn from failure, and refuse to let a single scar define the entire map of one's world.</p>
<p>At this some point, the technical knowledge in AI (reinforcement learning) is determined by how we imitate lessons and occurrences from psychology. Learning comes from a good amount of order to stand on, and a small amount of chaos to learn from.</p>
]]></content:encoded></item><item><title><![CDATA[What is AI?]]></title><description><![CDATA[I find it a little funny that I'm only getting to this article now. As an applied mathematician, I have a deep-seated need for rigor, for building arguments from first principles. In a field as dynamic and hype-driven as Artificial Intelligence, a ri...]]></description><link>https://hddatascience.tech/what-is-ai</link><guid isPermaLink="true">https://hddatascience.tech/what-is-ai</guid><category><![CDATA[AI]]></category><category><![CDATA[Reinforcement Learning]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Sun, 02 Nov 2025 08:06:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762069768389/92a83401-8bca-4851-b413-28b03b47df13.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I find it a little funny that I'm only getting to this article now. As an applied mathematician, I have a deep-seated need for rigor, for building arguments from first principles. In a field as dynamic and hype-driven as Artificial Intelligence, a rigorous definition can feel elusive. This article is my attempt to provide one. Not in the form of a dense mathematical proof, but through a structured framework that is both intuitive and fundamentally sound: viewing the core learning mechanisms of AI through the lens of child development.</p>
<p>My own "AI genesis" story wasn't seeing a robot, but coding a neural network from scratch and realizing the 'magic' was just calculus and matrix transformations. That revelation is the core of this post: AI is not a magic box. It is a set of mathematical tools. And to truly understand them, we must first understand how they "grow up."</p>
<h3 id="heading-learning-through-guidance-and-imitation-ml-amp-dl">Learning Through Guidance and Imitation: ML &amp; DL</h3>
<p>A child's first and most essential way of learning is by observing the ordered world their parents create for them. This is the domain of Machine Learning (ML) and its powerful subfield, Deep Learning (DL).</p>
<h4 id="heading-machine-learning-as-cultural-conditioning">Machine Learning as Cultural Conditioning</h4>
<p>Think of how a child learns the specific, non-negotiable rules of a Filipino household. They are taught to say "po" and "opo" to elders. They learn to take off their shoes or slippers the moment they step inside. This isn't learned through abstract reasoning; it's learned through direct instruction and imitation.</p>
<p>This is a perfect parallel for supervised Machine Learning. The model is given a massive dataset of specific inputs (an elder speaks to you) and the correct, labeled outputs ("opo"). It learns the function to map one to the other, perfectly mimicking the "correct" behavior it was shown.</p>
<h4 id="heading-deep-learning-as-internalizing-values">Deep Learning as Internalizing Values</h4>
<p>A child doesn't just parrot rules forever. Eventually, they move beyond mimicry and grasp the underlying <em>concept</em> of respect. They begin to apply it in novel situations, showing deference to other figures of authority even if they were never explicitly told to.</p>
<p>This is Deep Learning. The neural network's layers allow it to learn not just the surface-level pattern, but the deeper, abstract principles behind the data. It builds an internal model of "respect," allowing for a more flexible and intuitive application of the learned rules.</p>
<h3 id="heading-learning-through-consequence-reinforcement-learning-rl">Learning Through Consequence: Reinforcement Learning (RL)</h3>
<p>But not everything can be learned from a guiding hand. A child must eventually face the world on their own and learn from its direct, unfiltered feedback. This is the world of Reinforcement Learning (RL), and it is a process of conquering chaos.</p>
<h4 id="heading-reinforcement-learning-as-learning-to-walk">Reinforcement Learning as Learning to Walk</h4>
<p>The best analogy for RL is a toddler learning to walk. There is no instruction manual. No parent can perfectly explain the infinite micro-adjustments of balance and muscle control. The child must learn through brutal trial and error.</p>
<ul>
<li><p>The <strong>agent</strong> is the toddler.</p>
</li>
<li><p>The <strong>environment</strong> is the physical world, governed by the unforgiving laws of gravity.</p>
</li>
<li><p>The <strong>action</strong> is attempting to take a step.</p>
</li>
<li><p>The <strong>penalty</strong> is the immediate, painful feedback of falling.</p>
</li>
<li><p>The <strong>reward</strong> is the exhilarating success of staying upright and moving forward. (and probably the applause of your parents)</p>
</li>
</ul>
<p>The toddler is not trying to imitate a perfect "walk" from a dataset. They are developing their <em>own</em> strategy to maximize reward and minimize punishment, building a robust understanding directly from the consequences of their actions. This is how RL agents master complex games and robotic controls—by bravely confronting the chaos of their environment and structuring it through experience.</p>
<h3 id="heading-conclusion-think-like-a-mathematician-not-a-movie-director">Conclusion: Think Like a Mathematician, Not a Movie Director</h3>
<p>So when we ask, "What is AI?", the answer is multifaceted. It’s the carefully guided student, learning the cultural rules of its environment (ML/DL). And it’s the determined toddler, courageously facing gravity to learn to stand on its own two feet (RL).</p>
<p>The next time you interact with an AI, I challenge you to see past the code and think of its upbringing. Was it taught by the book, or did it learn from the school of hard knocks? Understanding its developmental journey demystifies its capabilities. Because beneath it all, whether it's a child internalizing respect or a toddler learning to walk, the engine is the same: a mathematical function, optimizing for a goal, and turning the unknown chaos of the world into the ordered structure of knowledge.</p>
]]></content:encoded></item><item><title><![CDATA[From "It Works" to "Why It Works": A Call for Deeper Understanding in Data Science]]></title><description><![CDATA[Sometimes, the most valuable lessons come from unexpected moments. I was attending a data science workshop recently, and a brief discussion served as a powerful reminder of a crucial question we must ask ourselves: are we content with knowing that so...]]></description><link>https://hddatascience.tech/from-it-works-to-why-it-works-a-call-for-deeper-understanding-in-data-science</link><guid isPermaLink="true">https://hddatascience.tech/from-it-works-to-why-it-works-a-call-for-deeper-understanding-in-data-science</guid><category><![CDATA[ConvolutionalNeuralNetworks]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Sun, 05 Oct 2025 06:02:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759643496786/e4be8da5-25e5-4e7b-8381-e6e4e3799bab.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Sometimes, the most valuable lessons come from unexpected moments. I was attending a data science workshop recently, and a brief discussion served as a powerful reminder of a crucial question we must ask ourselves: are we content with knowing <em>that</em> something works, or do we strive to understand <em>why</em> it works? It's the difference between being a technician and an engineer, and it is crucial for building robust and reliable solutions.</p>
<p>This question feels more relevant than ever. It's never been easier to get amazing results in data science. We can build powerful models that were cutting-edge just a few years ago with only a few lines of code. But this ease of use brings a hidden risk. We're often tempted to treat these powerful tools like "black boxes," focusing only on the final accuracy score without really knowing what’s happening inside.</p>
<h2 id="heading-dont-skip-the-why-the-soul-of-the-cnn">Don't Skip the "Why": The Soul of the CNN</h2>
<p>Let's use a classic example, the Convolutional Neural Network (CNN). Too often, tutorials and talks jump straight into the architecture, talking about layers, filters, and code, but they skip the most important question of all: why do we even use them?</p>
<p>The reason we use CNNs for images instead of a standard neural network comes down to a couple of brilliant ideas:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759643572054/9064720c-c2a2-4d2d-9838-bcb0bbe3ce39.webp" alt class="image--center mx-auto" /></p>
<ol>
<li><p><strong>Translation Invariance:</strong> A picture of a cat is still a picture of a cat, whether the cat is in the top left or the bottom right. A basic neural network would struggle with this, needing to learn what a "top-left cat" and a "bottom-right cat" are separately. This is incredibly inefficient. CNNs solve this by using sliding filters that spot features no matter where they are in the image.</p>
</li>
<li><p><strong>Parameter Efficiency:</strong> By using these sliding filters, a CNN reuses the same weights across the entire image. This drastically cuts down on the number of parameters the model has to learn, which means it trains faster and is less likely to overfit.</p>
</li>
</ol>
<p>Understanding this "why" isn't just for textbooks. It’s the very soul of the architecture. It helps you make better design choices and explain your work with real confidence.</p>
<h2 id="heading-the-anatomy-of-a-convolution-more-than-just-guesswork">The Anatomy of a Convolution: More Than Just Guesswork</h2>
<p>This need for understanding goes all the way down to the basic building blocks. When we set up a convolutional layer, we have to pick its kernel size, padding, and stride. These are not just random numbers to guess. They are key design decisions that have a huge impact on what your model learns.</p>
<p>Let's quickly break them down:</p>
<ul>
<li><p><strong>Kernel Size:</strong> Think of the kernel as the network's magnifying glass.</p>
<ul>
<li><p><strong>A small kernel (like 3x3)</strong> is great for spotting fine details like sharp edges and textures. Most modern models use these to build up a complex picture from small pieces.</p>
</li>
<li><p><strong>A large kernel (like 7x7)</strong> sees bigger patterns at once, like the general shape of an object. It’s less common now but can be useful for capturing broader strokes.</p>
</li>
</ul>
</li>
<li><p><strong>Padding:</strong> This means adding a border of pixels around the image.</p>
<ul>
<li><p><strong>Without padding,</strong> the image gets smaller with every layer, and information at the edges can get lost.</p>
</li>
<li><p><strong>With padding,</strong> you can keep the image size the same. This lets you build deeper networks and makes sure the features at the borders are treated fairly.</p>
</li>
</ul>
</li>
<li><p><strong>Stride:</strong> This is the step size the kernel takes as it moves across the image.</p>
<ul>
<li><p><strong>A stride of 1</strong> is very thorough, moving one pixel at a time. It captures the most information but is computationally slower.</p>
</li>
<li><p><strong>A stride of 2 or more</strong> makes the kernel jump, shrinking the output size quickly. It’s a fast way to down-sample and helps the network see the bigger picture, but you lose some fine-grained detail.</p>
</li>
</ul>
</li>
</ul>
<p>Choosing these values is an act of engineering, not a lucky guess. You are actively deciding how your model sees the world.</p>
<h2 id="heading-the-case-for-building-from-the-ground-up">The Case for Building from the Ground Up</h2>
<p>So, how do we get this deeper knowledge? We can do this by fighting the urge to always use the fanciest, most automated tools first. This is why I'm a huge believer in trying to build models from a more fundamental level.</p>
<p>A framework like <strong>PyTorch</strong> is perfect for this. While it handles the heavy-lifting of calculus for you, it doesn’t hide everything. You still have to define your network layer by layer and write the training loop yourself, which includes the forward pass, calculating the loss, the backward pass, and updating the model.</p>
<p>Going through this process connects you directly to the mechanics. You see how the data changes shape as it flows through the network. You finally understand why certain steps are necessary. Your model stops being a magic box and becomes a logical system you created</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>At the end of the day, our job is to solve problems with tools that are reliable and that we can explain. That kind of work isn’t built on trial and error. It’s built on rigor, intention, and a real curiosity to learn.</p>
<p>So, the next time you start a project, I encourage you to ask "why." Why this model? Why this setting? The best models, and the best data scientists, are made when we step away from the easy abstractions and get our hands dirty with the fundamentals.</p>
]]></content:encoded></item><item><title><![CDATA[Building Intuition for Convolutional Neural Networks]]></title><description><![CDATA[The motivation behind Convolutional Neural Networks (CNNs) comes from the inability of traditional dense neural networks to perform well on image classification tasks. Why is that? A dense network, also known as a fully-connected network, treats an i...]]></description><link>https://hddatascience.tech/building-intuition-for-convolutional-neural-networks</link><guid isPermaLink="true">https://hddatascience.tech/building-intuition-for-convolutional-neural-networks</guid><category><![CDATA[CNNs (Convolutional Neural Networks)]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Sun, 24 Aug 2025 03:54:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755998342928/d9ec60de-76f5-4a8f-bc65-1503f764f453.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The motivation behind Convolutional Neural Networks (CNNs) comes from the inability of traditional dense neural networks to perform well on image classification tasks. Why is that? A dense network, also known as a fully-connected network, treats an image as a flat vector of pixels. If you flatten a 32x32 pixel image, you get a 1024-dimensional vector. This process discards all spatial information. The network has no inherent understanding that a pixel is "next to" another. This makes it difficult to learn concepts like edges, textures, or shapes, and it completely fails to grasp <strong>translation invariance</strong>, the idea that a cat is still a cat whether it's in the top-left or bottom-right corner of the image.</p>
<p>This is where CNNs shine. They are specifically designed to process pixel data by creating better feature maps out of raw images. Instead of flattening the input, they use small filters (kernels) that slide across the image, recognizing patterns like edges, corners, and textures. These initial patterns are then combined in deeper layers to form more complex features like eyes, wheels, or wings.</p>
<p>In this post, we'll build a CNN from scratch using PyTorch to understand its core components. We'll train it on the popular CIFAR-10 dataset and see how it learns to classify images into one of ten categories.</p>
<p>Let's break down the process step-by-step.</p>
<h3 id="heading-phase-1-importing-dependencies">Phase 1: Importing Dependencies</h3>
<p>First, we import all the necessary libraries. We'll be using torch and its nn module for building the network, torchvision for the dataset and image transformations, and PIL for handling our own custom images later.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image

<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> torch <span class="hljs-keyword">import</span> nn

<span class="hljs-keyword">import</span> torch.nn.functional <span class="hljs-keyword">as</span> F
<span class="hljs-keyword">import</span> torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader

<span class="hljs-keyword">import</span> torchvision
<span class="hljs-keyword">from</span> torchvision <span class="hljs-keyword">import</span> datasets, transforms
</code></pre>
<h3 id="heading-phase-2-data-transformation-and-loading">Phase 2: Data Transformation and Loading</h3>
<p>Before we can feed images to our network, we need to preprocess them. This is done using torchvision.transforms.</p>
<ul>
<li><p>transforms.ToTensor(): This converts the image from a PIL Image format (with pixel values from 0-255) to a PyTorch tensor (with values from 0.0 to 1.0).</p>
</li>
<li><p>transforms.Normalize(): This standardizes the pixel values. The arguments (0.5, 0.5, 0.5) are the mean and standard deviation for each of the three (R, G, B) channels. This normalization helps the network train faster and more stably by centering the data around zero.</p>
</li>
</ul>
<pre><code class="lang-python">transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((<span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>), (<span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>))
])
</code></pre>
<p>With our transformation pipeline ready, we can load the CIFAR-10 dataset. We also wrap our datasets in a DataLoader, which is a handy utility that provides batches of data, shuffles it for each epoch, and can even use multiple workers to load data in parallel.</p>
<pre><code class="lang-python">train_data = torchvision.datasets.CIFAR10(root=<span class="hljs-string">'./data'</span>, train=<span class="hljs-literal">True</span>, transform=transform, download=<span class="hljs-literal">True</span>)
test_data = torchvision.datasets.CIFAR10(root=<span class="hljs-string">'./data'</span>, train=<span class="hljs-literal">False</span>, transform=transform, download=<span class="hljs-literal">True</span>)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=<span class="hljs-number">32</span>, shuffle=<span class="hljs-literal">True</span>, num_workers=<span class="hljs-number">2</span>)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=<span class="hljs-number">32</span>, shuffle=<span class="hljs-literal">True</span>, num_workers=<span class="hljs-number">2</span>)

class_names = [<span class="hljs-string">'plane'</span>, <span class="hljs-string">'car'</span>, <span class="hljs-string">'bird'</span>, <span class="hljs-string">'cat'</span>, <span class="hljs-string">'deer'</span>, <span class="hljs-string">'dog'</span>, <span class="hljs-string">'frog'</span>, <span class="hljs-string">'horse'</span>, <span class="hljs-string">'ship'</span>, <span class="hljs-string">'truck'</span>]
</code></pre>
<p>The CIFAR-10 images are 3-channel (RGB) images of size 32x32 pixels. Let's confirm this:</p>
<pre><code class="lang-python">image, label = train_data[<span class="hljs-number">0</span>]
print(image.size())
<span class="hljs-comment"># Output: torch.Size([3, 32, 32])</span>
</code></pre>
<h3 id="heading-phase-3-defining-the-neural-network-architecture">Phase 3: Defining the Neural Network Architecture</h3>
<p>This is the core of our project. We'll define a class NeuralNet that inherits from nn.Module.</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">NeuralNet</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        super().__init__()
        <span class="hljs-comment"># Input: (3, 32, 32)</span>
        self.conv1 = nn.Conv2d(<span class="hljs-number">3</span>, <span class="hljs-number">16</span>, <span class="hljs-number">5</span>, padding=<span class="hljs-number">2</span>) <span class="hljs-comment"># 32x32 -&gt; 32x32</span>
        self.pool1 = nn.MaxPool2d(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)            <span class="hljs-comment"># 32x32 -&gt; 16x16</span>
        <span class="hljs-comment"># Shape: (16, 16, 16)</span>

        self.conv2 = nn.Conv2d(<span class="hljs-number">16</span>, <span class="hljs-number">32</span>, <span class="hljs-number">3</span>, padding=<span class="hljs-number">1</span>) <span class="hljs-comment"># 16x16 -&gt; 16x16</span>
        self.pool2 = nn.MaxPool2d(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)             <span class="hljs-comment"># 16x16 -&gt; 8x8</span>
        <span class="hljs-comment"># Shape: (32, 8, 8)</span>

        self.conv3 = nn.Conv2d(<span class="hljs-number">32</span>, <span class="hljs-number">64</span>, <span class="hljs-number">3</span>, padding=<span class="hljs-number">1</span>) <span class="hljs-comment"># 8x8 -&gt; 8x8</span>
        self.pool3 = nn.MaxPool2d(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)             <span class="hljs-comment"># 8x8 -&gt; 4x4</span>
        <span class="hljs-comment"># Shape: (64, 4, 4)</span>

        <span class="hljs-comment"># IMPORTANT: Calculate the new flattened size</span>
        self.fc1 = nn.Linear(<span class="hljs-number">64</span> * <span class="hljs-number">4</span> * <span class="hljs-number">4</span>, <span class="hljs-number">256</span>)
        self.fc2 = nn.Linear(<span class="hljs-number">256</span>, <span class="hljs-number">128</span>)
        self.fc3 = nn.Linear(<span class="hljs-number">128</span>, <span class="hljs-number">10</span>)
</code></pre>
<p>Let's break down how the shape of our data changes as it flows through the network:</p>
<h4 id="heading-convolutional-layers-nnconv2d">Convolutional Layers (nn.Conv2d)</h4>
<p>The shape of the output from a convolutional layer depends on the input size, kernel size, stride, and padding. The formula is:<br />Output_Size = (Input_Size - Kernel_Size + 2 * Padding) / Stride + 1</p>
<ul>
<li><p>self.conv1 = nn.Conv2d(3, 16, 5, padding=2)</p>
<ul>
<li><p>in_channels=3: We start with a 3-channel (RGB) image.</p>
</li>
<li><p>out_channels=16: The layer will produce 16 feature maps.</p>
</li>
<li><p>kernel_size=5: The filter is a 5x5 matrix.</p>
</li>
<li><p>padding=2: We add a 2-pixel border around the image.</p>
</li>
<li><p><strong>Shape Change</strong>: (32 - 5 + 2*2) / 1 + 1 = 32. With this padding, the height and width are preserved. Our shape goes from (3, 32, 32) to (16, 32, 32).</p>
</li>
</ul>
</li>
</ul>
<h4 id="heading-pooling-layers-nnmaxpool2d">Pooling Layers (nn.MaxPool2d)</h4>
<p>Pooling layers are used to down sample the feature maps. This reduces the computational load and makes the detected features more robust to their exact location in the image.</p>
<ul>
<li><p>self.pool1 = nn.MaxPool2d(2, 2)</p>
<ul>
<li><p>This takes a 2x2 window and keeps only the maximum value, effectively halving the height and width.</p>
</li>
<li><p><strong>Shape Change</strong>: The input (16, 32, 32) becomes (16, 16, 16).</p>
</li>
</ul>
</li>
</ul>
<p>We repeat this pattern. After conv3 and pool3, our final feature map has a shape of (64, 4, 4).</p>
<h4 id="heading-the-flattening-and-fully-connected-layers-nnlinear">The Flattening and Fully-Connected Layers (nn.Linear)</h4>
<p>The convolutional layers have done their job of extracting spatial features. Now, we need to feed these features into a standard dense network to perform the final classification. To do this, we must "flatten" our 3D feature map (64, 4, 4) into a 1D vector.</p>
<p>The size of this vector is channels <em>height</em> width, which is 64 <em>4</em> 4 = 1024.</p>
<p>This is why our first fully-connected layer, fc1, is defined as nn.Linear(64 <em>4</em> 4, 256). It takes the 1024 features from our flattened map and transforms them into 256 features. The final layer, fc3, outputs 10 values, one for each class in CIFAR-10.</p>
<p>To understand how fully connected layers work, <a target="_blank" href="https://hddatascience.tech/building-a-neural-network-from-scratch-in-python-and-numpy">here’s a link explaining the math and intuition behind it as we do it from scratch.</a></p>
<h3 id="heading-phase-4-defining-the-forward-propagation">Phase 4: Defining the Forward Propagation</h3>
<p>The forward method defines the actual path our data takes through the layers. We apply a ReLU activation function after each convolution and after the first two fully-connected layers to introduce non-linearity, which is crucial for learning complex patterns.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        x = self.pool3(F.relu(self.conv3(x)))

        x = torch.flatten(x, <span class="hljs-number">1</span>) <span class="hljs-comment"># Flatten all dimensions except batch</span>

        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        <span class="hljs-keyword">return</span> x
</code></pre>
<h3 id="heading-phase-5-optimizer-and-loss-function">Phase 5: Optimizer and Loss Function</h3>
<p>To train the network, we need two things:</p>
<ol>
<li><p><strong>Loss Function</strong>: Measures how wrong the model's predictions are.</p>
</li>
<li><p><strong>Optimizer</strong>: Updates the model's weights to reduce the loss.</p>
</li>
</ol>
<pre><code class="lang-python">net = NeuralNet()
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=<span class="hljs-number">0.001</span>, momentum=<span class="hljs-number">0.9</span>)
</code></pre>
<h4 id="heading-why-crossentropyloss">Why CrossEntropyLoss?</h4>
<p>nn.CrossEntropyLoss is the standard choice for multi-class classification problems like this one. It's particularly effective because it combines two operations: LogSoftmax and NLLLoss (Negative Log Likelihood Loss). Internally, it takes the raw output scores (logits) from our final layer, converts them into probabilities using a softmax function, and then calculates the loss. It heavily penalizes the model for being confident in the wrong prediction, which makes it a very effective teacher during training.</p>
<h3 id="heading-phase-6-the-training-loop">Phase 6: The Training Loop</h3>
<p>Here, we iterate through our training data for a set number of epochs. In each step, we perform the standard training routine:</p>
<ol>
<li><p>Get a batch of inputs and labels.</p>
</li>
<li><p>Clear previous gradients with <a target="_blank" href="http://optimizer.zero">optimizer.zero</a>_grad().</p>
</li>
<li><p>Make a prediction (outputs = net(inputs)).</p>
</li>
<li><p>Calculate the loss.</p>
</li>
<li><p>Perform backpropagation to calculate gradients (loss.backward()).</p>
</li>
<li><p>Update the network's weights (optimizer.step()).</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(<span class="hljs-number">30</span>):
    print(<span class="hljs-string">f"Training Epoch: <span class="hljs-subst">{epoch}</span>"</span>)
    running_loss = <span class="hljs-number">0.0</span>

    <span class="hljs-keyword">for</span> i, data <span class="hljs-keyword">in</span> enumerate(train_loader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)

        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{running_loss/len(train_loader):<span class="hljs-number">.4</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-phase-7-saving-the-model-and-evaluating-performance">Phase 7: Saving the Model and Evaluating Performance</h3>
<p>After training, we save the learned weights (the model's "state") to a file. Then, we load these weights into a fresh instance of our network and evaluate its performance on the test dataset, which it has never seen before.</p>
<p>We switch the network to evaluation mode with net.eval(). This is important as it disables certain layers like Dropout that behave differently during training and inference. We use <a target="_blank" href="http://torch.no">torch.no</a>_grad() to tell PyTorch not to calculate gradients, which saves memory and computation.</p>
<pre><code class="lang-python">torch.save(net.state_dict(), <span class="hljs-string">'trained_net.pth'</span>)

net = NeuralNet()
net.load_state_dict(torch.load(<span class="hljs-string">'trained_net.pth'</span>))

correct = <span class="hljs-number">0</span>
total = <span class="hljs-number">0</span>

net.eval()
<span class="hljs-keyword">with</span> torch.no_grad():
    <span class="hljs-keyword">for</span> data <span class="hljs-keyword">in</span> test_loader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, <span class="hljs-number">1</span>)
        total += labels.size(<span class="hljs-number">0</span>)
        correct += (predicted == labels).sum().item()

accuracy = <span class="hljs-number">100</span> * correct / total
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy}</span>%"</span>)
</code></pre>
<p>This will give us a final accuracy score, showing how well our CNN learned to generalize.</p>
<h3 id="heading-phase-8-testing-with-our-own-image">Phase 8: Testing with Our Own Image</h3>
<p>Finally, the fun part! Let's see how our trained model performs on a completely new image from the web. We create a simple function to load, resize, and transform an image to match the input format our network expects.</p>
<p>Note the image.unsqueeze(0) step. Our network was trained on <em>batches</em> of images. This adds a "batch dimension" of size 1, so the tensor shape becomes (1, 3, 32, 32), which is what the network expects.</p>
<pre><code class="lang-python">new_transform = transforms.Compose([
    transforms.Resize((<span class="hljs-number">32</span>, <span class="hljs-number">32</span>)),
    transforms.ToTensor(),
    transforms.Normalize((<span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>), (<span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.5</span>))
])

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_image</span>(<span class="hljs-params">image_path</span>):</span>
    image = Image.open(image_path)
    image = image.convert(<span class="hljs-string">'RGB'</span>)
    image = new_transform(image)
    image = image.unsqueeze(<span class="hljs-number">0</span>) <span class="hljs-comment"># Add batch dimension</span>
    <span class="hljs-keyword">return</span> image

<span class="hljs-comment"># Replace with the path to your image</span>
image_paths = [<span class="hljs-string">'path/to/your/image.png'</span>] 
images = [load_image(img) <span class="hljs-keyword">for</span> img <span class="hljs-keyword">in</span> image_paths]

net.eval()
<span class="hljs-keyword">with</span> torch.no_grad():
    <span class="hljs-keyword">for</span> image <span class="hljs-keyword">in</span> images:
        output = net(image)
        _, predicted = torch.max(output, <span class="hljs-number">1</span>)
        print(<span class="hljs-string">f"Prediction: <span class="hljs-subst">{class_names[predicted.item()]}</span>"</span>)
</code></pre>
<h3 id="heading-conclusion">Conclusion</h3>
<p>We've successfully built, trained, and tested a Convolutional Neural Network. We saw how convolutional and pooling layers work together to extract meaningful features from raw pixels, and how these features are then used by a classifier to make a final prediction. This ability to learn spatial hierarchies of patterns is what makes CNNs the powerhouse behind modern computer vision. From here, you can experiment by changing the architecture, adding more layers, or trying different optimizers to see how it affects performance.</p>
<p>Python Code Link: <a target="_blank" href="https://github.com/HarvsDucs/hashnode_python_scripts/tree/main/Building%20Intuition%20for%20Convolutional%20Neural%20Networks">https://github.com/HarvsDucs/hashnode_python_scripts/tree/main/Building%20Intuition%20for%20Convolutional%20Neural%20Networks</a></p>
]]></content:encoded></item><item><title><![CDATA[Building a Neural Network from Scratch in Python and NumPy]]></title><description><![CDATA[Ever looked at a neural network formula and felt a disconnect from what it's actually doing? You're not alone. Frameworks like TensorFlow and PyTorch are powerful, but their high-level abstractions can hide the beautiful, intuitive mathematics that m...]]></description><link>https://hddatascience.tech/building-a-neural-network-from-scratch-in-python-and-numpy</link><guid isPermaLink="true">https://hddatascience.tech/building-a-neural-network-from-scratch-in-python-and-numpy</guid><category><![CDATA[neural networks]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Handwritten Digit Recognition]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Fri, 08 Aug 2025 09:23:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754644489741/32e7b7fc-826c-428b-a3c1-7f16bb7bea41.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever looked at a neural network formula and felt a disconnect from what it's <em>actually doing</em>? You're not alone. Frameworks like TensorFlow and PyTorch are powerful, but their high-level abstractions can hide the beautiful, intuitive mathematics that make learning possible.</p>
<p>Today, we're going to build a neural network from scratch using only Python and NumPy. But more importantly, at every step we'll stop and connect the code to the core mathematical ideas.</p>
<p>Our goal isn't just to make it work, it's to understand <em>how</em> a collection of matrix multiplications and derivatives can learn to recognize something as complex as a handwritten digit.</p>
<p>Let's translate the math into intuition.</p>
<hr />
<h2 id="heading-files-and-dependencies">Files and Dependencies</h2>
<p>Every learning process starts with information. For our network, that information is the MNIST dataset.</p>
<p>Our project relies on this main library:</p>
<ul>
<li>NumPy: This is our mathematical workhorse. It will perform the vector and matrix operations that form the very language of neural networks.</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
</code></pre>
<p>The MNIST dataset contains 70,000 images (28x28 pixels) of handwritten digits. We load it from a NumPy-native .npz file.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Define the path to your dataset file</span>
file_path = <span class="hljs-string">'/kaggle/input/mnist-from-scratch/mnist.npz'</span>

data = np.load(file_path)
print(<span class="hljs-string">f"Arrays in the file: <span class="hljs-subst">{data.files}</span>"</span>)

<span class="hljs-comment"># Unpack the data into training and testing sets</span>
x_train, y_train = data[<span class="hljs-string">'x_train'</span>], data[<span class="hljs-string">'y_train'</span>]
x_test, y_test = data[<span class="hljs-string">'x_test'</span>], data[<span class="hljs-string">'y_test'</span>]
</code></pre>
<p>Our network doesn't see images, it sees numbers. To make these numbers easier to work with, we perform <strong>normalization</strong>.</p>
<p><strong>The Math:</strong> We take pixel values from [0, 255] and scale them to [0, 1].<br /><strong>The Intuition:</strong> This ensures that all our input features are on a similar scale. During training, the gradients (which tell us how to update our weights) are sensitive to the scale of the input. Normalization prevents gradients from becoming too large and unstable, leading to a smoother, more predictable learning process.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Normalize pixel values to the [0, 1] range</span>
x_train = x_train / <span class="hljs-number">255.0</span>
x_test = x_test / <span class="hljs-number">255.0</span>
</code></pre>
<hr />
<h2 id="heading-building-the-neural-network">Building the Neural Network</h2>
<p>We are building a brain with a specific structure: an input layer, two hidden layers, and an output layer. The "learning" happens in the connections between these layers. These connections are defined by <strong>weights</strong> and <strong>biases</strong>.</p>
<h3 id="heading-initializing-weights-and-biases">Initializing Weights and Biases</h3>
<p><strong>The Math:</strong> We create matrices of random numbers. The shape of each matrix is (neurons_in_layer, neurons_in_previous_layer).</p>
<p><strong>The Intuition:</strong></p>
<ul>
<li><p><strong>Weights (w):</strong> Think of a weight as the <strong>strength or importance of a connection</strong>. A large weight means the signal from a previous neuron has a strong influence. We initialize them randomly to <strong>break symmetry</strong>. If all weights started at zero, every neuron in a layer would learn the exact same thing, and our network would be no better than a single neuron. Randomness ensures they each start on a unique path to find different features.</p>
</li>
<li><p><strong>Biases (b):</strong> Think of a bias as an <strong>"activation threshold."</strong> It's a value that determines how easy it is for a neuron to "fire" (output a high value). A neuron with a large negative bias will require a very strong input signal to become active.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># --- Weights: The strength of connections ---</span>
<span class="hljs-comment"># w_i_h1 connects 784 input neurons to 64 hidden layer 1 neurons</span>
w_i_h1 = np.random.uniform(<span class="hljs-number">-0.5</span>, <span class="hljs-number">0.5</span>, (<span class="hljs-number">64</span>, <span class="hljs-number">784</span>))
<span class="hljs-comment"># w_h1_h2 connects 64 L1 neurons to 32 L2 neurons</span>
w_h1_h2 = np.random.uniform(<span class="hljs-number">-0.5</span>, <span class="hljs-number">0.5</span>, (<span class="hljs-number">32</span>, <span class="hljs-number">64</span>))
<span class="hljs-comment"># w_h2_o connects 32 L2 neurons to 10 output neurons</span>
w_h2_o = np.random.uniform(<span class="hljs-number">-0.5</span>, <span class="hljs-number">0.5</span>, (<span class="hljs-number">10</span>, <span class="hljs-number">32</span>))

<span class="hljs-comment"># --- Biases: The neuron's activation threshold ---</span>
b_i_h1 = np.random.uniform(<span class="hljs-number">-0.5</span>, <span class="hljs-number">0.5</span>, (<span class="hljs-number">64</span>, <span class="hljs-number">1</span>))
b_h1_h2 = np.random.uniform(<span class="hljs-number">-0.5</span>, <span class="hljs-number">0.5</span>, (<span class="hljs-number">32</span>, <span class="hljs-number">1</span>))
b_h2_o = np.random.uniform(<span class="hljs-number">-0.5</span>, <span class="hljs-number">0.5</span>, (<span class="hljs-number">10</span>, <span class="hljs-number">1</span>))
</code></pre>
<p>On a side note, I’ve set up 64 neurons first the first hidden layer and 32 neurons for the second hidden layer. These numbers are arbitrary and can be replaced with almost any number of integer neurons and still perform the same. I feel like using 64 and 32 for the two layers are the easiest to understand and make an example out of. Also, the weights were initialized as a uniform distribution from -0.5 to 0.5, as well as the biases.</p>
<hr />
<h3 id="heading-training-loop"><strong>Training Loop</strong></h3>
<p>The training loop is where the magic happens. It's a cycle of guessing, checking, and correcting that, when repeated thousands of times, allows the network to learn. We will dissect this process into four distinct stages that occur for <em>every single image</em> in our training data.</p>
<p>Here's the full code block for context. We will break it down piece by piece below.</p>
<pre><code class="lang-python"><span class="hljs-comment"># --- The Full Training Loop ---</span>
<span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(epochs):
    <span class="hljs-comment"># ... (code for tracking error and accuracy)</span>
    <span class="hljs-keyword">for</span> image, label <span class="hljs-keyword">in</span> zip(x_train, y_train):
        <span class="hljs-comment"># STAGE A: Data Preparation</span>
        image = image.reshape(<span class="hljs-number">784</span>, <span class="hljs-number">1</span>)
        label_vec = np.zeros((<span class="hljs-number">10</span>, <span class="hljs-number">1</span>))
        label_vec[label] = <span class="hljs-number">1</span>

        <span class="hljs-comment"># STAGE B: Forward Propagation</span>
        h1_pre = w_i_h1 @ image + b_i_h1
        h1 = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-h1_pre))
        h2_pre = w_h1_h2 @ h1 + b_h1_h2
        h2 = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-h2_pre))
        o_pre = w_h2_o @ h2 + b_h2_o
        exps = np.exp(o_pre - np.max(o_pre))
        o = exps / np.sum(exps)
        o = np.clip(o, epsilon, <span class="hljs-number">1</span> - epsilon)

        <span class="hljs-comment"># STAGE C: Loss Calculation</span>
        error = -np.sum(label_vec * np.log(o))
        <span class="hljs-comment"># ... (update total error and accuracy)</span>

        <span class="hljs-comment"># STAGE D: Backpropagation</span>
        delta_o = o - label_vec
        w_h2_o += -learning_rate * delta_o @ np.transpose(h2)
        b_h2_o += -learning_rate * delta_o

        delta_h2 = np.transpose(w_h2_o) @ delta_o * (h2 * (<span class="hljs-number">1</span> - h2))
        w_h1_h2 += -learning_rate * delta_h2 @ np.transpose(h1)
        b_h1_h2 += -learning_rate * delta_h2

        delta_h1 = np.transpose(w_h1_h2) @ delta_h2 * (h1 * (<span class="hljs-number">1</span> - h1))
        w_i_h1 += -learning_rate * delta_h1 @ np.transpose(image)
        b_i_h1 += -learning_rate * delta_h1
</code></pre>
<hr />
<h3 id="heading-data-preparation"><strong>Data Preparation</strong></h3>
<p>Before we can feed data to our network, we must format it correctly.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Reshape the input image from a 28x28 matrix to a 784x1 vector</span>
image = image.reshape(<span class="hljs-number">784</span>, <span class="hljs-number">1</span>)

<span class="hljs-comment"># Create a "one-hot" encoded vector for the label</span>
label_vec = np.zeros((<span class="hljs-number">10</span>, <span class="hljs-number">1</span>))
label_vec[label] = <span class="hljs-number">1</span>
</code></pre>
<p><strong>1. Flattening the Image:</strong><br />Our network's input layer has 784 neurons, arranged as a single line. The raw image is a 28x28 grid of pixels. By reshape-ing it into a (784, 1) column vector, we are "unspooling" the grid into a flat list that can be directly multiplied by our first weight matrix (w_i_h1), which has a shape of (64, 784).</p>
<p><strong>2. Vectorizing the Label (One-Hot Encoding):</strong><br />A human understands the label 7, but our network's output is a vector of 10 probabilities. To measure the error, we need a "ground truth" target in the same format. <strong>One-hot encoding</strong> does this. For the label 7, it creates a vector that is zero everywhere except for a 1 at the 7th index, representing "100% confidence that the digit is 7."</p>
<hr />
<h3 id="heading-forward-propagation"><strong>Forward Propagation</strong></h3>
<p>Here, data flows forward through the network, from the input pixels to the final probability outputs.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Hidden Layer 1</span>
h1_pre = w_i_h1 @ image + b_i_h1
h1 = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-h1_pre))  <span class="hljs-comment"># Sigmoid activation</span>

<span class="hljs-comment"># Hidden Layer 2</span>
h2_pre = w_h1_h2 @ h1 + b_h1_h2
h2 = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-h2_pre))  <span class="hljs-comment"># Sigmoid activation</span>

<span class="hljs-comment"># Output Layer</span>
o_pre = w_h2_o @ h2 + b_h2_o
exps = np.exp(o_pre - np.max(o_pre)) <span class="hljs-comment"># Stable Softmax</span>
o = exps / np.sum(exps)
o = np.clip(o, epsilon, <span class="hljs-number">1</span>-epsilon)
</code></pre>
<p>At each layer, we do two things:</p>
<ol>
<li><p><strong>Linear Combination:</strong> Z = W @ A_prev + b. This is a weighted sum. The matrix multiplication (@) aggregates signals from the previous layer, and the bias (b) shifts the result.</p>
</li>
<li><p><strong>Activation:</strong> A = g(Z). We pass the result through a non-linear activation function.</p>
</li>
</ol>
<h4 id="heading-a-closer-look-at-our-activation-functions"><strong>A Closer Look at Our Activation Functions</strong></h4>
<ul>
<li><p><strong>Sigmoid (for Hidden Layers):</strong> σ(z) = 1 / (1 + e⁻ᶻ)</p>
<ul>
<li><p><strong>The Math:</strong> This function takes any real number z and "squashes" it into a range between 0 and 1.</p>
</li>
<li><p><strong>The Intuition:</strong> It acts like a dimmer switch or a "gate." A value close to 1 means the neuron is highly "active" and passing its signal forward. A value close to 0 means it's "inactive." Crucially, it's <strong>non-linear</strong>. Without non-linearity, stacking layers would be pointless; the entire network would collapse into a single, less powerful linear transformation.</p>
</li>
</ul>
</li>
<li><p><strong>Softmax (for the Output Layer):</strong> S(zᵢ) = eᶻᵢ / Σ eᶻⱼ</p>
<ul>
<li><p><strong>The Math:</strong> It exponentiates each input score (making them all positive) and then divides by the sum of all exponentiated scores.</p>
</li>
<li><p><strong>The Intuition:</strong> This is why we use Softmax for the output layer in a classification problem. Its output has two beautiful properties:</p>
<ol>
<li><p>All output values are between 0 and 1.</p>
</li>
<li><p>All output values <strong>sum to 1</strong>.<br /> This transforms the network's raw final scores (o_pre) into a <strong>probability distribution</strong>. We can interpret the output o as the network's <em>confidence</em> in each digit. We can't use a function like ReLU (max(0,z)) here because its outputs don't sum to 1 and can't be interpreted as probabilities.</p>
</li>
</ol>
</li>
</ul>
</li>
</ul>
<p>Finally, you could see that value of “o” is is clipped by epsilon. It's to prevent the computation from breaking. The loss function uses np.log(), and log(0) is negative infinity.</p>
<hr />
<h3 id="heading-cross-entropy-loss"><strong>Cross-Entropy Loss</strong></h3>
<p>Now that we have a prediction (o) and a ground truth (label_vec), we can quantify how wrong the network was.</p>
<pre><code class="lang-python">error = -np.sum(label_vec * np.log(o)) <span class="hljs-comment"># Cross-Entropy Loss</span>
</code></pre>
<ul>
<li><p><strong>The Math:</strong> L = -Σ yᵢ log(pᵢ), where y is the true label (our one-hot label_vec) and p is the prediction (o). Since y is 1 for the correct class and 0 for all others, this simplifies to L = -log(p_correct).</p>
</li>
<li><p><strong>The Intuition:</strong> This measures "surprise." If the network predicts a high probability for the correct digit (e.g., p_correct = 0.95), then -log(0.95) is a very small error. If the network is confidently wrong (e.g., p_correct = 0.01), then -log(0.01) is a very large error. This loss function heavily penalizes confident mistakes.</p>
</li>
</ul>
<hr />
<h3 id="heading-backpropagation"><strong>Backpropagation</strong></h3>
<p>This is the most mathematically rich part. We use the error to figure out how to adjust every single weight and bias. The goal is to calculate the <strong>gradient</strong> of the loss function with respect to each parameter (∂L/∂W, ∂L/∂b). The gradient tells us the direction of steepest ascent for the loss, so we take a small step in the <strong>opposite</strong> direction to reduce the error. This entire process is a practical application of the <strong>Chain Rule</strong> from calculus.</p>
<p><strong>Step 1: The Output Error Gradient</strong></p>
<pre><code class="lang-python">delta_o = o - label_vec
</code></pre>
<ul>
<li><strong>The Math:</strong> This is the derivative of the Loss with respect to the pre-activation output scores, ∂L/∂o_pre. For the specific combination of Softmax and Cross-Entropy Loss, the calculus simplifies beautifully to this intuitive form: (prediction - actual). This vector tells us the magnitude and direction of the error for each output neuron.</li>
</ul>
<p><strong>Step 2: Update Output Layer Weights &amp; Biases</strong></p>
<pre><code class="lang-python">w_h2_o += -learning_rate * delta_o @ np.transpose(h2)
b_h2_o += -learning_rate * delta_o
</code></pre>
<ul>
<li><p><strong>The Math:</strong> To get the gradient for the weights (∂L/∂w_h2_o), the Chain Rule states: ∂L/∂w_h2_o = (∂L/∂o_pre) * (∂o_pre/∂w_h2_o).</p>
</li>
<li><p><strong>Connecting to Code:</strong> We already have ∂L/∂o_pre (it's delta_o). The term ∂o_pre/∂w_h2_o is simply the activation of the layer that fed into it, h2. Therefore, the full gradient is delta_o @ np.transpose(h2). We then take a small step (-learning_rate) in the opposite direction of this gradient. The bias update is even simpler as its derivative is just 1.</p>
</li>
</ul>
<p><strong>Step 3: Propagate Error to Hidden Layer 2</strong></p>
<pre><code class="lang-python">delta_h2 = np.transpose(w_h2_o) @ delta_o * (h2 * (<span class="hljs-number">1</span> - h2))
</code></pre>
<ul>
<li><p><strong>The Math:</strong> Now we need the error for the <em>next</em> layer back, ∂L/∂h2_pre. The Chain Rule expands: ∂L/∂h2_pre = (∂L/∂o_pre) <em>(∂o_pre/∂h2)</em> (∂h2/∂h2_pre).</p>
</li>
<li><p><strong>Connecting to Code:</strong></p>
<ul>
<li><p>(∂L/∂o_pre) * (∂o_pre/∂h2): This is the output error (delta_o) propagated backward through the weights (w_h2_o). This is np.transpose(w_h2_o) @ delta_o. It tells us how much each h2 neuron contributed to the final output error.</p>
</li>
<li><p>(∂h2/∂h2_pre): This is the derivative of the Sigmoid activation function, which conveniently is σ(z) <em>(1 - σ(z)), or in our code, h2</em> (1 - h2).</p>
</li>
<li><p>The term h2 * (1-h2) has a powerful intuition: neurons that were very certain (output near 0 or 1) have a small derivative and are changed very little. Neurons that were uncertain (output near 0.5) have the largest derivative and are updated the most. The network focuses its learning on its points of uncertainty!</p>
</li>
</ul>
</li>
</ul>
<p><strong>Steps 4-6: Continue the Process Backward</strong><br />The remaining lines repeat this exact pattern. The error delta_h2 is used to update w_h1_h2 and b_h1_h2, and then it's propagated further back to create delta_h1, which is used to update the final set of weights and biases. This "chain" is what gives the Chain Rule its name.</p>
<hr />
<h2 id="heading-conclusion">Conclusion</h2>
<p>We took a blank file, a handful of mathematical principles, and a collection of pixels, and we forged a system that can learn. But the most valuable thing we built today wasn't a digit classifier, it was <strong>intuition</strong>.</p>
<p>In a world dominated by powerful frameworks like PyTorch and TensorFlow, it's tempting to jump straight to the high-level commands.</p>
<p>These tools are incredible for productivity and are essential for building state-of-the-art models. But used without a foundational understanding, they can become "magic black boxes" that work in mysterious ways. When they break, we don't know why. When we need to innovate, we don't know how.</p>
<p>This project was our refusal to accept the magic box.</p>
<p>Take this blog with a grain of salt, but I’d rather share a project that’s done from scratch and close from intuition (that may be prone to more mistakes too) in order to open express my creativity of applied mathematics, and also further the discussion when it comes to AI.</p>
<p>code link: <a target="_blank" href="https://github.com/HarvsDucs/mnist_from_scratch/blob/main/main.py">https://github.com/HarvsDucs/mnist_from_scratch/blob/main/main.py</a></p>
<p>I also highly suggest checking out 3blue1brown’s video about backpropagation calculus: <a target="_blank" href="https://www.youtube.com/watch?v=tIeHLnjs5U8">https://www.youtube.com/watch?v=tIeHLnjs5U8</a></p>
]]></content:encoded></item><item><title><![CDATA[Why ML Fails at Stock/Crypto Prediction]]></title><description><![CDATA[The use of ML (machine learning) has been on the rise for the past 10 years. The idea of machine learning has been a trend for quite a while now and many people coming from different domains try to grasp the topic in hopes for a better opportunity ca...]]></description><link>https://hddatascience.tech/why-ml-fails-at-stockcrypto-prediction</link><guid isPermaLink="true">https://hddatascience.tech/why-ml-fails-at-stockcrypto-prediction</guid><category><![CDATA[Cryptocurrency]]></category><category><![CDATA[stockmarket]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Fri, 20 Jun 2025 04:57:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750394368627/78f2c587-0a5d-4dc6-b468-0bf31e39aa84.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750393230185/747a5f66-f473-46af-8a1a-9736274415ce.png" alt class="image--center mx-auto" /></p>
<p>The use of ML (machine learning) has been on the rise for the past 10 years. The idea of machine learning has been a trend for quite a while now and many people coming from different domains try to grasp the topic in hopes for a better opportunity career-wise and financial-wise.</p>
<p>From my personal view, there has been lots of side-projects made by newcomers in the field, of analysis that involves stock price prediction or crypto price prediction. Don’t get me wrong, I’m not against the idea of using ML for these use-cases, but I just want to point out how traditional ML techniques are misused a lot for this specific domain.</p>
<h2 id="heading-beyond-price-and-time-missing-market-signals">Beyond price and time: Missing market signals</h2>
<p>Almost all (if not all) of the side projects I’ve seen that tried to predict stock/crypto price settled with a open/close price through time with some added lagged features to predict future prices of stock/crypto. As some who’s been a crypto trader for more than 5 years now, and a data scientist for more than 3 years, I knew that there are a lot more factors affecting crypto prices. Some of these factors involves market structured factors, technical, adoption, regulatory, and market sentiment factors.</p>
<p>Honestly speaking, there are more factors than what I have listed out and I knew that making a model out of all these factors as features or parameters for your ML will only make it better and more complex. If you could have a model that captures all these features well (if you have a way of representing these factors well enough) then maybe, you really have a shot at actually predicting the future stock/crypto price.</p>
<h2 id="heading-trading-against-quant-traders-hedge-funds-and-banks">Trading against Quant Traders, Hedge Funds, and Banks</h2>
<p>I knew that using a very simple model to predict stock/crypto prices would be disrespectful to the years of domain experience professional traders have when it comes to trading. They knew factors that we didn’t know exist that is a key thing for predicting stock/crypto prices.</p>
<p>I’m not saying that we can’t beat professional traders through traditional ML techniques in building a stock/crypto price prediction model, maybe we can build a better one, but at the end of the day these guys have the ‘data advantage’. Being in data science for quite a while now, I knew that nothing beats a good and clean data that you can use for modelling. My personal long term goal was to also build an algorithm that could aid in my trading journey. I knew that it’s a long and continuous process, but it’s something I’m willing to be a part of.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The reality is that when you are trading, you are going against institutions with years of domain experience that has helped develop a superb market intuition, and an unending pocket of money to buy whatever data advantage they would need in order to create a better model. The gap between side projects to predict stock/crypto prices and professional trading systems is beyond most people's expectations. This doesn’t entirely mean that developing your own model for stock/crypto price prediction as an individual or ‘retail’ won’t ever work, it just means that there is room for improvement and more chances in making whatever model you have to be better. In a game where your opponents have a billion dollar budget and decades of experience, it is good to have a little humility that will push you to try and learn more. That is a lesson worth learning before risking your capital before trying to use your model.</p>
]]></content:encoded></item><item><title><![CDATA[Reinforcement Learning is the inevitable]]></title><description><![CDATA[The internet is full of bad data, and that is where the training data for LLMs are coming from. Some are polluting the internet with bad data out of spite. Most just spam the internet of AI generated data for profit One day, we might see a future whe...]]></description><link>https://hddatascience.tech/reinforcement-learning-is-the-inevitable</link><guid isPermaLink="true">https://hddatascience.tech/reinforcement-learning-is-the-inevitable</guid><category><![CDATA[Reinforcement Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Wed, 18 Jun 2025 01:34:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750207255359/fe7bf9b5-020d-4873-8e72-4e6ad97a7a13.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The internet is full of bad data, and that is where the training data for LLMs are coming from. Some are polluting the internet with bad data out of spite. Most just spam the internet of AI generated data for profit One day, we might see a future where an outside training data is necessary in order for these models to train as reinforcement learning will be the new standard for model training. Feels odd, doesn’t it?</p>
<h2 id="heading-why-reinforcement-learning">Why reinforcement learning</h2>
<p>AGI (Artificial General Intelligence) has been in the talks right now especially with top executives working closely with AI. I sure ain’t got much, but I’m willing to bet a lot on AGI being done out of reinforcement learning training. You could imagine reinforcement learning as a brute force algorithm (in comparison to traditional neural networks architecture) that tries every set of possible solutions in a given environment space, and then the most optimal one is chosen (according to rewards and punishments set).</p>
<h2 id="heading-what-made-reinforcement-learning-different">What made reinforcement learning different</h2>
<p>Reinforcement learning has proven time and time again that models trained out of it always find the most efficient way to reach its goal, often surpassing human intuition and assumptions about the domain. An autonomous helicopter even learned to fly in an inverted manner through reinforcement learning as its main goal is to just learn how to fly, stay above the surface, and do not crash. Isn’t it kind of funny that we humans hadn’t ever thought of flying in this way?</p>
<h2 id="heading-types-of-reinforcement-learning-algorithms">Types of Reinforcement Learning Algorithms</h2>
<ul>
<li><p>Value-Based</p>
</li>
<li><p>Policy-Based</p>
</li>
<li><p>Model-Based</p>
</li>
<li><p>Actor-Critic Methods</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Reinforcement Learning isn’t getting too much of a hype now and it’s understandable because it isn’t as feasible as we thing it is. Doing reinforcement learning is very much more compute and memory intensive in comparison to traditional way of doing neural networks and machine learning. In a time where the compute becomes less expensive, we will see a world where reinforcement learning is a more prominent way to train AI.</p>
]]></content:encoded></item><item><title><![CDATA[PCA as a Last Resort]]></title><description><![CDATA[Introduction
Principal Component Analysis (PCA) is often the first dimensionality reduction technique that data scientists reach for when faced with high-dimensional data. While PCA is powerful and mathematically elegant, treating it as a default fir...]]></description><link>https://hddatascience.tech/pca-as-a-last-resort</link><guid isPermaLink="true">https://hddatascience.tech/pca-as-a-last-resort</guid><category><![CDATA[Pca]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Mon, 28 Apr 2025 14:34:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745850827997/19b40501-c74f-48d8-9263-f027a1d672dc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Principal Component Analysis (PCA) is often the first dimensionality reduction technique that data scientists reach for when faced with high-dimensional data. While PCA is powerful and mathematically elegant, treating it as a default first step can lead to missed opportunities and suboptimal models. This post explores why feature engineering and feature removal should be your first considerations before applying PCA.</p>
<h2 id="heading-understanding-the-limitations-of-pca">Understanding the Limitations of PCA</h2>
<p>PCA transforms your original features into new components that capture maximum variance. However, this mathematical transformation comes with significant tradeoffs:</p>
<ol>
<li><p>Loss of interpretability - Principal components are linear combinations of original features, making them difficult to explain to stakeholders</p>
</li>
<li><p>Domain knowledge is discarded - PCA is a purely statistical technique that ignores valuable domain expertise</p>
</li>
<li><p>Non-linear relationships are missed - Standard PCA only captures linear relationships between features</p>
</li>
</ol>
<h2 id="heading-feature-engineering-creating-meaningful-representations">Feature Engineering: Creating Meaningful Representations</h2>
<p>Before reducing dimensions, consider creating more informative features:</p>
<ul>
<li><p>Ratio features that capture relationships between variables (e.g., debt-to-income ratio)</p>
</li>
<li><p>Interaction terms that represent how features work together</p>
</li>
<li><p>Domain-specific transformations based on expert knowledge</p>
</li>
<li><p>Polynomial features to capture non-linear relationships</p>
</li>
</ul>
<p>These engineered features often provide more predictive power than abstract principal components while maintaining interpretability.</p>
<h2 id="heading-feature-removal-the-simplest-form-of-dimensionality-reduction">Feature Removal: The Simplest Form of Dimensionality Reduction</h2>
<p>Feature removal should be your first dimensionality reduction approach because:</p>
<ul>
<li><p>It preserves the original meaning of remaining features</p>
</li>
<li><p>It forces critical thinking about which variables truly matter</p>
</li>
<li><p>It simplifies your model and reduces overfitting</p>
</li>
</ul>
<p>Methods for informed feature removal include:</p>
<ul>
<li><p>Correlation analysis to identify redundant features</p>
</li>
<li><p>Feature importance rankings from tree-based models</p>
</li>
<li><p>Filter methods like variance thresholds and mutual information</p>
</li>
<li><p>Wrapper methods such as recursive feature elimination</p>
</li>
</ul>
<h2 id="heading-when-pca-makes-sense">When PCA Makes Sense</h2>
<p>PCA becomes valuable after you've exhausted feature engineering and removal options, particularly when:</p>
<ul>
<li><p>You still have high dimensionality after careful feature selection</p>
</li>
<li><p>Multicollinearity remains a significant issue</p>
</li>
<li><p>Computational efficiency is a critical concern</p>
</li>
<li><p>You're using specific algorithms that benefit from orthogonal features</p>
</li>
<li><p>Visualization of high-dimensional data is needed</p>
</li>
</ul>
<h2 id="heading-a-better-workflow-for-dimensionality-reduction">A Better Workflow for Dimensionality Reduction</h2>
<p>Instead of immediately applying PCA, follow this approach:</p>
<ol>
<li><p>Start with domain knowledge to engineer meaningful features</p>
</li>
<li><p>Apply feature selection techniques to remove redundant or irrelevant variables</p>
</li>
<li><p>Use PCA only on the remaining features if dimensionality is still problematic</p>
</li>
<li><p>Consider non-linear dimensionality reduction techniques (t-SNE, UMAP) if linear PCA performs poorly</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>While PCA is a valuable tool in the data scientist's toolkit, it should rarely be your first choice for dimensionality reduction. By prioritizing feature engineering and thoughtful feature removal, you'll create models that are not only more accurate but also more interpretable and actionable. Save PCA for when you truly need it—as a last resort after you've leveraged your domain knowledge and simpler techniques.</p>
]]></content:encoded></item><item><title><![CDATA[Chunking Methods for RAG: What and Why]]></title><description><![CDATA[The Day My RAG System Failed
Meet Charlie, a developer who learned a valuable lesson about RAG systems the hard way. (Let’s just call him Charlie but we all really know who he is. 😉)
Last year, Charlie built what he thought was the perfect RAG (Retr...]]></description><link>https://hddatascience.tech/chunking-methods-for-rag-what-and-why</link><guid isPermaLink="true">https://hddatascience.tech/chunking-methods-for-rag-what-and-why</guid><category><![CDATA[RAG ]]></category><category><![CDATA[llm]]></category><category><![CDATA[AI]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Mon, 03 Mar 2025 06:17:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740977490305/079d57bc-da14-42c1-beac-f61150865622.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-the-day-my-rag-system-failed">The Day My RAG System Failed</h2>
<p>Meet Charlie, a developer who learned a valuable lesson about RAG systems the hard way. (Let’s just call him Charlie but we all really know who he is. 😉)</p>
<p>Last year, Charlie built what he thought was the perfect RAG (Retrieval-Augmented Generation) system for a Solar Company. The architecture was elegant, the UI sleek, and the LLM integration seamless. It was perfect for consuming PDFs and being a central source of truth for onboarding new employees.</p>
<p>Then came the testing portion. An employee asked the system a specific question about sales technique. The system confidently responded with completely wrong information that contradicted their report. The room went silent.</p>
<p>That was the day Charlie learned that a RAG system is only as good as its chunking strategy, and he had chosen the wrong one.</p>
<p>You might think chunking documents is just about splitting text into smaller pieces. But in reality, it's the strategic foundation that can make or break your entire RAG system's effectiveness.</p>
<h2 id="heading-why-most-rag-systems-fail-despite-using-advanced-models">Why Most RAG Systems Fail Despite Using Advanced Models</h2>
<p>When engineers build RAG systems, they often focus on the fancy parts, the latest embedding models, vector stores, and prompt engineering techniques. Yet many overlook the humble chunking step, treating it as a trivial preprocessing task.</p>
<p>This is a big mistake.</p>
<p>No matter how advanced your retrieval algorithms or language models are, if your chunks don't properly preserve context and semantic meaning, your RAG system will inevitably deliver hallucinations and irrelevant responses.</p>
<h2 id="heading-the-three-chunking-methods-you-need-to-know">The Three Chunking Methods You Need to Know</h2>
<h3 id="heading-1-fixed-size-chunking-the-default-trap">1. Fixed-Size Chunking: The Default Trap</h3>
<p>Most engineers start with fixed-size chunking, slicing documents into equal segments of token or character counts. It's simple and conventional.</p>
<p>But here's the shocking truth: fixed-size chunking regularly destroys the semantic cohesion of your content. When you arbitrarily split text every 512 tokens, you're likely cutting right through important concepts, breaking relationships between sentences, and fragmenting contextual information.</p>
<h3 id="heading-example-scenario">Example Scenario</h3>
<p>Imagine we're feeding a RAG system these two sentences:</p>
<ul>
<li><p>"The brown dog jumps over the lazy fox."</p>
</li>
<li><p>"A brown dog jumps over the lazy fox quickly."</p>
</li>
</ul>
<p>If we use fixed-size chunks (e.g., 20 characters), we might get these chunks:</p>
<ul>
<li><p>Sentence 1: "The brown dog jumps o" and "ver the lazy fox."</p>
</li>
<li><p>Sentence 2: "A brown dog jumps ove" and "r the lazy fox quickly."</p>
</li>
</ul>
<p>Now, if a user asks "What animal jumps over the lazy fox?", neither chunk perfectly captures the key information. The query terms are split across chunks due to the slight shift ("The" vs. "A"). The RAG system might miss the crucial link, even though both sentences clearly answer the question. This demonstrates how fixed-size chunking can break semantic context and hurt retrieval accuracy.</p>
<h3 id="heading-2-recursive-character-text-splitting">2. Recursive Character Text Splitting</h3>
<p>Recursive character splitting aims to create smart chunks, but it can still <em>fragment context</em>.</p>
<h3 id="heading-example-scenario-1">Example Scenario</h3>
<p>Recursive character splitting breaks text by punctuation (periods, etc.), then characters if needed. Sounds good, but it can still <em>fragment context</em>.</p>
<p><strong>Example:</strong></p>
<p>Consider: "Climate change threatens coasts. Rising sea levels are a problem. Communities rely on fishing and tourism. Reducing emissions is crucial."</p>
<p>Recursive splitting (aiming for ~100-character chunks) might give:</p>
<ul>
<li><p>"Climate change threatens coasts. Rising sea levels are a problem."</p>
</li>
<li><p>"Communities rely on fishing and tourism. Reducing emissions is crucial."</p>
</li>
</ul>
<p>If someone asks, "How do we protect coastal communities?", the link between climate change <em>specifically</em> and the need for emissions reductions is weakened. It's now less clear that reducing emissions directly addresses the threats <em>caused</em> by climate change.</p>
<p><strong>The problem?</strong> Even with punctuation-based splitting, <em>semantic relationships</em> between chunks can be lost. Don't assume it's perfect; experiment and consider smarter chunking for optimal RAG!</p>
<h3 id="heading-3-semantic-chunking-the-contextual-approach">3. Semantic Chunking: The Contextual Approach</h3>
<p>Unlike fixed-size chunking, semantic chunking preserves meaning by respecting natural boundaries in the text, paragraphs, sections, or semantic units.</p>
<h3 id="heading-example-scenario-2">Example Scenario</h3>
<p>"The use of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP). LLMs, trained on massive datasets, demonstrate impressive capabilities in text generation, translation, and question answering. However, LLMs also present challenges. One significant concern is the potential for generating biased or harmful content. Careful data curation and bias mitigation techniques are crucial. Another challenge is the computational cost associated with training and deploying these models. Research into more efficient architectures is ongoing. Despite these challenges, the benefits of LLMs are undeniable, and their applications are rapidly expanding across various industries."</p>
<p><strong>Fixed-Size Chunking (Problem):</strong></p>
<p>If we used a fixed-size chunk of, say, 200 characters, we might get chunks like:</p>
<ul>
<li><p>"The use of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP). LLMs, trained on massive datasets, demonstrate impressive capabilities in text generation, translation, and quest"</p>
</li>
<li><p>"ion answering. However, LLMs also present challenges. One significant concern is the potential for generating biased or harmful content. Careful data curation and bias mitigation techniques are cru"</p>
</li>
<li><p>"cial. Another challenge is the computational cost associated with training and deploying these models. Research into more efficient architectures is ongoing. Despite these challenges, the benefits"</p>
</li>
<li><p>" of LLMs are undeniable, and their applications are rapidly expanding across various industries."</p>
</li>
</ul>
<p>Notice how the chunks break in the middle of sentences and thoughts.</p>
<p><strong>Semantic Chunking (Solution):</strong></p>
<p>A semantic chunking approach might produce the following chunks, identifying logical breaks between topics:</p>
<ul>
<li><p><strong>Chunk 1:</strong> "The use of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP). LLMs, trained on massive datasets, demonstrate impressive capabilities in text generation, translation, and question answering." (This chunk focuses on the positive impact of LLMs)</p>
</li>
<li><p><strong>Chunk 2:</strong> "However, LLMs also present challenges. One significant concern is the potential for generating biased or harmful content. Careful data curation and bias mitigation techniques are crucial." (This chunk focuses on the bias challenge and mitigation)</p>
</li>
<li><p><strong>Chunk 3:</strong> "Another challenge is the computational cost associated with training and deploying these models. Research into more efficient architectures is ongoing." (This chunk focuses on the computational cost challenge)</p>
</li>
<li><p><strong>Chunk 4:</strong> "Despite these challenges, the benefits of LLMs are undeniable, and their applications are rapidly expanding across various industries." (This chunk serves as a conclusion, summarizing the overall value of LLMs).</p>
</li>
</ul>
<p><strong>Why Semantic Chunking Wins:</strong></p>
<ul>
<li><p><strong>If the user asks:</strong> "What are the advantages of LLMs?", Chunk 1 is a perfect match.</p>
</li>
<li><p><strong>If the user asks:</strong> "What are the challenges with LLMs?", Chunks 2 and 3 provide detailed answers to different challenges.</p>
</li>
<li><p><strong>If the user asks:</strong> "Are LLMs useful despite their problems?", Chunk 4 provides the concluding perspective.</p>
</li>
</ul>
<p>Traditional chunking methods (fixed-size, punctuation-based) often fragment context, hurting RAG performance. Semantic chunking aims for <em>meaningful</em> chunks, leading to better retrieval and generation.</p>
<p><strong>Why This Works (A Simplified Analogy):</strong></p>
<p>Imagine each word has a "location" in a semantic space. You could imagine the location as a single point in a vector space described by the values of the embedding vector. Semantic chunking tries to group words into chunks where the "average location" (centroid) of all words in the chunk is close to each individual word. The closer the words are, the more coherent and semantically related the chunk is. This minimizes "semantic distance" within each chunk, maximizing its relevance. Fixed chunking ignores this all together.</p>
<p><strong>How it Works (Simply):</strong></p>
<p>Instead of rigidly sticking to character counts or punctuation, semantic chunking tries to <em>understand</em> the text. It identifies logical boundaries based on the content itself. This might involve:</p>
<ul>
<li><p>Looking for topic shifts.</p>
</li>
<li><p>Identifying clear beginnings and endings of arguments.</p>
</li>
<li><p>Using more sophisticated NLP techniques to recognize semantic similarity within a chunk.</p>
</li>
</ul>
<p>A visual example:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740981358348/a3dd1bee-aa3d-4ae3-bb52-f4cd5e91390f.png" alt class="image--center mx-auto" /></p>
<p><em>Note that the sentences and the vectors that represent them on the graph are arbitrary, and only exists to show what it means to semantically chunk.*</em></p>
<p>The moment the next sentence is far off, its automatically considered as another chunk. One could imagine this as similar to removing duplicates in a database to save storage and compute resources (Although objectively speaking, these vectors aren’t similar unless they are linearly dependent).</p>
<h3 id="heading-4-agentic-chunking-the-advanced-solution">4. Agentic Chunking: The Advanced Solution</h3>
<p>For complex documents with nested structure, hierarchical chunking creates multiple granularity levels—section-level chunks, paragraph-level chunks, and sentence-level chunks.</p>
<p>Even the best semantic chunking has limitations. It's <em>static</em> – analyzes data once and creates fixed chunks. This fails with messy data like scraped websites or complex PDFs.</p>
<p><strong>The Problem: Unstructured Data &amp; Varying Queries</strong></p>
<ul>
<li><p><strong>Websites:</strong> Noisy HTML, ads, irrelevant disclaimers.</p>
</li>
<li><p><strong>PDFs:</strong> Complex formatting, embedded images breaking text flow.</p>
</li>
<li><p><strong>Long Text:</strong> Subtle topic shifts, making uniform chunking ineffective.</p>
</li>
<li><p><strong>Query Variance:</strong> Some questions need broad context, others specific details. Static chunks can't adapt.</p>
</li>
</ul>
<p><strong>Agentic RAG: Dynamic &amp; Adaptive Chunking to the Rescue!</strong></p>
<p>Agentic RAG uses an "agent" (often a smaller LLM) to <em>dynamically</em> analyze the data and <em>adapt</em> the chunking strategy based on the source and the user's query.</p>
<p><strong>Example: Scraping a Product Review Website</strong></p>
<ul>
<li><p><strong>Static Chunking:</strong> You scrape a product review page and try to split by HTML structure. You end up with chunks containing navigation menus, ads, and user comments alongside the actual review.</p>
</li>
<li><p><strong>Agentic RAG Approach:</strong></p>
<ol>
<li><p><strong>Agent Identifies Core Content:</strong> The agent identifies the main review text, ignoring irrelevant parts. It might use rules like, "Find the longest text block within the <code>&lt;article&gt;</code> tag" or "Identify the section with the highest concentration of keywords related to the product."</p>
</li>
<li><p><strong>Content-Aware Chunking:</strong> Now that the agent has the main article, it can do semantic chunking on <em>just</em> the review content, prioritizing sections with headings like "Pros," "Cons," or "Performance."</p>
</li>
<li><p><strong>Query-Aware Chunking:</strong> If the user asks, "What are the drawbacks of this product?", the agent could <em>re-chunk</em> the review, focusing specifically on sentences containing keywords related to "drawbacks," "cons," "problems," or "issues," creating highly targeted chunks.</p>
</li>
</ol>
</li>
</ul>
<p><strong>Benefits of Agentic RAG:</strong></p>
<ul>
<li><p><strong>Noise Reduction:</strong> Filters irrelevant content before chunking.</p>
</li>
<li><p><strong>Contextual Understanding:</strong> Adapts to different data types (webpages, PDFs, etc.).</p>
</li>
<li><p><strong>Query Optimization:</strong> Tailors chunk sizes and content to answer the user's specific question.</p>
</li>
<li><p><strong>PDF Mastery:</strong> Handles PDFs by first extracting text, identifying headings, and chunking structurally.</p>
</li>
<li><p><strong>Long Text Savvy:</strong> For long texts, employs techniques like sliding windows or hierarchical summarization to maintain context across large distances.</p>
</li>
</ul>
<p><strong>Cons of Agentic RAG:</strong></p>
<ul>
<li><p><strong>Complexity Overload</strong>: Designing intelligent agents adds layers of code and complexity. You'll need stronger programming and NLP skills than with simple chunking. <em>*</em></p>
</li>
<li><p><strong>Higher Costs</strong>: Running agents, especially those powered by LLMs, consumes more computational resources. Expect increased latency and potentially higher cloud bills.</p>
</li>
<li><p><strong>Prompt Engineering</strong>: Just like with other applications using LLMs, prompt engineering plays a critical role. If the prompts used to generate the agents are off, then the performance will not meet the requirements.</p>
</li>
<li><p><strong>Over-Engineering Trap</strong>: It's tempting to over-engineer your agents. Start with simple agents and add complexity only when it demonstrably improves results.  </p>
</li>
</ul>
<p>Agentic RAG isn't just chunking; it's <em>intelligent</em> chunking. By dynamically adapting to the data and the user's needs, it unlocks far better accuracy and relevance in RAG systems compared to static approaches. If you're serious about RAG, you need to explore agentic strategies. But then again, it might be overkill to use Agentic RAG for simple use cases.</p>
<h2 id="heading-the-impact-on-your-rag-systems-performance">The Impact on Your RAG System's Performance</h2>
<p>Choosing the right chunking method isn't just a technical decision, it's a business-critical one. Here's what happens when you get it right:</p>
<ol>
<li><p><strong>Reduced Hallucinations</strong>: Proper chunks preserve context, giving the LLM less reason to "fill in the gaps" with fabricated information</p>
</li>
<li><p><strong>Improved Relevance</strong>: Better chunks mean more precise retrieval, ensuring responses actually address the user's query</p>
</li>
<li><p><strong>Enhanced Context Window Utilization</strong>: Strategic chunking makes better use of limited context windows in LLMs</p>
</li>
<li><p><strong>Lower Operational Costs</strong>: Better retrieval means fewer tokens processed and less computational overhead</p>
</li>
</ol>
<h2 id="heading-implementing-the-right-chunking-strategy-today">Implementing the Right Chunking Strategy Today</h2>
<p>The most successful RAG engineers I've worked with follow this process:</p>
<ol>
<li><p>Analyze your document structure and content type</p>
</li>
<li><p>Experiment with multiple chunking strategies on a test dataset</p>
</li>
<li><p>Measure retrieval effectiveness using precision, recall, and answer relevance metrics</p>
</li>
<li><p>Implement a hybrid approach tailored to your specific knowledge base</p>
</li>
</ol>
<p>Remember, what works for general web content may fail spectacularly for legal documents, code bases, or scientific papers.</p>
<h2 id="heading-conclusion-the-decision-that-will-define-your-rag-system">Conclusion: The Decision That Will Define Your RAG System</h2>
<p>As AI engineers, we're often drawn to the exciting parts of RAG, the latest models, complex retrievers, and advanced prompting techniques. But I've seen time and again that the engineers who master the seemingly mundane art of chunking are the ones who build systems that actually work when it matters most. For most cases, semantic chunking just might be enough.</p>
<p>What chunking method are you using in your RAG system today? And more importantly, are you absolutely certain it's the right one?</p>
<hr />
<p>If you find this blog interesting, connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a> sure to leave a message!</p>
<p>Links:</p>
<p><a target="_blank" href="https://www.youtube.com/@hddatascience">https://www.youtube.com/@hddatascience</a></p>
<p><a target="_blank" href="https://harveyducay.blog/">https://harveyducay.blog/</a></p>
<p><a target="_blank" href="https://github.com/harvsDucs/">https://github.com/harvsDucs/</a></p>
]]></content:encoded></item><item><title><![CDATA[Semantic Search Data Engineering Pipeline: RAG Without the AI]]></title><description><![CDATA[Building a Semantic Document Search System
flowchart TD
    %% Color definitions
    classDef default fill:#2c3e50,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef processing fill:#3498db,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    cl...]]></description><link>https://hddatascience.tech/semantic-search-data-engineering-pipeline-rag-without-the-ai</link><guid isPermaLink="true">https://hddatascience.tech/semantic-search-data-engineering-pipeline-rag-without-the-ai</guid><category><![CDATA[RAG ]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[AI]]></category><category><![CDATA[semantic search]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Fri, 28 Feb 2025 03:56:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740714846146/6336f61b-a2a2-4727-8f2b-3f4626dc3857.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-building-a-semantic-document-search-system">Building a Semantic Document Search System</h1>
<pre><code class="lang-mermaid">flowchart TD
    %% Color definitions
    classDef default fill:#2c3e50,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef processing fill:#3498db,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef storage fill:#9b59b6,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef embedding fill:#2ecc71,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef query fill:#f1c40f,stroke:#34495e,stroke-width:2px,color:#34495e
    classDef display fill:#e74c3c,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1

    A["PDF Documents"] --&gt; B["Supabase Storage Upload"]
    B --&gt; C["File Parsing via Llama Index"]
    C --&gt; D["Text Semantic Chunking via LangChain Flask API (Vercel)"]
    D --&gt; E["Text Embedding Generation via Nomic-Embed-Text Flask API (Vercel)"]

    E --&gt; F["Supabase Upload Text per Embedding ID"]
    E --&gt; G["Pinecone Upload Embedding per Embedding ID"]

    H["User Query"] --&gt; I["Convert Query to Embedding via Nomic-Embed-Text Flask API (Vercel)"]
    I --&gt; J["Compare Embeddings via Pinecone Query API (Return Top 2 References)"]
    J --&gt; K["Display References in UI Show Source Information"]
    K --&gt; L["Display Results Based on Retrieved References"]

    %% Styling nodes by category
    class A default
    class B,C storage
    class D processing
    class E,F,G,I embedding
    class H,J query
    class K,L display

    subgraph Pipeline1["Document Processing Pipeline"]
        A
        B
        C
        D
        E
        F
        G
    end

    subgraph Pipeline2["Query Processing Pipeline"]
        H
        I
        J
        K
        L
    end
</code></pre>
<p>In today's data-driven world, organizations are drowning in unstructured information. PDF documents, reports, manuals, and other text-based resources contain valuable knowledge, but accessing this information efficiently remains challenging. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are gaining popularity, not every solution requires the generative AI component.</p>
<p>In this post, I'll walk through how I built a powerful semantic search system for documents that captures the "retrieval" part of RAG without the "generation" component - providing accurate document references without synthesizing new content.</p>
<h2 id="heading-the-architecture">The Architecture</h2>
<p>Our system consists of two primary pipelines:</p>
<h3 id="heading-document-processing-pipeline">Document Processing Pipeline</h3>
<p>This pipeline handles the ingestion and processing of documents:</p>
<ol>
<li><p><strong>PDF Document Collection</strong>: The starting point is a repository of PDF documents containing the information we want to make searchable.</p>
</li>
<li><p><strong>Supabase Storage Upload</strong>: Documents are uploaded to Supabase storage, providing a centralized location for all our documents.</p>
</li>
<li><p><strong>File Parsing via Llama Index</strong>: We utilize Llama Index to extract and structure the content from our PDFs. This tool effectively transforms unstructured documents into structured content.</p>
</li>
<li><p><strong>Text Semantic Chunking</strong>: Using LangChain's Flask API (hosted on Vercel), we divide the document content into semantic chunks - logical sections that preserve context rather than arbitrary splits.</p>
</li>
<li><p><strong>Text Embedding Generation</strong>: Each chunk is processed through Nomic-Embed-Text Flask API to generate vector embeddings. These embeddings capture the semantic meaning of text in a mathematical format.</p>
</li>
<li><p><strong>Dual Storage Strategy</strong>:</p>
<ul>
<li><p>We store the text chunks in Supabase, indexed by unique embedding IDs.</p>
</li>
<li><p>We upload the vector embeddings to Pinecone, a vector database optimized for similarity search.</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-query-processing-pipeline">Query Processing Pipeline</h3>
<p>This pipeline handles user interactions:</p>
<ol>
<li><p><strong>User Query</strong>: The process begins when a user submits a text query seeking information.</p>
</li>
<li><p><strong>Query Embedding</strong>: The user's query is converted into an embedding using the same Nomic-Embed-Text model, ensuring compatibility with our document embeddings.</p>
</li>
<li><p><strong>Embedding Comparison</strong>: Pinecone's Query API compares the query embedding with stored document embeddings, returning the top 2 most semantically similar text chunks.</p>
</li>
<li><p><strong>Reference Display</strong>: The system displays these references in the UI along with source information, helping users understand where the information originated.</p>
</li>
<li><p><strong>Results Display</strong>: Finally, the system presents the retrieved information based on semantic relevance rather than keyword matching.</p>
</li>
</ol>
<h2 id="heading-technical-implementation-details">Technical Implementation Details</h2>
<p>For this implementation, I leveraged several key technologies:</p>
<ul>
<li><p><strong>Embedding Model</strong>: Nomic-Embed-Text provides high-quality embeddings for both document chunks and user queries.</p>
</li>
<li><p><strong>Vector Database</strong>: Pinecone stores and efficiently searches through vector embeddings.</p>
</li>
<li><p><strong>Storage Solution</strong>: Supabase stores both the original documents and the text chunks.</p>
</li>
<li><p><strong>Processing Tools</strong>: Llama Index for document parsing and LangChain for semantic chunking.</p>
</li>
<li><p><strong>Deployment</strong>: All API components are deployed on Vercel for reliable scaling.</p>
</li>
</ul>
<h2 id="heading-the-benefits-of-this-approach">The Benefits of This Approach</h2>
<p>By implementing a "RAG without the AI" approach, we gain several advantages:</p>
<ol>
<li><p><strong>Reference Transparency</strong>: Users receive direct references to relevant documents rather than AI-generated summaries that might contain hallucinations.</p>
</li>
<li><p><strong>Semantic Understanding</strong>: Unlike traditional keyword search, this system understands the meaning behind queries, returning contextually relevant results.</p>
</li>
<li><p><strong>Source Verification</strong>: Each result links directly to its source document, enabling users to verify information.</p>
</li>
<li><p><strong>Reduced Complexity</strong>: Without the generative component, the system is simpler to implement, debug, and maintain.</p>
</li>
<li><p><strong>Lower Computational Requirements</strong>: Vector similarity search requires fewer resources than running large language models.</p>
</li>
</ol>
<h2 id="heading-real-world-applications">Real-World Applications</h2>
<p>This system is particularly valuable for:</p>
<ul>
<li><p><strong>Legal Firms</strong>: Searching through case law and precedents</p>
</li>
<li><p><strong>Healthcare Organizations</strong>: Finding relevant medical documentation</p>
</li>
<li><p><strong>Financial Institutions</strong>: Locating specific regulatory guidance</p>
</li>
<li><p><strong>Research Organizations</strong>: Discovering relevant papers and findings</p>
</li>
<li><p><strong>Educational Institutions</strong>: Connecting students with relevant learning materials</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building a semantic document search system using embedding-based retrieval provides organizations with a powerful tool to unlock the value hidden in their unstructured data. By focusing on the retrieval component without the generative AI aspect, we create a system that:</p>
<ul>
<li><p>Delivers accurate, source-verified information</p>
</li>
<li><p>Understands the semantic meaning behind user queries</p>
</li>
<li><p>Scales efficiently with growing document collections</p>
</li>
<li><p>Maintains transparency in information retrieval</p>
</li>
</ul>
<p>For organizations with large collections of documents that need to be searchable by meaning rather than just keywords, this approach offers significant value. It bridges the gap between traditional search and full RAG systems, providing a practical solution for making institutional knowledge accessible without the complexity and potential pitfalls of generative AI.</p>
<p>The next time you're considering implementing a document search solution, remember that sometimes you don't need the "G" in RAG to deliver transformative results.</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together"><strong>P.S. Let's Build Something Cool Together!</strong></h3>
<p>Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.</p>
<p>If you find this blog interesting, connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a> and make sure to leave a message!</p>
]]></content:encoded></item><item><title><![CDATA[Self-Learning AI: Does Reinforcement Learning Really Eliminate Data Engineering?]]></title><description><![CDATA[Picture a machine learning model that learns like a child, through trial and error, with no need for massive pre-existing datasets. That's the allure of reinforcement learning (RL), a branch of artificial intelligence that's revolutionizing everythin...]]></description><link>https://hddatascience.tech/self-learning-ai-does-reinforcement-learning-really-eliminate-data-engineering</link><guid isPermaLink="true">https://hddatascience.tech/self-learning-ai-does-reinforcement-learning-really-eliminate-data-engineering</guid><category><![CDATA[Reinforcement Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[AI]]></category><category><![CDATA[data-engineering]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Tue, 07 Jan 2025 04:43:01 GMT</pubDate><content:encoded><![CDATA[<p><img src="https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExbHkyamJhZGpnamlvMGF5NDh2ODcyYWQwN2t0OTEydDFjNTEwaTk5eCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/ZZl8MetzKTl5tLGEUI/giphy.gif" alt="Robot Walking GIF by Sandia National Labs" class="image--center mx-auto" /></p>
<p>Picture a machine learning model that learns like a child, through trial and error, with no need for massive pre-existing datasets. That's the allure of reinforcement learning (RL), a branch of artificial intelligence that's revolutionizing everything from game-playing robots to industrial automation. While it's true that RL agents generate their own training data through interaction, the popular belief that this eliminates the need for data engineering might be too good to be true. Let's dive into the reality of data engineering in reinforcement learning and uncover whether this compelling promise holds up in practice.</p>
<h2 id="heading-the-case-for-reduced-data-engineering-in-rl">The Case for Reduced Data Engineering in RL</h2>
<h3 id="heading-self-generating-data-through-interaction">Self-Generating Data Through Interaction</h3>
<p>One of the most compelling arguments for reduced data engineering in RL is its ability to generate training data through direct interaction with environments. Unlike traditional supervised learning approaches, where data must be collected, cleaned, and labeled beforehand, RL agents learn through experience, creating their own training examples along the way.</p>
<h3 id="heading-the-power-of-the-reward-signal">The Power of the Reward Signal</h3>
<p><img src="https://deepsense.ai/wp-content/uploads/2023/02/Figure-2-Classic-reinforcement-learning-training-loop.png" alt="Reinforcement Learning from Human Feedback (RLHF) for LLMs - deepsense.ai" class="image--center mx-auto" /></p>
<p>Reinforcement learning's reliance on reward signals rather than labeled examples presents another potential reduction in data engineering overhead. Instead of requiring extensive human annotation, RL systems learn from simple feedback signals that indicate the success or failure of actions. This fundamental shift can significantly reduce the traditional data preparation burden.</p>
<h3 id="heading-leveraging-synthetic-environments">Leveraging Synthetic Environments</h3>
<p><img src="https://uwaterloo.ca/scholar/sites/ca.scholar/files/styles/os_files_xxlarge/public/ajlobbez/files/gazebo_and_real.jpg?m=1651521617&amp;itok=jRmUy7M2" alt="Robotic and Gazebo Control" class="image--center mx-auto" /></p>
<p>Many RL applications begin their training journey in simulated environments, providing a controlled and readily available data source. This approach can substantially reduce the initial data engineering requirements typically associated with real-world data collection and processing.</p>
<h2 id="heading-lunar-landing-reinforcement-learning">Lunar Landing Reinforcement Learning</h2>
<p>One of the best ways to understand the amount of complexity necessary in order to train a reinforcement learning model is through this code, with these few lines of code, a lunar lander was able to learn how to land safely on its target position.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> gymnasium <span class="hljs-keyword">as</span> gym

<span class="hljs-keyword">from</span> stable_baselines3 <span class="hljs-keyword">import</span> PPO
<span class="hljs-keyword">from</span> stable_baselines3.common.env_util <span class="hljs-keyword">import</span> make_vec_env

vec_env = make_vec_env(<span class="hljs-string">"LunarLander-v3"</span>)

model = PPO(<span class="hljs-string">"MlpPolicy"</span>, vec_env, verbose=<span class="hljs-number">1</span>)
model.learn(total_timesteps=<span class="hljs-number">750000</span>)
model.save(<span class="hljs-string">"LunarLander-v3-750k-ts"</span>)
</code></pre>
<h3 id="heading-no-training">No Training</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736224913885/4a7703f0-ca63-48bc-b2ff-98bd0d25ef31.gif" alt class="image--center mx-auto" /></p>
<h3 id="heading-250k-timesteps-results">250k Timesteps Results</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736224649946/3097d439-3608-42cd-b4bc-211ff4e8c64d.gif" alt class="image--center mx-auto" /></p>
<h3 id="heading-500k-timesteps-results">500k Timesteps Results</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736224633242/f425386d-85a6-4327-bb16-a729a03ed404.gif" alt class="image--center mx-auto" /></p>
<h3 id="heading-750k-timesteps-results">750k Timesteps Results</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736224642287/ad802982-395c-492c-a414-02864561ff9c.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-the-hidden-data-engineering-challenges">The Hidden Data Engineering Challenges</h2>
<h3 id="heading-complex-environment-engineering">Complex Environment Engineering</h3>
<p>While RL might reduce certain aspects of traditional data engineering, it introduces its own set of challenges. Creating and maintaining effective training environments requires sophisticated engineering work, including:</p>
<ul>
<li><p>Designing accurate state representations</p>
</li>
<li><p>Defining appropriate action spaces</p>
</li>
<li><p>Crafting meaningful reward functions</p>
</li>
<li><p>Developing realistic simulators</p>
</li>
</ul>
<h3 id="heading-managing-interaction-histories">Managing Interaction Histories</h3>
<p>The need to store and process interaction histories introduces significant data management challenges. Each training episode generates sequences of state-action-reward tuples that must be efficiently stored, accessed, and analyzed. This becomes particularly demanding in applications with extended training periods or complex environmental interactions.</p>
<h3 id="heading-specialized-data-pipeline-requirements">Specialized Data Pipeline Requirements</h3>
<p>RL systems often require specialized data pipeline components to handle unique requirements such as:</p>
<ul>
<li><p>Experience replay mechanisms for efficient learning</p>
</li>
<li><p>Data synchronization in distributed training setups</p>
</li>
<li><p>Storage and processing of historical policy data</p>
</li>
<li><p>Real-time monitoring and debugging capabilities</p>
</li>
</ul>
<h2 id="heading-the-reality-different-rather-than-less">The Reality: Different Rather Than Less</h2>
<p>The relationship between reinforcement learning and data engineering isn't about reduction, it's about transformation. While RL might minimize certain traditional data engineering tasks, it introduces new challenges that require equally sophisticated solutions.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>While reinforcement learning offers exciting alternatives to traditional machine learning approaches, it doesn't eliminate the need for data engineering, it transforms it. Success in RL projects requires understanding and embracing these unique data engineering challenges rather than assuming they don't exist.</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together"><strong>P.S. Let's Build Something Cool Together!</strong></h3>
<p>Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.</p>
<p>If you find this blog interesting, connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a> and make sure to leave a message!</p>
]]></content:encoded></item><item><title><![CDATA[Navigating Your Path to Data Engineering: A Comprehensive Guide to Breaking Into the Field]]></title><description><![CDATA[The Data Dilemma: From Frustrated Coder to Strategic Problem Solver
Let me be honest—when I first started my journey in data science, I was that developer who could barely string together a machine learning model without feeling like I was trying to ...]]></description><link>https://hddatascience.tech/navigating-your-path-to-data-engineering-a-comprehensive-guide-to-breaking-into-the-field</link><guid isPermaLink="true">https://hddatascience.tech/navigating-your-path-to-data-engineering-a-comprehensive-guide-to-breaking-into-the-field</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Wed, 18 Dec 2024 08:20:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1734509769932/fd7d7606-8e71-45f4-a9c0-dfaeffe5ebcf.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-the-data-dilemma-from-frustrated-coder-to-strategic-problem-solver">The Data Dilemma: From Frustrated Coder to Strategic Problem Solver</h2>
<p>Let me be honest—when I first started my journey in data science, I was that developer who could barely string together a machine learning model without feeling like I was trying to solve a Rubik's cube blindfolded. Machine learning felt like an intricate dance where I constantly had two left feet.</p>
<p>Then, data engineering entered my life like a superhero with a Swiss Army knife of technological solutions. Suddenly, those complicated data pipelines that used to make me want to throw my laptop out the window became... manageable. Dare I say, even enjoyable?</p>
<h2 id="heading-the-growing-demand-why-data-engineering-is-your-golden-ticket">The Growing Demand: Why Data Engineering is Your Golden Ticket</h2>
<p><img src="https://www.bigdatawire.com/wp-content/uploads/2020/02/Hired_engineer_19.png" alt="Demand for Data Engineers Up 50%, Report Says" /></p>
<p>Data has become the new oil, and data engineers are the drilling experts of the 21st century. With top companies processing terabytes of information daily and platforms like LinkedIn showcasing thousands of data engineering positions, this field isn't just a career—it's a technological revolution.</p>
<p>💡 Fun Fact: The average data engineer earns approximately $130,000 annually. That's not just a salary; that's a "buy-a-Tesla-and-still-have-money-for-artisan-coffee" kind of income!</p>
<h2 id="heading-from-chaos-to-clarity">From Chaos to Clarity</h2>
<p>Imagine being the person who transforms raw, messy data into crystal-clear insights that help businesses make game-changing decisions. That's not just a job—it's almost like being a data wizard.</p>
<p>When I help a business understand its customer behavior, reduce inefficiencies, or predict market trends, I'm not just moving numbers around. I'm helping create stories from seemingly random data points, turning complexity into comprehensible narratives.</p>
<h2 id="heading-foundational-skills">Foundational Skills</h2>
<h3 id="heading-1-master-the-core-technologies">1. Master the Core Technologies</h3>
<p>To build a solid foundation in data engineering, focus on three fundamental technologies:</p>
<ol>
<li><p><strong>Python</strong>: An open-source language with extensive third-party libraries and robust virtual environment capabilities. It's like the friendly neighborhood superhero of coding—flexible, powerful, and always ready to save the day.</p>
</li>
<li><p><strong>SQL</strong>: More than just a declarative language, SQL offers advanced transaction properties that make data manipulation efficient. Think of it as a precise dance of data manipulation, where every query is a carefully choreographed move. Key advanced topics to master include:</p>
<ul>
<li><p>Group by functions</p>
</li>
<li><p>Window functions</p>
</li>
<li><p>Complex querying techniques</p>
</li>
</ul>
</li>
<li><p><strong>Command Line Tools</strong>: Like the stage managers of your data engineering theater, these help facilitate data pipeline interactions and improve productivity.</p>
</li>
</ol>
<h3 id="heading-2-data-storage-and-orchestration">2. Data Storage and Orchestration</h3>
<p>Understanding data storage is crucial for data engineers. Focus on:</p>
<ul>
<li><p><strong>Object Stores</strong>: Ideal for unstructured data like images, audio, and text</p>
</li>
<li><p><strong>Relational Databases</strong>: Often the solution to most data engineering challenges</p>
</li>
<li><p><strong>Data Orchestration</strong>: Learn Extract, Transform, Load (ETL) processes</p>
</li>
<li><p><strong>Apache Airflow</strong>: The industry-standard tool for workflow management</p>
</li>
</ul>
<h3 id="heading-3-advanced-data-processing-techniques">3. Advanced Data Processing Techniques</h3>
<p>Differentiate yourself by understanding:</p>
<ul>
<li><p><strong>Batch Processing</strong>: Utilizing tools like Apache Spark to handle large-scale data</p>
</li>
<li><p><strong>Stream Processing</strong>: Learning frameworks like Apache Kafka for real-time data handling</p>
</li>
<li><p><strong>Distributed Systems</strong>: Understanding concepts like map-reduce and parallel processing</p>
</li>
</ul>
<h2 id="heading-learning-strategies-turning-passion-into-profession">Learning Strategies: Turning Passion into Profession</h2>
<h3 id="heading-the-no-pressure-learning-approach">The "No Pressure" Learning Approach</h3>
<ul>
<li><p>Take at least three months</p>
</li>
<li><p>Build projects that make your heart sing</p>
</li>
<li><p>Choose resources that don't make you want to fall asleep</p>
<p>  Pro Tip: If a learning resource feels more boring than watching paint dry, it's time to find a new one!</p>
</li>
</ul>
<h2 id="heading-real-world-impact-beyond-the-code">Real-World Impact: Beyond the Code</h2>
<p><img src="https://www.altexsoft.com/media/2019/06/word-image-48.png" alt="Data Engineering: Data Warehouse, Data Pipeline and Data Eng" /></p>
<p>Data engineering isn't just about technical skills. It's about:</p>
<ul>
<li><p>Helping businesses make smarter decisions</p>
</li>
<li><p>Transforming complex data into actionable insights</p>
</li>
<li><p>Creating value that goes beyond lines of code</p>
</li>
</ul>
<h2 id="heading-conclusion-your-strategic-roadmap-to-data-engineering-success">Conclusion: Your Strategic Roadmap to Data Engineering Success</h2>
<p>Some days, you'll feel like a coding genius. Other days, you'll wonder if you accidentally signed up for technological self-torture. Spoiler alert: It's totally worth it. The journey into data engineering is more than a career choice—it's a strategic investment in your professional future. As businesses increasingly rely on data-driven decision-making, the role of a data engineer has transformed from a technical support position to a critical strategic partner in organizational success.</p>
]]></content:encoded></item><item><title><![CDATA[Uncovering Semantic Relationships with the Universal Sentence Encoder]]></title><description><![CDATA[As the amount of text data we interact with on a daily basis continues to grow, the ability to quickly identify meaningful connections between pieces of information becomes increasingly valuable. This is where semantic similarity models can be incred...]]></description><link>https://hddatascience.tech/uncovering-semantic-relationships-with-the-universal-sentence-encoder</link><guid isPermaLink="true">https://hddatascience.tech/uncovering-semantic-relationships-with-the-universal-sentence-encoder</guid><category><![CDATA[TensorFlow]]></category><category><![CDATA[embedding]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Thu, 05 Dec 2024 08:41:45 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733383113591/84a5d284-f132-49b6-8462-92addc31290a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As the amount of text data we interact with on a daily basis continues to grow, the ability to quickly identify meaningful connections between pieces of information becomes increasingly valuable. This is where semantic similarity models can be incredibly useful, by capturing the underlying meaning of text, rather than just looking at surface-level similarities.</p>
<p>One powerful tool for building semantic similarity models is the Universal Sentence Encoder, provided by the TensorFlow Hub library. In this article, I'll walk through how you can leverage this pre-trained model to uncover interesting relationships in your own text data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733383056788/b5ff7f65-2073-4f0e-b079-d198727fe3d4.webp" alt class="image--center mx-auto" /></p>
<h2 id="heading-getting-started-with-the-universal-sentence-encoder">Getting Started with the Universal Sentence Encoder</h2>
<p>The Universal Sentence Encoder is a machine learning model that has been trained on a large corpus of text to produce high-quality vector representations of sentences and phrases. These vector embeddings encode the semantic meaning of the input, allowing you to easily compare the relatedness of different pieces of text.</p>
<p>To get started, you'll first need to import the necessary libraries:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> absl <span class="hljs-keyword">import</span> logging

<span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

<span class="hljs-keyword">import</span> tensorflow_hub <span class="hljs-keyword">as</span> hub
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
</code></pre>
<p>With the imports set up, you can then load the pre-trained Universal Sentence Encoder model from TensorFlow Hub:</p>
<pre><code class="lang-python">module_url = <span class="hljs-string">"https://tfhub.dev/google/universal-sentence-encoder/4"</span> <span class="hljs-comment">#@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]</span>
model = hub.load(module_url)
<span class="hljs-keyword">print</span> (<span class="hljs-string">"module %s loaded"</span> % module_url)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">embed</span>(<span class="hljs-params">input</span>):</span>
  <span class="hljs-keyword">return</span> model(input)
</code></pre>
<p>This model can now be used to encode your text data into semantic embeddings, which you can then use to compute similarity scores. An embed function also initiated.</p>
<p>Next up is defining our functions necessary for the plotting. We plot in order to visualize the similarities in the semantics calculated by the model. Code snippet is shown below:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">plot_similarity</span>(<span class="hljs-params">labels, features, rotation</span>):</span>
  corr = np.inner(features, features)
  sns.set(font_scale=<span class="hljs-number">1.2</span>)
  g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=<span class="hljs-number">0</span>,
      vmax=<span class="hljs-number">1</span>,
      cmap=<span class="hljs-string">"YlOrRd"</span>)
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title(<span class="hljs-string">"Semantic Textual Similarity"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_and_plot</span>(<span class="hljs-params">messages_</span>):</span>
  message_embeddings_ = embed(messages_)
  plot_similarity(messages_, message_embeddings_, <span class="hljs-number">90</span>)
</code></pre>
<h2 id="heading-computing-semantic-similarity">Computing Semantic Similarity</h2>
<p>Let's say you have a dataset of customer messages, and you want to identify which messages are discussing similar topics. You can use the Universal Sentence Encoder to generate embeddings for each message, and then calculate the pairwise cosine similarity between those embeddings.</p>
<p>The resulting similarity matrix will contain values between 0 and 1, where 1 indicates that two messages are semantically identical, and 0 indicates they are completely unrelated.</p>
<p>You can then use this matrix to cluster messages, identify outliers, or visualize the semantic relationships between your data points.</p>
<h2 id="heading-exploring-and-evaluating-the-model">Exploring and Evaluating the Model</h2>
<p>One way to get a better understanding of how the Universal Sentence Encoder is capturing semantic meaning is to examine the similarity scores for a few sample messages. For example:</p>
<pre><code class="lang-python">messages = [
    <span class="hljs-comment"># Smartphones</span>
    <span class="hljs-string">"I like my phone"</span>,
    <span class="hljs-string">"My phone is not good."</span>,
    <span class="hljs-string">"Your cellphone looks great."</span>,

    <span class="hljs-comment"># Weather</span>
    <span class="hljs-string">"Will it snow tomorrow?"</span>,
    <span class="hljs-string">"Recently a lot of hurricanes have hit the US"</span>,
    <span class="hljs-string">"Global warming is real"</span>,

    <span class="hljs-comment"># Food and health</span>
    <span class="hljs-string">"An apple a day, keeps the doctors away"</span>,
    <span class="hljs-string">"Eating strawberries is healthy"</span>,
    <span class="hljs-string">"Is paleo better than keto?"</span>,

    <span class="hljs-comment"># Asking about age</span>
    <span class="hljs-string">"How old are you?"</span>,
    <span class="hljs-string">"what is your age?"</span>,
]

run_and_plot(messages)
</code></pre>
<p>By reviewing these examples, you can start to get a sense of how well the model is performing and where it may be struggling. Additionally, you can manually review high and low similarity pairs to further evaluate the model's effectiveness for your specific use case.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733387473021/f7d118df-91b0-4a5b-9ef9-4ac50185f964.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The Universal Sentence Encoder is a powerful tool that can help unlock the semantic insights hidden within your text data. By leveraging this pre-trained model, you can quickly generate high-quality vector representations of your content and uncover meaningful relationships that would be difficult to spot through manual review alone.</p>
<p>Of course, as with any machine learning model, it's important to carefully evaluate its performance and understand its limitations. But with a bit of experimentation and exploration, you can harness the power of semantic similarity to drive valuable discoveries in your data.</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together"><strong>P.S. Let's Build Something Cool Together!</strong></h3>
<p>Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.</p>
<p>If you find this blog interesting, connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a> and make sure to leave a message!</p>
]]></content:encoded></item><item><title><![CDATA[Why should you have a Data Science Team for your business?]]></title><description><![CDATA[Let's be honest, if you've ever tried to make sense of a massive Excel spreadsheet at 3 AM, desperately searching for that one insight that could make or break your quarterly presentation, you know the pain. Trust me, I've been there, frantically goo...]]></description><link>https://hddatascience.tech/why-should-you-have-a-data-science-team-for-your-business</link><guid isPermaLink="true">https://hddatascience.tech/why-should-you-have-a-data-science-team-for-your-business</guid><category><![CDATA[Data Science]]></category><category><![CDATA[business]]></category><category><![CDATA[Cryptocurrency]]></category><category><![CDATA[decision making]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Sat, 23 Nov 2024 23:57:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732375057545/c2643ccf-2a9b-4d0b-b7ae-02e05f750ae0.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let's be honest, if you've ever tried to make sense of a massive Excel spreadsheet at 3 AM, desperately searching for that one insight that could make or break your quarterly presentation, you know the pain. Trust me, I've been there, frantically googling "how to pivot table" while my coffee got cold. But what if I told you there's a better way?</p>
<h2 id="heading-a-personal-story-from-crypto-chaos-to-data-driven-success">A Personal Story: From Crypto Chaos to Data-Driven Success</h2>
<p>Before we dive deep into business data science, let me share a story that might resonate with you. Back in 2020, I was like many others, trying to navigate the volatile crypto markets with nothing but gut feelings and Twitter threads. Every trade felt like a game of chance. Should I buy the dip? Is this the top? My portfolio looked like a roller coaster designed by someone who'd had too much coffee.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732404446086/947c5041-8c10-4d7a-b43e-cfdc86939020.png" alt class="image--center mx-auto" /></p>
<p>Then I discovered the power of data science. I began analyzing market patterns, building predictive models, and suddenly, those seemingly random price movements started making sense. My decisions became calculated rather than emotional. Sure, the market is still volatile, but now I sleep better knowing my trades are backed by data, not just hopes and dreams.</p>
<p>This same transformation can happen for your business. Whether you're in retail, finance, healthcare, or selling artisanal pet rocks online (hey, no judgment!), data science can turn uncertainty into clarity.</p>
<p>Everybody wants confidence in their decision making. Everybody wants peace of mind knowing that they made the right business decision, knowing there is a lot at stake. Sometimes, domain experience isn’t enough and you are gonna make mistakes in decision making, hence the existence of data science.</p>
<h2 id="heading-how-i-learned-to-stop-worrying-and-love-the-algorithm">How I Learned to Stop Worrying and Love the Algorithm</h2>
<p>Remember when making business decisions felt like throwing darts blindfolded? Yeah, those were not the good old days. As someone who started their journey as a data engineer, I can tell you firsthand that a lot of people underestimate the value of data.</p>
<p>When a business doesn’t have a data science team, most of the decision making comes from the executives from an intuitive point of view. Having an objective reasoning for every decision not only leads business growth in a more predictable way, but also gives a peace of mind towards its stakeholders.</p>
<p>For business employees, having a data science team might mean less probability for the business to go bankrupt and being laid off. For stock holders, it might mean a peace of mind knowing their stocks is more stable due to business decisions being backed by data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732405244665/63c03846-72ea-41cd-b274-6fe9fa6286df.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-human-learning">Human Learning</h2>
<p>Don’t get me wrong though, I don’t discount the capabilities of people who has been in their domain for so long now. Sometimes the intuition of business leaders through years of experience prove to be valuable. What works best is that the decision of these business leaders be augmented with the objective conclusion of data so that a decision could be more reliable and stable.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732404690937/d72f36a6-7e11-46b7-9a3a-4caaecdc5bf3.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-building-your-dream-team">Building Your Dream Team</h2>
<ol>
<li><p><strong>Data Scientists</strong>: They're like the Tony Stark of your team, brilliant minds who turn complex problems into flawless solutions. Just don't expect them to build an Iron Man suit... :). They are the ones who do the math heavy stuff, data pre-processing, machine learning, deep learning, and many more.</p>
</li>
<li><p><strong>Data Engineers</strong>: The unsung heroes who build and maintain your data infrastructure. We're like the plumbers of the digital world, nobody thinks about us until something goes wrong, and then we're everyone's best friend. Whole data teams should start of with a data engineer or a data engineering team. It provides business capability to use data.</p>
</li>
<li><p><strong>Data Analysts</strong>: The storytellers who turn numbers into narratives. They're the ones who make sure your executives don't fall asleep during presentations. Data analysts are the ones who connect the gap between the objective nature of data science and what it means for the decision making, as well as the communication to the non-math or non-technical people who might be the decision makers.</p>
</li>
</ol>
<p>On a more serious note, an illustration below might help you understand the roles better.</p>
<p><img src="https://www.dmbi.org/wp-content/uploads/Data-science-analyst-engineer-2-1.jpeg" alt /></p>
<h2 id="heading-the-trading-parallel-your-business-decisions">The Trading Parallel: Your Business Decisions</h2>
<p>Just like how data transformed my crypto trading from a guessing game into a strategic operation, it can revolutionize your business decisions. Remember:</p>
<ul>
<li><p>Without data: "This product might sell well because my cousin Linda likes it"</p>
</li>
<li><p>With data: "Our analysis shows a 78% probability of market success based on consumer behavior patterns"</p>
</li>
</ul>
<p>It doesn’t take a lot to allot resources into a data science team versus its potential upside for your business. It’s time to take action and start learning how data can help you in your domain.</p>
<p><img src="https://kentrix.in/wp-content/uploads/2023/08/Screenshot-2023-08-30-125434.png" alt="Screenshot 2023 08 30 125434" /></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The truth is, your business leaders' experience and intuition are invaluable, but in today's complex market they shouldn't have to navigate alone. A data science team turns gut feelings into validated decisions and transforms uncertainty into measurable outcomes. Think of it as adding a high-powered telescope to your captain's decades of sailing experience. You still need both to chart the best course forward.</p>
<p>So whether you're aiming to boost sales, reduce costs, or simply sleep better knowing your decisions are data-backed, consider this: in a world where everyone has access to data, the real advantage lies in how well you use it. If you weren’t using data to your advantage now, then when?</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together"><strong>P.S. Let's Build Something Cool Together!</strong></h3>
<p>Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.</p>
<p>If you find this blog interesting, connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a> and make sure to leave a message!</p>
]]></content:encoded></item><item><title><![CDATA[Top 5 Common Machine Learning Mistakes Beginners Do]]></title><description><![CDATA[Machine learning has become an essential tool in data science, but it's surprisingly easy to make fundamental mistakes that can severely impact your model's performance. In fact my motivation for this blog came out of peers underestimating the comple...]]></description><link>https://hddatascience.tech/top-5-common-machine-learning-mistakes-beginners-do</link><guid isPermaLink="true">https://hddatascience.tech/top-5-common-machine-learning-mistakes-beginners-do</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Beginner Developers]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Fri, 15 Nov 2024 13:41:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731676965781/a18b642e-b88a-40ec-89a9-4ce17144303e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Machine learning has become an essential tool in data science, but it's surprisingly easy to make fundamental mistakes that can severely impact your model's performance. In fact my motivation for this blog came out of peers underestimating the complexities of Linear Regression as a way to model data. In this guide, we'll explore the most common pitfalls and how to avoid them.</p>
<h2 id="heading-1-the-just-throw-it-into-a-model-syndrome">1. The "Just Throw It Into a Model" Syndrome</h2>
<p>One of the most prevalent mistakes is treating machine learning like a magic black box. Many newcomers simply load their dataset into scikit-learn's LinearRegression() and expect meaningful results. This approach ignores crucial preprocessing steps and can lead to severely underperforming models.</p>
<h3 id="heading-key-problems">Key Problems:</h3>
<ul>
<li><p>No train-test split</p>
</li>
<li><p>Missing data preprocessing</p>
</li>
<li><p>Lack of feature engineering</p>
</li>
<li><p>Ignoring data leakage</p>
</li>
</ul>
<h2 id="heading-2-data-preprocessing-oversights">2. Data Preprocessing Oversights</h2>
<h3 id="heading-feature-scaling">Feature Scaling</h3>
<p>Not normalizing or standardizing features is a common oversight that can significantly impact model performance. Different scales across features can cause:</p>
<ul>
<li><p>Gradient descent algorithms to converge slowly</p>
</li>
<li><p>Some features to dominate others unnecessarily</p>
</li>
<li><p>Poor performance in distance-based algorithms like k-NN</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731677851847/2a6b2c9a-0222-46ef-8f27-073ac8dfd0cd.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-dimensionality-issues">Dimensionality Issues</h3>
<p>Many practitioners fail to address the curse of dimensionality. High-dimensional data often needs:</p>
<ul>
<li><p>Principal Component Analysis (PCA)</p>
</li>
<li><p>Feature selection methods</p>
</li>
<li><p>Other dimensionality reduction techniques</p>
</li>
</ul>
<h2 id="heading-3-evaluation-metric-mismatches">3. Evaluation Metric Mismatches</h2>
<p>Choosing the wrong evaluation metric is like using a ruler to measure weight. Different problems require different metrics:</p>
<h3 id="heading-classification-metrics">Classification Metrics</h3>
<ul>
<li><p><strong>Imbalanced Data</strong>: Using accuracy for imbalanced datasets can be misleading</p>
</li>
<li><p><strong>False Positives vs. False Negatives</strong>: Not considering the business impact of different types of errors</p>
</li>
<li><p><strong>Common Solutions</strong>:</p>
<ul>
<li><p>F1-score for balanced precision and recall</p>
</li>
<li><p>Area Under ROC Curve (AUC-ROC)</p>
</li>
<li><p>Precision for minimizing false positives</p>
</li>
<li><p>Recall for minimizing false negatives</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731677919592/9f407f4d-a06b-421d-8c29-12fecfe2e70b.webp" alt class="image--center mx-auto" /></p>
<h2 id="heading-4-validation-vulnerabilities">4. Validation Vulnerabilities</h2>
<h3 id="heading-cross-validation-mistakes">Cross-Validation Mistakes</h3>
<p>Simple train-test splits aren't enough. Common issues include:</p>
<ul>
<li><p>Not using k-fold cross-validation</p>
</li>
<li><p>Applying cross-validation incorrectly</p>
</li>
<li><p>Ignoring temporal aspects in time-series data</p>
</li>
</ul>
<h3 id="heading-data-leakage">Data Leakage</h3>
<p>Subtle forms of data leakage can creep in through:</p>
<ul>
<li><p>Preprocessing before splitting the data</p>
</li>
<li><p>Using future information in time-series</p>
</li>
<li><p>Including target-related features</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731678002421/0bd98b0f-67ef-48b3-a2e4-254224884863.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-5-overcomplicating-solutions">5. Overcomplicating Solutions</h2>
<p>Sometimes simpler is better. Common overcomplications include:</p>
<ul>
<li><p>Using deep learning when linear regression would suffice</p>
</li>
<li><p>Adding unnecessary features without validation</p>
</li>
<li><p>Over-tuning hyperparameters without significant gains</p>
</li>
</ul>
<h2 id="heading-best-practices-checklist">Best Practices Checklist</h2>
<ul>
<li><p>Start with data exploration and visualization (EDA)</p>
</li>
<li><p>Implement proper train-test splits</p>
</li>
<li><p>Apply appropriate preprocessing techniques</p>
</li>
<li><p>Choose metrics based on business objectives</p>
</li>
<li><p>Use cross-validation for robust evaluation</p>
</li>
<li><p>Monitor for data leakage</p>
</li>
<li><p>Start simple and iterate based on results</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Avoiding these common mistakes can significantly improve your machine learning models' performance. Remember that machine learning is not about throwing data at algorithms – it's about understanding your data, choosing appropriate methods, and carefully validating your results.</p>
<p>Would you like to build better models? Start by auditing your current practices against these common pitfalls. Your future self (and your models) will thank you.</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together"><strong>P.S. Let's Build Something Cool Together!</strong></h3>
<p>As a versatile data professional, I have expertise in both data engineering (most recent job exp) and data science (my undergrad), including machine learning, AI. I'd be excited to collaborate on an interesting project that leverages my diverse skillset.</p>
<p>Also, I do a little bit of Next.JS on the side 😉.</p>
<p>Connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a>, and let's discuss potential opportunities.</p>
]]></content:encoded></item><item><title><![CDATA[What is ETL in Data Engineering?]]></title><description><![CDATA[So imagine you're making a smoothie - that's basically what ETL is in the data world. First, you Extract all your ingredients (data) from different places, like grabbing berries from the fridge, bananas from the counter, and yogurt from the store - j...]]></description><link>https://hddatascience.tech/what-is-etl-in-data-engineering</link><guid isPermaLink="true">https://hddatascience.tech/what-is-etl-in-data-engineering</guid><category><![CDATA[etl-pipeline]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Wed, 13 Nov 2024 05:16:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731473795411/ad56211d-b81b-4945-97b3-1679edc75c04.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>So imagine you're making a smoothie - that's basically what ETL is in the data world. First, you Extract all your ingredients (data) from different places, like grabbing berries from the fridge, bananas from the counter, and yogurt from the store - just like pulling data from different systems and files. Then comes the Transform part, where you wash the fruit, cut it up, and make sure everything's ready to blend - similar to cleaning up messy data, fixing errors, and making sure everything fits together nicely. Finally, you Load it all into your blender (or in data terms, your final database or warehouse) where it becomes something useful that everyone can consume!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731474331933/1b6e1ed0-3f6d-4c3c-b880-d586d78952d9.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-e-t-l-meaning">“E-T-L” meaning</h2>
<h3 id="heading-extract">Extract</h3>
<p>Data extraction is the first crucial step in the ETL process, involving the gathering of data from various source systems. This phase focuses on pulling raw data from multiple sources and preparing it for the transformation phase. The extraction process can be as simple as copying data from a single database or as complex as gathering information from dozens of disparate systems.</p>
<h3 id="heading-transform">Transform</h3>
<p>The transformation phase is where raw data becomes valuable business information. This critical stage involves converting extracted data into a format that's suitable for analysis and storage. During transformation, data undergoes various operations to ensure it meets business rules, quality standards, and technical requirements of the target system.</p>
<h3 id="heading-load">Load</h3>
<p>The loading phase represents the final step where transformed data is written into the target system. This phase requires careful planning to ensure data is loaded efficiently while maintaining system performance and data integrity.</p>
<h2 id="heading-the-three-critical-pillars-of-etls-business-impact">The Three Critical Pillars of ETL's Business Impact</h2>
<h2 id="heading-1-data-driven-decision-making">1. Data-Driven Decision Making</h2>
<h3 id="heading-strategic-advantage">Strategic Advantage</h3>
<ul>
<li><p>Consolidates data from multiple sources into a single source of truth</p>
</li>
<li><p>Sales, marketing, financial, and operational data integration</p>
</li>
<li><p>Enables real-time business intelligence and reporting</p>
</li>
</ul>
<h3 id="heading-business-impact">Business Impact</h3>
<ul>
<li><p>Faster, more accurate decision making</p>
</li>
<li><p>Reduced analysis time and effort</p>
</li>
<li><p>Better resource allocation based on actual data</p>
</li>
<li><p>Improved forecasting and planning capabilities</p>
</li>
</ul>
<h2 id="heading-2-operational-excellence">2. Operational Excellence</h2>
<h3 id="heading-process-optimization">Process Optimization</h3>
<ul>
<li><p>Automates manual data processing tasks</p>
</li>
<li><p>Standardizes data handling across the organization</p>
</li>
<li><p>Eliminates redundant data entry and processing</p>
</li>
<li><p>Ensures consistent data quality and formatting</p>
</li>
</ul>
<h3 id="heading-cost-benefits">Cost Benefits</h3>
<ul>
<li><p>Reduces manual labor costs by 40-60% on average</p>
</li>
<li><p>Minimizes errors and associated correction costs</p>
</li>
<li><p>Improves employee productivity</p>
</li>
<li><p>Faster time-to-market for data-dependent projects</p>
</li>
</ul>
<h2 id="heading-3-customer-experience-enhancement">3. Customer Experience Enhancement</h2>
<h3 id="heading-customer-understanding">Customer Understanding</h3>
<ul>
<li><p>Creates comprehensive 360-degree customer views</p>
</li>
<li><p>Combines data from all customer touchpoints</p>
</li>
<li><p>Enables personalized marketing and service delivery</p>
</li>
<li><p>Supports predictive customer behavior analysis</p>
</li>
</ul>
<h3 id="heading-business-growth">Business Growth</h3>
<ul>
<li><p>Improves customer retention through better service</p>
</li>
<li><p>Increases cross-selling and upselling opportunities</p>
</li>
<li><p>Enables targeted marketing campaigns</p>
</li>
<li><p>Supports customer satisfaction monitoring and improvement</p>
</li>
</ul>
<h2 id="heading-how-can-etl-go-wrong">How can ETL go wrong?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731474863226/586e7703-f80d-49f5-ac83-7be379820bfe.gif" alt class="image--center mx-auto" /></p>
<p>Data extraction fails when source systems unexpectedly change formats or experience downtime, breaking automated processes. Legacy systems with outdated security protocols can become inaccessible, forcing rushed workarounds. Poor connectivity leads to incomplete datasets.</p>
<p>Transformation can silently corrupt data when business rules aren't properly maintained, while faulty loading processes create database deadlocks. Multiple competing ETL jobs can bring production systems to a halt during peak hours, causing business disruption.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>​ETL, or Extract, Transform, Load, is a crucial data integration process that consolidates data from multiple sources into a central repository, enabling organizations to perform analytics and drive informed decision-making.​ While ETL provides significant advantages such as improved data quality, automation, and enhanced business intelligence, it can also encounter challenges related to data quality issues, complex transformations, and performance bottlenecks. To mitigate potential pitfalls, businesses should follow best practices, such as implementing robust error handling, ensuring data quality, and utilizing automation to streamline and optimize the ETL process.</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together"><strong>P.S. Let's Build Something Cool Together!</strong></h3>
<p>Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.</p>
<p>If you find this blog interesting, connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a> and make sure to leave a message!</p>
]]></content:encoded></item><item><title><![CDATA[How to Learn Machine Learning?]]></title><description><![CDATA[What is Machine Learning?
Machine learning is basically teaching computers to learn from data - kind of like how we humans learn from experience, except computers don't get tired or need coffee breaks! It's a branch of artificial intelligence that's ...]]></description><link>https://hddatascience.tech/how-to-learn-machine-learning</link><guid isPermaLink="true">https://hddatascience.tech/how-to-learn-machine-learning</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[AI]]></category><category><![CDATA[learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Mathematics]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Thu, 07 Nov 2024 23:41:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731021856997/e1a7acee-722a-406f-b8c3-7d63096f8db1.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-what-is-machine-learning"><strong>What is Machine Learning?</strong></h2>
<p>Machine learning is basically teaching computers to learn from data - kind of like how we humans learn from experience, except computers don't get tired or need coffee breaks! It's a branch of artificial intelligence that's taking over pretty much every industry you can think of, from helping doctors detect diseases to recommending that next Netflix show you'll probably binge-watch.</p>
<p>As someone who's been a data engineer for years now, I've seen countless people get overwhelmed when starting their machine learning journey. Trust me, I've been there - staring at mathematical equations that looked more like ancient hieroglyphics than something I'd never understand. But don't worry, I'll break down the different paths you can take, depending on your goals and how much math you're willing to tolerate (just kidding... kind of).</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/1*2nH4-iMWzxMa9_Xpi1LkkQ.png" alt="Starting Machine Learning Journey At Different Career Levels | by ..." /></p>
<h2 id="heading-different-ways-to-learn-machine-learning"><strong>Different Ways to Learn Machine Learning</strong></h2>
<p>Let me tell you something funny - there are actually two types of people in the machine learning world: those who dive straight into coding with libraries like scikit-learn (the "just make it work" crowd), and those who start with calculus textbooks (the "but why does it work?" crowd). Both approaches are totally valid!</p>
<p>As someone who's tried both paths (and crashed and burned a few times), here's my honest breakdown:</p>
<h2 id="heading-the-quick-and-dirty-way"><strong>The Quick and Dirty Way</strong></h2>
<p>Want to start building ML models ASAP? This is what I like to call the "scikit-learn and pray" approach. You can:</p>
<ul>
<li><p>Learn Python (it's friendlier than it sounds, I promise!)</p>
</li>
<li><p>Jump straight into machine learning libraries</p>
</li>
<li><p>Start building models without diving too deep into the math</p>
</li>
</ul>
<p>I actually started this way when my boss needed a prediction model "by yesterday." Did I fully understand what was happening under the hood? Nope! Did it work? Well... eventually!</p>
<p><img src="https://www.springboard.com/blog/wp-content/uploads/2021/04/tensorflow-vs-scikit-learn-how-do-they-compare.png" alt="TensorFlow vs. Scikit-Learn: How Do They Compare?" /></p>
<h2 id="heading-the-deep-dive-approach"><strong>The Deep Dive Approach</strong></h2>
<p>This is for the brave souls who want to understand every single detail. You'll need:</p>
<ul>
<li><p>Calculus (yes, those derivatives are coming back to haunt you)</p>
</li>
<li><p>Linear Algebra (matrices are your new best friends)</p>
</li>
<li><p>Statistics (probability distributions will be your breakfast reading)</p>
</li>
</ul>
<p>I remember spending years in school with these math concepts, fueled by energy drinks and questionable life choices. But I'll tell you what - once it clicks, it's like having a superpower!</p>
<p>You may learn how to do machine learning concepts using pre-existing libraries such as scikit-learn and many more, but at some point you will feel the debt in knowledge where there will be gaps in what you are trying to do.</p>
<p><img src="https://media.licdn.com/dms/image/v2/D5612AQFPqMC0F-cj3A/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1695092840155?e=2147483647&amp;v=beta&amp;t=ZXk1l73MQ4EgG63-SzV7T5C9Mi_e44A6ULt_zTiEdOo" alt="ML 1.8 Why We Need Math Knowledge for Machine Learning" /></p>
<h2 id="heading-why-should-you-care"><strong>Why Should You Care?</strong></h2>
<p>Look, I get it - learning machine learning can feel like trying to eat an elephant. But here's the thing: the field is exploding faster than my coffee addiction (and that's saying something). Every day I see companies scrambling to hire people who understand this stuff. Whether you're a fresh graduate or a seasoned developer, this knowledge is becoming as essential as knowing how to use a spreadsheet was in the '90s. If you’re here, well, I know you know how much machine learning engineers are getting paid by the hour ;).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731022828832/eccc1135-6c3d-4816-a7d0-d33a4667b467.webp" alt class="image--center mx-auto" /></p>
<h3 id="heading-tips-from-someone-whos-been-there"><strong>Tips From Someone Who's Been There</strong></h3>
<ol>
<li><p><strong>Start Small</strong>: My first ML project was predicting house prices. It wasn't revolutionary, but hey, it worked! And I didn't cry... much. Starting small prevents you from getting overwhelmed and it lets you start.</p>
</li>
<li><p><strong>Join Communities</strong>: Trust me, you'll need people to commiserate with when your model's accuracy is lower than your high school math grades. Getting feedback in public is as crucial as spending time to learn.</p>
</li>
<li><p><strong>Build Real Projects</strong>: Theory is great, but nothing beats the thrill (and frustration) of building something real. I learned more from my failed projects than from any tutorial. I believe that learning mostly comes from our failures more than our success.</p>
</li>
</ol>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Whether you choose to dive straight into coding or take the scenic route through math town, remember that everyone starts somewhere. I went from barely understanding what ML meant to building production models that actually work (most of the time). If I can do it, so can you!</p>
<p>And hey, if you're feeling overwhelmed, just remember: even the most sophisticated ML models sometimes make predictions that are about as accurate as my weather app - and we still keep trying!</p>
<p>For more nerdy data science content and occasional attempts at humor, check out my other articles on getting started with Python and data science fundamentals. Trust me, they're marginally more entertaining than watching paint dry! 😉</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together"><strong>P.S. Let's Build Something Cool Together!</strong></h3>
<p>After years of stumbling through machine learning, I've found that learning is always better when done together. If you're stuck on a concept, need guidance, or just want to chat about ML, feel free to reach out to me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/"><strong>Linkedin</strong></a></p>
]]></content:encoded></item><item><title><![CDATA[What is The Scatterplot Data Generator?]]></title><description><![CDATA[Introduction
Picture this: It's 3 AM, you're on your fifth cup of coffee, and you're staring at your screen thinking, "If only I had the perfect dataset to test this clustering algorithm..." Trust me, I've been there – we've all been there. As a data...]]></description><link>https://hddatascience.tech/what-is-the-scatterplot-data-generator</link><guid isPermaLink="true">https://hddatascience.tech/what-is-the-scatterplot-data-generator</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Artificial Intelligence]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Sat, 02 Nov 2024 01:52:05 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730511897787/f29f1157-0fbc-43ab-afbf-e3ad9280e2fc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730511867292/527736b4-0d07-492d-acc1-e3fb17558c86.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-introduction">Introduction</h2>
<p>Picture this: It's 3 AM, you're on your fifth cup of coffee, and you're staring at your screen thinking, "If only I had the perfect dataset to test this clustering algorithm..." Trust me, I've been there – we've all been there. As a data engineer who's spent countless nights wrestling with algorithms that just won't behave (much like my neighbor's cat), I've discovered a tool that's become my secret weapon: the Scatterplot Data Generator.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730508541968/88b1d499-23e4-4def-ab6b-6b67d2f679fc.webp" alt class="image--center mx-auto" /></p>
<p>In this post, we'll explore what this magical tool is, why it's a game-changer for data scientists and engineers, and how it can save you from those late-night data hunting expeditions.</p>
<h2 id="heading-what-is-the-scatterplot-data-generator">What is The Scatterplot Data Generator?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730508635907/dec33b38-b257-44eb-bbb8-7963c48b4095.png" alt class="image--center mx-auto" /></p>
<p>Scatterplot Data Generator is a web-based tool that lets you literally draw your data points into existence. Think of it as MS Paint meets data science (minus the questionable artistic results we all created in the '90s). It allows users to draw points of different colors on a coordinate system, which are then converted into actual numerical data that you can use for machine learning models, testing, or educational purposes.</p>
<h2 id="heading-why-is-scatterplot-data-generator-important">Why is Scatterplot Data Generator Important?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730508610592/f57b9f2d-f1c4-4fba-9a77-1058a33189fb.webp" alt class="image--center mx-auto" /></p>
<p>Let me tell you a story that might sound familiar. Last year, I was working on a multi-class classification model that needed very specific data patterns to test edge cases. After hours of searching through Kaggle and various datasets (and possibly losing a bit of my sanity), I realized I was doing it the hard way.</p>
<h2 id="heading-simplest-use-case-for-scatterplot-data-generator">Simplest Use Case for Scatterplot Data Generator</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730518848095/1c689fd5-5c62-40cb-ac24-0d73e8147bc5.png" alt class="image--center mx-auto" /></p>
<ol>
<li><h3 id="heading-visual-pattern-recognition">Visual Pattern Recognition:</h3>
</li>
</ol>
<ul>
<li><p>The tool shows two distinct plots: one with color-coded points (blue and red) and another with black points</p>
</li>
<li><p>This helps learners understand how clustering algorithms identify and separate data points into groups based on their spatial relationships</p>
</li>
</ul>
<ol start="2">
<li><h3 id="heading-interactive-learning-features">Interactive Learning Features:</h3>
</li>
</ol>
<ul>
<li><p>The interface has color selection options (Blue, Red, Green)</p>
</li>
<li><p>A "Reset" button to start fresh</p>
</li>
<li><p>"Download CSV" functionality to export the data</p>
</li>
<li><p>These features allow hands-on experimentation with different data patterns</p>
</li>
</ul>
<ol start="3">
<li><h3 id="heading-educational-benefits">Educational Benefits:</h3>
</li>
</ol>
<ul>
<li><p>Learners can create custom data distributions to test clustering scenarios</p>
</li>
<li><p>The tool demonstrates how points that are closer together tend to form clusters</p>
</li>
<li><p>The right-side plot shows how raw data looks before classification/clustering</p>
</li>
<li><p>The left-side plot shows how clustering algorithms might separate the data into distinct groups</p>
</li>
</ul>
<ol start="4">
<li><h3 id="heading-practical-applications">Practical Applications:</h3>
</li>
</ol>
<ul>
<li><p>Users can generate synthetic datasets for testing clustering algorithms like K-means or DBSCAN</p>
</li>
<li><p>They can experiment with different data patterns and see how clustering algorithms might perform</p>
</li>
<li><p>The CSV export feature lets them use the generated data in actual ML tools and frameworks</p>
</li>
</ul>
<p>This tool essentially bridges the gap between theoretical understanding and practical application in machine learning clustering concepts.</p>
<h2 id="heading-real-examples-of-scatterplot-data-generator-in-action">Real Examples of Scatterplot Data Generator in Action</h2>
<h3 id="heading-1-the-classification-conundrum">1. The Classification Conundrum</h3>
<p><img src="https://www.researchgate.net/publication/341699548/figure/fig1/AS:896044741181440@1590645122163/Scatter-plot-of-XOR-Spiral-and-Circle-Benchmarks.ppm" alt="Scatter plot of XOR, Spiral and Circle Benchmarks" /></p>
<p>Picture this: I was working with a peer who couldn't understand why their beautiful linear classifier was failing miserably. Rather than diving into complex math, I fired up Scatterplot Data Generator and drew a simple XOR pattern – you know, that classic "cross" shape that makes linear classifiers cry themselves to sleep. Five minutes of interactive demonstration showed what would have taken an hour to explain with equations. The best part? They immediately started experimenting with their own patterns, creating increasingly diabolical datasets to break various classifiers. It's all fun and games until someone creates a spiral pattern!</p>
<h3 id="heading-2-the-edge-case-emergency">2. The Edge Case Emergency</h3>
<p>It was Sunday night (because production issues never happen on a Tuesday afternoon, right?), and our anomaly detection system was having false positives. We needed to test edge cases, and fast. Using Scatterplot Data Generator, we created datasets with specific outlier patterns that mimicked our production scenarios. Within an hour, we had a suite of test cases that would have taken days to find in real data. The best part? We could tweak the patterns in real-time as we discovered new edge cases. Our Monday morning post-mortem turned into a "look how we nailed it" presentation!</p>
<h2 id="heading-workflows-for-scatterplot-data-generator">Workflows for Scatterplot Data Generator</h2>
<pre><code class="lang-mermaid">flowchart LR
    A[Draw Points] --&gt;|Click &amp; Drag| B[Generate Data]
    B --&gt; C[Download CSV]
    B --&gt; D[Export Visual]
    C --&gt; E[Use in ML Pipeline]
    D --&gt; F[Use in Documentation]

    style A fill:#5b9aa0
    style B fill:#5b9aa0
    style C fill:#5b9aa0
    style D fill:#5b9aa0
    style E fill:#5b9aa0
    style F fill:#5b9aa0
</code></pre>
<h2 id="heading-tips-and-reminders-for-using-scatterplot-data-generator">Tips and Reminders for Using Scatterplot Data Generator</h2>
<h3 id="heading-1-plan-your-pattern">1. Plan Your Pattern</h3>
<p>Before diving in, spend five minutes sketching your intended pattern. Trust me, I learned this the hard way after creating what I thought would be a perfect Gaussian distribution but ended up looking more like my failed attempt at drawing a cat. Quick tip: Use graph paper for your sketches – your coordinates will thank you later! ###</p>
<h3 id="heading-2-save-everything">2. Save Everything</h3>
<p>I cannot stress this enough: Save. Your. Work. Name your files descriptively (not "test1_final_final_REALLY_FINAL.csv"). Keep both the visual and the data. Document your patterns. I once spent three hours recreating a "perfect" dataset because I forgot to save the original. Learn from my pain!</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Just as every great artist needs their canvas, every data scientist needs their tools. Scatterplot Data Generator bridges the gap between imagination and implementation, between "I wish I had this data" and "I created exactly what I needed." Whether you're a seasoned data scientist battling with edge cases, a teacher illuminating the mysteries of machine learning, or a beginner trying to understand why your neural network has trust issues, this tool transforms the abstract into the tangible. Remember: in a world where data is the new oil, being able to generate exactly what you need makes you not just a data scientist, but a data artist. And sometimes, the best datasets are the ones we draw ourselves – even if they occasionally end up looking like abstract art!</p>
<hr />
<h3 id="heading-ps-lets-build-something-cool-together">P.S. Let's Build Something Cool Together!</h3>
<p>Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.</p>
<p>If you find this blog interesting, connect with me on <a target="_blank" href="https://www.linkedin.com/in/harvey-ducay-090157253/">Linkedin</a> and make sure to leave a message!</p>
]]></content:encoded></item><item><title><![CDATA[What is Data Engineering?: Everything You Need to Know]]></title><description><![CDATA[Ever found yourself drowning in a sea of data, trying to make sense of countless Excel sheets while your computer fan sounds like it's about to take off? Trust me, I've been there. As someone who once tried to train a machine learning model on my lap...]]></description><link>https://hddatascience.tech/what-is-data-engineering-everything-you-need-to-know</link><guid isPermaLink="true">https://hddatascience.tech/what-is-data-engineering-everything-you-need-to-know</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[netflix]]></category><dc:creator><![CDATA[Harvey Ducay]]></dc:creator><pubDate>Thu, 24 Oct 2024 23:25:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729810715552/296aa8c6-d3ba-4f84-a514-4dd270bfc995.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever found yourself drowning in a sea of data, trying to make sense of countless Excel sheets while your computer fan sounds like it's about to take off? Trust me, I've been there. As someone who once tried to train a machine learning model on my laptop with 100GB of unstructured data (spoiler alert: it didn't end well), I learned the hard way why data engineering is the unsung hero of the data world.</p>
<p>My laptop trying to process 100GB of data…</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729810987018/9ded2596-63f4-4097-8273-c28e2d6576e0.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-introduction">Introduction</h2>
<p>In today's digital age, data is the new gold – but just like raw gold, raw data needs refining before it becomes valuable. That's where data engineering comes in. Whether you're a startup trying to make sense of your customer data or a large enterprise handling petabytes of information, data engineering is the foundation that makes modern data science and analytics possible.</p>
<h2 id="heading-the-data-pipeline-journey">The Data Pipeline Journey</h2>
<p><em>Raw Data → Data Engineering → Clean Data → Analysis → Insights</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729811103420/0a08d75a-e8ff-4e79-8928-0276163ab434.png" alt class="image--center mx-auto" /></p>
<p>In this post, we'll define data engineering, explore its crucial role in the data ecosystem, and provide practical insights into how it can transform your business's data operations. We'll also look at real-world examples and best practices that can help you get started on your data engineering journey.</p>
<h2 id="heading-what-is-data-engineering">What is Data Engineering?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729811313597/836c4287-bf3b-4394-a023-1fe18efeccc2.webp" alt class="image--center mx-auto" /></p>
<p>Data engineering is the practice of designing, building, and maintaining the infrastructure and systems needed to collect, store, process, and deliver data for analysis. Think of data engineers as the architects and plumbers of the data world – they build the pipelines and systems that ensure data flows smoothly from source to destination, arriving clean and ready for analysis.</p>
<p><em>A sample python BigQuery snippet</em></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> google.cloud <span class="hljs-keyword">import</span> bigquery
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">BigQueryETL</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, project_id: str</span>):</span>
        <span class="hljs-string">"""Initialize BigQuery client"""</span>
        self.client = bigquery.Client(project=project_id)
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract</span>(<span class="hljs-params">self, query: str</span>) -&gt; pd.DataFrame:</span>
        <span class="hljs-string">"""Extract data from BigQuery"""</span>
        <span class="hljs-keyword">try</span>:
            df = self.client.query(query).to_dataframe()
            self.logger.info(<span class="hljs-string">f"Extracted <span class="hljs-subst">{len(df)}</span> rows"</span>)
            <span class="hljs-keyword">return</span> df
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            self.logger.error(<span class="hljs-string">f"Extraction failed: <span class="hljs-subst">{str(e)}</span>"</span>)
            <span class="hljs-keyword">raise</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transform</span>(<span class="hljs-params">self, df: pd.DataFrame</span>) -&gt; pd.DataFrame:</span>
        <span class="hljs-string">"""Apply transformations to the data"""</span>
        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># Convert dates</span>
            date_cols = df.select_dtypes(include=[<span class="hljs-string">'datetime64[ns]'</span>]).columns
            <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> date_cols:
                df[col] = pd.to_datetime(df[col])

            <span class="hljs-comment"># Handle missing values in numeric columns</span>
            num_cols = df.select_dtypes(include=[<span class="hljs-string">'float64'</span>, <span class="hljs-string">'int64'</span>]).columns
            df[num_cols] = df[num_cols].fillna(df[num_cols].mean())

            <span class="hljs-comment"># Add time-based features if timestamp exists</span>
            <span class="hljs-keyword">if</span> <span class="hljs-string">'timestamp'</span> <span class="hljs-keyword">in</span> df.columns:
                df[<span class="hljs-string">'hour'</span>] = df[<span class="hljs-string">'timestamp'</span>].dt.hour
                df[<span class="hljs-string">'is_weekend'</span>] = df[<span class="hljs-string">'timestamp'</span>].dt.dayofweek.isin([<span class="hljs-number">5</span>, <span class="hljs-number">6</span>]).astype(int)

            <span class="hljs-keyword">return</span> df.drop_duplicates()
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            self.logger.error(<span class="hljs-string">f"Transformation failed: <span class="hljs-subst">{str(e)}</span>"</span>)
            <span class="hljs-keyword">raise</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load</span>(<span class="hljs-params">self, df: pd.DataFrame, table_id: str</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-string">"""Load data into BigQuery"""</span>
        <span class="hljs-keyword">try</span>:
            job_config = bigquery.LoadJobConfig(
                write_disposition=<span class="hljs-string">'WRITE_TRUNCATE'</span>,
                schema_update_options=[bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]
            )

            load_job = self.client.load_table_from_dataframe(
                df, table_id, job_config=job_config
            )
            load_job.result()  <span class="hljs-comment"># Wait for job to complete</span>

            self.logger.info(<span class="hljs-string">f"Loaded <span class="hljs-subst">{len(df)}</span> rows to <span class="hljs-subst">{table_id}</span>"</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            self.logger.error(<span class="hljs-string">f"Load failed: <span class="hljs-subst">{str(e)}</span>"</span>)
            <span class="hljs-keyword">raise</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_pipeline</span>(<span class="hljs-params">self, query: str, destination_table: str</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-string">"""Execute the full ETL pipeline"""</span>
        start_time = datetime.now()
        <span class="hljs-keyword">try</span>:
            df = self.extract(query)
            df_transformed = self.transform(df)
            self.load(df_transformed, destination_table)

            duration = datetime.now() - start_time
            self.logger.info(<span class="hljs-string">f"Pipeline completed in <span class="hljs-subst">{duration}</span>"</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            self.logger.error(<span class="hljs-string">f"Pipeline failed: <span class="hljs-subst">{str(e)}</span>"</span>)
            <span class="hljs-keyword">raise</span>

<span class="hljs-comment"># Example usage</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    <span class="hljs-comment"># Initialize pipeline</span>
    etl = BigQueryETL(<span class="hljs-string">"your-project-id"</span>)

    <span class="hljs-comment"># Example query</span>
    query = <span class="hljs-string">"""
    SELECT user_id, timestamp, activity_type, duration
    FROM `your-project-id.dataset.user_activity`
    WHERE DATE(timestamp) &gt;= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    """</span>

    <span class="hljs-comment"># Run pipeline</span>
    etl.run_pipeline(query, <span class="hljs-string">"your-project-id.dataset.processed_activity"</span>)
</code></pre>
<h2 id="heading-why-is-data-engineering-important">Why is Data Engineering Important?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729811666863/e71dda39-9305-4aca-acb0-26a3baa8c08f.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-1-data-volume-175-zettabytes-by-2025">1. Data Volume: 175 Zettabytes by 2025</h3>
<p>By 2025, global data creation will hit 175 zettabytes (that's 175 billion terabytes!). To put this in perspective, if you stored this data on DVDs, the stack would reach the moon 23 times. This explosive growth, driven by IoT devices, social media, and streaming services, makes robust data engineering not just important, but critical for business survival.</p>
<h3 id="heading-2-decision-speed-25-faster">2. Decision Speed: 25% Faster</h3>
<p>Organizations with proper data engineering make decisions 25% faster than their competitors. Think retail making inventory decisions in hours instead of days, or healthcare reducing patient diagnosis time by a third. This speed comes from automated data pipelines, real-time analytics, and streamlined access to clean, reliable data.</p>
<h3 id="heading-3-cost-reduction-up-to-70">3. Cost Reduction: Up to 70%</h3>
<p>Companies can slash data-related costs by up to 70% through data engineering. How? Through smart infrastructure optimization (30% savings), automated processes (20% savings), and better resource allocation (20% savings). Instead of throwing money at storing and processing messy data, proper engineering means you spend less while getting better results.</p>
<h2 id="heading-real-examples-of-data-engineering">Real Examples of Data Engineering</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729811991194/adb0f65c-30d9-4b70-8936-e86d0e2d7496.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-real-world-data-engineering-examples">Real-World Data Engineering Examples</h2>
<h3 id="heading-1-netflixs-data-pipeline">1. Netflix's Data Pipeline</h3>
<p>[ARCHITECTURE DIAGRAM SUGGESTION: Netflix Data Flow]</p>
<pre><code class="lang-mermaid">graph TD
    A[User Interactions] --&gt;|Streaming Events| B[Kafka]
    B --&gt;|Real-time Processing| C[Apache Flink]
    B --&gt;|Batch Processing| D[Spark]
    C --&gt;|Hot Data| E[Cassandra]
    D --&gt;|Cold Data| F[S3 Data Lake]
    E --&gt; G[Feature Store]
    F --&gt; G
    G --&gt;|ML Training| H[Model Training]
    H --&gt;|Model Serving| I[Recommendation Service]
    I --&gt;|Personalization| J[User Interface]
</code></pre>
<p>Netflix processes a staggering 450+ billion events per day through their data pipeline. Here's how their architecture works:</p>
<ol>
<li><p><strong>Data Collection Layer</strong></p>
<ul>
<li><p>Captures user interactions (clicks, views, pauses, ratings)</p>
</li>
<li><p>Records viewing quality metrics</p>
</li>
<li><p>Tracks device-specific information</p>
</li>
<li><p>Processes content metadata</p>
</li>
</ul>
</li>
<li><p><strong>Processing Layer</strong></p>
<ul>
<li><p>Real-time processing for immediate recommendations</p>
</li>
<li><p>Batch processing for deeper insights</p>
</li>
<li><p>A/B testing data for feature optimization</p>
</li>
<li><p>Content performance analytics</p>
</li>
</ul>
</li>
<li><p><strong>Storage Layer</strong></p>
<ul>
<li><p>Hot data in Cassandra for real-time access</p>
</li>
<li><p>Cold data in S3 for historical analysis</p>
</li>
<li><p>Feature store for ML model training</p>
</li>
<li><p>Redis cache for quick access to recommendations</p>
</li>
</ul>
</li>
</ol>
<p>The result? Those eerily accurate "Because you watched..." recommendations that keep us binge-watching!</p>
<h3 id="heading-2-ubers-real-time-analytics">2. Uber's Real-Time Analytics</h3>
<p>[ARCHITECTURE DIAGRAM SUGGESTION: Uber's Real-time System]</p>
<pre><code class="lang-mermaid">graph TD
    A[Rider/Driver Apps] --&gt;|Events| B[Apache Kafka]
    B --&gt;|Stream Processing| C[Apache Flink]
    B --&gt;|Batch Processing| D[Apache Spark]
    C --&gt;|Real-time Metrics| E[Apache AthenaX]
    D --&gt;|Historical Data| F[Hudi Data Lake]
    E --&gt;|Current State| G[Redis]
    F --&gt;|Analytics| H[Presto]
    G --&gt;|Real-time Decisions| I[Matching Service]
    H --&gt;|Business Intelligence| J[Analytics Dashboard]
</code></pre>
<p>Uber's real-time data pipeline handles millions of events per second. Here's their architecture breakdown:</p>
<ol>
<li><p><strong>Real-time Processing Layer</strong></p>
<ul>
<li><p>Processes GPS coordinates every 4 seconds</p>
</li>
<li><p>Handles surge pricing calculations</p>
</li>
<li><p>Manages driver-rider matching</p>
</li>
<li><p>Monitors service health</p>
</li>
</ul>
</li>
<li><p><strong>Storage Layer</strong></p>
<ul>
<li><p>Temporal data in Redis for immediate access</p>
</li>
<li><p>Historical data in Apache Hudi</p>
</li>
<li><p>Geospatial indexing for location services</p>
</li>
<li><p>Cached frequently accessed routes</p>
</li>
</ul>
</li>
<li><p><strong>Analytics Layer</strong></p>
<ul>
<li><p>Real-time city demand forecasting</p>
</li>
<li><p>Dynamic pricing algorithms</p>
</li>
<li><p>Driver supply optimization</p>
</li>
<li><p>Route optimization based on traffic patterns</p>
</li>
</ul>
</li>
</ol>
<p>The result is a system that can match you with a driver in seconds while optimizing for countless variables in real-time!</p>
<hr />
<h2 id="heading-conclusion-the-future-is-data-driven">Conclusion: The Future is Data-Driven</h2>
<p>Looking at these real-world examples, it's clear that data engineering isn't just about moving data from point A to point B – it's the backbone of modern digital experiences we take for granted. From Netflix knowing exactly what show you'll love next to Uber finding you the perfect driver in seconds, data engineering makes the impossible possible.</p>
<p>Remember when I mentioned my laptop meltdown trying to process 100GB of data? That's like trying to deliver packages on a bicycle when you need a fleet of trucks. Modern data engineering is that fleet of trucks, complete with GPS, route optimization, and real-time tracking.</p>
<p>As we move toward an even more data-intensive future, the role of data engineering will only grow. Whether you're a startup processing your first thousand users' worth of data or an enterprise handling petabytes, the principles remain the same:</p>
<ul>
<li><p>Build scalable, resilient pipelines</p>
</li>
<li><p>Automate everything you can</p>
</li>
<li><p>Monitor religiously</p>
</li>
<li><p>Plan for growth</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>