<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ernesto Arias</title>
    <description>The latest articles on DEV Community by Ernesto Arias (@ernesto_arias_148b35bc25d).</description>
    <link>https://dev.clauneck.workers.dev/ernesto_arias_148b35bc25d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2767341%2Fc9163db9-e7b0-4323-a2b4-92de38adcdae.jpg</url>
      <title>DEV Community: Ernesto Arias</title>
      <link>https://dev.clauneck.workers.dev/ernesto_arias_148b35bc25d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.clauneck.workers.dev/feed/ernesto_arias_148b35bc25d"/>
    <language>en</language>
    <item>
      <title># How I Found Out 52% of My Knowledge Graph Was Duplicates (and What I Did About It)</title>
      <dc:creator>Ernesto Arias</dc:creator>
      <pubDate>Thu, 25 Jun 2026 00:50:17 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/ernesto_arias_148b35bc25d/-how-i-found-out-52-of-my-knowledge-graph-was-duplicates-and-what-i-did-about-it-3coh</link>
      <guid>https://dev.clauneck.workers.dev/ernesto_arias_148b35bc25d/-how-i-found-out-52-of-my-knowledge-graph-was-duplicates-and-what-i-did-about-it-3coh</guid>
      <description>&lt;p&gt;I've spent the last several months building &lt;a href="https://github.com/ernestoariasdiaz/animus-ai" rel="noopener noreferrer"&gt;ANIMUS&lt;/a&gt;, an autonomous system in Rust that gives a local LLM persistent memory. The idea is simple: a knowledge graph that grows on its own, cycle after cycle, as the system reads documents, detects gaps in its knowledge, and fills them in.&lt;/p&gt;

&lt;p&gt;For months, the metric I watched most closely was the node count of the graph. It kept climbing. I felt good about that.&lt;/p&gt;

&lt;p&gt;Until I ran a full audit and found out that &lt;strong&gt;52% of those nodes were undetected duplicates&lt;/strong&gt;. Of 1,892 reported nodes, only 911 were actually unique.&lt;/p&gt;

&lt;h2&gt;
  
  
  How did this happen?
&lt;/h2&gt;

&lt;p&gt;ANIMUS's autonomous loop actively looks for "gaps" — holes in its knowledge that the system decides to fill on its own. The problem: an overly aggressive filter was excluding certain categories from the gap pool, which trapped the system in a loop of re-exploring the same ~40 topics for thousands of cycles. Each pass generated content that was &lt;em&gt;similar&lt;/em&gt; but not identical to the last — different enough to avoid triggering any exact-duplicate check, but substantially the same information rephrased.&lt;/p&gt;

&lt;p&gt;The node count kept climbing. Actual knowledge, not so much.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rust engineering side
&lt;/h2&gt;

&lt;p&gt;The fix wasn't magic, it was audit work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reopening the gap filter that had been closed too aggressively, so the system would explore genuinely new topics instead of repeating itself.&lt;/li&gt;
&lt;li&gt;Fixing a recency bias in the semantic search (&lt;code&gt;Brain::search&lt;/code&gt;): it walked the graph from node 0 with &lt;code&gt;.take(2)&lt;/code&gt;, which meant it almost always returned stale content from earlier versions of the system. A simple &lt;code&gt;.rev()&lt;/code&gt; fixed it.&lt;/li&gt;
&lt;li&gt;Building an "auto-census" process that runs every 37 cycles and generates real statistics about the graph by category — so the system itself (and I) could see with numbers, not intuition, whether it was growing in a healthy way.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Along the way, I also migrated the inference engine: from a Python wrapper to a &lt;code&gt;llama-server.exe&lt;/code&gt; launched directly from Rust, and from the original model to a quantized Gemma 4 E2B, running at ~77 tokens/second on a consumer GPU (RTX 3050, 4GB). None of this required the cloud or paid APIs — everything runs locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;The most valuable part of this whole episode wasn't fixing the bug. It was realizing that &lt;strong&gt;a metric that only goes up never warns you that something is wrong&lt;/strong&gt;. Node count was a proxy for "the system is learning," but optimizing that one proxy, with nothing to balance it, ended up producing the opposite: inflated content, not new knowledge.&lt;/p&gt;

&lt;p&gt;ANIMUS now runs on several cross-checked signals (verified uniqueness, recency-weighted relevance, source validation) instead of one vanity metric. If two signals start to diverge, the system stops and re-audits instead of continuing to generate.&lt;/p&gt;

&lt;p&gt;If you're curious about the full picture (architecture, benchmarks, comparison against a simple vector RAG baseline), the technical paper is open access with a DOI: &lt;a href="https://doi.org/10.5281/zenodo.20674981" rel="noopener noreferrer"&gt;10.5281/zenodo.20674981&lt;/a&gt;. Code is on &lt;a href="https://github.com/ernestoariasdiaz/animus-ai" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;ANIMUS is an independent project, developed in Santo Domingo, Dominican Republic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was written with the assistance of an LLM, based on my own project, data, and experience.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rust</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
