<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ude-p</title>
    <description>The latest articles on DEV Community by ude-p (@udep).</description>
    <link>https://dev.clauneck.workers.dev/udep</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3580619%2F4ff21755-a2d4-476c-89e6-da2e1462c891.png</url>
      <title>DEV Community: ude-p</title>
      <link>https://dev.clauneck.workers.dev/udep</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.clauneck.workers.dev/feed/udep"/>
    <language>en</language>
    <item>
      <title>Building a Deployment Platform on Self-Managed Infrastructure with k3s</title>
      <dc:creator>ude-p</dc:creator>
      <pubDate>Wed, 24 Jun 2026 19:49:17 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/udep/building-a-deployment-platform-on-self-managed-infrastructure-with-k3s-1hk4</link>
      <guid>https://dev.clauneck.workers.dev/udep/building-a-deployment-platform-on-self-managed-infrastructure-with-k3s-1hk4</guid>
      <description>&lt;p&gt;At the beginning of this year, our production workloads were spread across AWS, Contabo and GoDaddy.&lt;/p&gt;

&lt;p&gt;Most of the applications had predictable traffic patterns and fairly stable resource requirements. As more products were added, infrastructure costs became harder to ignore, particularly on AWS. The deployment workflows, infrastructure layouts and operational processes supporting those products had also become increasingly inconsistent.&lt;/p&gt;

&lt;p&gt;Part of my responsibility became consolidating that environment, reducing unnecessary infrastructure costs and introducing a more consistent deployment workflow.&lt;/p&gt;

&lt;p&gt;Over the last few months, that work resulted in a deployment platform built around GitHub Organizations, GitHub Actions, Ansible, k3s, SOPS and an internal application deployment operator written in Go.&lt;/p&gt;

&lt;p&gt;This article covers the current version of that platform, the decisions behind it, and some of the problems encountered while building it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository Management
&lt;/h2&gt;

&lt;p&gt;Repositories were moved into a GitHub Organization and branch protection became consistent across projects.&lt;/p&gt;

&lt;p&gt;Rulesets were configured to prevent direct pushes to &lt;code&gt;main&lt;/code&gt;, require pull request reviews, enforce successful CI checks before merging and restrict who could bypass those requirements.&lt;/p&gt;

&lt;p&gt;Since deployments would eventually be tied to merges into &lt;code&gt;main&lt;/code&gt;, repository rules became part of the deployment platform itself. A failed CI check blocks a deployment. A missing review blocks a deployment. Direct pushes to &lt;code&gt;main&lt;/code&gt; are no longer possible.&lt;/p&gt;

&lt;p&gt;The result is that every change reaching production passes through the same validation path regardless of which repository it originated from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continuous Integration
&lt;/h2&gt;

&lt;p&gt;The next step was standardizing CI across repositories using GitHub Actions. Most of our services are written in Go, so the workflow was fairly similar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run formatting checks&lt;/li&gt;
&lt;li&gt;Run linters&lt;/li&gt;
&lt;li&gt;Execute tests&lt;/li&gt;
&lt;li&gt;Build the application&lt;/li&gt;
&lt;li&gt;Build and publish container images&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once CI became consistent across repositories, deployments could assume the same validation process had already taken place regardless of which application was being deployed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure Architecture
&lt;/h2&gt;

&lt;p&gt;Most applications had predictable resource consumption and did not require the elasticity that AWS is particularly good at providing.&lt;/p&gt;

&lt;p&gt;A comparison between AWS infrastructure costs and equivalent VPS resources on Contabo made the migration decision fairly straightforward, and several workloads were moved onto VPS infrastructure.&lt;/p&gt;

&lt;p&gt;The resulting architecture was built around Contabo's private networking feature. Each server was connected to a private network and cluster communication was configured to use private IP addresses. Kubernetes control plane traffic, pod networking, database connections and application-to-application communication remained on the private network, while public access was limited to services that actually needed internet exposure.&lt;/p&gt;

&lt;p&gt;The architecture ended up looking something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
    |
Traefik Ingress
    |
k3s Cluster
    |
+---------------+---------------+---------------+
|               |               |               |
Control Plane   Worker 1      Worker 2      Worker N
     \             |             |             /
      \____________Private Network____________/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using private networking meant node-to-node communication never needed to traverse the public internet, and new nodes could join the cluster using private addresses rather than public endpoints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure Provisioning
&lt;/h2&gt;

&lt;p&gt;As the number of servers increased, infrastructure provisioning became repetitive.&lt;/p&gt;

&lt;p&gt;Every machine required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User creation&lt;/li&gt;
&lt;li&gt;SSH configuration&lt;/li&gt;
&lt;li&gt;Hostname configuration&lt;/li&gt;
&lt;li&gt;Firewall configuration&lt;/li&gt;
&lt;li&gt;Package installation&lt;/li&gt;
&lt;li&gt;Cluster bootstrap steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first few servers were configured manually. Ansible appeared shortly afterwards.&lt;/p&gt;

&lt;p&gt;Most of the provisioning process eventually became playbooks covering server configuration, SSH setup and k3s installation. The value of this became obvious when one of the worker nodes failed.&lt;/p&gt;

&lt;p&gt;Replacing the node involved provisioning a replacement server, updating inventory and rerunning the playbooks rather than manually rebuilding configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying k3s
&lt;/h2&gt;

&lt;p&gt;I evaluated kubeadm briefly before choosing k3s.&lt;/p&gt;

&lt;p&gt;For the size of infrastructure I was managing, k3s solved most of the problems I cared about without introducing additional operational overhead.&lt;/p&gt;

&lt;p&gt;The control plane, service discovery, ingress controller and storage provisioner were already packaged together.&lt;/p&gt;

&lt;p&gt;I wasn't particularly interested in assembling Kubernetes components individually if a distribution already existed that solved the same problem.&lt;/p&gt;

&lt;p&gt;Within a few hours the cluster was operational and workloads started moving over.&lt;/p&gt;

&lt;p&gt;For a small engineering team operating a handful of products, the defaults provided by k3s have been difficult to argue against.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Automation
&lt;/h2&gt;

&lt;p&gt;Around this period I started getting tired of maintaining Kubernetes manifests.&lt;/p&gt;

&lt;p&gt;Most applications required the same collection of resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployments&lt;/li&gt;
&lt;li&gt;Services&lt;/li&gt;
&lt;li&gt;Ingresses&lt;/li&gt;
&lt;li&gt;ConfigMaps&lt;/li&gt;
&lt;li&gt;Secrets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The differences between applications were usually small, but they still resulted in maintaining large amounts of repetitive YAML.&lt;/p&gt;

&lt;p&gt;I initially used Helm, but over time I found myself spending more effort maintaining templates and manifests than I wanted to.&lt;/p&gt;

&lt;p&gt;Eventually I moved deployment definitions into Go using the Kubernetes client libraries and controller-runtime.&lt;/p&gt;

&lt;p&gt;Working directly with the Kubernetes API provided a much deeper understanding of Kubernetes resources than maintaining YAML manifests. Deployments, Services, ConfigMaps and Secrets became resources constructed and managed directly in code.&lt;/p&gt;

&lt;p&gt;Writing deployment definitions directly in Go became repetitive over time, so I started building an internal application deployment operator that abstracts much of the Kubernetes resource configuration behind a simpler deployment definition.&lt;/p&gt;

&lt;p&gt;The operator is still evolving, but it has become the primary way applications are deployed into the cluster.&lt;/p&gt;

&lt;p&gt;I'll cover its design and implementation in a future article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secret Management
&lt;/h2&gt;

&lt;p&gt;Keeping environment variables synchronized across applications and environments became increasingly difficult as the platform grew.&lt;/p&gt;

&lt;p&gt;I looked at Vault and Infisical, but both introduced another service to operate. For the size of the team and operational requirements, SOPS with Age encryption was a better fit.&lt;/p&gt;

&lt;p&gt;Environment files are encrypted with SOPS and committed to Git. During deployment, the files are decrypted and converted into Kubernetes ConfigMaps and Secrets.&lt;/p&gt;

&lt;p&gt;The Age private key is stored outside the repository and is only available to trusted deployment environments responsible for decrypting configuration during deployments.&lt;/p&gt;

&lt;p&gt;The deployment workflow looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.env
  |
SOPS + Age Encrypt
  |
Git Repository
  |
Deployment Environment
  |
Age Private Key
  |
Decrypt
  |
Kubernetes Secrets / ConfigMaps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Kubernetes RBAC
&lt;/h2&gt;

&lt;p&gt;Giving GitHub access to deploy applications introduced another problem: authentication to the cluster.&lt;/p&gt;

&lt;p&gt;Kubernetes RBAC provides several ways to solve this, typically through service accounts and scoped permissions.&lt;/p&gt;

&lt;p&gt;The implementation worked, but managing service account tokens quickly became an operational concern.&lt;/p&gt;

&lt;p&gt;At the moment I use a combination of restricted Kubernetes identities and trusted administrative machines for deployment operations.&lt;/p&gt;

&lt;p&gt;I am currently evaluating ArgoCD as a GitOps layer on top of the deployment platform, allowing the cluster to reconcile changes from Git rather than accepting deployments pushed directly from GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stateful Workloads
&lt;/h2&gt;

&lt;p&gt;Stateful workloads required a different approach from application workloads.&lt;/p&gt;

&lt;p&gt;Running databases as StatefulSets works, but using local-path storage introduces an important limitation, the persistent volume is tied to the node where it was created. If that node goes down, Kubernetes can reschedule the pod, but the data does not automatically move with it.&lt;/p&gt;

&lt;p&gt;For now, databases still run as StatefulSets on local-path storage, with backups and restore procedures treated as the primary recovery path. This keeps the architecture simple, but it also means node failure recovery depends on either recovering the original node or restoring the database from backup.&lt;/p&gt;

&lt;p&gt;Most small teams eventually move toward dedicated database infrastructure, managed database services or distributed storage solutions such as Longhorn as their operational requirements grow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Architecture
&lt;/h2&gt;

&lt;p&gt;Today the platform runs on a k3s cluster hosted on Contabo infrastructure and connected through private networking.&lt;/p&gt;

&lt;p&gt;GitHub Actions handles CI, Ansible provisions infrastructure, SOPS manages secrets and applications are deployed through an internal application deployment operator built on top of the Kubernetes API.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>go</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Profiling gorilla/websocket fan-out bottlenecks in Go</title>
      <dc:creator>ude-p</dc:creator>
      <pubDate>Tue, 19 May 2026 21:20:45 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/udep/profiling-gorillawebsocket-fan-out-bottlenecks-in-go-4c9p</link>
      <guid>https://dev.clauneck.workers.dev/udep/profiling-gorillawebsocket-fan-out-bottlenecks-in-go-4c9p</guid>
      <description>&lt;p&gt;I have been working on a realtime multiplayer server in Go.&lt;/p&gt;

&lt;p&gt;The transport layer is WebSocket using &lt;code&gt;gorilla/websocket&lt;/code&gt;, and most outbound payloads are protobuf messages generated with &lt;code&gt;protobuf-go-lite&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The core server loop is simple enough:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;accept client input&lt;/li&gt;
&lt;li&gt;advance simulation&lt;/li&gt;
&lt;li&gt;build world snapshot&lt;/li&gt;
&lt;li&gt;broadcast snapshot to connected clients&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each simulation instance runs at 60 ticks per second, so every tick has roughly 16ms available for input processing, simulation, synchronization, and outbound networking.&lt;/p&gt;

&lt;p&gt;At small connection counts, everything felt fine. Once I started testing hundreds of clients connected to the same simulation instance, the problems became obvious. Joining became slower, movement updates started lagging behind, snapshots backed up, and tick duration became unstable.&lt;/p&gt;

&lt;p&gt;At that point the useful question stopped being:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;why is the server slow?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;and became:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;what part of the realtime path gets worse as connection count grows?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer was mostly fan-out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The server shape
&lt;/h2&gt;

&lt;p&gt;Each WebSocket connection owns a session object.&lt;/p&gt;

&lt;p&gt;The session manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the websocket connection&lt;/li&gt;
&lt;li&gt;outbound queues&lt;/li&gt;
&lt;li&gt;heartbeat state&lt;/li&gt;
&lt;li&gt;write synchronization&lt;/li&gt;
&lt;li&gt;inbound rate limiting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All websocket writes are intentionally serialized.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;gorilla/websocket&lt;/code&gt; allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one concurrent reader&lt;/li&gt;
&lt;li&gt;one concurrent writer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, it is still easy to accidentally write from multiple goroutines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the normal write loop&lt;/li&gt;
&lt;li&gt;heartbeat responses&lt;/li&gt;
&lt;li&gt;disconnect handlers&lt;/li&gt;
&lt;li&gt;internal events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So every socket write goes through one locked path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;UserSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeMu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeMu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetWriteDeadline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wsWriteTimeout&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BinaryMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That lock was not the main bottleneck, but it matters for correctness. I do not want multiple goroutines calling &lt;code&gt;WriteMessage&lt;/code&gt; on the same connection.&lt;/p&gt;

&lt;p&gt;The simulation state is owned and mutated by a single goroutine.&lt;/p&gt;

&lt;p&gt;Every simulation instance has one command channel handling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;joins&lt;/li&gt;
&lt;li&gt;leaves&lt;/li&gt;
&lt;li&gt;player input&lt;/li&gt;
&lt;li&gt;disconnects&lt;/li&gt;
&lt;li&gt;internal events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each tick roughly looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;drain commands&lt;/li&gt;
&lt;li&gt;advance simulation&lt;/li&gt;
&lt;li&gt;build synchronization state&lt;/li&gt;
&lt;li&gt;broadcast updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last step became the expensive one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape that broke first
&lt;/h2&gt;

&lt;p&gt;The simulation work itself was not the first thing to fail.&lt;/p&gt;

&lt;p&gt;The first bad shape was the broadcast path.&lt;/p&gt;

&lt;p&gt;Each simulation tick builds a snapshot containing replicated world state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;updatedObjects&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;scratch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updatedObjects&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;objectID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;object&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;simulation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;objects&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shouldReplicate&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;updatedObjects&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;updatedObjects&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;syncSnapshotObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;objectID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only objects marked for network replication are included in the snapshot.&lt;/p&gt;

&lt;p&gt;That snapshot then gets sent to every connected client subscribed to the simulation instance.&lt;/p&gt;

&lt;p&gt;The important detail here is that the snapshot payload is usually identical for every client in that room.&lt;/p&gt;

&lt;p&gt;The original broadcast path effectively looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recipient&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;recipients&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;buildSnapshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MarshalVT&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;recipient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At small scales, this feels harmless.&lt;/p&gt;

&lt;p&gt;At larger scales, the shape becomes expensive very quickly.&lt;/p&gt;

&lt;p&gt;With 500 connected clients running at 60 ticks per second:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;500 recipients x 60 snapshots/sec
= 30,000 websocket snapshot writes/sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the bigger problem was not the socket writes themselves.&lt;/p&gt;

&lt;p&gt;The expensive part was repeatedly serializing the same snapshot payload for every connected client.&lt;/p&gt;

&lt;p&gt;That means every tick was doing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;protobuf marshaling per recipient&lt;/li&gt;
&lt;li&gt;allocations per recipient&lt;/li&gt;
&lt;li&gt;buffer growth per recipient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;even though the payload itself was identical.&lt;/p&gt;

&lt;p&gt;The first profiling lesson was simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;if the payload is identical for every recipient, serializing it per client is wasted work&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Profiling setup
&lt;/h2&gt;

&lt;p&gt;The server uses Pyroscope for continuous profiling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetMutexProfileFraction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetBlockProfileRate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pyroscope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pyroscope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ApplicationName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ProfileTypes&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;pyroscope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProfileType&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;pyroscope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProfileCPU&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;pyroscope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProfileAllocSpace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;pyroscope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProfileInuseSpace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;pyroscope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProfileGoroutines&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tick duration is also exported through OpenTelemetry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;tickDuration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tickStartedAt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Seconds&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The metrics I cared about for this issue were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;active websocket sessions&lt;/li&gt;
&lt;li&gt;room tick p50/p95/p99&lt;/li&gt;
&lt;li&gt;protobuf marshal cost&lt;/li&gt;
&lt;li&gt;allocation pressure&lt;/li&gt;
&lt;li&gt;websocket backlog growth&lt;/li&gt;
&lt;li&gt;goroutine buildup&lt;/li&gt;
&lt;li&gt;socket write duration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The profiling run made the problem obvious.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reliable vs unreliable messages
&lt;/h2&gt;

&lt;p&gt;One important change was separating outbound traffic by semantics.&lt;/p&gt;

&lt;p&gt;Not every websocket message deserves the same delivery behavior.&lt;/p&gt;

&lt;p&gt;Snapshots are latest-state messages. If the client misses one snapshot, the correct behavior is usually to apply a newer one, not replay old movement history.&lt;/p&gt;

&lt;p&gt;Some messages are different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;object created&lt;/li&gt;
&lt;li&gt;object deleted&lt;/li&gt;
&lt;li&gt;session ended&lt;/li&gt;
&lt;li&gt;important gameplay events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are ordered state transitions and should not be dropped.&lt;/p&gt;

&lt;p&gt;The session ended up with two outbound paths:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;Reliable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="n"&gt;OutboundFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unreliable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="n"&gt;OutboundFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both paths operate on an &lt;code&gt;OutboundFrame&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;An &lt;code&gt;OutboundFrame&lt;/code&gt; is the object the server uses to represent a websocket payload that is ready to be written. In this case, it carries the marshaled protobuf bytes and the release logic needed when those bytes come from a pool.&lt;/p&gt;

&lt;p&gt;Reliable messages use a bounded queue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;Outbound&lt;/span&gt; &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;OutboundFrame&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Snapshots use a latest-only slot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;LatestUnreliable&lt;/span&gt; &lt;span class="n"&gt;OutboundFrame&lt;/span&gt;
&lt;span class="n"&gt;UnreliableNotify&lt;/span&gt; &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The unreliable path intentionally avoids replacing a snapshot that is already waiting to be written:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;UserSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Unreliable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="n"&gt;OutboundFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outboundMu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LatestUnreliable&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outboundMu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LatestUnreliable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outboundMu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UnreliableNotify&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}{}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This part is easy to get wrong.&lt;/p&gt;

&lt;p&gt;The simulation loop may call &lt;code&gt;Unreliable&lt;/code&gt; 60 times per second. If every new snapshot immediately replaced the pending one, a busy session could keep overwriting the pending frame before the writer loop gets a chance to consume it.&lt;/p&gt;

&lt;p&gt;So the rule is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if no unreliable snapshot is pending, publish one&lt;/li&gt;
&lt;li&gt;if one is already pending, drop the newer one and let the writer catch up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That keeps each session bounded to at most one pending unreliable snapshot.&lt;/p&gt;

&lt;p&gt;Reliable messages queue because they represent state transitions. Unreliable snapshots do not queue because they represent latest world state.&lt;/p&gt;

&lt;p&gt;A slow client may miss movement snapshots, but it should not force the server to retain stale movement history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Marshaling once per broadcast
&lt;/h2&gt;

&lt;p&gt;The fix was to move protobuf marshaling out of the recipient loop.&lt;/p&gt;

&lt;p&gt;The current broadcast path marshals once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;snapshotFrame&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;MarshalOutboundFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;snapshotEnvelope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;snapshotRecipients&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then shares the same frame across sessions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recipient&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;recipients&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;recipient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unreliable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshotFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That changes the scaling shape from:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;marshal protobuf once per recipient&lt;/strong&gt;&lt;br&gt;
to:&lt;br&gt;
&lt;strong&gt;marshal protobuf once per broadcast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The network work is still multiplied by recipient count.&lt;/p&gt;

&lt;p&gt;The serialization work is not.&lt;/p&gt;
&lt;h2&gt;
  
  
  Sharing broadcast frames safely
&lt;/h2&gt;

&lt;p&gt;Once protobuf marshaling moved out of the recipient loop, the server needed a safe way to share the same websocket payload across many sessions.&lt;/p&gt;

&lt;p&gt;The broadcast path now produces one shared &lt;code&gt;OutboundFrame&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;OutboundFrame&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Data&lt;/span&gt;    &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;
    &lt;span class="n"&gt;refs&lt;/span&gt;    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Int32&lt;/span&gt;
    &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The frame contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the marshaled websocket payload&lt;/li&gt;
&lt;li&gt;a reference counter&lt;/li&gt;
&lt;li&gt;a cleanup callback for returning pooled buffers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important detail is that the byte slice comes from a pool.&lt;/p&gt;

&lt;p&gt;The server marshals protobuf data into a reusable buffer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;outboundFrameBufferPool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;writeBuffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MarshalToVT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That same byte slice then gets shared across every recipient:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recipient&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;recipients&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;recipient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unreliable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshotFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At that point ownership matters.&lt;/p&gt;

&lt;p&gt;The pooled buffer cannot go back into the pool until every session has either written or dropped the frame.&lt;/p&gt;

&lt;p&gt;So the frame carries a reference count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="n"&gt;OutboundFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refs&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each session releases the frame after writing or dropping it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without ownership tracking, pooled buffers become dangerous very quickly. One session can still be writing bytes while another goroutine has already returned the buffer to the pool for reuse.&lt;/p&gt;

&lt;p&gt;The optimized broadcast path eventually became:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;build snapshot once per tick
marshal protobuf once per broadcast
write into pooled buffer
share frame across recipients
release pooled buffer after all recipients finish
drop stale unreliable snapshots
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;None of these changes remove the cost of socket writes. Every connected client still needs its own websocket write.&lt;/p&gt;

&lt;p&gt;What they remove is the avoidable work around the write:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;repeated protobuf marshaling&lt;/li&gt;
&lt;li&gt;repeated buffer allocation&lt;/li&gt;
&lt;li&gt;stale snapshot queue buildup&lt;/li&gt;
&lt;li&gt;unnecessary garbage collector pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Profiling result
&lt;/h2&gt;

&lt;p&gt;This was the test shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;simulation tick rate: 60Hz
connected clients: ~500
transport: gorilla/websocket
payload format: protobuf
snapshot delivery: unreliable latest-only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The profiling result looked like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Unoptimized&lt;/th&gt;
&lt;th&gt;Optimized&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Snapshot builds per tick&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapshot marshals per tick&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapshot marshals/sec&lt;/td&gt;
&lt;td&gt;~30,000&lt;/td&gt;
&lt;td&gt;~60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pooled frame buffers&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapshot delivery&lt;/td&gt;
&lt;td&gt;queued per recipient&lt;/td&gt;
&lt;td&gt;latest-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Room tick p99&lt;/td&gt;
&lt;td&gt;~243ms&lt;/td&gt;
&lt;td&gt;~18–22ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Allocation pressure&lt;/td&gt;
&lt;td&gt;very high&lt;/td&gt;
&lt;td&gt;much lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebSocket backlog growth&lt;/td&gt;
&lt;td&gt;severe during bursts&lt;/td&gt;
&lt;td&gt;reduced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU time in protobuf marshal path&lt;/td&gt;
&lt;td&gt;dominant hotspot&lt;/td&gt;
&lt;td&gt;mostly removed from broadcast multiplier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The important part is not just the p99 number.&lt;/p&gt;

&lt;p&gt;The important part is the shape change.&lt;/p&gt;

&lt;p&gt;The optimized version still performs one websocket write per connected client, but it no longer rebuilds and serializes identical snapshot payloads per recipient.&lt;br&gt;
That moved protobuf encoding and buffer allocation out of the recipient loop.&lt;/p&gt;
&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;The issue was not that &lt;code&gt;gorilla/websocket&lt;/code&gt; is slow.&lt;/p&gt;

&lt;p&gt;The issue was that fan-out multiplies small costs very aggressively.&lt;/p&gt;

&lt;p&gt;At one client:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;protobuf marshaling is noise&lt;/li&gt;
&lt;li&gt;allocations are noise&lt;/li&gt;
&lt;li&gt;queue pressure is noise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At hundreds of clients and 60 ticks per second, those costs become visible very quickly in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU usage&lt;/li&gt;
&lt;li&gt;allocations&lt;/li&gt;
&lt;li&gt;websocket backlog growth&lt;/li&gt;
&lt;li&gt;tick latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main changes were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;separate reliable events from snapshots&lt;/li&gt;
&lt;li&gt;make snapshots latest-only&lt;/li&gt;
&lt;li&gt;marshal broadcast payloads once&lt;/li&gt;
&lt;li&gt;share pooled frames across sessions&lt;/li&gt;
&lt;li&gt;use ref-counted ownership for shared buffers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is still a larger scaling problem left.&lt;/p&gt;

&lt;p&gt;Right now every replicated object is still sent to every subscribed client. That means the network shape is still:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;objects x recipients
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is proper interest management:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;visibility filtering&lt;/li&gt;
&lt;li&gt;area-of-interest replication&lt;/li&gt;
&lt;li&gt;per-client relevance filtering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I would not start there.&lt;/p&gt;

&lt;p&gt;First:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;profile the current system&lt;/li&gt;
&lt;li&gt;remove multiplied work&lt;/li&gt;
&lt;li&gt;fix ownership problems&lt;/li&gt;
&lt;li&gt;stabilize the hot path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then decide whether the architecture actually needs a larger cut.&lt;/p&gt;

</description>
      <category>go</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
