<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevOps AI ToolKit</title>
    <description>The latest articles on DEV Community by DevOps AI ToolKit (devopsaitoolkit).</description>
    <link>https://dev.clauneck.workers.dev/devopsaitoolkit</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F13604%2F98511a3b-3821-49c1-b918-dca37eae0c17.png</url>
      <title>DEV Community: DevOps AI ToolKit</title>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.clauneck.workers.dev/feed/devopsaitoolkit"/>
    <language>en</language>
    <item>
      <title>Why AI Loves Ansible (And You Should Let It Help)</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 25 Jun 2026 04:36:48 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/why-ai-loves-ansible-and-you-should-let-it-help-3o2p</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/why-ai-loves-ansible-and-you-should-let-it-help-3o2p</guid>
      <description>&lt;p&gt;If you compare how well Claude handles Ansible against how well it handles, say, raw bash or kubectl YAML, Ansible wins by a wide margin. The reason isn't subtle: Ansible's shape — declarative, idempotent, modules-with-arguments — happens to map almost perfectly to how LLMs reason. They're good at producing structured output that fills in a known template, and that's what most Ansible tasks are.&lt;/p&gt;

&lt;p&gt;This means AI-assisted Ansible work is the highest-leverage automation pairing I know of. If you only adopt AI for one infrastructure tool, make it Ansible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes Ansible AI-friendly
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Modules have published contracts
&lt;/h3&gt;

&lt;p&gt;Every Ansible module has a documented argument spec: what's required, what's optional, what the defaults are. The model can fit your intent into the spec with high accuracy because the spec is finite and known.&lt;/p&gt;

&lt;p&gt;Compare this to shell: there are a thousand ways to "create a user with a specific UID, member of these groups, with this shell, and a home directory in this location." In bash, every distro is slightly different. In Ansible, you use &lt;code&gt;ansible.builtin.user&lt;/code&gt; with named arguments.&lt;/p&gt;

&lt;p&gt;The model gets this right &lt;em&gt;every single time&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Idempotency is the default
&lt;/h3&gt;

&lt;p&gt;When a model generates a Python script, it has to think about "what if this is run twice." When it generates Ansible, most modules handle that for free. The model can write the task, ignore the re-run case, and produce something that works.&lt;/p&gt;

&lt;p&gt;This means the cognitive load on both sides — model and human — is lower. You're describing the target state, not the procedure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Roles and structure are predictable
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;roles/foo/{defaults,vars,tasks,handlers,templates,files,meta}/main.yml&lt;/code&gt; — every Ansible role looks the same. The model can scaffold a new role in seconds because the layout is fixed.&lt;/p&gt;

&lt;p&gt;If you ask Claude to "create a new role for installing PostgreSQL 16 on Ubuntu 24.04 with default user &lt;code&gt;postgres&lt;/code&gt; and a tuned &lt;code&gt;postgresql.conf&lt;/code&gt;," you'll get a complete role structure with &lt;code&gt;defaults/main.yml&lt;/code&gt;, &lt;code&gt;tasks/main.yml&lt;/code&gt;, a Jinja template, and &lt;code&gt;handlers/main.yml&lt;/code&gt; — all consistent, all in the right places. The structure is constrained enough that the model rarely improvises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases where AI shines for Ansible
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Generating new roles from scratch
&lt;/h3&gt;

&lt;p&gt;This is the killer app. You can describe a role in two sentences and get a 90%-done implementation. You then refine: add validation, adjust defaults, write a README.&lt;/p&gt;

&lt;p&gt;I now treat "draft a new role with Claude" as the default first step. Even if I rewrite half of it, the structure saves me 20 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Converting shell scripts to playbooks
&lt;/h3&gt;

&lt;p&gt;If you have a legacy bash script that provisions a server, pasting it into Claude with "convert this to an idempotent Ansible playbook using the appropriate modules" produces a usable result. The model knows when to use &lt;code&gt;ansible.builtin.file&lt;/code&gt;, &lt;code&gt;lineinfile&lt;/code&gt;, &lt;code&gt;template&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;You'll need to verify the idempotency manually (run twice, expect 0 changes on the second run), but the conversion is mostly mechanical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Refactoring playbooks to use FQCN
&lt;/h3&gt;

&lt;p&gt;Ansible 2.10+ wants fully-qualified collection names: &lt;code&gt;ansible.builtin.package&lt;/code&gt; instead of &lt;code&gt;package&lt;/code&gt;. Old playbooks have hundreds of short-form references. AI is a perfect fit for this kind of mass refactoring — it knows the mapping and won't get bored.&lt;/p&gt;

&lt;p&gt;Paste a 200-line playbook, ask for it back with FQCN throughout, and you're done in 30 seconds. Verify with &lt;code&gt;ansible-lint&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing Molecule tests
&lt;/h3&gt;

&lt;p&gt;Molecule scaffolding is repetitive — same &lt;code&gt;molecule.yml&lt;/code&gt;, same &lt;code&gt;converge.yml&lt;/code&gt;, same &lt;code&gt;verify.yml&lt;/code&gt; structure for most roles. AI is great at generating the boilerplate. You describe what you want to test; the model writes the assertion playbook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Jinja template generation
&lt;/h3&gt;

&lt;p&gt;Jinja is just structured-enough that AI handles it well — generating templates for config files (nginx, postgres, sshd) from a description of the desired behavior. The model knows the configuration keys and the conditional structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI struggles with Ansible
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Variable precedence
&lt;/h3&gt;

&lt;p&gt;Ansible's 21-layer variable precedence rules are not intuitive. The model will sometimes suggest putting a variable in &lt;code&gt;vars/main.yml&lt;/code&gt; when you really want it in &lt;code&gt;defaults/main.yml&lt;/code&gt; (the former overrides the latter). The result: users of your role can't override the variable they expected to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When the model puts something in &lt;code&gt;vars/&lt;/code&gt;, ask "should this be overridable by the role user?" If yes, move to &lt;code&gt;defaults/&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom facts and &lt;code&gt;set_fact&lt;/code&gt; lifetime
&lt;/h3&gt;

&lt;p&gt;The model sometimes uses &lt;code&gt;set_fact&lt;/code&gt; for values that need to persist across plays, but doesn't add &lt;code&gt;cacheable: true&lt;/code&gt;. The fact is then gone after the play ends, and the next play sees &lt;code&gt;undefined&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When you use &lt;code&gt;set_fact&lt;/code&gt; for a value you need later, verify the lifetime is what you expect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vault integration
&lt;/h3&gt;

&lt;p&gt;The model will sometimes generate playbooks that reference &lt;code&gt;vault_db_password&lt;/code&gt; as a variable but don't include the &lt;code&gt;lookup('community.hashi_vault.hashi_vault', ...)&lt;/code&gt; call or the Ansible Vault encrypted file. You have to wire up the secret source separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; For any sensitive variable in a generated playbook, verify there's an actual source for it (Vault encrypted file, external manager lookup, environment variable).&lt;/p&gt;

&lt;h3&gt;
  
  
  Distro-specific paths
&lt;/h3&gt;

&lt;p&gt;The model defaults to Debian/Ubuntu conventions. If you run on RHEL, you'll sometimes get &lt;code&gt;apt&lt;/code&gt; modules in tasks that should be using the &lt;code&gt;package&lt;/code&gt; module (or distro conditionals).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When generating playbooks for non-Debian systems, audit for &lt;code&gt;apt&lt;/code&gt;, &lt;code&gt;apt_repository&lt;/code&gt;, &lt;code&gt;dpkg_selections&lt;/code&gt;, and ask for the abstraction (&lt;code&gt;package&lt;/code&gt;) or the distro split.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that's been working for me
&lt;/h2&gt;

&lt;p&gt;For a new role, my process now looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Describe the role&lt;/strong&gt; to Claude in 2-3 sentences (purpose, target distros, key behaviors).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate the scaffolding&lt;/strong&gt;: &lt;code&gt;defaults/main.yml&lt;/code&gt;, &lt;code&gt;tasks/main.yml&lt;/code&gt;, a template if needed, &lt;code&gt;meta/main.yml&lt;/code&gt; with platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read every task.&lt;/strong&gt; Look for the failure modes above (precedence, lifetime, Vault, distros).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add Molecule tests.&lt;/strong&gt; Have Claude scaffold &lt;code&gt;molecule/default/&lt;/code&gt;, then write the assertions yourself or ask for them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run &lt;code&gt;ansible-lint&lt;/code&gt; and Molecule.&lt;/strong&gt; Fix what they catch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotence check.&lt;/strong&gt; Run the role twice; second run should report 0 changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refine the README.&lt;/strong&gt; This is the one place I write from scratch — explaining the role to future-me.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This takes maybe 30 minutes for a moderately complex role. Without AI assistance, the same role would take me a couple of hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on safety
&lt;/h2&gt;

&lt;p&gt;Ansible runs as root on production servers. Whatever the model generates, &lt;em&gt;you&lt;/em&gt; are responsible for what it does. Two patterns I follow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Check &lt;code&gt;--check --diff&lt;/code&gt; before any real run.&lt;/strong&gt; Dry-run the playbook in check mode; verify the diff matches what you expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test on a sandbox host first.&lt;/strong&gt; Especially for new roles. Don't trust the model with production until the role has run cleanly on a throwaway VM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the same disciplines that apply to any infrastructure change. AI doesn't change the discipline; it just makes you faster at the parts before the change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I think Ansible is the right entry point
&lt;/h2&gt;

&lt;p&gt;If you're new to using AI for infrastructure work and want to pick one tool to start with, Ansible is the safest, highest-leverage choice. The structure makes the AI accurate. The idempotency makes mistakes recoverable. The module ecosystem covers most common cases.&lt;/p&gt;

&lt;p&gt;By the time you've used AI to write a dozen Ansible playbooks, you'll have developed the intuition for what AI handles well and what needs human attention. That intuition transfers to harder tools — Terraform, Kubernetes, custom shell — where the cost of AI mistakes is higher.&lt;/p&gt;

&lt;p&gt;For our full set of AI-driven Ansible workflows, see the &lt;a href="https://dev.clauneck.workers.dev/categories/iac/"&gt;IaC category&lt;/a&gt; — including &lt;a href="https://dev.clauneck.workers.dev/prompts/ansible-vault-secrets-management/"&gt;ansible-vault-secrets-management&lt;/a&gt; and &lt;a href="https://dev.clauneck.workers.dev/prompts/ansible-molecule-testing/"&gt;ansible-molecule-testing&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/why-ai-loves-ansible/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ansible</category>
      <category>ai</category>
      <category>automation</category>
    </item>
    <item>
      <title>AI for GitLab CI Authoring: Save Hours, Avoid Footguns</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sat, 20 Jun 2026 12:15:22 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/ai-for-gitlab-ci-authoring-save-hours-avoid-footguns-3lco</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/ai-for-gitlab-ci-authoring-save-hours-avoid-footguns-3lco</guid>
      <description>&lt;p&gt;GitLab CI YAML is one of those formats where you can stare at it for an hour, get it 95% right, and have it fail with &lt;code&gt;yaml: line 12: did not find expected key&lt;/code&gt; because of a tab character. AI assistants are very fast at this kind of work. They're also confidently wrong about specific GitLab features in ways that waste a lot of time if you don't know what to check.&lt;/p&gt;

&lt;p&gt;After a year of letting Claude write a lot of my pipelines, here's what works and what doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI gets right consistently
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Standard job shapes
&lt;/h3&gt;

&lt;p&gt;"Write me a job that builds a Docker image, pushes to the GitLab Container Registry, and tags with the commit SHA and &lt;code&gt;latest&lt;/code&gt; on the default branch." Type that into Claude and you get a working job in five seconds. The shape is well-established and the model has seen thousands of variations.&lt;/p&gt;

&lt;p&gt;The same is true for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test jobs across languages (pytest, jest, go test, etc.)&lt;/li&gt;
&lt;li&gt;Standard cache configurations&lt;/li&gt;
&lt;li&gt;Standard artifact patterns&lt;/li&gt;
&lt;li&gt;Basic &lt;code&gt;rules:&lt;/code&gt; for branch / tag / MR pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find yourself writing one of these from scratch, you're spending time that you don't need to spend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Translating from other CIs
&lt;/h3&gt;

&lt;p&gt;GitLab CI has obvious parallels to GitHub Actions, CircleCI, Jenkins declarative pipelines, etc. AI is &lt;em&gt;excellent&lt;/em&gt; at translating between them. The structures rhyme; the model knows the dictionary.&lt;/p&gt;

&lt;p&gt;If you're migrating from Actions to GitLab CI, paste the workflow and ask for the GitLab CI equivalent. You'll get something 80% right that you can refine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reviewing pipelines for inefficiency
&lt;/h3&gt;

&lt;p&gt;This is the underrated use case. Paste your &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; and ask: "what's the critical path of this pipeline, and what's making it slow?" The model will spot things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Your test job downloads node_modules from cache, but install-deps doesn't push to cache — your cache key is broken."&lt;/li&gt;
&lt;li&gt;"Your build and deploy stages are sequential but build's artifacts aren't used by deploy — they can be parallel with &lt;code&gt;needs:&lt;/code&gt;."&lt;/li&gt;
&lt;li&gt;"Your &lt;code&gt;rules:changes:&lt;/code&gt; doesn't include &lt;code&gt;package-lock.json&lt;/code&gt;, so dependency changes don't retrigger tests."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real findings I've gotten from Claude on pipelines I thought I'd already optimized. Worth the five-minute review.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI gets wrong — and how to catch it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;rules:&lt;/code&gt; vs &lt;code&gt;only/except&lt;/code&gt; confusion
&lt;/h3&gt;

&lt;p&gt;The model will sometimes mix them in the same job. GitLab silently ignores &lt;code&gt;only:&lt;/code&gt; when &lt;code&gt;rules:&lt;/code&gt; is also defined. The pipeline runs but the behavior isn't what you expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Are you using &lt;code&gt;rules:&lt;/code&gt; OR &lt;code&gt;only:&lt;/code&gt;/&lt;code&gt;except:&lt;/code&gt; in each job? Pick one. (Use &lt;code&gt;rules:&lt;/code&gt; — &lt;code&gt;only/except&lt;/code&gt; is legacy.)&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;$CI_COMMIT_BRANCH&lt;/code&gt; empty on MR pipelines
&lt;/h3&gt;

&lt;p&gt;A common bug: you ask for "this job runs on the default branch" and you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_BRANCH == "main"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is correct for branch pipelines. It is &lt;strong&gt;empty&lt;/strong&gt; on MR (&lt;code&gt;merge_request_event&lt;/code&gt;) pipelines. If you have MR pipelines enabled, your job silently won't run when developers expect it to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Does your pipeline target both push events and MR events? If so, you probably want &lt;code&gt;$CI_MERGE_REQUEST_TARGET_BRANCH_NAME&lt;/code&gt; or to handle both pipeline sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;needs:&lt;/code&gt; referencing hidden jobs
&lt;/h3&gt;

&lt;p&gt;Hidden jobs (prefixed with &lt;code&gt;.&lt;/code&gt;) are templates — they don't execute. If you do &lt;code&gt;needs: [".lint"]&lt;/code&gt;, your job will fail with a confusing error because GitLab thinks you're depending on a job that doesn't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Every &lt;code&gt;needs:&lt;/code&gt; entry should be a real job name, not a template.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-apply rules that don't include the right branches
&lt;/h3&gt;

&lt;p&gt;The model loves writing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_BRANCH == "main"&lt;/span&gt;
    &lt;span class="na"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;never&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works on &lt;code&gt;main&lt;/code&gt; but blocks the job on tags, on schedules, and on MR pipelines. Sometimes that's what you want. Often it's not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; What pipeline sources do you expect this job to run in? List them, then verify your rules cover each.&lt;/p&gt;

&lt;h3&gt;
  
  
  Imaginary GitLab features
&lt;/h3&gt;

&lt;p&gt;This is the most expensive AI failure mode. The model will sometimes generate syntax for features that don't exist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;condition:&lt;/code&gt; field that's actually OPA/Conftest, not GitLab CI&lt;/li&gt;
&lt;li&gt;An &lt;code&gt;auto_retry:&lt;/code&gt; block that's GitHub Actions, not GitLab&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;before_script:&lt;/code&gt; keyword that does exist but with different semantics than the model claims&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; If you see a keyword you haven't seen before in GitLab docs, verify it exists. The lint endpoint (&lt;code&gt;/api/v4/ci/lint&lt;/code&gt;) catches most of these, but some pass lint and just behave weirdly.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that catches the failures cheaply
&lt;/h2&gt;

&lt;p&gt;I now do this for any non-trivial pipeline change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Draft with AI.&lt;/strong&gt; Describe the desired behavior in plain English; let the model write the YAML.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read every line.&lt;/strong&gt; Treat the output as a draft you'd write yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lint via the API.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"PRIVATE-TOKEN: &lt;/span&gt;&lt;span class="nv"&gt;$TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;content&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; .gitlab-ci.yml | jq &lt;span class="nt"&gt;-Rs&lt;/span&gt; .&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITLAB_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/v4/ci/lint"&lt;/span&gt; | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run on a sandbox branch.&lt;/strong&gt; Push to a branch that won't trigger deploys; verify the pipeline runs the jobs you expect, when you expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff against the existing pipeline.&lt;/strong&gt; If the AI introduced changes you didn't ask for (a different cache key, a removed &lt;code&gt;interruptible:&lt;/code&gt;), revert them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 5 is the one most people skip. The model is good at writing YAML but not at preserving your previous decisions. If you don't diff, you'll lose your old cache strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical example
&lt;/h2&gt;

&lt;p&gt;Last month I needed to add a job that runs &lt;code&gt;terraform plan&lt;/code&gt; on every MR and posts the output as a comment. Drafted with Claude in two minutes; it produced something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;terraform-plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hashicorp/terraform:1.9&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plan&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform init&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform plan -out=tfplan -no-color&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform show -no-color tfplan &amp;gt; plan.txt&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;curl -X POST -H "PRIVATE-TOKEN: $GITLAB_API_TOKEN" \&lt;/span&gt;
          &lt;span class="s"&gt;-d "body=$(cat plan.txt | jq -Rs .)" \&lt;/span&gt;
          &lt;span class="s"&gt;"$CI_API_V4_URL/projects/$CI_PROJECT_ID/merge_requests/$CI_MERGE_REQUEST_IID/notes"&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_PIPELINE_SOURCE == "merge_request_event"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;em&gt;almost&lt;/em&gt; right. Two issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PRIVATE-TOKEN&lt;/code&gt; as a CI variable&lt;/strong&gt; — using a personal access token for CI is the old pattern. Modern approach: use &lt;code&gt;$CI_JOB_TOKEN&lt;/code&gt; for in-instance API calls. Saves rotation pain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;terraform init -backend-config&lt;/code&gt;&lt;/strong&gt; — works if the backend is configured in code, but if you have multiple environments using the same module, you'd want to specify which backend.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both fixes are 30 seconds. Without the AI I'd have spent 15 minutes writing the curl invocation alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;AI doesn't replace knowing GitLab CI. It removes the typing and the boilerplate so you can spend your attention on the parts that matter — the &lt;code&gt;rules:&lt;/code&gt; logic, the cache keys, the secrets, the environment promotion.&lt;/p&gt;

&lt;p&gt;Once you've internalized the failure modes above, the workflow becomes mostly automatic. You stop reading the boilerplate and start reading the rules. That's where the bugs live.&lt;/p&gt;

&lt;p&gt;For the prompt set we use on GitLab CI specifically, see the &lt;a href="https://dev.clauneck.workers.dev/categories/gitlab-cicd/"&gt;GitLab CI/CD category&lt;/a&gt; — particularly &lt;a href="https://dev.clauneck.workers.dev/prompts/gitlab-pipeline-optimization/"&gt;gitlab-pipeline-optimization&lt;/a&gt; and &lt;a href="https://dev.clauneck.workers.dev/prompts/gitlab-ci-rules-debugging/"&gt;gitlab-ci-rules-debugging&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/ai-for-gitlab-ci-authoring/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>gitlab</category>
      <category>cicd</category>
      <category>ai</category>
    </item>
    <item>
      <title>Securing AI-Generated Bash Scripts Before You Run Them</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 18 Jun 2026 15:51:56 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/securing-ai-generated-bash-scripts-before-you-run-them-401m</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/securing-ai-generated-bash-scripts-before-you-run-them-401m</guid>
      <description>&lt;p&gt;Bash is the easiest language for AI to write and the easiest language to get devastating output from. A 20-line script that "just cleans up old files" can recursively delete a home directory because the model assumed a variable would always be set. A "simple log shipper" can write your secrets to a remote server because the model used &lt;code&gt;set -x&lt;/code&gt; for debugging and forgot to remove it.&lt;/p&gt;

&lt;p&gt;I have run AI-generated bash that I should not have. Most engineers I know have too. After enough close calls, there's a short checklist that catches the worst of it. This is that checklist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five things to check before running any AI-generated bash
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Does it start with a strict pragma?
&lt;/h3&gt;

&lt;p&gt;The first lines of any non-trivial bash script should be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;$'&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -e&lt;/code&gt;&lt;/strong&gt; — exit on any command failure. Without this, a failure in line 5 doesn't stop the script from happily running lines 6-50.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -u&lt;/code&gt;&lt;/strong&gt; — error on undefined variables. This is the one that saves you from &lt;code&gt;rm -rf $UNDEFINED/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -o pipefail&lt;/code&gt;&lt;/strong&gt; — propagate failures through pipes. Without it, &lt;code&gt;failing-command | grep something&lt;/code&gt; succeeds because grep succeeds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IFS=$'\n\t'&lt;/code&gt;&lt;/strong&gt; — sane field splitting. Defends against word-splitting bugs in filenames.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the AI-generated script doesn't have these, add them and re-read the script. You'll often discover bugs the pragma now flags.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Is every variable expansion quoted?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wrong&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="nv"&gt;$TARGET_DIR&lt;/span&gt;

&lt;span class="c"&gt;# Right&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TARGET_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wrong version is what causes the "I deleted the root directory" stories. If &lt;code&gt;$TARGET_DIR&lt;/code&gt; is empty or contains a space, the command becomes &lt;code&gt;rm -rf&lt;/code&gt; (delete current directory) or &lt;code&gt;rm -rf foo bar&lt;/code&gt; (delete two unintended things).&lt;/p&gt;

&lt;p&gt;Models default to the wrong version about half the time because the right version is harder to write in chat ("escape the quotes!") and the wrong version is what most blogs show.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; When reading AI bash, mentally check every &lt;code&gt;$VAR&lt;/code&gt; for quotes. Add them if missing. This is the single biggest source of bash disasters.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. What happens if a step fails partway through?
&lt;/h3&gt;

&lt;p&gt;The AI will cheerfully write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/new-app
&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/new-app
&lt;span class="nb"&gt;tar &lt;/span&gt;xzf &lt;span class="nv"&gt;$TARBALL&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nv"&gt;$TARBALL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens if &lt;code&gt;tar xzf&lt;/code&gt; fails (corrupt tarball, full disk)? With &lt;code&gt;set -e&lt;/code&gt;, the script stops. Good. Without &lt;code&gt;set -e&lt;/code&gt;, it continues to &lt;code&gt;rm $TARBALL&lt;/code&gt; and deletes your tarball with no backup.&lt;/p&gt;

&lt;p&gt;For any state-changing script, ask yourself: at each step, what's the recovery path if the step fails? If the answer is "nothing automated," the script should at least &lt;em&gt;not delete data&lt;/em&gt; before verifying the previous step succeeded.&lt;/p&gt;

&lt;p&gt;The AI almost never thinks about this on its own.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Are secrets visible in logs?
&lt;/h3&gt;

&lt;p&gt;The most common way AI-generated bash leaks secrets is via &lt;code&gt;set -x&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-x&lt;/span&gt;  &lt;span class="c"&gt;# debugging&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; https://api.example.com/...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;set -x&lt;/code&gt;, every command is printed including the expanded variables. Your API token is now in the script's output, which is in your CI logs, which are visible to anyone with project access.&lt;/p&gt;

&lt;p&gt;The fix is selective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; +x  &lt;span class="c"&gt;# disable trace&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; https://api.example.com/...
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-x&lt;/span&gt;  &lt;span class="c"&gt;# re-enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or simply remove &lt;code&gt;set -x&lt;/code&gt; once debugging is done. The model frequently leaves it in.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Does it run as root unnecessarily?
&lt;/h3&gt;

&lt;p&gt;The AI will sometimes write &lt;code&gt;sudo&lt;/code&gt; into every command, even ones that don't need it. Or it'll assume the script runs as root and use absolute paths that require root to write.&lt;/p&gt;

&lt;p&gt;The principle: if a command can run as a non-root user, it should. The smaller the privileged surface, the smaller the blast radius.&lt;/p&gt;

&lt;p&gt;This is especially important for scripts that download and execute code. A common pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dangerous: privileged download + execute&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'curl https://example.com/install.sh | bash'&lt;/span&gt;

&lt;span class="c"&gt;# Safer: review then run&lt;/span&gt;
curl https://example.com/install.sh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; install.sh
&lt;span class="c"&gt;# READ install.sh&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bash install.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the model generates the first pattern, replace it with the second. Always.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;Last month I asked Claude to write a script that cleans up Docker images older than 30 days on a CI runner host. The first draft was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="nv"&gt;DOCKER_IMAGES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;docker images &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.ID}} {{.CreatedAt}}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;CUTOFF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DOCKER_IMAGES&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;read &lt;/span&gt;ID DATE&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;CREATED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$CREATED&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; &lt;span class="nv"&gt;$CUTOFF&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;docker rmi &lt;span class="nv"&gt;$ID&lt;/span&gt;
    &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Walking the checklist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No strict pragma.&lt;/strong&gt; Missing &lt;code&gt;set -euo pipefail&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unquoted &lt;code&gt;$DOCKER_IMAGES&lt;/code&gt;, &lt;code&gt;$ID&lt;/code&gt;, &lt;code&gt;$DATE&lt;/code&gt;.&lt;/strong&gt; Each one is a potential bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure handling.&lt;/strong&gt; &lt;code&gt;docker rmi&lt;/code&gt; fails if an image is in use. The script continues, marches through, and silently fails on every in-use image. We never know which were cleaned and which weren't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No secrets&lt;/strong&gt; (docker doesn't expose them here), but the script also doesn't log what it's doing, so you can't audit afterward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;sudo&lt;/code&gt;,&lt;/strong&gt; good — assumes the user has Docker socket access, which is reasonable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The hardened version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;$'&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;

&lt;span class="nv"&gt;CUTOFF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;REMOVED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="nv"&gt;SKIPPED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

&lt;span class="c"&gt;# Use --format with safer parsing&lt;/span&gt;
docker images &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.ID}}|{{.CreatedAt}}'&lt;/span&gt; | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'|'&lt;/span&gt; &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; ID DATE&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;CREATED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CREATED&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CUTOFF&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        if &lt;/span&gt;docker rmi &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
            &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Removed: &lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
            &lt;span class="nv"&gt;REMOVED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;REMOVED &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else
            &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Skipped (in use): &lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
            &lt;span class="nv"&gt;SKIPPED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;SKIPPED &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;fi
    fi
done

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Cleanup complete. Removed: &lt;/span&gt;&lt;span class="nv"&gt;$REMOVED&lt;/span&gt;&lt;span class="s2"&gt;, Skipped: &lt;/span&gt;&lt;span class="nv"&gt;$SKIPPED&lt;/span&gt;&lt;span class="s2"&gt;."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This took two minutes of editing. Without the checklist, I might have run the original and noticed days later that disk usage hadn't really dropped because half the images were in use.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small note on bash linting
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;shellcheck&lt;/code&gt; catches most of these issues automatically. If you adopt one tool from this article, make it shellcheck:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;shellcheck cleanup-images.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will flag unquoted variables, missing strict mode, and a dozen other patterns. AI-generated bash usually has at least one shellcheck warning.&lt;/p&gt;

&lt;p&gt;I now run shellcheck on every script before I run the script itself. It's two seconds and catches things I'd miss.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the AI gets it right
&lt;/h2&gt;

&lt;p&gt;To be fair: the model is often perfectly capable of producing safe bash. If you prompt it explicitly — "write this with &lt;code&gt;set -euo pipefail&lt;/code&gt;, quote every variable, fail loudly on errors" — you'll get a clean script.&lt;/p&gt;

&lt;p&gt;The problem is that "write me a script that does X" without that prompt gets you the &lt;em&gt;common&lt;/em&gt; form of the script, which is the unsafe form. So the rule of thumb:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always include the safety requirements in the prompt.&lt;/strong&gt; Or: always treat the output as a draft that needs hardening. Don't run any bash the AI wrote without one of those two disciplines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Bash from AI is fast to produce and easy to read incorrectly. The checklist is short — strict pragma, quoted expansions, failure paths, secrets in logs, unnecessary privilege — and applying it takes a couple of minutes per script. The downside of skipping it is on the spectrum of "minor cleanup mistake" to "career incident." There's no excuse not to do the check.&lt;/p&gt;

&lt;p&gt;For our prompts on bash specifically, see &lt;a href="https://dev.clauneck.workers.dev/prompts/bash-script-code-review/"&gt;bash-script-code-review&lt;/a&gt; and the related &lt;a href="https://dev.clauneck.workers.dev/prompts/linux-server-hardening/"&gt;linux-server-hardening&lt;/a&gt; prompt — both of which cover related territory.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/securing-ai-generated-bash-scripts/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>bash</category>
      <category>security</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Best AI Tools for DevOps Engineers in 2026</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Wed, 17 Jun 2026 20:59:44 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/the-best-ai-tools-for-devops-engineers-in-2026-15a9</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/the-best-ai-tools-for-devops-engineers-in-2026-15a9</guid>
      <description>&lt;p&gt;If you spend your day in a terminal, a YAML editor, or a Grafana tab — AI assistants in 2026 are no longer a curiosity. They're a real productivity layer. But not every tool is good at infrastructure work. After a year of daily use across Linux administration, OpenStack operations, Prometheus alert authoring, and Kubernetes debugging, here's the honest shortlist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The criteria
&lt;/h2&gt;

&lt;p&gt;We're not ranking on benchmark scores. We're ranking on &lt;strong&gt;infrastructure usefulness&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning over command output&lt;/strong&gt; — can it actually read &lt;code&gt;top&lt;/code&gt;, &lt;code&gt;kubectl describe&lt;/code&gt;, or &lt;code&gt;journalctl&lt;/code&gt; and find the real problem?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety&lt;/strong&gt; — does it warn before suggesting destructive commands?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long context&lt;/strong&gt; — can it hold a 1,000-line &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; plus failing logs without losing track?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terminal integration&lt;/strong&gt; — can you use it without leaving your workflow?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy and self-host options&lt;/strong&gt; — for the engineers whose employers care.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The shortlist
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Claude (Anthropic)
&lt;/h3&gt;

&lt;p&gt;The current best general assistant for infrastructure reasoning. Long context handles enormous log dumps and Kubernetes manifests in one shot. It is consistently more cautious about destructive commands than alternatives — which matters when you're tired at 2am and tempted to copy-paste straight into prod.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Linux/OpenStack/Kubernetes troubleshooting, postmortem drafting, code review on infrastructure-as-code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. ChatGPT (OpenAI)
&lt;/h3&gt;

&lt;p&gt;The broadest ecosystem. Strong code generation, plug-in support, and the largest community of shared prompts and patterns. For Ansible and Terraform generation, output quality is excellent. Slightly less cautious by default — you'll want to add safety constraints in your prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Ansible/Terraform generation, ad-hoc scripting, learning new tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cursor
&lt;/h3&gt;

&lt;p&gt;If you live in an IDE, Cursor is what your IDE should have been. Native multi-file context, agent mode for repo-wide refactors, and tab-completion that actually understands your codebase. Especially strong for IaC repositories with many interconnected files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Editing real codebases (Helm charts, Terraform modules, Python operators).&lt;/p&gt;

&lt;h3&gt;
  
  
  4. GitHub Copilot
&lt;/h3&gt;

&lt;p&gt;The lowest-friction option. Inline completion just works, and the chat sidebar is genuinely useful for "explain this regex" or "what's this PromQL doing?" If your org already pays for GitHub, Copilot is essentially free upside.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Inline completion while editing YAML, Bash, Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Warp Terminal (with AI features)
&lt;/h3&gt;

&lt;p&gt;The only entry on this list that isn't an AI assistant per se — it's a terminal that has AI built in. The killer feature: natural-language command suggestions in your shell, with safety previews. For Linux admins who don't want to alt-tab to a chat window every five seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Terminal-native workflows where context-switching kills focus.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we don't recommend (yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generic LLM wrappers that promise "DevOps AI."&lt;/strong&gt; Most are thin layers over the same APIs above, sometimes with worse safety defaults. Use the underlying tools directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anything that requires uploading your &lt;code&gt;~/.ssh&lt;/code&gt; directory or production credentials.&lt;/strong&gt; Be skeptical of "AI agents that run commands for you" without a clear sandbox model.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to combine them
&lt;/h2&gt;

&lt;p&gt;A pattern that works well in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Claude or ChatGPT in a browser&lt;/strong&gt; for deep diagnosis sessions (paste logs, walk through hypotheses, draft postmortems).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cursor or Copilot in your editor&lt;/strong&gt; for actually writing the fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warp&lt;/strong&gt; in the terminal for quick command lookups without switching context.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need one perfect tool. You need a workflow where each tool plays to its strengths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/blog/claude-linux-troubleshooting/"&gt;How to Use Claude to Troubleshoot Linux Servers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/blog/chatgpt-vs-claude-for-infrastructure/"&gt;ChatGPT vs Claude for Infrastructure Engineers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/best-ai-tools-for-devops-engineers/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>tools</category>
      <category>claude</category>
    </item>
    <item>
      <title>Auditing Kubernetes Manifests With AI: A Practical Workflow</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Tue, 16 Jun 2026 04:31:15 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/auditing-kubernetes-manifests-with-ai-a-practical-workflow-4368</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/auditing-kubernetes-manifests-with-ai-a-practical-workflow-4368</guid>
      <description>&lt;p&gt;A senior K8s engineer I work with audits manifests faster than I read them. He's seen so many patterns that "missing readinessProbe on a Deployment that takes 45 seconds to start" jumps off the page. Most of us don't have that pattern library memorized — and increasingly, we don't need to. AI assistants have read more Kubernetes manifests than any human ever will.&lt;/p&gt;

&lt;p&gt;The catch: a generic "review this YAML" prompt produces generic noise. You need to direct the model toward the categories of issues that actually matter in your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two mistakes everyone makes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Asking for "a security review."&lt;/strong&gt; You'll get a bullet list of every possible concern, ranked alphabetically, with no signal about which matter. You'll skim, dismiss, and learn nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Pasting one manifest.&lt;/strong&gt; Real Kubernetes problems live in the interaction between resources — a Deployment's readiness probe and a Service's selector, a NetworkPolicy and the actual app traffic. One YAML in isolation hides most of the bugs.&lt;/p&gt;

&lt;p&gt;The fix for both is the same: give the model a &lt;em&gt;bounded scope&lt;/em&gt; and &lt;em&gt;enough context&lt;/em&gt; to reason about interactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Pick the audit dimension
&lt;/h3&gt;

&lt;p&gt;Pre-decide what you're checking &lt;em&gt;for&lt;/em&gt;. Different prompts for different dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits &amp;amp; QoS&lt;/strong&gt; — are requests/limits set, does QoS match intent, are limits realistic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probes &amp;amp; lifecycle&lt;/strong&gt; — readiness, liveness, startup, preStop, terminationGracePeriodSeconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security context&lt;/strong&gt; — runAsNonRoot, capabilities, readOnlyRootFilesystem, seccomp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network exposure&lt;/strong&gt; — NetworkPolicy, Service type, Ingress rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability&lt;/strong&gt; — PodDisruptionBudget, topology spread, replica count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State &amp;amp; storage&lt;/strong&gt; — PVC access modes, retention policies, backup tags&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mixing dimensions in one review produces wishy-washy output. Pick one, get a clean answer, move on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Paste the manifest + related context
&lt;/h3&gt;

&lt;p&gt;For a workload review, paste:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Deployment / StatefulSet / DaemonSet&lt;/li&gt;
&lt;li&gt;Its Service(s) and Ingress&lt;/li&gt;
&lt;li&gt;Any NetworkPolicies that match its labels&lt;/li&gt;
&lt;li&gt;The HPA if relevant&lt;/li&gt;
&lt;li&gt;The ConfigMaps and Secrets it references (sanitize first)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For YAML this is usually under 500 lines, well within any model's context window. The model can now reason about interactions, not just isolated fields.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Use a directive prompt
&lt;/h3&gt;

&lt;p&gt;The big difference between "tell me about this YAML" and a useful review is &lt;em&gt;the instruction format&lt;/em&gt;. Compare:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Review this Kubernetes manifest.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;versus:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are reviewing a production Deployment + Service + NetworkPolicy bundle. For each finding, give: (1) severity (critical/high/medium/low), (2) the exact field path that's wrong, (3) one sentence on why it matters, (4) the corrected YAML snippet. Focus only on probes, lifecycle, and graceful shutdown. Ignore documentation/comments.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first prompt produces an essay. The second produces a list of fixable issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Verify before applying
&lt;/h3&gt;

&lt;p&gt;This is where most reviews go wrong. The model is right &lt;em&gt;most of the time&lt;/em&gt;. It's wrong some of the time, often in ways that look correct.&lt;/p&gt;

&lt;p&gt;Common AI failure modes in K8s review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated field names&lt;/strong&gt; — &lt;code&gt;spec.template.spec.terminationGracePeriod&lt;/code&gt; (it's &lt;code&gt;terminationGracePeriodSeconds&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outdated API versions&lt;/strong&gt; — &lt;code&gt;policy/v1beta1 PodDisruptionBudget&lt;/code&gt; (removed in 1.25)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong defaults claimed&lt;/strong&gt; — claiming &lt;code&gt;failureThreshold&lt;/code&gt; defaults to 1 when it's 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Misreading the use case&lt;/strong&gt; — recommending &lt;code&gt;runAsNonRoot: true&lt;/code&gt; for a workload that legitimately needs root&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For every "fix" the model suggests, glance at the official K8s docs for that field. This adds 30 seconds per finding and catches the wrong ones. Without this step, you will apply changes that break things.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;Here's a Deployment I reviewed last week:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;payments&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;payments&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.example.com/payments:v3.1.0&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_URL&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres://payments-db:5432/payments&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
        &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;/healthz&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;8080&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I asked Claude to review for probes and graceful shutdown only. The findings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;requests&lt;/code&gt;, only &lt;code&gt;limits&lt;/code&gt;&lt;/strong&gt; → pod gets &lt;code&gt;BestEffort&lt;/code&gt; QoS, first to be evicted under pressure. Set requests equal to or below limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;initialDelaySeconds: 5&lt;/code&gt;&lt;/strong&gt; → Java/Spring apps typically need 30-90 seconds to start. Add &lt;code&gt;startupProbe&lt;/code&gt; with longer threshold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;livenessProbe&lt;/code&gt;&lt;/strong&gt; → kubelet won't restart if the app deadlocks. Mirror readinessProbe with looser thresholds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;terminationGracePeriodSeconds&lt;/code&gt;&lt;/strong&gt; → defaults to 30s; for a payment service with in-flight requests, this is borderline. Set to 60s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;preStop&lt;/code&gt; hook&lt;/strong&gt; → SIGTERM hits immediately; load balancers may still send traffic for ~10s after pod marked Terminating. Add &lt;code&gt;sleep 15&lt;/code&gt; preStop.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All five were real, all five were fixable in two minutes of YAML editing. The model didn't tell me about anything irrelevant. That's because I scoped the prompt to "probes and graceful shutdown only."&lt;/p&gt;

&lt;p&gt;The big one — #5 — is something I've personally been bitten by twice. The model wouldn't have prioritized it without the directive prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about Kyverno / OPA / Pod Security Admission?
&lt;/h2&gt;

&lt;p&gt;Yes, you should run those too. They catch consistent issues at admission time. They don't catch issues that require &lt;em&gt;judgment&lt;/em&gt;: "is 30 seconds enough graceful shutdown for this specific service?" Policy enforcement is a floor; AI review is a directed second opinion above that floor.&lt;/p&gt;

&lt;p&gt;I run both. Kyverno catches "no securityContext at all" before it ever lands. AI review catches "readinessProbe path doesn't match what the app exposes" — something only a human (or an AI imitating one) would notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  A starter prompt
&lt;/h2&gt;

&lt;p&gt;If you want a template, here's the one I use most:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are reviewing a Kubernetes workload bundle for production readiness. Focus only on: probes (readiness, liveness, startup), &lt;code&gt;terminationGracePeriodSeconds&lt;/code&gt;, preStop hooks, and rolling update strategy. For each finding produce: severity, exact field path, why it matters in one sentence, corrected YAML. Ignore everything else (security context, network policies, resource limits — those are separate reviews). The workload is [serves HTTP at /api on port 8080 / consumes from a queue / batch processor that runs N hours].&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The bracketed context at the end is what makes the review accurate for &lt;em&gt;your&lt;/em&gt; workload. Without it, the model assumes a generic web service.&lt;/p&gt;

&lt;p&gt;For our full prompt library on Kubernetes review, see the &lt;a href="https://dev.clauneck.workers.dev/categories/kubernetes-helm/"&gt;Kubernetes &amp;amp; Helm category&lt;/a&gt; — especially &lt;a href="https://dev.clauneck.workers.dev/prompts/kubernetes-yaml-security-review/"&gt;kubernetes-yaml-security-review&lt;/a&gt; and &lt;a href="https://dev.clauneck.workers.dev/prompts/kubernetes-resource-limits-tuning/"&gt;kubernetes-resource-limits-tuning&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/auditing-kubernetes-manifests-with-ai/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>yaml</category>
      <category>security</category>
    </item>
    <item>
      <title>How to Use Claude to Troubleshoot Linux Servers</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sun, 14 Jun 2026 21:15:59 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/how-to-use-claude-to-troubleshoot-linux-servers-1fhe</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/how-to-use-claude-to-troubleshoot-linux-servers-1fhe</guid>
      <description>&lt;p&gt;Claude is genuinely useful for production Linux troubleshooting — when you use it right. Here's the workflow that works, after a year of using it on real incidents across Ubuntu, RHEL, and Rocky.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model: Claude is a senior pair, not an oracle
&lt;/h2&gt;

&lt;p&gt;The mistake most engineers make on day one: they paste a 5-line error message and expect a fix. Claude can do better than that — but only if you give it the same context you'd give a senior engineer joining your incident bridge.&lt;/p&gt;

&lt;p&gt;A senior engineer would want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What OS and version?&lt;/li&gt;
&lt;li&gt;What does this server do?&lt;/li&gt;
&lt;li&gt;What changed recently?&lt;/li&gt;
&lt;li&gt;What's the actual symptom?&lt;/li&gt;
&lt;li&gt;What command output have you already gathered?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Give Claude that, and the quality of analysis changes completely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Establish context with a system prompt
&lt;/h3&gt;

&lt;p&gt;Use our &lt;a href="https://dev.clauneck.workers.dev/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt; as your system prompt, or paraphrase: &lt;em&gt;"You are a senior Linux sysadmin. Rank root-cause hypotheses by probability. Recommend safe diagnostics first. Label destructive commands as DANGEROUS."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Paste structured context, not noise
&lt;/h3&gt;

&lt;p&gt;Good:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;OS: Ubuntu 22.04, kernel 5.15
Role: production MySQL replica, 64GB RAM, 16 cores
Recent changes: kernel upgrade 6 hours ago
Symptom: server load average 40+, MySQL replication lag growing, queries timing out

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;uptime&lt;/span&gt;
&lt;span class="go"&gt; 14:22:01 up 6:02,  4 users,  load average: 41.23, 38.51, 35.04

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;free &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;span class="go"&gt;              total        used        free      shared  buff/cache   available
Mem:           62Gi        58Gi       1.2Gi       128Mi       3.1Gi       1.8Gi

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;iostat &lt;span class="nt"&gt;-xz&lt;/span&gt; 2 3
&lt;span class="go"&gt;[...]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my server is slow can you help
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Let it ask follow-up questions
&lt;/h3&gt;

&lt;p&gt;The good prompts in our library tell Claude to &lt;strong&gt;ask for missing data&lt;/strong&gt; before guessing. When it asks "can you share &lt;code&gt;dmesg | tail -50&lt;/code&gt; and &lt;code&gt;vmstat 1 5&lt;/code&gt;?" — that's a feature, not a flaw. Give it the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Validate suggested commands before running
&lt;/h3&gt;

&lt;p&gt;Claude will sometimes suggest a command with subtly wrong syntax, a destructive flag, or a path that doesn't exist on your distro. Read every suggestion before running. &lt;strong&gt;Never paste straight into a root shell.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Keep the conversation alive
&lt;/h3&gt;

&lt;p&gt;Claude's long context means you can run a 30-minute diagnostic session in one thread, paste new output as you gather it, and the model retains the full diagnostic context. This is the single biggest workflow win versus older AI tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Claude is good at
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Reading command output you don't fully understand (&lt;code&gt;strace&lt;/code&gt;, &lt;code&gt;perf&lt;/code&gt;, &lt;code&gt;tcpdump&lt;/code&gt; summaries).&lt;/li&gt;
&lt;li&gt;Drafting &lt;code&gt;awk&lt;/code&gt;/&lt;code&gt;sed&lt;/code&gt;/&lt;code&gt;grep&lt;/code&gt; one-liners for log analysis.&lt;/li&gt;
&lt;li&gt;Explaining why a specific kernel parameter or sysctl is set.&lt;/li&gt;
&lt;li&gt;Suggesting &lt;em&gt;what to look at next&lt;/em&gt; when you're stuck.&lt;/li&gt;
&lt;li&gt;Drafting the incident summary after you've fixed it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Claude is not good at
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Real-time anything — it can't see your live metrics.&lt;/li&gt;
&lt;li&gt;Distinguishing between two plausible root causes when both fit the symptoms (it'll guess).&lt;/li&gt;
&lt;li&gt;Telling you what's normal for &lt;em&gt;your&lt;/em&gt; environment. You have to provide that baseline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A real-world example
&lt;/h2&gt;

&lt;p&gt;A production server's load average suddenly spiked. Pasting &lt;code&gt;top&lt;/code&gt;, &lt;code&gt;iostat -xz 2 3&lt;/code&gt;, and &lt;code&gt;dmesg | tail -50&lt;/code&gt; into Claude with our prompt template, it immediately flagged: &lt;em&gt;"&lt;code&gt;%iowait&lt;/code&gt; is 78%, &lt;code&gt;await&lt;/code&gt; on /dev/sda is 320ms, and dmesg shows 'task X blocked for more than 120 seconds.' The disk subsystem is saturated, not CPU. Investigate which process is doing heavy I/O: &lt;code&gt;iotop -oP -d1&lt;/code&gt; will show the writer in 1-second intervals."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's exactly the diagnosis we wanted, framed with the evidence — in seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Companion resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/prompts/bash-script-code-review/"&gt;Bash Script Code Review Prompt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/blog/ai-safely-with-bash/"&gt;How to Use AI Safely with Bash Commands&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/claude-linux-troubleshooting/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>claude</category>
      <category>linux</category>
      <category>troubleshooting</category>
    </item>
    <item>
      <title>How to Choose the Right DevOps as a Service Provider</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sat, 13 Jun 2026 19:26:40 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/how-to-choose-the-right-devops-as-a-service-provider-466l</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/how-to-choose-the-right-devops-as-a-service-provider-466l</guid>
      <description>&lt;p&gt;I've spent 25 years building, breaking, and scaling production infrastructure — long enough to watch "DevOps" go from a conference buzzword to a thing companies now rent by the month. That shift is real, and for a lot of teams it's the right call. But the gap between a great DevOps as a Service provider and a bad one is enormous, and the marketing pages all read the same.&lt;/p&gt;

&lt;p&gt;So this is the article I wish more buyers had: what DevOps as a Service actually means, when it beats hiring, and how to tell — before you sign — whether the people you're talking to have ever been on-call at 3am.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DevOps as a Service actually means
&lt;/h2&gt;

&lt;p&gt;DevOps as a Service (DaaS) is outsourcing the engineering function that builds and runs your delivery pipeline and infrastructure, rather than hiring that function in-house. A provider takes ownership of some or all of: your CI/CD, your cloud environments, your observability, your automation, and the on-call response when something breaks.&lt;/p&gt;

&lt;p&gt;It is not a single tool, and it is not a one-time project. A consultancy that drops a Terraform repo and disappears is not DaaS. The "as a Service" part means there's an ongoing operational relationship — someone is responsible for your systems on Tuesday at 2am, not just during the engagement.&lt;/p&gt;

&lt;p&gt;Done well, you get the output of a seasoned platform team — Linux fundamentals, Kubernetes, Docker, infrastructure-as-code, pipelines, monitoring — without carrying that whole team on payroll.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why companies outsource DevOps instead of hiring
&lt;/h2&gt;

&lt;p&gt;Hiring a full in-house DevOps team is the "right" answer that's often the wrong answer in practice. Here's why teams rent it instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; A single senior DevOps/SRE hire in a competitive market is expensive — and you need more than one for real on-call coverage. Add recruiting time, ramp-up, benefits, and the risk of a bad hire, and the fully-loaded number gets large fast. A provider amortizes senior talent across clients, so you pay for the expertise without paying for the bench.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed to maturity.&lt;/strong&gt; A good provider has already built the Terraform modules, the GitLab CI templates, the Prometheus alert libraries, the backup runbooks. You're buying an opinionated, battle-tested baseline instead of inventing it. That can compress a year of platform work into weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-call coverage.&lt;/strong&gt; Sustainable 24/7 on-call needs roughly six to eight engineers in a healthy rotation. Most companies under a certain size simply cannot staff that without burning people out. Providers spread the rotation across a larger team, so nobody's carrying a pager every single night.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard-to-hire seniority.&lt;/strong&gt; The engineers who can debug a gnarly Kubernetes networking issue, reason about etcd, and also write clean Terraform are rare and they know it. They're hard to attract and harder to retain at a non-tech company. DaaS is often the only realistic way for a mid-sized business to get that caliber of person near its infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's usually included
&lt;/h2&gt;

&lt;p&gt;Scope varies, but a full-spectrum provider should be able to own all of these. When you evaluate one, map their offering against this list and find out exactly where the lines are.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD&lt;/strong&gt; — pipeline design, build/test/deploy stages, and crucially, a real rollback path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud infrastructure&lt;/strong&gt; — provisioning and managing your environments as code (Terraform or equivalent), with sane network and IAM design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring and observability&lt;/strong&gt; — Prometheus, Grafana, logs, and alert rules that page a human only when a human is actually needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt; — configuration management with Ansible, scripted runbooks, and elimination of manual toil.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt; — secrets management, least-privilege access, patching, and image scanning baked into the pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident response&lt;/strong&gt; — a defined process, on-call rotation, and blameless postmortems, not just "we'll look at it."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backups and disaster recovery&lt;/strong&gt; — and, more importantly, tested restores. A backup you've never restored is a rumor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt; — right-sizing, autoscaling, spot/reserved strategy, and killing the zombie resources nobody owns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Questions to ask before you hire a provider
&lt;/h2&gt;

&lt;p&gt;This is the part that separates the real operators from the slide decks. Don't ask "do you do Kubernetes?" — everyone says yes. Ask for specifics and watch how fast and how concretely they answer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Show me your Terraform module structure and how you handle state."&lt;/strong&gt; Real teams have an opinion about remote state, locking, workspace-vs-directory layout, and blast-radius isolation. Vague answers here mean they're winging your infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Walk me through a real GitLab CI pipeline you run, including the rollback path."&lt;/strong&gt; A deploy story with no rollback story is half a pipeline. I want to hear how they revert a bad release in minutes, not hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"How do you wire Prometheus alert rules to avoid pager fatigue?"&lt;/strong&gt; The right answer involves symptom-based alerting, &lt;code&gt;for:&lt;/code&gt; durations, severity routing, and ruthless deletion of noisy alerts. If every blip pages everyone, nobody responds to the one that matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"What does your on-call rotation look like, and what's your real response time?"&lt;/strong&gt; Get the rotation size, escalation policy, and the SLA in writing. "We're very responsive" is not an SLA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"How do you manage secrets and access?"&lt;/strong&gt; Listen for a vault, short-lived credentials, and least privilege — not secrets in environment files or a shared password manager.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"When did you last test a restore from backup, and how long did it take?"&lt;/strong&gt; The hesitation tells you everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"How do you handle configuration drift?"&lt;/strong&gt; Ansible, immutable images, drift detection — there should be a system, not heroics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"What happens to our infrastructure if we leave you?"&lt;/strong&gt; A confident provider hands you clean, documented IaC and walks away gracefully. Lock-in is a choice they make, and you should know it up front.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Who specifically will be on our account, and what's their production background?"&lt;/strong&gt; You're buying judgment. Find out whose judgment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Red flags to avoid
&lt;/h2&gt;

&lt;p&gt;A few patterns that, in my experience, reliably predict pain.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Buzzword density with no specifics.&lt;/strong&gt; If they can't move from "we leverage cloud-native synergies" to "here's how we structure a Helm chart" in one question, walk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No rollback story.&lt;/strong&gt; Anyone can deploy. Operators can un-deploy under pressure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickOps in the cloud console.&lt;/strong&gt; If they're configuring your production environment by hand instead of in code, you have no reproducibility and no audit trail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Everything is "automated by AI."&lt;/strong&gt; AI helps. AI does not own your incident at 2am. A provider hiding thin staffing behind AI claims is a serious risk (more on this below).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert noise as a feature.&lt;/strong&gt; Hundreds of alerts is not observability; it's a team that's trained itself to ignore the dashboard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No postmortems, or blame-heavy ones.&lt;/strong&gt; A team that doesn't write honest postmortems isn't learning, and you'll pay for the same outage twice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;They won't show you anything real.&lt;/strong&gt; Sanitized examples are fine. "We can't show you any of our work" usually means there isn't much to show.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep lock-in by design.&lt;/strong&gt; Proprietary wrappers around standard tools, undocumented infra, contracts that punish leaving — all signs they're protecting revenue, not your uptime.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why real production experience beats buzzwords
&lt;/h2&gt;

&lt;p&gt;Here's the thing the marketing won't tell you: tools are easy, judgment is hard. Anyone can &lt;code&gt;terraform apply&lt;/code&gt;. The value is in the engineer who knows &lt;em&gt;not&lt;/em&gt; to apply at 4:55pm on a Friday, who recognizes the failure mode three layers down, who's restored a database under pressure and remembers exactly how it went wrong last time.&lt;/p&gt;

&lt;p&gt;That judgment only comes from having run real production systems and felt the consequences. When you evaluate a provider, you're not really buying their Kubernetes skills — those are table stakes. You're buying scar tissue. You want the team that's debugged the keepalived VIP flap, the etcd disk-pressure cascade, the Docker layer that quietly doubled image size and blew out the build cache. Ask for war stories. The good ones light up; the pretenders get vague.&lt;/p&gt;

&lt;h2&gt;
  
  
  How AI fits — and where it doesn't
&lt;/h2&gt;

&lt;p&gt;I'm bullish on AI in DevOps, and I build with it daily. Used right, it's a genuine force multiplier: it can summarize a wall of logs faster than any human, draft Terraform and Ansible boilerplate, propose PromQL, correlate a timeline of "what changed," and write the first pass of a postmortem. That's real leverage, and a modern provider should be using it.&lt;/p&gt;

&lt;p&gt;But there's a hard line, and it's the same one I draw on my own systems: &lt;strong&gt;AI reads and reasons; humans run commands.&lt;/strong&gt; During an active incident, AI proposes a risk-classified, safest-first plan and a human executes every step. The model never touches production. If a provider tells you their AI auto-remediates your prod environment unattended, that's not maturity — that's an outage waiting for a confident-but-wrong suggestion.&lt;/p&gt;

&lt;p&gt;The right framing is AI as a very fast, very well-read junior engineer sitting next to a senior who owns the keyboard. It compresses the slow parts of the work without replacing the judgment that keeps you up. If you want to see what that looks like in practice, our &lt;a href="https://dev.clauneck.workers.dev/dashboard/incident-response/"&gt;AI incident-response workflows&lt;/a&gt; and &lt;a href="https://dev.clauneck.workers.dev/prompts/"&gt;prompt library&lt;/a&gt; are built around exactly that human-in-the-loop principle.&lt;/p&gt;

&lt;p&gt;So when you evaluate a provider's AI claims, ask the same question you'd ask about any tool: where's the human, and what's the blast radius if the AI is wrong?&lt;/p&gt;

&lt;h2&gt;
  
  
  How a good provider actually pays for itself
&lt;/h2&gt;

&lt;p&gt;The reason this model works isn't just cheaper labor — it's better outcomes in three places that show up directly on your books.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It saves money.&lt;/strong&gt; Cost optimization is continuous work most teams never get to: right-sizing nodes, tuning autoscaling, buying reserved capacity, deleting orphaned volumes and idle environments. A provider doing this routinely often saves more on cloud spend than they cost. The infrastructure-as-code discipline also prevents the expensive mistakes — the hand-clicked resource nobody can reproduce, the security group left wide open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It reduces downtime.&lt;/strong&gt; Better alerting means you catch degradation before customers do. Tested restores mean a disaster is an inconvenience, not a company-ending event. A defined incident process with real on-call coverage means the response starts in minutes. Downtime is one of the most expensive things a business buys without meaning to, and maturity here directly buys it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It speeds up deployments.&lt;/strong&gt; A solid GitLab CI pipeline with automated testing and a clean rollback path turns deploys from a scary quarterly event into a boring daily one. Teams that deploy confidently ship faster, and shipping faster is usually the whole point. The fastest way to slow down engineering is to make every release terrifying; good DevOps makes it dull.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go from here
&lt;/h2&gt;

&lt;p&gt;Be honest with yourself about where your infrastructure actually stands. Can you deploy and roll back in minutes, or does a release ruin someone's afternoon? Do your alerts mean something, or has your team learned to ignore them? If your primary database died right now, do you know — not hope, know — that you can restore it? Is there a real on-call rotation, or one exhausted person who's secretly the single point of failure?&lt;/p&gt;

&lt;p&gt;If those questions made you wince, you're not behind — you're normal. Most teams are running far less maturity than they think, and trying to close that gap by hiring slowly, one expensive senior at a time, while production keeps moving. DevOps as a Service exists precisely so you don't have to win that hiring war before you can move fast.&lt;/p&gt;

&lt;p&gt;Take an honest inventory this week. Score yourself on pipelines, observability, incident response, and recovery. Wherever you find a gap that's quietly costing you money, downtime, or velocity, that's where a good provider earns their fee many times over. The teams that move fastest aren't the ones with the most engineers — they're the ones who got serious about maturity before the outage forced the conversation. Decide which kind you want to be, and move while it's still your choice.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Evaluate any provider against your own systems and constraints. The right answer depends on your scale, your risk tolerance, and how much production maturity you already have in-house.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/how-to-choose-devops-as-a-service-provider/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>manageddevops</category>
      <category>cicd</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>ChatGPT vs Claude for Infrastructure Engineers</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 11 Jun 2026 15:21:51 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/chatgpt-vs-claude-for-infrastructure-engineers-7j5</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/chatgpt-vs-claude-for-infrastructure-engineers-7j5</guid>
      <description>&lt;p&gt;Both ChatGPT and Claude are excellent. But they have different strengths, and infrastructure engineers feel those differences more than most users — because we deal with long logs, multi-file configurations, and operations where being &lt;em&gt;almost right&lt;/em&gt; can mean being very wrong.&lt;/p&gt;

&lt;p&gt;Here's a side-by-side from a year of daily use on real infrastructure work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long-context reasoning over logs and manifests
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: Claude.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude's long context window means you can paste a 2,000-line &lt;code&gt;kubectl describe pod&lt;/code&gt;, the full Deployment manifest, and your last 50 events without losing fidelity. ChatGPT can handle long contexts too, but in practice it's more likely to summarize or "forget" earlier details mid-conversation.&lt;/p&gt;

&lt;p&gt;For diagnostic workflows where you keep pasting more output as you gather it, Claude's behavior is meaningfully better.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety with destructive commands
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: Claude (slightly).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without explicit prompting, Claude is more likely to flag destructive commands (&lt;code&gt;rm -rf&lt;/code&gt;, &lt;code&gt;DROP TABLE&lt;/code&gt;, &lt;code&gt;nova reset-state&lt;/code&gt;, &lt;code&gt;kubectl delete&lt;/code&gt;) with caveats. ChatGPT will too — but is more likely to just hand you the command without extra emphasis.&lt;/p&gt;

&lt;p&gt;If you use either tool in production troubleshooting, &lt;strong&gt;bake the safety constraints into your prompt&lt;/strong&gt; (our prompt library does this). Don't rely on default behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code generation: Ansible, Terraform, Bash, Python
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Roughly tied. Different defaults.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT&lt;/strong&gt; tends toward more "modern" Terraform (newer providers, recent syntax) and is slightly faster to produce a working playbook from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; tends toward more cautious, conventional output with better comments and more attention to idempotency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For infrastructure-as-code review, Claude usually catches more subtle issues. For first-draft generation, ChatGPT is often a hair faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  PromQL and observability queries
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Roughly tied.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both can write correct PromQL with &lt;code&gt;rate()&lt;/code&gt;, &lt;code&gt;histogram_quantile()&lt;/code&gt;, and label aggregation. Both occasionally hallucinate metric names if you don't paste your &lt;code&gt;/metrics&lt;/code&gt; output. The deciding factor is your prompt quality, not the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Postmortem drafting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: Claude.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude's prose is consistently more readable, less marketing-flavored, and more naturally blameless. ChatGPT tends to slip into corporate phrasing that engineers find grating ("leveraged our learnings to enhance reliability").&lt;/p&gt;

&lt;h2&gt;
  
  
  Ecosystem and integrations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: ChatGPT.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Far larger ecosystem of plugins, GPTs, and shared prompts. If you want a tool that integrates with everything else you use, ChatGPT wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;Both are roughly comparable for individual use. Both offer free tiers with rate limits. Teams pricing varies by org needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which should you use?
&lt;/h2&gt;

&lt;p&gt;The honest answer: &lt;strong&gt;both, for different tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; for diagnostic sessions, postmortems, sensitive prod work, and IaC review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT&lt;/strong&gt; for fast scaffolding, plugin-heavy workflows, and broad community templates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can only pick one and you do mostly production troubleshooting, pick Claude. If you can only pick one and you do mostly greenfield IaC scaffolding, ChatGPT is fine — your prompt quality matters more than the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Companion resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/blog/best-ai-tools-for-devops-engineers/"&gt;Best AI Tools for DevOps Engineers in 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/blog/claude-linux-troubleshooting/"&gt;How to Use Claude to Troubleshoot Linux Servers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.clauneck.workers.dev/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/chatgpt-vs-claude-for-infrastructure/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>chatgpt</category>
      <category>claude</category>
      <category>comparison</category>
    </item>
    <item>
      <title>How DevOps Engineers Can Use AI to Triage Production Incidents Faster</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Mon, 08 Jun 2026 19:49:29 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/devopsaitoolkit/how-devops-engineers-can-use-ai-to-triage-production-incidents-faster-3jb6</link>
      <guid>https://dev.clauneck.workers.dev/devopsaitoolkit/how-devops-engineers-can-use-ai-to-triage-production-incidents-faster-3jb6</guid>
      <description>&lt;p&gt;The pager goes off at 02:14. Checkout latency is up, error rate is climbing, and you have three dashboards, a wall of logs, and a half-awake brain. The fix, once you know what's wrong, is usually fast. The expensive part is the triage — the first fifteen minutes of "what is actually broken, and what changed?"&lt;/p&gt;

&lt;p&gt;That triage window is exactly where AI helps most, and exactly where it's most dangerous if you let it run commands. This is how to use it to go faster without handing it the keys to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule that makes AI safe during an incident
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI reads and reasons. Humans run commands.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;During an active incident you are sleep-deprived and time-pressured — the worst possible state to paste a command you don't fully understand. So draw a hard line: AI is allowed to look at evidence and propose a plan. It is never allowed to execute anything. Every command it suggests goes through your eyes and your hands.&lt;/p&gt;

&lt;p&gt;In practice that means you treat the model like a very fast, very well-read junior SRE sitting next to you: it can summarize, correlate, hypothesize, and draft commands — but you're the one with the keyboard, and you read each command before it runs.&lt;/p&gt;

&lt;p&gt;If you only take one thing from this article, take that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Turn the firehose into a summary
&lt;/h2&gt;

&lt;p&gt;The first thing AI is genuinely great at is reading more text than you can at 2am. Paste in the raw material and ask for structure, not answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The firing alerts (name, severity, labels, duration)&lt;/li&gt;
&lt;li&gt;A representative slice of error logs&lt;/li&gt;
&lt;li&gt;Recent deploy / change history&lt;/li&gt;
&lt;li&gt;The relevant dashboard values (p99 latency, error rate, saturation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then prompt it deliberately:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Here are the alerts, logs, and recent changes for an active production incident. Summarize what's happening in 5 bullets, list the top 3 hypotheses ordered by likelihood, and for each hypothesis give me the single read-only command that would confirm or rule it out. Do not suggest any command that changes state."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That last sentence matters. Left unconstrained, models love to suggest &lt;code&gt;kubectl rollout restart&lt;/code&gt; as step one. You want the diagnostics first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Make it order commands by blast radius
&lt;/h2&gt;

&lt;p&gt;A good incident AI prompt forces a risk classification on every suggested command. Ask it to label each one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;safe&lt;/strong&gt; — pure read-only: &lt;code&gt;kubectl get&lt;/code&gt;, &lt;code&gt;journalctl&lt;/code&gt;, &lt;code&gt;ss&lt;/code&gt;, &lt;code&gt;ip&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;grep&lt;/code&gt;, &lt;code&gt;promtool query&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;caution&lt;/strong&gt; — shells in or makes a small change: &lt;code&gt;kubectl exec&lt;/code&gt;, &lt;code&gt;docker exec&lt;/code&gt;, editing non-prod config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;destructive&lt;/strong&gt; — restarts, deletes, scale-to-zero, firewall changes, migrations, restores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then it must order them safest-first. You work top-down and you stop the moment you have a diagnosis. The number of incidents that get &lt;em&gt;worse&lt;/em&gt; because someone reached for a destructive "fix" before confirming the cause is depressingly high — a forced safest-first ordering is a cheap guardrail against that.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tip: keep your standard incident prompt in a snippet manager or a prompt library so you're not authoring it at 2am. We keep a set of &lt;a href="https://dev.clauneck.workers.dev/categories/incident-response/"&gt;AI incident-response prompts&lt;/a&gt; for exactly this.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 3: Correlate "what changed" automatically
&lt;/h2&gt;

&lt;p&gt;Most incidents are caused by a change. The model is good at lining up a timeline if you give it the raw inputs: the alert start time, the last few deploys, config changes, and infra events. Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The latency spike started at 02:09 UTC. Here is the deploy log and the config-change history for the last 6 hours. What changed closest to 02:09, and what's the mechanism by which it could cause this symptom?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where AI routinely beats a tired human: it doesn't get tunnel vision on the service you &lt;em&gt;think&lt;/em&gt; is the problem. It will notice the keepalived VIP change, the connection-pool tweak, or the cert that rotated — the boring change three layers down that you'd have found 20 minutes later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Draft comms while you investigate
&lt;/h2&gt;

&lt;p&gt;Incident comms are a tax you pay in attention you don't have. Hand them to the model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Write a status-page update for a degraded-checkout incident, customer-facing, no internal jargon, no root cause speculation, ~3 sentences. Then write a one-line internal update for the incident channel with current severity and what we're checking."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You get a customer update and an internal update in seconds, both in the right register. You skim, adjust a word, post. The investigation never stops to write prose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Let it draft the postmortem from the timeline
&lt;/h2&gt;

&lt;p&gt;When the incident is resolved, the timeline is freshest and you're most likely to actually write it down. Paste the incident-channel scrollback and the command history and ask for a blameless postmortem draft: summary, timeline, root cause, impact, what went well, what to improve, and action items. You're editing a draft instead of facing a blank page — which is the difference between a postmortem that gets written and one that doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What NOT to do
&lt;/h2&gt;

&lt;p&gt;A few failure modes worth naming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't paste secrets.&lt;/strong&gt; Scrub tokens, passwords, internal hostnames, and customer data before anything goes into a model. Treat the prompt like a screenshot you might accidentally post in a public channel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't let it invent metrics.&lt;/strong&gt; If you ask for PromQL and you haven't given it your real metric names, it will confidently make them up. Give it your metric names or tell it to use clearly-marked placeholders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't trust a confident command.&lt;/strong&gt; "Confident" and "correct" are unrelated in language models. The safest-first ordering exists precisely so a wrong-but-confident suggestion is read-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't skip the human review for "obvious" fixes.&lt;/strong&gt; The obvious fix at 2am is how the incident gets a second act.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this fits in your workflow
&lt;/h2&gt;

&lt;p&gt;You don't need a platform to start — a saved prompt and a scratch buffer get you most of the value tonight. The structure is what matters: summarize the firehose, hypothesize with read-only confirmations, correlate the timeline, draft the comms, and let the human run every command.&lt;/p&gt;

&lt;p&gt;If you want the structured version of this flow — paste your symptoms and logs, get a risk-classified, safest-first plan plus a postmortem draft — that's exactly what we built the &lt;a href="https://dev.clauneck.workers.dev/dashboard/incident-response/"&gt;AI Incident Response Assistant&lt;/a&gt; for. But the technique stands on its own. Steal the prompts, keep the human on the keyboard, and reclaim the first fifteen minutes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Generated incident plans and commands are assistive, not authoritative. Always verify recommendations against your own systems before running anything in production.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/how-devops-engineers-can-use-ai-to-triage-production-incidents-faster/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>incidentresponse</category>
      <category>ai</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
