Henning Reckey

Posted on Jun 24

"VBAF Learning Trail -- From Zero to AI Developer in PowerShell 5.1"

#ai #beginners #machinelearning #tutorial

VBAF -- Getting Started

A Guided Trail from Zero to AI Developer

Welcome. You are about to learn how artificial intelligence actually works --
not by reading about it, but by running it, watching it, and breaking it.

VBAF implements neural networks, reinforcement learning and multi-agent
systems from scratch in PowerShell 5.1. Every algorithm is readable.
Every concept is explained in the code comments.

This guide takes you from installation to building your own AI agent.
Follow the camps in order. Do not skip ahead.

Time required: 2-4 hours for Camps 0-3. Camps 4-5 are open-ended.

CAMP 0 -- BASECAMP

Get VBAF installed and your first output on screen

Goal: see "VBAF Framework ready!" on your screen

Step 1 of 5 -- Check your PowerShell version

Open PowerShell (not ISE, not VS Code -- just plain PowerShell for now).

$PSVersionTable.PSVersion

What you will see:

Major  Minor  Build  Revision
-----  -----  -----  --------
5      1      ...    ...

What it means:
VBAF requires PowerShell 5.1. This version ships with every modern
Windows PC. If you see 5.1 -- you are ready. If you see 7.x -- switch
to Windows PowerShell (search "Windows PowerShell" in the Start menu).

If something goes wrong:
On Windows 10 or 11, PowerShell 5.1 is always present.
Search "Windows PowerShell" in the Start menu -- not "PowerShell".

Step 2 of 5 -- Install VBAF from PSGallery

Install-Module VBAF -Scope CurrentUser

What you will see:

Untrusted repository
Are you sure you want to install the modules from 'PSGallery'?
[Y] Yes  [N] No

Type Y and press Enter.

What it means:
PSGallery is the official PowerShell module repository -- the same place
Microsoft publishes its own modules. VBAF is downloaded and installed
in your user profile. Nothing is changed system-wide.

If something goes wrong:
If you get a proxy or network error, try:
Install-Module VBAF -Scope CurrentUser -Force

Step 3 of 5 -- Navigate to the VBAF folder

cd "$env:USERPROFILE\OneDrive\WindowsPowerShell"

What you will see:
Your prompt changes to show the new folder.

What it means:
VBAF lives in your OneDrive\WindowsPowerShell folder.
All examples and files are here.

If something goes wrong:
If OneDrive is not set up, try:
cd "$env:USERPROFILE\Documents\WindowsPowerShell"
Or wherever you cloned the VBAF repository.

Step 4 of 5 -- Load the VBAF framework

. .\VBAF.LoadAll.ps1

What you will see:

  Loading VBAF Framework...

  [Phase 1] Core neural network...
  [Phase 2] Reinforcement learning...
  [Phase 3] Business and multi-agent...
  ...
  VBAF Framework ready!

  LEARNING PATH (run in order):
    1. & .\VBAF.Core.Example-XOR.ps1
    2. & .\VBAF.RL.Example-CastleLearning.ps1
    ...

What it means:
All VBAF classes and functions are now loaded into your session.
NeuralNetwork, QLearningAgent, DQNAgent, PPOAgent, A3CAgent --
all available. You need to run this once per PowerShell session.

If something goes wrong:
Make sure you are in the right folder (Step 3) before running this.
The dot-space-dot at the start is important: . .\VBAF.LoadAll.ps1

Step 5 of 5 -- Confirm everything loaded

[NeuralNetwork]::new(@(2,3,1), 0.1)

What you will see:

Layers       : {Layer, Layer}
LearningRate : 0.1
Architecture : {2, 3, 1}

What it means:
You just created a neural network with 2 inputs, 3 hidden neurons
and 1 output. It exists in memory. It has random weights.
It knows nothing yet. That is about to change.

CAMP 0 COMPLETE

You have: PowerShell 5.1, VBAF installed, framework loaded.
You can: create neural networks and RL agents from scratch.
Next: watch one learn.

CAMP 1 -- FIRST FIRE

Watch a neural network learn something for the first time

Goal: see "SUCCESS! Network learned XOR!"

Step 6 of 8 -- Understand the problem first

XOR is the simplest problem a single neuron CANNOT solve.

0 XOR 0 = 0   (both same   -> 0)
0 XOR 1 = 1   (different   -> 1)
1 XOR 0 = 1   (different   -> 1)
1 XOR 1 = 0   (both same   -> 0)

Draw these four points on paper:
(0,0) -> label 0
(0,1) -> label 1
(1,0) -> label 1
(1,1) -> label 0

Try to draw ONE straight line that separates the 0-labelled points
from the 1-labelled points. You cannot. That is the problem.

In 1969, Minsky and Papert proved mathematically that a single neuron
cannot solve XOR. This killed AI research funding for a decade.

The solution: add a hidden layer. Two lines can separate the points
even when one cannot. This is the Universal Approximation Theorem.

Step 7 of 8 -- Run the XOR example

cd examples\01-XOR-Network
. .\Run-Example-01.ps1

What you will see:

Training Neural Network...
Architecture : 2 -> 3 -> 1

Epoch     1 / 5000 (  0.0%) -- Error: 0.269
Epoch   500 / 5000 ( 10.0%) -- Error: 0.268
Epoch  1500 / 5000 ( 30.0%) -- Error: 0.082
Epoch  2000 / 5000 ( 40.0%) -- Error: 0.006
Epoch  5000 / 5000 (100.0%) -- Error: 0.000697

Accuracy   : 100.00%
Correct    : 4 / 4

SUCCESS! Network learned XOR!

What it means:
The network started with random weights and knew nothing.
After 5000 passes through 4 training examples, it learned XOR.
Error dropped from 0.269 to 0.0007 -- near perfect.

Watch the error curve:
Epoch 1-1000: barely moves (stuck in a plateau)
Epoch 1500: suddenly drops (escaped the plateau)
Epoch 2000+: smooth descent to near zero

This plateau-then-breakthrough is normal. The network was slowly
repositioning its weights until they crossed a threshold.

If something goes wrong:
If accuracy is below 75% -- run it again. Different random starting
weights sometimes get stuck. This is normal and expected.

Step 8 of 8 -- Look inside what just happened

# Create a network and inspect it
$nn = [NeuralNetwork]::new(@(2,3,1), 0.5)

# See the random starting weights of the first hidden neuron
$nn.Layers[0].Neurons[0].Weights
$nn.Layers[0].Neurons[0].Bias

What you will see:
Three random numbers between -0.5 and 0.5.
These are the starting weights -- completely random.

# Train it
$data = @(
    @{ Input = @(0.0,0.0); Expected = @(0.0) }
    @{ Input = @(0.0,1.0); Expected = @(1.0) }
    @{ Input = @(1.0,0.0); Expected = @(1.0) }
    @{ Input = @(1.0,1.0); Expected = @(0.0) }
)
$result = $nn.Train($data, 5000)

# See the weights AFTER training
$nn.Layers[0].Neurons[0].Weights
$nn.Layers[0].Neurons[0].Bias

What it means:
The weights changed. Backpropagation moved them from random values
to values that encode the XOR pattern. This is learning.

What to try before moving on

Run the XOR example 3 times -- does it always converge?
Change the architecture to [2, 2, 1] -- can 2 hidden neurons solve it?
Change learning rate to 0.1 -- slower but more stable
Change epochs to 500 -- does it converge fast enough?
Open VBAF.Core.AllClasses.ps1 and find the UpdateWeights method. Read the formula. Match it to what you read in the comments.

CAMP 1 COMPLETE

You have seen: a neural network learn from random weights.
You understand: XOR, hidden layers, backpropagation, error curves.
Next: make an agent learn WITHOUT being told the correct answer.

CAMP 2 -- LEARNING TO HUNT

An agent discovers strategy through trial and error

Goal: watch reward increase as the agent learns

Step 9 of 10 -- Understand the difference

In Camp 1, we gave the network the CORRECT ANSWER for every input.
Input: [0,1] -> Correct answer: 1
This is called SUPERVISED learning.

In Camp 2, we give the agent a REWARD for good outcomes.
Action: choose castle type -> Reward: +2 if varied, -1 if repeated
The agent must figure out what "good" means by itself.
This is called REINFORCEMENT learning.

The difference:
Supervised: "here is the right answer"
Reinforcement: "here is how good that was"

Step 10 of 10 -- Run the castle learning example

cd ..\02-Castle-Learning
. .\Run-Example-02.ps1

What you will see:

Episode   1 | Reward:  12.45 | Epsilon: 1.000 | Q-Table:   0 entries
Episode  10 | Reward:  15.32 | Epsilon: 0.951 | Q-Table:  14 entries
Episode  50 | Reward:  18.67 | Epsilon: 0.779 | Q-Table:  38 entries
Episode 100 | Reward:  21.43 | Epsilon: 0.607 | Q-Table:  52 entries

What to watch:
Epsilon: starts at 1.0 (100% random), decays toward 0.01
Q-Table: grows as the agent visits new states
Reward: should trend upward as the agent learns

What it means:
Episode 1: agent picks randomly -- no knowledge
Episode 10: agent has seen some states -- Q-table forming
Episode 100: agent exploiting learned knowledge -- reward rising

Step 11 -- Train a DQN on CartPole

Now a neural network approximates Q-values instead of a table.
This works for problems too large for a table.

cd "$env:USERPROFILE\OneDrive\WindowsPowerShell"
. .\VBAF.LoadAll.ps1

$agent = (Invoke-DQNTraining -Episodes 100 -PrintEvery 10)[-1]

What you will see:

Ep    10  Reward:    12  Best:    18  e: 0.951  Loss: 0.04521
Ep    20  Reward:    23  Best:    31  e: 0.905  Loss: 0.03891
Ep    50  Reward:    67  Best:    89  e: 0.779  Loss: 0.02341
Ep   100  Reward:   134  Best:   178  e: 0.607  Loss: 0.01123

What to watch:
Reward: random agent gets 10-20. Trained agent gets 100-200.
Epsilon: decays from 1.0 -- less exploration over time.
Loss: should trend downward -- Q-values becoming accurate.

# See the full stats
$agent.PrintStats()

# Get Q-values for a specific state
$state = @(0.1, 0.0, 0.05, 0.0)
$agent.GetQValues($state)

What it means:
GetQValues shows what the neural network thinks each action is worth.
Higher value = agent thinks this action leads to better outcomes.
After training, the values should make intuitive sense:
if the pole is tilting right, pushing right should have low value.

What to try before moving on

Run DQN with FastMode and compare speed:
$agent = (Invoke-DQNTraining -Episodes 50 -FastMode)[-1]
Run it twice -- does it always converge to the same reward?
Look at the Q-values before and after training:
Create a new agent, get Q-values, train, get Q-values again.
See how they changed.
Open VBAF.RL.DQN.ps1 and find the Replay() method.
Read the Bellman equation comment. Trace through one update.

CAMP 2 COMPLETE

You have seen: Q-learning and DQN in action.
You understand: rewards, Q-values, epsilon-greedy, experience replay.
Next: compare three different algorithms head to head.

CAMP 3 -- THE ARENA

Three algorithms compete -- you judge the winner

Goal: benchmark DQN vs PPO vs A3C and explain the difference

Step 12 of 10 -- Understand the three algorithms

Q-Learning / DQN:
Learns: Q(state, action) = expected future reward
Decides: take the action with the highest Q-value
Memory: experience replay buffer (random batches)
Key innovation: target network for stable training

PPO (Proximal Policy Optimization):
Learns: pi(action|state) = probability of each action
Decides: sample from the probability distribution
Memory: rollout buffer (recent experiences, then discard)
Key innovation: clipped update -- no catastrophic policy changes

A3C (Advantage Actor-Critic):
Learns: policy AND value function in ONE shared network
Decides: sample from policy head
Memory: n-step rollout per worker (no buffer)
Key innovation: parallel workers, shared global network

Step 13 -- Train all three

Write-Host "Training DQN..." -ForegroundColor Cyan
$dqn = (Invoke-DQNTraining -Episodes 100 -FastMode -Quiet)[-1]

Write-Host "Training PPO..." -ForegroundColor Cyan
$ppo = (Invoke-PPOTraining -Episodes 100 -FastMode -Quiet)[-1]

Write-Host "Training A3C..." -ForegroundColor Cyan
$a3c = (Invoke-A3CTraining -Episodes 100 -FastMode -Quiet)[-1]

Write-Host "All three trained!" -ForegroundColor Green

This takes 2-5 minutes. Watch the output -- each algorithm
prints different statistics because they work differently.

Step 14 -- Benchmark them head to head

$env = New-VBAFEnvironment -Name "CartPole" -MaxSteps 200

Invoke-VBAFBenchmark -Agent $null -Environment $env -Episodes 20 -Label "Random baseline"
Invoke-VBAFBenchmark -Agent $dqn  -Environment $env -Episodes 20 -Label "DQN"
Invoke-VBAFBenchmark -Agent $ppo  -Environment $env -Episodes 20 -Label "PPO"
Invoke-VBAFBenchmark -Agent $a3c  -Environment $env -Episodes 20 -Label "A3C"

What you will see:

  Random baseline
  Avg Reward :  14.3
  Max Reward :  28.0

  DQN
  Avg Reward : 143.7
  Max Reward : 200.0

  PPO
  Avg Reward : 167.2
  Max Reward : 200.0

  A3C
  Avg Reward : 112.4
  Max Reward : 200.0

What it means:
Random agent: 14 reward -- barely balances
Trained agents: 100-200 reward -- learned to balance

The winner varies each run because of random weight initialisation.
Run the benchmark 3 times. Which algorithm wins most often?

Step 15 -- Watch four companies compete

cd examples\03-Market-Simulation
. .\Run-Example-03.ps1

What you will see:
Four companies learning business strategy simultaneously.
After 10 simulated years, you see who won and WHY.

Watch for:

Tacit collusion: companies avoid price wars without communicating
Innovation races: R&D emerges as the dominant strategy
Herfindahl index: measures market concentration

What it means:
Nobody programmed these behaviours.
They emerged from four Q-learning agents optimising their own rewards.
This is multi-agent reinforcement learning in action.

What to try before moving on

Run the benchmark 3 times -- which algorithm wins most often?
Try GridWorld:
$env = New-VBAFEnvironment -Name "GridWorld" -GridSize 5
Invoke-VBAFBenchmark -Agent $dqn -Environment $env -Episodes 20 -Label "DQN on GridWorld"
Read the stats from each agent:
$dqn.PrintStats()
$ppo.PrintStats()
$a3c.PrintStats()
What is different? What does Entropy mean for PPO and A3C?
Open VBAF.RL.PPO.ps1 and find the ComputeGAE method.
Read the comment about lambda. What happens when lambda = 1.0?

CAMP 3 COMPLETE

You have seen: three RL algorithms, head-to-head benchmarking, multi-agent.
You understand: the difference between value-based, policy gradient, and actor-critic.
Next: build something yourself.

CAMP 4 -- YOUR OWN FIRE

Design your own environment and train your own agent

Goal: an agent learns something YOU designed

Step 16 -- Understand what an environment needs

Every VBAF environment needs three methods:

Reset()         # start new episode, return initial state
Step($action)   # apply action, return @{NextState; Reward; Done}
GetState()      # return current state as double array

That is all. Any problem that fits this shape can be learned by
any VBAF agent -- DQN, PPO, A3C or Q-learning.

Step 17 -- Study a simple example

# RandomWalk is the simplest possible environment
$env = New-VBAFEnvironment -Name "RandomWalk"
$env.PrintInfo()

# Run one episode manually
$state = $env.Reset()
Write-Host "Start state: $state"

for ($step = 0; $step -lt 10; $step++) {
    $action = Get-Random -Minimum 0 -Maximum 2   # 0=left, 1=right
    $result = $env.Step($action)
    Write-Host "Action: $action  State: $($result.NextState)  Reward: $($result.Reward)  Done: $($result.Done)"
    if ($result.Done) { break }
}

What it means:
You are manually controlling the agent -- choosing random actions.
Watch how the state changes and reward is assigned.
A trained agent would learn to always move toward 0 (center).

Step 18 -- Run the custom agent example

cd examples\06-Custom-Agent
. .\Run-Example-06.ps1

This example shows how to build your own environment from scratch
and train a DQN agent on it. Read the code carefully.

Step 19 -- Modify something

Pick ONE thing to change and observe the effect:

Option A -- Change the reward function in RandomWalk:
Currently: +10 for reaching center, else -(distance * 0.1)
Try: +1 for reaching center, else 0 (no distance penalty)
Question: does the agent still learn? Is it faster or slower?

Option B -- Change DQN hyperparameters:
Currently: Gamma=0.95, LearningRate=0.001, BatchSize=32
Try: Gamma=0.50 (agent ignores future rewards)
Question: does short-sighted learning work for CartPole?

Option C -- Change the architecture:
Currently: [4, 64, 64, 2]
Try: 4, 8, 2
Question: can a tiny network still solve CartPole?

What to try before moving on

Run your modified version 3 times. Is the result consistent?
Write down your hypothesis BEFORE running -- then check it.
Open VBAF.RL.Environment.ps1 and read the GridWorld class. Could you adapt it for a different grid-based problem?

CAMP 4 COMPLETE

You have done: run manual episodes, modified a reward function, changed hyperparameters.
You understand: the environment interface, reward shaping, hyperparameter sensitivity.
Next: see where all this leads at enterprise scale.

CAMP 5 -- THE SUMMIT

From foundation to enterprise -- trace the learning ladder

Goal: understand how Phase 1-9 becomes Phase 10-27

Step 20 -- The learning ladder

VBAF has two layers:

Foundation (Phases 1-9):
Neural networks, Q-learning, DQN, PPO, A3C, multi-agent.
You have learned all of this in Camps 1-4.

Enterprise (Phases 10-27):
14 production-grade automation agents built on the SAME foundation.
These are not teaching examples -- they solve real IT problems.

The ladder:

Phase 1-9:  learn HOW agents learn
Phase 10-27: see WHAT agents can do when they learn well

Step 21 -- Run one enterprise agent

cd "$env:USERPROFILE\OneDrive\WindowsPowerShell"
. .\VBAF.LoadAll.ps1

# Self-Healing Infrastructure -- Phase 14
$result = Invoke-VBAFSelfHealingTraining -Episodes 50 -SimMode

What you will see:
A DQN agent learning to detect and fix system problems.
State: CPU load, memory, disk, error rate, response time.
Actions: Observe, Adjust, Restart, Rebuild.

What it means:
This is the same DQN you trained on CartPole in Camp 2.
Same algorithm. Same Bellman equation. Same experience replay.
Different environment. Different reward function.
The learning mechanism is identical.

Step 22 -- Trace it back to the foundation

# Open the enterprise file and find where NeuralNetwork is used
Select-String "NeuralNetwork" ".\VBAF.Enterprise.SelfHealing.ps1"

# Compare with DQN
Select-String "NeuralNetwork" ".\VBAF.RL.DQN.ps1"

What you will see:
Both files reference NeuralNetwork.
The enterprise agent uses the SAME class you built in Camp 1.

Trace the chain:
VBAF.Core.AllClasses.ps1 -- defines NeuralNetwork
VBAF.RL.DQN.ps1 -- uses NeuralNetwork for Q-learning
VBAF.Enterprise.SelfHealing.ps1 -- uses DQN for IT automation

Three files. One chain. Same foundation.

Step 23 -- Run the AutoPilot (the crown jewel)

$result = Invoke-VBAFAutoPilotTraining -Episodes 50 -SimMode

What you will see:
AutoPilot orchestrates ALL 13 enterprise pillars simultaneously.
It is an agent that coordinates other agents.
Meta-learning -- an agent that decides which agents to activate.

This is Phase 27 -- the furthest point VBAF reaches.
But it is built entirely from the same concepts you learned in Camp 1.

Step 24 -- Read one enterprise file properly

Choose any enterprise file that interests you:

VBAF.Enterprise.AnomalyDetector.ps1 -- spots unusual patterns
VBAF.Enterprise.EnergyOptimizer.ps1 -- reduces power consumption
VBAF.Enterprise.PatchIntelligence.ps1 -- risk-aware patch scheduling

Open it and find:

What is the STATE? (what does the agent observe?)
What are the ACTIONS? (what can the agent do?)
What is the REWARD? (what is it optimising for?)

Answer those three questions for any environment and you understand
what the agent will learn to do.

CAMP 5 COMPLETE

You have seen: the full learning ladder from XOR to enterprise AutoPilot.
You understand: how foundation concepts scale to production systems.
You can: read any VBAF file and understand what the agent is learning.

THE VIEW FROM THE TOP

You started at Camp 0 with:

Install-Module VBAF

You are now at Camp 5 with:

Neural networks trained from scratch
Three RL algorithms benchmarked
Multi-agent competition observed
Your own hyperparameters tested
Enterprise agents running

What you can do now:

# Train any algorithm
$agent = (Invoke-DQNTraining -Episodes 200)[-1]
$agent = (Invoke-PPOTraining -Episodes 200)[-1]
$agent = (Invoke-A3CTraining -Episodes 200)[-1]

# On any environment
$env = New-VBAFEnvironment -Name "CartPole"
$env = New-VBAFEnvironment -Name "GridWorld" -GridSize 8
$env = New-VBAFEnvironment -Name "RandomWalk"

# Benchmark anything
Invoke-VBAFBenchmark -Agent $agent -Environment $env -Episodes 50 -Label "My Agent"

# Run any enterprise pillar
Invoke-VBAFSelfHealingTraining -Episodes 100 -SimMode
Invoke-VBAFAutoPilotTraining   -Episodes 100 -SimMode

Where to go from here:

Read the theory: docs/Theory.md
Build your own environment: examples/06-Custom-Agent/
Study the papers referenced in each file
Contribute an example or a new environment

QUICK REFERENCE

The 5 most important commands

# 1. Load everything (run once per session)
. .\VBAF.LoadAll.ps1

# 2. Run the learning path in order
& .\examples\01-XOR-Network\Run-Example-01.ps1
& .\examples\02-Castle-Learning\Run-Example-02.ps1
& .\examples\03-Market-Simulation\Run-Example-03.ps1

# 3. Train an agent
$agent = (Invoke-DQNTraining -Episodes 100 -PrintEvery 10)[-1]

# 4. Benchmark it
$env = New-VBAFEnvironment -Name "CartPole"
Invoke-VBAFBenchmark -Agent $agent -Environment $env -Episodes 20 -Label "My DQN"

# 5. See what it learned
$agent.PrintStats()
$agent.GetQValues(@(0.1, 0.0, 0.05, 0.0))

If something breaks

"Unable to find type"     -> run . .\VBAF.LoadAll.ps1 first
"Cannot find path"        -> check you are in the right folder
"accuracy below 75%"      -> run XOR again (random init sometimes fails)
"reward not increasing"   -> train for more episodes

VBAF -- Visual AI & Reinforcement Learning Framework
github.com/JupyterPS/VBAF
"The best way to understand AI is to build it yourself -- line by line."