Irvan Gerhana Septiyana

Posted on Jun 25

Generating Synthetic Enterprise Datasets for AI Systems

#ai #datascience #automation #showdev

Part 2 of the Building Enterprise AI Automation Systems Series

Introduction

One of the biggest obstacles in enterprise AI is not choosing a model.

It is finding data.

Most tutorials assume that training data already exists.

Reality is very different.

Large organizations rarely share operational datasets.

Financial transactions contain confidential information.

Contracts contain sensitive agreements.

Invoices reveal commercial relationships.

Bank statements expose customer activity.

For legal, regulatory, and competitive reasons, these datasets almost never become public.

This creates a difficult problem for AI engineers.

How do you build intelligent systems when the data you need cannot be accessed?

The answer is synthetic data.

Unfortunately, most synthetic datasets found online are little more than randomly generated CSV files.

They contain names.

Numbers.

Dates.

But they completely ignore something far more important:

Business relationships.

In this article, we'll explore how to design synthetic enterprise datasets that preserve real business logic and can be used for machine learning, automation, benchmarking, and AI engineering.

Random Data Is Not Synthetic Data

Many developers believe synthetic data simply means generating fake values.

For example:

Customer,Invoice,Amount
John,INV001,500
Alice,INV002,1200
Bob,INV003,900

Technically, this is synthetic.

Practically, it is useless.

Why?

Because enterprise systems are built around relationships.

Invoices belong to contracts.

Contracts belong to customers.

Payments reference invoices.

Purchase orders authorize invoices.

Bank transactions settle invoices.

Without these relationships, there is nothing meaningful to learn.

A machine learning model trained on isolated records learns isolated patterns.

Real enterprise automation requires connected data.

Thinking Like an Enterprise System

Before writing a single line of Python, ask one question:

"How does the business actually operate?"

Imagine a manufacturing company.

A customer signs a contract.

The contract defines:

products,
payment schedules,
milestones,
currencies,
pricing.

Invoices are generated from the contract.

Purchase orders authorize procurement.

Eventually, a payment appears in a bank statement.

That payment is never independent.

It always belongs to a business process.

Therefore our synthetic dataset must preserve that process.

Designing the Data Model

Rather than generating random tables, begin by designing business entities.

For this project, the core entities were:

Customer
        │
        ▼
Contract
        │
        ▼
Invoice
        │
        ▼
Bank Transaction

This hierarchy reflects real enterprise operations.

Every entity inherits context from its parent.

Customer Master

The customer master acts as the source of truth.

Example:

{
  "customer_id":"CUS-00002",
  "legal_name":"ALPHABRIDGE SOLUTIONS",
  "country":"United States",
  "industry":"Manufacturing"
}

Customers rarely change.

Everything else references them.

Contract Master

Contracts establish commercial relationships.

Example:

{
  "contract_id":"CNT-2024-587",
  "customer_id":"CUS-00002",
  "billing_schedule":"Monthly",
  "currency":"EUR"
}

Notice that contracts reference customers.

Never duplicate customer information.

Use identifiers.

Invoice Master

Invoices inherit context from contracts.

{
  "invoice_number":"MFG-INV-000157",
  "contract_id":"CNT-2024-587",
  "customer_id":"CUS-00002",
  "amount":3979.85
}

Again, relationships matter more than values.

Bank Statements

Only after customers, contracts, and invoices exist should transactions be generated.

Example narrative:

PART PMT ALPHABRIDGE SOLUTIONS MFG-INV-000157

Notice that the narrative references existing business entities.

This is the difference between realistic synthetic data and random text generation.

Why Relationships Matter

Suppose an invoice references:

MFG-INV-000157

That invoice should always resolve to:

Customer
↓

Contract
↓

Invoice

Otherwise:

Entity Resolution cannot be evaluated.
Reconciliation cannot be validated.
Ground truth disappears.

Synthetic data must preserve referential integrity.

Building Ground Truth

One advantage of synthetic data is complete control.

Every generated transaction already knows:

which customer owns it,
which contract created it,
which invoice it references,
whether it is a partial payment,
whether reconciliation should succeed.

This hidden knowledge becomes ground truth.

Ground truth enables benchmarking.

Instead of asking:

"Did the model perform well?"

we can ask:

"Did the model recover the correct business relationship?"

This is a much stronger evaluation.

Simulating Real-World Noise

Real enterprise data is messy.

Invoices are not always written consistently.

Examples:

INV-001
INV001
INV 001
INVOICE-001

Customer names evolve:

ALPHABRIDGE SOLUTIONS
ALPHABRIDGE LTD
ALPHA BRIDGE
ABS

Synthetic datasets should deliberately include this variability.

Otherwise models learn perfect data instead of realistic data.

The goal is not to make the dataset clean.

The goal is to make it believable.

Balancing Entity Distribution

Another common mistake is imbalance.

Imagine a dataset containing:

Invoice Labels : 50,000
Contract Labels : 35
Purchase Orders : 40

A transformer will naturally learn invoices better than contracts.

The issue is not the model.

It is the dataset.

Balanced entity distribution improves learning quality and produces more reliable evaluation metrics.

Synthetic generation should therefore control not only volume, but also diversity.

Why Synthetic Data Enables Better AI

Once relationships exist, a single synthetic dataset can support multiple AI tasks.

For example:

Named Entity Recognition

Extract:

Customer
Invoice
Contract
Purchase Order

Entity Resolution

Resolve:

ALPHABRIDGE

↓

CUS-00002

Reconciliation

Determine whether a payment correctly settles an invoice.

Agentic Workflows

Trigger downstream actions:

approve,
escalate,
notify,
reconcile,
update ERP.

The same dataset becomes reusable across multiple machine learning tasks.

Lessons Learned

After generating hundreds of thousands of synthetic enterprise transactions, one lesson became obvious.

Volume alone is meaningless.

Relationships matter.

Business logic matters.

Ground truth matters.

If your synthetic dataset behaves like a real business, your AI system learns to solve real business problems.

If your synthetic dataset behaves like random CSV files, your AI system learns randomness.

Conclusion

Synthetic data is not a shortcut.

It is an engineering discipline.

Well-designed synthetic datasets preserve business logic, entity relationships, referential integrity, and realistic variability.

These characteristics make them valuable not only for machine learning but also for benchmarking, software testing, API validation, and enterprise automation.

In the next article, we'll use this synthetic dataset to build a Financial Named Entity Recognition (NER) pipeline capable of understanding enterprise bank transaction narratives and transforming them into structured business knowledge.

Part 3 — Building a Financial Named Entity Recognition Pipeline Using Doccano and IndoBERT

We'll cover:

Designing a business taxonomy
Automatic pre-labeling
Annotation guidelines
Doccano workflow
BIO tagging
Fine-tuning IndoBERT
Evaluating precision, recall, and F1-score
Preparing data for entity resolution

DEV Community