close

DEV Community

Irvan Gerhana Septiyana
Irvan Gerhana Septiyana

Posted on

Generating Synthetic Enterprise Datasets for AI Systems

Part 2 of the Building Enterprise AI Automation Systems Series


Introduction

One of the biggest obstacles in enterprise AI is not choosing a model.

It is finding data.

Most tutorials assume that training data already exists.

Reality is very different.

Large organizations rarely share operational datasets.

Financial transactions contain confidential information.

Contracts contain sensitive agreements.

Invoices reveal commercial relationships.

Bank statements expose customer activity.

For legal, regulatory, and competitive reasons, these datasets almost never become public.

This creates a difficult problem for AI engineers.

How do you build intelligent systems when the data you need cannot be accessed?

The answer is synthetic data.

Unfortunately, most synthetic datasets found online are little more than randomly generated CSV files.

They contain names.

Numbers.

Dates.

But they completely ignore something far more important:

Business relationships.

In this article, we'll explore how to design synthetic enterprise datasets that preserve real business logic and can be used for machine learning, automation, benchmarking, and AI engineering.


Random Data Is Not Synthetic Data

Many developers believe synthetic data simply means generating fake values.

For example:

Customer,Invoice,Amount
John,INV001,500
Alice,INV002,1200
Bob,INV003,900
Enter fullscreen mode Exit fullscreen mode

Technically, this is synthetic.

Practically, it is useless.

Why?

Because enterprise systems are built around relationships.

Invoices belong to contracts.

Contracts belong to customers.

Payments reference invoices.

Purchase orders authorize invoices.

Bank transactions settle invoices.

Without these relationships, there is nothing meaningful to learn.

A machine learning model trained on isolated records learns isolated patterns.

Real enterprise automation requires connected data.


Thinking Like an Enterprise System

Before writing a single line of Python, ask one question:

"How does the business actually operate?"

Imagine a manufacturing company.

A customer signs a contract.

The contract defines:

  • products,
  • payment schedules,
  • milestones,
  • currencies,
  • pricing.

Invoices are generated from the contract.

Purchase orders authorize procurement.

Eventually, a payment appears in a bank statement.

That payment is never independent.

It always belongs to a business process.

Therefore our synthetic dataset must preserve that process.


Designing the Data Model

Rather than generating random tables, begin by designing business entities.

For this project, the core entities were:

Customer
        │
        ▼
Contract
        │
        ▼
Invoice
        │
        ▼
Bank Transaction
Enter fullscreen mode Exit fullscreen mode

This hierarchy reflects real enterprise operations.

Every entity inherits context from its parent.


Customer Master

The customer master acts as the source of truth.

Example:

{
  "customer_id":"CUS-00002",
  "legal_name":"ALPHABRIDGE SOLUTIONS",
  "country":"United States",
  "industry":"Manufacturing"
}
Enter fullscreen mode Exit fullscreen mode

Customers rarely change.

Everything else references them.


Contract Master

Contracts establish commercial relationships.

Example:

{
  "contract_id":"CNT-2024-587",
  "customer_id":"CUS-00002",
  "billing_schedule":"Monthly",
  "currency":"EUR"
}
Enter fullscreen mode Exit fullscreen mode

Notice that contracts reference customers.

Never duplicate customer information.

Use identifiers.


Invoice Master

Invoices inherit context from contracts.

{
  "invoice_number":"MFG-INV-000157",
  "contract_id":"CNT-2024-587",
  "customer_id":"CUS-00002",
  "amount":3979.85
}
Enter fullscreen mode Exit fullscreen mode

Again, relationships matter more than values.


Bank Statements

Only after customers, contracts, and invoices exist should transactions be generated.

Example narrative:

PART PMT ALPHABRIDGE SOLUTIONS MFG-INV-000157
Enter fullscreen mode Exit fullscreen mode

Notice that the narrative references existing business entities.

This is the difference between realistic synthetic data and random text generation.


Why Relationships Matter

Suppose an invoice references:

MFG-INV-000157
Enter fullscreen mode Exit fullscreen mode

That invoice should always resolve to:

Customer
↓

Contract
↓

Invoice
Enter fullscreen mode Exit fullscreen mode

Otherwise:

  • Entity Resolution cannot be evaluated.
  • Reconciliation cannot be validated.
  • Ground truth disappears.

Synthetic data must preserve referential integrity.


Building Ground Truth

One advantage of synthetic data is complete control.

Every generated transaction already knows:

  • which customer owns it,
  • which contract created it,
  • which invoice it references,
  • whether it is a partial payment,
  • whether reconciliation should succeed.

This hidden knowledge becomes ground truth.

Ground truth enables benchmarking.

Instead of asking:

"Did the model perform well?"

we can ask:

"Did the model recover the correct business relationship?"

This is a much stronger evaluation.


Simulating Real-World Noise

Real enterprise data is messy.

Invoices are not always written consistently.

Examples:

INV-001
INV001
INV 001
INVOICE-001
Enter fullscreen mode Exit fullscreen mode

Customer names evolve:

ALPHABRIDGE SOLUTIONS
ALPHABRIDGE LTD
ALPHA BRIDGE
ABS
Enter fullscreen mode Exit fullscreen mode

Synthetic datasets should deliberately include this variability.

Otherwise models learn perfect data instead of realistic data.

The goal is not to make the dataset clean.

The goal is to make it believable.


Balancing Entity Distribution

Another common mistake is imbalance.

Imagine a dataset containing:

Invoice Labels : 50,000
Contract Labels : 35
Purchase Orders : 40
Enter fullscreen mode Exit fullscreen mode

A transformer will naturally learn invoices better than contracts.

The issue is not the model.

It is the dataset.

Balanced entity distribution improves learning quality and produces more reliable evaluation metrics.

Synthetic generation should therefore control not only volume, but also diversity.


Why Synthetic Data Enables Better AI

Once relationships exist, a single synthetic dataset can support multiple AI tasks.

For example:

Named Entity Recognition

Extract:

  • Customer
  • Invoice
  • Contract
  • Purchase Order

Entity Resolution

Resolve:

ALPHABRIDGE

↓

CUS-00002
Enter fullscreen mode Exit fullscreen mode

Reconciliation

Determine whether a payment correctly settles an invoice.


Agentic Workflows

Trigger downstream actions:

  • approve,
  • escalate,
  • notify,
  • reconcile,
  • update ERP.

The same dataset becomes reusable across multiple machine learning tasks.


Lessons Learned

After generating hundreds of thousands of synthetic enterprise transactions, one lesson became obvious.

Volume alone is meaningless.

Relationships matter.

Business logic matters.

Ground truth matters.

If your synthetic dataset behaves like a real business, your AI system learns to solve real business problems.

If your synthetic dataset behaves like random CSV files, your AI system learns randomness.


Conclusion

Synthetic data is not a shortcut.

It is an engineering discipline.

Well-designed synthetic datasets preserve business logic, entity relationships, referential integrity, and realistic variability.

These characteristics make them valuable not only for machine learning but also for benchmarking, software testing, API validation, and enterprise automation.

In the next article, we'll use this synthetic dataset to build a Financial Named Entity Recognition (NER) pipeline capable of understanding enterprise bank transaction narratives and transforming them into structured business knowledge.


Next Article

Part 3 — Building a Financial Named Entity Recognition Pipeline Using Doccano and IndoBERT

We'll cover:

  • Designing a business taxonomy
  • Automatic pre-labeling
  • Annotation guidelines
  • Doccano workflow
  • BIO tagging
  • Fine-tuning IndoBERT
  • Evaluating precision, recall, and F1-score
  • Preparing data for entity resolution

Top comments (0)