Reproducible evidence that retrieval-aware prompt design — not Q&A conversion in itself — is what makes synthetic Q&A beat raw markdown in RAG.
| KB | Source | Accuracy (n=3) |
|---|---|---|
mdx_direct |
Raw markdown | 72% |
naive_facts |
Q&A from a generic prompt | 75% |
best_facts |
Q&A from a retrieval-aware prompt | 92% |
Three knowledge bases, same source document, same embedding model, same chatbot LLM, same questions, three independent runs. Naive Q&A conversion gives you +3 pt over raw markdown — but that's run variance, not a retrieval gain: the genuine fixes and breaks cancel exactly (Q2/Q6 fixed, Q4/Q12 broken), and the residual +3 pt traces to a single question where the raw-markdown baseline happened to wobble on one run (see results/summary.md). The real gain (+17 pt over naive, +20 pt over raw markdown) comes from the prompt design (chiefly Rules 4–5). Documented in prompt_engineering.md.
Full numbers and per-question breakdown: results/summary.md.
To produce one line of JSONL per Q&A fact from any document, this repo uses a two-stage StructFlow pipeline:
document ─► Stage 1 (segmenter) ─► {sections: [...]}
Splits the document into self-contained sections.
Flatten sections ─► one section per JSONL line
(Stage 1 output expanded for Stage 2 input)
Stage 2 (extractor) ─► {facts: [...]} per section
Creates Q&A pairs from each section.
Flatten facts ─► one fact per JSONL line
(Stage 2 output expanded for the final KB)
The "accordion" name comes from the shape: 1 doc → N sections → M facts, with a flatten step after each Stage to expand array outputs into line-per-record JSONL (sections.jsonl for Stage 2's input; facts.jsonl for the final KB). Stage 1 and Stage 2 are both StructFlow jobs.
The mechanism is format-agnostic — Stage 1's segmentation rule is the only thing that needs to know your input. This demo segments markdown by ## / ### headings, but the same pattern works on HTML sections, PDF chapters, DOCX heading styles, or any structured text. For PDF/DOCX/XLSX/PPTX, pair it with LDX hub's ExtractDoc to extract clean text first, then feed that into Stage 1.
The prompts are what matter. See prompts/ for the four prompts used (Stage 1 segmenter, Stage 2 best, Stage 2 naive, and the chatbot system prompt that drives answer generation against the attached KB).
What separates best_facts (92%) from naive_facts (75%):
- Self-contained answers — each fact stands alone, no cross-references
- Developer-friendly question phrasing —
How do I...?,What is the default value of...? - Exact preservation of technical identifiers — API names, endpoint paths, parameter names, enum values stay verbatim
- Service-specific facts for cross-category information — when a section discusses statuses, errors, or behaviors tied to a specific service, generate a service-scoped fact with the service name in both question and answer (this is what resolves Q12; Q4 is resolved by Rule 5)
- Deliberate keyword design — 3–7 short terms per fact (service names, parameters, concepts)
In this validation, Rules 4 and 5 produced the entire +17 pt gain over naive (Q12 and Q4 respectively); Rules 1–3 measured zero net contribution because the Stage 2 model already satisfied them at temperature 0. They are robustness insurance for when the model, temperature, or input changes — see prompt_engineering.md for the full attribution.
Each rule is explained, with examples and the failure mode it prevents, in prompt_engineering.md.
- Document AI: LDX hub (StructFlow — accordion implementation)
- RAG platform: Dify Cloud
- Vector storage: TiDB Cloud Starter (PingCAP case study)
- Embedding: OpenAI
text-embedding-3-large - Retrieval: Hybrid Search (Weighted Score, 0.7 semantic / 0.3 keyword)
- LLMs:
- Chatbot: OpenAI GPT-5.5 (per-query, premium quality)
- Stage 2 Q&A generation: Google Gemini 3.5 Flash (batch, cost-efficient)
Cross-vendor by design — LDX hub treats LLMs as swappable, so picking the right model per phase is the boring default.
.
├── README.md ← you are here
├── prompt_engineering.md ← the five rules, with examples
├── test_questions.md ← 12 validation questions
├── prompts/
│ ├── stage1_segmenter.md ← document segmentation prompt
│ ├── stage2_best.md ← retrieval-aware Q&A prompt
│ ├── stage2_naive.md ← generic Q&A prompt (baseline)
│ └── chatbot_system.md ← chatbot system prompt (answer generation)
├── data/
│ ├── test_en_full.mdx ← source document (LDX hub portal intro + API ref; internal links like `/signup` and `/api` are preserved as-is from the portal and do not resolve inside this repository)
│ ├── sections.jsonl ← Stage 1 output, flattened (53 sections, Stage 2 input format)
│ ├── facts_best.txt ← 82 Q&A facts from stage2_best.md
│ └── facts_naive.txt ← 78 Q&A facts from stage2_naive.md
├── workflows/
│ └── dify-accordion.yml ← Dify Workflow template (Stage 1 + Stage 2 + flatten)
└── results/
├── summary.md ← final aggregate, key findings
├── mdx_direct/ ← raw-markdown KB runs 1–3
├── best_facts/ ← best-prompt KB runs 1–3
└── naive_facts/ ← naive-prompt KB runs 1–3
- Generate the facts: import
workflows/dify-accordion.ymlinto Dify and feeddata/test_en_full.mdxas input. The workflow ships withprompts/stage2_best.mdalready baked into the Stage 2 system prompt — to reproduce the naive baseline, swap it forprompts/stage2_naive.mdbefore running. Output is JSONL — rename to.txtfor Dify Knowledge upload. - Build the three KBs: in Dify Cloud, create one Knowledge Base per source (
test_en_full.mdx,facts_best.txt,facts_naive.txt) using the chunking and retrieval settings in results/summary.md. - Run the questions: feed the 12 questions from test_questions.md into a Chatbot app, switching the attached KB between runs. Expected per-question pattern matches results/summary.md.
LDX hub is a document AI gateway that exposes five services through a single API: StructFlow (structured generation), RefineLoop (XLIFF translation refinement), RenderOCR (OCR conversion with layout), CastDoc (PDF-to-Office without OCR), and ExtractDoc (plain-text extraction). The accordion pattern in this repo is one StructFlow use case among many.
If you want to skip building the workflow yourself, the LDX hub Dify plugin and n8n nodes cover StructFlow as a one-step block. MCP access is also available for use from Claude Desktop and other MCP clients.
The Dify workflow (workflows/dify-accordion.yml) uses two community plugins, both installed automatically when the workflow is imported into Dify:
- ldxhub-io/ldxhub — StructFlow tool. Source:
ldxhub-io/dify-nodes-ldxhub(MIT). - kurokobo/file_tools — converts the in-memory string output of each
Codenode into a DifyFileobject that the next StructFlow node can consume (Apache 2.0).
MIT.