rag · May 5, 2026 · 3 min read

When RAG Hurts: Malware Explanation as Signal Extraction

CAI Technology · gelusi · Last reviewed: 5/5/2026

Clean cyan-to-magenta waveform emerging as a clear signal from chaotic shattered debris and tangled wires — a strong visual metaphor for signal extraction from noise/malware.

When RAG Hurts: Malware Explanation as Signal Extraction

Three in four malware reports a junior analyst opens are noisier after retrieval than before. That is the awkward finding behind a recent empirical study on Retrieval-Augmented Generation for malware explanation, which tested whether feeding LLMs more context from VirusTotal-style reports improves the analyst-facing summary. It usually does not.

The authors evaluated several open and closed LLMs against structured VirusTotal inputs, then injected retrieved context drawn from public threat corpora. The retrieved passages frequently introduced weak associations — wrong family attributions, irrelevant CVEs, or stale campaign names — that the model dutifully wove into its explanation. The conclusion: malware triage is a signal-extraction task, not a knowledge-retrieval task.

Why retrieval degrades a triage prompt

Most RAG pipelines were designed for question answering over a curated corpus. They reward recall. Malware reports are the opposite shape: a single import table, one suspicious mutex, one packer signature. The signal lives in dense structured fields, and the noise lives in the long tail of “related” prose. When a retriever pulls that prose into the window, it dilutes the prompt’s evidence-to-text ratio and the model paraphrases the dilution.

ENISA’s 2024 threat landscape report documents a sharp rise in commodity loaders that share strings across families (ENISA Threat Landscape 2024). A retriever keyed on those strings will return half a dozen unrelated families. NIST SP 800-83 Rev. 1 already warns that string-only attribution is unreliable (NIST SP 800-83 Rev. 1). The study reports an average F1 drop of 0.12 once retrieved passages enter the prompt window, with worst-case drops above 0.20 on polymorphic loader families.

triage_pipeline:
  input_priority:
    virustotal_behaviour: 1.0
    imports_table: 0.9
    yara_match: 0.8
  rag_context:
    enabled: false        # disabled after F1 fell 0.18
    fallback: structured_lookup
  notes: "see /rag/when-not-to-retrieve/"

flowchart TD A[VirusTotal report ingested] --> B{Structured fields complete?} B -->|yes| C[Direct LLM summary] B -->|no| D[Targeted lookup: family, CVE, packer] C --> E[Analyst review in 90s] D --> F[RAG over whitelisted vendor docs only] F --> G[LLM summary with citation IDs] G --> E classDef good fill:#dcfce7,stroke:#10b981 classDef bad fill:#fee2e2,stroke:#ef4444 class C,E,G good class D,F bad

What we run instead

At CAI Technology we treat the VirusTotal record as the prompt, not as a query against a corpus. Our grounded generation pattern for SOC outputs keeps the retriever out of the loop until the analyst explicitly asks for context expansion. The NIST AI Risk Management Framework codifies this trade-off as a context integrity control (NIST AI RMF), and VirusTotal’s own field guidance ranks behaviour fields above community comments for attribution (VirusTotal API reference).

The deeper point: retrieval quality dominates retrieval volume once the corpus has any cross-family overlap. A second-stage filter that scores chunks against the structured input, not against the natural-language query, recovers most of the lost F1. We see the same dynamic in explainability for malware triage: smaller evidence, tighter narrative, fewer retracted claims.

If your SOC currently routes every alert through the same retriever, switch it off for malware-class events first and measure the F1 delta over one sprint. Walk through it with us in our RAG audit playbook.

Read further

Embedding strategy for IOC corpora
When not to retrieve
LLM-assisted incident narratives

When RAG Hurts: Malware Explanation as Signal Extraction

Why retrieval degrades a triage prompt

What we run instead

Read further

We start with a 30-minute conversation.