On-premise SIEM with a local LLM: AI incident analysis without breaking confidentiality
Datadog and Splunk are cloud SIEMs — your logs leave. For regulated entities, that is impossible. An open-source stack plus a local LLM delivers enterprise SIEM at a fraction of the cost.
On-premise SIEM with a local LLM: AI incident analysis without breaking confidentiality
Datadog. Splunk Cloud. Sumo Logic. All excellent SIEMs. All share one feature that, if you work in a public institution, a bank, a regulated law firm, or a NIS2 critical operator, is prohibitive: your logs — every single one, including leaked credentials, SQL queries with PII, internal traffic, authentication headers — physically leave to the vendor, in the US or on its cloud infrastructure.
For a NIS2 operator, that is a compliance problem. For a lawyer, it is a professional secrecy risk. For a public institution, it is a political decision.
This article describes AEGIS — our open-source on-premise SIEM platform with AI analysis through a locally-run LLM. We show the architecture, the log → AI triage → operator flow, and why the cost is roughly 1/10 of a comparable cloud SIEM.
TL;DR
- Cloud SIEMs (Datadog, Splunk, Sumo) require sending your logs to the vendor. For regulated entities = a problem.
- AEGIS uses a modern open-source stack (log aggregation + search engine + metrics + IDS + HIDS) plus a locally-run open-source LLM for incident triage.
- AI does first-line triage; the operator makes every irreversible decision (block IP, isolate host, page on-call).
- Active Response through Wazuh can auto-block attacks at iptables level, with a complete audit trail.
The problem with cloud SIEMs
A modern SIEM (Security Information and Event Management) does three main things:
- Aggregates logs from all sources (servers, applications, firewalls, IDS).
- Indexes them for full-text search and temporal correlation.
- Applies detection rules and alerts on suspicious patterns.
Cloud stacks excel technically. They have good UX, automatic scaling, and ship with thousands of pre-built rules. For a tech startup with no sensitive data, they are the optimal choice.
For a regulated client, the problem comes in two forms:
Schrems II. If the vendor is American (Datadog HQ Boston, Splunk HQ San Francisco), shipping logs to them falls under the regime of personal-data transfers to the US. Logs invariably contain personal data — IPs, usernames, queries with PII, session cookies. See our Schrems II analysis for the EDPB rules in detail.
Nonlinear cost. Cloud SIEMs charge by ingested log volume. For an active datacenter — a few thousand servers, thousands of containers — you quickly reach 50,000-200,000 EUR/year. And if you try to economise by reducing log volume, you also reduce detection capability.
AEGIS architecture
AEGIS is a unified observability and SecOps platform for an on-premise datacenter. All components run on the customer’s infrastructure — no log byte leaves the network without explicit decision.
┌──────────────────────────────────────────────────────────────────┐
│ AEGIS — STACK OVERVIEW │
│ │
│ COLLECT AGGREGATE METRICS AI TRIAGE │
│ Fluent Bit Graylog + Prometheus + FastAPI + │
│ (forward, OpenSearch + Alertmanager + open-source │
│ syslog, HTTP) MongoDB Grafana + LLM run │
│ Exporters locally │
│ (Qwen3-class) │
│ ▼ ▼ ▼ ▼ │
│ └────────────────┴──────────────────┴──────────────┘ │
│ │
│ SECURITY ACTIVE RESPONSE │
│ Suricata (IDS) + Zeek (network) + Wazuh AR │
│ Wazuh (HIDS) + Falco (runtime) (auto-block iptables, │
│ quarantine container)│
└──────────────────────────────────────────────────────────────────┘
Each component is mature open-source, used in production by large organisations. AEGIS hides the integration complexity — the customer gets a coherent platform, not nine separate products to manage.
The layers
Collection. Fluent Bit is the agent on each host. Very light (~5 MB RAM), supports forward, syslog, HTTP, JSON and plain log parsing. Forwards everything to the aggregation layer.
Log aggregation. Graylog + OpenSearch + MongoDB. Graylog manages stream routing, extraction rule definition, dashboards. OpenSearch handles full-text indexing. MongoDB stores configuration metadata. Stable stack, retention configurable on local storage.
Metrics. Prometheus for time-series metrics. Alertmanager for alert routing. Grafana for visualisation. Exporters for nodes (CPU/RAM/disk), containers, SNMP for switches.
Security detection. Suricata on span ports for network IDS. Zeek for deep traffic analysis (DNS queries, TLS certificates, suspicious flows). Wazuh as HIDS on each host — file integrity monitoring, log analysis, vulnerability scanning. Falco for container runtime security (detection of exec commands inside production containers).
AI triage pipeline. This is the differentiator. A FastAPI service receives alerts from Alertmanager, Suricata, Wazuh. It aggregates them with full context (who, what, when, relevant logs from the last N minutes). Sends the context to an open-source LLM running locally. The LLM produces a triage: estimated severity, what the alert means, recommended action. The output reaches the operator on a dashboard plus webhook (Telegram, self-hosted Slack, email).
End-to-end flow of an incident
Let us follow an SSH brute-force attack step by step, as it appears in AEGIS:
T+0s — Attacker probes. IP 198.51.100.42 tries to log in as root@server-vm-50 with a wrong password. SSHd logs Failed password for root from 198.51.100.42.
T+1s — Fluent Bit collects. The Fluent Bit agent on the host parses the log, serialises it as JSON with metadata (host, timestamp, severity), forwards to the aggregation layer.
T+2s — Graylog routes. The ssh-failures stream has an extraction rule that detects the “Failed password” pattern and extracts the attacker IP.
T+5s — Wazuh detects escalation. Wazuh on the host monitors auth.log. After 5 failed attempts from the same IP within 60 seconds, it triggers rule 5712 (SSH brute-force) with severity 10.
T+5s — Active Response. Wazuh AR has a firewall-drop script configured that, on severity ≥ 10 for rule 5712, adds an iptables rule: iptables -I INPUT -s 198.51.100.42 -j DROP with a 24h timeout. The attacker is isolated. Audit trail in the Wazuh manager.
T+5s — Alert to AI triage. The same alert flows through Alertmanager into the AI pipeline. FastAPI aggregates context:
- Who:
198.51.100.42, GeoIP suggested = NL. - What: 5 SSH brute-force attempts in 60s.
- Earlier: Suricata detected a port scan from the same IP 30 minutes ago.
- Target hosts:
server-vm-50(database server). - Wazuh action: blocked at iptables.
T+8s — LLM analyses. The locally-run open-source LLM (a modern open-source model in the 30-70B parameter class, running on our internal GPU) receives the structured context and produces a short report:
Estimated severity: Medium (automated attack, contained)
Attack type: SSH brute-force after port scan, automated scanner pattern.
State: blocked at iptables level via Wazuh AR.
Recommendation:
1. Check whether the IP probed other services (consult Suricata logs).
2. Add IP to permanent block list if pattern repeats.
3. Review whether SSH root login is enabled — best practice = keys only.
T+10s — Operator sees. On the AEGIS dashboard a card appears with the original alert + AI report + automatic Wazuh action + buttons for next actions. The operator reads, decides. Most often they accept the recommendation (permanent block), but they have the freedom to escalate or mark false positive.
Why the LLM runs locally — and why it is enough
If AI does the automatic triage, why not use a frontier model like GPT-5 or Claude Opus through API?
Two reasons.
Log confidentiality. Logs contain everything described at the start — leaked credentials in error messages, internal IPs, SQL queries with PII, headers with tokens. Sending them to an external API recreates exactly the problem we are avoiding.
Cost. An active datacenter generates ~10K significant alerts/day worth AI triage. At 5K tokens average per context (alert + relevant logs + response), we are at 50M tokens/day. At frontier API rates, that is several thousand EUR/month for triage alone.
Modern open-source models, run on a single enterprise GPU (A100/H100/H200), produce triage of sufficient quality for this task. We are not asking the LLM to solve a complex attack — we are asking it to recognise patterns (brute-force vs malformed packet vs misconfig), correlate logs, produce a summary for the operator. That is exactly the type of task a 30-70B parameter LLM excels at.
We use enterprise-class open-source models (multilingual Romanian + English), with fine-tuning on our corpus of historical incidents. Average triage latency: 2-5 seconds per alert.
Active Response — intentional limits
Wazuh Active Response can auto-act on incidents: block IP, kill process, isolate container, delete suspicious file. The temptation is to automate as much as possible.
The line we drew: AR auto-acts only on reversible actions with controlled impact. iptables block with 24h timeout = reversible. Permanent iptables block = operator decision. Kill suspicious process = reversible. Restart service = operator decision. Container network isolation = reversible. Delete file = no, never.
Auto-action must not take down production if it errs. If rule 5712 produces a false positive that cuts a legitimate IP for 24h, it is unpleasant but reversible. If the production network cuts itself off, that is a disaster.
In current production, 29 Wazuh agents have Active Response active. They blocked ~2,400 brute-force IPs in the last 30 days, with zero false-positive incidents affecting legitimate users (all were automated scanners).
Comparative cost
A complete AEGIS cluster (log aggregation with 90-day retention, metrics with 1-year retention, network IDS, HIDS on ~75 hosts, AI triage with one dedicated GPU) requires:
- Hardware: 5 VMs (40 vCPU + 64 GB RAM + 800 GB cumulative storage) plus 1 enterprise GPU. On existing on-premise infrastructure, amortised cost ~6-12K EUR/year.
- Software: zero — the entire stack is open-source (Apache 2.0, AGPL, BSD).
- Operations: one part-time FTE for rule tuning, updates, on-call. ~30K EUR/year.
Total: approximately 36-42K EUR/year including ops.
Cloud equivalent (Datadog, Splunk Cloud, Sumo Logic at the log volumes of a datacenter the described size): 200-400K EUR/year.
The difference is not only money. It is also sovereignty.
Related articles
- AI incident analysis with a local LLM: triage from 30 minutes to 30 seconds
- Graylog vs Splunk for 50-500 server SMBs: 3-year TCO and scaling pain points
- SaaS SIEM vs on-premise TCO: 200 servers and 10k events/sec, 3-year numbers
- NIS2 implementation — operational checklist for essential and important entities
- Pillar AEGIS — on-premise SIEM with local AI
Next steps
If you run an on-premise datacenter or hybrid cluster and you are evaluating replacement of a cloud SIEM — the AEGIS page carries the full specs, the 30-day install plan, and operational rates. Or write to contact for an initial technical discussion.
Related reading: Why not Auth0 — Schrems II (compliance for authentication) · Pre-production hardening (how we test internally).