سامي
سامي الغامدي
مستشار Fyntralink · متاح الآن
مدعوم بالذكاء الاصطناعي · Fyntralink

Bleeding Llama: Critical Ollama Flaw Leaks AI Server Memory to Unauthenticated Attackers

A critical out-of-bounds read in Ollama's GGUF loader lets attackers siphon API keys, user prompts, and credentials from 300,000+ exposed AI servers — no authentication required.

F
FyntraLink Team

Security researchers at Cyera have disclosed CVE-2026-7482, a CVSS 9.1 vulnerability in Ollama they've dubbed "Bleeding Llama." The flaw allows any unauthenticated attacker to drain process memory from exposed Ollama instances — leaking API keys, user prompts, environment variables, and personally identifiable information. With over 300,000 Ollama servers accessible on the public internet, this is one of the most consequential AI-infrastructure vulnerabilities disclosed this year.

How Bleeding Llama Works: Out-of-Bounds Heap Read in the GGUF Loader

The vulnerability resides in Ollama's GGUF model loader, specifically in the WriteTo() function within the model creation pipeline. When an attacker submits a specially crafted GGUF file to the /api/create endpoint, they can set tensor dimensions to deliberately oversized values. Ollama processes the file without validating these dimensions against the actual allocated buffer, causing the server to read far beyond its intended memory boundaries. The result is a classic out-of-bounds heap read — similar in spirit to the infamous Heartbleed vulnerability that devastated TLS implementations a decade ago.

What makes this particularly dangerous is the attack's simplicity. No authentication is required. No complex exploit chain is needed. A single malformed HTTP request to an exposed Ollama instance is sufficient to begin extracting heap contents, byte by byte. The attacker receives raw memory dumps containing whatever data happens to reside on the heap at the time of the request.

What Leaks: API Keys, Prompts, Credentials, and PII

The data exposed through Bleeding Llama reads like a threat actor's wish list. Cyera's research confirmed that the following categories of sensitive information are recoverable from leaked heap memory: system prompts and instruction sets that reveal proprietary logic and guardrails, fragments of other users' chat messages and inference queries, environment variables commonly loaded with API keys for OpenAI, Anthropic, Azure, and AWS services, database connection strings and cloud service credentials, and any PII or protected health information flowing through active inference jobs.

For organizations running Ollama as part of internal AI assistants, document-processing pipelines, or customer-facing chatbots, this means an attacker could silently harvest credentials that unlock lateral movement across the entire cloud estate. The leak is not a one-time event — repeated requests yield different memory segments, allowing methodical extraction of the server's full heap over time.

300,000 Exposed Servers: The Scale of the Problem

Ollama has surged in popularity as the go-to framework for running open-source large language models locally and in private cloud environments. Its ease of deployment is both its strength and its Achilles' heel. Internet-wide scans reveal over 300,000 Ollama instances directly accessible without any form of authentication gateway, API proxy, or network-level access control. Many of these deployments were stood up by development teams experimenting with LLMs, then left running in production environments without security review.

The pattern is familiar to anyone who tracked the MongoDB and Elasticsearch exposure waves of the late 2010s: a powerful tool ships with permissive defaults, adoption outpaces security guidance, and attackers eventually notice. Bleeding Llama is that moment for the AI-inference server category.

Impact on Saudi Financial Institutions Under SAMA Oversight

Saudi banks, insurance companies, and fintech firms regulated by SAMA have been aggressively adopting generative AI for fraud detection, customer service automation, regulatory document analysis, and internal knowledge management. Several of these deployments rely on Ollama or similar self-hosted inference frameworks to keep sensitive financial data out of third-party cloud APIs.

Bleeding Llama undermines the very premise of that strategy. If an Ollama instance processing loan applications, KYC documents, or transaction monitoring alerts is exposed — even briefly — an attacker could extract customer PII, internal model instructions, and cloud credentials without leaving a trace in application-level logs. This creates direct compliance exposure across multiple frameworks: SAMA's Cyber Security Common Controls (CSCC) mandate strict access control and data protection for systems processing customer data, NCA's Essential Cybersecurity Controls (ECC) require vulnerability management and secure configuration baselines for all infrastructure including AI workloads, the Personal Data Protection Law (PDPL) imposes notification obligations and potential penalties when personal data is exposed due to inadequate technical safeguards, and PCI-DSS v4.0 requires that any system component in the cardholder data environment be hardened and access-restricted — an unauthenticated API endpoint processing payment-related queries fails this requirement categorically.

Why Traditional Scanning Missed This

One reason Bleeding Llama persisted undetected for so long is that conventional vulnerability scanners and endpoint detection tools do not typically cover AI-inference frameworks. Ollama is not a web application in the traditional sense — it exposes a REST API but does not appear in standard web application firewalls' signature databases. Most organizations treat it as a development tool rather than a production service, exempting it from hardening checklists and penetration testing scope.

This blind spot is not unique to Ollama. The broader AI toolchain — including vLLM, LocalAI, and text-generation-inference — operates in a security governance gap where neither traditional IT security teams nor data science teams take ownership of hardening, patching, and monitoring. For SAMA-regulated entities, this gap must be closed before the next disclosure.

Recommended Actions for Security Teams

  1. Patch immediately. Upgrade all Ollama instances to version 0.17.1 or later, which contains the fix for CVE-2026-7482. Verify the version programmatically across all environments — development, staging, and production.
  2. Deploy an authentication proxy. Never expose Ollama's API directly to any network segment without an authentication and authorization layer. Use an API gateway such as Kong, Envoy, or a cloud-native equivalent that enforces mTLS and token-based access.
  3. Restrict network exposure. Ollama instances should be bound to localhost or internal network interfaces only. Use firewall rules and security groups to block all inbound access from untrusted networks. Conduct an immediate port scan for TCP 11434 (Ollama's default) across your entire IP range.
  4. Audit environment variables. Rotate any API keys, database credentials, or cloud tokens that were loaded as environment variables on systems running vulnerable Ollama versions. Assume compromise if the instance was internet-accessible.
  5. Include AI infrastructure in your VAPT scope. Update penetration testing and vulnerability assessment engagements to explicitly cover AI-inference servers, model-serving endpoints, and associated APIs. This should be reflected in your SAMA CSCC vulnerability management documentation.
  6. Inventory all AI workloads. Many Ollama deployments are shadow IT — spun up by data science teams without security team awareness. Conduct an asset discovery sweep specifically targeting AI-framework signatures and non-standard API ports.
  7. Monitor for exploitation indicators. Watch for unusual POST requests to /api/create with abnormally large payloads or unexpected GGUF file uploads. Correlate with outbound data transfers that could indicate memory exfiltration.

Conclusion

Bleeding Llama is a wake-up call for every organization deploying self-hosted AI infrastructure. The vulnerability demonstrates that the same classes of memory-safety bugs that plagued web servers and TLS libraries for decades are now emerging in AI-inference frameworks — with the added risk that leaked memory may contain not just credentials, but proprietary model logic, customer conversations, and regulated financial data. For Saudi financial institutions, the message is clear: AI workloads are production workloads, and they demand the same security rigor as any other system in your regulated environment.

Is your organization prepared? Contact Fyntralink for a complimentary SAMA Cyber Maturity Assessment that includes AI-infrastructure security review and compliance gap analysis.