ChatGPT

OpenAI's ChatGPT is a broad-capability multi-turn conversational AI product behind a chatbot interface. An agent loop orchestrates image understanding, image generation, code execution, computer use, RAG/web search, and agent memory, all running in server-side sandboxes.

Capabilities

Trust analysis

ChatGPT operates as a broad-capability agent behind a conversational interface. The agent loop can select from multiple tools per turn - code execution, web search, file search, computer use - with the model deciding which tools to invoke based on the conversation. The sandbox is the critical security boundary: code execution runs in an ephemeral server-side environment with restricted filesystem and network access, and computer use operates in an isolated browser environment.

The code execution tool has no schema boundary - generated code can do anything the sandbox permits, making sandbox quality the sole containment mechanism. User-uploaded files enter the sandbox and become part of the agent's operating environment, where their content can influence reasoning (through the context) and be operated on by generated code. The ephemeral nature of the sandbox limits persistence attacks: generated files and state are discarded after the session.

Image understanding adds a visual prompt injection surface: adversarial content embedded in images (as visible or near-invisible text, QR codes, or steganographic patterns) can redirect model behavior. This is particularly concerning because images appear benign to human reviewers while containing instructions the model processes. User-uploaded images, images encountered during web browsing, and screenshots all become potential injection vectors that bypass text-based input filters.

Image generation (DALL-E) introduces content manipulation risks: the model can produce photorealistic images that could be mistaken for photographs or used to create fabricated visual evidence. Guardrails filter both generation prompts and outputs, but these are probabilistic and subject to guardrail bypass.

Web search results and uploaded documents introduce external content into the agent's context, creating injection surfaces for prompt injection and context poisoning through adversarial web pages or document content. The web search capability also makes ChatGPT a venue for agent SEO: content optimized for agent consumption can influence which sources the model discovers and cites. Agent memory persists information across sessions, creating a durable influence surface - a compromised session can write poisoned memories that affect future sessions.

The multi-turn conversation pattern means the context accumulates across turns, and the model's broad capabilities make the impact of a successful prompt injection high: an attacker who gains influence over the model's behavior can direct code execution, web browsing, image generation, and file operations.

Interaction effects

  • Code execution + uploaded files + sandbox: Uploaded files can contain adversarial content that influences code generation. The sandbox must contain both accidental errors and deliberately malicious code. Since the code execution tool has no schema boundary, the sandbox is the only barrier between the model's output and arbitrary computation.
  • Image understanding + agent loop + web browsing: Images from untrusted sources (user uploads, web pages, screenshots from computer use) can contain visual prompt injection - adversarial content embedded in images that redirects model behavior. This injection surface bypasses text-based input filters and is difficult for human reviewers to detect, making it a particularly stealthy attack vector across the agent loop.
  • Image generation + multi-turn conversation: The model can generate photorealistic images within a conversation, amplifying user manipulation risk. Generated images that appear authentic lend false credibility to claims, and when combined with authoritative conversation and web-sourced information, can create convincing but fabricated visual evidence.
  • Image understanding + image generation: The model can analyze an input image and generate a modified version, enabling content manipulation workflows - such as altering text in screenshots, modifying visual evidence, or creating variations of real photographs - that neither capability alone would support.
  • Web search + agent loop: The agent can be directed (via prompt injection in web content) to search for additional attacker-controlled pages, creating a chain of poisoned context that compounds across the agent loop.
  • Agent memory + multi-turn conversation: Memories persist across sessions and are loaded into future conversations as trusted context, creating a cross-session context poisoning vector. A single compromised interaction can have lasting effects on future sessions.
  • Broad tool set + user trust: The combination of authoritative-seeming conversation, computation results, generated images, and web-sourced information amplifies user manipulation risk - users are more likely to trust outputs that appear backed by code execution, visual evidence, and web sources.

Threats

Threat Relevance Note
Prompt injection Primary Multiple surfaces: conversation, uploaded files, images (visual injection), web search results
Context poisoning Primary Uploaded files, images, web search results, and persisted memories all enter context
Unauthorized code execution Primary Generated code can do anything the sandbox permits; sandbox is sole containment
Persistence attacks Elevated Ephemeral sandbox limits local persistence, but agent memory persists across sessions
User manipulation Elevated Amplified by computation results, generated images, and web-sourced information
Tool misuse Elevated Misleading analysis, incorrect computations, deceptive image generation
Goal manipulation Elevated Uploaded files or web content redirect objectives across broad tool set
Data exfiltration Elevated Code execution network requests, web browsing actions, encoded outputs
Guardrail bypass Elevated Multi-turn jailbreaking; visual prompt injection evades text-based filters
Hallucination exploitation Elevated Incorrect code, misleading analysis, fabricated web search citations
Tool output poisoning Elevated Code execution or web search results hijack subsequent reasoning
Denial of service Elevated Expensive computation, infinite loops, memory exhaustion in sandbox
Privilege compromise Elevated Sandbox escape granting access beyond intended isolation boundary
System prompt extraction Standard Through conversation or tool-mediated interactions
Misaligned model behaviors Standard Compounding over multi-turn sessions across broad capability set
Training data poisoning Standard Baseline risk, no architecture-specific amplifier