System architecture where the AI capability is a first-class dependency, not a feature you call from the application layer. The architectural decisions that determine whether an AI product works at scale are made at this layer, not discovered after launch.
Data model for AI inputs and outputs: the prompt context, retrieved documents, model response, evaluation score, and user feedback action are all first-class entities stored in the database, not transient API responses that disappear after the UI renders. This matters because prompt regression testing, cost attribution, and the improvement cycle all depend on having a retrievable record of what the model was asked and what it returned.
Latency architecture: for interactive AI products (chat, copilot, real-time analysis), the p95 latency target is defined before architecture is chosen. GPT-4o averages 1-3s for short responses via streaming; for non-streaming or long outputs, p95 latency can reach 8-15s. Architecture decisions that affect latency: response streaming (Server-Sent Events from the backend to the browser, so the user sees output arriving rather than waiting for the full response); async job processing for non-interactive tasks (document analysis, batch classification); model selection (GPT-4o mini at 1-2s vs GPT-4o at 3-8s for the same query); and context window management (fewer tokens in = lower latency and cost).
Provider abstraction layer: the application calls a provider-agnostic service interface, not the OpenAI SDK directly. If you switch from OpenAI GPT-4o to Anthropic Claude 3.5 Sonnet or Google Gemini 1.5 Pro, one service adapter changes, not 40 call sites across the codebase. This is critical for AI products because model pricing, capability, and availability evolve faster than the product.
Fallback architecture: when the primary model provider is unavailable or rate-limited, the system falls back to a secondary provider automatically (OpenAI primary, Anthropic secondary) via the abstraction layer; circuit breaker pattern with a 60-second retry window prevents cascading failures from amplifying latency under load.