Episode 45 — Plan for vendor outages and safe degraded modes in AI systems (Task 17)

In this episode, we’re going to focus on a reality that every organization faces sooner or later: vendors go down, and when they do, your A I system might not fail gracefully unless you planned for that exact moment. For brand new learners, it can be tempting to assume vendors are simply reliable utilities, like electricity, but vendor services are still software systems that can have outages, performance issues, or security incidents of their own. When your A I capabilities depend on a vendor, your business risk is tied to their availability and to how your system behaves when their service is missing or degraded. The goal is not to distrust vendors, but to be honest about dependency and to design safe degraded modes so a disruption does not turn into a dangerous failure. By the end, you should understand what vendor outages look like in practice, why degraded modes must be designed intentionally, and how safe fallbacks protect users, data, and operations when you cannot get the full A I service you want.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

A vendor outage is any event where an external provider cannot deliver the service you depend on at the expected level, and that can include full downtime, partial feature failure, or severe latency that makes the service unusable. In an A I context, vendors might provide model hosting, content filtering, data retrieval, vector search, monitoring, or specialized A I safety controls. Outages may be planned, like maintenance windows, or unplanned, like infrastructure failures, network issues, or emergency shutdowns. A beginner should recognize that an outage is not only the moment a service returns an error; it can also be a slow degradation that causes timeouts, partial responses, or inconsistent behavior. This matters because inconsistent behavior can be more dangerous than clear failure, since users may continue relying on outputs that are incomplete or wrong. Vendor outage planning is about anticipating the spectrum of failure and deciding what your system should do at each stage.

Safe degraded mode means the system continues operating in a limited way that reduces risk while preserving essential outcomes. Degraded mode is not the same as improvising a workaround, and it is not the same as letting the system behave unpredictably. It is a planned state where certain features are disabled, certain data flows are restricted, and certain actions require additional checks. In A I systems, degraded mode might mean disabling high risk integrations, limiting access to trusted users, switching to a simpler model for low risk tasks, or requiring human review before outputs are used. The purpose is to maintain safety and continuity at the same time, even if performance or convenience is reduced. For beginners, it helps to think of degraded mode like driving in a storm: you may slow down, avoid certain roads, and increase caution, but you still aim to reach essential destinations safely.

Planning for vendor outages begins with mapping dependency, because you cannot plan for what you do not understand. You need to know which parts of your A I system depend on vendor services and how the system behaves when those services fail. For example, if a vendor provides model inference, the entire feature may stop if inference is unavailable. If a vendor provides a safety filter, the system might still generate outputs but without protections, which can be far worse than an outage. If a vendor provides retrieval search, the model may still respond but without current context, which can cause confident mistakes. Beginners should see dependency mapping as a safety exercise, not as paperwork, because it reveals where a vendor failure could lead to harmful behavior rather than simple unavailability. The goal is to identify which dependencies can be safely bypassed and which must cause the system to stop.

Once you understand dependencies, you can define failure modes, meaning the specific ways the vendor service can break and the signals you might see. A failure mode could be complete unavailability, repeated timeouts, partial endpoint failure, degraded throughput, inconsistent responses, or safety control misbehavior. Each failure mode implies different risks and different decisions. Complete unavailability may be simple, because you can fail closed, meaning you stop the dependent feature. Partial failure is harder, because some requests may succeed while others fail, which can produce inconsistent user experiences and unpredictable data flows. Safety control failure is especially serious, because it can allow disallowed content or sensitive data leakage to slip through. A beginner should understand that fail closed is often safer than fail open, because fail open can create silent harm. Planning means choosing when to fail closed, when to degrade safely, and what evidence you need to detect the difference.

A safe degraded mode design should start with the question of what is essential and what is optional. If the A I feature is convenience, you might choose to disable it during a vendor outage rather than risking unsafe behavior. If the feature supports critical operations, you might define a limited mode that provides partial functionality, such as answering from approved static sources rather than querying live sensitive systems. You might also limit usage to trained internal users rather than exposing degraded behavior to the public. For example, during a vendor outage, a customer facing A I assistant might switch to a simple menu of known answers or route users to human support, while an internal assistant might still provide drafting help based on non sensitive templates. Beginners should see that the same system can have different degraded modes depending on user group and risk tolerance. Planning means making these decisions before an outage, not during one.

One of the most important safety decisions is how to handle data access when vendor services are degraded. If the vendor provides retrieval or context building, and that component fails, the model might respond without context, which can lead to errors. The safest approach may be to restrict the system to tasks that do not require live sensitive data, or to explicitly label outputs as limited when context is missing, though labeling alone is not a full control. If the vendor provides the model itself, you must consider whether prompts or data are being sent externally, and whether during outages retries might increase exposure or create logs in unexpected places. Safe degraded mode should limit sensitive data flows, not expand them. A beginner should learn that during outages, systems may behave differently, such as retry storms that resend data, so planning should include rate limiting and careful handling of repeated failures. The goal is to keep data movement predictable and minimal during degraded conditions.

Another critical consideration is what happens to safety controls during vendor outages, especially if those controls are provided by the vendor. If content filtering or policy enforcement fails, the system may produce outputs that are unsafe, disallowed, or privacy violating. In this scenario, a safe design often requires failing closed, meaning the system stops responding rather than responding without protections. If you must continue operating, you might use simpler internal safeguards, such as stricter prompt rules, more aggressive refusal behavior, or human review gates, but you should not assume these are equal to normal protection. For beginners, this is a core lesson: if the guardrails are down, the safest road may be to stop driving. Degraded mode is safe only when you have alternate guardrails that are strong enough for the reduced scope. Planning means identifying which guardrails are essential and ensuring there is a safe fallback or a clear shutdown rule.

Vendor outage planning also requires operational workflows so people know what to do when signals show degradation. Monitoring should detect vendor health problems, such as rising error rates, rising latency, and failed requests, and it should route alerts to the right team. The workflow should include decision triggers for switching to degraded mode, such as a sustained period of failures or evidence that safety controls are not operating correctly. It should also include clear communication to users about what functionality is available and what to do if they need help. For internal users, communication can include guidance like use manual processes for certain tasks and report any suspicious outputs. For external users, communication may be simpler, such as an availability message and a path to human support. Beginners should see that a degraded mode without a workflow is fragile, because people will not know when it started, what it means, or when it is safe to return to normal.

A subtle but important aspect is safe return to normal, because switching back too quickly can create instability. When the vendor service returns, your system may need to reestablish connections, rebuild caches, and confirm that controls are functioning. If you rush back to full mode, you may reintroduce problems like stale data, misaligned configurations, or unverified safety filters. A planned approach includes verification steps, such as checking that error rates are normal, checking that safety rules are triggering appropriately, and checking that outputs are consistent with expected behavior. It also includes watching for recurrence, because early restoration periods can be noisy. Beginners should understand that recovery is part of outage planning, not an afterthought, because the transition states are where unexpected behavior often appears. A safe degraded mode plan includes both entering degraded mode and exiting it with confidence.

Vendor relationships matter here because planning is not only internal; it also involves setting expectations with the vendor. This includes understanding what communication the vendor provides during incidents, what service level commitments exist, and what support channels are available. It also includes clarifying what data and logs you can access to understand whether your system was affected by the vendor’s event. Beginners should see that vendor outage planning is a governance activity as well as an engineering activity. You are managing dependency risk by ensuring you are not blind when the vendor struggles. Even if the vendor is excellent, you still need a plan because excellence does not eliminate the possibility of disruption. The plan is your safety net, and the vendor relationship helps you know when to deploy it and when to retract it.

As we close, planning for vendor outages is about recognizing that external dependencies can fail and designing your A I system to respond safely when they do. A vendor outage can be total, partial, slow, or safety related, and each failure mode has different risks. Safe degraded modes are intentionally limited operating states that preserve essential outcomes while reducing risk, often by restricting access, limiting data flows, disabling high risk features, and adding extra checks. The most important safety principle is to fail closed when critical guardrails are missing, because running without protections can cause silent harm. Effective planning includes monitoring, clear triggers, user communication, and safe transitions back to normal once the vendor recovers. When you do this well, outages become manageable disruptions instead of chaotic events, and your A I systems remain trustworthy even under stress.

Episode 45 — Plan for vendor outages and safe degraded modes in AI systems (Task 17)
Broadcast by