Client Handoffs and Hosting Uptime: A Practical, No-Nonsense Guide

From Wiki Planet
Jump to navigationJump to search

1. Why this list matters: what clients actually need from a hosting handoff

If you’re taking ownership of a website or app from a vendor, designer, or internal team, the handoff is where most future headaches are seeded. The goal isn’t to collect credentials and hope for the best. It’s to leave with a clear, actionable state of the system, the expectations for uptime, and a plan for when things go wrong. Think of a handoff like moving into a new apartment: you want the keys, the owner’s manual for the HVAC, receipts for recent repairs, and emergency contacts. For hosting, that translates to credentials, architecture diagrams, backups, monitoring access, and an incident playbook.

This list walks through what to demand, how uptime guarantees translate into real-world minutes of downtime, how to calculate and verify actual availability, and what to do immediately after the handoff so you don’t inherit a ticking time bomb. I’ll give concrete examples, calculations you can use during vendor negotiation, and a 30-day plan to make the new setup reliable. If you value your time and client revenue, these are the checks that pay for themselves.

2. Handoff essentials: the inventory you need before you trust anything

Start the handoff by insisting on a complete inventory. Don’t accept “I’ll send it later.” Ask for a single, exportable source of truth: a spreadsheet or repository that lists every service, credential owner, location, and purpose. Include hosting provider accounts, DNS registrar access, TLS certificate sources, database hosts, storage buckets, CI/CD pipelines, monitoring tools, and backup locations. For each item, require the current credentials or a confirmed transfer process, the contact person, and the last verification date.

Security matters: never accept credentials in unencrypted email. Require secure vault transfer (1Password, Bitwarden, Secrets Manager) or a one-time secure handover. Get confirmation that two-factor authentication is enabled on all accounts and that you’re added as an admin or owner. Make sure there’s an account recovery plan if a 2FA device is lost. Ask for documented IAM roles and permissions so you don’t suddenly have more privileges than needed - or worse, less.

Also demand recent backups and test restores. A backup exists only if you’ve verified a restore into a staging environment. Get the backup frequency, retention policy, encryption status, and a sample restore log. If they can’t prove a restore worked, treat backups as non-existent until you run one yourself.

3. Uptime guarantees decoded: what 99.9% actually means in minutes

Vendors love to quote a percentage. You’ll hear 99%, 99.9%, 99.99% and so on. Those numbers sound precise but are meaningless without converting to time. Here’s the quick math you can use during negotiation and post-handoff verification.

How to calculate downtime from uptime

Formula: downtime = total_time * (1 - uptime_fraction). Use minutes for clarity.

Example conversions (approximate):

UptimeAllowed downtime per yearAllowed downtime per 30-day month 99%~3.65 days (87.6 hours)~7.2 hours 99.9%~8.76 hours~43 minutes 99.99%~52.6 minutes~4.3 minutes 99.999%~5.26 minutes~0.43 minutes (26 seconds)

Use these numbers when evaluating SLAs. If your business can’t tolerate eight hours of downtime per year, 99.9% might be fine. If a single hour of downtime costs you thousands of dollars, demand 99.99% or build redundancy that doesn’t rely on one provider’s SLA. Don’t forget partial outages: degraded performance that reduces revenue isn’t always counted as “downtime” by an SLA. Read the definitions section closely.

4. Real-world uptime: how to measure and dispute the vendor’s numbers

Vendors report uptime from their perspective. That often aligns with server reachability from their monitoring points, not your users’ experience. You need independent monitoring from multiple regions, synthetic transactions (login, checkout), and real-user monitoring if possible. Tools like UptimeRobot, Pingdom, Datadog synthetic checks, or running simple cron-based checks from multiple cloud regions can reveal issues the hosting provider’s internal metrics miss.

When calculating observed uptime: keep a time series log of checks, count failures (be cautious of transient network flaps), and compute uptime = (total_checks - failed_checks)/total_checks. For continuous monitoring, do it by time slice: measure total minutes in window minus minutes unavailable, then convert to percentage. Save raw logs and screenshots for disputes.

Contrarian viewpoint: chasing SLA credits is often a poor use of time. Providers frequently offer service credits with strict rules and caps. If you face repeated, business-impacting downtime, invest in redundancy or switch providers rather than litigate a small percentage credit. Treat SLAs as a floor, not a remedy. Use them to set expectations, not to recover lost revenue.

5. Designing for the outage you’ll actually see: recovery, not mythical perfection

Don’t design for never-failing. Design for fast recovery. Define RTO (recovery time objective) and RPO (recovery point objective) for each system. RTO answers how long your business can tolerate being down; RPO says how much data loss you can accept. For an ecommerce checkout, RTO might be 15 minutes, RPO zero with synchronous replication. For a blog, RTO could be several hours and RPO a day.

Failover strategies matter. DNS-level failover depends on TTLs and propagation, so keep TTLs low if you plan active failover, but be mindful of caching on client DNS resolvers. Use health checks and automated failover for databases and load balancers. Consider multi-region deployments or multi-cloud if your revenue justifies the complexity. Use Infrastructure as Code and immutable deployments to make recovery repeatable. Blue-green or canary releases reduce the blast radius of a bad deployment.

Also plan for human factors. A robust incident runbook that names roles, escalation paths, and step-by-step recovery commands cuts mean time to repair (MTTR). Practice a disaster recovery drill quarterly. You’ll find gaps in permissions, missing keys, and forgotten steps long before they cost you.

6. The customer-protecting clauses and contract traps to watch for

Contracts sound legal but hide operational realities. Read the SLA fine print: does “uptime” exclude scheduled maintenance windows? How is downtime measured - from the provider’s monitoring, or from your end users? Is there an exclusion for “third-party network outages”? What is the maximum credit and how quickly do you have to report incidents? Many contracts force you to file claims within a short window and cap credits at a fraction of annual fees.

Negotiate practical changes: require an agreed maintenance notification period, limit scheduled maintenance windows, and define “availability” with synthetic transactions tailored to your critical flows. Ask for a runbook handoff clause: the vendor must provide updated runbooks and architecture diagrams at the end of the contract. Include audit rights for logs if you suspect ongoing problems. If you rely on a single-provider PaaS, try to retain account-level control for export of backups and DNS to avoid vendor lock-in during disputes.

Contrarian view: a very strict SLA on paper doesn’t guarantee better uptime. I’ve seen teams with strict SLAs still fail because operational maturity was lacking. Prioritize evidence of good incident management - postmortems, public uptime history, transparent communication - in addition to contract language.

Your 30-Day Action Plan: fix the handoff, validate uptime, and reduce risk

This is a practical checklist you can run through in the first month after a handoff. Do these steps in order and log everything. Assign one owner for the plan and set deadlines.

  1. Day 1-3 — Inventory and secure access: collect the full inventory, transfer credentials to a secure vault, add 2FA and backup recovery methods, and confirm admin roles.
  2. Day 4-7 — Backup verification: run a test restore to a staging environment from the most recent backup. Document RPO and adjust backup frequency if needed.
  3. Day 8-12 — Monitoring setup: deploy independent synthetic checks from multiple regions, set alert thresholds, and test paging to on-call people. Log baseline metrics for the next 30 days.
  4. Day 13-16 — SLA and contract review: read the SLA definitions, mark exceptions, and request amendments for maintenance windows and measurement methods if necessary.
  5. Day 17-20 — Incident runbook and roles: create or update a runbook for common failures, name primary and secondary responders, and run a tabletop drill to exercise the process.
  6. Day 21-24 — Failover tests: practice a controlled failover (DNS, database replica promotion, or region failover). Measure RTO and note gaps.
  7. Day 25-27 — Performance and load checks: run a load test in staging that simulates peak traffic patterns; validate autoscaling and queue back-pressure behavior.
  8. Day 28-30 — Post-handoff review: compile logs, update documentation, and schedule quarterly DR drills. If issues above a tolerable threshold appeared, start vendor remediation or migration planning.

After 30 days you Informative post should have: verified backups, independent uptime data, an accessible runbook, tested failover, and a clear contract baseline. If any of those boxes remain unchecked, treat the environment as high risk and limit critical operations until you fix the holes.

Final note: uptime guarantees are useful for setting baseline expectations but they don’t replace good engineering. Protect your users and revenue by demanding transparent metrics, independent monitoring, and practical recovery plans. If a vendor balks at these requests, consider that a stronger signal than any percentage in an SLA.