# Buddy CRM — Incident Response Playbooks

Last updated: 2026-05-06

---

## General Process

```
1. DETECT      → User report, monitoring alert, error in logs
2. ASSESS      → Which service is affected? What's the blast radius?
3. COMMUNICATE → Tell affected users
4. MITIGATE    → Apply a workaround or roll back
5. RESOLVE     → Confirm fix is working
6. POSTMORTEM  → Document what happened and how to prevent recurrence
```

If you don't know which platform is failing yet, run through the **status pages** at the bottom of this doc first.

---

## Playbook 1 — Netlify down

**Symptoms:** entire site returns 5xx or is unreachable; pages don't load (not just API calls); https://www.netlifystatus.com/ shows an incident.

**Impact:** total outage. No one can access any page or API endpoint.

**Diagnosis:**
1. https://www.netlifystatus.com/
2. Try the site from a different network
3. Netlify Dashboard → Deploys for any recent failed deploy

**If Netlify itself is down:**
- Wait it out — the site is fully on Netlify
- Communicate to users that it's a third-party issue

**If a bad deploy caused it:**
1. Netlify → Deploys → find the last green deploy
2. Click → "Publish deploy" (instant rollback)
3. Verify the site

---

## Playbook 2 — Supabase down or paused

**Symptoms:** pages render (HTML/CSS/JS work) but show empty data or errors; API calls return 500; Netlify function logs show Supabase connection errors.

**Impact:** all data-driven features broken. Auth still works (Microsoft SSO is independent).

**Diagnosis:**
1. https://status.supabase.com/
2. Netlify function logs for connection errors
3. Supabase Dashboard → SQL Editor → run `SELECT 1` to confirm the project is reachable

**If platform-wide:** wait for Supabase to resolve. The app shows empty states but doesn't crash.

**If the project is paused:**
- Free-tier Supabase projects pause after 7 days of inactivity
- Supabase Dashboard → Project → click "Restore" — takes a few minutes

**If credentials are wrong:**
1. Compare Netlify env vars `SUPABASE_URL` / `SUPABASE_SERVICE_KEY` against Supabase Dashboard → Settings → API
2. Update if they don't match → trigger a redeploy

---

## Playbook 3 — Microsoft Entra ID down

**Symptoms:** every page shows blank white screen (page hidden during auth, never shown); browser console shows MSAL errors; https://status.office.com/ shows an Azure AD / Entra ID incident.

**Impact:** total outage. Nobody can authenticate.

**Diagnosis:**
1. https://status.office.com/
2. Try https://login.microsoftonline.com/ directly
3. Browser console for `[Auth]` errors

**If Microsoft is down:** wait. No workaround — SSO is the only auth method. Already-authenticated users with tokens in `sessionStorage` may keep working until tokens expire (~1 hour).

**If the App Registration was modified or deleted:** escalate to Max / Platform team. Verify:
- The app (Client ID `3530576b-7189-4e35-8070-6d23c8f49fc0`) still exists
- Redirect URIs include the deployed Netlify URL
- "ID tokens" is enabled
- The required Microsoft Graph scopes (`User.Read`, `Mail.ReadWrite`, `Calendars.Read`, `offline_access`) are still present and admin-consented

If recreated, update `AZURE_CLIENT_ID` / `AZURE_TENANT_ID` in Netlify env vars and the hardcoded fallbacks in [netlify/functions/supabase-client.js](../netlify/functions/supabase-client.js) and [shared/auth.js](../shared/auth.js).

---

## Playbook 4 — Google Gemini down

**Symptoms:** AI features (Sales Buddy, mobile photo→lead) error or spin forever; Netlify function logs show `gemini.js` errors; non-AI features work fine.

**Impact:** Low — only AI features affected.

**Diagnosis:**
1. https://status.cloud.google.com/
2. Netlify function logs for `gemini.js`
3. Test the API key with curl

**If Gemini is down:** wait. Users can use everything else.

**If the key is invalid:**
1. Google Cloud Console → APIs & Services → Credentials
2. Generate a new key
3. Update `GEMINI_API_KEY` in Netlify → trigger redeploy

---

## Playbook 5 — Mailchimp down or key invalid

**Symptoms:** Audience Buddy fails or returns errors; https://status.mailchimp.com/ shows an incident.

**Impact:** Low — only the audience sync feature.

**Diagnosis:** Netlify logs for `mailchimp.js`; status page; test the ping endpoint.

**If the key is invalid:**
1. Mailchimp → Account → Extras → API keys → generate a new one
2. Update `MAILCHIMP_API_KEY` in Netlify
3. Verify `MAILCHIMP_SERVER_PREFIX` matches (e.g. `key-us1` → `us1`)
4. Trigger redeploy

---

## Playbook 6 — Outlook email/calendar sync broken

This subsystem has the most moving parts. Symptoms vary.

### 6a — Auto-logging stopped working entirely

**Symptoms:** new outbound/inbound emails are no longer landing on Lead/Opp activity timelines; new meetings aren't appearing in the calendar.

**Diagnosis:**
1. Supabase: `SELECT MAX(run_at) FROM email_sync_log;` — if the most recent run is hours stale, the scheduled function isn't running. Check Netlify Functions logs for `email-sync.js` and `calendar-sync.js`
2. Per-BDM: `SELECT email, last_sync_at, last_sync_error FROM user_graph_tokens;` — `last_sync_error` will tell you why a particular BDM is stuck
3. If you see crypto / decryption errors in the logs, the `EMAIL_TOKEN_KEY` env var has likely been changed or lost

**If `EMAIL_TOKEN_KEY` was rotated incorrectly:** every encrypted refresh token in `user_graph_tokens` is now unreadable. Either restore the previous key and redeploy, or have every BDM re-OAuth via the auth-outlook flow.

### 6b — Layer 2 webhooks dropped

**Symptoms:** moving a message into the `@Buddy/Auto-Log` folder doesn't surface it in `outlook-sync.html`; subscriptions in `email_subscriptions` show `expires_at` in the past.

**Diagnosis:** `SELECT user_email, expires_at FROM email_subscriptions ORDER BY expires_at;`. Subscriptions expire every ~3 days. The poller renews any with `expires_at` within 24h, so if everything's expired, the renewal step has been broken for >24h.

**Fix:** check Netlify logs for the renewal failure, then either let the next poller tick recover or manually re-subscribe the affected user via `auth-outlook.js`'s flow.

### 6c — Calendar v1: meetings aren't rendering on /calendar

This is the active blocker as of 2026-05-06. See `docs/handover-2026-05-06.md` § 1.1 for the full investigation. Likely fix: add `Prefer: outlook.timezone="UTC"` header to the Graph fetch in `calendar-sync.js:152-156`.

---

## Playbook 7 — DNS / SSL issues

**Symptoms:** SSL certificate errors; site unreachable but Netlify status is fine; DNS lookup fails.

**Diagnosis:**
1. Try the Netlify subdomain directly (e.g. `https://<site>.netlify.app`)
2. If using a custom domain, check DNS records at the registrar
3. Netlify → Domain management → HTTPS for cert status

**If SSL expired:**
- Netlify auto-renews Let's Encrypt certs
- Netlify → Domain management → HTTPS → "Renew certificate"

**If DNS is misconfigured:**
- Netlify wants either a CNAME to `<site>.netlify.app` or an A record to Netlify's load balancer
- DNS changes can take up to 48 hours to propagate (usually much faster)

---

## Playbook 8 — Credential compromise / rotation

### `SUPABASE_SERVICE_KEY` compromised
1. Supabase Dashboard → Settings → API. **Note:** Supabase doesn't let you rotate the service key in place. If truly compromised, contact Supabase support, or migrate to a new project
2. Update `SUPABASE_SERVICE_KEY` in Netlify → redeploy

### `GEMINI_API_KEY` compromised
1. Google Cloud Console → APIs & Services → Credentials → delete old → create new
2. Update `GEMINI_API_KEY` → redeploy

### `MAILCHIMP_API_KEY` compromised
1. Mailchimp → Account → Extras → API keys → disable old → create new
2. Update `MAILCHIMP_API_KEY` → redeploy

### `MS_CLIENT_SECRET` compromised
1. Coordinate with Max / Platform team to rotate the secret on the Azure App Registration
2. Update `MS_CLIENT_SECRET` in Netlify → redeploy
3. Existing stored refresh tokens still work — only the auth-code exchange uses this secret

### `BUDDY_SERVICE_KEY` compromised
1. Generate a new random string
2. Update `BUDDY_SERVICE_KEY` in Netlify env vars
3. Update every consumer that holds the old key (e.g. `.mcp.json` in the original Buddy repo, any other service caller)

### `EMAIL_TOKEN_KEY` exposure
This one's load-bearing. If exposed:
- The plaintext stored refresh tokens are now decryptable by anyone holding the old key + a Supabase row dump
- Best path: write a migration that reads each `user_graph_tokens` row with the old key, re-encrypts with a new key, then update `EMAIL_TOKEN_KEY` and redeploy
- Worst path: clear `user_graph_tokens` and have every BDM re-OAuth

### Azure App Registration compromised
- Buddy CRM uses a public-client flow on the browser (MSAL.js) — no client secret in the SPA
- The server side (`auth-outlook.js`) uses `MS_CLIENT_SECRET`, which is rotatable (above)
- If the entire App Registration is compromised: please see Max / Platform team to delete and recreate it, then update `AZURE_CLIENT_ID` / `AZURE_TENANT_ID` env vars and the fallbacks in `shared/auth.js` and `supabase-client.js`

---

## Playbook 9 — Marc unavailable (bus factor)

Marc is the sole developer and primary administrator. Other Now NZ staff can use the company password manager (please contact Marketing for the Account Access and Passwords sheet) to perform basic ops.

### What someone with credential access can do

1. **Roll back a bad deploy** — Netlify → Deploys → previous green deploy → Publish deploy
2. **Check function logs** — Netlify → Functions → click the failing function
3. **Toggle a feature off (emergency)** — Supabase → SQL Editor → `UPDATE feature_flags SET enabled = false WHERE key = 'feature_name';`
4. **Restart Supabase** if paused — Supabase → Project → Restore
5. **Update an env var** — Netlify → Site configuration → Environment variables → edit → trigger redeploy

### What they CAN'T do without code access

- Fix bugs in application code
- Add new features
- Change database schema
- Modify auth wiring

### Preparation
- Make sure staff know to contact Marketing for the Account Access and Passwords sheet
- Source code lives on GitHub (private) and is mirrored weekly to [SharePoint](https://nownz.sharepoint.com/:f:/s/Tools/IgCJmDrcuGh5Q4cFoyy7dnVEAdL2H95BDPF2Pi-xFQjCyrM?e=CpyqwP)
- This `docs/` directory contains everything needed to understand the platform — start with `start-here.md` and the relevant playbook above

---

## Playbook 10 — Gmail / notifications down

**Symptoms:** Skill Queue tasks complete or fail but no email is sent; Netlify function logs show SMTP connection errors in `send-notification.js`.

**Impact:** Very low — only notifications affected. App features and skill execution continue normally.

**Diagnosis:**
1. https://www.google.com/appsstatus/dashboard/
2. Verify `GMAIL_USER` + `GMAIL_APP_PASSWORD` are set in Netlify
3. Netlify logs for `send-notification.js`

**If the App Password was revoked or expired:**
1. Sign in to the notifications Gmail account
2. Security → 2-Step Verification → App passwords → generate new
3. Update `GMAIL_APP_PASSWORD` → trigger redeploy

---

## Playbook 11 — Skill queue stuck

**Symptoms:** tasks sit in `pending` or `in_progress` indefinitely; `skill-queue.html` shows stale tasks; no completion notifications.

**Impact:** medium — automated tasks don't execute. Manual features unaffected.

**Diagnosis:**
1. Check `skill-queue.html` for current task statuses
2. Confirm the skill-queue-worker Claude Code agent is running (it's not in this repo)
3. `SELECT status, COUNT(*) FROM skill_queue_tasks GROUP BY status;`

**Tasks stuck in `in_progress`:**
The worker crashed mid-task. Reset stuck tasks:
```sql
UPDATE skill_queue_tasks
SET status = 'pending', started_at = NULL
WHERE status = 'in_progress'
  AND started_at < NOW() - INTERVAL '30 minutes';
```
Then restart the worker.

**Tasks stuck in `pending`:**
The worker isn't running. Start the skill-queue-worker. Verify `BUDDY_SERVICE_KEY` is correct on the worker side.

**A skill is disabled:**
```sql
SELECT name, enabled FROM skill_queue_skills;
UPDATE skill_queue_skills SET enabled = true WHERE name = '<skill_name>';
```

---

## Status Page URLs (bookmark these)

| Service | Status Page |
|---|---|
| Netlify | https://www.netlifystatus.com/ |
| Supabase | https://status.supabase.com/ |
| Microsoft Azure / Entra ID | https://status.office.com/ |
| Google Cloud (Gemini) | https://status.cloud.google.com/ |
| Mailchimp | https://status.mailchimp.com/ |
| Gmail / Google Workspace | https://www.google.com/appsstatus/dashboard/ |
