Skip to content

SwissTennis scheduler — operational runbook

Owner: Casey (DevOps). Version: 1.0 (2026-05-24). Status: Adopted. Close-out of GitLab #180 / #181 / #199. Audience: anyone on-call for the CSD app or the keystone platform fleet. Related: - Spec docs/specs/swisstennis-api-integration.md (Alex + Rafael). - Implementation plan docs/specs/swisstennis-api-integration-impl-plan.md §5 (Morgan — note the v0.2 correction at the top of §5). - Dispatcher source packages/trpc/src/routers/swisstennisScheduler.ts. - Pipeline source packages/swisstennis/src/pipelines/sync.ts. - Observability sources packages/swisstennis/src/observability/logging.ts, metrics.ts.


1. What it does

The SwissTennis scheduler keeps every CSD tournament with swisstennis_sync_enabled = true in step with the live data on comp.swisstennis.ch.

The substrate is a systemd timer on the keystone platform host, firing every 60 s, that issues an HTTPS POST to a secret-gated tRPC mutation on the CSD app over the public Caddy frontend. The mutation is the heartbeat; all per-tournament decision logic (is this tournament due? in-window or pre-window? final-pull or auto-stop?) lives inside the dispatcher and reads current state from Postgres.

┌─────────────────────────────────────────┐         ┌───────────────────────────────────┐
│  keystone host                          │         │  CSD app (Caddy + Next.js)        │
│                                         │         │                                   │
│  keystone-cron-csd-<env>-               │  HTTPS  │  POST /api/trpc/                  │
│  http-trigger.timer  ─── 60s ──▶ .service ───────▶│   swisstennisScheduler.           │
│                                         │  X-Sch- │   dispatchScheduler               │
│  /srv/apps/csd/<env>/cron/              │  -Secret│                                   │
│  swisstennis-scheduler.sh               │         │  → reads tournaments table        │
│   curl -fsS -X POST                     │         │  → resolves cadence per row       │
│   -H "X-Scheduler-Secret: $SECRET"      │         │  → invokes syncTournament(...)    │
│   -d '{}' https://csd-<env>.wagen.io/…  │         │   for each due tournament         │
└─────────────────────────────────────────┘         └───────────────────────────────────┘

1.1 Cadence rules (one decision per tournament per tick)

The dispatcher resolves each candidate tournament against the table below. Implementation: resolveCadence in the dispatcher source.

Tournament state Decision
swisstennis_sync_enabled = false OR swisstennis_tournament_id IS NULL OR swisstennis_sync_stopped_at IS NOT NULL Filtered out before cadence resolution — never considered.
Now < start_date 00:00 Europe/Zurich (pre-window) Due if last_synced_at IS NULL OR (now − last_synced_at) ≥ 15 min; else skip.
Now in [start_date 00:00, end_date 23:59] Europe/Zurich (in-window) Due if last_synced_at IS NULL OR (now − last_synced_at) ≥ 60 s; else skip.
Now in (end_date 23:59, end_date 23:59 + 1 h) Europe/Zurich (after-window-grace) Skip — silent grace window before the final pull.
Now ≥ end_date 23:59 + 1 h Europe/Zurich AND not yet stopped Due once. The pipeline performs one final pull; the next cycle's cadence resolver returns auto_stop and stamps swisstennis_sync_stopped_at = now().

Per-cycle concurrency is capped at MAX_TOURNAMENTS_PER_CYCLE = 5. Excess due tournaments roll over to the next tick (visible in the response body as cycle.rolledOver).


2. Endpoint contract

Field Value
URL (test) https://csd-test.wagen.io/api/trpc/swisstennisScheduler.dispatchScheduler
URL (prod) https://csd.wagen.io/api/trpc/swisstennisScheduler.dispatchScheduler
Method POST
Body {} (the dispatcher accepts an empty object; production keystone wrapper does not send now)
Authentication header X-Scheduler-Secret: <value of SWISSTENNIS_SCHEDULER_SECRET on the CSD app>
Content type application/json
Expected status 200 always — the dispatcher signals problems via the response body, not via HTTP status

2.1 Response shape

{
  "status": "ok",
  "cycle": {
    "candidatesConsidered": 3,
    "dueThisCycle": 1,
    "succeeded": 1,
    "failed": 0,
    "skippedByCadence": 2,
    "autoStopped": 0,
    "rolledOver": 0
  }
}

Misconfigured state (only when SWISSTENNIS_SCHEDULER_SECRET is unset on the CSD app side):

{ "status": "misconfigured" }

Bad secret returns HTTP 401 with body {"error":{"json":{"message":"Invalid scheduler secret.", ... }}} — this is the only non-200 the keystone wrapper should ever see in steady state.

2.2 Secret contract

Side Where Source of truth
CSD app SWISSTENNIS_SCHEDULER_SECRET in the running container's env Set via the keystone deploy contract (see gitlab.com/wagen-public/keystone-public/-/blob/main/docs/runbooks/12-platform-deploy-target.md on the keystone side); the value is also written to /srv/apps/csd/<env>/app.env on the host.
keystone timer wrapper sourced from /srv/apps/csd/<env>/app.env at curl time Same file the container reads — single source of truth.
Rotation re-deploy the CSD app with a new value; keystone wrapper picks it up on next tick because it sources the env file every fire. No restart needed on the timer side.

The secret is compared in constant time on the CSD side via SHA-256 + timingSafeEqual — see secretsMatch in the dispatcher source.


3. Keystone-side units

Unit Purpose
keystone-cron-csd-test-http-trigger.timer Fires every 60 s on csd-test.
keystone-cron-csd-test-http-trigger.service Runs the wrapper script when the timer fires.
keystone-cron-csd-prod-http-trigger.timer Fires every 60 s on csd-prod.
keystone-cron-csd-prod-http-trigger.service Runs the wrapper script when the timer fires.
/srv/apps/csd/test/cron/swisstennis-scheduler.sh The wrapper (curl with the secret header).
/srv/apps/csd/prod/cron/swisstennis-scheduler.sh Same, prod.

The timer uses OnUnitActiveSec=60s (no pile-up if a tick slows). No retries on the timer side — the wrapper exits non-zero on non-2xx, the next tick is the retry. Cadence-level idempotency means a missed tick costs at most 60 s of staleness.

Provisioning script on the keystone side: scripts/16b-app-http-cron-timer.sh (sibling of the SQL-shaped 16-app-cron-timer.sh). Not maintained from CSD.

3.1 Current deployment state (2026-05-24)

  • csd-test: fully wired. Timer running, wrapper sourcing the secret, endpoint returning {"status":"ok","cycle":{...}} on every fire.
  • csd-prod: timer wired and firing, but the deployed image (f0fed074-prod at time of writing) predates the swisstennisScheduler router. Wrapper logs HTTP 404 until the next manual prod deploy lands the router. No further keystone-side action needed — the next prod deploy resolves it automatically.

4. Inspecting logs

4.1 Keystone-side (timer + wrapper)

From the keystone platform host (kst1.wagen.io):

journalctl -u keystone-cron-csd-test-http-trigger.service --since today
journalctl -u keystone-cron-csd-prod-http-trigger.service --since today

Each tick is one journal entry — typically the curl exit code and the HTTP status. Examples:

Log line Meaning
curl: 200 (or wrapper logs the body) Tick succeeded.
curl: (22) The requested URL returned error: 401 Secret mismatch (see §6.1).
curl: (22) The requested URL returned error: 404 App image predates the router (see §6.2).
curl: (22) The requested URL returned error: 5xx App-side bug or app down (see §6.3).
HTTP 200 with body {"status":"misconfigured"} App is up and the secret-header check would pass, but the app-side env var is unset (see §6.4).

4.2 CSD app-side (dispatcher + pipeline)

The CSD app emits structured logs (single-line JSON) for every dispatcher invocation and every pipeline call. Look for the canonical fields documented in logging.ts:

  • correlation_id — one per pipeline invocation; correlates fetcher / merge / audit lines.
  • pipeline_kind'one_shot' | 'event_bracket' | 'sync'.
  • triggered_by'cron' for the dispatcher path; user UUID otherwise.
  • tournament_id (CSD UUID) and swisstennis_tournament_id (numeric).
  • side'swisstennis' for upstream/network/parser failures, 'csd' for merge/persistence/validation failures. This is the field that splits "they broke" from "we broke" in alerts.

How to tail them depends on the platform's log aggregation; the lines are JSON on stdout from the Next.js process container. See the keystone runbook for the canonical path to app logs.

4.3 In-app forensics (the audit table)

Every sync cycle, manual import, and one-shot import writes one row to swisstennis_import_events. Per-tournament audit history is also surfaced in the UI at /<tournamentId>/settings/swisstennis/history (Taylor's #187). Most operator questions can be answered there without journald:

  • Which endpoints were called and what their body-hashes were.
  • Per-cycle summary counts (players_updated, matches_updated, …) and warnings/errors arrays.
  • consecutive_failures_before / consecutive_failures_after on the metadata payload — the most direct signal of degradation.

5. Misconfigured state

The dispatcher returns {"status":"misconfigured"} (HTTP 200) when SWISSTENNIS_SCHEDULER_SECRET is unset or empty on the CSD app side. This is deliberate — it is not an error:

  • The keystone timer keeps firing; nothing degrades.
  • cron.job_run_details.return_message-style log on the keystone side shows the misconfigured wording, so the on-call sees "the timer is fine, the app env var is the issue" at a glance.
  • Manual UI flows (syncNow, importTournament, importEventBracket) do not depend on this env var — they are gated on the user's auth session, not the scheduler secret. Background sync simply stays paused.

To fix: set SWISSTENNIS_SCHEDULER_SECRET via the keystone deploy contract (see §2.2) and redeploy. The next timer tick picks it up automatically.


6. Troubleshooting

6.1 HTTP 401 in the keystone timer log

Cause: the secret the wrapper passes does not match the value the CSD app holds in SWISSTENNIS_SCHEDULER_SECRET.

Diagnose: 1. On the keystone host: read /srv/apps/csd/<env>/app.env and grep for SWISSTENNIS_SCHEDULER_SECRET. That is the value the wrapper sends. 2. In the CSD GitLab CI variables (Settings → CI/CD → Variables): find the same key for the matching environment scope. That is what the deploy job writes into the container. 3. If they differ, rotate: update one to match the other (typically: take the CI variable as authoritative, push the same value into app.env via the keystone deploy contract), then redeploy.

Fix: rotate the secret on both sides via the keystone deploy contract.

6.2 HTTP 404 from the dispatcher endpoint

Cause: the deployed app image predates the swisstennisScheduler router. There is no /api/trpc/swisstennisScheduler.dispatchScheduler route on this image yet.

Diagnose: - Check the deployed image tag (visible from the keystone platform or docker ps on the host). - Compare against git log -- packages/trpc/src/routers/swisstennisScheduler.ts on main. If the deployed commit predates 0af9cd0 (the swisstennisScheduler tRPC entrypoint) the router is genuinely absent.

Fix: trigger a fresh deploy. No code change needed. csd-prod is in this state at the time of writing (2026-05-24) — next manual prod deploy resolves it.

6.3 HTTP 5xx from the dispatcher endpoint

Cause: app-side bug or the app process is down. The dispatcher returns 200 even on internal failures (the response body carries the cycle summary including failed); a 5xx means the request didn't reach the procedure body.

Diagnose: - Check the CSD app container health on the keystone host. - Tail the CSD app's stdout for the stack trace at the same wall-clock minute as the 5xx (use correlation_id to scope to the offending tick — but if the request never reached the procedure body there will be no correlation id; look for the Next.js / Hono framework error instead).

Fix: triage the stack trace. The dispatcher itself is fairly small (see swisstennisScheduler.ts) — almost all logic that can throw is inside syncTournament, and that's wrapped in try/catch that converts errors into cycle.failed++. A 5xx therefore almost always means the bug is in the dispatcher's pre-procedure machinery (auth context, body parsing, DB connection) — not the pipeline itself.

6.4 Status "misconfigured" returned for many consecutive cycles

Cause: SWISSTENNIS_SCHEDULER_SECRET is unset on the CSD app side. See §5.

Fix: set the env var via the keystone deploy contract and redeploy.

6.5 Tournament not syncing despite the timer ticking and the cycle returning ok

Diagnose: 1. Is sync enabled on that tournament? Open /<tournamentId>/settings, SwissTennis tab. The "Hintergrund-Sync" toggle must be on. If off — that's the cause. 2. Has sync been auto-disabled? Inspect swisstennis_consecutive_failures on the tournament row (or look at recent rows in swisstennis_import_events — the metadata payload carries consecutive_failures_before/after). At N = 5 the pipeline stops sync and stamps swisstennis_sync_stopped_at. Reset path: organizer toggles sync off then on, or zero the counter manually if the underlying cause is fixed. 3. Is the tournament in the silent grace window? Between end_date 23:59 and end_date 23:59 + 1 h Europe/Zurich the cadence resolver returns skip_cadence deliberately. Wait for the final pull. 4. Is the tournament past auto-stop? If swisstennis_sync_stopped_at IS NOT NULL the candidate query filters it out before cadence resolution. Manual "Sync now" still works — it bypasses the dispatcher. 5. Is the cycle dispatching but failing? Check the response body's cycle.failed counter, and the per-tournament audit row at /<tournamentId>/settings/swisstennis/history for the failure reason. Common reasons: endpoint_deprecated (SwissTennis changed something — see §7's alert thresholds), malformed_payload (parser regression), network (transient).

6.6 Cycle reports rolledOver > 0 consistently

Cause: more than five tournaments are due simultaneously every tick. The hard cap (MAX_TOURNAMENTS_PER_CYCLE = 5) is bounding outbound traffic to SwissTennis.

Fix: tune. If the workload is genuinely larger than five active tournaments, raise the constant in swisstennisScheduler.ts. Compute the worst-case outbound rate first — at 16 events per tournament and ~17 requests per cycle, each extra slot adds ~17 req/min to SwissTennis. Below ~200 req/min is fine; above that, talk to the PO before changing.


7. Alert thresholds (#190 Casey portion)

7.1 Alerting substrate — current state and recommendation

The metrics counters in metrics.ts live in process memory on the Next.js app. There is no scrape endpoint, no push gateway, no aggregator. getMetricsSnapshot() exposes the registry to in-process callers (a future /metrics route handler is the obvious next step) but nothing currently reads it from outside the process.

This is consistent with the rest of the CSD app's posture today: structured JSON logs to stdout + a future Prometheus-style scrape endpoint, both hosted on the keystone platform's existing log/metrics stack. There is no separate metrics aggregator in this repo and no infrastructure ticket open to add one (verified by absence in docs/).

Recommendation for v1: journald-based alerts. Treat the structured-log lines as the primary signal source. The keystone platform's log aggregation (whatever it is — see the keystone runbooks) is the substrate; we write the queries below against the field shape documented in logging.ts. This is an explicit v1 stopgap — when the platform adds a scrape substrate, port the thresholds in §7.2 to the equivalent PromQL.

Follow-up: see §9 below — if a real metrics aggregator is needed before the v1 launch (or if the on-call appetite for "manually check the audit table" is low), file a follow-up issue.

7.2 Thresholds

The metric / log expressions below are written against the field shape exposed by metrics.ts (counter names, label sets) and logging.ts (canonical field set). Implementation is per §7.1 — for v1 these are documentation that the on-call queries by hand against journald or the audit table; for v2 they become alert rules in the platform aggregator.

7.2.1 Critical — page on call

swisstennis_pii_violations_total{caller_layer=*} > 0 over any window.

The invariant is "always zero". The counter is incremented only by defence-in-depth checks inside the merge layer when the PII whitelist would have leaked an unexpected field. A non-zero value means the parser whitelist drifted from the merge-layer guard — a code bug that needs a hotfix because the next import could write PII (email, address, phone, birthdate) into the CSD database.

  • v1 query (journald): journalctl ... | grep '"side":"csd"' | grep 'PiiViolation'
  • v2 alert: sum(swisstennis_pii_violations_total) > 0
  • Page severity: high. Page Sam (backend) and Stefan (PO).

swisstennis_sync_failures_total{reason="endpoint_deprecated"} > 0.

Increments when the fetcher receives an HTTP 301 redirect from comp.swisstennis.ch toward the new mytennis.ch SPA. Means SwissTennis has deprecated the anonymous Advantage servlet we depend on; background sync will fail forever until we update the URL set or rewire to Phase 2 (Calendar/Hasura with organizer credentials).

  • v1 query (journald): journalctl ... | grep 'EndpointDeprecatedError'
  • v2 alert: increase(swisstennis_sync_failures_total{reason="endpoint_deprecated"}[1h]) > 0
  • Page severity: high. Page Sam (backend) and Stefan (PO). Disable background sync globally until resolved (toggle swisstennis_sync_enabled = false on affected tournaments to stop the per-tournament failure counter from auto-disabling each one in turn).

7.2.2 Warning — notify next business day

Per-tournament consecutive_failures ≥ 3 (auto-disable at 5).

The pipeline auto-disables sync at N = 5 consecutive failures. Catching it at 3 gives two ticks of warning before sync stops. Two interpretations:

  1. SwissTennis is intermittently down for this tournament — re-runs at the next tick.
  2. Something tournament-specific is broken (event-mode change, payload regression on one event only).

  3. v1 query (audit table): select id, swisstennis_consecutive_failures from tournaments where swisstennis_consecutive_failures >= 3;

  4. v2 alert: a gauge per tournament — derive from the audit row's consecutive_failures_after field on the most recent swisstennis_import_events row per tournament.
  5. Notify Sam in #swisstennis (or the equivalent comms channel) — no page.

dispatcher.status="misconfigured" for > 5 consecutive cycles on csd-prod.

Means a deploy or secret-rotation issue: either SWISSTENNIS_SCHEDULER_SECRET was dropped on the latest deploy, or the keystone-side app.env is out of step. csd-test will sit in this state during development too — alert only on csd-prod.

  • v1 query (keystone-side): journalctl -u keystone-cron-csd-prod-http-trigger.service --since '6 minutes ago' | grep misconfigured | wc -l (≥ 5 = trigger).
  • v2 alert: rate of dispatchScheduler 200-responses with body status="misconfigured" > 0 over a 6-minute window.
  • Notify Casey + Stefan — usually correlates with a recent deploy event.

7.2.3 Info — track, don't alert

swisstennis_fetch_failures_total{error_kind="NetworkError"} baseline rate.

Transient 5xx from upstream is normal background noise. Track the rate as a sanity check (sudden 10× spikes are the canonical "they're having an outage" signal) but don't auto-page on it — the consecutive-failures threshold above catches the case where it stops being transient.

swisstennis_sync_failures_total{reason="merge_conflict"} rate.

Increments when an organizer has locally edited a match result (result_locally_entered = true) and SwissTennis disagrees. This is expected behaviour — it surfaces as a warning in the audit row, not a failure. Track for context; never page.

7.3 What I am explicitly not doing in this pass

  • Wiring up real alert rules. Until the keystone metrics aggregator exists, these thresholds are documentation that the on-call queries by hand. See §9.
  • Adding a /metrics HTTP handler exposing getMetricsSnapshot() in Prometheus exposition format. That's the natural next step once the aggregator is decided; filing as a follow-up rather than committing speculatively.
  • Pushing alerts via webhook from the pipeline code. Considered and rejected — coupling the pipeline to an alert sink means a flaky alert sink can take the pipeline down, and the alert sink itself becomes a new infra dependency for an integration that doesn't otherwise need one.

8. Cross-references

Where What
docs/specs/swisstennis-api-integration.md §5 Data-mapping reference: which SwissTennis fields produce which CSD entities. Not scheduler-specific but answers "what does a successful cycle actually write?".
docs/specs/swisstennis-api-integration-impl-plan.md §5 (v0.2 note) Architectural rationale for the scheduler, including the v0.1 → v0.2 substrate change from pg_cron to systemd timer.
packages/trpc/src/routers/swisstennisScheduler.ts Dispatcher source — resolveCadence, the secret-gate, the per-cycle summary shape.
packages/swisstennis/src/pipelines/sync.ts The pipeline the dispatcher invokes — consecutive-failures bookkeeping, final-pull stamping.
packages/swisstennis/src/observability/ Structured log + metrics counter shapes (the alert thresholds in §7 read from here).
gitlab.com/wagen-public/keystone-public/-/blob/main/docs/runbooks/12-platform-deploy-target.md (keystone side) The deploy contract that delivers SWISSTENNIS_SCHEDULER_SECRET to the CSD container and the app.env file on the host.
gitlab.com/wagen-public/keystone-public/-/blob/main/scripts/16b-app-http-cron-timer.sh (keystone side) Provisioning script for the timer + service + wrapper triple on the keystone host. Not maintained from CSD.

9. Open follow-ups

Item Recommendation Filing
Decide the metrics-aggregator substrate for the CSD app Defer to the platform's existing posture (currently: structured logs only). If the on-call appetite for "query the audit table by hand" is too high, add Prometheus scrape support and port the §7.2 thresholds. Not blocking v1. File a new issue if/when needed — recommend label area::devops, type::infrastructure.
Add a /metrics HTTP handler exposing getMetricsSnapshot() Natural next step once the aggregator is decided. Self-contained — no DB / auth changes. Same — fold into the aggregator decision.
Verify the keystone wrapper logs the response body on success (not just the status) Helpful for the §6 troubleshooting flows (status="misconfigured" is invisible if only the HTTP status is logged). Coordination with keystone — file on the keystone side if the wrapper currently swallows the body.

10. Changelog

  • v1.0 (2026-05-24, Casey) — initial runbook, close-out of #199 + #190 (Casey portion). Substrate is the keystone systemd-timer fleet; pg_cron approach in the v0.1 impl plan is superseded (see impl plan §5 v0.2 note).