После 24 спринтов regress-suite разросся; нестабильность блокирует доверие. Этот спринт: ловит flaky тесты, добавляет observability (Grafana + Prometheus alerts + RUNBOOK), сертифицирует 10× cert-прогон. 1. tests/regression/find-flaky.sh — 10× прогон + JSON-агрегатор → docs/flaky-tests.md (per-test pass/fail sequence + reproduce). 2. OrgFactory.signupWithRetry теперь honors Retry-After header (api-client.ts:ApiError.retryAfterSec). Stage rate-limit поднят: RATE_SIGNUP_HOUR=5000, RATE_PER_IP_MIN=5000 (~/food-market-stage/deploy/.env). 3. fullyParallel=true + workers=4 = тесты идут в недетерминированном порядке; isolation работает (OrgFactory per-test). 4. workers=4 даёт **2.4× ускорение** (66.6s → 27.7s). Worker-scoped фикстура lib/worker-org.ts добавлена как opt-in. 5. deploy/grafana/dashboards/quality-watchdog.json (10 панелей: smoke success ratio 7d, incidents, multi-tenant violations, current emoji, p95 by endpoint, step failures, RPS, DB p95, docs posted, disk free) + dashboards/README.md. quality-watchdog.sh пишет Prometheus textfile экспорт в ~/.fm-watchdog/textfile/quality_watchdog.prom для node_exporter. 6. deploy/prometheus/alerts.yml — 10 правил, 4 группы (uptime, errors, database, quality-watchdog). MultiTenantViolation = P0. deploy/prometheus/prometheus.yml — reference config. 7. docs/RUNBOOK.md +178 строк: action per alert (api-down, rps-drop, http-errors-spike/growing, doc-posting-errors, db-p95-high, disk-free-low, watchdog-red, multi-tenant-violation, watchdog-incident). Junior-friendly с конкретными командами. **Cert-прогон (10× workers=4):** 420/420 passed, 0 flaky, avg 30.1s/run, total 300.6s (< 5min budget). Изменения вне репо: - ~/food-market-stage/deploy/.env — RATE_* limits bumped. - ~/quality-watchdog.sh — добавлен .prom textfile экспорт. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
48 lines
1.4 KiB
YAML
48 lines
1.4 KiB
YAML
# Sprint 26: пример конфига Prometheus для food-market.
|
||
#
|
||
# НЕ деплоится автоматически — это reference для оператора. Под stage:
|
||
#
|
||
# docker run -d --name prometheus \
|
||
# -p 9090:9090 \
|
||
# -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml \
|
||
# -v $PWD/alerts.yml:/etc/prometheus/alerts.yml \
|
||
# prom/prometheus:latest
|
||
#
|
||
# Затем Grafana datasource «Prometheus» = http://prometheus:9090.
|
||
|
||
global:
|
||
scrape_interval: 30s
|
||
evaluation_interval: 30s
|
||
external_labels:
|
||
env: stage
|
||
|
||
rule_files:
|
||
- alerts.yml
|
||
|
||
scrape_configs:
|
||
# API exposed via /metrics endpoint
|
||
- job_name: food-market-api
|
||
metrics_path: /metrics
|
||
static_configs:
|
||
- targets:
|
||
- test.admin.food-market.kz:443 # stage
|
||
# - api.food-market.kz:443 # prod
|
||
scheme: https
|
||
relabel_configs:
|
||
- source_labels: [__address__]
|
||
target_label: instance
|
||
|
||
# quality-watchdog textfile exporter (через node_exporter).
|
||
# Запускается на машине, где живёт ~/quality-watchdog.sh:
|
||
# node_exporter --collector.textfile.directory=$HOME/.fm-watchdog/textfile
|
||
- job_name: quality-watchdog
|
||
static_configs:
|
||
- targets:
|
||
- 192.168.1.193:9100 # dev-vm node_exporter
|
||
|
||
alerting:
|
||
alertmanagers:
|
||
- static_configs:
|
||
- targets:
|
||
- alertmanager:9093
|