food-market/deploy/prometheus/prometheus.yml
nns cf760fab10
Some checks are pending
Auto-tag / Create date-tag (push) Waiting to run
CI / Backend (.NET 8) (push) Waiting to run
CI / Web (React + Vite) (push) Waiting to run
CI / POS (WPF, Windows) (push) Waiting to run
feat(s26): flaky-test detection + observability dashboards (8/8 ✓ 10/10 cert)
После 24 спринтов regress-suite разросся; нестабильность блокирует
доверие. Этот спринт: ловит flaky тесты, добавляет observability
(Grafana + Prometheus alerts + RUNBOOK), сертифицирует 10× cert-прогон.

1. tests/regression/find-flaky.sh — 10× прогон + JSON-агрегатор →
   docs/flaky-tests.md (per-test pass/fail sequence + reproduce).
2. OrgFactory.signupWithRetry теперь honors Retry-After header
   (api-client.ts:ApiError.retryAfterSec). Stage rate-limit поднят:
   RATE_SIGNUP_HOUR=5000, RATE_PER_IP_MIN=5000 (~/food-market-stage/deploy/.env).
3. fullyParallel=true + workers=4 = тесты идут в недетерминированном
   порядке; isolation работает (OrgFactory per-test).
4. workers=4 даёт **2.4× ускорение** (66.6s → 27.7s). Worker-scoped
   фикстура lib/worker-org.ts добавлена как opt-in.
5. deploy/grafana/dashboards/quality-watchdog.json (10 панелей:
   smoke success ratio 7d, incidents, multi-tenant violations,
   current emoji, p95 by endpoint, step failures, RPS, DB p95,
   docs posted, disk free) + dashboards/README.md.
   quality-watchdog.sh пишет Prometheus textfile экспорт в
   ~/.fm-watchdog/textfile/quality_watchdog.prom для node_exporter.
6. deploy/prometheus/alerts.yml — 10 правил, 4 группы (uptime,
   errors, database, quality-watchdog). MultiTenantViolation = P0.
   deploy/prometheus/prometheus.yml — reference config.
7. docs/RUNBOOK.md +178 строк: action per alert (api-down,
   rps-drop, http-errors-spike/growing, doc-posting-errors,
   db-p95-high, disk-free-low, watchdog-red, multi-tenant-violation,
   watchdog-incident). Junior-friendly с конкретными командами.

**Cert-прогон (10× workers=4):** 420/420 passed, 0 flaky, avg 30.1s/run,
total 300.6s (< 5min budget).

Изменения вне репо:
- ~/food-market-stage/deploy/.env — RATE_* limits bumped.
- ~/quality-watchdog.sh — добавлен .prom textfile экспорт.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-08 14:44:19 +05:00

48 lines
1.4 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Sprint 26: пример конфига Prometheus для food-market.
#
# НЕ деплоится автоматически — это reference для оператора. Под stage:
#
# docker run -d --name prometheus \
# -p 9090:9090 \
# -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml \
# -v $PWD/alerts.yml:/etc/prometheus/alerts.yml \
# prom/prometheus:latest
#
# Затем Grafana datasource «Prometheus» = http://prometheus:9090.
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
env: stage
rule_files:
- alerts.yml
scrape_configs:
# API exposed via /metrics endpoint
- job_name: food-market-api
metrics_path: /metrics
static_configs:
- targets:
- test.admin.food-market.kz:443 # stage
# - api.food-market.kz:443 # prod
scheme: https
relabel_configs:
- source_labels: [__address__]
target_label: instance
# quality-watchdog textfile exporter (через node_exporter).
# Запускается на машине, где живёт ~/quality-watchdog.sh:
# node_exporter --collector.textfile.directory=$HOME/.fm-watchdog/textfile
- job_name: quality-watchdog
static_configs:
- targets:
- 192.168.1.193:9100 # dev-vm node_exporter
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093