Graceful shutdown và readiness probe — deploy không cắt giữa request

Deploy lúc cao điểm đặt lịch. User bấm “Xác nhận” — spinner quay, rồi timeout. Pod cũ vừa nhận SIGTERM, Tomcat cắt connection giữa chừng. Transaction appointment rollback hoặc commit xong nhưng client không nhận response — user bấm lại, duplicate risk (cần idempotency bài 84).

Healthy deploy không chỉ “image mới chạy được”. Còn là pod cũ thoát êm.

SIGTERM và graceful period

Kubernetes terminate pod: SIGTERM → chờ terminationGracePeriodSeconds (mặc định 30s) → SIGKILL.

Spring Boot 2.3+:

server:
  shutdown: graceful

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

Graceful shutdown:

Ngừng nhận request mới (load balancer / K8s remove từ endpoints)
Chờ request đang chạy hoàn thành (trong timeout)
Đóng connection pool

// PreStop hook K8s — optional, cho LB propagate
// sleep 5s trước SIGTERM để endpoint list cập nhật

Readiness vs Liveness

Probe	Câu hỏi	Fail thì
Liveness	Process còn sống?	Restart pod
Readiness	Có nhận traffic không?	Remove khỏi Service, không restart

@Component
public class ReadinessHealthIndicator implements HealthIndicator {

  private final DataSource dataSource;
  private final RedisConnectionFactory redis;

  @Override
  public Health health() {
    try (var conn = dataSource.getConnection();
         var redisConn = redis.getConnection()) {
      conn.isValid(2);
      redisConn.ping();
      return Health.up().build();
    } catch (Exception ex) {
      return Health.down().withException(ex).build();
    }
  }
}

# application.yml — Actuator
management:
  endpoint:
    health:
      probes:
        enabled: true
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true

K8s:

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080

DB maintenance — readiness down → traffic chuyển pod khác, không gửi request vào pod sắp fail từng query.

Đừng check DB nặng mỗi giây trên liveness — DB blip restart toàn bộ pod = thundering herd.

Deploy sequence thực tế

Pod mới readiness up (DB OK, migration xong)
K8s add vào endpoints
Pod cũ SIGTERM, graceful drain
Pod cũ exit

Rolling update maxUnavailable: 0 giữ capacity trong lúc swap.

Job dài và shutdown

@Async hoặc batch export PDF — nếu vượt grace period, bị kill giữa chừng. Đánh dấu job RUNNING trong DB, worker khác resume, hoặc tăng grace / drain job trước deploy.

Outbox worker (bài 115): transaction ngắn — ít risk hơn report generation 10 phút.

Local dev vs prod

Dev Ctrl+C cũng trigger shutdown — test graceful trước khi tin prod. kubectl delete pod với grace period thấp để simulate.

Takeaway

Production deploy: bật server.shutdown=graceful, readiness phản ánh DB/Redis thật, liveness nhẹ. Pod terminate không phải instant kill switch cho user đang book lịch. Và nếu timeout spike đúng deploy window — xem grace period và preStop trước khi blame code mới.

Bài tiếp theo: Migration zero-downtime — expand-contract.