etcd 实战

etcd 基础架构

  • client
    • load balancing
    • node failover
  • API network (gRPC)
  • Raft Consensus
    • leader selection
    • log replication
    • ReadIndex
  • function
    • KVServer
    • MVCC
    • Auth
    • Lease
    • Compactor
  • storage
    • WAL
    • Snapshot
    • boltdb

architecture

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
+---------------------+
|       Clients       |
+---------------------+
          |
          v
+---------------------+
|       gRPC API      |
+---------------------+
          |
          v
+---------------------+       +---------------------+
|       Leader        | <-->  |      Followers      |
|   (Raft Consensus)  |       |   (Raft Consensus)  |
+---------------------+       +---------------------+
          |
          v
+---------------------+
|     BoltDB (MVCC)   |
+---------------------+
          |
          v
+---------------------+
|  Authentication &   |
|   Authorization     |
+---------------------+
          |
          v
+---------------------+
|       Leases        |
+---------------------+
          |
          v
+---------------------+
|       Watch         |
+---------------------+

Raft 算法

  • leader selection
    • leader, follower, candidate (preVote, preCandidate)
    • heart-beat interval, election timeout
  • log replication
  • safety
    • epoch, nextIndex, commitIndex

鉴权模块

  • user:pass + RBAC
    • blowfish encryption algorithm, salt, customizable hash iteration
    • x.509
    • ACL, ABAC, RBAC
    • JWT
    • Segment Tree
1
2
3
4
5
6
7
8
etcdctl user add root:root
etcdctl auth enable
etcdctl put hello world --user root:root

# role
etcdctl role add admin --user root:root
etcdctl role grant-permission admin readwrite hello helly --user root:root
etcdctl user grant-role alice admin --user root:root

租约模块

如何检测一个进程的存活性

MVCC/Watch 模块

TXN

  • WAL, uncommitted transaction/committed transaction (system fail, recover from WAL)
  • MVCC
  • consistent index

脏读、脏写、不可重复读与读倾斜、幻读与写倾斜、更新丢失、快照隔离、可串行化快照隔离

Backend / Boltdb

  • member/snap/db
    • key-value
    • lease
    • meta
    • member
    • cluster
    • auth
  • mmap with fsync/fdatasync

four types of page

  • branch page 0x01
  • leaf page 0x02
  • meta page 0x04
  • freelist page 0x10

FAQ

  • etcd watch 机制能保证事件不丢吗
  • 哪些因素会导致集群 leader 发生切换
  • 为什么基于 raft 实现的 etcd 还可能出现数据不一致
  • 为什么删除了大量数据, dbsize 无变化
  • 为什么 etcd 社区建议 db 不要超过 8G (default quota: 2G)
  • 为什么集群各节点磁盘 IO 延时很低, 写请求也会超时
  • 为什么只存储了 1 个几百 KB 的 k/v, etcd 进程却可能耗费数 G 内存
  • 当在一个 namespace 下创建了数万个 pod/crd 资源时, 频繁通过标签去查询制定 pod/crd 资源时, api-server 和 etcd 为什么扛不住

to identify a compromised node

  • monitor node behavior – irregularity
    • heartbeat
    • log replication
    • election behavior
  • audit logs – unusual patterns
    • access logs
    • operation logs
  • data integrity checks
    • checksum verification
    • snapshot comparison
  • security measures
    • authentication and authorization
    • encryption
  • health checks
    • liveness and readiness probes
    • resource monitoring
  • consensus protocol violations
    • protocol adherence
    • quorum verification

请求延迟

how to debug: metrics, etcd log, trace 日志, blktrace, pprof

  • 网络质量, 节点之间 RTT 延时、网卡宽带满(丢包)
  • 磁盘 I/O 抖动, WAL 日志持久化、boltdb 事务提交出现抖动、导致 Leader 切换
  • expensive request, 大包请求、大量 Key 遍历、Authenticate 密码鉴权
  • 容量瓶颈, 太多写请求导致线性读请求性能下降
  • 节点配置, CPU 繁忙导致请求处理延时、内存不够导致 Swap

内存占用过高

  • raftlog: 为了帮助 slow follower 同步数据, 至少保留 5000 条最近收到的写请求在内存里
  • treeIndex: 每个 key-value 都会在内存中保留一个索引项
  • boltdb: mmap (compact, defrag)
  • watcher: grpc watch stream, watcher 数量

性能优化/稳定性

  • Authenticate RPC 所隐含的昂贵 IO/CPU 处理
    • 生产环境最好使用证书鉴权
    • 确保业务调用时复用 Token
    • 3.4.9 及以上版本
  • Learner Node
  • 减少 expensive read request
    • 程序启动时获取一次全量数据, 后续采用 watch 机制获取增量数据
    • 数据 分片、拆分, etcd prefix
    • 分页机制
  • treeIndex, boltdb lock
  • committedIndex > appliedIndex 5000: too many requests error
  • 避免IO/CPU延迟导致频繁选主
  • snapshot configuration
    • follower catch-up
    • --snapshot-count
  • Big Value

Best Practice

  • 开启 etcd 的数据损坏检测机制 –experimental-initial-corrupt-check, –experimental-corrupt-check-time
  • 应用层的数据一致性检测
  • 定时数据备份
  • 良好的运维规范
    • 较新稳定版本
    • 确保版本一致
    • 灰度变更
Licensed under CC BY-NC-SA 4.0
Get Things Done
Built with Hugo
Theme Stack designed by Jimmy