etcd 实战

etcd 基础架构

client
- load balancing
- node failover
API network (gRPC)
Raft Consensus
- leader selection
- log replication
- ReadIndex
function
- KVServer
- MVCC
- Auth
- Lease
- Compactor
storage
- WAL
- Snapshot
- boltdb

architecture

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


+---------------------+
|       Clients       |
+---------------------+
          |
          v
+---------------------+
|       gRPC API      |
+---------------------+
          |
          v
+---------------------+       +---------------------+
|       Leader        | <-->  |      Followers      |
|   (Raft Consensus)  |       |   (Raft Consensus)  |
+---------------------+       +---------------------+
          |
          v
+---------------------+
|     BoltDB (MVCC)   |
+---------------------+
          |
          v
+---------------------+
|  Authentication &   |
|   Authorization     |
+---------------------+
          |
          v
+---------------------+
|       Leases        |
+---------------------+
          |
          v
+---------------------+
|       Watch         |
+---------------------+

Raft 算法

leader selection
- leader, follower, candidate (preVote, preCandidate)
- heart-beat interval, election timeout
log replication
safety
- epoch, nextIndex, commitIndex

鉴权模块

user:pass + RBAC
- blowfish encryption algorithm, salt, customizable hash iteration
- x.509
- ACL, ABAC, RBAC
- JWT
- Segment Tree

1
2
3
4
5
6
7
8


etcdctl user add root:root
etcdctl auth enable
etcdctl put hello world --user root:root

# role
etcdctl role add admin --user root:root
etcdctl role grant-permission admin readwrite hello helly --user root:root
etcdctl user grant-role alice admin --user root:root

租约模块

如何检测一个进程的存活性

MVCC/Watch 模块

TXN

WAL, uncommitted transaction/committed transaction (system fail, recover from WAL)
MVCC
consistent index

脏读、脏写、不可重复读与读倾斜、幻读与写倾斜、更新丢失、快照隔离、可串行化快照隔离

Backend / Boltdb

member/snap/db
- key-value
- lease
- meta
- member
- cluster
- auth
mmap with fsync/fdatasync

four types of page

branch page 0x01
leaf page 0x02
meta page 0x04
freelist page 0x10

FAQ

etcd watch 机制能保证事件不丢吗
哪些因素会导致集群 leader 发生切换
为什么基于 raft 实现的 etcd 还可能出现数据不一致
为什么删除了大量数据, dbsize 无变化
为什么 etcd 社区建议 db 不要超过 8G (default quota: 2G)
为什么集群各节点磁盘 IO 延时很低, 写请求也会超时
为什么只存储了 1 个几百 KB 的 k/v, etcd 进程却可能耗费数 G 内存
当在一个 namespace 下创建了数万个 pod/crd 资源时, 频繁通过标签去查询制定 pod/crd 资源时, api-server 和 etcd 为什么扛不住

to identify a compromised node

monitor node behavior – irregularity
- heartbeat
- log replication
- election behavior
audit logs – unusual patterns
- access logs
- operation logs
data integrity checks
- checksum verification
- snapshot comparison
security measures
- authentication and authorization
- encryption
health checks
- liveness and readiness probes
- resource monitoring
consensus protocol violations
- protocol adherence
- quorum verification

请求延迟

how to debug: metrics, etcd log, trace 日志, blktrace, pprof

网络质量, 节点之间 RTT 延时、网卡宽带满(丢包)
磁盘 I/O 抖动, WAL 日志持久化、boltdb 事务提交出现抖动、导致 Leader 切换
expensive request, 大包请求、大量 Key 遍历、Authenticate 密码鉴权
容量瓶颈, 太多写请求导致线性读请求性能下降
节点配置, CPU 繁忙导致请求处理延时、内存不够导致 Swap

内存占用过高

raftlog: 为了帮助 slow follower 同步数据, 至少保留 5000 条最近收到的写请求在内存里
treeIndex: 每个 key-value 都会在内存中保留一个索引项
boltdb: mmap (compact, defrag)
watcher: grpc watch stream, watcher 数量

性能优化/稳定性

Authenticate RPC 所隐含的昂贵 IO/CPU 处理
- 生产环境最好使用证书鉴权
- 确保业务调用时复用 Token
- 3.4.9 及以上版本
Learner Node
减少 expensive read request
- 程序启动时获取一次全量数据, 后续采用 watch 机制获取增量数据
- 数据分片、拆分, etcd prefix
- 分页机制
treeIndex, boltdb lock
committedIndex > appliedIndex 5000: too many requests error
避免IO/CPU延迟导致频繁选主
snapshot configuration
- follower catch-up
- --snapshot-count
Big Value

Best Practice

开启 etcd 的数据损坏检测机制 –experimental-initial-corrupt-check, –experimental-corrupt-check-time
应用层的数据一致性检测
定时数据备份
良好的运维规范
- 较新稳定版本
- 确保版本一致
- 灰度变更