背景
K2 测试网运行过程中,主节点(Sequencer)与副本节点(Replica)出现区块高度差异过大,无法正常同步。需要紧急处理保证服务可用性。
环境信息
| 节点 | IP | 角色 |
|---|---|---|
| Sequencer | 10.7.113.106 | 主节点(排序器) |
| Replica 1 | 10.7.66.87 | 副本节点 |
| Replica 2 | 10.7.102.137 | 副本节点 + Nginx |
| ETH Node | 10.7.95.144 | L1 自建节点 |
| Explorer | 10.7.120.156 | 浏览器 |
现象
- 主节点区块正常推进
- 副本节点区块高度停滞,与主节点差距持续扩大
- 用户通过 RPC 查询可能获取到不一致的区块数据
问题排查
1. KeyStore 连接问题
- 域名解析异常
- DNS 缓存导致连接失败
2. P2P 同步异常
- 副本节点无法通过 P2P 网络从 Sequencer 同步最新区块
- 可能原因:网络分区、防火墙规则变更、节点 ID 变更
紧急处理方案
阶段一:Nginx 流量切换(5 分钟)
目标:将所有 RPC 流量切换到 Sequencer 节点,保证用户访问一致性
# 登录 Nginx 节点 (10.7.102.137)
vim /usr/local/openresty/nginx/conf/vhosts/nal/testnet-rpc.nal.network.conf
修改前配置:
upstream testnet-rpc {
# 负载均衡到副本节点
server 10.7.66.87:8545 weight=50 max_fails=2 fail_timeout=30s;
server 10.7.102.137:8545 weight=50 max_fails=2 fail_timeout=30s;
}
修改后配置:
upstream testnet-rpc {
# 注释掉副本节点,仅指向 Sequencer
#server 10.7.66.87:8545 weight=50 max_fails=2 fail_timeout=30s;
#server 10.7.102.137:8545 weight=50 max_fails=2 fail_timeout=30s;
# 仅指向主节点
server 10.7.113.106:8545;
}
server {
listen 80;
server_name testnet-rpc.nal.network;
index index.html index.htm;
ssl_session_timeout 5m;
ssl_protocols TLSv1.1 TLSv1.2 TLSv1.3;
location / {
proxy_pass http://testnet-rpc;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Port $server_port;
error_log /data/wwwlogs/testnet-rpc-nal-network/http_error.log error;
access_log /data/wwwlogs/testnet-rpc-nal-network/http_access.log access;
}
}
# 检查配置并重启
/usr/local/openresty/nginx/sbin/nginx -t
/usr/local/openresty/nginx/sbin/nginx -s reload
阶段二:数据同步(30 分钟)
目标:从 Sequencer 同步完整数据到 Replica 节点
1. Sequencer 节点准备数据
# 登录 Sequencer (10.7.113.106)
cd /data/deploy/op-geth/datadir/
# 创建同步目录
mkdir -pv /data/nfs/op/20240813/datadir
# 复制完整数据(chaindata + lightchaindata)
cp -r /data/deploy/op-geth/datadir/* /data/nfs/op/20240813/datadir/
# 验证数据完整性
ls -lh /data/nfs/op/20240813/datadir/geth/
du -sh /data/nfs/op/20240813/datadir/geth/chaindata
2. Replica 节点重建
Replica 1 (10.7.66.87):
# 1. 停止服务
supervisorctl stop test-replica1-op-geth
supervisorctl stop test-replica1-op-node
# 2. 备份旧数据(保留现场)
cd /data/deploy/op-geth/
mv datadir datadir_bad_0813
# 3. 从 NFS 复制主节点数据
cp -r /data/nfs/op/20240813/datadir ./
# 4. 验证数据
du -sh datadir/geth/chaindata
ls datadir/geth/
# 5. 启动服务
supervisorctl start test-replica1-op-geth
supervisorctl start test-replica1-op-node
# 6. 查看同步状态
supervisorctl status
tail -f /data/logs/op-node/test-replica1-op-node-out.log
Replica 2 (10.7.102.137):
# 同上操作
supervisorctl stop test-replica2-op-geth
supervisorctl stop test-replica2-op-node
cd /data/deploy/op-geth/
mv datadir datadir_bad_0813
cp -r /data/nfs/op/20240813/datadir ./
supervisorctl start test-replica2-op-geth
supervisorctl start test-replica2-op-node
阶段三:恢复负载均衡(验证后)
待副本节点同步完成后,恢复 Nginx 负载均衡配置:
upstream testnet-rpc {
server 10.7.66.87:8545 weight=50 max_fails=2 fail_timeout=30s;
server 10.7.102.137:8545 weight=50 max_fails=2 fail_timeout=30s;
server 10.7.113.106:8545 backup; # Sequencer 作为备用
}
验证命令
检查区块高度
# 查询 Sequencer 高度
curl -s -H "Content-Type: application/json" -X POST --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' http://10.7.113.106:8545 | jq .result
# 查询 Replica 1 高度
curl -s -H "Content-Type: application/json" -X POST --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' http://10.7.66.87:8545 | jq .result
# 查询 Replica 2 高度
curl -s -H "Content-Type: application/json" -X POST --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' http://10.7.102.137:8545 | jq .result
检查 P2P 连接
# Replica 节点日志
grep -r "connected to peer" /data/logs/op-node/test-replica1-op-node-out.log
# 预期输出
t=2024-08-13T02:39:40+0000 lvl=info msg="connected to peer" peer=16Uiu2HAmK8oVSbeJjgtfUAVEVfC75GFnSVgMZQ5VfZgm6NX8XWE3 addr=/ip4/10.7.113.106/tcp/9003
复盘
问题根因:
- 网络分区导致 P2P 连接中断
- 副本节点长时间无法同步,区块差距累积
- 无自动告警机制,发现时差距已过大
解决:
- 紧急切换流量到 Sequencer,保证服务可用
- 通过 NFS 全量复制数据重建副本节点
- 恢复后逐步回归负载均衡
改进措施:
- 建立区块高度差异监控(> 50 块触发告警)
- 配置自动故障转移:差异过大时自动切换 Nginx upstream
- 定期全量备份 datadir,缩短恢复时间(目标 < 10 分钟)
- 增加 P2P 连接健康检查,自动重连机制
- 文档化故障处理 SOP,明确各阶段操作与负责人
本文首发于 wr.mrchi.cn,转载请注明出处。