1. Tổng quan
Trong môi trường Ceph phân tán, mỗi node OSD chứa nhiều OSD daemon để lưu trữ dữ liệu. Khi bạn muốn gỡ bỏ toàn bộ một node OSD (ví dụ do hết vòng đời phần cứng, cần nâng cấp, hoặc node lỗi nặng), bạn cần thực hiện đúng quy trình để đảm bảo an toàn dữ liệu và không ảnh hưởng đến cluster.
Bài viết này sẽ hướng dẫn quy trình gỡ node OSD chuẩn, kèm ví dụ thực tế, sơ đồ minh họa và đặc biệt là cách sử dụng script tự động giảm dần trọng số (reweight) OSD trước khi loại bỏ.
2. Khi nào cần gỡ node OSD?
✅ Dùng khi:
- Node OSD hết khấu hao, cần thay mới
- Gom lại OSD, giảm tải cluster
- Node vật lý bị lỗi, không thể sửa
❌ Không phù hợp nếu:
- Node chỉ tạm thời tắt
- Node có MON/MGR – cần xử lý riêng
3. Phân biệt 2 tình huống cần xử lý
Tình huống node | Cách xử lý OSD bên trong |
---|---|
Node còn hoạt động | Gỡ mềm từng OSD (dùng script) |
Node hỏng hoàn toàn | Gỡ cứng với ceph osd lost |
4. Sơ đồ
Sơ đồ cluster
+-------------------------------+
| Ceph Cluster |
+-------------------------------+
|
+-------------------+-------------------+
| |
+---------------+ +----------------+
| MON/MGR | | OSD Nodes |
| Nodes | | (Lưu trữ dữ liệu) |
+---------------+ +----------------+
| |
+------------------------+ +---------------------------+
| CEPH-LAB-MON-071 | | CEPH-LAB-OSD-074 (osd) |
| IP: 10.237.7.71 | | IP: 10.237.7.74 |
| Roles: _admin, mon, | +---------------------------+
| mgr, osd |
+------------------------+ +---------------------------+
| CEPH-LAB-MON-072 | | CEPH-LAB-OSD-075 (osd) |
| IP: 10.237.7.72 | | IP: 10.237.7.75 |
| Roles: _admin, mon, | +---------------------------+
| mgr, osd |
+------------------------+ +---------------------------+
| CEPH-LAB-MON-073 | | CEPH-LAB-OSD-076 (osd) |
| IP: 10.237.7.73 | | IP: 10.237.7.76 |
| Roles: _admin, mon, | +---------------------------+
| mgr, osd |
+------------------------+
- Các node MON đồng thời là MGR và cũng có cài OSD.
- Các node CEPH-LAB-OSD-074 → 076 chỉ có vai trò là node lưu trữ (osd-only).
- _admin label chỉ rõ node đó có thể thực hiện các thao tác quản trị cluster (qua cephadm).
Sơ đồ quy trình gỡ 1 node OSD (nhiều OSD bên trong)
+---------------------------+
| Node CEPH-LAB-OSD-076 |
+---------------------------+
| osd.8, osd.11, osd.14,... |
+---------------------------+
|
v
1. Giảm weight từng OSD về 0 (bằng script)
|
v
2. Gỡ từng OSD theo quy trình phù hợp
|
v
3. Xác minh: không còn OSD trên node
|
v
4. Gỡ daemon crash / exporter (nếu có)
|
v
5. Gỡ node khỏi cluster (orch host rm)
4. Cùng review cluster trước khi lab.
Check status cluster.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph -s
cluster:
id: 75ac298c-0653-11f0-a2e7-2b96c52a296a
health: HEALTH_OK
services:
mon: 3 daemons, quorum CEPH-LAB-MON-071,CEPH-LAB-MON-073,CEPH-LAB-MON-072 (age 3w)
mgr: CEPH-LAB-MON-072.agtskh(active, since 3w), standbys: CEPH-LAB-MON-071.lyxipt, CEPH-LAB-MON-073.holphb
osd: 55 osds: 55 up (since 5d), 55 in (since 3w)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 449 KiB
usage: 35 GiB used, 515 GiB / 550 GiB avail
pgs: 1 active+clean
Danh sách host
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch host ls
HOST ADDR LABELS STATUS
CEPH-LAB-MON-071 10.237.7.71 _admin,mon,mgr,osd
CEPH-LAB-MON-072 10.237.7.72 _admin,mon,mgr,osd
CEPH-LAB-MON-073 10.237.7.73 _admin,mon,mgr,osd
CEPH-LAB-OSD-074 10.237.7.74 osd
CEPH-LAB-OSD-075 10.237.7.75 osd
CEPH-LAB-OSD-076 10.237.7.76 osd
6 hosts in cluster
Danh sách OSD
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 92.54982 root ssd-01
-9 15.29997 host CEPH-LAB-MON-071
12 ssd 1.70000 osd.12 up 1.00000 1.00000
15 ssd 1.70000 osd.15 up 1.00000 1.00000
20 ssd 1.70000 osd.20 up 1.00000 1.00000
25 ssd 1.70000 osd.25 up 1.00000 1.00000
30 ssd 1.70000 osd.30 up 1.00000 1.00000
35 ssd 1.70000 osd.35 up 1.00000 1.00000
40 ssd 1.70000 osd.40 up 1.00000 1.00000
45 ssd 1.70000 osd.45 up 1.00000 1.00000
50 ssd 1.70000 osd.50 up 1.00000 1.00000
-13 15.29997 host CEPH-LAB-MON-072
13 ssd 1.70000 osd.13 up 1.00000 1.00000
17 ssd 1.70000 osd.17 up 1.00000 1.00000
23 ssd 1.70000 osd.23 up 1.00000 1.00000
28 ssd 1.70000 osd.28 up 1.00000 1.00000
33 ssd 1.70000 osd.33 up 0.95001 1.00000
38 ssd 1.70000 osd.38 up 1.00000 1.00000
43 ssd 1.70000 osd.43 up 1.00000 1.00000
48 ssd 1.70000 osd.48 up 1.00000 1.00000
53 ssd 1.70000 osd.53 up 1.00000 1.00000
-7 15.29997 host CEPH-LAB-MON-073
9 ssd 1.70000 osd.9 up 1.00000 1.00000
18 ssd 1.70000 osd.18 up 0.95001 1.00000
22 ssd 1.70000 osd.22 up 1.00000 1.00000
27 ssd 1.70000 osd.27 up 1.00000 1.00000
32 ssd 1.70000 osd.32 up 1.00000 1.00000
36 ssd 1.70000 osd.36 up 0.95001 1.00000
41 ssd 1.70000 osd.41 up 1.00000 1.00000
46 ssd 1.70000 osd.46 up 1.00000 1.00000
51 ssd 1.70000 osd.51 up 1.00000 1.00000
-3 13.59998 host CEPH-LAB-OSD-074
0 ssd 1.70000 osd.0 up 1.00000 1.00000
1 ssd 1.70000 osd.1 up 0.95001 1.00000
2 ssd 1.70000 osd.2 up 1.00000 1.00000
3 ssd 1.70000 osd.3 up 1.00000 1.00000
4 ssd 1.70000 osd.4 up 1.00000 1.00000
5 ssd 1.70000 osd.5 up 1.00000 1.00000
6 ssd 1.70000 osd.6 up 1.00000 1.00000
7 ssd 1.70000 osd.7 up 1.00000 1.00000
-11 15.29997 host CEPH-LAB-OSD-075
10 ssd 1.70000 osd.10 up 1.00000 1.00000
16 ssd 1.70000 osd.16 up 1.00000 1.00000
21 ssd 1.70000 osd.21 up 1.00000 1.00000
26 ssd 1.70000 osd.26 up 1.00000 1.00000
31 ssd 1.70000 osd.31 up 1.00000 1.00000
37 ssd 1.70000 osd.37 up 1.00000 1.00000
42 ssd 1.70000 osd.42 up 1.00000 1.00000
47 ssd 1.70000 osd.47 up 1.00000 1.00000
52 ssd 1.70000 osd.52 up 1.00000 1.00000
-5 17.74995 host CEPH-LAB-OSD-076
8 ssd 1.70000 osd.8 up 1.00000 1.00000
11 ssd 1.70000 osd.11 up 1.00000 1.00000
14 ssd 1.70000 osd.14 up 1.00000 1.00000
19 ssd 1.70000 osd.19 up 1.00000 1.00000
24 ssd 1.70000 osd.24 up 1.00000 1.00000
29 ssd 1.70000 osd.29 up 1.00000 1.00000
34 ssd 1.70000 osd.34 up 1.00000 1.00000
39 ssd 1.70000 osd.39 up 1.00000 1.00000
44 ssd 1.70000 osd.44 up 1.00000 1.00000
49 ssd 1.70000 osd.49 up 1.00000 1.00000
54 ssd 1.70000 osd.54 up 1.00000 1.00000
Note: OSD 8 của node CEPH-LAB-OSD-076 tôi đã gỡ ở các bài LAB trước nhé.
5. Các bước chi tiết gỡ một node OSD
Bước 1 – Liệt kê tất cả OSD trên node
ceph osd tree | grep CEPH-LAB-OSD-076
➡ Ví dụ: các OSD gồm 8 11 14 19 24 29 34 39 44 49 54
Bước 2 – Sử dụng script tự động reweight về 0
Trước khi out OSD, bạn nên giảm từ từ weight
về 0 để Ceph dồn dữ liệu sang OSD khác một cách an toàn và kiểm soát:
#!/bin/bash
### CONFIG
OSDs="11 14 19 24 29 34 39 44 49 54"
DECREMENT_WEIGHT=0.1
TARGET_WEIGHT=0
CONTROL_BACKFILL=10
INTERVAL_CHECK=2
WORK_DIR=/opt/weight-down
TARGET_HOST="CEPH-LAB-OSD-076"
### Validate numeric values
validate_weight () {
local w=$1
re='^[+-]?[0-9]+\.?[0-9]*$'
if ! [[ $w =~ $re && $DECREMENT_WEIGHT =~ $re && $TARGET_WEIGHT =~ $re && $INTERVAL_CHECK =~ $re ]]; then
echo "xx Not a number" >&2
exit 1
elif (( $(echo "$w < $TARGET_WEIGHT" | bc -l) )); then
echo "xx Weight cannot be smaller than target"
exit 1
elif (( $(echo "$TARGET_WEIGHT > 1.45" | bc -l) )) || (( $(echo "$TARGET_WEIGHT < 0" | bc -l) )); then
echo "xx TARGET_WEIGHT must be in [0;1.45]"
exit 1
elif (( $(echo "$DECREMENT_WEIGHT > 0.5" | bc -l) )) || (( $(echo "$DECREMENT_WEIGHT <= 0" | bc -l) )); then
echo "xx DECREMENT_WEIGHT must be in (0;0.5]"
exit 1
fi
return 0
}
### Pre-check: verify all OSDs and hostnames
for OSD in $OSDs; do
host_name=$(ceph osd find "$OSD" -f json 2>/dev/null | jq -r '.crush_location.host')
if [[ -z "$host_name" ]]; then
echo "!! osd.$OSD: crush_location.host is undefined or OSD does not exist. Aborting."
exit 1
fi
if [[ "$host_name" != "$TARGET_HOST" ]]; then
echo "!! osd.$OSD belongs to different host ($host_name ≠ $TARGET_HOST). Aborting."
exit 1
fi
done
### Init weights
mkdir -p "$WORK_DIR"
for OSD in $OSDs; do
host_name=$(ceph osd find "$OSD" -f json 2>/dev/null | jq -r '.crush_location.host')
if [[ -z "$host_name" || "$host_name" != "$TARGET_HOST" ]]; then
echo "!! osd.$OSD invalid or not on $TARGET_HOST. Skipping."
continue
fi
w=$(ceph osd df | awk -v osd="$OSD" '$1 == osd { print $3 }' | head -n1 | tr -d '[:space:]')
if [[ -z "$w" ]]; then
echo "!! Failed to get initial weight for osd.$OSD"
continue
fi
echo "$w" > "$WORK_DIR/weight.$OSD"
echo "=Init osd.$OSD with weight $w"
validate_weight "$w"
done
### Main Loop
while true; do
sleep 1
if (( $(ceph -s | grep "active+remapped+backfilling" | awk '{print $1}') < $CONTROL_BACKFILL )) 2>/dev/null; then
for OSD in $OSDs; do
host_name=$(ceph osd find "$OSD" -f json 2>/dev/null | jq -r '.crush_location.host')
if [[ -z "$host_name" || "$host_name" != "$TARGET_HOST" ]]; then
echo "!! osd.$OSD is no longer on $TARGET_HOST (actual: ${host_name:-"unknown"}), skipping."
continue
fi
if (( $(ceph -s | grep "active+remapped+backfilling" | awk '{print $1}') < $CONTROL_BACKFILL )) 2>/dev/null; then
echo "+++++ Start: $(date)"
ceph -s | grep backfilling
w=$(ceph osd df | awk -v osd="$OSD" '$1 == osd { print $3 }' | head -n1 | tr -d '[:space:]')
if [[ -z "$w" ]]; then
echo "!! Failed to get current weight for osd.$OSD. Skipping..."
continue
fi
if (( $(echo "$w > $TARGET_WEIGHT" | bc -l) )); then
new_w=$(echo "$w - $DECREMENT_WEIGHT" | bc)
[[ "$new_w" == .* ]] && new_w="0$new_w"
if (( $(echo "$new_w < $TARGET_WEIGHT" | bc -l) )); then
new_w="$TARGET_WEIGHT"
fi
echo "++ osd.$OSD current weight: $w, new weight: $new_w"
echo "++ reweight: $(date)"
ceph osd crush reweight osd.$OSD "$new_w"
fi
sleep $INTERVAL_CHECK
fi
done
else
is_break=1
ceph osd df > /tmp/osd-df
for OSD in $OSDs; do
host_name=$(ceph osd find "$OSD" -f json 2>/dev/null | jq -r '.crush_location.host')
if [[ -z "$host_name" || "$host_name" != "$TARGET_HOST" ]]; then
echo "!! osd.$OSD moved from target host (${host_name:-unknown}), removing from list."
OSDs=$(echo "$OSDs" | sed "s/\b$OSD\b//g")
continue
fi
w=$(awk -v osd="$OSD" '$1 == osd { print $3 }' /tmp/osd-df | head -n1 | tr -d '[:space:]')
if [[ -z "$w" ]]; then
echo "!! OSD $OSD not found in df output. Skipping."
continue
fi
if (( $(echo "$w > $TARGET_WEIGHT" | bc -l) )); then
is_break=0
else
OSDs=$(echo "$OSDs" | sed "s/\b$OSD\b//g")
echo "++ osd.$OSD has reached target weight. Removed from list. Remaining: $OSDs"
fi
done
if (( $is_break == 1 )); then
echo "== All OSDs on $TARGET_HOST reached target weight ($TARGET_WEIGHT). Exiting."
break
fi
fi
done
➡ Script sẽ reweight dần từng OSD, mỗi lần giảm một ít (0.01) cho đến khi đạt 0
và kiểm tra trạng thái cluster ceph -s
đảm bảo backfill không quá tải.
Bạn có thể chạy background script với osd-reweight-down.sh
là tên file script, 15607
là PID của process.
shell> nohup ./osd-reweight-down.sh &
[1] 15607
Check log để verify init weight, nếu thấy danh sách xổ ra đúng với các osd.<id> bạn mong muốn thì hãy qua bước tiếp theo.
root@CEPH-LAB-MON-071:/home/hoanghd3# tail -f ./nohup.out
=Init 11 with weight 1.70000
=Init 14 with weight 1.70000
=Init 19 with weight 1.70000
=Init 24 with weight 1.70000
=Init 29 with weight 1.70000
=Init 34 with weight 1.70000
=Init 39 with weight 1.70000
=Init 44 with weight 1.70000
=Init 49 with weight 1.70000
=Init 54 with weight 1.70000
Để script hoạt động bạn cần reweight down mồi một osd (chọn 1 osd bất kỳ trong danh sách OSDs cần reweight down, trường hợp này là osd.8).
ceph osd crush reweight osd.11 1.65
Kết quả khi reweight xong.
root@CEPH-LAB-MON-071:/home/hoanghd3# tail -f ./nohup.out
=Init 11 with weight 1.70000
=Init 14 with weight 1.70000
=Init 19 with weight 1.70000
=Init 24 with weight 1.70000
=Init 29 with weight 1.70000
=Init 34 with weight 1.70000
=Init 39 with weight 1.70000
=Init 44 with weight 1.70000
=Init 49 with weight 1.70000
=Init 54 with weight 1.70000
++ osd is removed: 11. Current list: 14 19 24 29 34 39 44 49 54
++ osd is removed: 14. Current list: 19 24 29 34 39 44 49 54
++ osd is removed: 19. Current list: 24 29 34 39 44 49 54
++ osd is removed: 24. Current list: 29 34 39 44 49 54
++ osd is removed: 29. Current list: 34 39 44 49 54
++ osd is removed: 34. Current list: 39 44 49 54
++ osd is removed: 39. Current list: 44 49 54
++ osd is removed: 44. Current list: 49 54
++ osd is removed: 49. Current list: 54
++ osd is removed: 54. Current list:
Kết quả của lệnh ceph osd tree
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
11 ssd 0 osd.11 up 1.00000 1.00000
14 ssd 0 osd.14 up 1.00000 1.00000
19 ssd 0 osd.19 up 1.00000 1.00000
24 ssd 0 osd.24 up 1.00000 1.00000
29 ssd 0 osd.29 up 1.00000 1.00000
34 ssd 0 osd.34 up 1.00000 1.00000
39 ssd 0 osd.39 up 1.00000 1.00000
44 ssd 0 osd.44 up 1.00000 1.00000
49 ssd 0 osd.49 up 1.00000 1.00000
54 ssd 0 osd.54 up 1.00000 1.00000
Bước 3 – Gỡ từng OSD (khi đã đạt weight = 0 và PGs active+clean)
Cùng phân tích tại sao khi đã weight = 0 nhưng chúng ta vẫn cần quan tâm đến phương án gỡ OSD.
Khi đã reweight = 0
rồi thì sao?
ceph osd crush reweight osd.<id> 0
Có nghĩa là Ceph sẽ không phân phối dữ liệu mới vào OSD này nữa, nhưng:
- Ceph vẫn xem OSD là thành viên của cluster
- Nó vẫn
up
, vẫn có mặt trong CRUSH map, OSD map, auth, và chạy container (daemon). - Dữ liệu vẫn còn trên OSD đó, trừ khi bạn
osd out
và để recovery diễn ra.
- Nó vẫn
Câu hỏi đặt ra là gỡ bằng cách nào cũng được phải không?
Câu trả lời là không hẳn vì việc reweight 0
giúp giảm tải trước, nhưng khi gỡ vẫn cần theo đúng loại OSD:
Trường hợp | Hành động cần thiết |
---|---|
OSD còn sống | ceph osd out , stop daemon, rồi gỡ CRUSH, OSD map, auth, daemon |
OSD chết/hỏng | Phải ceph osd lost , vì không stop daemon được, rồi tiếp tục gỡ |
- Lý do:
- Nếu OSD đã chết (ví dụ node không boot), bạn không thể stop daemon hay xóa container →
systemctl
haypodman
sẽ fail. - Khi ấy, bạn cần
ceph osd lost
để Ceph xử lý nội bộ dữ liệu của nó và không còn mong chờ nó trở lại.
- Nếu OSD đã chết (ví dụ node không boot), bạn không thể stop daemon hay xóa container →
Kết luận ngắn gọn:
reweight 0
là bước chuẩn bị tốt, nhưng không thay thế toàn bộ quy trình gỡ.- Khi gỡ, bạn vẫn cần phân biệt OSD còn sống và OSD hỏng để gọi đúng lệnh (
stop
vslost
). - Nếu bạn gỡ sai loại, có thể làm Ceph bị treo recovery hoặc cảnh báo lỗi daemon.
Với OSD còn hoạt động tốt:
ceph osd out <id>
ceph orch daemon stop osd.<id>
ceph osd crush remove osd.<id>
ceph osd rm <id>
ceph auth del osd.<id>
ceph orch daemon rm osd.<id> --force
Ví dụ ở OSD 11 nhé
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd out 11
marked out osd.11.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch daemon stop osd.11
Scheduled to stop osd.11 on host 'CEPH-LAB-OSD-076'
- Verify của 2 lệnh trên:
- Process osd.11 đã dừng
- podman ps đã mất container của osd.11
- systemctl status của osd.11 đã chuyển trạng thái sang inactive
- Ceph osd tree kết quả osd.11 đã reweight về 0 và đang ở trạng thái down
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch ps | grep osd.11
osd.11 CEPH-LAB-OSD-076 stopped 37s ago 3M - 4096M <unknown> <unknown> <unknown>
root@CEPH-LAB-OSD-076:~# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c8d05f66ac9e 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.14 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-14
bdd2e8f078a8 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.54 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-54
baa838967755 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.24 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-24
a1123e7933ea 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.19 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-19
dbf5ac1766ac 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.49 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-49
511cd981a587 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.29 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-29
aa6ff5cb8d81 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.44 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-44
8cebc31d2b90 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.34 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-34
e544461880c9 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n osd.39 -f --se... 2 months ago Up 2 months ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-osd-39
6e4895fed3a1 10.237.7.74:5000/ceph/ceph@sha256:479f0db9298e37defcdedb1edb8b8db25dd0b934afd6b409d610a7ed81648dbc -n client.crash.C... 11 days ago Up 11 days ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-crash-CEPH-LAB-OSD-076
d0f34489f70b 10.237.7.74:5000/prometheus/node-exporter:v1.5.0
--no-collector.ti... 11 days ago Up 11 days ago ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a-node-exporter-CEPH-LAB-OSD-076
root@CEPH-LAB-OSD-076:~# systemctl status ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@osd.11
○ ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@osd.11.service - Ceph osd.11 for 75ac298c-0653-11f0-a2e7-2b96c52a296a
Loaded: loaded (/etc/systemd/system/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2025-06-24 22:19:11 +07; 59s ago
Process: 1947753 ExecStop=/bin/bash -c bash /var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/osd.11/unit.stop (code=exited, status=0/SUCCESS)
Process: 1947905 ExecStopPost=/bin/bash /var/lib/ceph/75ac298c-0653-11f0-a2e7-2b96c52a296a/osd.11/unit.poststop (code=exited, status=0/SUCCESS)
Process: 1948202 ExecStopPost=/bin/rm -f /run/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@osd.11.service-pid /run/ceph-75ac298c-0653-11f0-a2e7-2b96c52a296a@osd.11.service-cid (code=exited, status=0/SUCCESS)
Main PID: 8324 (code=exited, status=0/SUCCESS)
CPU: 4h 52min 5.632s
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
11 ssd 0 osd.11 down 0 1.00000
14 ssd 0 osd.14 up 1.00000 1.00000
19 ssd 0 osd.19 up 1.00000 1.00000
24 ssd 0 osd.24 up 1.00000 1.00000
29 ssd 0 osd.29 up 1.00000 1.00000
34 ssd 0 osd.34 up 1.00000 1.00000
39 ssd 0 osd.39 up 1.00000 1.00000
44 ssd 0 osd.44 up 1.00000 1.00000
49 ssd 0 osd.49 up 1.00000 1.00000
54 ssd 0 osd.54 up 1.00000 1.00000
Tiếp tục gỡ OSD 11 nhé.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd crush remove osd.11
removed item id 11 name 'osd.11' from crush map
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd rm 11
removed osd.11
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph auth del osd.11
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch daemon rm osd.11 --force
Removed osd.11 from host 'CEPH-LAB-OSD-076'
Đã gỡ thành công OSD 11, dưới đây là trạng thái cluster hiện tại.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch ps | grep osd.11
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
14 ssd 0 osd.14 up 1.00000 1.00000
19 ssd 0 osd.19 up 1.00000 1.00000
24 ssd 0 osd.24 up 1.00000 1.00000
29 ssd 0 osd.29 up 1.00000 1.00000
34 ssd 0 osd.34 up 1.00000 1.00000
39 ssd 0 osd.39 up 1.00000 1.00000
44 ssd 0 osd.44 up 1.00000 1.00000
49 ssd 0 osd.49 up 1.00000 1.00000
54 ssd 0 osd.54 up 1.00000 1.00000
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph -s
cluster:
id: 75ac298c-0653-11f0-a2e7-2b96c52a296a
health: HEALTH_OK
services:
mon: 3 daemons, quorum CEPH-LAB-MON-071,CEPH-LAB-MON-073,CEPH-LAB-MON-072 (age 3w)
mgr: CEPH-LAB-MON-072.agtskh(active, since 3w), standbys: CEPH-LAB-MON-071.lyxipt, CEPH-LAB-MON-073.holphb
osd: 53 osds: 53 up (since 5m), 53 in (since 7m)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 449 KiB
usage: 34 GiB used, 496 GiB / 530 GiB avail
pgs: 1 active+clean
Với OSD đã hỏng (node chết):
ceph osd out <id>
ceph osd lost <id> --yes-i-really-mean-it
ceph osd crush remove osd.<id>
ceph osd rm <id>
ceph auth del osd.<id>
ceph orch daemon rm osd.<id> --force
Lấy ví dụ OSD 14 nhé.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd out 14
marked out osd.14.
Giả sử trường hợp osd.14 không hỏng, bạn cũng không thể sử dụng option lost
, mình sẽ dùng lệnh ceph orch daemon stop osd.14
để làm down osd.14 này giả lập như osd.14 bị hỏng nhé.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd lost 14 --yes-i-really-mean-it
Error EBUSY: osd.14 is not down
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch daemon stop osd.14
Scheduled to stop osd.14 on host 'CEPH-LAB-OSD-076'
Xác minh osd.14 đã down.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
14 ssd 0 osd.14 down 0 1.00000
19 ssd 0 osd.19 up 1.00000 1.00000
24 ssd 0 osd.24 up 1.00000 1.00000
29 ssd 0 osd.29 up 1.00000 1.00000
34 ssd 0 osd.34 up 1.00000 1.00000
39 ssd 0 osd.39 up 1.00000 1.00000
44 ssd 0 osd.44 up 1.00000 1.00000
49 ssd 0 osd.49 up 1.00000 1.00000
54 ssd 0 osd.54 up 1.00000 1.00000
Giờ đây bạn đã sử dụng option lost
được cho osd.14
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd lost 14 --yes-i-really-mean-it
marked osd lost in epoch 11589
Tiếp tục remove osd.14
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd crush remove osd.14
removed item id 14 name 'osd.14' from crush map
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd rm 14
removed osd.14
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph auth del osd.14
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch daemon rm osd.14 --force
Removed osd.14 from host 'CEPH-LAB-OSD-076'
Đã gỡ thành công OSD 14, dưới đây là trạng thái cluster hiện tại.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
19 ssd 0 osd.19 up 1.00000 1.00000
24 ssd 0 osd.24 up 1.00000 1.00000
29 ssd 0 osd.29 up 1.00000 1.00000
34 ssd 0 osd.34 up 1.00000 1.00000
39 ssd 0 osd.39 up 1.00000 1.00000
44 ssd 0 osd.44 up 1.00000 1.00000
49 ssd 0 osd.49 up 1.00000 1.00000
54 ssd 0 osd.54 up 1.00000 1.00000
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph -s
cluster:
id: 75ac298c-0653-11f0-a2e7-2b96c52a296a
health: HEALTH_OK
services:
mon: 3 daemons, quorum CEPH-LAB-MON-071,CEPH-LAB-MON-073,CEPH-LAB-MON-072 (age 3w)
mgr: CEPH-LAB-MON-072.agtskh(active, since 3w), standbys: CEPH-LAB-MON-071.lyxipt, CEPH-LAB-MON-073.holphb
osd: 52 osds: 52 up (since 3m), 52 in (since 4m)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 449 KiB
usage: 34 GiB used, 486 GiB / 520 GiB avail
pgs: 1 active+clean
Bonus thêm script gỡ OSD tự động.
--soft
: Gỡ OSD còn sống (stop daemon, gỡ sạch logic)--hard
: Gỡ OSD đã chết (dùngosd lost
, không cần stop daemon)
#!/bin/bash
# === Configuration ===
NODE_NAME="CEPH-LAB-OSD-076" # Node where OSDs will be removed
OSD_LIST="8 11 14 19" # List of OSD IDs to be removed
CLUSTER_HOST_INFO=$(ceph osd tree -f json-pretty)
echo "==> Starting OSD removal on node: $NODE_NAME"
echo "==> Target OSD list: $OSD_LIST"
echo "------------------------------------------"
for OSD_ID in $OSD_LIST; do
echo ""
echo "▶ Processing osd.$OSD_ID ..."
# 1. Check if OSD exists
ceph osd metadata $OSD_ID &>/dev/null
if [[ $? -ne 0 ]]; then
echo "⚠️ osd.$OSD_ID does not exist in the cluster. Skipping."
continue
fi
# 2. Verify the OSD belongs to the target node
HOST=$(ceph osd metadata $OSD_ID -f json | jq -r '.hostname')
if [[ "$HOST" != "$NODE_NAME" ]]; then
echo "⚠️ osd.$OSD_ID does NOT belong to node $NODE_NAME (belongs to $HOST). Skipping."
continue
fi
# 3. Check if OSD has already been removed from CRUSH map
ceph osd crush ls | grep -q "osd.$OSD_ID"
if [[ $? -ne 0 ]]; then
echo "ℹ️ osd.$OSD_ID has already been removed from the CRUSH map. Skipping."
continue
fi
echo "✅ Validation complete: osd.$OSD_ID belongs to $NODE_NAME. Proceeding with removal..."
# Begin OSD removal
echo "+ Marking osd.$OSD_ID as out"
ceph osd out $OSD_ID
echo "+ Stopping OSD daemon"
ceph orch daemon stop osd.$OSD_ID
echo "+ Removing from CRUSH map"
ceph osd crush remove osd.$OSD_ID
echo "+ Removing from OSD map"
ceph osd rm $OSD_ID
echo "+ Removing auth/keyring"
ceph auth del osd.$OSD_ID
echo "+ Removing cephadm daemon"
ceph orch daemon rm osd.$OSD_ID --force
echo "✅ Successfully removed osd.$OSD_ID"
done
echo ""
echo "🎉 Finished processing all OSDs on node $NODE_NAME"
Yêu cầu cài đặt jq
: apt install jq
(nếu chưa có) để parse JSON
Cách sử dụng:
chmod +x remove_osd.sh
./remove_osd.sh
Tạo file .env
:
cat <<EOF > osd_removal.env
NODE_NAME="CEPH-LAB-OSD-076"
OSD_LIST="19 24 29 34"
EOF
Chạy script:
chmod +x remove_osd.sh
./remove_osd.sh --hard # Hoặc --soft
Ví dụ sử dụng script.
Tình trạng cluster node CEPH-LAB-OSD-076 có 4 OSDs trạng thái down.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph -s
cluster:
id: 75ac298c-0653-11f0-a2e7-2b96c52a296a
health: HEALTH_WARN
4 osds down
services:
mon: 3 daemons, quorum CEPH-LAB-MON-071,CEPH-LAB-MON-073,CEPH-LAB-MON-072 (age 3w)
mgr: CEPH-LAB-MON-072.agtskh(active, since 3w), standbys: CEPH-LAB-MON-071.lyxipt, CEPH-LAB-MON-073.holphb
osd: 52 osds: 48 up (since 7m), 52 in (since 24m)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 449 KiB
usage: 34 GiB used, 486 GiB / 520 GiB avail
pgs: 1 active+clean
Lấy danh sách OSDs còn lại ở node CEPH-LAB-OSD-076. Ta xác định các OSDs 19, 24, 29, 34 là đã hỏng và còn lại là các OSDs chưa hỏng.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
19 ssd 0 osd.19 down 1.00000 1.00000
24 ssd 0 osd.24 down 1.00000 1.00000
29 ssd 0 osd.29 down 1.00000 1.00000
34 ssd 0 osd.34 down 1.00000 1.00000
39 ssd 0 osd.39 up 1.00000 1.00000
44 ssd 0 osd.44 up 1.00000 1.00000
49 ssd 0 osd.49 up 1.00000 1.00000
54 ssd 0 osd.54 up 1.00000 1.00000
Mình sẽ xử lý list OSDs 19, 24, 29, 34 đã hỏng này trước, truyền vào OSD_LIST list OSDs và CEPH-LAB-OSD-076 vào NODE_NAME để ràng buộc các OSDs.
NODE_NAME="CEPH-LAB-OSD-076" # Node where OSDs will be removed
OSD_LIST="19 24 29 34" # List of OSD IDs to be removed
Nếu không có file osd_removal.env
thì bạn sẽ nhận được thông báo này nhé.
root@CEPH-LAB-MON-071:/home/hoanghd3# ./remove_osd.sh --soft
⚠️ Missing ./osd_removal.env file. Please create it with NODE_NAME and OSD_LIST.
Hãy tạo file osd_removal.env
và chạy script với cờ --hard
root@CEPH-LAB-MON-071:/home/hoanghd3# ./remove_osd.sh --hard
==> Starting OSD removal on node: CEPH-LAB-OSD-076
==> Target OSD list: 19 24 29 34
==> Mode: hard
==> Log file: ./osd_removal.log
------------------------------------------
▶ Processing osd.19 ...
✅ osd.19 validated. Ready for removal.
❓ Proceed to remove osd.19? [y/N]: y
+ Marking osd.19 as out
osd.19 is already out.
+ Marking osd.19 as lost
marked osd lost in epoch 11595
+ Removing from CRUSH map
removed item id 19 name 'osd.19' from crush map
+ Removing from OSD map
removed osd.19
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.19 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.19
▶ Processing osd.24 ...
✅ osd.24 validated. Ready for removal.
❓ Proceed to remove osd.24? [y/N]: y
+ Marking osd.24 as out
osd.24 is already out.
+ Marking osd.24 as lost
marked osd lost in epoch 11593
+ Removing from CRUSH map
removed item id 24 name 'osd.24' from crush map
+ Removing from OSD map
removed osd.24
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.24 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.24
▶ Processing osd.29 ...
✅ osd.29 validated. Ready for removal.
❓ Proceed to remove osd.29? [y/N]: y
+ Marking osd.29 as out
osd.29 is already out.
+ Marking osd.29 as lost
marked osd lost in epoch 11594
+ Removing from CRUSH map
removed item id 29 name 'osd.29' from crush map
+ Removing from OSD map
removed osd.29
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.29 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.29
▶ Processing osd.34 ...
✅ osd.34 validated. Ready for removal.
❓ Proceed to remove osd.34? [y/N]: y
+ Marking osd.34 as out
osd.34 is already out.
+ Marking osd.34 as lost
marked osd lost in epoch 11596
+ Removing from CRUSH map
removed item id 34 name 'osd.34' from crush map
+ Removing from OSD map
removed osd.34
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.34 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.34
🎉 Finished processing all OSDs on node CEPH-LAB-OSD-076
Kết quả sau khi gỡ các OSDs bị down.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
39 ssd 0 osd.39 up 1.00000 1.00000
44 ssd 0 osd.44 up 1.00000 1.00000
49 ssd 0 osd.49 up 1.00000 1.00000
54 ssd 0 osd.54 up 1.00000 1.00000
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph -s
cluster:
id: 75ac298c-0653-11f0-a2e7-2b96c52a296a
health: HEALTH_OK
services:
mon: 3 daemons, quorum CEPH-LAB-MON-071,CEPH-LAB-MON-073,CEPH-LAB-MON-072 (age 3w)
mgr: CEPH-LAB-MON-072.agtskh(active, since 3w), standbys: CEPH-LAB-MON-071.lyxipt, CEPH-LAB-MON-073.holphb
osd: 48 osds: 48 up (since 15m), 48 in (since 5m)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 449 KiB
usage: 32 GiB used, 448 GiB / 480 GiB avail
pgs: 1 active+clean
Tiếp theo mình dùng script để gỡ các OSDs còn lại, các OSDs này được giả lập như các OSDs chưa bị hỏng nhưng không có nhu cầu dùng nữa.
cat <<EOF > osd_removal.env
NODE_NAME="CEPH-LAB-OSD-076"
OSD_LIST="39 44 49 54"
EOF
Gỡ OSDs.
root@CEPH-LAB-MON-071:/home/hoanghd3# ./remove_osd.sh --soft
==> Starting OSD removal on node: CEPH-LAB-OSD-076
==> Target OSD list: 39 44 49 54
==> Mode: soft
==> Log file: ./osd_removal.log
------------------------------------------
▶ Processing osd.39 ...
✅ osd.39 validated. Ready for removal.
❓ Proceed to remove osd.39? [y/N]: y
+ Marking osd.39 as out
marked out osd.39.
+ Stopping OSD daemon
Scheduled to stop osd.39 on host 'CEPH-LAB-OSD-076'
+ Waiting for osd.44 to go down...
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
✔️ osd.44 is now down.
+ Removing from CRUSH map
removed item id 39 name 'osd.39' from crush map
+ Removing from OSD map
removed osd.39
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.39 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.39
▶ Processing osd.44 ...
✅ osd.44 validated. Ready for removal.
❓ Proceed to remove osd.44? [y/N]: y
+ Marking osd.44 as out
marked out osd.44.
+ Stopping OSD daemon
Scheduled to stop osd.44 on host 'CEPH-LAB-OSD-076'
+ Waiting for osd.44 to go down...
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
⏳ Waiting... osd.44 still up
✔️ osd.44 is now down.
+ Removing from CRUSH map
removed item id 44 name 'osd.44' from crush map
+ Removing from OSD map
removed osd.44
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.44 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.44
▶ Processing osd.49 ...
✅ osd.49 validated. Ready for removal.
❓ Proceed to remove osd.49? [y/N]: y
+ Marking osd.49 as out
marked out osd.49.
+ Stopping OSD daemon
Scheduled to stop osd.49 on host 'CEPH-LAB-OSD-076'
+ Waiting for osd.49 to go down...
⏳ Waiting... osd.49 still up
⏳ Waiting... osd.49 still up
⏳ Waiting... osd.49 still up
⏳ Waiting... osd.49 still up
⏳ Waiting... osd.49 still up
✔️ osd.49 is now down.
+ Removing from CRUSH map
removed item id 49 name 'osd.49' from crush map
+ Removing from OSD map
removed osd.49
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.49 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.49
▶ Processing osd.54 ...
✅ osd.54 validated. Ready for removal.
❓ Proceed to remove osd.54? [y/N]: y
+ Marking osd.54 as out
marked out osd.54.
+ Stopping OSD daemon
Scheduled to stop osd.54 on host 'CEPH-LAB-OSD-076'
+ Waiting for osd.54 to go down...
⏳ Waiting... osd.54 still up
⏳ Waiting... osd.54 still up
⏳ Waiting... osd.54 still up
✔️ osd.54 is now down.
+ Removing from CRUSH map
removed item id 54 name 'osd.54' from crush map
+ Removing from OSD map
removed osd.54
+ Removing auth/keyring
+ Removing cephadm daemon
Removed osd.54 from host 'CEPH-LAB-OSD-076'
✅ Successfully removed osd.54
🎉 Finished processing all OSDs on node CEPH-LAB-OSD-076
Cùng xem lại trạng thái cluster.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph -s
cluster:
id: 75ac298c-0653-11f0-a2e7-2b96c52a296a
health: HEALTH_OK
services:
mon: 3 daemons, quorum CEPH-LAB-MON-071,CEPH-LAB-MON-073,CEPH-LAB-MON-072 (age 3w)
mgr: CEPH-LAB-MON-072.agtskh(active, since 3w), standbys: CEPH-LAB-MON-071.lyxipt, CEPH-LAB-MON-073.holphb
osd: 44 osds: 44 up (since 7m), 44 in (since 7m)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 449 KiB
usage: 31 GiB used, 409 GiB / 440 GiB avail
pgs: 1 active+clean
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
Bước 4 – Xác minh không còn OSD nào trên node
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph osd tree | grep CEPH-LAB-OSD-076 -A 10
-5 0 host CEPH-LAB-OSD-076
Bước 5 – Gỡ các daemon phụ nếu có.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch ps | grep CEPH-LAB-OSD-076
crash.CEPH-LAB-OSD-076 CEPH-LAB-OSD-076 running (11d) 6m ago 11d 8304k - 18.2.4 2bc0b0f4375d 6e4895fed3a1
node-exporter.CEPH-LAB-OSD-076 CEPH-LAB-OSD-076 *:9112 running (11d) 6m ago 11d 21.3M - 1.5.0 0da6a335fe13 d0f34489f70b
Hai daemon crash
và node-exporter
vẫn còn chạy trên node CEPH-LAB-OSD-076
.
Bạn cần gỡ từng daemon bằng lệnh:
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch daemon rm crash.CEPH-LAB-OSD-076 --force
Removed crash.CEPH-LAB-OSD-076 from host 'CEPH-LAB-OSD-076'
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch daemon rm node-exporter.CEPH-LAB-OSD-076 --force
Removed node-exporter.CEPH-LAB-OSD-076 from host 'CEPH-LAB-OSD-076'
Xác minh trên node CEPH-LAB-OSD-076.
root@CEPH-LAB-OSD-076:~# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Bước 6 – Gỡ node khỏi cluster
Đã đến bước cuối cùng để gỡ node, nhưng Ceph báo lỗi:
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch host rm CEPH-LAB-OSD-076
Error EINVAL: Not allowed to remove CEPH-LAB-OSD-076 from cluster. The following daemons are running in the host:
type id
-------------------- ---------------
crash CEPH-LAB-OSD-076
node-exporter CEPH-LAB-OSD-076
Please run 'ceph orch host drain CEPH-LAB-OSD-076' to remove daemons from host
Lý do làceph orch host rm
KHÔNG cho phép xoá node nếu còn daemon (crash, node-exporter).
- Bạn có thể gỡ thủ công từng daemon như đã nói trước đó,
- Hoặc dùng cách tự động Ceph đề xuất:
Lệnh ceph orch host drain CEPH-LAB-OSD-076
đã được thực thi thành công và scheduler đã lên lịch để xóa các daemon còn lại trên node đó:
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch host drain CEPH-LAB-OSD-076
Scheduled to remove the following daemons from host 'CEPH-LAB-OSD-076'
type id
-------------------- ---------------
crash CEPH-LAB-OSD-076
node-exporter CEPH-LAB-OSD-076
Chờ 10–30 giây để Ceph orchestrator tự động xóa các daemon (crash
, node-exporter
) khỏi node.
ceph orch ps | grep CEPH-LAB-OSD-076
Nếu kết quả rỗng thì node đã được drained
thành công, kết quả của ceph orch host ls
.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch host ls
HOST ADDR LABELS STATUS
CEPH-LAB-MON-071 10.237.7.71 _admin,mon,mgr,osd
CEPH-LAB-MON-072 10.237.7.72 _admin,mon,mgr,osd
CEPH-LAB-MON-073 10.237.7.73 _admin,mon,mgr,osd
CEPH-LAB-OSD-074 10.237.7.74 osd
CEPH-LAB-OSD-075 10.237.7.75 osd
CEPH-LAB-OSD-076 10.237.7.76 osd,_no_schedule,_no_conf_keyring
6 hosts in cluster
Cuối cùng hãy gỡ node khỏi cluster.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch host rm CEPH-LAB-OSD-076
Removed host 'CEPH-LAB-OSD-076'
Bạn sẽ không còn thấy node CEPH-LAB-OSD-076 trong danh sách OSDs host nữa.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph orch host ls
HOST ADDR LABELS STATUS
CEPH-LAB-MON-071 10.237.7.71 _admin,mon,mgr,osd
CEPH-LAB-MON-072 10.237.7.72 _admin,mon,mgr,osd
CEPH-LAB-MON-073 10.237.7.73 _admin,mon,mgr,osd
CEPH-LAB-OSD-074 10.237.7.74 osd
CEPH-LAB-OSD-075 10.237.7.75 osd
5 hosts in cluster
Xác minh lại cluster status.
root@CEPH-LAB-MON-071:/home/hoanghd3# ceph -s
cluster:
id: 75ac298c-0653-11f0-a2e7-2b96c52a296a
health: HEALTH_OK
services:
mon: 3 daemons, quorum CEPH-LAB-MON-071,CEPH-LAB-MON-073,CEPH-LAB-MON-072 (age 3w)
mgr: CEPH-LAB-MON-072.agtskh(active, since 3w), standbys: CEPH-LAB-MON-071.lyxipt, CEPH-LAB-MON-073.holphb
osd: 44 osds: 44 up (since 18m), 44 in (since 18m)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 449 KiB
usage: 31 GiB used, 409 GiB / 440 GiB avail
pgs: 1 active+clean
6. Ưu điểm & nhược điểm khi dùng script reweight
✅ Ưu điểm:
- Giảm tải an toàn và kiểm soát
- Tránh gây backfill đột ngột
- Tự động hóa dễ dàng cho nhiều OSD
⚠️ Nhược điểm:
- Cần hiểu rõ script và giá trị tham số
- Yêu cầu cluster hoạt động ổn định khi chạy script
7. So sánh với gỡ từng OSD đơn lẻ
Tiêu chí | Gỡ 1 OSD | Gỡ 1 node OSD |
---|---|---|
Độ phức tạp | Thấp | Cao hơn |
Phải xử lý nhiều OSD | ❌ | ✅ |
Cần thêm bước host rm | ❌ | ✅ |
Có thể dùng script hỗ trợ | Có, nhưng không cần thiết | Rất nên |
8. Lời khuyên
- Không nên out hoặc giảm weight quá nhiều OSD cùng lúc nếu cluster nhiều dữ liệu
- Luôn kiểm tra
ceph -s
trước mỗi bước quan trọng - Ưu tiên dùng script để reweight nếu gỡ hàng loạt OSD
- Sau khi gỡ node, kiểm tra lại tổng số OSD, crush map và trạng thái PGs
- Script gỡ OSD tự động không nên áp dụng trong production để tránh mất kiểm soát
9. Kết luận
Gỡ toàn bộ một node OSD trong Ceph đòi hỏi thao tác tuần tự, chính xác. Việc kết hợp sử dụng script tự động reweight giúp đảm bảo quá trình an toàn, giảm thiểu rủi ro, đặc biệt là khi cần gỡ nhiều OSD cùng lúc.
Hy vọng bài chia sẻ giúp bạn chủ động và an toàn hơn trong việc bảo trì, nâng cấp, hoặc loại bỏ node khỏi cluster Ceph.