1. Fault Description
After completing power upgrades in a customer's server room, the physical server was powered back on and brought to a normal state. OpenStack was then started in the virtual machine environment. However, one virtual machine failed to start, and another virtual machine, while able to start, could not write to the file system, causing IO utilization to spike without actual read or write activity.
---
2. Failure Analysis
1. After discussing with the customer, it was noted that before powering down, all OpenStack virtual machines were properly shut down, with OpenStack services and Ceph also stopped and the physical server shut down safely.
2. Upon re-powering the server, Ceph was started first, followed by OpenStack.
3. Once OpenStack was running, all virtual machines were started manually.
4. No issues were initially observed until application staff reported problems: one virtual machine could not start, and another experienced abnormal file system read/write behavior. On inspection of the OpenStack and Ceph management platforms, the OpenStack status of one VM showed issues.
3. Current Ceph Status
The Ceph output showed `osd.7` reporting slow operations, with one placement group (PG) in an activating state.
1. **Determine the OSD Status:**
The status check indicated that `osd.7` belongs to the `ceph03` node.
2. **Determine the PG Status:**
Using specific commands, it was discovered that PG `7.1d` had entered a "STUCK" state before the previous shutdown. The "activating" state in Ceph means that the PGs are interconnected but unable to achieve an active state.
3. **Check Ceph Logs:**
Examining the Ceph log for node `ceph03` at `/var/log/ceph/ceph-osd.7.log` provided additional details.
---
4. Troubleshooting Steps
1. **Restart the Ceph Monitor Service:**
Restarting the `ceph.mon` service had no effect.
2. **Reboot to Fix PG:**
Attempted a PG repair, which was also unsuccessful.
3. **Restart the OSD Service:**
Restarting the `osd` service resolved the issue.
After the Ceph issue was resolved, the affected VM in OpenStack returned to normal.
---
5. Summary
1. For any Ceph modifications requiring shutdown, it is recommended to stop all applications first before shutting down Ceph.
2. After re-powering, verify Ceph status is normal before starting applications.
3. For routine Ceph operations and maintenance, implement comprehensive monitoring and establish a performance baseline. This will enable effective comparisons and quicker troubleshooting when issues arise.