Several questions have couple up in the last few weeks about offload quarantines which means a blog post on this topic is overdue. We work hard to stress test every new rpm that is released but on rare occasions customers can encounter an issue with the thin database layer that exists in the offload server. This layer is known externally as “Smart Scan” and internally as “FPLIB” (a.k.a. Filter Projection Library).
A crash in the thin database layer could because of either an issue with some aspect of the sql_id (for example, with the predicates) or because of an issue with the data on some region of disk (for example, with the OLTP compression symbol table). The worst, and rarest, form of crashes are where striping leads to every offload server failing simultaneously: these are known colloquially as “Railroad Crashes”). The most important thing is to make sure the retry mechanism doesn’t immediately resubmit the query and re-crash the offload server causing a halt to the business operating. In a hospital, the floor nurse would call a code and the crash team would come running with a crash cart to stabilize the patient. Two members of my family are nurses and I’m reminded that nurses are a lot like technical support engineers in that while doing their job they sometimes have to deal with abuse from frustrated patients (customers): please remember that both groups work hard to resolve your issues and be kind to them!
What are Quarantines?
The option of calling a crash cart is not available to us here so starting in early 22.214.171.124, we created a quarantine system where, after a crash, the exception handler remembers both the sql_id and the disk region being processed and creates a persistent quarantine for both. When a sql_id or a disk region is quarantined any Smart Scan operations on them will be executed in passthru mode.
Currently it is hard to see when this is happening, but an upcoming release has a new stat to make this easier. If an operation has been quarantined you can see its effects by monitoring:
- cell num bytes in passthru due to quarantine
When a quarantine has been created you can look at it in detail using CellCLI:
CellCLI> list quarantine 1 detail name: 1 clientPID: 12798 crashReason: ORA-600 creationTime: 2011-02-07T17:18:13-08:00 dbUniqueID: 2022407934 dbUniqueName: YAMA incidentID: 16 planLineID: 37 quarantineReason: Crash quarantineType: "SQL PLAN" remoteHostName: sclbndb02.us.oracle.com rpmVersion: OSS_126.96.36.199.0_LINUX.X64_1012 sqlID: 1jw05wutgpfyf sqlPlanHashValue: 3386023660
The ‘list detail’ version of the command gives us everything we would need to know about exactly what has been quarantined and why it was quarantined. CellCLI also supports manually creating a quarantine using the attributes shown by ‘list detail’.
This is the topic that has caused the most confusion: if three new quarantines are generated within a 24 hour period the quarantine is escalated to a database quarantine. Using the ‘list detail’ option we would then see:
Note: the number of quarantines in 24 hours before escalation is configurable via a cellinit param: please contact Technical Support if you feel you have a valid need to change this.
The final level of escalation is where if more than one database has been escalated to a database quarantine, the system will escalate to a complete offload quarantine where Smart Scan is disabled completely and all I/O goes through regular block I/O. I’m glad to say that I’ve have never seen this happen.
The next question is how and when are the quarantines removed. Any quarantine can be removed manually using CellCLI. Quarantines are also automatically dropped by certain operations:
- Plan Step Quarantines:
- Dropped on rpm upgrades
- Disk Region Quarantines
- Dropped on rpm upgrades, or on successful writes to the quarantined disk region
- Database quarantines
- Dropped on rpm upgrades
- Offload quarantines
- Dropped on rpm upgrades
What about CDB?
In 12.1, we changed the architecture of cellsrv to support multiple RDBMS versions running at the same time by introducing the concept of offload servers. When a new rpm is installed it typically contains offload servers for 188.8.131.52, 184.108.40.206, 220.127.116.11, (and 18.104.22.168). This is known internally as multi-DB. Any given operation is tagged with the RDBMS version it is coming from and routed to the offload server for that version. A crash in Smart Scan typically means that only the offload server has to restart and not the central cellsrv that maintains Storage Index and does Smart IO. A side effect of this is that all operations for that RDBMS version can revert to Block IO while the offload server restarts minimizing disruption.
The architecture change necessitated a change to the way quarantines are created, checked, and dropped. In multi-DB, installation of a new rpm no longer drops all quarantines. Instead, system created quarantines now record the offload server’s rpm version. Manually created quarantines can also optionally specify offload rpm they are to effect. In multi-DB, a quarantine is observed if the offload rpm specified matches the actual offload rpm the operation will be sent to or if no offload rpm is specified regardless of offloadgroup.
Multi-DB quarantines are dropped if the matching offload rpm is uninstalled or a new rpm installed for that offload version. Multi-DB quarantines with no offload rpm specified must be dropped manually.
Please let me know if you have any questions.