Hi everyone! This is Danny from OCLC’s Service Operations Center (SOC). I am one of four Service Delivery Managers who lead a team of analysts in the SOC all hours of the day and night, 365 days a year. Our primary purpose is to make sure you always have access to the critical services you depend on from OCLC. We are constantly monitoring the state of OCLC services and quickly acting to remediate any disruption we detect. This year was a banner year for transformation and evolution at the SOC, with us making big changes in how we execute our mission.
Part of our mission is to make sure you never lose access to the OCLC services you need. However, if a maintenance or unexpected disruption does impact an OCLC service, we want you to be aware that we know there is something going on and that we are working on it. For this reason, we worked with other OCLC support teams and launched a service status dashboard at oc.lc/status. This dashboard allows you to quickly identify when major maintenances or unexpected degradation of services occur for OCLC products and APIs. The dashboard is simple and quick to use. It’s also mobile device friendly. If ever we should confirm that there is a broad disruption of services occurring, now we can get that fact to you and let you know we are on it from a central location! The best part is that this tool is administered by our 24/7 group, meaning we can always get this critical information to you.
Not only did we launch this, but we worked with Monitoring and Tools (another amazing group doing great things for our members) to create enhanced monitoring for our APIs. Rather than relying only upon standard systems data to tell us about our service health, we decided to be more member-centered. We created a continuously running suite of automated tests that recreates the real user experience of authenticating and executing queries into our systems and that alerts our group if it fails to return the expected result.
For example, with the WorldCat Metadata API, we first check that we can obtain an access token to authenticate for the service, and then we send a simple HTTP request like this:
We then check for a HTTP 200 OK response code and the following response header:
To accomplish this, we use a series of tools to distribute and execute these tests from a number of hosts that serve as “slaves” for these schedulers. This means we are checking status from multiple hosts, since relying on a single host can lead to us missing critical outages as they occur. As new APIs are released, we will continue to add new tests to ensure the best possible service for the community.
We know that our APIs are critical to the DevNet community. What’s critical to you is critical to us, so we have taken these steps to better serve you.
Service Delivery Manager