Outage October 6th-7th 2024

What Happened

On Sunday October 6th 2024 at 03:28AM GMT, our system went down. It did not come back online until Monday morning October 7th at 10:10 AM.

Why did this happen?


Over the past number of years, we have had issues related to our growth. We have added to our features, we have added to our flexibility, we have added more clients and our clients have added more customers. All of this growth puts added strain on our system. For example, every time a page loads, our system needs to access the database for infomration related to that business and make calculations relating to availability, dates, times prices etc. All of this ‘load’ uses computer resources.

Just as we outsource payment processing to Stripe, we outsource the ‘hosting’ of our system to Hosting Ireland. Similar to how Stripe take a commission on every transaction, Hosting Ireland monitor how much load we put on their servers and the more load we place on them, the more they charge us.

At some point a number of years ago, Hosting Ireland added more resources to our account in what was supposed to be a temporary adjustment to ease the strain our system was feeling due to us nearing the extent of our resource limits. However, this ‘temporary’ adjustment was forgotten about and never formalised.

Just after 03:00 AM GMT on the morning of October 6th, a scheduled software patch was applied on a system named ‘cPanel’ by Hosting Ireland. As part of this update, ‘cPanel’ recalibrated how recource limits were measured and enforced. At this point, the ‘cPanel’ software observed the resources allocated to BookingHawk.com, and compared them to the resources BookingHawk.com was eligible for. It then immeadiatley reduced and strictly enforced our resource limits.

Once the lower limits were enforced, our system did not even have enough resources to start up, let alone serve even a few users. This meant that the Tomcat server kept trying to restart and kept shutting down as it reached its resource limits during startup.

Why did it take so long to come back online?


Although we do have some administration controls available to us on the Hosting Ireland servers, sometimes we need one of their administrators to perform an action on one of the servers or systems that they provide for our use. Hosting Ireland normally do provide support over the weekend, albeit in a much more limited capacity than during office hours.

When we realised that we had done everything we could do, we opened a support request with Hosting Ireland at 09:21 AM Sunday October 6th. We assigned it the highest possible priority accoring to their ticketing process.

We sent another request and labeled it critical at 10:37 AM. We tried contact by telephone too. We did not receive any response from Hosting Ireland.

At this point, we had done everything we could have done from a technical point of view. It was not until Hosting Ireland started their working week on Monday at 9AM that this matter received their attention.

At 09:15 on Monday morning, I phoned them and was satisifed that their strongest engineer was investigating. I asked to speak with him but this request was declined. As a software engineer, I did not find this out of the ordinary, sometimes when you are working on an issue you need to be left alone. As the system was not back at 10 AM I phoned again and was granted my request to speak with the engineer. Within a few minutes of working together, we had identified the issue as being related to resource consumption.

Hosting Ireland then granted more resources and the system came back online.

What about Hosting Ireland?


For us, where we feel let down by Hosting Ireland is their lack of availability to offer support. We accept that software systems will crash, they will always need to be updated and unfortunatley, sometimes these things happen at the least ideal times. However, we cannot accept that when things do go wrong, it takes so long to get them right.

Having spoken with Hosting Ireland, it seems that internally they have had issues recently related to their out-of-hours support rota. They are currently in the process of reviewing it. The engineers and support staff that I spoke to, in no uncertain terms, expressed that they should have done much better responding to our support requests.

BookingHawk.com is not without blame


We did not cover ourselves in glory here either. We cannot just blame Hosting Ireland. We should have been on top of how much resources we were using and ensured that we had a formal agreement in place for them. This would have prevented the ‘cPanel’ update from causing the Tomcat server to get into a restarting loop.

We should have had better communication. Firstly, we should have informed our clients of the issues as soon as we became aware of them. Many of our clients first learned of the issue because their customers contacted them to inform them that they could not make bookings. Secondly, the error page that was served when users attempted to view booking pages or login, was technical and confusing. Ideally, we would have been able to serve an error page with a message saying what was happening and when we expected a fix.

What will be done to prevent this happening again?
We are in the process of putting together a formal agreement with Hosting Ireland regarding resource usage.
We are reviewing the performance of our systems to ensure we optimise resource usage.
We will be seeking a committment from Hosting Ireland regarding support response times and Service Level Agreements. If we do not find this satisfactory, we will be reviewing our options regarding our hosting needs.
We will be improving the fault alerting within our system to ensure we are made aware of any issues within our systems within seconds of something going wrong.

In Closing

With all this in mind, and no matter what we do, unfortunatley due to the nature of software, I cannot guarantee that our system will not suffer an outage again. It will, like any system suffer outages. Even the largest companies in the world, with unlimited resources are not immune to outages. What I can say is that the same sequence of events will not be allowed to occur again.

As I mentioned in earlier correspondence, Sunday was the worst possible day for this to happen. As a gesture of goodwill, I have applied one months subscription to all business accounts on BookingHawk.com effective immediately. I know your business may not feel this, but ours certainly will and it will not be forgotten.

Later in October, we are due to push an upgrade with some much-needed improvements specifically relating to Credit Bundles. When that upgrade goes live, I will be in touch.

Thank you for your patience and understanding over the past few days.

Niall

Begin typing your search term above and press enter to search. Press ESC to cancel.