Welcome to the AGUK Portal
[ Home Page - About This Portal - AGUK Services - 49Resellers]

Recent System Failures November 21st, 2007

Dear Customer,

Over the last few days there have been a number of problems on the AGUK network.

On Sunday 18th November at around 11:30GMT the performance of the server CHIEFWIGGUM start to deteriorate. Despite efforts by engineers the erver became unrecoverable. It appears the raid system had failed and corrupted both disk drives within the machine. At 23:00GMT on the same day it was decided that the system could not be fully recovered and that instead we would begin restoring sites to alternative servers using backup data.

By 08:00GMT Monday 19th November sites had been restored. During the course of monday we began cleaning up some sites which showed errors. At times it was necessary to restart some services to release locked files. This caused some sites to return Service Unavailable errors or in the case of PHP CGI handling errors.

On Tuesday 20th November at 11:45GMT we reported that service had been restored except for control panel bandwidth and disk space statistics, payment facilities through the control panel and web site statistical data. We had stopped both the control panel reporting and payment systems to prevent any errors occurring while we made changes to the control panel set-up. We did not want users being charged incorrectly for their usage nor did we want problems with payments to occur until the control panel was working correctly. The website statistical data took some time to be restored. This is because of the vast amounts of data that had to be restored from backup and the re-allocation of data to the relevant server after the site had been moved.

At 16:45GMT on Tuesday 20th November we had an unrelated hardware failure with one of our internal routers. This router carries internal traffic between servers such as the web servers and database servers. As a result sites using MySQL and MSSQL began to show connection errors. This router was swapped out with another router and connectivity restored. During the replacement of the router all connectivity was disabled for approximately 15 minutes. Service was returned to normal at around 17:45GMT.

Today, Wednesday 21st November, at around 11:45GMT we began seeing some sites hosted on BART show Service Unavailable error messages. We investigated this and found the cause to be a misconfiguration of the Helm Control Panel following the relocation of both the control panel and many sites to alternative servers. We resolved the error but this required a reboot of the server which was performed at 12:45GMT. Following the reboot some coruption was discovered with the PHP set-up. This was verified and resolved by 13:20GMT

The problems we have faced have been to hardware failure. There is little we can do when systems fail other than re-allocate resources or replace the hardware which is what was done in each case. Because of our backup system we were able to restore the data that was corrupted on the CHIEFWIGGUM server.

Over the last few days we have received an enormous number of emails, telephone calls and messages. This is to be expected given the problems. We are only now being able to catch-up in replying to these messages. We apologise if your email has currently gone unanswered or you have left a telephone message wich we have not yet responded to. We are working through these and we will rely as soon as we can.

There are some repeated points being made which we will try and answer below.

The systems affected were our shared hosting systems. Shared hosting means that you share the server with many other websites. In return you receive a lower price on your hosting costs. While websites are split across multiple servers individual sites are not maintained on multiple servers. To have each website hosted on more than one server in a failover set-up is complex and much more expensive. More expensive than many are willing to pay for shared hosting. Such set-ups are available from us and we can supply them but they need to be requested and we will supply a quote relevant to the particular website.

Many of our customers have stated that this is a serious decline in the service being offered by AGUK. It is accepted that any downtime is bad for both your business and ours. However up until the incident this week the overall average uptime for all AGUK services year to date has been 99.765% (including maintenance outages). Again we accept that the recent problems can be viewed negatively we believe it is important to understand that overall we believe the service to be offered by AGUK as better than the industry average. We are also continuing to work on improving our network and service - further announcements on future updates will be made in due course.

AGUK is a small hosting provider when compared to some of the other larger scale players in the market. However this does mean we care significantly about all our customers as our business model is not about “customer churn” (getting new customers everyday to replace those lost) it is about maintaining the customers we have and ensuring they receive the best service we can offer.

Finally thank you to all customers for your understanding during the recent problems and we apologise for all the inconvenience this must have caused.

AGUK Hosting Team

PS Additional note from Andy Gambles (owner / director / tea boy). Many of you will know me as I do answer a lot of the support tickets we receive. However we also have a number of staff and contractors who also answer support tickets and work in the data centers. Just wanted to say thank you to them for putting up with me being a substantial pain in the ….. for the last few days. The swear box looks pretty healthy at the moment!


6 Responses to “Recent System Failures”
  1. Ian Maynard Says:

    I just wanted to add a quick comment to show my appreciation for all the hard work that has been going on behind the scenes. Well done chaps! Just one small point regarding communications. In future, would it be possible to put more detailed info on www.viewnetworkstatus.net? For instance, if customers knew that there was a major hardware failure, it would lead to less emails and phone queries, and therefore less mopping up when things were back to normal.

    Keep up the good work!

  2. Mark Lyth Says:

    Hiya Andy,

    Yes its been a stressful time, but lets hope we are out the other side of it now. Thank you for your emails and updates on the situtation and doing quick fixes to some of the sites we run to get them going again. Its appreciated.

    Regards
    Mark Lyth

  3. J. Stathatos Says:

    I think what exacerbated the frustration to a very considerable degree was the lack of any source of information for the better part of a day, since all AGUK pages were down. It would be sensible, in future, for you to have a backup internet page of even the most primitive kind which would at least let customers know you were all still alive. Even a simple message saying, in effect, “It’s all gone pear-shaped, but we’re working on it…” would be better than complete silence.

  4. Andy @ AGUK Says:

    We do have a status page hosted on an alternative network. This can be found at http://viewnetworkstatus.net and contained updates during the problems.

  5. Ian Maynard Says:

    But that’s the point I made earlier. viewnetworkstatus is not updated until after the event. It happened again today. The site was down for 4 hours, but there was no update posted.

  6. Andy @ AGUK Says:

    Ian,

    Your comments are noted. We did not update the status pae on Thursday when we should have done.

Leave a Comment

  • This is not a support contact form.
  • Support can be contacted here.
  • Leaving a reply does not guarantee a response.
  • This form is for comments on the above subject only.

You must be logged in to post a comment.


Search Portal

Categories
  • About AGUK Portal (1)
  • Company News (16)
  • Customer Emails (5)
  • Development Projects (16)
  • General Information (13)
  • General Internet (4)
  • Hosting Updates (58)
  • Internet Mail (2)
  • Internet Security (5)
  • Network Outages (13)
  • Spam Filtering (4)
  • System Maintenance (16)
  • Web Site Advice (3)

  • Archives
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • May 2005
  • November 2004
  • October 2004
  • August 2004
  • June 2004

  • Customer Links
  • Corporate Site
  • Hosting Services
  • Hosting Price List
  • Support Pages
  • 49Resellers Forum
  • Network Status
  • Helm Login
  • Web Mail

  •