Help - Search - Members - Calendar
Full Version: Suggestions from 05/06/03
Hostony Board > General Support > Suggestions
thuskey
So many people have expressed various emotions about todays outage of the server10 web server. Resolution of the issue, although long over due, has now been achieved. As a new customer of Hostony, I would like to give them another chance. If I was to judge their overall service and performance level on the events that have happened here in my first week, I would be acting totally unfair. Being a unix/linux administrator myself, I deal with these situations from time to time as well and understand the hurdles you have to overcome to keep your servers functioning.

From experience, every good administrator should do three things after each outage:

1) Research and thouroghly understand the events that lead up to an outage.
2) React to those findings by developing and initiating new procedures to avoid repeating the outage.
3) Develop a plan of action to improve resolution times or to make a similar outage vitually transparent to all users.

It's obvious I'm not going to be able to help you research the actual events that lead up to todays outage, but I do have a few suggestion on what would make me, your customer, feel better next time an outage occurs. I encourage all readers of this post to include your suggestion and show your continued support for Hostony.

1) In case of a server outage, I would appreciate it if my customers were forwarded to a "We are currently experiencing technical difficulties" page, even if it resides on another server. I would rather my customers not see the "Can not find server" or "404 not found" and especially not the "Internal Server Error" message. This make a bad impression on my company which is reflected in my overall sales.

2) Think security, linux viruses and linux hackers can't cause that much damage unless they achieve a root level access. I would expect a little better server lock down and a little better security monitoring after today's issues.

3) It's time to develop new tools/scripts to streamline the recovery of your customers webpages. If you don't already have one, set up a seperate server that will house your support and automation scripts that will be used next time you experience an outage. An example of an excellent script would be one that recreates a customers domain on a different server either initiated automatically or manually in the event a customers domain goes down. Another script may occationally poll customer resource usage and email the customer when thresh-holds are near exceeded.

Anyone else have a few good ideas? I'm sure I'll think of more good ones sometime between now and the morning.
sscweb
you know unix.... i don't.... I only know business and business can't be down for 16 hours.... is that reasonable for unix? I get told every week by my techs how great it is but in the 4 weeks I've been with hostony I've had 3 weeks of greif....
thuskey
QUOTE
you know unix.... i don't.... I only know business and business can't be down for 16 hours.... is that reasonable for unix? I get told every week by my techs how great it is but in the 4 weeks I've been with hostony I've had 3 weeks of greif....


Unix can be great. When properly managed, downtime is measured in minutes, not hours. So far I see two guys fighting fires, but without initiating the proper tools to manage the hostony domain, the service will never reach a level of being GREAT. I really hope Hostony takes this opportunity to learn from past mistakes and implements an improved level of service.
Serge
Once again sorry for the downtime you experienced. As You already know the service restore was delayed because of tordnado warning in Atlanta.
Once datacenter permitted tech to go to the box Mike started to restore HDD and we moved on with it.

Thank you for your information. Here is answers to your questions.

1) Research and thouroghly understand the events that lead up to an outage.

We had a customer who appeared a phraeker that paid with stolen CC for 2 years of the profi account. He tried to modify kernel in the memory and install a trojan that should give him root access after box reboot.
Cpanel has a trojan detection system that reported it to us and Alex logged on to the box to see how to remove it. The phreaker just damaged the kernel that made it unoperable in order to force us to reboot a box and because all teh services except http were nonfunctional.
Alex decided to leave the box as is with only http functional, because there was a tornado warning and he was sure that phreaker also cannot login to the box.
He saw a user who did it and we terminated his account after server restore and refun money to the person he stole CC.
When Mike was able to work with a server he found that damaged kernel also damaged HDD and he need some time to restore it with fsck. Copying dat from HDD is much faster then restoring it from backup so it was wise to spend time on HDD restore when OS were installing on the other HDD. After OS and Cpanel was installed on the box we copied users data and restored configs.

2) React to those findings by developing and initiating new procedures to avoid repeating the outage.
We have jailrooted all shell accounts in order to limit evil people with their efforts. But the main source of the problem is users that signup with stolen CC and get access to the box. There is no way to stop them completely. We use 2checkout.com for payment processing and they check orders against their database of fraud Card, emails and persons and even call back to the numbers provided but still a small part of fraudulent orders came in.

3) Develop a plan of action to improve resolution times or to make a similar outage vitually transparent to all users.
I really cannot see how resolution times can be improved. I could hardly see any big fault during the time we maid a restore. The only thing we stopped make personal answers to the users for about 2 hours when all had to concentrate on configuration restore, but we posted announce and updates in the forum. Also tornado warning delayed us.


As for you suggestions.
1) Due to datacenter network configurations and antiabuse procedures they plan to use it is possible in rare cases. Datacenter has software that monitors boxes for ip stealing, because some people like to steal ips, by bounding them to their boxes. Make some illegal things using them and release back to the original owner. Since it is hard to catch such people datacenter had to develop such a monitor that should switch off abusing boxes once ip is stolen in order to release it to the owner as soon as possible.

2) As I stated above we already have chrootkit installed on the server + trojan monitoring software that comes with cpanel. Before we had stopped 4 hack attempts for other servers with no downtime. In this case it just did not help. Increasing security means limit a user. The simplest method not to allow hackers in is to switch it off completely, but customers need to work on it and use different services ...
We just installed jailroot and had numerous complains from about 20 customers that they now cannot use this or that command, etc.
Security/usability balance is a hard thing.


3) Automation scripts are good when you are working with robots that are accustomed to do standard things. We work with peaple that are individuals not robots. And most of them requires some specific configurations, etc. Cpanel already has automated restore scripts, the only thing they don't restore custom configurations maid to teh accounts. That's why we had to restore it manual using parts of old configs in order to restore services for majot part of the customers as they asked them to configure before. It was not smoothly and still some people complained about passwords not working, etc., but by the moment we don't have have any complaints that something is not working on this server Probably we'll have some but major part of them was received and resolved within several hours after restore. For you to compare a year ago whe we had similar issue we had to spend 8 days sitting on the server and doing lost custom configs.

Certainly we use automation and scripts to improve work process but it is not always a best solution.

Once again thank you for taking time to write your suggestions. I appreciate it.
sscweb
Alright, here we are down again. Resolution times must be improved,...I am an IT Director and have worked in this industry for a great many years. Often I find that when someone says it can't be done, I end up doing it. I recommend that if you are not implementing disaster recovery you look into it. This solution would drastically reduce your rebuild time.

Secondly, Businesses cannot be down this often, we are about to lose our clients over the fiasco of this week. So let me break it down, we lose our customers...you lose yours. It's that simple. If you are unwilling to improve service that's one thing, if you are unable to that's another. If you're unwilling, let us all know now so we can make decisions that will benefit our clients, if you're unable, then perhaps your corporate services need to be re-thought as to what you need to offer.

We hate to be this way, but something must change we can't keep going through this. I understand things happen, but we have had more go wrong than normal.
Serge
sscweb
It was a 5 minutes mainatanance downtime announced in the announcement forums yesterday. Server got back up in 5 minutes after we enabled hyperthreading support on the motherboard and pulled out old HDD.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2024 Invision Power Services, Inc.
IPS Driver Error

IPS Driver Error

There appears to be an error with the database.
You can try to refresh the page by clicking here