I just thought I’d share some internal Hybrid Logic correspondence with the world, as it’s pretty exciting…
From: Luke
To: Mike, Rob, Kieran
Hey guys,
I’ve successfully set up our first cross-datacentre web cluster today!
Three servers are in our rack at Telehouse and three are hosted with ElasticHosts (also in London). The most impressive demo of course is turning off an entire data centre, which I just simulating by doing a hard power-off on all three ElasticHosts machines simultaneously *just* after updating a WordPress blog whose database was being hosted there.
I held my breath, counted to ten, clicked refresh, and voila, my latest post was still there
I’ve also got real failure-resistant DNS set up, no more /etc/hosts hacks. All websites resolve to all six IP addresses (with the current live one first in the list). All modern web browsers fail-over to a new IP within one second of the current IP failing. This is serious redundancy which you usually have to pay a lot of cash for.
More good news – ElasticHosts have just announced a new data centre in San Antonio, USA. Imagine a “Prefer my site to be hosted in the US” checkbox in our control panel… that’s gonna be a reality.
It’s looking good!
Cheers,
Luke
Mike replied with a few questions, which I answered:
I just had a thought. Does this also make it an awesome migration tool? For example if I wanted to move all my client sites to our own web cluster it would be an admin nightmare. However if we used HL to add web cluster machines to our existing network could we then quietly turn off our existing servers and our world keep spinning?
If you’re trying to move from one data centre where the web cluster software can be installed at both then yes, it’s a life-saver. Even if you had to move physical hardware from one location to another, you could do it by simply unplugging one server at a time (everything carries on working), moving them from A to B (also one at a time), and telling the web cluster where to look. The transition would look like this:
(Site A – old data centre, site B – new data centre, each * represents a server)
A * * * * * B
A * * * * B *
A * * * B * *
A * * B * * *
A * B * * * *
A B * * * * *
And voila, you’ve seamlessly *moved data centres* without a minute’s downtime for any of your sites.
It doesn’t solve the initial problem of moving all my client sites to the new software infrastructure. That will be done manually, and gradually, site-by-site. With the new DNS system now working you’d set up a bunch of nameservers, say ns1.tpj-cloud.com, ns2.tpj-cloud.com which would be hosted by the web cluster itself, and then you’d migrate the sites one-by-one to your new web cluster, at which point they gain the ease of transferability between data centres.
Whether your existing dedicated server host will support FreeBSD is another question… there’s always the Depenguinator!
Also would I be right in assuming that all machines in a web cluster have to have all applications installed on them? Sorry if that’s a Noddy question.
Not at all, it’s a good question. Let me try and explain how it works:
Not every site needs to be copied onto every single server. Suppose you have ten servers and you want a “redundancy guarantee” that you could turn off *any four servers* and not lose any data (perhaps the maximum number of servers is any single data centre is four, and you want to be data-centre-failure-proof).
What this means is that every website has to be copied onto 5 servers — consider the worst case for some website X, if four of the five servers which had copies of X went down, there’d always be one server left with that data. This is called N+1 redundancy. N is the level of redundancy (or acceptable risk), and N+1 is the number of copies of each piece of data (website, database) you need at minimum to have to guarantee that level of redundancy.
This number N, the redundancy level, is a knob you can tweak within the web cluster. It represents a trade-off between disk and bandwidth usage (replicating websites to lots of servers takes more network and I/O bandwidth) and the redundancy of your web cluster (its resistance to failure). If you want your web cluster to be nuclear-bomb-proof then you need N to be as high as the maximum number of machines in any one city (assuming only one city gets hit at a time!). But if you’re less paranoid about mutually assured destruction and want to make better use of your bandwidth and disk space, you might set your level of acceptable risk equal to *the probability of any two machines failing simultaneously*, you can set N=2. Then there’ll always be three copies of every single piece of data, so that you can’t possibly lose it when two machines go down.
Note that setting N to be a small constant number with respect to the number of machines in your web cluster enables the scalability of your web cluster. If every website only has to be replicated to 2 machines in a 100 node web cluster, you get 50 nodes’ worth of disk space and capacity to play with. When you scale that same web cluster to 200 machines, you get 100 nodes’ worth of disk space. In other words, you get the holy grail of linear horizontal scalability, while retaining your chosen level of redundancy.
At the other extreme, if every website has to be replicated to 50 machines, you only get 2 nodes’ worth of disk space to play with, but you do have an incredibly resilient system!
Any questions? Just shout in the comments and I’ll get back to you!