Archive for the ‘Technical details’ Category

Hybrid Cluster — self-healing, auto-scaling & very forgiving

Wednesday, December 28th, 2011

You won’t have heard much from Hybrid Logic recently — now with an early stage tech company this can mean one of two things: either they’ve given up and gone home, or they’re mad busy innovating, building and shipping their product.

I’m pleased to report that in our case it’s the latter ;-)

Hybrid Cluster has had an extraordinary year of development and we’re on the cusp of releasing some very exciting new features for the world to get to grips with. What we’ve done is nothing short of revolutionary — we’re changing the fundamental assumptions about how your servers can co-operate together, how applications and databases can scale, and how companies do business continuity planning across data centres.

In the “old world”, a server is seen as a single entity; one which has its own specific configuration, and which hosts a set of applications and databases. If you’re staying up-to-date with the industry, you’ll have virtualized that server and put its storage in a centralized storage system (a SAN, for example) — now that’s all very well, but the virtual server is still conceptually a single server and can still suffer from these three problems:

  1. Hardware and networks fail
  2. Servers get over-loaded when there are spikes in demand
  3. Users make mistakes

At Hybrid Logic it’s our mission to solve all three of these problems for your existing LAMP applications, and our software — available for license today — solves them by employing a fundamental paradigm shift in industry thinking.

Individual servers and storage systems should not be the unit of concern for you, the developer or administrator. Applications, databases and mailboxes should be — the servers should look after themselves.

Now, if you look a little further down the road, this is the way the industry’s moving — in cloud, the move from IaaS to PaaS is exactly this — developers and sysadmins should not have to think about individual server instances ever again. Their servers should form a cognizant co-operative group on their own. This is exactly what our software does — it transforms a bunch of dumb, commodity machines, connected by slow and unreliable network connections, into a loosely-coupled distributed cluster where the failure of an individual server or even an entire data centre is automatically healed so that the cluster carries on working — keeping your applications, databases and mailboxes online even in the face of catastrophic failure of an entire region.

I’m Luke, the CTO here at Hybrid Logic, and in the next few blog posts I’m going to give you a bit of insight into how we do it ;-)

Happy New Year!

Cheers,
Luke

Announcing txMySQL – native async Twisted MySQL protocol

Tuesday, February 8th, 2011

Hybrid Logic is pleased to announce the release of txMySQL, a native Twisted MySQL protocol implementation at https://github.com/hybridlogic/txMySQL

The bulk of this code is courtesy of _habnabit (thank you!), we just added authentication support and fixed a couple of bugs which were stopping the MySQL protocol parser working.

This works well enough to .fetchall() basic results sets and .query() any other MySQL statements you care to run. See example.py.

Feel free to fork, tweak, fix, use, report issues, etc.

More performance and stability improvements

Sunday, December 5th, 2010

Hello everyone,

We’ve been working hard over the weekend and have some good core cluster stability and performance improvements to show for it, new internal performance testing tools and a sneak peek of our Hybrid Sites project:

First, a bug in Twisted which was causing the distributed proxying layer to sometimes stop accepting new requests has been worked around. This means you shouldn’t see database connection errors any more. If you do, please report them by posting on the forums!

Second, we’ve now got a new internal performance testing tool which shows us a scatter graph of a cluster’s response time and stability:

This plot shows, for example, an average response time of around 400ms for a WordPress blog and our Control Panel (the CP in red, WordPress in blue). The few outliers show latencies of up to 10 seconds when a server fails! Much better than the usual hours or days of downtime!!

Next, I’ve got a couple of sneak peaks of our Hybrid Sites website and Control Panel. This will be our flagship cloud web hosting platform, perfect for developers, designers and publishers alike.

And here’s the control panel, showing off our whitelabel features:

A massive amount of under-the-hood work has gone on with the Control Panel in readying our incredibly powerful reseller system for Hybrid Site’s go-live in a couple of weeks.

We are also happy to announce that Hybrid Sites will be launched in association with ElasticHosts, giving their customers access to powerful, simple cloud web hosting — a much easier option than setting up their own Linux box over SSH. Hybrid Sites will in fact be launched across multiple cloud providers, including CloudSigma, to provide impressive cross-cloud redundancy.

We also have a reseller API which presently has 35 commands and growing. This will allow you to set up reseller accounts, take payments, set up websites, databases and purchase domains all through a powerful REST API. This will come hand-in-hand with the WordPress plugin which runs our — or your — frontend web hosting company page. Hybrid Sites will be the first to prove this technology tool-chain :-)

Lots more to come this week, but for now, please try to thrash the knackers off your beta clusters, and get in touch if you want to test drive it and you’re not on the beta yet!

Cheers!

Luke Marsden, CTO

Running FreeBSD 8.1 as a Xen HVM DomU on Flexiant

Friday, November 26th, 2010

Just thought I’d share the incantations which were necessary to get FreeBSD 8.1 XENHVM kernel to work well on Flexiant, with paravirtualised network and disk:

Just before the kernel boots (which you have to be quick to catch with Flexiant’s VNC client) hit F6 on the bootloader and type:

set hw.clflush_disable=1
boot

This will allow you to boot even a GENERIC kernel. Once you’re booted, chuck hw.clflush_disable="1" into /boot/loader.conf to make this permanent.

You’ll then want to build your own kernel for paravirtualised network and disk drivers. Edit the XENHVM kernel config (see the FreeBSD handbook on compiling your own kernel) – comment out the MODULES_OVERRIDE line which disables building all the modules (assuming you want ZFS support) and also comment out the whole section about WITNESS and INVARIANT, as having this enabled will slow down your kernel quite significantly.

Then you’ll need to patch the network driver as per this post (manually, since the code has changed a bit), else you get a lot of dropped packets:

http://www.mail-archive.com/freebsd-xen@freebsd.org/msg00598.html

Then just (as root):

cd /usr/src
make buildkernel KERNCONF=XENHVM
make installkernel KERNCONF=XENHVM
shutdown -r now

And enjoy your speedy FreeBSD 8.1 VM in the cloud!

Hybrid Web Cluster now runs on Xen HVM IaaS providers

Friday, November 26th, 2010

The proof:

Lots more IaaS providers, here we come!

FTP support now in Hybrid Web Cluster

Tuesday, November 23rd, 2010

Update: This has now been deployed to your beta web clusters! And check out the new video below…

We now have clustered FTP support! This means you can open an FTP connection to any node on the cluster at any time, and you’ll get internally redirected to the correct backend server for that site, and authenticated against the password you’ve set up in your Control Panel.

FTP is one of the most annoying and broken protocols on the planet (a separate data connection, whyyy?), but it’s also crucial for uploading your website — so it’s rather good that we support it now :-)

This inherits the same nice properties which HTTP and MySQL requests enjoy while traveling through our distributed proxying layer, such as stopping site-juggling from occurring while an FTP connection is happening (so your site won’t get load-balanced if you’re uploading to it). It also means that if your site is being juggled at the same moment you connect to it, you’ll get a slight delay before your FTP connection is established, rather than an error message.

We are nearly feature complete!

Stability, performance and rendering improvements

Friday, November 19th, 2010

It’s been a busy first week of the beta here at Hybrid Logic HQ, and we’re very pleased by the response we’ve had to the start of the beta — thank you! There’s been a buzz of activity on the forums and we love it when you give us feedback, so please carry on experimenting with our software and tell us what you think.

Along with the awesome feedback from yourselves, which we taking careful note of, we’ve also been doing some improvements of our own. Here’s a quick breakdown of the fixes and improvements which have now been deployed across all your clusters:

  • At the deployment level, we can now add new instances to an existing cluster. This means we can unintrusively upgrade a cluster to include new or better spec machines (irrespective of physical location) so that you can scale your hosting operation seamlessly.
  • The “God Pod” has had some significant responsiveness improvements. When we first launched, it wasn’t the most responsive user experience in the world. It’s much quicker and more accurate now, so give it a go!
  • We’ve made significant improvements to the stability of the core web hosting platform. We’ve solved several problems which were causing “Default site on X” error messages where your websites should have been. Another bug was causing databases to sometimes become inaccessible, and we’ve solved that too. Stability is looking a lot better.
  • We’ve improved the intelligence of the core load balancing algorithms, meaning that the decisions to move a site from one server to another (due to load) is now a fair bit smarter, and you should see fewer unnecessary load balancing events. As ever, there’s still room for improvement.
  • We’ve enabled swap on all your machines, so that if your 1.4GB memory does ever get fully used up, your instances will just become slow for a few minutes as they recover, rather than falling over or crashing completely.
  • When a site is about to be moved from one server to another, what happens internally is that requests for that site get “paused” by the distributed proxying layer which runs on top of the web and database servers. This pausing happens so that during the transfer of the site or database from one server to another, none of the requests return error messages — rather, the user just experiences a slow page load. The Load Balancing Diagram in the God Pod now shows a dotted line around a site when it is paused. This gives you a better insight into what’s happening within the cluster during the process of moving sites from one server to another to keep your servers healthy and balanced.
  • Performance has been improved massively. Previously, load balancing events caused sites to be blocked for up to 20 seconds. We’ve managed to get this down to 3-6 seconds in most cases, resulting in fewer requests building up. We’ve also made some code changes which have made everything feel a lot snappier. We will be continuing to optimise for performance over the coming weeks and months — this is only the start!
  • Numerous tweaks and improvements to functionality in the Control Panel have also been deployed (more details on this will be posted to our forum in due course).

We can’t wait to see how much better we can make it next week!

Lightning talk at CloudCamp London

Thursday, October 21st, 2010
For those who missed it, here’s the text of my talk at CloudCamp London yesterday. CloudCamp was great fun, thanks Chris!

Slide 1

Hi, I’m Luke from Hybrid Logic and I’m going to talk about filesystem snapshots and how they are useful in cloud computing.

Slide 2

A snapshot is an instantaneous point-in-time copy of your filesystem. The blocks that haven’t changed aren’t needlessly copied so you can store lots of snapshots with less disk space than you’d expect.

What are snapshots good for? Well, have you ever deleted important files by accident? Keeping snapshots lets you quickly “roll back time”.

Also, if you can copy your snapshots onto a different server, they can act as a great backup which you can recover very quickly from.

Cloud instances aren’t perfect, and data loss/instance failure in not un-heard-of in public clouds. Whole industries have grown up around dealing with the transient, ephemeral nature of cloud instances.

Being able to take a snapshot of your server and clone it brings a new level of manageability as well. If you’ve ever started up an EC2 instance, then you have – perhaps unwittingly – cloned a snapshot of a disk image.

Slide 3

The cloud storage model

Infrastructure is the underlying compute hardware, whether real or virtualised. With respect to storage, the infrastructure corresponds to the block device exposed by, say, EBS on EC2, or the physical hard disk in a non-cloud data centre.

The platform includes the Operating System and crucially the Fileystem which you choose to install on your cloud instances.

My claim is that it’s better to have the snapshotting done at the filesystem level, than to rely on the underlying infrastructure’s snapshotting capabilities, if they exist at all.

Slide 4

The primary benefit of doing this is the removal of vendor lock-in. By having snapshots at the platform level you can replicate data between servers in entirely different cloud infrastructures, for example, you can move data between EC2 to ElasticHosts and back again. Plus you can move snapshots in and out of the cloud entirely, allowing you to build hybrid clouds without expensive, complex virtualisation in your own data centre. In total, this reduces your dependence on any one provider, which reduces your risk of downtime.

Slide 5

Relying on infrastructure for your snapshots brings some other problems too. When you take a snapshot with something like EBS, because the infrastructure can’t communicate “up” to the platform, it has no way of telling the filesystem that the snapshot is about to happen. If the filesystem is mid-way through a write when the snapshot takes place, you’ll end up with a corrupt snapshot.

One solution is to use a “pausable” filesystem, such as XFS, so you can flush it to disk and block the flow of writes during a snapshot. But because you require interaction between the two different layers, the process of pausing the filesystem and taking the snapshot can take a long time, which has been known to crash MySQL.

ZFS allows the unification of these layers. By some Linux kernel hackers this has been described as a “rampant layering violation” but I prefer to think of it as a elegant refactoring, because in fusing these two layers together ZFS becomes faster and smarter, guaranteeing O(1), consistent filesystem snapshots.

Slide 6

Comparison: filesystems with snapshots

XFS on EBS gives you vendor lock-in and so do any other infrastructure-based solutions. You also can’t use it to do live migration of snapshots from one server to another, called send/recv replication.

Btrfs is the Linux answer to the next-gen filesystem but it’s immature and not yet production ready.

Veritas does snapshots, but while it’s mature and stable, it’s very expensive.

This leaves ZFS, which is mature, stable and fast, and which allows you to send incremental changes between snapshots from one server to another. The only thing holding it back from mass adoption is the a lack of a performant Linux kernel port. But ZFS for Linux is coming in December. I’ve tested the beta, and it’s promising.

Here’s an example of how to do an incremental send and receive of a snapshot with ZFS to keep a slave up-to-date with the filesystem on a master.

Slide 7

Worked example of incremental ZFS replication

We create a zfs filesystem called “bucket1″. We put some data into that filesystem and then we snapshot it.

Then we send the first snapshot in full over to the slave which receives it and saves it to disk.

Then we change some bytes in the data on the master, snapshot the filesystem again, and send an incremental diff over to the slave.

This means that only the blocks that have changed get sent from one machine to another, so it’s very efficient.

Slide 8

We’re doing some cool stuff with this incremental zfs replication. We’ve built an asynchronously replicated cluster filesystem on top of it and we’re using that to build web clusters which have these nice properties. You can kill any machine safely in the knowlegde that a 10-second old backup of all its data will be stored safely across the cluster. By mounting many snapshots read-only, you can get horizontal scalability for read-heavy loads. And by picking the latest snapshot and stashing any others after a netsplit, you gain partition tolerance.

Furthermore, the incremental snapshots trick lets us automatically bring offline machines up to date from any timestamp, efficiently sending only the data which has changed between the time the machine went offline to when it came back.

In conclusion, ZFS let’s you do all this, it already runs on FreeBSD (our primary platform) and it’s coming to Linux in December, so check it out.

Slide 9

Thanks!

Follow us on Twitter: @hybridcluster / @lmarsden

Native ZFS on Linux, GA in December 2010: zfs.kqinfotech.com

Parallel spin-up

Friday, October 8th, 2010

Here’s a little taste of where we’re going with automatically spinning up web clusters on our shiny new cloud infrastructure:

It will be yours to play with soon :-)

No longer a critical event

Friday, August 6th, 2010

web cluster redundancyDoes this scare the hell out of you?

It used to scare me too, until I started using Hybrid Web Cluster. Now this isn’t a critical event any more. I can be developing on three virtual machines, one of them crashes, and I don’t even notice! All my websites and databases just carry on running as normal.

Find out more…