Fault tolerance in Manta

Since launching Manta last week, we’ve seen a lot of excitement. Thoughtful readers quickly got to burning questions about the system’s fault tolerance: what happens when backend Manta components go down? In this post, I’ll discuss fault tolerance in Manta in more detail. If anything seems left out, please ask: it’s either an oversight or just seemed too arcane to be interesting.

This is an engineering internals post for people interested in learning how the system is built. If you just want to understand the availability and durability of objects stored in Manta, the simpler answer is that the system is highly available (i.e., service survives system and availability zone outages) and durable (i.e., data survives multiple system and component failures).

First principles

First, Manta is strongly consistent. If you PUT an object, a subsequent GET for the same path will immediately return the object you just PUT. In terms of CAP, that means Manta sacrifices availability for consistency in the face of a complete network partition. We feel this is the right choice for an object store: while other object stores remain available in a technical sense, they can emit 404s or stale data both when partitioned and in steady-state. The client code to deal with eventual consistency is at least as complex as dealing with unavailability, except that there’s no way to distinguish client error from server inconsistency — both show up as 404 or stale data. We feel transparency about the state of the system is more valuable here. If you get a 404, that’s because the object’s not there. If the system’s partitioned and we cannot be sure of the current state, you’ll get a 503, indicating clearly that the service is not available and you should probably retry your request later. Moreover, if desired, it’s possible to build an eventually consistent caching layer on top of Manta’s semantics for increased availability.

While CAP tells us that the system cannot be both available and consistent in the face of an extreme network event, that doesn’t mean the system fails in the face of minor failures. Manta is currently deployed in three availability zones in a single region (us-east), and it’s designed to survive any single inter-AZ partition or a complete AZ loss. As expected, availability zones in the region share no physical components (including power) except for being physically close to one another and connected by a high-bandwidth, low-latency interconnect.

Like most complex distributed systems, Manta is itself made up of several different internal services. The only public-facing services are the loadbalancers, which proxy directly to the API services. The API services are clients of other internal services, many of which make use of still other internal services, and so on.

Stateless services

Most of these services are easy to reason about because they’re stateless. These include the frontend loadbalancers, the API servers, authentication caches, job supervisors, and several others.

For each stateless service, we deploy multiple instances in each AZ, each instance on a different physical server. Incoming requests are distributed across the healthy instances using internal DNS with aggressive TTLs. The most common failure here is a software crash. SMF restarts the service, which picks up where it left off.

For the internal services, more significant failures (like a local network partition, power loss, or kernel panic) result in the DNS record expiring and the instance being taken out of service.

Stateful services

Statelessness just pushes the problem around: there must be some service somewhere that ultimately stores the state of the system. In Manta, that lives in two tiers:

The storage tier

The availability and durability characteristics of an object are determined in part by its “durability level“. From an API perspective, this indicates the number of independent copies of the object that you want Manta to store. You pay for each copy, and the default value is 2. Copies are always stored on separate physical servers, and the first 3 copies are stored in separate availability zones.

Durability of a single copy: Objects in the storage tier are stored on raidz2 storage pools with two hot spares. The machine has to sustain at least three concurrent disk failures before losing any data, and could survive as many as eight. The use of hot spares ensures that the system can begin resilvering data from failed disks onto healthy ones immediately, in order to reduce the likelihood of a second or third concurrent failure. Keith discusses our hardware choices in more depth on his blog.

Object durability: Because of the above, it’s very unlikely for even a single copy of an object to be lost as a result of storage node failure. If the durability level is greater than 1 (recall that it’s 2 by default), all copies would have to be lost for the object’s data to be lost.

Object availability: When making a request for an object, Manta selects one of the copies and tries to fetch the object from the corresponding storage node. If that node is unavailable, Manta tries another node that has a copy of the object, and it continues doing this until it either finds an available copy or runs out of copies to try. As a result, the object is available as long as the frontend can reach any storage node hosting a copy of the object. As described above, any storage node failure (transient or otherwise) or AZ loss would not affect object availability for objects with at least two copies, though such failures may increase latency as Manta tries to find available copies. Similarly, in the event of an AZ partition, the partitioned AZ’s loadbalancers would be removed from DNS, and the other AZs would be able to service requests for all objects with at least two copies.

Since it’s much more likely for a single storage node to be temporarily unavailable than for data to be lost, it may be more useful to think of “durability level” as “availability level”. (This property also impacts the amount of concurrency you can get for an object — see Compute Jobs below.)

Metadata tier

The metadata tier records the physical storage nodes where each object is stored. The object namespace is partitioned into several completely independent shards, each of which is designed to survive the usual failure modes (individual component failure, AZ loss, and single-AZ partition).

Each shard is backed by a postgres database using postgres-based replication from the master to both a synchronous slave and an asynchronous standby. Each database instance within the shard (master, sync slave, and async slave) is located in a different AZ, and we use Zookeeper for election of the master.

The shard requires only one peer to be available for read availability, and requires both master and synchronous slave for write availability. Individual failures (or partitions) of the master or synchronous slave can result in transient outages as the system elects a new leader.

The mechanics of this component are complex and interesting (as in, we learned a lot of lessons in building this). Look for a subsequent blog post from the team about the metadata tier.

Compute Jobs

Manta’s compute engine is built on top of the same metadata and storage tiers. Like the other supporting services, the services are effectively stateless and the real state lives in the metadata tier. It’s subject to the availability characteristics of the metadata tier, but it retries internal operations as needed to survive the transient outages described above.

If a given storage node becomes unavailable when there are tasks running on it, those tasks will be retried on a node storing another copy of the object. (That’s one source of the “retries” counter you see in “mjob get”.) Manta makes sure that the results of only one of these instances of the task are actually used.

The durability level of an object affects not only its availability for compute (for the same reasons as it affects availability over HTTP as described above), but also the amount of concurrency you can get on an object. That’s because Manta schedules compute tasks to operate on a random copy of an object. All things being equal, if you have two copies of an object instead of one, you can have twice as many tasks operating on the object concurrently (on twice as many physical systems).

Final notes

You’ll notice that I’ve mainly been talking about transient failures, either of software, hardware, or the network. The only non-transient failure in the system is loss of a ZFS storage pool; any other failure mode is recoverable by replacing the affected components. Objects with durability of at least two would be recoverable in the event of pool loss from the other copies, while objects with durability of one that were stored on that pool would be lost. (But remember: storage pool loss as a result of any normal hardware failure, even multiple failures, is unlikely.)

I also didn’t mention anything about replication in the storage tier. That’s because there is none. When you PUT a new object, we dynamically select the storage nodes that will store that object and then funnel the incoming data stream to both nodes synchronously (or more, if durability level is greater than 2). If we lose a ZFS storage pool, we would have to replicate objects to other pools, but that’s not something that’s triggered automatically in response to failure since it’s not appropriate for most failures.

Whether in the physical world or the cloud, infrastructure services have to be highly available. We’re very up front about how Manta works, the design tradeoffs we made, and how it’s designed to survive software failure, hardware component failure, physical server failure, AZ loss, and network partitions. With a three AZ model, if all three AZs became partitioned, the system chooses strong consistency over availability, which we believe provides a significantly more useful programming model than the alternatives.

For more detail on how we think about building distributed systems, see Mark Cavage’s ACM Queue article “There’s Just No Getting Around It: You’re Building a Distributed System.”

Posted on July 3, 2013 at 8:04 am by dap · Permalink
In: Joyent, Manta

11 Responses

Subscribe to comments via RSS

  1. Written by Tushar Jain
    on July 3, 2013 at 11:49 am
    Permalink

    Nice post. A couple questions:

    1) Partial failures during a PUT: Let’s say we’re doing a PUT with 2 copies and we get a partial failure. That is, the write to one of the storage nodes succeeds, and the second write fails. I imagine this results in 500 to the client.
    – What happens if the client does a GET for this object now? Will he get a 404 or will you end up returning the object since you have one copy of it?
    – In general, if my durability level is N, do all N writes have to succeed for the PUT to return a 200?

    2) You mentioned that in the event that you loose a storage node, you don’t automatically re-replicate the data that was on that node. Is that correct? If so, how do you get those objects back to their original durability level? The reason to try to maintain objects at their desired durability level is to keep the probability of data loss low and hopefully constant over time. If you don’t re-replicate in the face of failures, doesn’t the probability of data loss for an object increase as time goes on, as long as the probability of storage node loss is non-zero?
    Or am I thinking about this incorrectly?

    Thanks!
    Tushar

  2. Written by dap
    on July 3, 2013 at 12:03 pm
    Permalink

    Good questions.

    (1) The object namespace is updated atomically at the end of the request, after the backend writes finish. So for a failure during a PUT, the namespace is never updated, and a subsequent request will return 404. As for the second part: I’m not sure if we wait for all writes or not — I’ll circle back with Mark.

    (2) We don’t *automatically* replicate the data in the event of a storage node loss (e.g., panic or power off) because it’s much more likely to be a transient loss with no risk to data durability. We’re trying to avoid a cascading failure like the one discussed in http://www.joyent.com/blog/on-cascading-failures-and-amazons-elastic-block-store/, where automatic mechanisms mistake a transient network failure for data loss and make the network problem significantly worse by attempting to repair the data loss. In the event of actual storage pool loss, a human would determine that, and we would invoke a manual operation to replicate objects up to the expected durability level.

    Thanks!

  3. Written by Mark Cavage
    on July 3, 2013 at 1:32 pm
    Permalink

    Hello Tushar Jain,

    (1) We return 200 only upon both copies being durably on disk, and the metadata being durably on disk about where your object is. That is, once you get a 200, you’ve got $DurabilityLevel guaranteed, and the ability to read it. So yes, all N writes have to succeed for you to get a 200.

    To the read-after-write-on-error, if you got a 500 on a PUT, we now have some set of “orphan data” on storage nodes, but we won’t have stashed a metadata record for it, so a subsequent GET will return 404 (or the previous version of the object if it was an overwrite). We asynchronously clean that up, but it’s not your problem.

    (2) As Dave said, we don’t automatically re-replicate, but we will do ao if we can’t recover the host by operator means (i.e., it’s far less invasive to replace a blown PDU for example than to rewrite all the data on a dense storage box). Your data will be returned to the advertised durability level “as fast as we can”. The only exception to this is if you set durability level to one, and the storage pool is actually toast, in which case you would have data loss.

  4. Written by Tushar Jain
    on July 3, 2013 at 1:51 pm
    Permalink

    Great answers guys. Thanks.

    One more follow up. If my understanding is correct, then it seems that currently, if I set durability level to 3 and a single AZ is unavailable, then all my PUTs will 500.

    I’m reaching this conclusion based on the following points you guys mentioned:
    * All N writes have to succeed for me to get a 200
    * Manta is currently in 3 AZs and the first 3 copies are stored in separate availability zones.

    So, am I correct in concluding that if I set my durability level to 3, my PUTs cannot withstand a single AZ failure or inter-az partition? GETs should still succeed since you’ll be able to fetch the data from one of the other AZs.

    Thanks!

    • Written by Mark Cavage
      on July 4, 2013 at 8:20 am
      Permalink

      Hello Tushar Jain,

      Yes your understanding is correct.

      We have debated whether or not to just set a “floor of 2″ for number of AZs or min(num_azs, num_copies) — right now it’s the latter, as we believe that’s the “intended semantics” of customers when setting number of copies > 2, and since total AZ failure is (very) uncommon.

      If this is a frequent request or problem, I think we’d add a small feature to explicitly let you control the minimum number of AZs, instead of us inferring.

      Lastly, there’s no mention thus far of the durability guarantee (more on that in a future post), but note that 2 copies in manta is both price competitive with other object stores, and offers an astronomically high durability level. That doesn’t change your questions, or the answers, but it seems worth pointing out that most customers don’t need/want > 2 copies in Manta.

      m

  5. Written by Andrew Leonard
    on July 4, 2013 at 7:17 am
    Permalink

    I understand the decision not to automatically re-replicate in the event of a failure – that seems reasonable. I am curious about how you plan to rebalance data for performance reasons – will that be done manually or automatically? The dilemma I see is that manual rebalancing won’t scale, while automatic balancing would likely introduce new complex failure modes.

    (I’m assuming that across your fleet of storage nodes, some will contain hotter data than others, perhaps to the point that some are continually congested with compute jobs, while others sit idle. My experience is that old data is generally colder than new data, which could result in low average compute utilization – without rebalancing, newly deployed systems will be the busiest, while older systems grow increasingly idle.)

    Perhaps rebalancing could be delegated from Joyent operations to the customer – maybe a metric could be exposed to the user when an object could benefit from being rebalanced, and the customer could request it themselves if desired – or a property could be added within job configuration files along the lines of “rebalance my data if it will get my job done more quickly.”

    -Andy

    • Written by Mark Cavage
      on July 4, 2013 at 8:31 am
      Permalink

      Hi Andy,

      Yes, good question. Right now we don’t rebalance data for any reason besides failure.

      What we’ve talked about for this is actually giving customers some sample jobs (and perhaps an extra magical hook) such that you can run a Manta compute job on your data to do any of rebalance/(increase|decrease) durability/reconstitute etc, which is in line with your closing thought on Joyent delegating to the customer.

      Combined with that, we’ve also talked about letting you specify a compute profile when you write data (i.e., you could say you wanted to write to a heavy CPU skew box, or SSD, or …); pricing incentives would dictate to users how they want to write and store.

      The honest answer here is there’s a lot we *can* do, but we’re sort of waiting and seeing what we should/will do from feedback like this question and what we see from actual usage data.

      m

  6. Written by Alfonso
    on July 17, 2013 at 12:53 am
    Permalink

    How do you deal if there are 2 updates on different replicas of the same object? Imagine a network partition. What happens if the objetc gets 2 different updates, then the network partition is restored. Do ou have any distributed locking mechanism? So we can guaranteed that the “transaction” over an object is ACID?

    • Written by dap
      on July 17, 2013 at 6:59 am
      Permalink

      Alfonso,

      Remember that Manta only supports replacing an existing object with a different one, not updating some part of the original object. Replacing an object is just a PUT of a new object with the same name.

      In the case of a partition, the partitioned AZ will generally be completely removed from service: its frontend loadbalancers will be removed from DNS, and internal services in the other AZs will stop trying to use the services they cannot reach. As a result, incoming requests will hit the non-partitioned AZs and succeed because the metadata tier maintains write availability as long as a majority of instances for each shard are available. (As I alluded to, the way it actually does this is complex. We’re using Zookeeper to figure out what “majority” means, and we wrote software to update the read-write state of the database in response to changes in the topology.) In the unlikely event that a request does hit the partitioned AZ, the request would get a 503 because the metadata tier refuses updates when there’s only one instance of a shard available.

      So there’s nothing to resolve: objects are immutable, and changes to the metadata tier are ACID. Does that answer it?

      Thanks,
      Dave

  7. Written by Alfonso
    on July 18, 2013 at 1:25 am
    Permalink

    Distributed Systems are always comples due to the different points of failure ;) . Glad to hear you use Zookeeper.

    So if during a put on an object, a network partition happens where the copy is first overwritte, so the AZ is disconected from the rest, the operation will fail and I would have to try again to update one of the other replicas. When the network partition is fixed, then zookeeper plays its role I presume.

    Is the “PUT” operation atomic across all the instances? so If an instance is updated, until all the replicas are replaced, I cannot do another “PUT” or if there is an orthering in the operations if a new PUT comes before all the replicas are updated, as it is a new version of the objetc, the later PUT operation overrides the running one.

    I know that this will not be the “normal” use case for manta.

    Thanks Dave.

    • Written by dap
      on July 18, 2013 at 8:59 am
      Permalink

      Alfonso,

      Yes, if there’s a partition that affects the storage node for an ongoing PUT operation, that operation will completely fail and you’ll have to retry it from the client. The metadata update (i.e., linking /$MANTA_USER/stor/foo to the newly stored object) is only done once the data is stored in the backend, so that step will never happen in this case. It will be like the request never happened.

      The PUT operation is atomic across all instances in a single shard. (We don’t support server-side operations that would modify multiple shards.) You can do as many operations as you want concurrently, with the usual semantics: exactly one of them will happen “last” and become the final state. You can use HTTP conditional PUT (using etags) to avoid multiple clients stomping on each other.

      Hope that helps!

      – Dave

Subscribe to comments via RSS