4 min read

Anywhere Apps: designing for global distribution

💡
This post has been sitting in my drafts for several months, and rather than let it sit around longer, I’m going to publish it even though I don’t think it’s 100% “done”. Feedback is welcome.

I’m attempting to give a name - “Anywhere Apps” - to a type of cloud application that is designed intentionally to be globally distributed. The main goal of an Anywhere App is to minimize the latency from the end-user to the server handling their requests. Similarly to 12-Factor Apps, Anywhere Apps are defined by a set of best practices and design decisions rather than the use of a specific framework, technology, or language.

Based on the one sentence description above, you might say “oh but that’s just Cloudflare Workers!” Well, yes and no, we’ll get into that. I’m feeling the need to create a common definition for these kinds of applications because the term “Edge” has become so extremely muddled that nobody really knows what you mean when you say that anymore. Edge can mean anything from using Anycast routing to a Kubernetes cluster in a Chick-fil-a to a moisture sensor in a factory. I also have a personal need to define things clearly to build better mental models for myself, as it helps me to make technology decisions and communicate with other people.

So what are Anywhere Apps?

They can be deployed anywhere, on any server, in any cloud

1) Multi-Geo:  An Anywhere App is deployed in at least 2 “geos” (geographically distant regions). Within each geo there exists a set of “machines”, each running a set of “services” which are served traffic by a “gateway”.

2) Fabric: There must exist an authenticated, encrypted communication channel (“fabric”) for inter-geo traffic. Ideally this should allow geos to be hosted using the same infrastructure provider or several different providers.

3) Auto Routing: User requests are automatically routed to the nearest geo, and clients remain entirely unaware of the distributed nature of the app, i.e. no region-specific logic embedded in clients.

Users can connect from anywhere to the nearest geo, and it can handle their entire request

1) Identical Geos: Every geo for a given Anywhere App runs an identical set of services, meaning that every kind of request can be handled immediately and user requests never need to be routed between geos.

2) Local Writes: Any datastore used by an Anywhere App must be fully replicated to every geo, and writes must be handled within the geo that receives a request, i.e. writes should never be routed to a leader or primary replica in another geo.

3) Local Calls: Any inter-service communication within an Anywhere App must be contained within a single geo, i.e. a service should never communicate directly with a service running in a different geo.

The system can fail anywhere, and it will continue to work elsewhere

1) Blast Radius: Infrastructure faults or other service downtime within one geo should never affect the ability of other geos to continue serving requests.

2) Partition Recovery: Temporary loss of inter-geo communication due to a failure of the fabric should never block reads or writes, and all data should become eventually consistent upon repair.

3) Failure Routing: If (and only if) one geo fails or becomes overwhelmed with traffic, a subset of requests should be routed to the next nearest geo to balance load.

To summarize, Anywhere Apps can Run Anywhere, Serve Anywhere, and Fail Anywhere.

You may have guessed that this idea has come out of the work I’ve been doing on libsdk, and that’s correct. I’ve identified an architecture that I hope will be demonstrated by libsdk, but some quick searches will show you that there are other solutions out there to accomplish the same goals. Libsdk and an upcoming sibling project called Weave are meant to help implement an Anywhere App in the simplest possible manner using secure and trusted components.

If we work under the assumption that the above 9 properties are a complete definition of an ideal globally distributed application, I think it would make sense to look at an existing platform (Cloudflare Workers, since it gets the most attention in this space) to score its offerings. To arrive at a score, I’ll give 1 point for each property that it satisfies, 0.5 if it’s satisfied with caveats, or 0 if it is not satisfied (so a perfect score would be 9).

Cloudflare Workers

Multi-Geo: 1 (Yep, it sure is! 300+ PoPs)

Fabric: 1 (Yes, entirely transparent to Worker code)

Auto Routing: 1 (Yes, Anycast BGP)

Identical Geos: 1 (Yes, all workers (and apparently every Cloudflare service) are deployed to every PoP)

Local Writes: 0.5 (Writes for D1 and Durable Objects are routed to a “primary“, but Workers KV allow local writes with eventual consistency, according to Cloudflare blog posts and docs)

Local Calls: 1 (Workers can call one another, and since all workers exist in all PoPs, it’s a reasonable assumption that these calls are local)

Blast Radius: 1 (From this blog post, it appears that individual PoPs going offline do not stop others from serving requests)

Partition Recovery: 0.5 (Same as above, D1 and Durable Objects will fail writes when partitioned, KV will become eventually consistent)

Failure Routing: 0 (This blog post seems to imply that if a PoP goes offline, the traffic it would normally serve does not get re-routed)

Assuming my research is correct, Cloudflare would get a score of 7/9.

Giving scores to other platforms such as Fly.io, Fastly Compute@Edge, Deno Deploy, and AWS Lambda@Edge, is an exercise left to the reader.

I’m no longer on social media, so email me connor [at] cohix.network if you have any feedback about this idea, I would like to iterate on it over time. Work has kept me very busy lately, but I hope to release new versions of libsdk and weave soon.