On the Iceberg’s Waterline

7 min readDec 29, 2020

Before we start talking about the extensible API design, we should discuss the hygienic minimum. A huge number of problems would have never happened if API vendors had payed more attention to marking their area of responsibility.

1. Provide a minimal amount of functionality

At any moment in its lifetime, your API is like an iceberg: it comprises an observable (e.g. documented) part and a hidden one, undocumented. If the API is designed properly, these two parts correspond to each other just like above-water and under-water parts of a real iceberg do, i.e. one to ten. Why so? Because of two obvious reasons.

Computers exist to make complicated things easy, not vice versa. The code developers write upon your API must describe complicated problem’s solution in neat and straightforward sentences. If developers have to write more code than the API itself comprises, then there is something rotten here. Probably, this API simply isn’t needed at all.
Revoking the API functionality causes losses. If you’ve promised to provide some functionality, you will have to do so ‘forever’ (until this API version’s maintenance period is over). Pronouncing some functionality deprecated is a tricky thing, potentially alienating your customers.

Rule #1 is the simplest: if some functionality might be withheld — then never expose it. It might be reformulated like: every entity, every field, every public API method is a product solution. There must be solid product reasons why some functionality is exposed.

2. Avoid gray zones and ambiguities

You obligations to maintain some functionality must be stated as clearly as possible. Especially regarding those environments and platforms where no native capability to restrict access to undocumented functionality exists. Unfortunately, developers tend to consider some private features they found to be eligible for use, thus presuming the API vendor shall maintain it intact. Policy on such ‘findings’ must be articulated explicitly. At the very least, in case of such non-authorized usage of undocumented functionality, you might refer to the docs, and be in your own rights in the eyes of the community.

However, API developers often legitimize such gray zones themselves, for example, by:

returning undocumented fields in endpoints’ responses;
using private functionality in code examples — in the docs, responding to support messages, in conference talks, etc.

One cannot make a partial commitment. Either you guarantee this code will always work, or do not slip a slightest note such a functionality exists.

3. Codify implicit agreements

Third principle is much less obvious. Pay a close attention to the code which you’re suggesting developers to write: are there any conventions which you consider evident, but never wrote them down?

Example #1. Let’s take a look at this order processing SDK example:

// Creates an order
let order = api.createOrder();
// Returns the order status
let status = api.getStatus(order.id);

Let’s imagine that you’re struggling with scaling your service, and at some point moved to the asynchronous replication of the database. This would lead to the situation when querying for the order status right after order creating might return 404, if an asynchronous replica haven't got the update yet. In fact, thus we abandon strict consistency policy in a favor of an eventual one.

What would be the result? The code above will stop working. A developer creates an order, tries to get its status — but gets the error. It’s very hard to predict what an approach developers would implement to tackle this error. Probably, none at all.

You may say something like, ‘But we’ve never promised the strict consistency in the first place’ — and that is obviously not true. You may say that if, and only if, you really described the eventual consistency in the createOrder docs, and all your SDK examples look like:

let order = api.createOrder();
let status;
while (true) {
    try {
        status = api.getStatus(order.id);
    } catch (e) {
        if (e.httpStatusCode != 404 || timeoutExceeded()) {
            break;
        }
    }
}
if (status) {
    …
}

We presume we may skip the explanations why such code must never be written in any circumstances. If you’re really providing a non-strictly consistent API, then either createOrder operation must be asynchronous and return the result when all replicas are synchronized, or the retry policy must be hidden inside getStatus operation implementation.

If you failed to describe the eventual consistency in the first place, then you simply can’t make these changes in the API. You will effectively break backwards compatibility, which will lead to huge problems with your customers’ apps, intensified by the fact it can’t be simply reproduced.

Example #2. Take a look at the following code:

let resolve;
let promise = new Promise(
    function (innerResolve) {
        resolve = innerResolve;
    }
);
resolve();

This code presumes that callback function passed to new Promise will be executed synchronously, and the resolve variable will be initialized before the resolve() function is called. But this convention is based on nothing: there is no clues indicating the new Promise constructor executes the callback function synchronously.

Of course, the developers of the language standard can afford such tricks; but you as an API developer cannot. You must at least document this behavior and make the signatures point to it; actually, good advice is to avoid such conventions, since they are simply unobvious while reading the code. And of course, under no circumstances you actually change this behavior to asynchronous one.

Example #3. Imagine you’re providing animations API, which includes two independent functions:

// Animates object's width,
// beginning with first value, ending with second
// in a specified time period
object.animateWidth('100px', '500px', '1s');
// Observes object's width changes
object.observe('widthchange', observerFunction);

A question arises: how frequently and at what time fractions the observerFunction will be called? Let's assume in the first SDK version we emulated step-by-step animation at 10 frames per second: then observerFunction will be called 10 times, getting values '140px', '180px', etc., up to '500px'. But then in new API version we moved to implementing both functions atop of system native functionality — and so you're simply don't know, when and how frequently the observerFunction will be called.

Just changing call frequency might result in making some code dysfunctional — for example, if the callback function makes some complex calculations, and no throttling is implemented, since the developer just relied on your SDK built-in throttling. An if observerFunction cease to be called when exactly '500px' is reached because of some system algorithms specifics, some code will be broken without any doubt.

In this example you should document the concrete contract (how often the observer function is called) and stick to it even if the underlying technology is changed.

Example #4. Imagine that customer orders are passing through a specific pipeline:

GET /v1/orders/{id}/events/history
→
{
    "event_history": [
        {
            "iso_datetime": "2020-12-29T00:35:00+03:00",
            "new_status": "created"
        },
        {
            "iso_datetime": "2020-12-29T00:35:10+03:00",
            "new_status": "payment_approved"
        },
        {
            "iso_datetime": "2020-12-29T00:35:20+03:00",
            "new_status": "preparing_started"
        },
        {
            "iso_datetime": "2020-12-29T00:35:30+03:00",
            "new_status": "ready"
        }
    ]
}

Suppose at some moment we decided to allow trustworthy clients to get their coffee in advance, before the payment is confirmed. So an order will jump straight to “preparing_started”, or event “ready”, without a “payment_approved” event being emitted. It might appear to you that this modification is backwards compatible, since you never really promised any specific event order be maintained, but it is not.

Let’s assume that a developer (probably, your company’s business partner) wrote some code executing some valuable business procedure, for example, gathering income and expenses analytics. It’s quite logical to expect this code operates a state machine, which switches from one state to another depending on getting (or getting not) specific events. This analytical code will be broken if the event order changes. In best-case scenario a developer will get some exceptions and have to cope with error’s cause; worst-case, partners will operate wrong statistics for an indefinite period of time until they find a mistake.

A proper decision would be, in first, document the event order and allowed states; in second, continue generating “payment_approved” event before “preparing_started” (since you’re making a decision to prepare that order, so you’re in fact approving the payment) and add extended payment information.

This example leads us to the last rule.

4. Product logic must be backwards compatible as well

State transition graph, event order, possible causes of status changes — such critical things must be documented. Not every piece of business logic might be defined in a form of programmatical contract; some cannot be represented at all.

Imagine that one day you start to take phone calls. A client may contact the call center to cancel an order. You might even make this functionality technically backwards compatible, introducing new fields to the ‘order’ entity. But end user might simply know the number, and call it even if the app wasn’t suggesting anything like that. Partner’s business analytical code might be broken likewise, or start displaying a weather on Mars, since it was written knowing nothing about the possibility of canceling orders somehow in circumvention of the partner’s systems.

A technically correct decision would be adding ‘canceling via call center allowed’ parameter to the order creation function. Conversely, call center operators may only cancel those orders which were created with this flag set. But that would be a bad decision from a product point of view. The only one ‘good’ decision in this situation is to foresee the possibility of external order cancels in the first place. If you haven’t foreseen it, your only option is the ‘Serenity Notepad’ to be discussed in the Section III.

This is the draft for the future ‘HTTP API’ section of the book; the work continues at Github. I’d appreciate if you share it on reddit, for I personally can’t do that.

On the Iceberg’s Waterline

Written by Sergey Konstantinov