Erlang in Anger
How to Dive into a Code Base
There are three main types of Erlang code bases you’ll encounter in the wild: raw Erlang code bases, OTP applications, and OTP releases 1.
Raw Erlang
If you encounter a raw Erlang code base, you’re pretty much on your own. These rarely follow any specific standard, and you have to dive in the old way to figure out whatever happens in there.
OTP Applications
Figuring out OTP applications 2 is usually rather simple. They usually all share a directory structure that looks like:
doc/ ebin/ src/ test/ LICENSE.txt README.md rebar.configThere might be slight differences, but the general structure will be the same.
Library Applications
Library applications will usually have modules named
appname_something, and one module namedappname. This will usually be the interface module that's central to the library and contains a quick way into most of the functionality provided.
Regular Applications
The higher a process resides in the tree, the more likely it is to be vital to the survival of the application. You can also estimate how important a process is by the order it is started (all children in the supervision tree are started in order, depth-first). If a process is started later in the supervision tree, it probably depends on processes that were started earlier.
The supervisor restart strategy reflects the relationship between processes under a supervisor:
one_for_oneandsimple_one_for_oneare used for processes that are not dependent upon each other directly, although their failures will collectively be counted towards total application shutdown.rest_for_onewill be used to represent processes that depend on each other in a linear manner.one_for_allis used for processes that entirely depend on each other.This structure means it is easiest to navigate OTP applications in a top-down manner by exploring supervision subtrees.
For each worker process supervised, the behaviour it implements will give a good clue about its purpose:
- a
gen_serverholds resources and tends to follow client/server patterns (or more generally, request/response patterns).- a
gen_fsmwill deal with a sequence of events or inputs and react depending on them, as a Finite State Machine. It will often be used to implement protocols.- a
gen_eventwill act as an event hub for callbacks, or as a way to deal with notifications of some sort.
Dependencies
All applications have dependencies, and these dependencies will have their own dependencies. OTP applications usually share no state between them, so it's possible to know what bits of code depend on what other bits of code by looking at the app file only, assuming the developer wrote them in a mostly correct manner.
OTP Releases
OTP releases are not a lot harder to understand than most OTP applications you'll encounter in the wild. A release is a set of OTP applications packaged in a production-ready manner so it boots and shuts down without needing to manually call
application:start/2for any app. Compiled releases may contain their own copy of the Erlang virtual machine with more or less libraries than the default distribution, and can be ready to run standalone. Of course there’s a bit more to releases than that, but generally, the same discovery process used for individual OTP applications will be applicable here.
Building Open Source Erlang Software
OTP applications are the vast majority of the open source code people will encounter. In fact, many people who would need to build an OTP release would do so as one umbrella OTP application.
If what you're writing is a stand-alone piece of code that could be used by someone building a product, it's likely an OTP application. If what you're building is a product that stands on its own and should be deployed by users as-is (or with a little configuration), what you should be building is an OTP release.
Project Structure
OTP Releases
For releases, the structure can a bit different. Releases are collections of applications, and their structures may reflect that.
Instead of having a top-level app alone in src, applications can be nested one level deeper in a apps or lib directory:
_build/ apps/ - myapp1/ - src/ - myapp2/ - src/ doc/ LICENSE.txt README.md rebar.config rebar.lockThis structure lends itself to generating releases where multiple OTP applications under your control under a single code repository. Both
rebar3anderlang.mkrely on therelxlibrary to assemble releases.A
relxconfiguration tuple (withinrebar.config) for the directory structure above would look like:{relx, [ {release, {demo, "1.0.0"}, [myapp1, myapp2, ..., recon]}, {include_erts, false} % will use local Erlang install ]}
Supervisors and start_link Semantics
In complex production systems, most faults and errors are transient, and retrying an operation is a good way to do things - Jim Gray's paper quotes Mean Times Between Failures (MTBF) of systems handling transient bugs being better by a factor of 4 when doing this. Still, supervisors aren’t just about restarting.
One very important part of Erlang supervisors and their supervision trees is that their start phases are synchronous. Each OTP process has the potential to prevent its siblings and cousins from booting. If the process dies, it’s retried again, and again, until it works, or fails too often.
Many Erlang developers end up arguing in favor of a supervisor that has a cooldown period. I strongly oppose the sentiment for one simple reason: it’s all about the guarantees.
It's About the Guarantees
Restarting a process is about bringing it back to a stable, known state. From there, things can be retried. When the initialization isn’t stable, supervision is worth very little. An initialized process should be stable no matter what happens. That way, when its siblings and cousins get started later on, they can be booted fully knowing that the rest of the system that came up before them is healthy.
(…)
Supervised processes provide guarantees in their initialization phase, not a best effort. This means that when you're writing a client for a database or service, you shouldn't need a connection to be established as part of the initialization phase unless you’re ready to say it will always be available no matter what happens.
You could force a connection during initialization if you know the database is on the same host and should be booted before your Erlang system, for example. Then a restart should work. In case of something incomprehensible and unexpected that breaks these guarantees, the node will end up crashing, which is desirable: a pre-condition to starting your system hasn’t been met. It’s a system-wide assertion that failed.
(…)
In this case, the only guarantee you can make in the client process is that your client will be able to handle requests, but not that it will communicate to the database. It could return
{error, not_connected}on all calls during a net split, for example.(…)
If you expect failure to happen on an external service, do not make its presence a guarantee of your system. We're dealing with the real world here, and failure of external dependencies is always an option.
Side Effects
Of course, the libraries and processes that call such a client will then error out if they don't expect to work without a database. That's an entirely different issue in a different problem space, one that depends on your business rules and what you can or can’t do to a client, but one that is possible to work around. (…)
The difference in both initialization and supervision approaches is that the client's callers make the decision about how much failure they can tolerate, not the client itself. That's a very important distinction when it comes to designing fault-tolerant systems. Yes, supervisors are about restarts, but they should be about restarts to a stable known state.
Example: Initializing without guaranteeing connections
The following code attempts to guarantee a connection as part of the process’ state:
init(Args) -> Opts = parse_args(Args), {ok, Port} = connect(Opts), {ok, #state{sock=Port, opts=Opts}}. [...] handle_info(reconnect, S = #state{sock=undefined, opts=Opts}) -> %% try reconnecting in a loop case connect(Opts) of {ok, New} -> {noreply, S#state{sock=New}}; _ -> self() ! reconnect, {noreply, S} end;Instead, consider rewriting it as:
init(Args) -> Opts = parse_args(Args), %% you could try connecting here anyway, for a best %% effort thing, but be ready to not have a connection. self() ! reconnect, {ok, #state{sock=undefined, opts=Opts}}. [...] handle_info(reconnect, S = #state{sock=undefined, opts=Opts}) -> %% try reconnecting in a loop case connect(Opts) of {ok, New} -> {noreply, S#state{sock=New}}; _ -> self() ! reconnect, {noreply, S} end;
Planning for Overload
By far, the most common cause of failure I've encountered in real-world scenarios is due to the node running out of memory. Furthermore, it is usually related to message queues going out of bounds. There are plenty of ways to deal with this, but knowing which one to use will require a decent understanding of the system you’re working on.
(…)
Determining what queue blew up is not necessarily hard. This is information that can be found in a crash dump. Finding out why it blew up is trickier. Based on the role of the process or run-time inspection, it’s possible to figure out whether causes include fast flooding, blocked processes that won’t process messages fast enough, and so on.
Footnotes:
See Release (OTP).
See Application (OTP).