Skip to content

Commit b63bdcc

Browse files
committed
Merge commit '40e4c4a4d4393444a5488e8e03590305788f23d8'
* commit '40e4c4a4d4393444a5488e8e03590305788f23d8': Further tutorial improvements wiki changes Conflicts: tutorials/5ch.md
2 parents 5566323 + 40e4c4a commit b63bdcc

File tree

10 files changed

+154
-47
lines changed

10 files changed

+154
-47
lines changed

_layouts/site.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
<div class="carousel-caption">
2222
<h1>Get Started</h1>
2323
<p class="lead">Learn how to build concurrent, distributed programs with Cloud Haskell</p>
24-
<a class="btn btn-large btn-primary" href="/tutorials/ch1.html">Learn more</a>
24+
<a class="btn btn-large btn-primary" href="/tutorials/1ch.html">Learn more</a>
2525
</div>
2626
</div>
2727
</div>

img/OTP-Diagrams.png

54.7 KB
Loading

img/one-for-all-left-to-right.png

41 KB
Loading

img/one-for-all.png

51.5 KB
Loading

img/one-for-one.png

26.6 KB
Loading

img/sup1.png

10.2 KB
Loading

tutorials/5ch.md

Lines changed: 137 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,152 @@
11
---
22
layout: tutorial
33
categories: tutorial
4-
sections: ['Introduction']
5-
title: Supervision Principles
4+
sections: ['Introduction', 'Quis custodiet ipsos custodes', 'Isolated Restarts', 'All or nothing restarts']
5+
title: 5. Supervision Principles
66
---
77

88
### Introduction
99

1010
In previous tutorial, we've looked at utilities for linking processes together
1111
and monitoring their lifecycle as it changes. The ability to link and monitor are
12-
foundational tools for building _reliable_ systems, and are the bedrock principles
12+
foundational tools for building _reliable_ systems and are the bedrock principles
1313
on which Cloud Haskell's supervision capabilities are built.
1414

15-
The [`Supervisor`][1] provides a means to manage a set of _child processes_ and to construct
16-
a tree of processes, where some children are workers (e.g., regular processes) and
17-
others are themselves supervisors.
15+
A `Supervisor` manages a set of _child processes_ throughout their entire lifecycle,
16+
from birth (spawning) till death (exiting). Supervision is a key component in building
17+
fault tolerant systems, providing applications with a structured way to recover from
18+
isolated failures without the whole system crashing. Supervisors allow us to structure
19+
our applications as independently managed subsystems, each with its own dependencies
20+
(and inter-dependencies with other subsystems) and specify various policies determining
21+
the fashion in which these subsystems are to be started, stopped (i.e., terminated)
22+
and how they should behave at each level in case of failures.
1823

19-
The supervisor process is started with a list of _child specifications_, which
20-
tell the supervisor how to interact with its children. Each specification provides
21-
the supervisor with the following information about the child process:
24+
Supervisors also provide a convenient means to shut down a system (or subsystem) in a
25+
controlled fashion, since supervisors will always terminate their children before
26+
exiting themselves and do so based on the policies supplied when they were initially
27+
created.
2228

23-
1. [`ChildKey`][2]: used to identify the child once it has been started
24-
2. [`ChildType`][3]: indicating whether the child is a worker or another (nested) supervisor
25-
3. [`RestartPolicy`][4]: tells the supervisor under what circumstances the child should be restarted
26-
4. [`ChildTerminationPolicy`][5]: tells the supervisor how to terminate the child, should it need to
27-
5. [`ChildStart`][6]: provides a means for the supervisor to start/spawn the child process
29+
### Quis custodiet ipsos custodes
2830

29-
TBC
31+
Supervisors can be used to construct a tree of processes, where some children are
32+
workers (e.g., regular processes) and others are themselves supervisors. Each supervisor
33+
is responsible for monitoring its children and handling child failures by policy, as
34+
well as deliberately terminating children when instructed to do so (either explicitly
35+
per child, or when the supervisor is itself told to terminate).
36+
37+
Each supervisor takes with a list of _child specifications_, which tell the supervisor
38+
how to interact with its children. Each specification provides the supervisor with the
39+
following information about the corresponding child process:
40+
41+
1. `ChildKey`: used to identify the child specification and process (once it has started)
42+
2. `ChildType`: indicates whether the child is a worker or another (nested) supervisor
43+
3. `RestartPolicy`: tells the supervisor under what circumstances the child should be restarted
44+
4. `ChildTerminationPolicy`: tells the supervisor how to terminate the child, should it need to
45+
5. `ChildStart`: provides a means for the supervisor to start/spawn the child process
46+
47+
The `RestartPolicy` determines the circumstances under which a child should be
48+
restarted when the supervisor detects that it has exited. A `Permanent` child will
49+
always be restarted, whilst a `Temporary` child is never restarted. `Transient` children
50+
are only restarted if the exit normally (i.e., the `DiedReason` the supervisor sees for
51+
the child is `DiedNormal` rather than `DiedException`). `Intrinsic` children behave
52+
exactly like `Transient` ones, except that if they terminate normally, the whole
53+
supervisor (i.e., all the other children) exits normally as well, as if someone had
54+
triggered the shutdown/terminate sequence for the supervisor's process explicitly.
55+
56+
When a supervisor is told directly to terminate a child process, it uses the
57+
`ChildTerminationPolicy` to determine whether the child should be terminated
58+
_gracefully_ or _brutally killed_. This _shutdown protocol_ is used throughout
59+
[distributed-process-platform][dpp] and in order for a child process to be managed
60+
effectively by its supervisor, it is imperative that it understands the protocol.
61+
When a _graceful_ shutdown is required, the supervisor will send an exit signal to the
62+
child process, with the `ExitReason` set to `ExitShutdown`, whence the child process is
63+
expected to perform any required cleanup and then exit with the same `ExitReason`,
64+
indicating that the shutdown happened cleanly/gracefully. On the other hand, when
65+
the `RestartPolicy` is set to `TerminateImmediately`, the supervisor will not send
66+
an exit signal at all, calling the `kill` primitive instead of the `exit` primitive.
67+
This immediately kills the child process without giving it the opportunity to clean
68+
up its internal state at all. The gracefull shutdown mode, `TerminateTimeout`, must
69+
provide a timeout value. The supervisor attempts a _gracefull_ shutdown initially,
70+
however if the child does not exit within the given time window, the supervisor will
71+
automatically revert to a _brutal kill_ using `TerminateImmediately`. If the
72+
timeout value is set to `Infinity`, the supervisor will wait indefintiely for the
73+
child to exit cleanly.
74+
75+
When a supervisor detects a child exit, it will attempt a restart. Whilst explicitly
76+
terminating a child will **only** terminate the specified child process, unexpected
77+
child exits can trigger a _branch restart_, where other (sibling) child processes are
78+
restarted along with the child that failed. How the supervisor goes about this
79+
_branch restart_ is governed by the `RestartStrategy` given when the supervisor is
80+
first started.
81+
82+
------
83+
> ![Info: ][info] Whenever a `RestartStrategy` causes multiple children to be restarted
84+
> in response to a single child failure, a _branch restart_ incorporating some (possibly
85+
> a subset) of the supervisor's remaining children will be triggered. The exceptions
86+
> to this rule are `Temporary` children and `Transient` children that exit normally,
87+
> therefore **not** triggering a restart. The basic rule of thumb is that, if a child
88+
> should be restarted and the `RestartStrategy` is not `RestartOne`, then a _branch_
89+
> containing some other children will be restarted as well.
90+
------
91+
92+
### Isolated Restarts
93+
94+
The `RestartOne` strategy is very simple. When one child fails, only that individual
95+
child is restarted and its siblings are left running. Use `RestartOne` whenever the
96+
processes being supervised are completely independent of one another, or a child can
97+
be restarted and lose it's state without adversely affecting its siblings.
98+
99+
-------
100+
![Sup1: ][sup1]
101+
-------
102+
103+
### All or nothing restarts
104+
105+
The `RestartAll` strategy is used when our children are all inter-dependent and it's
106+
necessary to restart them all whenever one child crashes. This strategy triggers one of
107+
those _branch restarts_ we mentioned earlier, which in this case means that **all** the
108+
supervisor's children are restarted if any child fails.
109+
110+
The order and manner in which the surviving children are restarted depends on the chosen
111+
`RestartMode` which parameterises the `RestartStrategy`. This comes in three flavours:
112+
113+
1. `RestartEach`: stops then starts each child sequentially
114+
2. `RestartInOrder`: stops all children first (in order), then restarts them sequentially
115+
3. `RestartRevOrder`: stops all children in one order, then restarts them sequentially in the opposite
116+
117+
Each `RestartMode` is further parameterised by its `RestartOrder`, which is either left
118+
to righ, or right to left. To illustrate, we will consider three alternative configurations
119+
here, starting with `RestartEach` and `LeftToRight`.
120+
121+
-------
122+
![Sup2: ][sup2]
123+
-------
124+
125+
There are times when we need to shut down all the children first, before restarting them.
126+
The `RestartInOrder` mode will do this, shutting the children down according to our chosen
127+
`RestartOrder` and then starting them up in the same way. Here's an example demonstrating
128+
`RestartInOrder` using `LeftToRight`.
129+
130+
-------
131+
![Sup3: ][sup3]
132+
-------
133+
134+
If we'd chosen `RightToLeft`, the children would have been stopped from right to left (i.e.,
135+
starting with child-3, then child-2, etc) and then restarted in the same order.
136+
137+
The astute reader might've noticed that so far, we've yet to demonstrate the behaviour that's
138+
default in [Erlang/OTP's Supervisor][erlsup], and it's a default for good reason. It is not
139+
uncommon for children to depend on one another and therefore need to be started in the correct
140+
order. Since these children rely on their siblings to function, we must stop them in the opposite
141+
order, otherwise the dependent children might crash whilst we're restarting other processes they
142+
rely on. It follows that, in this setup, we cannot subsequently (re)start the children in the
143+
same order we stopped them either.
144+
145+
[dpp]: https://github.com/haskell-distributed/distributed-process-platform
146+
[sup1]: /img/one-for-one.png
147+
[sup2]: /img/one-for-all.png
148+
[sup3]: /img/one-for-all-left-to-right.png
149+
[alert]: /img/alert.png
150+
[info]: /img/info.png
151+
[erlsup]: http://www.erlang.org/doc/man/supervisor.html
30152

31-
[1]: /static/doc/distributed-process-platform/Control-Distributed-Process-Platform-Supervisor.html
32-
[2]: /static/doc/distributed-process-platform/Control-Distributed-Process-Platform-Supervisor.html
33-
[3]: /static/doc/distributed-process-platform/Control-Distributed-Process-Platform-Supervisor.html
34-
[4]: /static/doc/distributed-process-platform/Control-Distributed-Process-Platform-Supervisor.html
35-
[5]: /static/doc/distributed-process-platform/Control-Distributed-Process-Platform-Supervisor.html
36-
[6]: /static/doc/distributed-process-platform/Control-Distributed-Process-Platform/Supervisor.html

wiki/contributing.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,10 @@ We have a rather full backlog, so your help will be most welcome assisting
2424
us in clearing that. You can view the exiting open issues on the
2525
[jira issue tracker](https://cloud-haskell.atlassian.net/issues/?filter=10001).
2626

27-
If you wish to submit an issue there, you can do so without logging in,
28-
although you obviously won't get any email notifications unless you create
29-
an account and provide your email address.
27+
If you wish to submit a new issue there, you cannot do so without logging in
28+
creating an account (by providing your email address) and logging in.
3029

31-
It is also important to work out which component or sub-system should be
30+
It is also helpful to work out which component or sub-system should be
3231
changed. You may wish to email the maintainers to discuss this first.
3332

3433
### __2. Make sure your patch merges cleanly__
@@ -47,7 +46,8 @@ local branch. For example:
4746

4847
$ git checkout -b bugfix-issue123
4948

50-
## make, add and commit your changes
49+
## add and commit your changes
50+
## base them on master for bugfixes or development for new features
5151

5252
$ git checkout master
5353
$ git remote add upstream git://github.com/haskell-distributed/distributed-process.git
@@ -70,9 +70,9 @@ conventions page [here](http://hackage.haskell.org/trac/ghc/wiki/WorkingConventi
7070

7171
1. try to make small patches - the bigger they are, the longer the pull request QA process will take
7272
2. strictly separate all changes that affect functionality from those that just affect code layout, indentation, whitespace, filenames etc
73-
3. always include the issue number (of the form `fixes #N`) in the final commit message for the patch - pull requests without an issue are unlikely to have been discussed (see above)
73+
3. always include the issue number (of the form `PROJECT_CODE #resolve Fixed`) in the final commit message for the patch - pull requests without an issue are unlikely to have been discussed (see above)
7474
4. use Unix conventions for line endings. If you are on Windows, ensure that git handles line-endings sanely by running `git config --global core.autocrlf false`
75-
5. make sure you have setup git to use the correct name and email for your commits - see the [github help guide](https://help.github.com/articles/setting-your-email-in-git)
75+
5. make sure you have setup git to use the correct name and email for your commits - see the [github help guide](https://help.github.com/articles/setting-your-email-in-git) - otherwise you won't be attributed in the scm history!
7676

7777
### __4. Make sure all the tests pass__
7878

@@ -171,7 +171,7 @@ import Data.Blah
171171
import Data.Boom (Typeable)
172172
{% endhighlight %}
173173

174-
Personally I don't care *that much* about alignment for other things,
174+
We generally don't care *that much* about alignment for other things,
175175
but as always, try to follow the convention in the file you're editing
176176
and don't change things just for the sake of it.
177177

@@ -186,18 +186,18 @@ punctuation.
186186

187187
Comment every top level function (particularly exported functions),
188188
and provide a type signature; use Haddock syntax in the comments.
189-
Comment every exported data type. Function example:
189+
Comment every exported data type. Function example:
190190

191191
{% highlight haskell %}
192-
-- | Send a message on a socket. The socket must be in a connected
193-
-- state. Returns the number of bytes sent. Applications are
192+
-- | Send a message on a socket. The socket must be in a connected
193+
-- state. Returns the number of bytes sent. Applications are
194194
-- responsible for ensuring that all data has been sent.
195195
send :: Socket -- ^ Connected socket
196196
-> ByteString -- ^ Data to send
197197
-> IO Int -- ^ Bytes sent
198198
{% endhighlight %}
199199

200-
For functions the documentation should give enough information to
200+
For functions, the documentation should give enough information to
201201
apply the function without looking at the function's definition.
202202

203203
### Naming
@@ -214,3 +214,4 @@ abbreviation. For example, write `HttpServer` instead of
214214
Use singular when naming modules e.g. use `Data.Map` and
215215
`Data.ByteString.Internal` instead of `Data.Maps` and
216216
`Data.ByteString.Internals`.
217+

wiki/maintainers.md

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -117,17 +117,9 @@ What's good for the goose...
117117

118118
#### Making API documentation available on the website
119119

120-
Currently this is a manual process. If you don't sed/awk out the
121-
reference/link paths, it'll be a mess. We will add a script to
122-
handle this some time soon. I tend to only update the static
123-
documentation for d-p and d-p-platform, at least until the process has
124-
been automated. I also do this *only* for mainline branches (i.e.,
125-
for development and master), although again, automation could solve
126-
a lot of issues there.
127-
128-
There is also an open ticket to set up nightly builds, which will
129-
update the HEAD haddocks (on the website) and produce an 'sdist'
130-
bundle and add that to the website too.
120+
There is an open ticket to set up nightly builds, which will update
121+
the HEAD haddocks (on the website) and produce an 'sdist' bundle and
122+
add that to the website too.
131123

132124
See https://cloud-haskell.atlassian.net/browse/INFRA-1 for details.
133125

wiki/reliability.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,6 @@ child processes. A supervisors *children* can be either worker processes or
2828
supervisors, which allows us to build hierarchical process structures (called
2929
supervision trees in Erlang parlance).
3030

31-
The supervision APIs are a work in progress.
32-
3331
[1]: http://en.wikipedia.org/wiki/Open_Telecom_Platform
3432
[2]: http://www.erlang.org/doc/design_principles/sup_princ.html
3533
[3]: http://www.erlang.org/doc/man/supervisor.html

0 commit comments

Comments
 (0)