Note: the code is a bit of a mess; I wouldn't teach this as-is. But at least
it's tiny, and from scratch!

==== USAGE ====
Quickstart:

    $ make
    $ cat xor_input | ./main
    Read 4 training points.
    [1.000000, ]
    [0.999998, ]
    [0.000002, ]
    [0.000001, ]
    [0.146817, ]
    [0.154475, ]
    [0.914473, ]

The `main` script trains a neural net with 2 inputs and 1 output. You provide
the training set, hyperparameters, and test set on stdin. The input format is
as follows:

    $ cat xor_input
    in size 2
    out size 1
    hidden size 10
    n layers 3
    0 0 0
    0 1 1
    1 0 1
    1 1 0
    train 150 0.05
    0 1
    1 0
    0 0
    1 1
    0.8 0.99
    0.95 0.86
    0.95 0.1

The first few lines describe the network. Then any lines before the "train"
statement give training points. Each training point line is the input
dimensions followed by the output dimensions for a single point.  Then you use
`train [iters] [lr]` to train the neural net on those points with the given
number of iterations and given learning rate. Finally, lines after that are
treated as test inputs. The trained neural net is run against each one and the
output of the neural net is given on stdout.

==== HOW? ====
The wrapper code is in main.c, the core operations are in nn.c.

The naming scheme is:
    - v_...: vector operation
    - vv_...: vector-vector operation
    - mv_...: matrix-vector operation
    - ..._bp...: backwards operation (compute derivative)

Note that when computing the derivative of, say, a mat-vec multiplication, we
can ask for the derivative with respect to either the matrix (weights) or the
vector (input). These are called mv_bp_m and mv_bp_v, respectively.

The forward operations are exactly what you would expect.

For the backwards operations, the intuitive thing to remember is just
"everything is linear, so derivatives add up." I've tried to decompose all of
the backprop operations in terms of smaller operations: note mat-vec
multiplication can be defined as sums of vec-vec multiplications, so the
backprop for mat-vec multiplication can be defined as sums of backprop of
vec-vec multiplications. Meanwhile, vec-vec backprop can be understood
intuitively:

    v.w = v1*w1 + v2*w2 + ... + vn*wn
    well, the derivative of v.w wrt v1 is w1
    etc. for vi
    and then we multiply by 'how much we want it to change', i.e., the
    backprop'd value from the previous layer.