A reader (Barry Watson) supplied this updated version of the Vespa
structural Verilog implementation.  The key changes, as explained by
Barry, are:

1. The D-type flip flop uses blocking assignment (=) instead of
non-blocking assignment (<=). This leads to problems in stage 4 where
IR5 is always updated to be the same as IR4 on each clock tick. The
other IR flip flops in other stages are fed into multiplexers so the 6
time unit propogation delay means that blocking assignment works for them.

2. The decoding of reg_write in stage 5 leads to problems. If we take
ex3.asm as an example:

ldi r2,#-1
sub r2,r2,r2
hlt

This correctly writes 0 into r2 but because we do the following in stage 5
and #(gate_delay) gen_jmpl(jmpl, jmp5, IR5[16]);
or #(gate_delay) gen_write_or(reg_write, add5, sub5, and5, or5,
not5, xor5, jmpl, ld5, ldi5, ldx5);

assign a3 = IR5[26:22];

The update of a3 (which is 5'b00000 for the hlt instruction) happens
before the update of reg_write from 1'b1 to 1'b0, so, r0 has the value
of 0 written to it as well. I fixed this with decoding reg_write and
halt in stage 2 and I just let these flow through the pipeline along
with everything else decoded at stage 2.