I can guarantee that any device that uses floating point math will not, in fact, deliver binary-equal results for math-equal expressions. Try the equivalent of the following MATLAB on your various platforms:
a = [repmat(1, 10000, 1); 1e16];
format long
sum(a)
sum(flipud(a))
Results:
1.000000000001000e+16
1.000000000000000e+16
Addition is commutative, so the expressions are mathematically equivalent. But the order matters in floating-point world. Obviously, adding 10000 1's in sequence should present no problem for a floating-point, when the accumulation value is around 0. But once the floating-point has "floated" to 1e16, 1 is simply too small to be represented, so it's never added.
Here's an extra wrinkle: The x87 FPU computes in extended precision (80 bits), internally. So FPU code will give you a different answer, if the compiler decides to keep the intermediate results inside the FPU. If instead it decides to spill intermediate results to the stack, then you're back to 64. If it decides to compute using the SSE instructions, then you're back to 64.
These various MATLAB tricks suggested in the other answers might get you close enough for your problem. But if you're really after perfect modelling of your system, you'll probably need a simulation framework that is more controllable. Perhaps using vpa, with conversion to the correct number of bits at each step. Or switch to C or C++, paying extremely careful attention to the optimizer control settings.
Or mathematically compute the bounds on your error based on the input scaling, and verify that your answer is always below the bound.