piecewise linear sqrt in python and C output comparison [closed]

Question

Closed. This question is off-topic. It is not currently accepting answers.

General programming questions are off-topic here, but can be asked on Stack Overflow.

Closed last year.

I am implementing a piecewise linear function to linearize the sqrt operation. I am using breakpoints from 0 to 100,000. I have set higher precision in lower breakpoints (as the sqrt changes more rapidly) and less in the higher values.

I implemented the same code in python:

def piecewise_linear_sqrt(x, coefficients, breakpoints, scale_factor):
"""Approximate square root using piecewise linear functions with integer coefficients."""
if x < 0:
    return 0  # Return 0 for negative values
for i, (m, n) in enumerate(coefficients):
    if breakpoints[i] <= x < breakpoints[i+1]:
        result = (m * x + n) / scale_factor
        return result
m, n = coefficients[-1]
return (m * x + n) / scale_factor  # Use the last segment for any values at or beyond the last breakpoint

and in C:

float piecewise_linear_sqrt(int x) {
if (x < 0) {
    return 0;
}
for (int i = 0; i < num_coefficients - 1; ++i) {
    if (x >= breakpoints[i] && x < breakpoints[i + 1]) {
        float result = (coefficients[i][0] * x + coefficients[i][1]) / scale_factor;
        return result;
    }
}
if (x >= breakpoints[num_coefficients - 1]) {
    float result = (coefficients[num_coefficients - 1][0] * x + coefficients[num_coefficients - 1][1]) / scale_factor;
    return result;
}
return 0;
}

However, when I run the two implements with the same input, I have a different behaviour.

I guess that I have a precision problem from what I see, but I dont know how to get better results. I tried increasing the scaling of the coefficient to be used in the cpp because I want to later us a fixed point implementation.

Any ideas on how to improve the performance of this code or any reference/code that I can follow/borrow?

changed your tags and title: as you say in your text, this is C code, not C++. These are two different languages. — Marcus Müller
– Marcus Müller, Commented May 16, 2024 at 12:44

robert bristow-johnson · Accepted Answer · 2024-05-15 22:59:59Z

2

Well, before they close the question, there are better ways to compute the square root (and other transcendentals) that a big table of piecewise-continuous lines. Even if execution time is important.

Math is here.

These can be made more accurate by increasing polynomial order, but they ain't terribly bad the way it is. But they're not bit-wise accurate.

answered May 15, 2024 at 22:59

robert bristow-johnson

22.6k4 gold badges41 silver badges82 bronze badges

$\begingroup$ I'd add especially if execution time is important, don't go piecewise; going sequentially through the list of bounds for the linear pieces is both algorithmically highly complex, as well as a memory access prediction nightmare. Friends typically don't let friends approximate piecewise if the number of pieces $\gg 2$! $\endgroup$

Marcus Müller
– Marcus Müller

2024-05-16 10:00:36 +00:00
Commented May 16, 2024 at 10:00
$\begingroup$ Made comparison between your __sqrt(), Newton based (iterative) and one (odd) simple method (not sure if my implementation works perfectly in every case (but, as OP's function input type is integer...) ) which I bumped to at MathExchange - godbolt.org/z/j96WMME4d (dunno why Newton based method is so slow and why it results huge error for values above ~12k ? works well in Octave implementation ) $\endgroup$

Juha P
– Juha P

2024-05-17 19:06:20 +00:00
Commented May 17, 2024 at 19:06
$\begingroup$ Thanks Robert for the suggestio. I am going to explore the implementation through polynomial order! Lets see if I can reduce the number of divisions and multiplications. Thanks! $\endgroup$

GGChe
– GGChe

2024-05-17 22:23:56 +00:00
Commented May 17, 2024 at 22:23

Add a comment |

Marcus Müller · Accepted Answer · 2024-05-16 11:33:05Z

the numerical differences between your Python and C implementation might simply be that Python's floating point type is a double precision one, whereas your C uses single precision.

but I dont know how to get better results

As rbj says, don't go piecewise linear; that's very suboptimal in terms of any error norm I could think of ($\sqrt{}$ is not very linear; if your pieces have the "correct" value at both endpoints, then your approximation is always underestimating the square root, as you can see in your plot; if you chose to go with tangents, you'd always be overestimating; if you chose to try to absolutely minimize the integrated error, you'd lose strict monoticity: there's simply no good choice for picking piecewise linear segments.).

Also, it's incredibly inefficient in terms of complexity! What do you think a lookup in a 100000-entry table costs, computationally? You could (and should!) have applied a Taylor approximation instead, or something else, simpler than that. The way you're doing it now, this is going to be a lot slower than just casting your integer to double precision float and using standard sqrt() functionality (or in case of python, to simply def sqrt(val: int): return val**0.5).

Compare these statements:

I am using breakpoints from 0 to 100,000.

and

float piecewise_linear_sqrt(int x)

Think about it: the maximum value of int on most platforms is 2³¹-1; so the largest possible sqrt(int x) is < 2¹⁵-1 = 32767. Your approximation, even the better of your two curves, is often more than -0.5 wrong. So, instead of your piecewise linear approximation with 100,000 complicated table entries, you could just look up the square root of values in a 32,767-entry table, and be at least as good, for one sixth of the coefficients you need to store, one third of the search effort, and none of the calculations.

Still, don't do that! As rbj's link shows, there's good polynomial approximations, and because you only care about 2³¹ integer values of that, you could just as well run a few iterations of minimum search to fine-tune the coefficients to reduce the (already very low) error by maybe another few 10⁻⁶. And instead of keeping more than a typical L1 CPU cache in coefficients hot and searching through 50,000 (or even just 16,384 in the pure look-up table case) you only need to do four multiplications and additions. There's plain no question that that's better! Also, unlike your search, you, or your compiler and your CPU's dispatch itself, can put multiple of these arithmetic operations in parallel (through SIMD, which even relatively modest mobile CPUs support these day, and PCs have for >25 years), and pipeline them.

Frankly, even a classical (as in: classical antique, Greeks drawing circles in sand) search algorithm would have been much faster than your approach: to calculate sqrt(x), pick 2¹⁵ as initial guess g. Square g. Is g² smaller than x? In that case, g := g + g/2. Is g² larger than x? g := g/2 (division by constant powers of 2 are cheap, unlike any other integer division in a binary computer).
Rinse, repeat, until the error between g² and x is sufficiently small.

Thanks Marcus for your detailed response. Everything you say sounds very good, and I already thought of all of that. I have a taylor series, newton-raphson and search approximations already. Also tested std sqrt. Perhaps my mistake here was to mention that I am designing this method for C synthesis to hardware using an experimental tool that translated C-like code (with agregated compiler instructions) to verilog for PL synthesis. I am traying to achieve the fewer operations. No divisions if possible (just bit shifting), Those are my constraints and why I am exploring this implementation. — GGChe
– GGChe, Commented May 16, 2024 at 19:22
Unless there is some black magic hidden, newton raphson implementation from literature and from other HPC/HLS libraries, requires more computation than this one, at least in the few iterations they do. Would you suggest any other resource-hardware friendly implementation? — GGChe
– GGChe, Commented May 16, 2024 at 19:23
@GabrielGaleote-Checa very clearly, you're wrong about that. I explained why that is in my answer, and I also did recommend multiple better alternatives. Could you elaborate on what I'm not explaining well there? — Marcus Müller
– Marcus Müller, Commented May 16, 2024 at 19:48
I'm not saying you are not explaining well! apologies! I read carefully through your indications but I was wondering whether going for a taylor series for example would be hardware efficient in an FPGA (I mean resources) not only clock cycles. What I was also indicating is that the search method could take many iterations and the division might be something I would like to avoid in N repetitions. I wasn't meaning to criticize your comment — GGChe
– GGChe, Commented May 16, 2024 at 20:13
yes, it would be. As said; looking up bounds by going through a large table is very complex, memory bandwidth-wise, and it doesn't really matter whether you do it in a CPU, or an FPGA. Also, your algorithm is inherently extremely sequential (as I addressed in my answer already…), so it doesn't lend itself at all to parallel / combinatorial implementation in an FPGA. — Marcus Müller
– Marcus Müller, Commented May 16, 2024 at 21:05

Stack Exchange Network

piecewise linear sqrt in python and C output comparison [closed]

2 Answers 2

Linked

Hot Network Questions

piecewise linear sqrt in python and C output comparison [closed]

2 Answers 2

Linked

Related

Hot Network Questions