1

I am working on a project where I need to implement CRC (Cyclic Redundancy Check) on a Xilinx Alveo U280 FPGA. I am considering two approaches for CRC calculation and would like to understand which one would be faster in terms of performance:

Custom Algorithm: Implementing the CRC calculation using custom logic that leverages the parallel processing capabilities of the FPGA.
Lookup Table: Precomputing the CRC values for all possible inputs and storing them in a lookup table for quick retrieval.

Here are the details and constraints of my project:

The FPGA model is Xilinx Alveo U280.
The FPGA has a sufficient amount of logic resources and memory.
The data sizes can vary, ranging from small (8-bit) to large (potentially multi-kilobyte streams).
Speed is a critical factor, and I need the CRC computation to be as fast as possible.
Memory usage should be efficient, but I am willing to allocate a reasonable amount of memory for performance gains.

I would appreciate insights on the following points:

Which approach is generally faster for CRC computation on an FPGA, specifically the Xilinx Alveo U280?

How do the two methods compare in terms of scalability and resource usage on this FPGA?

Are there any hybrid approaches or optimizations that could combine the benefits of both methods?

Any advice, examples, or references to relevant resources would be greatly appreciated. Thank you!

2
  • 1
    Why is this question tagged as C++ and HLSL (Microsoft's High-Level Shader Language for DirectX?) Are either of those involved with your question? Also, just as a heads-up, Stack Overflow is generally for computer programming questions. For FPGAs, you might be better served with the Electrical Engineering SE site, as FPGA gateware design is generally very different from computer programming and much more like digital circuit design. Commented Jun 21, 2024 at 2:03
  • @reirab Yeah, sorry, by HLS, I meant High-level synthesis. Commented Jun 21, 2024 at 2:41

1 Answer 1

1

I'm not seeing how they are not both custom logic. Everything in an FPGA is custom logic.

What you might mean by your first alternative would be a classic shift-register implementation, which would process one bit of input per cycle.

For your second alternative you could implement a table-lookup that processes eight bits of input per cycle, with a table that is 256 by the the number of bits in the CRC.

You need to be quantitative about the speed you require. "As fast as possible" is not a valid requirement. Is one bit per cycle fast enough? If not, how about eight? If not that, then what does your application require?

Sign up to request clarification or add additional context in comments.

10 Comments

Finally for the lookup table method, how feasible is it to scale this up to handle larger data sizes without excessive memory usage?
Slower with fewer gates, vs. faster with more gates.
Stop saying things like "as fast as possible", and "maximize". How many bits do you need to process per cycle? One? A hundred? A million? You need a number.
@Arash - custom versus table - I'm wondering how native instructions like X86 CRC32 are implemented in hardware, and how these would compare to hardware based table lookup. Maybe Mark Adler would know this.
@Arash Correct. You can keep adding more and more gates, processing more of the data in parallel, increasing the number of bits processed per cycle to as many as you like. There are ways to combine the CRCs from multiple paths to compute the total CRC.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.