How can I schedule multiple inputs into an instantiated SystemVerilog module?

Question

I am trying to build a module that takes a 32 bit input (parameterised) and outputs the cube of the input. The naive approach would be the following:

module cuber #(
    BW = 32
) (
    input logic [BW-1:0] in0,
    output logic [BW*2-1:0] cubed_op
);

logic [BW*2-1:0] inter_l;

fast_mul #(.BW(BW))
fm_inst_1 (
  .input1(in0),
  .input2(in0),
  .product(inter_l)
);

fast_mul #(.BW(BW))
fm_inst_2 (
  .input1(inter_l),
  .input2(in0),
  .product(cubed_op)
);

endmodule

But I want to know if I can reuse the fm_inst_1 multiplier to perform both multiplications.

I am trying to use a FIFO to schedule the inputs but I can't wrap my head around how the multiplier would perform these multiplications. Then I tried to write the first multiplication's output back to the intermediate register and hoping it will reuse it but I am sure there is a better way to do this.

Hey @Mikef, the fast_mul is a Karatsuba-Ofman multiplier. I picked it up from this repo -> github.com/JC-S/Karatsuba_multiplier_HDL/blob/master/rtl/… — Suhas
– Suhas, Commented Jun 12, 2024 at 5:20

Mikef · Accepted Answer · 2024-07-07 03:47:17Z

I am sure there is a better way to do this.

There is no need for multiple instances of a structural multiply in the context you have presented.
Behavioral modeling works in simulation and synthesis workflows.

A synthesis workflow will infer two FPGA DSP blocks approximately in cascade, to perform the multiplies.

I am trying to use a FIFO to schedule the inputs

A fifo has nothing to do with the multiply. Use a fifo when you need the buffering/delay/accumulation behavior of a fifo. Use a register to store intermediate values if needed.

If you prefer structural design, create two instances of the multiplier and you have the basic structure for the multiplier. Most FPGAs have several DSP block available. Instantiating two modules will not re-use one physical resource.

If you want to use re-use single multiplier, you will need a state machine to act as a controller and some registers and a multiplexer at the input which makes the selection to determine if the 2nd mul input comes from the DUT module input or the registered intermediate value. The design would require some sort of flag or strobe telling it when a new value arrives.

You have at least one error in what you posted.
Cubing N bits produces 3N bits, not 2N.

If you prefer to perform the muls using structural modeling, the output of the first is 32 bits times 32 bits which is 64 bits. The output of the second is 32 bits times 64 bits which is 96 bits.

If the problem definition needed a parameterized power, then a structural model might be better because you could use a generate loop to create 2**WHATEVER_PARAMETER is needed.

If the design had a high speed clock, then a structural model might be better because the output of each stage could be registered using flop flops for timing closure/performance.

The best model depends on the context.

Here is a behavioral model of the unsigned cuber which I like better than what you did in the context you presented because its more concise.

module cuber #(
    BW = 32
) (
  input  logic [BW       -  1:0] in0,
  output logic [(BW * 3) -  1:0] cubed_op
);

  always_comb
    cubed_op = in0 * in0 * in0;

endmodule

A small sim of this produces

time = 0, in0 =        2, cubed_op =               8, log2 cubed =   3
time = 1, in0 =        4, cubed_op =              64, log2 cubed =   6
time = 2, in0 =        8, cubed_op =             512, log2 cubed =   9
time = 3, in0 =       16, cubed_op =            4096, log2 cubed =  12
time = 4, in0 =      256, cubed_op =        16777216, log2 cubed =  24
time = 5, in0 =     2048, cubed_op =      8589934592, log2 cubed =  33
time = 6, in0 = 4294967295, cubed_op = 79228162458924105385300197375, log2 cubed =  96

I printed the log base2 to display the number of bits used for the cube.

The last vector at time 6, is the max value of the input (2**32 - 1) so that you can see it works for big numbers and takes 3N bits.

Here is the state machine version which uses only a single multiply. The mul performs the square in the first clock, then the cube in the 2nd. The design accepts data at a 50% duty cycle.

module cuber
   (input logic [7:0] data_in,
    input logic val_in,
    input logic clk,
    input logic rst,
    output logic val_out,
    output logic [15:0] mul_out);
  
  // locals
  typedef enum logic [1:0] {SQUARE=2'b00,CUBE=2'b01} T_SM_ENUM;
  //  
  logic [31:0] mul_in_32; 
  logic [63:0] mul_in_64;
  logic [95:0] mul_out_96;
  logic        val_del1,val_del2;
  logic        mux_sel_out64_nout96;
  T_SM_ENUM    current_state, next_state;
  
  // rename
  assign mul_in_32 = data_in;
  
  // mux
  always_comb
    if(mux_sel_out64_nout96)
      mul_in_64 = data_in;
    else
      mul_in_64 = mul_out_96[63:0];
  
  // SM Combinational proc
  always_comb begin :SM
      // outputs
      mux_sel_out64_nout96 = 0;
      // NS
      next_state = current_state;
      
      case(current_state)
        SQUARE: begin
          // outputs
          if(val_in)
            mux_sel_out64_nout96 = 1;
          // NS
          if(val_in)
            next_state = CUBE;
        end
        
        CUBE: begin
          // outputs
          if(val_in)
            mux_sel_out64_nout96 = 1;
          // NS
          if(val_in)        
            next_state = SQUARE;
        end
        
        default:
          next_state = SQUARE;
      endcase
    end :SM
  
  // sync proc general use
  always_ff @(posedge clk)
    if(rst) begin
       current_state = SQUARE;  
       val_del1 <= 0;
       val_del2 <= 0;
      end
    else  begin
       current_state <= next_state;
      val_del1 <= val_in;
      val_del2 <= val_del1;
    end
      
  // sync proc for single mul
  always_ff @(posedge clk)
    if(rst) 
       mul_out_96 <= '0;
     else       
       mul_out_96 <= mul_in_32 * mul_in_64;
  
  // rename
  assign mul_out = mul_out_96;
  assign val_out = val_del2;
  
endmodule

And a '$monitor' of the results:

# time =   0, reset = 1,val_in = 0, data_in =  x, mul_out =     x, val_out = x
# time =   5, reset = 0,val_in = 0, data_in =  x, mul_out =     0, val_out = 0
# time =  15, reset = 0,val_in = 0, data_in =  x, mul_out =     x, val_out = 0
# time =  35, reset = 0,val_in = 1, data_in =  4, mul_out =     x, val_out = 0
# time =  45, reset = 0,val_in = 0, data_in =  4, mul_out =    16, val_out = 0
# time =  55, reset = 0,val_in = 1, data_in = 16, mul_out =    64, val_out = 1
# time =  65, reset = 0,val_in = 0, data_in = 16, mul_out =   256, val_out = 0
# time =  75, reset = 0,val_in = 0, data_in = 16, mul_out =  4096, val_out = 1
# time =  85, reset = 0,val_in = 0, data_in = 16, mul_out =     0, val_out = 0

The test drives 4 as a vector, and the dut produces 64 two clocks later then drives 16 and produces 4096 two clocks later.

Ah yes, a state machine makes sense. The reason why I am looking to reuse the same instance is because this cuber is part of a larger algorithm that operates on larger bit sizes (>256 bits), I could use the simple * operator but for large bit sizes it takes up too much of the DSP resources and I am looking to use them as efficiently as possible. But thank you for the response, I will look into creating a controller using a state machine and get back to you. Have a nice day!

Collectives™ on Stack Overflow

How can I schedule multiple inputs into an instantiated SystemVerilog module?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related