3

I have a problem with badly optimized palette cycling function of the background shader:

Shader "Background/Earthbound"
{
// 6 things to solve
// X palette cycling
// X background scrolling
// X horizontal oscillation
// X vertical oscillation
// X interleaved oscillation
// X transparency

Properties
{
    [Toggle] _Blend("Blend?", int) = 0

    [Header(Texture A)]
    _TexA ("Texture", 2D) = "white" {}      // ensure "Repeat" wrap mode
    _PaletteA("Palette Cycle", 2D) = "white" {} // ensure "Clamp" wrap mode
    [Enum(None,0,Horizontal,1,Interleaved,2,Vertical,3)] _OscillationVariantA("Oscillation Variant", int) = 0
    _ScrollDirXA("Scroll Direction X", float) = 1
    _ScrollDirYA("Scroll Direction Y", float) = 1
    _ScrollSpeedA("Scroll Speed", float) = 0
    _OscillationSpeedA("Oscillation Speed", float) = 1
    _OscillationAmplitudeA("Oscillation Amplitude", int) = 32
    _OscillationDelayA("Oscillation Delay", int) = 1

    [Header(Texture B)]
    _TexB("Texture", 2D) = "white" {}
    _PaletteB("Palette Cycle", 2D) = "white" {}
    [Enum(None,0,Horizontal,1,Interleaved,2,Vertical,3)] _OscillationVariantB("Oscillation Variant", int) = 0
    _ScrollDirXB("Scroll Direction X", float) = 1
    _ScrollDirYB("Scroll Direction Y", float) = 1
    _ScrollSpeedB("Scroll Speed", float) = 0
    _OscillationSpeedB("Oscillation Speed", float) = 1
    _OscillationAmplitudeB("Oscillation Amplitude", int) = 32
    _OscillationDelayB("Oscillation Delay", int) = 1
}
SubShader
{
    Tags { "RenderType"="Opaque" }
    LOD 100
        ...
        ...
        ...
        // palette cycling (too expensive right now...)
        float4 paletteCycle(float4 inCol, sampler2D paletteCycle, float paletteCount)
        {
            float4 outCol = inCol;

            int paletteIndex = -1;
            for (int i = 0; i < paletteCount; i++)
            {
                if (inCol.a == tex2D(paletteCycle, float2(i / paletteCount, 0)).a) // match alpha values (greyscale)
                {
                    paletteIndex = i;
                }
            }
            if (paletteIndex >= 0)
            {
                int paletteOffset = (paletteIndex + _Time.y * 12) % paletteCount;
                outCol = tex2D(paletteCycle, float2(paletteOffset / paletteCount, 0));
            }
            return outCol;
        }
     }

I use 2 grayscale sprites for the background animation - main bg (256x256) with "Repeat" option and palette (17x1) with "Clamp" option.

How can I optimize it?

Unity Version: 2020.

2
  • 1
    paletteCount is a float while I think it should be an integer. Note that division by a non-constant variable are quite expensive. Modulos are even more expensive. Can't you use a compile time constant? If you cannot, then computing 1.0f / paletteCount and using multiplications in the loop should be faster. Commented Apr 7 at 18:44
  • 1
    Be aware that conditionals are rather expensive on GPUs especially when divergence happens (IDK how to avoid them here). Can't you use a break in the loop (assuming the condition is often taken)? If you need the last, you could use iterate in the reverse order. Commented Apr 7 at 18:52

1 Answer 1

1

1-st for loops are bad, at least untill you dont do just a small amount of iterations, you can limit them with [unroll(max number of iterations)], dont pass palleteCount in function, use it as constant, will be more clear for compiler to optimize loop as constant one

2-nd there was a comment talking about breaks in loop, I respect desire to help, but tbh, that not the case and won't help, any fast-path optimization will fail on gpu (there are another examples but on workgroup scale in compute shaders only). GPU is SIMD device and you should measure the end of the task by the slowest possible thread.

3-d texture sampling isnt fast especially, when you multismaple it manually like in your example. You also do use some sampler2D, just never combine it with texture, use separate SamplerState and Texture2D, because number of samplers is always limited (around 4-6), so just for habbit at least use better way.

4-th sample texture LOD. If its background, then your LOD will somehow clearly depend on your screen resoultion. That means tex2D is not optimal as its trying to calculate best 2! mip levels, sample them and interpolate, thats not gonna work well. use pointClampSampler and Texture2D.SampleLevel()

5-th reduce your texture format. Use the smallest possible one, if it can be just a mask where you bind specific color to one of 256 values, then that will be perfect, you will pack it into 8 bit one chanel texture and it will be efficient to sample.

6-th less comparassions, they are just not good, wont change much, but still, if number is -1 then you should check != -1 not >= 0. I replaced all ifs with ? : expressions, because that will make clearer whats actually will be happening on GPU.

I made a some of my thoughts here, check it out, may be will be more clear. Also you have some time dependency not sure how it should work in your case, and why dont increase speed of time instead of doing iterations??? why do you implement search algo inside of shader, may be you can pass something from outside, like frame specific and check only 1 texture?

Sorry if Imade some typos or whatever mistakes:

...
SamplerState pointClampSampler;
uint paletteCount;

float4 paletteCycle(float4 inCol, Texture2D paletteCycle, uint lod) {
    float4 outCol = inCol;

    int paletteIndex = -1;
    // 8 texture samples is already a big deal, so no more will be optimal
    [unroll(8)]
    for (int i = 0; i < paletteCount; i++) {
        paletteIndex = (inCol.a == paletteCycle.SampleLevel(pointClampSampler, float2(i / paletteCount, 0), lod).a)?  i : paletteIndex;// match alpha values (greyscale) might fail if you calculate inColor dynamicaly and it differs a bit
    }

    int paletteOffset = (paletteIndex + _Time.y * 12) % paletteCount;
    outCol = tex2D(paletteCycle, float2(paletteOffset / paletteCount, 0));
    return (paletteIndex != -1)? outCol : inColor;
}
Sign up to request clarification or add additional context in comments.

4 Comments

First of all, thanks for such detailed answer. I tried doing this solution, but got an error: Shader error in 'Background/EarthboundOPT': 'tex2D': no matching 2 parameter intrinsic function; Possible intrinsic functions are: tex2D(sampler2D, float2|half2|min10float2|min16float2) tex2D(sampler2D, float2|half2|min10float2|min16float2, float2|half2|min10float2|min16float2, float2|half2|min10float2|min16float2) at line 141 (on d3d11)
thats because tex2D operates on sampler2D, which we replaced, use Texture2D.Sample() instead. Sorry my bad, I havent noticed this statement for some reason. learn.microsoft.com/en-us/windows/win32/direct3dhlsl/…
Well, i fixed it, it works nice and more optmized, but the image doesn't repeat itself when changing tiling values...
use SamplerState pointRepeatSampler instead of SamplerState pointClampSampler

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.