0

Discussion

HuggingFace accelerate's init_empty_weights() properly loads all text encoders I tested to the PyTorch meta device and consumes no apparent memory or disk space while loaded.

However, it worked differently with both HuggingFace diffusers I tried (Flux and Stable Diffusion XL). They loaded to the either the "CPU" or "CUDA" devices and caused memory to be consumed as apparent through the Windows 11 Performance Manager.

Is init_empty_weights() implemented differently, incorrectly, or not at all for diffusers as compared with text encoders?

Code:

init_empty_weights() Works For Text Encoder

with init_empty_weights():
    text_encoder_2 = T5EncoderModel.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        subfolder="text_encoder_2",
        torch_dtype=torch.float32
    )

text_encoder_2.device

Jupyter Notebook Response:

device(type='meta')

As expected, the model is loaded only to the meta device and Windows 11 Performance Monitor shows no additional RAM or VRAM usage.

init_empty_weights() Doesn't Seem to Work For Diffusers

init_empty_weights() Doesn't Seem to Work For Flux

with init_empty_weights():
    transformer = FluxTransformer2DModel.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        subfolder="transformer",
        torch_dtype=torch.bfloat16
    )

transformer.device

Jupyter Notebook Response:

device(type='cpu')

Unexpectedly (to me), the model was loaded to CPU (vice meta) and Windows 11 Performance Monitor shows the corresponding increase in RAM usage.

init_empty_weights() Doesn't Seem to Work For SDXL

with init_empty_weights():
    pipeline = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0", 
        torch_dtype=torch.float16, 
        variant="fp16", 
        use_safetensors=True
    )

pipeline.unet.device

Jupyter Notebook Response:

device(type='cpu')

Unexpectedly (to me), the model was loaded to CPU (vice meta) and Windows 11 Performance Monitor shows the corresponding increase in RAM usage.

Background

If it's helpful to question answerers, I ask because I want to initialize models with empty weights in order to pass them to HuggingFace accelerate infer_auto_device_map(), allowing accelerate to make a best guess as to which device the various model layers should be loaded on. Loading the full models merely to obtain their shape is slow. It is possible (although inconvenient) to load a full model, obtain its inferred device map, output a text representation of that device map to text, restart the python kernel, assign the output device map text to a new device map, and finally use the new device map when loading the model for a second time. An awkward workaround.

2
  • 1
    device_map="auto" isn't the reason that it's going to cuda right? Commented Aug 17 at 23:27
  • @micsthepick Yes, you're right. That errant device_map="auto" caused it to go to CUDA vice CPU. But still not meta as I thought was the correct behavior. Edited with correct call and response. Commented Aug 18 at 20:11

1 Answer 1

0

bro, I meet the same question as yours. Here are some findings of mine.
when I use from_pretrained in my case, just like:

with init_empty_weights():
     model = AutoModelForCausalLM.from_pretrained(
        "BAAI/Emu2", # "BAAI/Emu2-Chat"
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True)  

the device is 'cpu'.
However, when i use from_config to initialize the model:

config = AutoConfig.from_pretrained(
    "BAAI/Emu2",
    trust_remote_code=True   # 
)

with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

the device is 'meta',and cpu RAM The CPU's RAM usage did not increase significantly.
But there is some question you need notice: Regardless of whether we use the parameter torch_dtype=torch.float16 in AutoConfig.from_pretrained, the model's weights still default to torch.float32 .
Therefore, if you need to use torch.float16, you should specify dtype=torch.float16 in infer_auto_device_map so that it can correctly estimate memory (if you intend to use this function), and then also specify dtype=torch.float16 when loading the checkpoints. Just like:

# specify dtype in infer_auto_device_map
device_map = infer_auto_device_map(model, max_memory={0:'18GiB',1:'20GiB',2:'20GiB',3:'20GiB'}, dtype=torch.float16,no_split_module_classes=['Block','LlamaDecoderLayer'])  
# specify dtype in load_checkpoint_and_dispatch
model = load_checkpoint_and_dispatch(
    model, 
    "./BAAI/Emu2", 
    device_map=device_map,
    dtype=torch.float16).eval()
Sign up to request clarification or add additional context in comments.

1 Comment

You're absolutely correct about needing to insert the dtype into infer_auto_device_map(). When I do, I can specify a higher VRAM number without crashing the inference. And that higher VRAM utilization decreases inference time. That doesn't stand out in the API documentation: huggingface.co/docs/accelerate/v1.10.1/en/package_reference/…. However, they do make clear that dtype is an accepted parameter.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.