Skip to content

Failed to run in mk mode when batch_size is greater than 1 #2

@zhendonghua

Description

@zhendonghua

Labels: bugs, help needed

Issue Description

I can't run the benchmark code in mk mode when batch_size is greater than 1. The model I use is Llama-3.2-1B-Instruct, batch size is 2. All other parameters of ScriptConfig is set to default value.
Take the following instruction as an example.

python megakernels/scripts/generate.py mode=mk prompt="tell me a funny joke about cookies" ntok=100 batch_size=2          

The traceback info is as below.

Traceback (most recent call last):
  File "/root/Megakernels/megakernels/scripts/generate.py", line 211, in <module>
    pydra.run(main)
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 146, in run
    return _apply_overrides_and_call(fn, first_arg_type, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 118, in _apply_overrides_and_call
    return fn(config)
           ^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/scripts/generate.py", line 174, in main
    gen.generate(output_tokens, prompt_len, config.ntok - 1)
  File "/root/Megakernels/megakernels/generators.py", line 165, in generate
    output_ids = self.run(input_ids, pos_id=pos_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/generators.py", line 132, in run
    self.schedule.globs.hidden_states[:] = hiddens.squeeze(1)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: expand(CUDABFloat16Type{[2, 2048]}, size=[2048]): the number of sizes provided (1) must be greater or equal to the number of dime
nsions in the tensor (2)

Potential cause

I think the problem lies in the shape of BaseGlobals.hidden_state. It is initialized in make_global() function of Megakernels/megakernels/demos/latency/scheduler.py.

hidden_states=make_buffer(config.hidden_size)

So the hidden_states has only one dimension because config.hidden_size is a model-related constant, let it be hidden_size. But if out batch size is greater than 1, let it be n, then in run function of MK_Generator, the input_ids should have shape (n, 1). And hiddens should have size (n, 1, hidden_size), which can not be squeezed into self.schedule.globs.hidden_states (the shape is (hidden_size)).

Environment

  • GPU: H800
  • OS: Linux x86_64
  • CUDA: 12.8
  • Python: 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions