Skip to content

Beginner integration questions: variant choice, instruction location, bistream behavior #1895

@Loopylo

Description

@Loopylo

Hi team, thanks a lot for CosyVoice 3.0 — the quality is really impressive.

We've been integrating Fun-CosyVoice3-0.5B-2512 into a Pipecat realtime pipeline on H100, and we've hit a few things that confused us during integration. Quite possibly we missed something obvious in the docs or examples — any clarification would be really appreciated, and we'd be happy to contribute back (docs PR, examples) once we understand the intended usage.

1. Which inference method should we use for our case?

We need: voice cloning (from a short reference) + streaming + some style/instruction control. We ended up choosing inference_zero_shot(..., stream=True, zero_shot_spk_id=...) after reading the source, but we're not 100% sure it's the right one vs inference_sft / inference_instruct / inference_cross_lingual. Did we pick correctly? Is there a recommendation matrix somewhere we missed?

2. Where should we pass the instruction/style prompt for a zero-shot voice?

We didn't find an instruction parameter on inference_zero_shot. Our current approach is to bake it into prompt_text at add_zero_shot_spk() time:

"You are a helpful assistant. {instruction}<|endofprompt|>{transcript}"

This means changing the instruction forces a re-registration (which re-runs the feature extractions). Is this the intended pattern, or is there a per-call instruction mechanism for zero-shot that we missed?

3. Bistream sometimes produces zero audio chunks — likely user error on our side

We've seen the bistream path yield no audio in two situations, both probably our mistake:

  • When tts_text is an iterator class rather than the result of a yield-function (looks like isinstance(tts_text, collections.abc.Generator) in frontend.py returns False for iterator classes)
  • When prompt_text in spk2info doesn't contain token 151646 (<|endofprompt|>)

If these are known constraints, would a short note in the docstring be welcome? We'd gladly send a docs PR if helpful.

Bonus question: any update on Fun-CosyVoice 3.5 open-source plans (#1840)? Asking because the FreeStyle instruction control sounds perfect for our realtime conversational use case.

Thanks again — we know maintainer time is limited, so any pointer is appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions