Beginner integration questions: variant choice, instruction location, bistream behavior

Hi team, thanks a lot for CosyVoice 3.0 — the quality is really impressive.

We've been integrating `Fun-CosyVoice3-0.5B-2512` into a Pipecat realtime pipeline on H100, and we've hit a few things that confused us during integration. Quite possibly we missed something obvious in the docs or examples — any clarification would be really appreciated, and we'd be happy to contribute back (docs PR, examples) once we understand the intended usage.

**1. Which inference method should we use for our case?**

We need: voice cloning (from a short reference) + streaming + some style/instruction control. We ended up choosing `inference_zero_shot(..., stream=True, zero_shot_spk_id=...)` after reading the source, but we're not 100% sure it's the right one vs `inference_sft` / `inference_instruct` / `inference_cross_lingual`. Did we pick correctly? Is there a recommendation matrix somewhere we missed?

**2. Where should we pass the instruction/style prompt for a zero-shot voice?**

We didn't find an `instruction` parameter on `inference_zero_shot`. Our current approach is to bake it into `prompt_text` at `add_zero_shot_spk()` time:

```
"You are a helpful assistant. {instruction}<|endofprompt|>{transcript}"
```

This means changing the instruction forces a re-registration (which re-runs the feature extractions). Is this the intended pattern, or is there a per-call instruction mechanism for zero-shot that we missed?

**3. Bistream sometimes produces zero audio chunks — likely user error on our side**

We've seen the bistream path yield no audio in two situations, both probably our mistake:
- When `tts_text` is an iterator class rather than the result of a `yield`-function (looks like `isinstance(tts_text, collections.abc.Generator)` in `frontend.py` returns False for iterator classes)
- When `prompt_text` in `spk2info` doesn't contain token 151646 (`<|endofprompt|>`)

If these are known constraints, would a short note in the docstring be welcome? We'd gladly send a docs PR if helpful.

**Bonus question**: any update on Fun-CosyVoice 3.5 open-source plans (#1840)? Asking because the FreeStyle instruction control sounds perfect for our realtime conversational use case.

Thanks again — we know maintainer time is limited, so any pointer is appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beginner integration questions: variant choice, instruction location, bistream behavior #1895

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Beginner integration questions: variant choice, instruction location, bistream behavior #1895

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions