Skip to content

Commit bfa0706

Browse files
osansevieroWauplinmerveenoyan
authored
Add streaming guide (#858)
Co-authored-by: Lucain <[email protected]> Co-authored-by: Merve Noyan <[email protected]>
1 parent bce5e22 commit bfa0706

File tree

3 files changed

+153
-1
lines changed

3 files changed

+153
-1
lines changed

docs/source/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,7 @@
1818
- local: basic_tutorials/using_cli
1919
title: Using TGI CLI
2020
title: Tutorials
21+
- sections:
22+
- local: conceptual/streaming
23+
title: Streaming
24+
title: Conceptual Guides
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Streaming
2+
3+
## What is Streaming?
4+
5+
Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.
6+
7+
<div class="flex justify-center">
8+
<img
9+
class="block dark:hidden"
10+
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif"
11+
/>
12+
<img
13+
class="hidden dark:block"
14+
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif"
15+
/>
16+
</div>
17+
18+
With token streaming, the server can start returning the tokens one by one before having to generate the whole response. Users can have a sense of the generation's quality earlier than the end of the generation. This has different positive effects:
19+
20+
* Users can get results orders of magnitude earlier for extremely long queries.
21+
* Seeing something in progress allows users to stop the generation if it's not going in the direction they expect.
22+
* Perceived latency is lower when results are shown in the early stages.
23+
* When used in conversational UIs, the experience feels more natural.
24+
25+
For example, a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and although end-to-end latency will be the same, they can see half of the generation after five seconds. Below you can see an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below.
26+
27+
<div class="block dark:hidden">
28+
<iframe
29+
src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=light"
30+
width="850"
31+
height="350"
32+
></iframe>
33+
</div>
34+
<div class="hidden dark:block">
35+
<iframe
36+
src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=dark"
37+
width="850"
38+
height="350"
39+
></iframe>
40+
</div>
41+
42+
## How to use Streaming?
43+
44+
### Streaming with Python
45+
46+
To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response.
47+
48+
```python
49+
from huggingface_hub import InferenceClient
50+
51+
client = InferenceClient("http://127.0.0.1:8080")
52+
for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):
53+
print(token)
54+
55+
# To
56+
# make
57+
# cheese
58+
#,
59+
# you
60+
# need
61+
# to
62+
# start
63+
# with
64+
# milk
65+
#.
66+
```
67+
68+
If you want additional details, you can add `details=True`. In this case, you get a `TextGenerationStreamResponse` which contains additional information such as the probabilities and the tokens. For the final response in the stream, it also returns the full generated text.
69+
70+
```python
71+
for details in client.text_generation("How do you make cheese?", max_new_tokens=12, details=True, stream=True):
72+
print(details)
73+
74+
#TextGenerationStreamResponse(token=Token(id=193, text='\n', logprob=-0.007358551, special=False), generated_text=None, details=None)
75+
#TextGenerationStreamResponse(token=Token(id=2044, text='To', logprob=-1.1357422, special=False), generated_text=None, details=None)
76+
#TextGenerationStreamResponse(token=Token(id=717, text=' make', logprob=-0.009841919, special=False), generated_text=None, details=None)
77+
#...
78+
#TextGenerationStreamResponse(token=Token(id=25, text='.', logprob=-1.3408203, special=False), generated_text='\nTo make cheese, you need to start with milk.', details=StreamDetails(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=12, seed=None))
79+
```
80+
81+
The `huggingface_hub` library also comes with an `AsyncInferenceClient` in case you need to handle the requests concurrently.
82+
83+
```python
84+
from huggingface_hub import AsyncInferenceClient
85+
86+
client = AsyncInferenceClient("http://127.0.0.1:8080")
87+
async for token in await client.text_generation("How do you make cheese?", stream=True):
88+
print(token)
89+
90+
# To
91+
# make
92+
# cheese
93+
#,
94+
# you
95+
# need
96+
# to
97+
# start
98+
# with
99+
# milk
100+
#.
101+
```
102+
103+
### Streaming with cURL
104+
105+
To use the `generate_stream` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
106+
107+
```curl
108+
curl -N 127.0.0.1:8080/generate_stream \
109+
-X POST \
110+
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
111+
-H 'Content-Type: application/json'
112+
```
113+
114+
### Streaming with JavaScript
115+
116+
First, we need to install the `@huggingface/inference` library.
117+
`npm install @huggingface/inference`
118+
119+
If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`. Let's
120+
121+
We can create a `HfInferenceEndpoint` providing our endpoint URL and credential.
122+
123+
```js
124+
import { HfInference } from '@huggingface/inference'
125+
126+
const hf = new HfInference('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')
127+
128+
// prompt
129+
const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'
130+
131+
const stream = hf.textGenerationStream({ inputs: prompt })
132+
for await (const r of stream) {
133+
// yield the generated token
134+
process.stdout.write(r.token.text)
135+
}
136+
```
137+
138+
## How does Streaming work under the hood?
139+
140+
Under the hood, TGI uses Server-Sent Events (SSE). In an SSE Setup, a client sends a request with the data, opening an HTTP connection and subscribing to updates. Afterward, the server sends data to the client. There is no need for further requests; the server will keep sending the data. SSEs are unidirectional, meaning the client does not send other requests to the server. SSE sends data over HTTP, making it easy to use.
141+
142+
SSEs are different than:
143+
* Polling: where the client keeps calling the server to get data. This means that the server might return empty responses and cause overhead.
144+
* Webhooks: where there is a bi-directional connection. The server can send information to the client, but the client can also send data to the server after the first request. Webhooks are more complex to operate as they don’t only use HTTP.
145+
146+
One of the limitations of Server-Sent Events is that they limit how many concurrent requests can handle by the server. Instead of timing out when there are too many SSE connections, TGI returns a HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g. it could display a busy error to the user or it could retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing to handle backpressure.
147+
148+
One of the limitations of Server-Sent Events is that they limit how many concurrent requests can handle by the server. Instead of timing out when there are too many SSE connections, TGI returns an HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g., it could display a busy error to the user or retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing clients to handle backpressure.

docs/source/quicktour.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ curl 127.0.0.1:8080/generate \
8585
To see all possible deploy flags and options, you can use the `--help` flag. It's possible to configure the number of shards, quantization, generation parameters, and more.
8686

8787
```bash
88-
docker run ghcr.io/huggingface/text-generation-inference:1.0.0 --help
88+
docker run ghcr.io/huggingface/text-generation-inference:1.0.1 --help
8989
```
9090

9191
</Tip>

0 commit comments

Comments
 (0)