From 1d7bbbffb6a024a33859b48a7a656d0455dc0be1 Mon Sep 17 00:00:00 2001
From: zhifu gao <zhifu.gzf@alibaba-inc.com>
Date: 星期一, 16 十月 2023 11:47:59 +0800
Subject: [PATCH] Update README.md
---
funasr/runtime/docs/websocket_protocol.md | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/funasr/runtime/docs/websocket_protocol.md b/funasr/runtime/docs/websocket_protocol.md
index 94823a2..cc4052f 100644
--- a/funasr/runtime/docs/websocket_protocol.md
+++ b/funasr/runtime/docs/websocket_protocol.md
@@ -1,6 +1,8 @@
([绠�浣撲腑鏂嘳(./websocket_protocol_zh.md)|English)
# WebSocket/gRPC Communication Protocol
+This protocol is the communication protocol for the FunASR software package, which includes offline file transcription ([deployment document](./SDK_tutorial.md)) and real-time speech recognition ([deployment document](./SDK_tutorial_online.md)).
+
## Offline File Transcription
### Sending Data from Client to Server
#### Message Format
@@ -8,7 +10,7 @@
#### Initial Communication
The message (which needs to be serialized in JSON) is:
```text
-{"mode": "offline", "wav_name": "wav_name", "is_speaking": True,"wav_format":"pcm"}
+{"mode": "offline", "wav_name": "wav_name","wav_format":"pcm","is_speaking": True,"wav_format":"pcm","hotwords":"闃块噷宸村反 杈炬懇闄� 闃块噷浜�","itn":true}
```
Parameter explanation:
```text
@@ -17,6 +19,8 @@
`wav_format`: the audio and video file extension, such as pcm, mp3, mp4, etc.
`is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file
`audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added
+`hotwords`锛欼f AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example锛�"闃块噷宸村反 杈炬懇闄� 闃块噷浜�"
+`itn`: whether to use itn, the default value is true for enabling and false for disabling.
```
#### Sending Audio Data
@@ -32,7 +36,7 @@
#### Sending Recognition Results
The message (serialized in JSON) is:
```text
-{"mode": "offline", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True}
+{"mode": "offline", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"}
```
Parameter explanation:
```text
@@ -40,12 +44,13 @@
`wav_name`: the name of the audio file to be transcribed
`text`: the text output of speech recognition
`is_final`: indicating the end of recognition
+`timestamp`锛欼f AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]"
```
## Real-time Speech Recognition
### System Architecture Diagram
-<div align="left"><img src="images/2pass.jpg" width="400"/></div>
+<div align="left"><img src="images/2pass.jpg" width="600"/></div>
### Sending Data from Client to Server
#### Message Format
@@ -54,7 +59,7 @@
#### Initial Communication
The message (which needs to be serialized in JSON) is:
```text
-{"mode": "2pass", "wav_name": "wav_name", "is_speaking": True, "wav_format":"pcm", "chunk_size":[5,10,5]
+{"mode": "2pass", "wav_name": "wav_name", "is_speaking": True, "wav_format":"pcm", "chunk_size":[5,10,5],"hotwords":"闃块噷宸村反 杈炬懇闄� 闃块噷浜�","itn":true}
```
Parameter explanation:
```text
@@ -64,6 +69,8 @@
`is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file
`chunk_size`: indicates the latency configuration of the streaming model, `[5,10,5]` indicates that the current audio is 600ms long, with a 300ms look-ahead and look-back time.
`audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added
+`hotwords`锛欼f AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example锛�"闃块噷宸村反 杈炬懇闄� 闃块噷浜�"
+`itn`: whether to use itn, the default value is true for enabling and false for disabling.
```
#### Sending Audio Data
Directly send the audio data, removing the header information and sending only the bytes data. Supported audio sampling rates are 8000 (which needs to be specified as audio_fs in message), and 16000.
@@ -77,7 +84,7 @@
The message (serialized in JSON) is:
```text
-{"mode": "2pass-online", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True}
+{"mode": "2pass-online", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"}
```
Parameter explanation:
```text
@@ -85,4 +92,5 @@
`wav_name`: the name of the audio file to be transcribed
`text`: the text output of speech recognition
`is_final`: indicating the end of recognition
-```
\ No newline at end of file
+`timestamp`锛欼f AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]"
+```
--
Gitblit v1.9.1