python/FunASR-XL.git

			@@ -1,6 +1,8 @@
			([简体中文](./websocket_protocol_zh.md)\|English)

			# WebSocket/gRPC Communication Protocol
			This protocol is the communication protocol for the FunASR software package, which includes offline file transcription ([deployment document](./SDK_tutorial.md)) and real-time speech recognition ([deployment document](./SDK_tutorial_online.md)).

			## Offline File Transcription
			### Sending Data from Client to Server
			#### Message Format
			@@ -8,7 +10,7 @@
			#### Initial Communication
			The message (which needs to be serialized in JSON) is:
			```text
			{"mode": "offline", "wav_name": "wav_name", "is_speaking": True,"wav_format":"pcm"}
			{"mode": "offline", "wav_name": "wav_name","wav_format":"pcm","is_speaking": True,"wav_format":"pcm","hotwords":"阿里巴巴达摩院阿里云"}
			```
			Parameter explanation:
			```text
			@@ -17,6 +19,7 @@
			`wav_format`: the audio and video file extension, such as pcm, mp3, mp4, etc.
			`is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file
			`audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added
			`hotwords`：If AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example："阿里巴巴达摩院阿里云"
			```

			#### Sending Audio Data
			@@ -32,7 +35,7 @@
			#### Sending Recognition Results
			The message (serialized in JSON) is:
			```text
			{"mode": "offline", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True}
			{"mode": "offline", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"}
			```
			Parameter explanation:
			```text
			@@ -40,12 +43,13 @@
			`wav_name`: the name of the audio file to be transcribed
			`text`: the text output of speech recognition
			`is_final`: indicating the end of recognition
			`timestamp`：If AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]"
			```

			## Real-time Speech Recognition
			### System Architecture Diagram

			<div align="left"><img src="images/2pass.jpg" width="400"/></div>
			<div align="left"><img src="images/2pass.jpg" width="600"/></div>

			### Sending Data from Client to Server
			#### Message Format
			@@ -54,7 +58,7 @@
			#### Initial Communication
			The message (which needs to be serialized in JSON) is:
			```text
			{"mode": "2pass", "wav_name": "wav_name", "is_speaking": True, "wav_format":"pcm", "chunk_size":[5,10,5]
			{"mode": "2pass", "wav_name": "wav_name", "is_speaking": True, "wav_format":"pcm", "chunk_size":[5,10,5]}
			```
			Parameter explanation:
			```text
			@@ -85,4 +89,4 @@
			`wav_name`: the name of the audio file to be transcribed
			`text`: the text output of speech recognition
			`is_final`: indicating the end of recognition
			```
			```