From 1a0de67a08b4407497dee0f3dcc1339d7b3e6e3a Mon Sep 17 00:00:00 2001
From: 游雁 <zhifu.gzf@alibaba-inc.com>
Date: 星期四, 07 十一月 2024 13:58:20 +0800
Subject: [PATCH] SenseVoice docs

---
 examples/industrial_data_pretraining/sense_voice/README_zh.md |  405 ++++++++++++++++++++
 examples/industrial_data_pretraining/sense_voice/finetune.sh  |    2 
 examples/industrial_data_pretraining/sense_voice/README_ja.md |  358 +++++++++++++++++
 examples/industrial_data_pretraining/sense_voice/README.md    |  381 +++++++++++++++++++
 4 files changed, 1,145 insertions(+), 1 deletions(-)

diff --git a/examples/industrial_data_pretraining/sense_voice/README.md b/examples/industrial_data_pretraining/sense_voice/README.md
new file mode 100644
index 0000000..746986c
--- /dev/null
+++ b/examples/industrial_data_pretraining/sense_voice/README.md
@@ -0,0 +1,381 @@
+([绠�浣撲腑鏂嘳(./README_zh.md)|English|[鏃ユ湰瑾瀅(./README_ja.md))
+
+
+# Introduction
+
+SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR),  spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). 
+
+<div align="center">  
+<img src="image/sensevoice2.png">
+</div>
+
+[//]: # (<div align="center"><img src="image/sensevoice.png" width="700"/> </div>)
+
+<div align="center">  
+<h4>
+<a href="https://funaudiollm.github.io/"> Homepage </a>
+锝�<a href="#What's News"> What's News </a>
+锝�<a href="#Benchmarks"> Benchmarks </a>
+锝�<a href="#Install"> Install </a>
+锝�<a href="#Usage"> Usage </a>
+锝�<a href="#Community"> Community </a>
+</h4>
+
+Model Zoo:
+[modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall), [huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
+
+Online Demo:
+[modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice)
+
+
+</div>
+
+
+<a name="Highligts"></a>
+# Highlights 馃幆
+**SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.
+- **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.
+- **Rich transcribe:** 
+  - Possess excellent emotion recognition capabilities, achieving and surpassing the effectiveness of the current best emotion recognition models on test data.
+  - Offer sound event detection capabilities, supporting the detection of various common human-computer interaction events such as bgm, applause, laughter, crying, coughing, and sneezing.
+- **Efficient Inference:** The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster than Whisper-Large.
+- **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.
+- **Service Deployment:** Offer service deployment pipeline,  supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.
+
+<a name="What's News"></a>
+# What's New 馃敟
+- 2024/7: Added Export Features for [ONNX](./demo_onnx.py) and [libtorch](./demo_libtorch.py), as well as Python Version Runtimes: [funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/), [funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/)
+- 2024/7: The [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) voice understanding model is open-sourced, which offers high-precision multilingual speech recognition, emotion recognition, and audio event detection capabilities for Mandarin, Cantonese, English, Japanese, and Korean and leads to exceptionally low inference latency.  
+- 2024/7: The CosyVoice for natural speech generation with multi-language, timbre, and emotion control. CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. [CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice space](https://www.modelscope.cn/studios/iic/CosyVoice-300M).
+- 2024/7: [FunASR](https://github.com/modelscope/FunASR) is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR.
+
+<a name="Benchmarks"></a>
+# Benchmarks 馃摑
+
+## Multilingual Speech Recognition
+We compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. In terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages.
+
+<div align="center">  
+<img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
+</div>
+
+## Speech Emotion Recognition
+
+Due to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve and exceed the performance of the current best speech emotion recognition models.
+
+<div align="center">  
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+Furthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the SenseVoice-Small model also surpassed other open-source models on the majority of the datasets.
+
+<div align="center">  
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## Audio Event Detection
+
+Although trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in training data and methodology, its event classification performance has some gaps compared to specialized AED models.
+
+<div align="center">  
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## Computational  Efficiency
+
+The SenseVoice-Small model deploys a non-autoregressive end-to-end architecture, resulting in extremely low inference latency. With a similar number of parameters to the Whisper-Small model, it infers more than 5 times faster than Whisper-Small and 15 times faster than Whisper-Large. 
+
+<div align="center">  
+<img src="image/inference.png" width="1000" />
+</div>
+
+
+# Requirements
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="Usage"></a>
+# Usage
+
+## Inference
+
+Supports input of audio in any format and of any duration.
+
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+
+model = AutoModel(
+    model=model_dir,
+    trust_remote_code=True,
+    remote_code="./model.py",    
+    vad_model="fsmn-vad",
+    vad_kwargs={"max_single_segment_time": 30000},
+    device="cuda:0",
+)
+
+# en
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=True,
+    batch_size_s=60,
+    merge_vad=True,  #
+    merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+
+<details><summary>Parameter Description (Click to Expand)</summary>
+
+- `model_dir`: The name of the model, or the path to the model on the local disk.
+- `trust_remote_code`:
+  - When `True`, it means that the model's code implementation is loaded from `remote_code`, which specifies the exact location of the `model` code (for example, `model.py` in the current directory). It supports absolute paths, relative paths, and network URLs.
+  - When `False`, it indicates that the model's code implementation is the integrated version within [FunASR](https://github.com/modelscope/FunASR). At this time, modifications made to `model.py` in the current directory will not be effective, as the version loaded is the internal one from FunASR. For the model code, [click here to view](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice).
+- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.
+- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).
+- `use_itn`: Whether the output result includes punctuation and inverse text normalization.
+- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).
+- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).
+- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.
+</details>
+
+If all inputs are short audios (<30s), and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="zh", # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    batch_size=64, 
+)
+```
+
+For more usage, please refer to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+### Inference directly
+
+Supports input of audio in any format, with an input duration limit of 30 seconds or less.
+
+```python
+from model import SenseVoiceSmall
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
+m.eval()
+
+res = m.inference(
+    data_in=f"{kwargs['model_path']}/example/en.mp3",
+    language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    ban_emo_unk=False,
+    **kwargs,
+)
+
+text = rich_transcription_postprocess(res[0][0]["text"])
+print(text)
+```
+
+### Export and Test
+<details><summary>ONNX and Libtorch Export</summary>
+
+#### ONNX
+```python
+# pip3 install -U funasr funasr-onnx
+from pathlib import Path
+from funasr_onnx import SenseVoiceSmall
+from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
+
+# inference
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+Note: ONNX model is exported to the original model directory.
+
+#### Libtorch
+```python
+from pathlib import Path
+from funasr_torch import SenseVoiceSmall
+from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0")
+
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+Note: Libtorch model is exported to the original model directory.
+</details>
+
+## Service
+
+### Deployment with FastAPI
+```shell
+export SENSEVOICE_DEVICE=cuda:0
+fastapi run --port 50000
+```
+
+## Finetune
+
+### Requirements
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### Data prepare
+
+Data examples
+
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+
+Full ref to `data/train_example.jsonl`
+
+<details><summary>Data Prepare Details</summary>
+
+Description锛�
+- `key`: audio file unique ID
+- `source`锛歱ath to the audio file
+- `source_len`锛歯umber of fbank frames of the audio file
+- `target`锛歵ranscription
+- `target_len`锛歭ength of target
+- `text_language`锛歭anguage id of the audio file
+- `emo_target`锛歟motion label of the audio file
+- `event_target`锛歟vent label of the audio file
+- `with_or_wo_itn`锛歸hether includes punctuation and inverse text normalization
+
+
+`train_text.txt`
+
+
+```bash
+BAC009S0764W0121 鐢氳嚦鍑虹幇浜ゆ槗鍑犱箮鍋滄粸鐨勬儏鍐�
+BAC009S0916W0489 婀栧寳涓�鍏徃浠ュ憳宸ュ悕涔夎捶娆炬暟鍗佸憳宸ヨ礋鍊哄崈涓�
+asr_example_cn_en 鎵�鏈夊彧瑕佸鐞� data 涓嶇浣犳槸鍋� machine learning 鍋� deep learning 鍋� data analytics 鍋� data science 涔熷ソ scientist 涔熷ソ閫氶�氶兘瑕侀兘鍋氱殑鍩烘湰鍔熷晩閭� again 鍏堝厛瀵规湁涓�浜�>涔熻瀵�
+ID0012W0014 he tried to think how it could be
+```
+
+`train_wav.scp`
+
+
+
+```bash
+BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav
+BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
+asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav
+ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav
+```
+
+`train_text_language.txt`
+
+The language ids include `<|zh|>`銆乣<|en|>`銆乣<|yue|>`銆乣<|ja|>` and `<|ko|>`.
+
+```bash
+BAC009S0764W0121 <|zh|>
+BAC009S0916W0489 <|zh|>
+asr_example_cn_en <|zh|>
+ID0012W0014 <|en|>
+```
+
+`train_emo.txt`
+
+The emotion labels include`<|HAPPY|>`銆乣<|SAD|>`銆乣<|ANGRY|>`銆乣<|NEUTRAL|>`銆乣<|FEARFUL|>`銆乣<|DISGUSTED|>` and `<|SURPRISED|>`.
+
+```bash
+BAC009S0764W0121 <|NEUTRAL|>
+BAC009S0916W0489 <|NEUTRAL|>
+asr_example_cn_en <|NEUTRAL|>
+ID0012W0014 <|NEUTRAL|>
+```
+
+`train_event.txt`
+
+The event labels include`<|BGM|>`銆乣<|Speech|>`銆乣<|Applause|>`銆乣<|Laughter|>`銆乣<|Cry|>`銆乣<|Sneeze|>`銆乣<|Breath|>` and `<|Cough|>`.
+
+```bash
+BAC009S0764W0121 <|Speech|>
+BAC009S0916W0489 <|Speech|>
+asr_example_cn_en <|Speech|>
+ID0012W0014 <|Speech|>
+```
+
+`Command`
+```shell
+# generate train.jsonl and val.jsonl from wav.scp, text.txt, text_language.txt, emo_target.txt, event_target.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt", "../../../data/list/train_text_language.txt", "../../../data/list/train_emo.txt", "../../../data/list/train_event.txt"]' \
+++data_type_list='["source", "target", "text_language", "emo_target", "event_target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+
+If there is no `train_text_language.txt`, `train_emo_target.txt` and `train_event_target.txt`, the language, emotion and event label will be predicted automatically by using the `SenseVoice` model.
+```shell
+# generate train.jsonl and val.jsonl from wav.scp and text.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
+++data_type_list='["source", "target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl" \
+++model_dir='iic/SenseVoiceSmall'
+```
+</details>
+
+### Finetune
+
+Ensure to modify the train_tool in finetune.sh to the absolute path of `funasr/bin/train_ds.py` from the FunASR installation directory you have set up earlier.
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+
+## Remarkable Third-Party Work
+- Triton (GPU) Deployment Best Practices: Using Triton + TensorRT, tested with FP32, achieving an acceleration ratio of 526 on V100 GPU. FP16 support is in progress. [Repository](https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/README.md)
+- Sherpa-onnx Deployment Best Practices: Supports using SenseVoice in 10 programming languages: C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, and Dart. Also supports deploying SenseVoice on platforms like iOS, Android, and Raspberry Pi. [Repository](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)
+- [SenseVoice.cpp](https://github.com/lovemefan/SenseVoice.cpp). Inference of SenseVoice in pure C/C++ based on GGML, supporting 3-bit, 4-bit, 5-bit, 8-bit quantization, etc. with no third-party dependencies.
+- [streaming-sensevoice](https://github.com/pengzhendong/streaming-sensevoice) processes inference in chunks. To achieve pseudo-streaming, it employs a truncated attention mechanism, sacrificing some accuracy. Additionally, this technology supports CTC prefix beam search and hot-word boosting features.
+- [OmniSenseVoice](https://github.com/lifeiteng/OmniSenseVoice) is optimized for lightning-fast inference and batching process. 
+
+<a name="Community"></a>
+# Community
+If you encounter problems in use, you can directly raise Issues on the github page.
+
+You can also scan the following DingTalk group QR code to join the community group for communication and discussion.
+
+|                          FunASR                          |
+|:--------------------------------------------------------:|
+| <img src="image/dingding_funasr.png" width="250"/></div> |
+
+
diff --git a/examples/industrial_data_pretraining/sense_voice/README_ja.md b/examples/industrial_data_pretraining/sense_voice/README_ja.md
new file mode 100644
index 0000000..c586899
--- /dev/null
+++ b/examples/industrial_data_pretraining/sense_voice/README_ja.md
@@ -0,0 +1,358 @@
+# SenseVoice
+
+銆孾绠�浣撲腑鏂嘳(./README_zh.md)銆峾銆孾English](./README.md)銆峾銆屾棩鏈獮銆�
+
+SenseVoice銇�侀煶澹拌獚璀橈紙ASR锛夈�佽█瑾炶瓨鍒ワ紙LID锛夈�侀煶澹版劅鎯呰獚璀橈紙SER锛夈�併亰銈堛伋闊抽熆銈ゃ儥銉炽儓鍒嗛锛圓EC锛夈伨銇熴伅闊抽熆銈ゃ儥銉炽儓妞滃嚭锛圓ED锛夈倰鍚個闊冲0鐞嗚В鑳藉姏銈掑倷銇堛仧闊冲0鍩虹洡銉€儑銉仹銇欍�傛湰銉椼儹銈搞偋銈儓銇с伅銆丼enseVoice銉€儑銉伄绱逛粙銇ㄣ�佽鏁般伄銈裤偣銈儐銈广儓銈汇儍銉堛仹銇儥銉炽儊銉炪兗銈�併亰銈堛伋銉€儑銉伄浣撻〒銇繀瑕併仾鐠板銇偆銉炽偣銉堛兗銉仺鎺ㄨ珫鏂规硶銈掓彁渚涖仐銇俱仚銆�
+
+<div align="center">  
+<img src="image/sensevoice2.png">
+</div>
+[//]: # (<div align="center"><img src="image/sensevoice2.png" width="700"/> </div>)
+ 
+<div align="center">  
+<h4>
+<a href="https://funaudiollm.github.io/"> 銉涖兗銉犮儦銉笺偢 </a>
+锝�<a href="#鏈�鏂板姩鎬�"> 鏈�鏂版儏鍫� </a>
+锝�<a href="#鎬ц兘璇勬祴"> 鎬ц兘瑭曚尽 </a>
+锝�<a href="#鐜瀹夎"> 鐠板銈ゃ兂銈广儓銉笺儷 </a>
+锝�<a href="#鐢ㄦ硶鏁欑▼"> 浣跨敤鏂规硶銉併儱銉笺儓銉偄銉� </a>
+锝�<a href="#鑱旂郴鎴戜滑"> 銇婂晱銇勫悎銈忋仜 </a>
+</h4>
+
+銉€儑銉儶銉濄偢銉堛儶锛歔modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall)锛孾huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
+
+銈兂銉┿偆銉充綋楱擄細
+[modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice)
+
+</div>
+
+<a name="鏍稿績鍔熻兘"></a>
+# 銈炽偄姗熻兘 馃幆
+**SenseVoice**銇�侀珮绮惧害銇瑷�瑾為煶澹拌獚璀樸�佹劅鎯呰獚璀樸�併亰銈堛伋闊冲0銈ゃ儥銉炽儓妞滃嚭銇劍鐐广倰褰撱仸銇︺亜銇俱仚銆�
+- **澶氳█瑾炶獚璀橈細** 40涓囨檪闁撲互涓娿伄銉囥兗銈裤倰浣跨敤銇椼仸銉堛儸銉笺儖銉炽偘銇曘倢銆�50浠ヤ笂銇█瑾炪倰銈点儩銉笺儓銇椼�佽獚璀樻�ц兘銇疻hisper銉€儑銉倰涓婂洖銈娿伨銇欍��
+- **銉儍銉併儐銈偣銉堣獚璀橈細** 
+  - 鍎倢銇熸劅鎯呰獚璀樿兘鍔涖倰鎸併仭銆併儐銈广儓銉囥兗銈裤仹鐝惧湪銇渶鑹伄鎰熸儏瑾嶈瓨銉€儑銉伄鍔规灉銈掗仈鎴愩亰銈堛伋涓婂洖銈娿伨銇欍��
+  - 闊冲0銈ゃ儥銉炽儓妞滃嚭鑳藉姏銈掓彁渚涖仐銆侀煶妤姐�佹媿鎵嬨�佺瑧銇勫0銆佹常銇嶅0銆佸挸銆併亸銇椼們銇裤仾銇┿伄銇曘伨銇栥伨銇竴鑸殑銇汉闁撱仺銈炽兂銉斻儱銉笺偪銇偆銉炽偪銉┿偗銈枫儳銉炽偆銉欍兂銉堛倰妞滃嚭銇椼伨銇欍��
+- **鍔圭巼鐨勩仾鎺ㄨ珫锛�** SenseVoice-Small銉€儑銉伅闈炶嚜宸卞洖甯般偍銉炽儔銉勩兗銈ㄣ兂銉夈儠銉兗銉犮儻銉笺偗銈掓帯鐢ㄣ仐銇︺亰銈娿�佹帹璜栭亝寤躲亴闈炲父銇綆銇忋��10绉掋伄闊冲0銇帹璜栥伀70ms銇椼亱銇嬨亱銈娿伨銇涖倱銆俉hisper-Large銈堛倞15鍊嶉珮閫熴仹銇欍��
+- **绨″崢銇井瑾挎暣锛�** 渚垮埄銇井瑾挎暣銈广偗銉儣銉堛仺鎴︾暐銈掓彁渚涖仐銆併儲銉笺偠銉笺亴銉撱偢銉嶃偣銈枫儕銉偑銇繙銇樸仸銉兂銈般儐銉笺儷銈点兂銉椼儷銇晱椤屻倰绨″崢銇В姹恒仹銇嶃倠銈堛亞銇仐銇俱仚銆�
+- **銈点兗銉撱偣灞曢枊锛�** 銉炪儷銉併偝銉炽偒銉兂銉堛儶銈偍銈广儓銈掋偟銉濄兗銉堛仚銈嬪畬鍏ㄣ仾銈点兗銉撱偣灞曢枊銉戙偆銉椼儵銈ゃ兂銈掓彁渚涖仐銆併偗銉┿偆銈€兂銉堛偟銈ゃ儔銇█瑾炪伀銇疨ython銆丆++銆丠TML銆丣ava銆丆#銇仼銇屻亗銈娿伨銇欍��
+
+<a name="鏈�鏂板姩鎬�"></a>
+# 鏈�鏂版儏鍫� 馃敟
+- 2024/7锛氭柊銇椼亸[ONNX](./demo_onnx.py)銇╗libtorch](./demo_libtorch.py)銇偍銈偣銉濄兗銉堟鑳姐倰杩藉姞銇椼�丳ython銉愩兗銈搞儳銉炽伄銉┿兂銈裤偆銉狅細[funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/)銆乕funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/)銈傛彁渚涢枊濮嬨��
+- 2024/7: [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) 澶氳█瑾為煶澹扮悊瑙c儮銉囥儷銇屻偑銉笺儣銉炽偨銉笺偣鍖栥仌銈屻伨銇椼仧銆備腑鍥借獮銆佸簝鏉辫獮銆佽嫳瑾炪�佹棩鏈獮銆侀煋鍥借獮銇瑷�瑾為煶澹拌獚璀樸�佹劅鎯呰獚璀樸�併亰銈堛伋銈ゃ儥銉炽儓妞滃嚭鑳藉姏銈掋偟銉濄兗銉堛仐銆侀潪甯搞伀浣庛亜鎺ㄨ珫閬呭欢銈掑疅鐝俱仐銇︺亜銇俱仚銆�
+- 2024/7: CosyVoice銇嚜鐒躲仾闊冲0鐢熸垚銇彇銈婄祫銈撱仹銇娿倞銆佸瑷�瑾炪�侀煶鑹层�佹劅鎯呭埗寰°倰銈点儩銉笺儓銇椼伨銇欍�傚瑷�瑾為煶澹扮敓鎴愩�併偧銉偡銉с儍銉堥煶澹扮敓鎴愩�併偗銉偣銉┿兂銈层兗銈搁煶澹般偗銉兗銉炽�併亰銈堛伋鎸囩ず銇緭銇嗚兘鍔涖伀鍎倢銇︺亜銇俱仚銆俒CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice 銈兂銉┿偆銉充綋楱揮(https://www.modelscope.cn/studios/iic/CosyVoice-300M).
+- 2024/7: [FunASR](https://github.com/modelscope/FunASR) 銇�侀煶澹拌獚璀橈紙ASR锛夈�侀煶澹版椿鍕曟鍑猴紙VAD锛夈�佸彞瑾偣寰╁厓銆佽█瑾炪儮銉囥儷銆佽┍鑰呮瑷笺�佽┍鑰呭垎闆€�併亰銈堛伋銉炪儷銉併儓銉笺偒銉糀SR銇仼銇鑳姐倰鎻愪緵銇欍倠鍩烘湰鐨勩仾闊冲0瑾嶈瓨銉勩兗銉偔銉冦儓銇с仚銆�
+
+<a name="Benchmarks"></a>
+# 銉欍兂銉併優銉笺偗 馃摑
+
+## 澶氳█瑾為煶澹拌獚璀�
+
+銈兗銉椼兂銈姐兗銈广伄銉欍兂銉併優銉笺偗銉囥兗銈裤偦銉冦儓锛圓ISHELL-1銆丄ISHELL-2銆乄enetspeech銆丩ibrispeech銆丆ommon Voice銈掑惈銈�锛夈仹SenseVoice銇╓hisper銇瑷�瑾為煶澹拌獚璀樻�ц兘銇ㄦ帹璜栧姽鐜囥倰姣旇純銇椼伨銇椼仧銆備腑鍥借獮銇ㄥ簝鏉辫獮銇獚璀樺姽鏋溿伀銇娿亜銇︺�丼enseVoice-Small銉€儑銉伅鏄庛倝銇嬨仾鍔规灉銇劒浣嶆�с倰鎸併仯銇︺亜銇俱仚銆�
+
+<div align="center">  
+<img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
+</div>
+
+## 鎰熸儏瑾嶈瓨
+
+鐝惧湪銆佸簝銇忎娇鐢ㄣ仌銈屻仸銇勩倠鎰熸儏瑾嶈瓨銇儐銈广儓鎸囨銇ㄦ柟娉曘亴涓嶈冻銇椼仸銇勩倠銇熴倎銆佽鏁般伄銉嗐偣銉堛偦銉冦儓銇с仌銇俱仏銇俱仾鎸囨銈掋儐銈广儓銇椼�佹渶杩戙伄銉欍兂銉併優銉笺偗銇鏁般伄绲愭灉銇ㄥ寘鎷殑銇瘮杓冦仐銇俱仐銇熴�傞伕鎶炪仌銈屻仧銉嗐偣銉堛偦銉冦儓銇伅銆佷腑鍥借獮/鑻辫獮銇浮鏂广伄瑷�瑾炪仺銆併儜銉曘偐銉笺優銉炽偣銆佹槧鐢汇�佽嚜鐒躲仾浼氳┍銇仼銇仌銇俱仏銇俱仾銈广偪銈ゃ儷銇儑銉笺偪銇屽惈銇俱倢銇︺亜銇俱仚銆傘偪銉笺偛銉冦儓銉囥兗銈裤伄寰鏁淬倰琛屻倧銇亜鍓嶆彁銇с�丼enseVoice銇儐銈广儓銉囥兗銈裤仹鐝惧湪銇渶鑹伄鎰熸儏瑾嶈瓨銉€儑銉伄鍔规灉銈掗仈鎴愩亰銈堛伋涓婂洖銈嬨亾銇ㄣ亴銇с亶銇俱仐銇熴��
+
+<div align="center">  
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+銇曘倝銇�併儐銈广儓銈汇儍銉堛仹瑜囨暟銇偑銉笺儣銉炽偨銉笺偣銇劅鎯呰獚璀樸儮銉囥儷銈掓瘮杓冦仐銆佺祼鏋溿伅SenseVoice-Large銉€儑銉亴銇汇伡銇欍伖銇︺伄銉囥兗銈裤仹鏈�鑹伄鍔规灉銈掗仈鎴愩仐銆丼enseVoice-Small銉€儑銉倐澶氭暟銇儑銉笺偪銈汇儍銉堛仹浠栥伄銈兗銉椼兂銈姐兗銈广儮銉囥儷銈掍笂鍥炪倠鍔规灉銈掗仈鎴愩仐銇熴亾銇ㄣ倰绀恒仐銇︺亜銇俱仚銆�
+
+<div align="center">  
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## 銈ゃ儥銉炽儓妞滃嚭
+
+SenseVoice銇煶澹般儑銉笺偪銇伩銇с儓銉兗銉嬨兂銈般仌銈屻仸銇勩伨銇欍亴銆併偆銉欍兂銉堟鍑恒儮銉囥儷銇ㄣ仐銇﹀崢鐙仹浣跨敤銇欍倠銇撱仺銈傘仹銇嶃伨銇欍�傜挵澧冮煶鍒嗛ESC-50銉囥兗銈裤偦銉冦儓銇с�佺従鍦ㄦキ鐣屻仹搴冦亸浣跨敤銇曘倢銇︺亜銈婤EATS銇娿倛銇砅ANN銉€儑銉伄鍔规灉銇ㄦ瘮杓冦仐銇俱仐銇熴�係enseVoice銉€儑銉伅銇撱倢銈夈伄銈裤偣銈仹鑹ソ銇姽鏋溿倰閬旀垚銇椼伨銇椼仧銇屻�併儓銉兗銉嬨兂銈般儑銉笺偪銇ㄣ儓銉兗銉嬨兂銈版柟娉曘伄鍒剁磩銇倛銈娿�併偆銉欍兂銉堝垎椤炪伄鍔规灉銇皞闁�銇偆銉欍兂銉堟鍑恒儮銉囥儷銇ㄦ瘮杓冦仐銇︺伨銇犱竴瀹氥伄宸亴銇傘倞銇俱仚銆�
+
+<div align="center">  
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## 鎺ㄨ珫鍔圭巼
+
+SenseVoice-small銉€儑銉伅闈炶嚜宸卞洖甯般偍銉炽儔銉勩兗銈ㄣ兂銉夈偄銉笺偔銉嗐偗銉併儯銈掓帯鐢ㄣ仐銇︺亰銈娿�佹帹璜栭亝寤躲亴闈炲父銇綆銇勩仹銇欍�俉hisper-Small銉€儑銉仺鍚岀瓑銇儜銉┿儭銉笺偪閲忋仹銆乄hisper-Small銉€儑銉倛銈�5鍊嶉珮閫熴仹銆乄hisper-Large銉€儑銉倛銈�15鍊嶉珮閫熴仹銇欍�傚悓鏅傘伀銆丼enseVoice-small銉€儑銉伅闊冲0銇暦銇曘亴澧楀姞銇椼仸銈傘�佹帹璜栨檪闁撱伀鏄庛倝銇嬨仾澧楀姞銇亗銈娿伨銇涖倱銆�
+
+<div align="center">  
+<img src="image/inference.png" width="1000" />
+</div>
+
+<a name="鐜瀹夎"></a>
+# 鐠板銈ゃ兂銈广儓銉笺儷 馃悕
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="鐢ㄦ硶鏁欑▼"></a>
+# 浣跨敤鏂规硶 馃洜锔�
+
+## 鎺ㄨ珫
+
+浠绘剰銇舰寮忋伄闊冲0鍏ュ姏銈掋偟銉濄兗銉堛仐銆佷换鎰忋伄闀枫仌銇叆鍔涖倰銈点儩銉笺儓銇椼伨銇欍��
+
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+
+model = AutoModel(
+    model=model_dir,
+    trust_remote_code=True,
+    remote_code="./model.py",  
+    vad_model="fsmn-vad",
+    vad_kwargs={"max_single_segment_time": 30000},
+    device="cuda:0",
+)
+
+# en
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=True,
+    batch_size_s=60,
+    merge_vad=True,  #
+    merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+
+<details><summary>銉戙儵銉°兗銈裤伄瑾槑锛堛偗銉儍銈仐銇﹀睍闁嬶級</summary>
+
+- `model_dir`锛氥儮銉囥儷鍚嶃�併伨銇熴伅銉兗銈儷銉囥偅銈广偗涓娿伄銉€儑銉儜銈广��
+- `trust_remote_code`锛�
+  - `True`銇�乵odel銈炽兗銉夈伄瀹熻銇宍remote_code`銇嬨倝銉兗銉夈仌銈屻倠銇撱仺銈掓剰鍛炽仐銆乣remote_code`銇痐model`銈炽兗銉夈伄姝g⒑銇綅缃倰鎸囧畾銇椼伨銇欙紙渚嬶細鐝惧湪銇儑銈c儸銈儓銉伄`model.py`锛夈�傜刀瀵俱儜銈广�佺浉瀵俱儜銈广�併亰銈堛伋銉嶃儍銉堛儻銉笺偗URL銈掋偟銉濄兗銉堛仐銇俱仚銆�
+  - `False`銇�乵odel銈炽兗銉夈伄瀹熻銇孾FunASR](https://github.com/modelscope/FunASR)鍐呴儴銇当鍚堛仌銈屻仧銉愩兗銈搞儳銉炽仹銇傘倠銇撱仺銈掓剰鍛炽仐銆併亾銇牬鍚堛�佺従鍦ㄣ伄銉囥偅銉偗銉堛儶銇甡model.py`銈掑鏇淬仐銇︺倐鍔规灉銇屻亗銈娿伨銇涖倱銆侳unASR鍐呴儴銉愩兗銈搞儳銉炽亴銉兗銉夈仌銈屻倠銇熴倎銇с仚銆傘儮銉囥儷銈炽兗銉塠銇撱仭銈夈倰鍙傜収](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice)銆�
+- `vad_model`锛歏AD锛堥煶澹版椿鍕曟鍑猴級銈掓湁鍔广伀銇欍倠銇撱仺銈掔ず銇椼伨銇欍�俈AD銇洰鐨勩伅銆侀暦銇勯煶澹般倰鐭亜銈儶銉冦儣銇垎鍓层仚銈嬨亾銇ㄣ仹銇欍�傘亾銇牬鍚堛�佹帹璜栨檪闁撱伀銇疺AD銇⊿enseVoice銇悎瑷堟秷璨汇亴鍚伨銈屻�併偍銉炽儔銉勩兗銈ㄣ兂銉夈伄閬呭欢銈掕〃銇椼伨銇欍�係enseVoice銉€儑銉伄鎺ㄨ珫鏅傞枔銈掑�嬪垾銇儐銈广儓銇欍倠鍫村悎銇�乂AD銉€儑銉倰鐒″姽銇仹銇嶃伨銇欍��
+- `vad_kwargs`锛歏AD銉€儑銉伄瑷畾銈掓寚瀹氥仐銇俱仚銆俙max_single_segment_time`锛歚vad_model`銇倛銈嬮煶澹般偦銈般儭銉炽儓銇渶澶ч暦銈掔ず銇椼�佸崢浣嶃伅銉熴儶绉掞紙ms锛夈仹銇欍��
+- `use_itn`锛氬嚭鍔涚祼鏋溿伀鍙ヨ鐐广仺閫嗐儐銈偣銉堟瑕忓寲銇屽惈銇俱倢銈嬨亱銇┿亞銇嬨��
+- `batch_size_s`锛氬嫊鐨勩儛銉冦儊銇娇鐢ㄣ倰绀恒仐銆併儛銉冦儊鍐呫伄闊冲0銇悎瑷堥暦銈掔锛坰锛夈仹娓畾銇椼伨銇欍��
+- `merge_vad`锛歏AD銉€儑銉伀銈堛仯銇﹀垎鍓层仌銈屻仧鐭亜闊冲0銉曘儵銈般儭銉炽儓銈掋優銉笺偢銇欍倠銇嬨仼銇嗐亱銆傘優銉笺偢寰屻伄闀枫仌銇痐merge_length_s`銇с�佸崢浣嶃伅绉掞紙s锛夈仹銇欍��
+- `ban_emo_unk`锛歟mo_unk銉┿儥銉倰鐒″姽銇仚銈嬨��
+</details>
+
+銇欍伖銇︺伄鍏ュ姏銇岀煭銇勯煶澹帮紙30绉掓湭婧�锛夈仹銇傘倞銆併儛銉冦儊鎺ㄨ珫銇屽繀瑕併仾鍫村悎銆佹帹璜栧姽鐜囥倰鍚戜笂銇曘仜銈嬨仧銈併伀VAD銉€儑銉倰鍓婇櫎銇椼�乣batch_size`銈掕ō瀹氥仹銇嶃伨銇欍��
+
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=True,
+    batch_size=64, 
+)
+```
+
+瑭崇窗銇娇鐢ㄦ柟娉曘伀銇ゃ亜銇︺伅銆乕銉夈偔銉ャ儭銉炽儓](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)銈掑弬鐓с仐銇︺亸銇犮仌銇勩��
+
+### 鐩存帴鎺ㄨ珫
+
+浠绘剰銇舰寮忋伄闊冲0鍏ュ姏銈掋偟銉濄兗銉堛仐銆佸叆鍔涢煶澹般伄闀枫仌銇�30绉掍互涓嬨伀鍒堕檺銇曘倢銇俱仚銆�
+
+```python
+from model import SenseVoiceSmall
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
+m.eval()
+
+res = m.inference(
+    data_in=f"{kwargs['model_path']}/example/en.mp3",
+    language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    ban_emo_unk=False,
+    **kwargs,
+)
+
+text = rich_transcription_postprocess(res[0][0]["text"])
+print(text)
+```
+
+## 銈点兗銉撱偣灞曢枊
+
+鏈畬浜�
+
+### 銈ㄣ偗銈广儩銉笺儓銇ㄣ儐銈广儓
+<details><summary>ONNX銇↙ibtorch銇偍銈偣銉濄兗銉�</summary>
+
+#### ONNX
+```python
+# pip3 install -U funasr funasr-onnx
+from pathlib import Path
+from funasr_onnx import SenseVoiceSmall
+from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
+
+# inference
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+鍌欒�冿細ONNX銉€儑銉伅鍏冦伄銉€儑銉儑銈c儸銈儓銉伀銈ㄣ偗銈广儩銉笺儓銇曘倢銇俱仚銆�
+
+#### Libtorch
+```python
+from pathlib import Path
+from funasr_torch import SenseVoiceSmall
+from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0")
+
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+鍌欒�冿細Libtorch銉€儑銉伅鍏冦伄銉€儑銉儑銈c儸銈儓銉伀銈ㄣ偗銈广儩銉笺儓銇曘倢銇俱仚銆�
+
+</details>
+
+### 灞曢枊
+
+### FastAPI銈掍娇銇c仧灞曢枊
+```shell
+export SENSEVOICE_DEVICE=cuda:0
+fastapi run --port 50000
+```
+
+## 寰鏁�
+
+### 銉堛儸銉笺儖銉炽偘鐠板銇偆銉炽偣銉堛兗銉�
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### 銉囥兗銈挎簴鍌�
+
+銉囥兗銈夸緥
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+瑭崇窗銇� `data/train_example.jsonl` 銈掑弬鐓с仐銇︺亸銇犮仌銇勩��
+
+<details><summary>銉囥兗銈挎簴鍌欍伄瑭崇窗</summary>
+
+瑾槑锛�
+- `key`锛氶煶澹般儠銈°偆銉伄銉︺儖銉笺偗ID
+- `source`锛氶煶澹般儠銈°偆銉伄銉戙偣
+- `source_len`锛氶煶澹般儠銈°偆銉伄fbank銉曘儸銉笺儬鏁�
+- `target`锛氭枃瀛楄捣銇撱仐绲愭灉
+- `target_len`锛歵arget锛堟枃瀛楄捣銇撱仐锛夈伄闀枫仌
+- `text_language`锛氶煶澹般儠銈°偆銉伄瑷�瑾濱D
+- `emo_target`锛氶煶澹般儠銈°偆銉伄鎰熸儏銉┿儥銉�
+- `event_target`锛氶煶澹般儠銈°偆銉伄銈ゃ儥銉炽儓銉┿儥銉�
+- `with_or_wo_itn`锛氬彞瑾偣銇ㄩ�嗐儐銈偣銉堟瑕忓寲銈掑惈銈�銇嬨仼銇嗐亱
+
+`train_text.txt`
+```bash
+BAC009S0764W0121 鐢氳嚦鍑虹幇浜ゆ槗鍑犱箮鍋滄粸鐨勬儏鍐�
+BAC009S0916W0489 婀栧寳涓�鍏徃浠ュ憳宸ュ悕涔夎捶娆炬暟鍗佸憳宸ヨ礋鍊哄崈涓�
+asr_example_cn_en 鎵�鏈夊彧瑕佸鐞� data 涓嶇浣犳槸鍋� machine learning 鍋� deep learning 鍋� data analytics 鍋� data science 涔熷ソ scientist 涔熷ソ閫氶�氶兘瑕侀兘鍋氱殑鍩烘湰鍔熷晩閭� again 鍏堝厛瀵规湁涓�浜�>涔熻瀵�
+ID0012W0014 he tried to think how it could be
+```
+`train_wav.scp`
+```bash
+BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav
+BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
+asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav
+ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav
+```
+`train_text_language.txt`
+瑷�瑾濱D銇� `<|zh|>`銆乣<|en|>`銆乣<|yue|>`銆乣<|ja|>`銆併亰銈堛伋 `<|ko|>`銈掑惈銇裤伨銇欍��
+```bash
+BAC009S0764W0121 <|zh|>
+BAC009S0916W0489 <|zh|>
+asr_example_cn_en <|zh|>
+ID0012W0014 <|en|>
+```
+`train_emo.txt`
+鎰熸儏銉┿儥銉伅銆乣<|HAPPY|>`銆乣<|SAD|>`銆乣<|ANGRY|>`銆乣<|NEUTRAL|>`銆乣<|FEARFUL|>`銆乣<|DISGUSTED|>` 銇娿倛銇� `<|SURPRISED|>`銈掑惈銇裤伨銇欍��
+```bash
+BAC009S0764W0121 <|NEUTRAL|>
+BAC009S0916W0489 <|NEUTRAL|>
+asr_example_cn_en <|NEUTRAL|>
+ID0012W0014 <|NEUTRAL|>
+```
+`train_event.txt`
+銈ゃ儥銉炽儓銉┿儥銉伅銆� `<|BGM|>`銆乣<|Speech|>`銆乣<|Applause|>`銆乣<|Laughter|>`銆乣<|Cry|>`銆乣<|Sneeze|>`銆乣<|Breath|>` 銇娿倛銇� `<|Cough|>`銈掑惈銇裤伨銇欍��
+```bash
+BAC009S0764W0121 <|Speech|>
+BAC009S0916W0489 <|Speech|>
+asr_example_cn_en <|Speech|>
+ID0012W0014 <|Speech|>
+```
+`銈炽優銉炽儔`
+```shell
+# wav.scp銆乼ext.txt銆乼ext_language.txt銆乪mo_target.txt銆乪vent_target.txt 銇嬨倝 train.jsonl 銇� val.jsonl 銈掔敓鎴愩仐銇俱仚
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt", "../../../data/list/train_text_language.txt", "../../../data/list/train_emo.txt", "../../../data/list/train_event.txt"]' \
+++data_type_list='["source", "target", "text_language", "emo_target", "event_target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+`train_text_language.txt`銆乣train_emo_target.txt`銆乣train_event_target.txt` 銇屻仾銇勫牬鍚堛�乣SenseVoice` 銉€儑銉倰浣跨敤銇椼仸瑷�瑾炪�佹劅鎯呫�併亰銈堛伋銈ゃ儥銉炽儓銉┿儥銉亴鑷嫊鐨勩伀浜堟脯銇曘倢銇俱仚銆�
+```shell
+# wav.scp 銇� text.txt 銇嬨倝 train.jsonl 銇� val.jsonl 銈掔敓鎴愩仐銇俱仚
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
+++data_type_list='["source", "target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+</details>
+
+### 銉堛儸銉笺儖銉炽偘銇枊濮�
+
+`finetune.sh`銇甡train_tool`銈掋�佸墠杩般伄FunASR銉戙偣鍐呫伄`funasr/bin/train_ds.py`銇刀瀵俱儜銈广伀澶夋洿銇欍倠銇撱仺銈掑繕銈屻仾銇勩仹銇忋仩銇曘亜銆�
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+## 娉ㄧ洰銇欍伖銇嶃偟銉笺儔銉戙兗銉嗐偅銇彇銈婄祫銇�
+- Triton (GPU) 銉囥儣銉偆銉°兂銉堛伄銉欍偣銉堛儣銉┿偗銉嗐偅銈癸細Triton + TensorRT 銈掍娇鐢ㄣ仐銆丗P32 銇с儐銈广儓銆俈100 GPU 銇у姞閫熸瘮 526 銈掗仈鎴愩�侳P16 銇偟銉濄兗銉堛伅閫茶涓仹銇欍�俒銉儩銈搞儓銉猐(https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/README.md)
+- Sherpa-onnx 銉囥儣銉偆銉°兂銉堛伄銉欍偣銉堛儣銉┿偗銉嗐偅銈癸細SenseVoice 銈�10绋銇儣銉偘銉┿儫銉炽偘瑷�瑾烇紙C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, Dart锛夈仹浣跨敤鍙兘銆傘伨銇熴�乮OS, Android, Raspberry Pi 銇仼銇儣銉┿儍銉堛儠銈┿兗銉犮仹銈� SenseVoice 銈掋儑銉椼儹銈ゃ仹銇嶃伨銇欍�俒銉儩銈搞儓銉猐(https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)
+- [SenseVoice.cpp](https://github.com/lovemefan/SenseVoice.cpp) GGML銇熀銇ャ亜銇︾磾绮嬨仾C/C++銇enseVoice銈掓帹娓仐銆�3銉撱儍銉堛��4銉撱儍銉堛��5銉撱儍銉堛��8銉撱儍銉堥噺瀛愬寲銇仼銈掋偟銉濄兗銉堛仐銆併偟銉笺儔銉戙兗銉嗐偅銇緷瀛橀枹淇傘伅銇傘倞銇俱仜銈撱��
+- [streaming-sensevoice](https://github.com/pengzhendong/streaming-sensevoice) 銈广儓銉兗銉犲瀷SenseVoice銇�併儊銉c兂銈紙chunk锛夋柟寮忋仹鎺ㄨ珫銈掕銇勩伨銇欍�傛摤浼笺偣銉堛儶銉笺儫銉炽偘鍑︾悊銈掑疅鐝俱仚銈嬨仧銈併伀銆佷竴閮ㄣ伄绮惧害銈掔姞鐗层伀銇椼仸鍒囥倞鎹ㄣ仸娉ㄦ剰姗熸锛坱runcated attention锛夈倰鎺$敤銇椼仸銇勩伨銇欍�傘仌銈夈伀銆併亾銇妧琛撱伅CTC銉椼儸銉曘偅銉冦偗銈广儞銉笺儬銈点兗銉侊紙CTC prefix beam search锛夈仺銉涖儍銉堛儻銉笺儔寮峰寲姗熻兘銈傘偟銉濄兗銉堛仐銇︺亜銇俱仚銆�
+- [OmniSenseVoice](https://github.com/lifeiteng/OmniSenseVoice) 銇�佽秴楂橀�熸帹璜栥仺銉愩儍銉佸嚘鐞嗐伄銇熴倎銇渶閬╁寲銇曘倢銇︺亜銇俱仚銆�
+
+# 銇婂晱銇勫悎銈忋仜
+
+浣跨敤涓伀鍟忛銇岀櫤鐢熴仐銇熷牬鍚堛伅銆乬ithub銉氥兗銈搞仹鐩存帴Issues銈掓彁璧枫仹銇嶃伨銇欍�傞煶澹般伀鑸堝懗銇亗銈嬫柟銇�佷互涓嬨伄DingTalk銈般儷銉笺儣QR銈炽兗銉夈倰銈广偔銉c兂銇椼仸銈炽儫銉ャ儖銉嗐偅銈般儷銉笺儣銇弬鍔犮仐銆佷氦娴併仺璀拌珫銈掕銇c仸銇忋仩銇曘亜銆�
+
+|                          FunASR                          |
+|:--------------------------------------------------------:|
+| <img src="image/dingding_funasr.png" width="250"/></div> |
+
diff --git a/examples/industrial_data_pretraining/sense_voice/README_zh.md b/examples/industrial_data_pretraining/sense_voice/README_zh.md
new file mode 100644
index 0000000..9dd907b
--- /dev/null
+++ b/examples/industrial_data_pretraining/sense_voice/README_zh.md
@@ -0,0 +1,405 @@
+# SenseVoice
+
+銆岀畝浣撲腑鏂囥�峾銆孾English](./README.md)銆峾銆孾鏃ユ湰瑾瀅(./README_ja.md)銆�
+
+SenseVoice 鏄叿鏈夐煶棰戠悊瑙h兘鍔涚殑闊抽鍩虹妯″瀷锛屽寘鎷闊宠瘑鍒紙ASR锛夈�佽绉嶈瘑鍒紙LID锛夈�佽闊虫儏鎰熻瘑鍒紙SER锛夊拰澹板浜嬩欢鍒嗙被锛圓EC锛夋垨澹板浜嬩欢妫�娴嬶紙AED锛夈�傛湰椤圭洰鎻愪緵 SenseVoice 妯″瀷鐨勪粙缁嶄互鍙婂湪澶氫釜浠诲姟娴嬭瘯闆嗕笂鐨� benchmark锛屼互鍙婁綋楠屾ā鍨嬫墍闇�鐨勭幆澧冨畨瑁呯殑涓庢帹鐞嗘柟寮忋��
+
+<div align="center">  
+<img src="image/sensevoice2.png">
+</div>
+
+<div align="center">  
+<h4>
+<a href="https://funaudiollm.github.io/"> Homepage </a>
+锝�<a href="#鏈�鏂板姩鎬�"> 鏈�鏂板姩鎬� </a>
+锝�<a href="#鎬ц兘璇勬祴"> 鎬ц兘璇勬祴 </a>
+锝�<a href="#鐜瀹夎"> 鐜瀹夎 </a>
+锝�<a href="#鐢ㄦ硶鏁欑▼"> 鐢ㄦ硶鏁欑▼ </a>
+锝�<a href="#鑱旂郴鎴戜滑"> 鑱旂郴鎴戜滑 </a>
+
+</h4>
+
+妯″瀷浠撳簱锛歔modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall)锛孾huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
+
+鍦ㄧ嚎浣撻獙锛�
+[modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice)
+
+</div>
+
+<a name="鏍稿績鍔熻兘"></a>
+
+# 鏍稿績鍔熻兘 馃幆
+
+**SenseVoice** 涓撴敞浜庨珮绮惧害澶氳瑷�璇煶璇嗗埆銆佹儏鎰熻鲸璇嗗拰闊抽浜嬩欢妫�娴�
+
+- **澶氳瑷�璇嗗埆锛�** 閲囩敤瓒呰繃 40 涓囧皬鏃舵暟鎹缁冿紝鏀寔瓒呰繃 50 绉嶈瑷�锛岃瘑鍒晥鏋滀笂浼樹簬 Whisper 妯″瀷銆�
+- **瀵屾枃鏈瘑鍒細**
+  - 鍏峰浼樼鐨勬儏鎰熻瘑鍒紝鑳藉鍦ㄦ祴璇曟暟鎹笂杈惧埌鍜岃秴杩囩洰鍓嶆渶浣虫儏鎰熻瘑鍒ā鍨嬬殑鏁堟灉銆�
+  - 鏀寔澹伴煶浜嬩欢妫�娴嬭兘鍔涳紝鏀寔闊充箰銆佹帉澹般�佺瑧澹般�佸摥澹般�佸挸鍡姐�佸柗鍤忕瓑澶氱甯歌浜烘満浜や簰浜嬩欢杩涜妫�娴嬨��
+- **楂樻晥鎺ㄧ悊锛�** SenseVoice-Small 妯″瀷閲囩敤闈炶嚜鍥炲綊绔埌绔鏋讹紝鎺ㄧ悊寤惰繜鏋佷綆锛�10s 闊抽鎺ㄧ悊浠呰�楁椂 70ms锛�15 鍊嶄紭浜� Whisper-Large銆�
+- **寰皟瀹氬埗锛�** 鍏峰渚挎嵎鐨勫井璋冭剼鏈笌绛栫暐锛屾柟渚跨敤鎴锋牴鎹笟鍔″満鏅慨澶嶉暱灏炬牱鏈棶棰樸��
+- **鏈嶅姟閮ㄧ讲锛�** 鍏锋湁瀹屾暣鐨勬湇鍔¢儴缃查摼璺紝鏀寔澶氬苟鍙戣姹傦紝鏀寔瀹㈡埛绔瑷�鏈夛紝python銆乧++銆乭tml銆乯ava 涓� c# 绛夈��
+
+<a name="鏈�鏂板姩鎬�"></a>
+
+# 鏈�鏂板姩鎬� 馃敟
+
+- 2024/7锛氭柊澧炲姞瀵煎嚭 [ONNX](./demo_onnx.py) 涓� [libtorch](./demo_libtorch.py) 鍔熻兘锛屼互鍙� python 鐗堟湰 runtime锛歔funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/)锛孾funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/)
+- 2024/7: [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) 澶氳瑷�闊抽鐞嗚В妯″瀷寮�婧愶紝鏀寔涓�佺菠銆佽嫳銆佹棩銆侀煩璇殑澶氳瑷�璇煶璇嗗埆锛屾儏鎰熻瘑鍒拰浜嬩欢妫�娴嬭兘鍔涳紝鍏锋湁鏋佷綆鐨勬帹鐞嗗欢杩熴�傘��
+- 2024/7: CosyVoice 鑷村姏浜庤嚜鐒惰闊崇敓鎴愶紝鏀寔澶氳瑷�銆侀煶鑹插拰鎯呮劅鎺у埗锛屾搮闀垮璇█璇煶鐢熸垚銆侀浂鏍锋湰璇煶鐢熸垚銆佽法璇█璇煶鍏嬮殕浠ュ強閬靛惊鎸囦护鐨勮兘鍔涖�俒CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice 鍦ㄧ嚎浣撻獙](https://www.modelscope.cn/studios/iic/CosyVoice-300M).
+- 2024/7: [FunASR](https://github.com/modelscope/FunASR) 鏄竴涓熀纭�璇煶璇嗗埆宸ュ叿鍖咃紝鎻愪緵澶氱鍔熻兘锛屽寘鎷闊宠瘑鍒紙ASR锛夈�佽闊崇鐐规娴嬶紙VAD锛夈�佹爣鐐规仮澶嶃�佽瑷�妯″瀷銆佽璇濅汉楠岃瘉銆佽璇濅汉鍒嗙鍜屽浜哄璇濊闊宠瘑鍒瓑銆�
+
+<a name="Benchmarks"></a>
+
+# 鎬ц兘璇勬祴 馃摑
+
+## 澶氳瑷�璇煶璇嗗埆
+
+鎴戜滑鍦ㄥ紑婧愬熀鍑嗘暟鎹泦锛堝寘鎷� AISHELL-1銆丄ISHELL-2銆乄enetspeech銆丩ibrispeech 鍜� Common Voice锛変笂姣旇緝浜� SenseVoice 涓� Whisper 鐨勫璇█璇煶璇嗗埆鎬ц兘鍜屾帹鐞嗘晥鐜囥�傚湪涓枃鍜岀菠璇瘑鍒晥鏋滀笂锛孲enseVoice-Small 妯″瀷鍏锋湁鏄庢樉鐨勬晥鏋滀紭鍔裤��
+
+<div align="center">  
+<img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
+</div>
+
+## 鎯呮劅璇嗗埆
+
+鐢变簬鐩墠缂轰箯琚箍娉涗娇鐢ㄧ殑鎯呮劅璇嗗埆娴嬭瘯鎸囨爣鍜屾柟娉曪紝鎴戜滑鍦ㄥ涓祴璇曢泦鐨勫绉嶆寚鏍囪繘琛屾祴璇曪紝骞朵笌杩戝勾鏉� Benchmark 涓婄殑澶氫釜缁撴灉杩涜浜嗗叏闈㈢殑瀵规瘮銆傛墍閫夊彇鐨勬祴璇曢泦鍚屾椂鍖呭惈涓枃 / 鑻辨枃涓ょ璇█浠ュ強琛ㄦ紨銆佸奖瑙嗗墽銆佽嚜鐒跺璇濈瓑澶氱椋庢牸鐨勬暟鎹紝鍦ㄤ笉杩涜鐩爣鏁版嵁寰皟鐨勫墠鎻愪笅锛孲enseVoice 鑳藉鍦ㄦ祴璇曟暟鎹笂杈惧埌鍜岃秴杩囩洰鍓嶆渶浣虫儏鎰熻瘑鍒ā鍨嬬殑鏁堟灉銆�
+
+<div align="center">  
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+鍚屾椂锛屾垜浠繕鍦ㄦ祴璇曢泦涓婂澶氫釜寮�婧愭儏鎰熻瘑鍒ā鍨嬭繘琛屽姣旓紝缁撴灉琛ㄦ槑锛孲enseVoice-Large 妯″瀷鍙互鍦ㄥ嚑涔庢墍鏈夋暟鎹笂閮借揪鍒颁簡鏈�浣虫晥鏋滐紝鑰� SenseVoice-Small 妯″瀷鍚屾牱鍙互鍦ㄥ鏁版暟鎹泦涓婂彇寰楄秴瓒婂叾浠栧紑婧愭ā鍨嬬殑鏁堟灉銆�
+
+<div align="center">  
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## 浜嬩欢妫�娴�
+
+灏界 SenseVoice 鍙湪璇煶鏁版嵁涓婅繘琛岃缁冿紝瀹冧粛鐒跺彲浠ヤ綔涓轰簨浠舵娴嬫ā鍨嬭繘琛屽崟鐙娇鐢ㄣ�傛垜浠湪鐜闊冲垎绫� ESC-50 鏁版嵁闆嗕笂涓庣洰鍓嶄笟鍐呭箍娉涗娇鐢ㄧ殑 BEATS 涓� PANN 妯″瀷鐨勬晥鏋滆繘琛屼簡瀵规瘮銆係enseVoice 妯″瀷鑳藉鍦ㄨ繖浜涗换鍔′笂鍙栧緱杈冨ソ鐨勬晥鏋滐紝浣嗗彈闄愪簬璁粌鏁版嵁涓庤缁冩柟寮忥紝鍏朵簨浠跺垎绫绘晥鏋滀笓涓氱殑浜嬩欢妫�娴嬫ā鍨嬬浉姣斾粛鐒舵湁涓�瀹氱殑宸窛銆�
+
+<div align="center">  
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## 鎺ㄧ悊鏁堢巼
+
+SenseVoice-small 妯″瀷閲囩敤闈炶嚜鍥炲綊绔埌绔灦鏋勶紝鎺ㄧ悊寤惰繜鏋佷綆銆傚湪鍙傛暟閲忎笌 Whisper-Small 妯″瀷鐩稿綋鐨勬儏鍐典笅锛屾瘮 Whisper-Small 妯″瀷鎺ㄧ悊閫熷害蹇� 5 鍊嶏紝姣� Whisper-Large 妯″瀷蹇� 15 鍊嶃�傚悓鏃� SenseVoice-small 妯″瀷鍦ㄩ煶棰戞椂闀垮鍔犵殑鎯呭喌涓嬶紝鎺ㄧ悊鑰楁椂涔熸棤鏄庢樉澧炲姞銆�
+
+<div align="center">  
+<img src="image/inference.png" width="1000" />
+</div>
+
+<a name="鐜瀹夎"></a>
+
+# 瀹夎渚濊禆鐜 馃悕
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="鐢ㄦ硶鏁欑▼"></a>
+
+# 鐢ㄦ硶 馃洜锔�
+
+## 鎺ㄧ悊
+
+### 浣跨敤 funasr 鎺ㄧ悊
+
+鏀寔浠绘剰鏍煎紡闊抽杈撳叆锛屾敮鎸佷换鎰忔椂闀胯緭鍏�
+
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+
+model = AutoModel(
+    model=model_dir,
+    trust_remote_code=True,
+    remote_code="./model.py",  
+    vad_model="fsmn-vad",
+    vad_kwargs={"max_single_segment_time": 30000},
+    device="cuda:0",
+)
+
+# en
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=True,
+    batch_size_s=60,
+    merge_vad=True,
+    merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+
+<details><summary> 鍙傛暟璇存槑锛堢偣鍑诲睍寮�锛�</summary>
+
+- `model_dir`锛氭ā鍨嬪悕绉帮紝鎴栨湰鍦扮鐩樹腑鐨勬ā鍨嬭矾寰勩��
+- `trust_remote_code`锛�
+  - `True` 琛ㄧず model 浠g爜瀹炵幇浠� `remote_code` 澶勫姞杞斤紝`remote_code` 鎸囧畾 `model` 鍏蜂綋浠g爜鐨勪綅缃紙渚嬪锛屽綋鍓嶇洰褰曚笅鐨� `model.py`锛夛紝鏀寔缁濆璺緞涓庣浉瀵硅矾寰勶紝浠ュ強缃戠粶 url銆�
+  - `False` 琛ㄧず锛宮odel 浠g爜瀹炵幇涓� [FunASR](https://github.com/modelscope/FunASR) 鍐呴儴闆嗘垚鐗堟湰锛屾鏃朵慨鏀瑰綋鍓嶇洰褰曚笅鐨� `model.py` 涓嶄細鐢熸晥锛屽洜涓哄姞杞界殑鏄� funasr 鍐呴儴鐗堟湰锛屾ā鍨嬩唬鐮� [鐐瑰嚮鏌ョ湅](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice)銆�
+- `vad_model`锛氳〃绀哄紑鍚� VAD锛孷AD 鐨勪綔鐢ㄦ槸灏嗛暱闊抽鍒囧壊鎴愮煭闊抽锛屾鏃舵帹鐞嗚�楁椂鍖呮嫭浜� VAD 涓� SenseVoice 鎬昏�楁椂锛屼负閾捐矾鑰楁椂锛屽鏋滈渶瑕佸崟鐙祴璇� SenseVoice 妯″瀷鑰楁椂锛屽彲浠ュ叧闂� VAD 妯″瀷銆�
+- `vad_kwargs`锛氳〃绀� VAD 妯″瀷閰嶇疆锛宍max_single_segment_time`: 琛ㄧず `vad_model` 鏈�澶у垏鍓查煶棰戞椂闀匡紝鍗曚綅鏄绉� ms銆�
+- `use_itn`锛氳緭鍑虹粨鏋滀腑鏄惁鍖呭惈鏍囩偣涓庨�嗘枃鏈鍒欏寲銆�
+- `batch_size_s` 琛ㄧず閲囩敤鍔ㄦ�� batch锛宐atch 涓�婚煶棰戞椂闀匡紝鍗曚綅涓虹 s銆�
+- `merge_vad`锛氭槸鍚﹀皢 vad 妯″瀷鍒囧壊鐨勭煭闊抽纰庣墖鍚堟垚锛屽悎骞跺悗闀垮害涓� `merge_length_s`锛屽崟浣嶄负绉� s銆�
+- `ban_emo_unk`锛氱鐢� emo_unk 鏍囩锛岀鐢ㄥ悗鎵�鏈夌殑鍙ュ瓙閮戒細琚祴涓庢儏鎰熸爣绛俱�傞粯璁� `False`
+
+</details>
+
+濡傛灉杈撳叆鍧囦负鐭煶棰戯紙灏忎簬 30s锛夛紝骞朵笖闇�瑕佹壒閲忓寲鎺ㄧ悊锛屼负浜嗗姞蹇帹鐞嗘晥鐜囷紝鍙互绉婚櫎 vad 妯″瀷锛屽苟璁剧疆 `batch_size`
+
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=True,
+    batch_size=64, 
+)
+```
+
+鏇村璇︾粏鐢ㄦ硶锛岃鍙傝�� [鏂囨。](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+### 鐩存帴鎺ㄧ悊
+
+鏀寔浠绘剰鏍煎紡闊抽杈撳叆锛岃緭鍏ラ煶棰戞椂闀块檺鍒跺湪 30s 浠ヤ笅
+
+```python
+from model import SenseVoiceSmall
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
+m.eval()
+
+res = m.inference(
+    data_in=f"{kwargs ['model_path']}/example/en.mp3",
+    language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    ban_emo_unk=False,
+    **kwargs,
+)
+
+text = rich_transcription_postprocess(res [0][0]["text"])
+print(text)
+```
+
+## 鏈嶅姟閮ㄧ讲
+
+Undo
+
+### 瀵煎嚭涓庢祴璇�
+
+<details><summary>ONNX 涓� Libtorch 瀵煎嚭 </summary>
+
+#### ONNX
+
+```python
+# pip3 install -U funasr funasr-onnx
+from pathlib import Path
+from funasr_onnx import SenseVoiceSmall
+from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
+
+# inference
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+
+澶囨敞锛歄NNX 妯″瀷瀵煎嚭鍒板師妯″瀷鐩綍涓�
+
+#### Libtorch
+
+```python
+from pathlib import Path
+from funasr_torch import SenseVoiceSmall
+from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0")
+
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess (i) for i in res])
+```
+
+澶囨敞锛歀ibtorch 妯″瀷瀵煎嚭鍒板師妯″瀷鐩綍涓�
+
+</details>
+
+### 閮ㄧ讲
+
+### 浣跨敤 FastAPI 閮ㄧ讲
+
+```shell
+export SENSEVOICE_DEVICE=cuda:0
+fastapi run --port 50000
+```
+
+## 寰皟
+
+### 瀹夎璁粌鐜
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### 鏁版嵁鍑嗗
+
+鏁版嵁鏍煎紡闇�瑕佸寘鎷涓嬪嚑涓瓧娈碉細
+
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+
+璇︾粏鍙互鍙傝�冿細`data/train_example.jsonl`
+
+<details><summary > 鏁版嵁鍑嗗缁嗚妭浠嬬粛 </summary>
+
+- `key`: 鏁版嵁鍞竴 ID
+- `source`锛氶煶棰戞枃浠剁殑璺緞
+- `source_len`锛氶煶棰戞枃浠剁殑 fbank 甯ф暟
+- `target`锛氶煶棰戞枃浠舵爣娉ㄦ枃鏈�
+- `target_len`锛氶煶棰戞枃浠舵爣娉ㄦ枃鏈暱搴�
+- `text_language`锛氶煶棰戞枃浠剁殑璇鏍囩
+- `emo_target`锛氶煶棰戞枃浠剁殑鎯呮劅鏍囩
+- `event_target`锛氶煶棰戞枃浠剁殑浜嬩欢鏍囩
+- `with_or_wo_itn`锛氭爣娉ㄦ枃鏈腑鏄惁鍖呭惈鏍囩偣涓庨�嗘枃鏈鍒欏寲
+
+鍙互鐢ㄦ寚浠� `sensevoice2jsonl` 浠� train_wav.scp銆乼rain_text.txt銆乼rain_text_language.txt銆乼rain_emo_target.txt 鍜� train_event_target.txt 鐢熸垚锛屽噯澶囪繃绋嬪涓嬶細
+
+`train_text.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_wav.scp` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠舵爣娉ㄦ枃鏈紝鏍煎紡濡備笅锛�
+
+```bash
+BAC009S0764W0121 鐢氳嚦鍑虹幇浜ゆ槗鍑犱箮鍋滄粸鐨勬儏鍐�
+BAC009S0916W0489 婀栧寳涓�鍏徃浠ュ憳宸ュ悕涔夎捶娆炬暟鍗佸憳宸ヨ礋鍊哄崈涓�
+asr_example_cn_en 鎵�鏈夊彧瑕佸鐞� data 涓嶇浣犳槸鍋� machine learning 鍋� deep learning 鍋� data analytics 鍋� data science 涔熷ソ scientist 涔熷ソ閫氶�氶兘瑕侀兘鍋氱殑鍩烘湰鍔熷晩閭� again 鍏堝厛瀵规湁涓�浜� > 涔熻瀵�
+ID0012W0014 he tried to think how it could be
+```
+
+`train_wav.scp`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_text.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑璺緞锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav
+BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
+asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav
+ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav
+```
+
+`train_text_language.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_text_language.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑璇鏍囩锛屾敮鎸� `<|zh|>`銆乣<|en|>`銆乣<|yue|>`銆乣<|ja|>` 鍜� `<|ko|>`锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 <|zh|>
+BAC009S0916W0489 <|zh|>
+asr_example_cn_en <|zh|>
+ID0012W0014 <|en|>
+```
+
+`train_emo.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_emo.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑鎯呮劅鏍囩锛屾敮鎸� `<|HAPPY|>`銆乣<|SAD|>`銆乣<|ANGRY|>`銆乣<|NEUTRAL|>`銆乣<|FEARFUL|>`銆乣<|DISGUSTED|>` 鍜� `<|SURPRISED|>`锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 <|NEUTRAL|>
+BAC009S0916W0489 <|NEUTRAL|>
+asr_example_cn_en <|NEUTRAL|>
+ID0012W0014 <|NEUTRAL|>
+```
+
+`train_event.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_event.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑浜嬩欢鏍囩锛屾敮鎸� `<|BGM|>`銆乣<|Speech|>`銆乣<|Applause|>`銆乣<|Laughter|>`銆乣<|Cry|>`銆乣<|Sneeze|>`銆乣<|Breath|>` 鍜� `<|Cough|>`锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 <|Speech|>
+BAC009S0916W0489 <|Speech|>
+asr_example_cn_en <|Speech|>
+ID0012W0014 <|Speech|>
+```
+
+`鐢熸垚鎸囦护`
+
+```shell
+# generate train.jsonl and val.jsonl from wav.scp, text.txt, text_language.txt, emo_target.txt, event_target.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt", "../../../data/list/train_text_language.txt", "../../../data/list/train_emo.txt", "../../../data/list/train_event.txt"]' \
+++data_type_list='["source", "target", "text_language", "emo_target", "event_target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+
+鑻ユ棤 train_text_language.txt銆乼rain_emo_target.txt 鍜� train_event_target.txt锛屽垯鑷姩閫氳繃浣跨敤 `SenseVoice` 妯″瀷瀵硅绉嶃�佹儏鎰熷拰浜嬩欢鎵撴爣銆�
+
+```shell
+# generate train.jsonl and val.jsonl from wav.scp and text.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
+++data_type_list='["source", "target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl" \
+++model_dir='iic/SenseVoiceSmall'
+```
+
+</details>
+
+### 鍚姩璁粌
+
+娉ㄦ剰淇敼 `finetune.sh` 涓� `train_tool` 涓轰綘鍓嶉潰瀹夎 FunASR 璺緞涓� `funasr/bin/train_ds.py` 缁濆璺緞
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+## 浼樼涓夋柟宸ヤ綔
+
+- Triton锛圙PU锛夐儴缃叉渶浣冲疄璺碉紝triton + tensorrt锛宖p32 娴嬭瘯锛孷100 GPU 涓婂姞閫熸瘮 526锛宖p16 鏀寔涓紝[repo](https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/README.md)
+- sherpa-onnx 閮ㄧ讲鏈�浣冲疄璺碉紝鏀寔鍦� 10 绉嶇紪绋嬭瑷�閲岄潰浣跨敤 SenseVoice, 鍗� C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, Dart. 鏀寔鍦� iOS, Android, Raspberry Pi 绛夊钩鍙颁娇鐢� SenseVoice锛孾repo](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)
+- [SenseVoice.cpp](https://github.com/lovemefan/SenseVoice.cpp) 鍩轰簬GGML锛屽湪绾疌/C++涓帹鏂璖enseVoice锛屾敮鎸�3浣嶃��4浣嶃��5浣嶃��8浣嶉噺鍖栫瓑锛屾棤闇�绗笁鏂逛緷璧栥��
+- [娴佸紡SenseVoice](https://github.com/pengzhendong/streaming-sensevoice)锛岄�氳繃鍒嗗潡锛坈hunk锛夌殑鏂瑰紡杩涜鎺ㄧ悊锛屼负浜嗗疄鐜颁吉娴佸紡澶勭悊锛岄噰鐢ㄤ簡鎴柇娉ㄦ剰鍔涙満鍒讹紙truncated attention锛夛紝鐗虹壊浜嗛儴鍒嗙簿搴︺�傛澶栵紝璇ユ妧鏈繕鏀寔CTC鍓嶇紑鏉熸悳绱紙CTC prefix beam search锛変互鍙婄儹璇嶅寮哄姛鑳姐��
+- [OmniSenseVoice](https://github.com/lifeiteng/OmniSenseVoice) 杞婚噺鍖栨帹鐞嗗簱锛屾敮鎸乥atch鎺ㄧ悊銆�
+
+# 鑱旂郴鎴戜滑
+
+濡傛灉鎮ㄥ湪浣跨敤涓亣鍒伴棶棰橈紝鍙互鐩存帴鍦� github 椤甸潰鎻� Issues銆傛杩庤闊冲叴瓒g埍濂借�呮壂鎻忎互涓嬬殑閽夐拤缇や簩缁寸爜鍔犲叆绀惧尯缇わ紝杩涜浜ゆ祦鍜岃璁恒��
+
+|                          FunASR                          |
+|:--------------------------------------------------------:|
+| <img src="image/dingding_funasr.png" width="250"/></div> |
diff --git a/examples/industrial_data_pretraining/sense_voice/finetune.sh b/examples/industrial_data_pretraining/sense_voice/finetune.sh
index 0003909..081b77b 100644
--- a/examples/industrial_data_pretraining/sense_voice/finetune.sh
+++ b/examples/industrial_data_pretraining/sense_voice/finetune.sh
@@ -43,7 +43,7 @@
 echo $DISTRIBUTED_ARGS
 
 # funasr trainer path
-train_tool=`dirname $(which funasr)`/train_ds.py
+train_tool=../../../funasr/bin/train_ds.py
 
 torchrun $DISTRIBUTED_ARGS \
 ${train_tool} \

--
Gitblit v1.9.1