From 1a0de67a08b4407497dee0f3dcc1339d7b3e6e3a Mon Sep 17 00:00:00 2001
From: 游雁 <zhifu.gzf@alibaba-inc.com>
Date: 星期四, 07 十一月 2024 13:58:20 +0800
Subject: [PATCH] SenseVoice docs
---
examples/industrial_data_pretraining/sense_voice/README_zh.md | 405 ++++++++++++++++++++
examples/industrial_data_pretraining/sense_voice/finetune.sh | 2
examples/industrial_data_pretraining/sense_voice/README_ja.md | 358 +++++++++++++++++
examples/industrial_data_pretraining/sense_voice/README.md | 381 +++++++++++++++++++
4 files changed, 1,145 insertions(+), 1 deletions(-)
diff --git a/examples/industrial_data_pretraining/sense_voice/README.md b/examples/industrial_data_pretraining/sense_voice/README.md
new file mode 100644
index 0000000..746986c
--- /dev/null
+++ b/examples/industrial_data_pretraining/sense_voice/README.md
@@ -0,0 +1,381 @@
+([绠�浣撲腑鏂嘳(./README_zh.md)|English|[鏃ユ湰瑾瀅(./README_ja.md))
+
+
+# Introduction
+
+SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED).
+
+<div align="center">
+<img src="image/sensevoice2.png">
+</div>
+
+[//]: # (<div align="center"><img src="image/sensevoice.png" width="700"/> </div>)
+
+<div align="center">
+<h4>
+<a href="https://funaudiollm.github.io/"> Homepage </a>
+锝�<a href="#What's News"> What's News </a>
+锝�<a href="#Benchmarks"> Benchmarks </a>
+锝�<a href="#Install"> Install </a>
+锝�<a href="#Usage"> Usage </a>
+锝�<a href="#Community"> Community </a>
+</h4>
+
+Model Zoo:
+[modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall), [huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
+
+Online Demo:
+[modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice)
+
+
+</div>
+
+
+<a name="Highligts"></a>
+# Highlights 馃幆
+**SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.
+- **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.
+- **Rich transcribe:**
+ - Possess excellent emotion recognition capabilities, achieving and surpassing the effectiveness of the current best emotion recognition models on test data.
+ - Offer sound event detection capabilities, supporting the detection of various common human-computer interaction events such as bgm, applause, laughter, crying, coughing, and sneezing.
+- **Efficient Inference:** The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster than Whisper-Large.
+- **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.
+- **Service Deployment:** Offer service deployment pipeline, supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.
+
+<a name="What's News"></a>
+# What's New 馃敟
+- 2024/7: Added Export Features for [ONNX](./demo_onnx.py) and [libtorch](./demo_libtorch.py), as well as Python Version Runtimes: [funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/), [funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/)
+- 2024/7: The [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) voice understanding model is open-sourced, which offers high-precision multilingual speech recognition, emotion recognition, and audio event detection capabilities for Mandarin, Cantonese, English, Japanese, and Korean and leads to exceptionally low inference latency.
+- 2024/7: The CosyVoice for natural speech generation with multi-language, timbre, and emotion control. CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. [CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice space](https://www.modelscope.cn/studios/iic/CosyVoice-300M).
+- 2024/7: [FunASR](https://github.com/modelscope/FunASR) is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR.
+
+<a name="Benchmarks"></a>
+# Benchmarks 馃摑
+
+## Multilingual Speech Recognition
+We compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. In terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages.
+
+<div align="center">
+<img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
+</div>
+
+## Speech Emotion Recognition
+
+Due to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve and exceed the performance of the current best speech emotion recognition models.
+
+<div align="center">
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+Furthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the SenseVoice-Small model also surpassed other open-source models on the majority of the datasets.
+
+<div align="center">
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## Audio Event Detection
+
+Although trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in training data and methodology, its event classification performance has some gaps compared to specialized AED models.
+
+<div align="center">
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## Computational Efficiency
+
+The SenseVoice-Small model deploys a non-autoregressive end-to-end architecture, resulting in extremely low inference latency. With a similar number of parameters to the Whisper-Small model, it infers more than 5 times faster than Whisper-Small and 15 times faster than Whisper-Large.
+
+<div align="center">
+<img src="image/inference.png" width="1000" />
+</div>
+
+
+# Requirements
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="Usage"></a>
+# Usage
+
+## Inference
+
+Supports input of audio in any format and of any duration.
+
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+
+model = AutoModel(
+ model=model_dir,
+ trust_remote_code=True,
+ remote_code="./model.py",
+ vad_model="fsmn-vad",
+ vad_kwargs={"max_single_segment_time": 30000},
+ device="cuda:0",
+)
+
+# en
+res = model.generate(
+ input=f"{model.model_path}/example/en.mp3",
+ cache={},
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=True,
+ batch_size_s=60,
+ merge_vad=True, #
+ merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+
+<details><summary>Parameter Description (Click to Expand)</summary>
+
+- `model_dir`: The name of the model, or the path to the model on the local disk.
+- `trust_remote_code`:
+ - When `True`, it means that the model's code implementation is loaded from `remote_code`, which specifies the exact location of the `model` code (for example, `model.py` in the current directory). It supports absolute paths, relative paths, and network URLs.
+ - When `False`, it indicates that the model's code implementation is the integrated version within [FunASR](https://github.com/modelscope/FunASR). At this time, modifications made to `model.py` in the current directory will not be effective, as the version loaded is the internal one from FunASR. For the model code, [click here to view](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice).
+- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.
+- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).
+- `use_itn`: Whether the output result includes punctuation and inverse text normalization.
+- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).
+- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).
+- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.
+</details>
+
+If all inputs are short audios (<30s), and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+ input=f"{model.model_path}/example/en.mp3",
+ cache={},
+ language="zh", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=False,
+ batch_size=64,
+)
+```
+
+For more usage, please refer to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+### Inference directly
+
+Supports input of audio in any format, with an input duration limit of 30 seconds or less.
+
+```python
+from model import SenseVoiceSmall
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
+m.eval()
+
+res = m.inference(
+ data_in=f"{kwargs['model_path']}/example/en.mp3",
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=False,
+ ban_emo_unk=False,
+ **kwargs,
+)
+
+text = rich_transcription_postprocess(res[0][0]["text"])
+print(text)
+```
+
+### Export and Test
+<details><summary>ONNX and Libtorch Export</summary>
+
+#### ONNX
+```python
+# pip3 install -U funasr funasr-onnx
+from pathlib import Path
+from funasr_onnx import SenseVoiceSmall
+from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
+
+# inference
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+Note: ONNX model is exported to the original model directory.
+
+#### Libtorch
+```python
+from pathlib import Path
+from funasr_torch import SenseVoiceSmall
+from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0")
+
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+Note: Libtorch model is exported to the original model directory.
+</details>
+
+## Service
+
+### Deployment with FastAPI
+```shell
+export SENSEVOICE_DEVICE=cuda:0
+fastapi run --port 50000
+```
+
+## Finetune
+
+### Requirements
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### Data prepare
+
+Data examples
+
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+
+Full ref to `data/train_example.jsonl`
+
+<details><summary>Data Prepare Details</summary>
+
+Description锛�
+- `key`: audio file unique ID
+- `source`锛歱ath to the audio file
+- `source_len`锛歯umber of fbank frames of the audio file
+- `target`锛歵ranscription
+- `target_len`锛歭ength of target
+- `text_language`锛歭anguage id of the audio file
+- `emo_target`锛歟motion label of the audio file
+- `event_target`锛歟vent label of the audio file
+- `with_or_wo_itn`锛歸hether includes punctuation and inverse text normalization
+
+
+`train_text.txt`
+
+
+```bash
+BAC009S0764W0121 鐢氳嚦鍑虹幇浜ゆ槗鍑犱箮鍋滄粸鐨勬儏鍐�
+BAC009S0916W0489 婀栧寳涓�鍏徃浠ュ憳宸ュ悕涔夎捶娆炬暟鍗佸憳宸ヨ礋鍊哄崈涓�
+asr_example_cn_en 鎵�鏈夊彧瑕佸鐞� data 涓嶇浣犳槸鍋� machine learning 鍋� deep learning 鍋� data analytics 鍋� data science 涔熷ソ scientist 涔熷ソ閫氶�氶兘瑕侀兘鍋氱殑鍩烘湰鍔熷晩閭� again 鍏堝厛瀵规湁涓�浜�>涔熻瀵�
+ID0012W0014 he tried to think how it could be
+```
+
+`train_wav.scp`
+
+
+
+```bash
+BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav
+BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
+asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav
+ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav
+```
+
+`train_text_language.txt`
+
+The language ids include `<|zh|>`銆乣<|en|>`銆乣<|yue|>`銆乣<|ja|>` and `<|ko|>`.
+
+```bash
+BAC009S0764W0121 <|zh|>
+BAC009S0916W0489 <|zh|>
+asr_example_cn_en <|zh|>
+ID0012W0014 <|en|>
+```
+
+`train_emo.txt`
+
+The emotion labels include`<|HAPPY|>`銆乣<|SAD|>`銆乣<|ANGRY|>`銆乣<|NEUTRAL|>`銆乣<|FEARFUL|>`銆乣<|DISGUSTED|>` and `<|SURPRISED|>`.
+
+```bash
+BAC009S0764W0121 <|NEUTRAL|>
+BAC009S0916W0489 <|NEUTRAL|>
+asr_example_cn_en <|NEUTRAL|>
+ID0012W0014 <|NEUTRAL|>
+```
+
+`train_event.txt`
+
+The event labels include`<|BGM|>`銆乣<|Speech|>`銆乣<|Applause|>`銆乣<|Laughter|>`銆乣<|Cry|>`銆乣<|Sneeze|>`銆乣<|Breath|>` and `<|Cough|>`.
+
+```bash
+BAC009S0764W0121 <|Speech|>
+BAC009S0916W0489 <|Speech|>
+asr_example_cn_en <|Speech|>
+ID0012W0014 <|Speech|>
+```
+
+`Command`
+```shell
+# generate train.jsonl and val.jsonl from wav.scp, text.txt, text_language.txt, emo_target.txt, event_target.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt", "../../../data/list/train_text_language.txt", "../../../data/list/train_emo.txt", "../../../data/list/train_event.txt"]' \
+++data_type_list='["source", "target", "text_language", "emo_target", "event_target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+
+If there is no `train_text_language.txt`, `train_emo_target.txt` and `train_event_target.txt`, the language, emotion and event label will be predicted automatically by using the `SenseVoice` model.
+```shell
+# generate train.jsonl and val.jsonl from wav.scp and text.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
+++data_type_list='["source", "target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl" \
+++model_dir='iic/SenseVoiceSmall'
+```
+</details>
+
+### Finetune
+
+Ensure to modify the train_tool in finetune.sh to the absolute path of `funasr/bin/train_ds.py` from the FunASR installation directory you have set up earlier.
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+
+## Remarkable Third-Party Work
+- Triton (GPU) Deployment Best Practices: Using Triton + TensorRT, tested with FP32, achieving an acceleration ratio of 526 on V100 GPU. FP16 support is in progress. [Repository](https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/README.md)
+- Sherpa-onnx Deployment Best Practices: Supports using SenseVoice in 10 programming languages: C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, and Dart. Also supports deploying SenseVoice on platforms like iOS, Android, and Raspberry Pi. [Repository](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)
+- [SenseVoice.cpp](https://github.com/lovemefan/SenseVoice.cpp). Inference of SenseVoice in pure C/C++ based on GGML, supporting 3-bit, 4-bit, 5-bit, 8-bit quantization, etc. with no third-party dependencies.
+- [streaming-sensevoice](https://github.com/pengzhendong/streaming-sensevoice) processes inference in chunks. To achieve pseudo-streaming, it employs a truncated attention mechanism, sacrificing some accuracy. Additionally, this technology supports CTC prefix beam search and hot-word boosting features.
+- [OmniSenseVoice](https://github.com/lifeiteng/OmniSenseVoice) is optimized for lightning-fast inference and batching process.
+
+<a name="Community"></a>
+# Community
+If you encounter problems in use, you can directly raise Issues on the github page.
+
+You can also scan the following DingTalk group QR code to join the community group for communication and discussion.
+
+| FunASR |
+|:--------------------------------------------------------:|
+| <img src="image/dingding_funasr.png" width="250"/></div> |
+
+
diff --git a/examples/industrial_data_pretraining/sense_voice/README_ja.md b/examples/industrial_data_pretraining/sense_voice/README_ja.md
new file mode 100644
index 0000000..c586899
--- /dev/null
+++ b/examples/industrial_data_pretraining/sense_voice/README_ja.md
@@ -0,0 +1,358 @@
+# SenseVoice
+
+銆孾绠�浣撲腑鏂嘳(./README_zh.md)銆峾銆孾English](./README.md)銆峾銆屾棩鏈獮銆�
+
+SenseVoice銇�侀煶澹拌獚璀橈紙ASR锛夈�佽█瑾炶瓨鍒ワ紙LID锛夈�侀煶澹版劅鎯呰獚璀橈紙SER锛夈�併亰銈堛伋闊抽熆銈ゃ儥銉炽儓鍒嗛锛圓EC锛夈伨銇熴伅闊抽熆銈ゃ儥銉炽儓妞滃嚭锛圓ED锛夈倰鍚個闊冲0鐞嗚В鑳藉姏銈掑倷銇堛仧闊冲0鍩虹洡銉€儑銉仹銇欍�傛湰銉椼儹銈搞偋銈儓銇с伅銆丼enseVoice銉€儑銉伄绱逛粙銇ㄣ�佽鏁般伄銈裤偣銈儐銈广儓銈汇儍銉堛仹銇儥銉炽儊銉炪兗銈�併亰銈堛伋銉€儑銉伄浣撻〒銇繀瑕併仾鐠板銇偆銉炽偣銉堛兗銉仺鎺ㄨ珫鏂规硶銈掓彁渚涖仐銇俱仚銆�
+
+<div align="center">
+<img src="image/sensevoice2.png">
+</div>
+[//]: # (<div align="center"><img src="image/sensevoice2.png" width="700"/> </div>)
+
+<div align="center">
+<h4>
+<a href="https://funaudiollm.github.io/"> 銉涖兗銉犮儦銉笺偢 </a>
+锝�<a href="#鏈�鏂板姩鎬�"> 鏈�鏂版儏鍫� </a>
+锝�<a href="#鎬ц兘璇勬祴"> 鎬ц兘瑭曚尽 </a>
+锝�<a href="#鐜瀹夎"> 鐠板銈ゃ兂銈广儓銉笺儷 </a>
+锝�<a href="#鐢ㄦ硶鏁欑▼"> 浣跨敤鏂规硶銉併儱銉笺儓銉偄銉� </a>
+锝�<a href="#鑱旂郴鎴戜滑"> 銇婂晱銇勫悎銈忋仜 </a>
+</h4>
+
+銉€儑銉儶銉濄偢銉堛儶锛歔modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall)锛孾huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
+
+銈兂銉┿偆銉充綋楱擄細
+[modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice)
+
+</div>
+
+<a name="鏍稿績鍔熻兘"></a>
+# 銈炽偄姗熻兘 馃幆
+**SenseVoice**銇�侀珮绮惧害銇瑷�瑾為煶澹拌獚璀樸�佹劅鎯呰獚璀樸�併亰銈堛伋闊冲0銈ゃ儥銉炽儓妞滃嚭銇劍鐐广倰褰撱仸銇︺亜銇俱仚銆�
+- **澶氳█瑾炶獚璀橈細** 40涓囨檪闁撲互涓娿伄銉囥兗銈裤倰浣跨敤銇椼仸銉堛儸銉笺儖銉炽偘銇曘倢銆�50浠ヤ笂銇█瑾炪倰銈点儩銉笺儓銇椼�佽獚璀樻�ц兘銇疻hisper銉€儑銉倰涓婂洖銈娿伨銇欍��
+- **銉儍銉併儐銈偣銉堣獚璀橈細**
+ - 鍎倢銇熸劅鎯呰獚璀樿兘鍔涖倰鎸併仭銆併儐銈广儓銉囥兗銈裤仹鐝惧湪銇渶鑹伄鎰熸儏瑾嶈瓨銉€儑銉伄鍔规灉銈掗仈鎴愩亰銈堛伋涓婂洖銈娿伨銇欍��
+ - 闊冲0銈ゃ儥銉炽儓妞滃嚭鑳藉姏銈掓彁渚涖仐銆侀煶妤姐�佹媿鎵嬨�佺瑧銇勫0銆佹常銇嶅0銆佸挸銆併亸銇椼們銇裤仾銇┿伄銇曘伨銇栥伨銇竴鑸殑銇汉闁撱仺銈炽兂銉斻儱銉笺偪銇偆銉炽偪銉┿偗銈枫儳銉炽偆銉欍兂銉堛倰妞滃嚭銇椼伨銇欍��
+- **鍔圭巼鐨勩仾鎺ㄨ珫锛�** SenseVoice-Small銉€儑銉伅闈炶嚜宸卞洖甯般偍銉炽儔銉勩兗銈ㄣ兂銉夈儠銉兗銉犮儻銉笺偗銈掓帯鐢ㄣ仐銇︺亰銈娿�佹帹璜栭亝寤躲亴闈炲父銇綆銇忋��10绉掋伄闊冲0銇帹璜栥伀70ms銇椼亱銇嬨亱銈娿伨銇涖倱銆俉hisper-Large銈堛倞15鍊嶉珮閫熴仹銇欍��
+- **绨″崢銇井瑾挎暣锛�** 渚垮埄銇井瑾挎暣銈广偗銉儣銉堛仺鎴︾暐銈掓彁渚涖仐銆併儲銉笺偠銉笺亴銉撱偢銉嶃偣銈枫儕銉偑銇繙銇樸仸銉兂銈般儐銉笺儷銈点兂銉椼儷銇晱椤屻倰绨″崢銇В姹恒仹銇嶃倠銈堛亞銇仐銇俱仚銆�
+- **銈点兗銉撱偣灞曢枊锛�** 銉炪儷銉併偝銉炽偒銉兂銉堛儶銈偍銈广儓銈掋偟銉濄兗銉堛仚銈嬪畬鍏ㄣ仾銈点兗銉撱偣灞曢枊銉戙偆銉椼儵銈ゃ兂銈掓彁渚涖仐銆併偗銉┿偆銈€兂銉堛偟銈ゃ儔銇█瑾炪伀銇疨ython銆丆++銆丠TML銆丣ava銆丆#銇仼銇屻亗銈娿伨銇欍��
+
+<a name="鏈�鏂板姩鎬�"></a>
+# 鏈�鏂版儏鍫� 馃敟
+- 2024/7锛氭柊銇椼亸[ONNX](./demo_onnx.py)銇╗libtorch](./demo_libtorch.py)銇偍銈偣銉濄兗銉堟鑳姐倰杩藉姞銇椼�丳ython銉愩兗銈搞儳銉炽伄銉┿兂銈裤偆銉狅細[funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/)銆乕funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/)銈傛彁渚涢枊濮嬨��
+- 2024/7: [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) 澶氳█瑾為煶澹扮悊瑙c儮銉囥儷銇屻偑銉笺儣銉炽偨銉笺偣鍖栥仌銈屻伨銇椼仧銆備腑鍥借獮銆佸簝鏉辫獮銆佽嫳瑾炪�佹棩鏈獮銆侀煋鍥借獮銇瑷�瑾為煶澹拌獚璀樸�佹劅鎯呰獚璀樸�併亰銈堛伋銈ゃ儥銉炽儓妞滃嚭鑳藉姏銈掋偟銉濄兗銉堛仐銆侀潪甯搞伀浣庛亜鎺ㄨ珫閬呭欢銈掑疅鐝俱仐銇︺亜銇俱仚銆�
+- 2024/7: CosyVoice銇嚜鐒躲仾闊冲0鐢熸垚銇彇銈婄祫銈撱仹銇娿倞銆佸瑷�瑾炪�侀煶鑹层�佹劅鎯呭埗寰°倰銈点儩銉笺儓銇椼伨銇欍�傚瑷�瑾為煶澹扮敓鎴愩�併偧銉偡銉с儍銉堥煶澹扮敓鎴愩�併偗銉偣銉┿兂銈层兗銈搁煶澹般偗銉兗銉炽�併亰銈堛伋鎸囩ず銇緭銇嗚兘鍔涖伀鍎倢銇︺亜銇俱仚銆俒CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice 銈兂銉┿偆銉充綋楱揮(https://www.modelscope.cn/studios/iic/CosyVoice-300M).
+- 2024/7: [FunASR](https://github.com/modelscope/FunASR) 銇�侀煶澹拌獚璀橈紙ASR锛夈�侀煶澹版椿鍕曟鍑猴紙VAD锛夈�佸彞瑾偣寰╁厓銆佽█瑾炪儮銉囥儷銆佽┍鑰呮瑷笺�佽┍鑰呭垎闆€�併亰銈堛伋銉炪儷銉併儓銉笺偒銉糀SR銇仼銇鑳姐倰鎻愪緵銇欍倠鍩烘湰鐨勩仾闊冲0瑾嶈瓨銉勩兗銉偔銉冦儓銇с仚銆�
+
+<a name="Benchmarks"></a>
+# 銉欍兂銉併優銉笺偗 馃摑
+
+## 澶氳█瑾為煶澹拌獚璀�
+
+銈兗銉椼兂銈姐兗銈广伄銉欍兂銉併優銉笺偗銉囥兗銈裤偦銉冦儓锛圓ISHELL-1銆丄ISHELL-2銆乄enetspeech銆丩ibrispeech銆丆ommon Voice銈掑惈銈�锛夈仹SenseVoice銇╓hisper銇瑷�瑾為煶澹拌獚璀樻�ц兘銇ㄦ帹璜栧姽鐜囥倰姣旇純銇椼伨銇椼仧銆備腑鍥借獮銇ㄥ簝鏉辫獮銇獚璀樺姽鏋溿伀銇娿亜銇︺�丼enseVoice-Small銉€儑銉伅鏄庛倝銇嬨仾鍔规灉銇劒浣嶆�с倰鎸併仯銇︺亜銇俱仚銆�
+
+<div align="center">
+<img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
+</div>
+
+## 鎰熸儏瑾嶈瓨
+
+鐝惧湪銆佸簝銇忎娇鐢ㄣ仌銈屻仸銇勩倠鎰熸儏瑾嶈瓨銇儐銈广儓鎸囨銇ㄦ柟娉曘亴涓嶈冻銇椼仸銇勩倠銇熴倎銆佽鏁般伄銉嗐偣銉堛偦銉冦儓銇с仌銇俱仏銇俱仾鎸囨銈掋儐銈广儓銇椼�佹渶杩戙伄銉欍兂銉併優銉笺偗銇鏁般伄绲愭灉銇ㄥ寘鎷殑銇瘮杓冦仐銇俱仐銇熴�傞伕鎶炪仌銈屻仧銉嗐偣銉堛偦銉冦儓銇伅銆佷腑鍥借獮/鑻辫獮銇浮鏂广伄瑷�瑾炪仺銆併儜銉曘偐銉笺優銉炽偣銆佹槧鐢汇�佽嚜鐒躲仾浼氳┍銇仼銇仌銇俱仏銇俱仾銈广偪銈ゃ儷銇儑銉笺偪銇屽惈銇俱倢銇︺亜銇俱仚銆傘偪銉笺偛銉冦儓銉囥兗銈裤伄寰鏁淬倰琛屻倧銇亜鍓嶆彁銇с�丼enseVoice銇儐銈广儓銉囥兗銈裤仹鐝惧湪銇渶鑹伄鎰熸儏瑾嶈瓨銉€儑銉伄鍔规灉銈掗仈鎴愩亰銈堛伋涓婂洖銈嬨亾銇ㄣ亴銇с亶銇俱仐銇熴��
+
+<div align="center">
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+銇曘倝銇�併儐銈广儓銈汇儍銉堛仹瑜囨暟銇偑銉笺儣銉炽偨銉笺偣銇劅鎯呰獚璀樸儮銉囥儷銈掓瘮杓冦仐銆佺祼鏋溿伅SenseVoice-Large銉€儑銉亴銇汇伡銇欍伖銇︺伄銉囥兗銈裤仹鏈�鑹伄鍔规灉銈掗仈鎴愩仐銆丼enseVoice-Small銉€儑銉倐澶氭暟銇儑銉笺偪銈汇儍銉堛仹浠栥伄銈兗銉椼兂銈姐兗銈广儮銉囥儷銈掍笂鍥炪倠鍔规灉銈掗仈鎴愩仐銇熴亾銇ㄣ倰绀恒仐銇︺亜銇俱仚銆�
+
+<div align="center">
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## 銈ゃ儥銉炽儓妞滃嚭
+
+SenseVoice銇煶澹般儑銉笺偪銇伩銇с儓銉兗銉嬨兂銈般仌銈屻仸銇勩伨銇欍亴銆併偆銉欍兂銉堟鍑恒儮銉囥儷銇ㄣ仐銇﹀崢鐙仹浣跨敤銇欍倠銇撱仺銈傘仹銇嶃伨銇欍�傜挵澧冮煶鍒嗛ESC-50銉囥兗銈裤偦銉冦儓銇с�佺従鍦ㄦキ鐣屻仹搴冦亸浣跨敤銇曘倢銇︺亜銈婤EATS銇娿倛銇砅ANN銉€儑銉伄鍔规灉銇ㄦ瘮杓冦仐銇俱仐銇熴�係enseVoice銉€儑銉伅銇撱倢銈夈伄銈裤偣銈仹鑹ソ銇姽鏋溿倰閬旀垚銇椼伨銇椼仧銇屻�併儓銉兗銉嬨兂銈般儑銉笺偪銇ㄣ儓銉兗銉嬨兂銈版柟娉曘伄鍒剁磩銇倛銈娿�併偆銉欍兂銉堝垎椤炪伄鍔规灉銇皞闁�銇偆銉欍兂銉堟鍑恒儮銉囥儷銇ㄦ瘮杓冦仐銇︺伨銇犱竴瀹氥伄宸亴銇傘倞銇俱仚銆�
+
+<div align="center">
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## 鎺ㄨ珫鍔圭巼
+
+SenseVoice-small銉€儑銉伅闈炶嚜宸卞洖甯般偍銉炽儔銉勩兗銈ㄣ兂銉夈偄銉笺偔銉嗐偗銉併儯銈掓帯鐢ㄣ仐銇︺亰銈娿�佹帹璜栭亝寤躲亴闈炲父銇綆銇勩仹銇欍�俉hisper-Small銉€儑銉仺鍚岀瓑銇儜銉┿儭銉笺偪閲忋仹銆乄hisper-Small銉€儑銉倛銈�5鍊嶉珮閫熴仹銆乄hisper-Large銉€儑銉倛銈�15鍊嶉珮閫熴仹銇欍�傚悓鏅傘伀銆丼enseVoice-small銉€儑銉伅闊冲0銇暦銇曘亴澧楀姞銇椼仸銈傘�佹帹璜栨檪闁撱伀鏄庛倝銇嬨仾澧楀姞銇亗銈娿伨銇涖倱銆�
+
+<div align="center">
+<img src="image/inference.png" width="1000" />
+</div>
+
+<a name="鐜瀹夎"></a>
+# 鐠板銈ゃ兂銈广儓銉笺儷 馃悕
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="鐢ㄦ硶鏁欑▼"></a>
+# 浣跨敤鏂规硶 馃洜锔�
+
+## 鎺ㄨ珫
+
+浠绘剰銇舰寮忋伄闊冲0鍏ュ姏銈掋偟銉濄兗銉堛仐銆佷换鎰忋伄闀枫仌銇叆鍔涖倰銈点儩銉笺儓銇椼伨銇欍��
+
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+
+model = AutoModel(
+ model=model_dir,
+ trust_remote_code=True,
+ remote_code="./model.py",
+ vad_model="fsmn-vad",
+ vad_kwargs={"max_single_segment_time": 30000},
+ device="cuda:0",
+)
+
+# en
+res = model.generate(
+ input=f"{model.model_path}/example/en.mp3",
+ cache={},
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=True,
+ batch_size_s=60,
+ merge_vad=True, #
+ merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+
+<details><summary>銉戙儵銉°兗銈裤伄瑾槑锛堛偗銉儍銈仐銇﹀睍闁嬶級</summary>
+
+- `model_dir`锛氥儮銉囥儷鍚嶃�併伨銇熴伅銉兗銈儷銉囥偅銈广偗涓娿伄銉€儑銉儜銈广��
+- `trust_remote_code`锛�
+ - `True`銇�乵odel銈炽兗銉夈伄瀹熻銇宍remote_code`銇嬨倝銉兗銉夈仌銈屻倠銇撱仺銈掓剰鍛炽仐銆乣remote_code`銇痐model`銈炽兗銉夈伄姝g⒑銇綅缃倰鎸囧畾銇椼伨銇欙紙渚嬶細鐝惧湪銇儑銈c儸銈儓銉伄`model.py`锛夈�傜刀瀵俱儜銈广�佺浉瀵俱儜銈广�併亰銈堛伋銉嶃儍銉堛儻銉笺偗URL銈掋偟銉濄兗銉堛仐銇俱仚銆�
+ - `False`銇�乵odel銈炽兗銉夈伄瀹熻銇孾FunASR](https://github.com/modelscope/FunASR)鍐呴儴銇当鍚堛仌銈屻仧銉愩兗銈搞儳銉炽仹銇傘倠銇撱仺銈掓剰鍛炽仐銆併亾銇牬鍚堛�佺従鍦ㄣ伄銉囥偅銉偗銉堛儶銇甡model.py`銈掑鏇淬仐銇︺倐鍔规灉銇屻亗銈娿伨銇涖倱銆侳unASR鍐呴儴銉愩兗銈搞儳銉炽亴銉兗銉夈仌銈屻倠銇熴倎銇с仚銆傘儮銉囥儷銈炽兗銉塠銇撱仭銈夈倰鍙傜収](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice)銆�
+- `vad_model`锛歏AD锛堥煶澹版椿鍕曟鍑猴級銈掓湁鍔广伀銇欍倠銇撱仺銈掔ず銇椼伨銇欍�俈AD銇洰鐨勩伅銆侀暦銇勯煶澹般倰鐭亜銈儶銉冦儣銇垎鍓层仚銈嬨亾銇ㄣ仹銇欍�傘亾銇牬鍚堛�佹帹璜栨檪闁撱伀銇疺AD銇⊿enseVoice銇悎瑷堟秷璨汇亴鍚伨銈屻�併偍銉炽儔銉勩兗銈ㄣ兂銉夈伄閬呭欢銈掕〃銇椼伨銇欍�係enseVoice銉€儑銉伄鎺ㄨ珫鏅傞枔銈掑�嬪垾銇儐銈广儓銇欍倠鍫村悎銇�乂AD銉€儑銉倰鐒″姽銇仹銇嶃伨銇欍��
+- `vad_kwargs`锛歏AD銉€儑銉伄瑷畾銈掓寚瀹氥仐銇俱仚銆俙max_single_segment_time`锛歚vad_model`銇倛銈嬮煶澹般偦銈般儭銉炽儓銇渶澶ч暦銈掔ず銇椼�佸崢浣嶃伅銉熴儶绉掞紙ms锛夈仹銇欍��
+- `use_itn`锛氬嚭鍔涚祼鏋溿伀鍙ヨ鐐广仺閫嗐儐銈偣銉堟瑕忓寲銇屽惈銇俱倢銈嬨亱銇┿亞銇嬨��
+- `batch_size_s`锛氬嫊鐨勩儛銉冦儊銇娇鐢ㄣ倰绀恒仐銆併儛銉冦儊鍐呫伄闊冲0銇悎瑷堥暦銈掔锛坰锛夈仹娓畾銇椼伨銇欍��
+- `merge_vad`锛歏AD銉€儑銉伀銈堛仯銇﹀垎鍓层仌銈屻仧鐭亜闊冲0銉曘儵銈般儭銉炽儓銈掋優銉笺偢銇欍倠銇嬨仼銇嗐亱銆傘優銉笺偢寰屻伄闀枫仌銇痐merge_length_s`銇с�佸崢浣嶃伅绉掞紙s锛夈仹銇欍��
+- `ban_emo_unk`锛歟mo_unk銉┿儥銉倰鐒″姽銇仚銈嬨��
+</details>
+
+銇欍伖銇︺伄鍏ュ姏銇岀煭銇勯煶澹帮紙30绉掓湭婧�锛夈仹銇傘倞銆併儛銉冦儊鎺ㄨ珫銇屽繀瑕併仾鍫村悎銆佹帹璜栧姽鐜囥倰鍚戜笂銇曘仜銈嬨仧銈併伀VAD銉€儑銉倰鍓婇櫎銇椼�乣batch_size`銈掕ō瀹氥仹銇嶃伨銇欍��
+
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+ input=f"{model.model_path}/example/en.mp3",
+ cache={},
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=True,
+ batch_size=64,
+)
+```
+
+瑭崇窗銇娇鐢ㄦ柟娉曘伀銇ゃ亜銇︺伅銆乕銉夈偔銉ャ儭銉炽儓](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)銈掑弬鐓с仐銇︺亸銇犮仌銇勩��
+
+### 鐩存帴鎺ㄨ珫
+
+浠绘剰銇舰寮忋伄闊冲0鍏ュ姏銈掋偟銉濄兗銉堛仐銆佸叆鍔涢煶澹般伄闀枫仌銇�30绉掍互涓嬨伀鍒堕檺銇曘倢銇俱仚銆�
+
+```python
+from model import SenseVoiceSmall
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
+m.eval()
+
+res = m.inference(
+ data_in=f"{kwargs['model_path']}/example/en.mp3",
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=False,
+ ban_emo_unk=False,
+ **kwargs,
+)
+
+text = rich_transcription_postprocess(res[0][0]["text"])
+print(text)
+```
+
+## 銈点兗銉撱偣灞曢枊
+
+鏈畬浜�
+
+### 銈ㄣ偗銈广儩銉笺儓銇ㄣ儐銈广儓
+<details><summary>ONNX銇↙ibtorch銇偍銈偣銉濄兗銉�</summary>
+
+#### ONNX
+```python
+# pip3 install -U funasr funasr-onnx
+from pathlib import Path
+from funasr_onnx import SenseVoiceSmall
+from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
+
+# inference
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+鍌欒�冿細ONNX銉€儑銉伅鍏冦伄銉€儑銉儑銈c儸銈儓銉伀銈ㄣ偗銈广儩銉笺儓銇曘倢銇俱仚銆�
+
+#### Libtorch
+```python
+from pathlib import Path
+from funasr_torch import SenseVoiceSmall
+from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0")
+
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+鍌欒�冿細Libtorch銉€儑銉伅鍏冦伄銉€儑銉儑銈c儸銈儓銉伀銈ㄣ偗銈广儩銉笺儓銇曘倢銇俱仚銆�
+
+</details>
+
+### 灞曢枊
+
+### FastAPI銈掍娇銇c仧灞曢枊
+```shell
+export SENSEVOICE_DEVICE=cuda:0
+fastapi run --port 50000
+```
+
+## 寰鏁�
+
+### 銉堛儸銉笺儖銉炽偘鐠板銇偆銉炽偣銉堛兗銉�
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### 銉囥兗銈挎簴鍌�
+
+銉囥兗銈夸緥
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+瑭崇窗銇� `data/train_example.jsonl` 銈掑弬鐓с仐銇︺亸銇犮仌銇勩��
+
+<details><summary>銉囥兗銈挎簴鍌欍伄瑭崇窗</summary>
+
+瑾槑锛�
+- `key`锛氶煶澹般儠銈°偆銉伄銉︺儖銉笺偗ID
+- `source`锛氶煶澹般儠銈°偆銉伄銉戙偣
+- `source_len`锛氶煶澹般儠銈°偆銉伄fbank銉曘儸銉笺儬鏁�
+- `target`锛氭枃瀛楄捣銇撱仐绲愭灉
+- `target_len`锛歵arget锛堟枃瀛楄捣銇撱仐锛夈伄闀枫仌
+- `text_language`锛氶煶澹般儠銈°偆銉伄瑷�瑾濱D
+- `emo_target`锛氶煶澹般儠銈°偆銉伄鎰熸儏銉┿儥銉�
+- `event_target`锛氶煶澹般儠銈°偆銉伄銈ゃ儥銉炽儓銉┿儥銉�
+- `with_or_wo_itn`锛氬彞瑾偣銇ㄩ�嗐儐銈偣銉堟瑕忓寲銈掑惈銈�銇嬨仼銇嗐亱
+
+`train_text.txt`
+```bash
+BAC009S0764W0121 鐢氳嚦鍑虹幇浜ゆ槗鍑犱箮鍋滄粸鐨勬儏鍐�
+BAC009S0916W0489 婀栧寳涓�鍏徃浠ュ憳宸ュ悕涔夎捶娆炬暟鍗佸憳宸ヨ礋鍊哄崈涓�
+asr_example_cn_en 鎵�鏈夊彧瑕佸鐞� data 涓嶇浣犳槸鍋� machine learning 鍋� deep learning 鍋� data analytics 鍋� data science 涔熷ソ scientist 涔熷ソ閫氶�氶兘瑕侀兘鍋氱殑鍩烘湰鍔熷晩閭� again 鍏堝厛瀵规湁涓�浜�>涔熻瀵�
+ID0012W0014 he tried to think how it could be
+```
+`train_wav.scp`
+```bash
+BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav
+BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
+asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav
+ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav
+```
+`train_text_language.txt`
+瑷�瑾濱D銇� `<|zh|>`銆乣<|en|>`銆乣<|yue|>`銆乣<|ja|>`銆併亰銈堛伋 `<|ko|>`銈掑惈銇裤伨銇欍��
+```bash
+BAC009S0764W0121 <|zh|>
+BAC009S0916W0489 <|zh|>
+asr_example_cn_en <|zh|>
+ID0012W0014 <|en|>
+```
+`train_emo.txt`
+鎰熸儏銉┿儥銉伅銆乣<|HAPPY|>`銆乣<|SAD|>`銆乣<|ANGRY|>`銆乣<|NEUTRAL|>`銆乣<|FEARFUL|>`銆乣<|DISGUSTED|>` 銇娿倛銇� `<|SURPRISED|>`銈掑惈銇裤伨銇欍��
+```bash
+BAC009S0764W0121 <|NEUTRAL|>
+BAC009S0916W0489 <|NEUTRAL|>
+asr_example_cn_en <|NEUTRAL|>
+ID0012W0014 <|NEUTRAL|>
+```
+`train_event.txt`
+銈ゃ儥銉炽儓銉┿儥銉伅銆� `<|BGM|>`銆乣<|Speech|>`銆乣<|Applause|>`銆乣<|Laughter|>`銆乣<|Cry|>`銆乣<|Sneeze|>`銆乣<|Breath|>` 銇娿倛銇� `<|Cough|>`銈掑惈銇裤伨銇欍��
+```bash
+BAC009S0764W0121 <|Speech|>
+BAC009S0916W0489 <|Speech|>
+asr_example_cn_en <|Speech|>
+ID0012W0014 <|Speech|>
+```
+`銈炽優銉炽儔`
+```shell
+# wav.scp銆乼ext.txt銆乼ext_language.txt銆乪mo_target.txt銆乪vent_target.txt 銇嬨倝 train.jsonl 銇� val.jsonl 銈掔敓鎴愩仐銇俱仚
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt", "../../../data/list/train_text_language.txt", "../../../data/list/train_emo.txt", "../../../data/list/train_event.txt"]' \
+++data_type_list='["source", "target", "text_language", "emo_target", "event_target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+`train_text_language.txt`銆乣train_emo_target.txt`銆乣train_event_target.txt` 銇屻仾銇勫牬鍚堛�乣SenseVoice` 銉€儑銉倰浣跨敤銇椼仸瑷�瑾炪�佹劅鎯呫�併亰銈堛伋銈ゃ儥銉炽儓銉┿儥銉亴鑷嫊鐨勩伀浜堟脯銇曘倢銇俱仚銆�
+```shell
+# wav.scp 銇� text.txt 銇嬨倝 train.jsonl 銇� val.jsonl 銈掔敓鎴愩仐銇俱仚
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
+++data_type_list='["source", "target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+</details>
+
+### 銉堛儸銉笺儖銉炽偘銇枊濮�
+
+`finetune.sh`銇甡train_tool`銈掋�佸墠杩般伄FunASR銉戙偣鍐呫伄`funasr/bin/train_ds.py`銇刀瀵俱儜銈广伀澶夋洿銇欍倠銇撱仺銈掑繕銈屻仾銇勩仹銇忋仩銇曘亜銆�
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+## 娉ㄧ洰銇欍伖銇嶃偟銉笺儔銉戙兗銉嗐偅銇彇銈婄祫銇�
+- Triton (GPU) 銉囥儣銉偆銉°兂銉堛伄銉欍偣銉堛儣銉┿偗銉嗐偅銈癸細Triton + TensorRT 銈掍娇鐢ㄣ仐銆丗P32 銇с儐銈广儓銆俈100 GPU 銇у姞閫熸瘮 526 銈掗仈鎴愩�侳P16 銇偟銉濄兗銉堛伅閫茶涓仹銇欍�俒銉儩銈搞儓銉猐(https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/README.md)
+- Sherpa-onnx 銉囥儣銉偆銉°兂銉堛伄銉欍偣銉堛儣銉┿偗銉嗐偅銈癸細SenseVoice 銈�10绋銇儣銉偘銉┿儫銉炽偘瑷�瑾烇紙C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, Dart锛夈仹浣跨敤鍙兘銆傘伨銇熴�乮OS, Android, Raspberry Pi 銇仼銇儣銉┿儍銉堛儠銈┿兗銉犮仹銈� SenseVoice 銈掋儑銉椼儹銈ゃ仹銇嶃伨銇欍�俒銉儩銈搞儓銉猐(https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)
+- [SenseVoice.cpp](https://github.com/lovemefan/SenseVoice.cpp) GGML銇熀銇ャ亜銇︾磾绮嬨仾C/C++銇enseVoice銈掓帹娓仐銆�3銉撱儍銉堛��4銉撱儍銉堛��5銉撱儍銉堛��8銉撱儍銉堥噺瀛愬寲銇仼銈掋偟銉濄兗銉堛仐銆併偟銉笺儔銉戙兗銉嗐偅銇緷瀛橀枹淇傘伅銇傘倞銇俱仜銈撱��
+- [streaming-sensevoice](https://github.com/pengzhendong/streaming-sensevoice) 銈广儓銉兗銉犲瀷SenseVoice銇�併儊銉c兂銈紙chunk锛夋柟寮忋仹鎺ㄨ珫銈掕銇勩伨銇欍�傛摤浼笺偣銉堛儶銉笺儫銉炽偘鍑︾悊銈掑疅鐝俱仚銈嬨仧銈併伀銆佷竴閮ㄣ伄绮惧害銈掔姞鐗层伀銇椼仸鍒囥倞鎹ㄣ仸娉ㄦ剰姗熸锛坱runcated attention锛夈倰鎺$敤銇椼仸銇勩伨銇欍�傘仌銈夈伀銆併亾銇妧琛撱伅CTC銉椼儸銉曘偅銉冦偗銈广儞銉笺儬銈点兗銉侊紙CTC prefix beam search锛夈仺銉涖儍銉堛儻銉笺儔寮峰寲姗熻兘銈傘偟銉濄兗銉堛仐銇︺亜銇俱仚銆�
+- [OmniSenseVoice](https://github.com/lifeiteng/OmniSenseVoice) 銇�佽秴楂橀�熸帹璜栥仺銉愩儍銉佸嚘鐞嗐伄銇熴倎銇渶閬╁寲銇曘倢銇︺亜銇俱仚銆�
+
+# 銇婂晱銇勫悎銈忋仜
+
+浣跨敤涓伀鍟忛銇岀櫤鐢熴仐銇熷牬鍚堛伅銆乬ithub銉氥兗銈搞仹鐩存帴Issues銈掓彁璧枫仹銇嶃伨銇欍�傞煶澹般伀鑸堝懗銇亗銈嬫柟銇�佷互涓嬨伄DingTalk銈般儷銉笺儣QR銈炽兗銉夈倰銈广偔銉c兂銇椼仸銈炽儫銉ャ儖銉嗐偅銈般儷銉笺儣銇弬鍔犮仐銆佷氦娴併仺璀拌珫銈掕銇c仸銇忋仩銇曘亜銆�
+
+| FunASR |
+|:--------------------------------------------------------:|
+| <img src="image/dingding_funasr.png" width="250"/></div> |
+
diff --git a/examples/industrial_data_pretraining/sense_voice/README_zh.md b/examples/industrial_data_pretraining/sense_voice/README_zh.md
new file mode 100644
index 0000000..9dd907b
--- /dev/null
+++ b/examples/industrial_data_pretraining/sense_voice/README_zh.md
@@ -0,0 +1,405 @@
+# SenseVoice
+
+銆岀畝浣撲腑鏂囥�峾銆孾English](./README.md)銆峾銆孾鏃ユ湰瑾瀅(./README_ja.md)銆�
+
+SenseVoice 鏄叿鏈夐煶棰戠悊瑙h兘鍔涚殑闊抽鍩虹妯″瀷锛屽寘鎷闊宠瘑鍒紙ASR锛夈�佽绉嶈瘑鍒紙LID锛夈�佽闊虫儏鎰熻瘑鍒紙SER锛夊拰澹板浜嬩欢鍒嗙被锛圓EC锛夋垨澹板浜嬩欢妫�娴嬶紙AED锛夈�傛湰椤圭洰鎻愪緵 SenseVoice 妯″瀷鐨勪粙缁嶄互鍙婂湪澶氫釜浠诲姟娴嬭瘯闆嗕笂鐨� benchmark锛屼互鍙婁綋楠屾ā鍨嬫墍闇�鐨勭幆澧冨畨瑁呯殑涓庢帹鐞嗘柟寮忋��
+
+<div align="center">
+<img src="image/sensevoice2.png">
+</div>
+
+<div align="center">
+<h4>
+<a href="https://funaudiollm.github.io/"> Homepage </a>
+锝�<a href="#鏈�鏂板姩鎬�"> 鏈�鏂板姩鎬� </a>
+锝�<a href="#鎬ц兘璇勬祴"> 鎬ц兘璇勬祴 </a>
+锝�<a href="#鐜瀹夎"> 鐜瀹夎 </a>
+锝�<a href="#鐢ㄦ硶鏁欑▼"> 鐢ㄦ硶鏁欑▼ </a>
+锝�<a href="#鑱旂郴鎴戜滑"> 鑱旂郴鎴戜滑 </a>
+
+</h4>
+
+妯″瀷浠撳簱锛歔modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall)锛孾huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
+
+鍦ㄧ嚎浣撻獙锛�
+[modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice)
+
+</div>
+
+<a name="鏍稿績鍔熻兘"></a>
+
+# 鏍稿績鍔熻兘 馃幆
+
+**SenseVoice** 涓撴敞浜庨珮绮惧害澶氳瑷�璇煶璇嗗埆銆佹儏鎰熻鲸璇嗗拰闊抽浜嬩欢妫�娴�
+
+- **澶氳瑷�璇嗗埆锛�** 閲囩敤瓒呰繃 40 涓囧皬鏃舵暟鎹缁冿紝鏀寔瓒呰繃 50 绉嶈瑷�锛岃瘑鍒晥鏋滀笂浼樹簬 Whisper 妯″瀷銆�
+- **瀵屾枃鏈瘑鍒細**
+ - 鍏峰浼樼鐨勬儏鎰熻瘑鍒紝鑳藉鍦ㄦ祴璇曟暟鎹笂杈惧埌鍜岃秴杩囩洰鍓嶆渶浣虫儏鎰熻瘑鍒ā鍨嬬殑鏁堟灉銆�
+ - 鏀寔澹伴煶浜嬩欢妫�娴嬭兘鍔涳紝鏀寔闊充箰銆佹帉澹般�佺瑧澹般�佸摥澹般�佸挸鍡姐�佸柗鍤忕瓑澶氱甯歌浜烘満浜や簰浜嬩欢杩涜妫�娴嬨��
+- **楂樻晥鎺ㄧ悊锛�** SenseVoice-Small 妯″瀷閲囩敤闈炶嚜鍥炲綊绔埌绔鏋讹紝鎺ㄧ悊寤惰繜鏋佷綆锛�10s 闊抽鎺ㄧ悊浠呰�楁椂 70ms锛�15 鍊嶄紭浜� Whisper-Large銆�
+- **寰皟瀹氬埗锛�** 鍏峰渚挎嵎鐨勫井璋冭剼鏈笌绛栫暐锛屾柟渚跨敤鎴锋牴鎹笟鍔″満鏅慨澶嶉暱灏炬牱鏈棶棰樸��
+- **鏈嶅姟閮ㄧ讲锛�** 鍏锋湁瀹屾暣鐨勬湇鍔¢儴缃查摼璺紝鏀寔澶氬苟鍙戣姹傦紝鏀寔瀹㈡埛绔瑷�鏈夛紝python銆乧++銆乭tml銆乯ava 涓� c# 绛夈��
+
+<a name="鏈�鏂板姩鎬�"></a>
+
+# 鏈�鏂板姩鎬� 馃敟
+
+- 2024/7锛氭柊澧炲姞瀵煎嚭 [ONNX](./demo_onnx.py) 涓� [libtorch](./demo_libtorch.py) 鍔熻兘锛屼互鍙� python 鐗堟湰 runtime锛歔funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/)锛孾funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/)
+- 2024/7: [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) 澶氳瑷�闊抽鐞嗚В妯″瀷寮�婧愶紝鏀寔涓�佺菠銆佽嫳銆佹棩銆侀煩璇殑澶氳瑷�璇煶璇嗗埆锛屾儏鎰熻瘑鍒拰浜嬩欢妫�娴嬭兘鍔涳紝鍏锋湁鏋佷綆鐨勬帹鐞嗗欢杩熴�傘��
+- 2024/7: CosyVoice 鑷村姏浜庤嚜鐒惰闊崇敓鎴愶紝鏀寔澶氳瑷�銆侀煶鑹插拰鎯呮劅鎺у埗锛屾搮闀垮璇█璇煶鐢熸垚銆侀浂鏍锋湰璇煶鐢熸垚銆佽法璇█璇煶鍏嬮殕浠ュ強閬靛惊鎸囦护鐨勮兘鍔涖�俒CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice 鍦ㄧ嚎浣撻獙](https://www.modelscope.cn/studios/iic/CosyVoice-300M).
+- 2024/7: [FunASR](https://github.com/modelscope/FunASR) 鏄竴涓熀纭�璇煶璇嗗埆宸ュ叿鍖咃紝鎻愪緵澶氱鍔熻兘锛屽寘鎷闊宠瘑鍒紙ASR锛夈�佽闊崇鐐规娴嬶紙VAD锛夈�佹爣鐐规仮澶嶃�佽瑷�妯″瀷銆佽璇濅汉楠岃瘉銆佽璇濅汉鍒嗙鍜屽浜哄璇濊闊宠瘑鍒瓑銆�
+
+<a name="Benchmarks"></a>
+
+# 鎬ц兘璇勬祴 馃摑
+
+## 澶氳瑷�璇煶璇嗗埆
+
+鎴戜滑鍦ㄥ紑婧愬熀鍑嗘暟鎹泦锛堝寘鎷� AISHELL-1銆丄ISHELL-2銆乄enetspeech銆丩ibrispeech 鍜� Common Voice锛変笂姣旇緝浜� SenseVoice 涓� Whisper 鐨勫璇█璇煶璇嗗埆鎬ц兘鍜屾帹鐞嗘晥鐜囥�傚湪涓枃鍜岀菠璇瘑鍒晥鏋滀笂锛孲enseVoice-Small 妯″瀷鍏锋湁鏄庢樉鐨勬晥鏋滀紭鍔裤��
+
+<div align="center">
+<img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
+</div>
+
+## 鎯呮劅璇嗗埆
+
+鐢变簬鐩墠缂轰箯琚箍娉涗娇鐢ㄧ殑鎯呮劅璇嗗埆娴嬭瘯鎸囨爣鍜屾柟娉曪紝鎴戜滑鍦ㄥ涓祴璇曢泦鐨勫绉嶆寚鏍囪繘琛屾祴璇曪紝骞朵笌杩戝勾鏉� Benchmark 涓婄殑澶氫釜缁撴灉杩涜浜嗗叏闈㈢殑瀵规瘮銆傛墍閫夊彇鐨勬祴璇曢泦鍚屾椂鍖呭惈涓枃 / 鑻辨枃涓ょ璇█浠ュ強琛ㄦ紨銆佸奖瑙嗗墽銆佽嚜鐒跺璇濈瓑澶氱椋庢牸鐨勬暟鎹紝鍦ㄤ笉杩涜鐩爣鏁版嵁寰皟鐨勫墠鎻愪笅锛孲enseVoice 鑳藉鍦ㄦ祴璇曟暟鎹笂杈惧埌鍜岃秴杩囩洰鍓嶆渶浣虫儏鎰熻瘑鍒ā鍨嬬殑鏁堟灉銆�
+
+<div align="center">
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+鍚屾椂锛屾垜浠繕鍦ㄦ祴璇曢泦涓婂澶氫釜寮�婧愭儏鎰熻瘑鍒ā鍨嬭繘琛屽姣旓紝缁撴灉琛ㄦ槑锛孲enseVoice-Large 妯″瀷鍙互鍦ㄥ嚑涔庢墍鏈夋暟鎹笂閮借揪鍒颁簡鏈�浣虫晥鏋滐紝鑰� SenseVoice-Small 妯″瀷鍚屾牱鍙互鍦ㄥ鏁版暟鎹泦涓婂彇寰楄秴瓒婂叾浠栧紑婧愭ā鍨嬬殑鏁堟灉銆�
+
+<div align="center">
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## 浜嬩欢妫�娴�
+
+灏界 SenseVoice 鍙湪璇煶鏁版嵁涓婅繘琛岃缁冿紝瀹冧粛鐒跺彲浠ヤ綔涓轰簨浠舵娴嬫ā鍨嬭繘琛屽崟鐙娇鐢ㄣ�傛垜浠湪鐜闊冲垎绫� ESC-50 鏁版嵁闆嗕笂涓庣洰鍓嶄笟鍐呭箍娉涗娇鐢ㄧ殑 BEATS 涓� PANN 妯″瀷鐨勬晥鏋滆繘琛屼簡瀵规瘮銆係enseVoice 妯″瀷鑳藉鍦ㄨ繖浜涗换鍔′笂鍙栧緱杈冨ソ鐨勬晥鏋滐紝浣嗗彈闄愪簬璁粌鏁版嵁涓庤缁冩柟寮忥紝鍏朵簨浠跺垎绫绘晥鏋滀笓涓氱殑浜嬩欢妫�娴嬫ā鍨嬬浉姣斾粛鐒舵湁涓�瀹氱殑宸窛銆�
+
+<div align="center">
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## 鎺ㄧ悊鏁堢巼
+
+SenseVoice-small 妯″瀷閲囩敤闈炶嚜鍥炲綊绔埌绔灦鏋勶紝鎺ㄧ悊寤惰繜鏋佷綆銆傚湪鍙傛暟閲忎笌 Whisper-Small 妯″瀷鐩稿綋鐨勬儏鍐典笅锛屾瘮 Whisper-Small 妯″瀷鎺ㄧ悊閫熷害蹇� 5 鍊嶏紝姣� Whisper-Large 妯″瀷蹇� 15 鍊嶃�傚悓鏃� SenseVoice-small 妯″瀷鍦ㄩ煶棰戞椂闀垮鍔犵殑鎯呭喌涓嬶紝鎺ㄧ悊鑰楁椂涔熸棤鏄庢樉澧炲姞銆�
+
+<div align="center">
+<img src="image/inference.png" width="1000" />
+</div>
+
+<a name="鐜瀹夎"></a>
+
+# 瀹夎渚濊禆鐜 馃悕
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="鐢ㄦ硶鏁欑▼"></a>
+
+# 鐢ㄦ硶 馃洜锔�
+
+## 鎺ㄧ悊
+
+### 浣跨敤 funasr 鎺ㄧ悊
+
+鏀寔浠绘剰鏍煎紡闊抽杈撳叆锛屾敮鎸佷换鎰忔椂闀胯緭鍏�
+
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+
+model = AutoModel(
+ model=model_dir,
+ trust_remote_code=True,
+ remote_code="./model.py",
+ vad_model="fsmn-vad",
+ vad_kwargs={"max_single_segment_time": 30000},
+ device="cuda:0",
+)
+
+# en
+res = model.generate(
+ input=f"{model.model_path}/example/en.mp3",
+ cache={},
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=True,
+ batch_size_s=60,
+ merge_vad=True,
+ merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+
+<details><summary> 鍙傛暟璇存槑锛堢偣鍑诲睍寮�锛�</summary>
+
+- `model_dir`锛氭ā鍨嬪悕绉帮紝鎴栨湰鍦扮鐩樹腑鐨勬ā鍨嬭矾寰勩��
+- `trust_remote_code`锛�
+ - `True` 琛ㄧず model 浠g爜瀹炵幇浠� `remote_code` 澶勫姞杞斤紝`remote_code` 鎸囧畾 `model` 鍏蜂綋浠g爜鐨勪綅缃紙渚嬪锛屽綋鍓嶇洰褰曚笅鐨� `model.py`锛夛紝鏀寔缁濆璺緞涓庣浉瀵硅矾寰勶紝浠ュ強缃戠粶 url銆�
+ - `False` 琛ㄧず锛宮odel 浠g爜瀹炵幇涓� [FunASR](https://github.com/modelscope/FunASR) 鍐呴儴闆嗘垚鐗堟湰锛屾鏃朵慨鏀瑰綋鍓嶇洰褰曚笅鐨� `model.py` 涓嶄細鐢熸晥锛屽洜涓哄姞杞界殑鏄� funasr 鍐呴儴鐗堟湰锛屾ā鍨嬩唬鐮� [鐐瑰嚮鏌ョ湅](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice)銆�
+- `vad_model`锛氳〃绀哄紑鍚� VAD锛孷AD 鐨勪綔鐢ㄦ槸灏嗛暱闊抽鍒囧壊鎴愮煭闊抽锛屾鏃舵帹鐞嗚�楁椂鍖呮嫭浜� VAD 涓� SenseVoice 鎬昏�楁椂锛屼负閾捐矾鑰楁椂锛屽鏋滈渶瑕佸崟鐙祴璇� SenseVoice 妯″瀷鑰楁椂锛屽彲浠ュ叧闂� VAD 妯″瀷銆�
+- `vad_kwargs`锛氳〃绀� VAD 妯″瀷閰嶇疆锛宍max_single_segment_time`: 琛ㄧず `vad_model` 鏈�澶у垏鍓查煶棰戞椂闀匡紝鍗曚綅鏄绉� ms銆�
+- `use_itn`锛氳緭鍑虹粨鏋滀腑鏄惁鍖呭惈鏍囩偣涓庨�嗘枃鏈鍒欏寲銆�
+- `batch_size_s` 琛ㄧず閲囩敤鍔ㄦ�� batch锛宐atch 涓�婚煶棰戞椂闀匡紝鍗曚綅涓虹 s銆�
+- `merge_vad`锛氭槸鍚﹀皢 vad 妯″瀷鍒囧壊鐨勭煭闊抽纰庣墖鍚堟垚锛屽悎骞跺悗闀垮害涓� `merge_length_s`锛屽崟浣嶄负绉� s銆�
+- `ban_emo_unk`锛氱鐢� emo_unk 鏍囩锛岀鐢ㄥ悗鎵�鏈夌殑鍙ュ瓙閮戒細琚祴涓庢儏鎰熸爣绛俱�傞粯璁� `False`
+
+</details>
+
+濡傛灉杈撳叆鍧囦负鐭煶棰戯紙灏忎簬 30s锛夛紝骞朵笖闇�瑕佹壒閲忓寲鎺ㄧ悊锛屼负浜嗗姞蹇帹鐞嗘晥鐜囷紝鍙互绉婚櫎 vad 妯″瀷锛屽苟璁剧疆 `batch_size`
+
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+ input=f"{model.model_path}/example/en.mp3",
+ cache={},
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=True,
+ batch_size=64,
+)
+```
+
+鏇村璇︾粏鐢ㄦ硶锛岃鍙傝�� [鏂囨。](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+### 鐩存帴鎺ㄧ悊
+
+鏀寔浠绘剰鏍煎紡闊抽杈撳叆锛岃緭鍏ラ煶棰戞椂闀块檺鍒跺湪 30s 浠ヤ笅
+
+```python
+from model import SenseVoiceSmall
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
+m.eval()
+
+res = m.inference(
+ data_in=f"{kwargs ['model_path']}/example/en.mp3",
+ language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=False,
+ ban_emo_unk=False,
+ **kwargs,
+)
+
+text = rich_transcription_postprocess(res [0][0]["text"])
+print(text)
+```
+
+## 鏈嶅姟閮ㄧ讲
+
+Undo
+
+### 瀵煎嚭涓庢祴璇�
+
+<details><summary>ONNX 涓� Libtorch 瀵煎嚭 </summary>
+
+#### ONNX
+
+```python
+# pip3 install -U funasr funasr-onnx
+from pathlib import Path
+from funasr_onnx import SenseVoiceSmall
+from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
+
+# inference
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess(i) for i in res])
+```
+
+澶囨敞锛歄NNX 妯″瀷瀵煎嚭鍒板師妯″瀷鐩綍涓�
+
+#### Libtorch
+
+```python
+from pathlib import Path
+from funasr_torch import SenseVoiceSmall
+from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess
+
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0")
+
+wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]
+
+res = model(wav_or_scp, language="auto", use_itn=True)
+print([rich_transcription_postprocess (i) for i in res])
+```
+
+澶囨敞锛歀ibtorch 妯″瀷瀵煎嚭鍒板師妯″瀷鐩綍涓�
+
+</details>
+
+### 閮ㄧ讲
+
+### 浣跨敤 FastAPI 閮ㄧ讲
+
+```shell
+export SENSEVOICE_DEVICE=cuda:0
+fastapi run --port 50000
+```
+
+## 寰皟
+
+### 瀹夎璁粌鐜
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### 鏁版嵁鍑嗗
+
+鏁版嵁鏍煎紡闇�瑕佸寘鎷涓嬪嚑涓瓧娈碉細
+
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+
+璇︾粏鍙互鍙傝�冿細`data/train_example.jsonl`
+
+<details><summary > 鏁版嵁鍑嗗缁嗚妭浠嬬粛 </summary>
+
+- `key`: 鏁版嵁鍞竴 ID
+- `source`锛氶煶棰戞枃浠剁殑璺緞
+- `source_len`锛氶煶棰戞枃浠剁殑 fbank 甯ф暟
+- `target`锛氶煶棰戞枃浠舵爣娉ㄦ枃鏈�
+- `target_len`锛氶煶棰戞枃浠舵爣娉ㄦ枃鏈暱搴�
+- `text_language`锛氶煶棰戞枃浠剁殑璇鏍囩
+- `emo_target`锛氶煶棰戞枃浠剁殑鎯呮劅鏍囩
+- `event_target`锛氶煶棰戞枃浠剁殑浜嬩欢鏍囩
+- `with_or_wo_itn`锛氭爣娉ㄦ枃鏈腑鏄惁鍖呭惈鏍囩偣涓庨�嗘枃鏈鍒欏寲
+
+鍙互鐢ㄦ寚浠� `sensevoice2jsonl` 浠� train_wav.scp銆乼rain_text.txt銆乼rain_text_language.txt銆乼rain_emo_target.txt 鍜� train_event_target.txt 鐢熸垚锛屽噯澶囪繃绋嬪涓嬶細
+
+`train_text.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_wav.scp` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠舵爣娉ㄦ枃鏈紝鏍煎紡濡備笅锛�
+
+```bash
+BAC009S0764W0121 鐢氳嚦鍑虹幇浜ゆ槗鍑犱箮鍋滄粸鐨勬儏鍐�
+BAC009S0916W0489 婀栧寳涓�鍏徃浠ュ憳宸ュ悕涔夎捶娆炬暟鍗佸憳宸ヨ礋鍊哄崈涓�
+asr_example_cn_en 鎵�鏈夊彧瑕佸鐞� data 涓嶇浣犳槸鍋� machine learning 鍋� deep learning 鍋� data analytics 鍋� data science 涔熷ソ scientist 涔熷ソ閫氶�氶兘瑕侀兘鍋氱殑鍩烘湰鍔熷晩閭� again 鍏堝厛瀵规湁涓�浜� > 涔熻瀵�
+ID0012W0014 he tried to think how it could be
+```
+
+`train_wav.scp`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_text.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑璺緞锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav
+BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav
+asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav
+ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav
+```
+
+`train_text_language.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_text_language.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑璇鏍囩锛屾敮鎸� `<|zh|>`銆乣<|en|>`銆乣<|yue|>`銆乣<|ja|>` 鍜� `<|ko|>`锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 <|zh|>
+BAC009S0916W0489 <|zh|>
+asr_example_cn_en <|zh|>
+ID0012W0014 <|en|>
+```
+
+`train_emo.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_emo.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑鎯呮劅鏍囩锛屾敮鎸� `<|HAPPY|>`銆乣<|SAD|>`銆乣<|ANGRY|>`銆乣<|NEUTRAL|>`銆乣<|FEARFUL|>`銆乣<|DISGUSTED|>` 鍜� `<|SURPRISED|>`锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 <|NEUTRAL|>
+BAC009S0916W0489 <|NEUTRAL|>
+asr_example_cn_en <|NEUTRAL|>
+ID0012W0014 <|NEUTRAL|>
+```
+
+`train_event.txt`
+
+宸﹁竟涓烘暟鎹敮涓� ID锛岄渶涓� `train_event.txt` 涓殑 `ID` 涓�涓�瀵瑰簲
+鍙宠竟涓洪煶棰戞枃浠剁殑浜嬩欢鏍囩锛屾敮鎸� `<|BGM|>`銆乣<|Speech|>`銆乣<|Applause|>`銆乣<|Laughter|>`銆乣<|Cry|>`銆乣<|Sneeze|>`銆乣<|Breath|>` 鍜� `<|Cough|>`锛屾牸寮忓涓�
+
+```bash
+BAC009S0764W0121 <|Speech|>
+BAC009S0916W0489 <|Speech|>
+asr_example_cn_en <|Speech|>
+ID0012W0014 <|Speech|>
+```
+
+`鐢熸垚鎸囦护`
+
+```shell
+# generate train.jsonl and val.jsonl from wav.scp, text.txt, text_language.txt, emo_target.txt, event_target.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt", "../../../data/list/train_text_language.txt", "../../../data/list/train_emo.txt", "../../../data/list/train_event.txt"]' \
+++data_type_list='["source", "target", "text_language", "emo_target", "event_target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl"
+```
+
+鑻ユ棤 train_text_language.txt銆乼rain_emo_target.txt 鍜� train_event_target.txt锛屽垯鑷姩閫氳繃浣跨敤 `SenseVoice` 妯″瀷瀵硅绉嶃�佹儏鎰熷拰浜嬩欢鎵撴爣銆�
+
+```shell
+# generate train.jsonl and val.jsonl from wav.scp and text.txt
+sensevoice2jsonl \
+++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
+++data_type_list='["source", "target"]' \
+++jsonl_file_out="../../../data/list/train.jsonl" \
+++model_dir='iic/SenseVoiceSmall'
+```
+
+</details>
+
+### 鍚姩璁粌
+
+娉ㄦ剰淇敼 `finetune.sh` 涓� `train_tool` 涓轰綘鍓嶉潰瀹夎 FunASR 璺緞涓� `funasr/bin/train_ds.py` 缁濆璺緞
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+## 浼樼涓夋柟宸ヤ綔
+
+- Triton锛圙PU锛夐儴缃叉渶浣冲疄璺碉紝triton + tensorrt锛宖p32 娴嬭瘯锛孷100 GPU 涓婂姞閫熸瘮 526锛宖p16 鏀寔涓紝[repo](https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/README.md)
+- sherpa-onnx 閮ㄧ讲鏈�浣冲疄璺碉紝鏀寔鍦� 10 绉嶇紪绋嬭瑷�閲岄潰浣跨敤 SenseVoice, 鍗� C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, Dart. 鏀寔鍦� iOS, Android, Raspberry Pi 绛夊钩鍙颁娇鐢� SenseVoice锛孾repo](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)
+- [SenseVoice.cpp](https://github.com/lovemefan/SenseVoice.cpp) 鍩轰簬GGML锛屽湪绾疌/C++涓帹鏂璖enseVoice锛屾敮鎸�3浣嶃��4浣嶃��5浣嶃��8浣嶉噺鍖栫瓑锛屾棤闇�绗笁鏂逛緷璧栥��
+- [娴佸紡SenseVoice](https://github.com/pengzhendong/streaming-sensevoice)锛岄�氳繃鍒嗗潡锛坈hunk锛夌殑鏂瑰紡杩涜鎺ㄧ悊锛屼负浜嗗疄鐜颁吉娴佸紡澶勭悊锛岄噰鐢ㄤ簡鎴柇娉ㄦ剰鍔涙満鍒讹紙truncated attention锛夛紝鐗虹壊浜嗛儴鍒嗙簿搴︺�傛澶栵紝璇ユ妧鏈繕鏀寔CTC鍓嶇紑鏉熸悳绱紙CTC prefix beam search锛変互鍙婄儹璇嶅寮哄姛鑳姐��
+- [OmniSenseVoice](https://github.com/lifeiteng/OmniSenseVoice) 杞婚噺鍖栨帹鐞嗗簱锛屾敮鎸乥atch鎺ㄧ悊銆�
+
+# 鑱旂郴鎴戜滑
+
+濡傛灉鎮ㄥ湪浣跨敤涓亣鍒伴棶棰橈紝鍙互鐩存帴鍦� github 椤甸潰鎻� Issues銆傛杩庤闊冲叴瓒g埍濂借�呮壂鎻忎互涓嬬殑閽夐拤缇や簩缁寸爜鍔犲叆绀惧尯缇わ紝杩涜浜ゆ祦鍜岃璁恒��
+
+| FunASR |
+|:--------------------------------------------------------:|
+| <img src="image/dingding_funasr.png" width="250"/></div> |
diff --git a/examples/industrial_data_pretraining/sense_voice/finetune.sh b/examples/industrial_data_pretraining/sense_voice/finetune.sh
index 0003909..081b77b 100644
--- a/examples/industrial_data_pretraining/sense_voice/finetune.sh
+++ b/examples/industrial_data_pretraining/sense_voice/finetune.sh
@@ -43,7 +43,7 @@
echo $DISTRIBUTED_ARGS
# funasr trainer path
-train_tool=`dirname $(which funasr)`/train_ds.py
+train_tool=../../../funasr/bin/train_ds.py
torchrun $DISTRIBUTED_ARGS \
${train_tool} \
--
Gitblit v1.9.1