From a836eca98e30fa67d45167dac40f359ae42d42ec Mon Sep 17 00:00:00 2001
From: 游雁 <zhifu.gzf@alibaba-inc.com>
Date: 星期三, 17 七月 2024 10:16:19 +0800
Subject: [PATCH] update
---
README.md | 40 +++++++++++++++++++++++++++++++++++++++-
1 files changed, 39 insertions(+), 1 deletions(-)
diff --git a/README.md b/README.md
index 66242f0..525b563 100644
--- a/README.md
+++ b/README.md
@@ -93,7 +93,7 @@
| Model Name | Task Details | Training Data | Parameters |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|:--------------------------------:|:----------:|
-| SenseVoiceSmall <br> ([猸怾(https://www.modelscope.cn/models/iic/SenseVoiceSmall) [馃](https://huggingface.co/FunAudioLLM/SenseVoiceSmall) ) | multiple speech understanding capabilities, including ASR, LID, SER, and AED. | 400000 hours | 330M |
+| SenseVoiceSmall <br> ([猸怾(https://www.modelscope.cn/models/iic/SenseVoiceSmall) [馃](https://huggingface.co/FunAudioLLM/SenseVoiceSmall) ) | multiple speech understanding capabilities, including ASR, ITN, LID, SER, and AED, support languages such as zh, yue, en, ja, ko | 300000 hours | 234M |
| paraformer-zh <br> ([猸怾(https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [馃](https://huggingface.co/funasr/paraformer-zh) ) | speech recognition, with timestamps, non-streaming | 60000 hours, Mandarin | 220M |
| <nobr>paraformer-zh-streaming <br> ( [猸怾(https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [馃](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> | speech recognition, streaming | 60000 hours, Mandarin | 220M |
| paraformer-en <br> ( [猸怾(https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [馃](https://huggingface.co/funasr/paraformer-en) ) | speech recognition, without timestamps, non-streaming | 50000 hours, English | 220M |
@@ -129,6 +129,44 @@
Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: `wav_id wav_pat`
### Speech Recognition (Non-streaming)
+#### SenseVoice
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+model = AutoModel(
+ model=model_dir,
+ vad_model="fsmn-vad",
+ vad_kwargs={"max_single_segment_time": 30000},
+ device="cuda:0",
+)
+
+# en
+res = model.generate(
+ input=f"{model.model_path}/example/en.mp3",
+ cache={},
+ language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+ use_itn=True,
+ batch_size_s=60,
+ merge_vad=True, #
+ merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+Parameter Descriptions:
+- `model_dir`: The name of the model, or the model's path on the local disk.
+- `trust_remote_code`:
+ - When set to `True`, this indicates that the model's code implementation should be loaded from the location specified by `remote_code`, which points to the exact code for the model (for example, `model.py` in the current directory). It supports absolute paths, relative paths, and network URLs.
+ - When set to `False`, this signifies that the model's code implementation is the integrated version within [FunASR](https://github.com/modelscope/FunASR). In this case, any modifications to `model.py` in the current directory will not take effect because the version loaded is the internal one from FunASR. For the model code, [click here to view](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice).
+- `max_single_segment_time`: The maximum length of audio segments that the `vad_model` can cut, measured in milliseconds (ms).
+- `use_itn`: Indicates whether the output should include punctuation and inverse text normalization.
+- `batch_size_s`: Represents a dynamic batch size where the total duration of the audio in the batch is measured in seconds (s).
+- `merge_vad`: Whether to concatenate short audio fragments cut by the vad model, with the merged length being `merge_length_s`, measured in seconds (s).
+
+#### Paraformer
```python
from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
--
Gitblit v1.9.1