python/FunASR-XL.git

parent: 37fc6ad9 | 补丁 | 提交 | ignore whitespace

游雁

2024-07-22 c2575f022df4d125c9bd1e2b25142417c0a277b5

docs

5个文件已修改

	README.md	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	README_zh.md	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/README.md	9 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/README_zh.md	9 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/industrial_data_pretraining/sense_voice/demo.py	22 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 README.md

@@ -163,6 +163,7 @@
- `use_itn`: Whether the output result includes punctuation and inverse text normalization.
- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).
- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).
- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.

#### Paraformer
```python

 README_zh.md

@@ -162,6 +162,7 @@
- `use_itn`：输出结果中是否包含标点与逆文本正则化。
- `batch_size_s` 表示采用动态batch，batch中总音频时长，单位为秒s。
- `merge_vad`：是否将 vad 模型切割的短音频碎片合成，合并后长度为`merge_length_s`，单位为秒s。
- `ban_emo_unk`：禁用emo_unk标签，禁用后所有的句子都会被赋与情感标签。

#### Paraformer
```python

 examples/README.md

@@ -96,6 +96,15 @@
text = rich_transcription_postprocess(res[0]["text"])
print(text)
```
Notes:
- `model_dir`: The name of the model, or the path to the model on the local disk.
- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.
- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).
- `use_itn`: Whether the output result includes punctuation and inverse text normalization.
- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).
- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).
- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.

##### Paraformer
```python
from funasr import AutoModel

 examples/README_zh.md

@@ -97,6 +97,15 @@
text = rich_transcription_postprocess(res[0]["text"])
print(text)
```
参数说明：
- `model_dir`：模型名称，或本地磁盘中的模型路径。
- `vad_model`：表示开启VAD，VAD的作用是将长音频切割成短音频，此时推理耗时包括了VAD与SenseVoice总耗时，为链路耗时，如果需要单独测试SenseVoice模型耗时，可以关闭VAD模型。
- `vad_kwargs`：表示VAD模型配置,`max_single_segment_time`: 表示`vad_model`最大切割音频时长, 单位是毫秒ms。
- `use_itn`：输出结果中是否包含标点与逆文本正则化。
- `batch_size_s` 表示采用动态batch，batch中总音频时长，单位为秒s。
- `merge_vad`：是否将 vad 模型切割的短音频碎片合成，合并后长度为`merge_length_s`，单位为秒s。
- `ban_emo_unk`：禁用emo_unk标签，禁用后所有的句子都会被赋与情感标签。

##### Paraformer
```python
from funasr import AutoModel

 examples/industrial_data_pretraining/sense_voice/demo.py

@@ -1,13 +1,12 @@
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
# Copyright FunASR (https://github.com/FunAudioLLM/SenseVoice). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)


from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "/Users/zhifu/Downloads/modelscope_models/SenseVoiceSmall"  # "iic/SenseVoiceSmall"
model_dir = "iic/SenseVoiceSmall"


model = AutoModel(
@@ -19,30 +18,17 @@

# en
res = model.generate(
    input="/Users/zhifu/Downloads/8_output.wav",
    input=f"{model.model_path}/example/en.mp3",
    cache={},
    language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,  #
    merge_length_s=0.1,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

# en
res = model.generate(
    input="/Users/zhifu/Downloads/8_output.wav",
    cache={},
    language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=False,  #
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)
raise "exit"

# zh
res = model.generate(
    input=f"{model.model_path}/example/zh.mp3",

			@@ -163,6 +163,7 @@
			- `use_itn`: Whether the output result includes punctuation and inverse text normalization.
			- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).
			- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).
			- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.

			#### Paraformer
			```python

			@@ -162,6 +162,7 @@
			- `use_itn`：输出结果中是否包含标点与逆文本正则化。
			- `batch_size_s` 表示采用动态batch，batch中总音频时长，单位为秒s。
			- `merge_vad`：是否将 vad 模型切割的短音频碎片合成，合并后长度为`merge_length_s`，单位为秒s。
			- `ban_emo_unk`：禁用emo_unk标签，禁用后所有的句子都会被赋与情感标签。

			#### Paraformer
			```python

			@@ -96,6 +96,15 @@
			text = rich_transcription_postprocess(res[0]["text"])
			print(text)
			```
			Notes:
			- `model_dir`: The name of the model, or the path to the model on the local disk.
			- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.
			- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).
			- `use_itn`: Whether the output result includes punctuation and inverse text normalization.
			- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).
			- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).
			- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.

			##### Paraformer
			```python
			from funasr import AutoModel

			@@ -97,6 +97,15 @@
			text = rich_transcription_postprocess(res[0]["text"])
			print(text)
			```
			参数说明：
			- `model_dir`：模型名称，或本地磁盘中的模型路径。
			- `vad_model`：表示开启VAD，VAD的作用是将长音频切割成短音频，此时推理耗时包括了VAD与SenseVoice总耗时，为链路耗时，如果需要单独测试SenseVoice模型耗时，可以关闭VAD模型。
			- `vad_kwargs`：表示VAD模型配置,`max_single_segment_time`: 表示`vad_model`最大切割音频时长, 单位是毫秒ms。
			- `use_itn`：输出结果中是否包含标点与逆文本正则化。
			- `batch_size_s` 表示采用动态batch，batch中总音频时长，单位为秒s。
			- `merge_vad`：是否将 vad 模型切割的短音频碎片合成，合并后长度为`merge_length_s`，单位为秒s。
			- `ban_emo_unk`：禁用emo_unk标签，禁用后所有的句子都会被赋与情感标签。

			##### Paraformer
			```python
			from funasr import AutoModel

			@@ -1,13 +1,12 @@
			#!/usr/bin/env python3
			# -- encoding: utf-8 --
			# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
			# Copyright FunASR (https://github.com/FunAudioLLM/SenseVoice). All Rights Reserved.
			# MIT License (https://opensource.org/licenses/MIT)


			from funasr import AutoModel
			from funasr.utils.postprocess_utils import rich_transcription_postprocess

			model_dir = "/Users/zhifu/Downloads/modelscope_models/SenseVoiceSmall" # "iic/SenseVoiceSmall"
			model_dir = "iic/SenseVoiceSmall"


			model = AutoModel(
			@@ -19,30 +18,17 @@

			# en
			res = model.generate(
			input="/Users/zhifu/Downloads/8_output.wav",
			input=f"{model.model_path}/example/en.mp3",
			cache={},
			language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
			use_itn=True,
			batch_size_s=60,
			merge_vad=True, #
			merge_length_s=0.1,
			)
			text = rich_transcription_postprocess(res[0]["text"])
			print(text)

			# en
			res = model.generate(
			input="/Users/zhifu/Downloads/8_output.wav",
			cache={},
			language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
			use_itn=True,
			batch_size_s=60,
			merge_vad=False, #
			merge_length_s=15,
			)
			text = rich_transcription_postprocess(res[0]["text"])
			print(text)
			raise "exit"

			# zh
			res = model.generate(
			input=f"{model.model_path}/example/zh.mp3",