python/FunASR-XL.git

parent: 6f476387 | 补丁 | 提交 | ignore whitespace

Merge branch 'alibaba-damo-academy:main' into main

Daniel

2023-04-23 5b2b979634b8cdf70aac0334feda99e6f5a779b8

Merge branch 'alibaba-damo-academy:main' into main

6个文件已修改

	docs/huggingface_models.md	24 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/index.rst	18 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/modelscope_models.md	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/TEMPLATE/README.md	12 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/vad/TEMPLATE/README.md	12 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/python/onnxruntime/README.md	65 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 docs/huggingface_models.md

@@ -9,11 +9,9 @@
### Speech Recognition Models
#### Paraformer Models

[//]: # (|                                                                     Model Name                                                                     | Language |          Training Data           | Vocab Size | Parameter | Offline/Online | Notes                                                                                                                           |)

[//]: # (|:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|)

[//]: # (|        [Paraformer-large]&#40;https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary&#41;        | CN & EN  | Alibaba Speech Data &#40;60000hours&#41; |    8404    |   220M    |    Offline     | Duration of input wav <= 20s                                                                                                    |)
|                               Model Name                                | Language |           Training Data            | Vocab Size | Parameter | Offline/Online | Notes                                                                                                                           |
|:-----------------------------------------------------------------------:|:--------:|:----------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|
| [Paraformer-large](https://huggingface.co/funasr/paraformer-large)      | CN & EN  | Alibaba Speech Data (60000hours)   |    8404    |   220M    |    Offline     | Duration of input wav <= 20s                                                                                                    |

[//]: # (| [Paraformer-large-long]&#40;https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary&#41; | CN & EN  | Alibaba Speech Data &#40;60000hours&#41; |    8404    |   220M    |    Offline     | Which ould deal with arbitrary length input wav                                                                                 |)

@@ -77,21 +75,17 @@

### Voice Activity Detection Models

[//]: # (|                                           Model Name                                           |        Training Data         | Parameters | Sampling Rate | Notes |)

[//]: # (|:----------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------|)

[//]: # (| [FSMN-VAD]&#40;https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary&#41; | Alibaba Speech Data &#40;5000hours&#41; |    0.4M    |     16000     |       |)
|                      Model Name                      |        Training Data         | Parameters | Sampling Rate | Notes |
|:----------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------|
| [FSMN-VAD](https://huggingface.co/funasr/FSMN-VAD)   | Alibaba Speech Data (5000hours) |    0.4M    |     16000     |       |

[//]: # (|   [FSMN-VAD]&#40;https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-8k-common/summary&#41;        | Alibaba Speech Data &#40;5000hours&#41; |    0.4M    |     8000      |       |)

### Punctuation Restoration Models

[//]: # (|                                                         Model Name                                                         |        Training Data         | Parameters | Vocab Size| Offline/Online | Notes |)

[//]: # (|:--------------------------------------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------|)

[//]: # (|      [CT-Transformer]&#40;https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary&#41;      | Alibaba Text Data |    70M     |    272727     |    Offline     |   offline punctuation model    |)
|                              Model Name                              |        Training Data         | Parameters | Vocab Size| Offline/Online | Notes |
|:--------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------|
| [CT-Transformer](https://huggingface.co/funasr/CT-Transformer-punc)  | Alibaba Text Data |    70M     |    272727     |    Offline     |   offline punctuation model    |

[//]: # (| [CT-Transformer]&#40;https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727/summary&#41;      | Alibaba Text Data |    70M     |    272727     |     Online     |  online punctuation model     |)


 docs/index.rst

@@ -31,12 +31,7 @@
   ./academic_recipe/sd_recipe.md


.. toctree::
   :maxdepth: 1
   :caption: Model Zoo

   ./modelscope_models.md
   ./huggingface_models.md

.. toctree::
   :maxdepth: 1
@@ -56,11 +51,13 @@

   Undo


.. toctree::
   :maxdepth: 1
   :caption: Funasr Library
   :caption: Model Zoo

   ./build_task.md
   ./modelscope_models.md
   ./huggingface_models.md

.. toctree::
   :maxdepth: 1
@@ -82,6 +79,13 @@
   ./benchmark/benchmark_onnx_cpp.md
   ./benchmark/benchmark_libtorch.md


.. toctree::
   :maxdepth: 1
   :caption: Funasr Library

   ./build_task.md

.. toctree::
   :maxdepth: 1
   :caption: Papers

 docs/modelscope_models.md

@@ -13,7 +13,7 @@
|:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|
|        [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)        | CN & EN  | Alibaba Speech Data (60000hours) |    8404    |   220M    |    Offline     | Duration of input wav <= 20s                                                                                                    |
| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN  | Alibaba Speech Data (60000hours) |    8404    |   220M    |    Offline     | Which ould deal with arbitrary length input wav                                                                                 |
| [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN  | Alibaba Speech Data (60000hours) |    8404    |   220M    |    Offline     | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. |
| [Paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN  | Alibaba Speech Data (60000hours) |    8404    |   220M    |    Offline     | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. |
|              [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)              | CN & EN  | Alibaba Speech Data (50000hours) |    8358    |    68M    |    Offline     | Duration of input wav <= 20s                                                                                                    |
|          [Paraformer-online](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)           | CN & EN  | Alibaba Speech Data (50000hours) |    8404    |    68M    |     Online     | Which could deal with streaming input                                                                                           |
|       [Paraformer-tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary)       |    CN    |  Alibaba Speech Data (200hours)  |    544     |   5.2M    |    Offline     | Lightweight Paraformer model which supports Mandarin command words recognition                                                  |

 egs_modelscope/asr/TEMPLATE/README.md

@@ -1,7 +1,7 @@
# Speech Recognition

> **Note**: 
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage.
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the typic models as examples to demonstrate the usage.

## Inference

@@ -62,10 +62,10 @@
##### Define pipeline
- `task`: `Tasks.auto_speech_recognition`
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU
- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU 
- `output_dir`: `None` (Defalut), the output path of results if set
- `batch_size`: `1` (Defalut), batch size when decoding
- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU 
- `output_dir`: `None` (Default), the output path of results if set
- `batch_size`: `1` (Default), batch size when decoding
##### Infer pipeline
- `audio_in`: the input to decode, which could be: 
  - wav_path, `e.g.`: asr_example.wav,
@@ -79,7 +79,7 @@
  ```
  In this case of `wav.scp` input, `output_dir` must be set to save the output results
- `audio_fs`: audio sampling rate, only set when audio_in is pcm audio
- `output_dir`: None (Defalut), the output path of results if set
- `output_dir`: None (Default), the output path of results if set

### Inference with multi-thread CPUs or multi GPUs
FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs.

 egs_modelscope/vad/TEMPLATE/README.md

@@ -1,7 +1,7 @@
# Voice Activity Detection

> **Note**: 
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of FSMN-VAD as example to demonstrate the usage.
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the model of FSMN-VAD as example to demonstrate the usage.

## Inference

@@ -47,10 +47,10 @@
##### Define pipeline
- `task`: `Tasks.voice_activity_detection`
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU
- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU 
- `output_dir`: `None` (Defalut), the output path of results if set
- `batch_size`: `1` (Defalut), batch size when decoding
- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU 
- `output_dir`: `None` (Default), the output path of results if set
- `batch_size`: `1` (Default), batch size when decoding
##### Infer pipeline
- `audio_in`: the input to decode, which could be: 
  - wav_path, `e.g.`: asr_example.wav,
@@ -64,7 +64,7 @@
  ```
  In this case of `wav.scp` input, `output_dir` must be set to save the output results
- `audio_fs`: audio sampling rate, only set when audio_in is pcm audio
- `output_dir`: None (Defalut), the output path of results if set
- `output_dir`: None (Default), the output path of results if set

### Inference with multi-thread CPUs or multi GPUs
FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/vad/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs.

 funasr/runtime/python/onnxruntime/README.md

@@ -19,7 +19,7 @@
```


## Install the `funasr_onnx`
## Install `funasr_onnx`

install from pip
```shell
@@ -46,16 +46,22 @@
 from funasr_onnx import Paraformer

 model_dir = "./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
 model = Paraformer(model_dir, batch_size=1)
 model = Paraformer(model_dir, batch_size=1, quantize=True)

 wav_path = ['./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav']

 result = model(wav_path)
 print(result)
 ```
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
- Output: `List[str]`: recognition result
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- `batch_size`: `1` (Default), the batch size duration inference
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

Input: wav formt file, support formats: `str, np.ndarray, List[str]`

Output: `List[str]`: recognition result

#### Paraformer-online

@@ -71,9 +77,16 @@
result = model(wav_path)
print(result)
```
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
- Output: `List[str]`: recognition result
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- `batch_size`: `1` (Default), the batch size duration inference
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

Input: wav formt file, support formats: `str, np.ndarray, List[str]`

Output: `List[str]`: recognition result


#### FSMN-VAD-online
```python
@@ -105,9 +118,16 @@
    if segments_result:
        print(segments_result)
```
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
- Output: `List[str]`: recognition result
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- `batch_size`: `1` (Default), the batch size duration inference
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

Input: wav formt file, support formats: `str, np.ndarray, List[str]`

Output: `List[str]`: recognition result


### Punctuation Restoration
#### CT-Transformer
@@ -121,9 +141,15 @@
result = model(text_in)
print(result[0])
```
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
- Output: `List[str]`: recognition result
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

Input: `str`, raw text of asr result

Output: `List[str]`: recognition result


#### CT-Transformer-online
```python
@@ -143,9 +169,14 @@

print(rec_result_all)
```
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
- Output: `List[str]`: recognition result
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

Input: `str`, raw text of asr result

Output: `List[str]`: recognition result

## Performance benchmark

			@@ -9,11 +9,9 @@
			### Speech Recognition Models
			#### Paraformer Models

			[//]: # (\| Model Name \| Language \| Training Data \| Vocab Size \| Parameter \| Offline/Online \| Notes \|)

			[//]: # (\|:--------------------------------------------------------------------------------------------------------------------------------------------------:\|:--------:\|:--------------------------------:\|:----------:\|:---------:\|:--------------:\|:--------------------------------------------------------------------------------------------------------------------------------\|)

			[//]: # (\| [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) \| CN & EN \| Alibaba Speech Data (60000hours) \| 8404 \| 220M \| Offline \| Duration of input wav <= 20s \|)
			\| Model Name \| Language \| Training Data \| Vocab Size \| Parameter \| Offline/Online \| Notes \|
			\|:-----------------------------------------------------------------------:\|:--------:\|:----------------------------------:\|:----------:\|:---------:\|:--------------:\|:--------------------------------------------------------------------------------------------------------------------------------\|
			\| [Paraformer-large](https://huggingface.co/funasr/paraformer-large) \| CN & EN \| Alibaba Speech Data (60000hours) \| 8404 \| 220M \| Offline \| Duration of input wav <= 20s \|

			[//]: # (\| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) \| CN & EN \| Alibaba Speech Data (60000hours) \| 8404 \| 220M \| Offline \| Which ould deal with arbitrary length input wav \|)

			@@ -77,21 +75,17 @@

			### Voice Activity Detection Models

			[//]: # (\| Model Name \| Training Data \| Parameters \| Sampling Rate \| Notes \|)

			[//]: # (\|:----------------------------------------------------------------------------------------------:\|:----------------------------:\|:----------:\|:-------------:\|:------\|)

			[//]: # (\| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) \| Alibaba Speech Data (5000hours) \| 0.4M \| 16000 \| \|)
			\| Model Name \| Training Data \| Parameters \| Sampling Rate \| Notes \|
			\|:----------------------------------------------------:\|:----------------------------:\|:----------:\|:-------------:\|:------\|
			\| [FSMN-VAD](https://huggingface.co/funasr/FSMN-VAD) \| Alibaba Speech Data (5000hours) \| 0.4M \| 16000 \| \|

			[//]: # (\| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-8k-common/summary) \| Alibaba Speech Data (5000hours) \| 0.4M \| 8000 \| \|)

			### Punctuation Restoration Models

			[//]: # (\| Model Name \| Training Data \| Parameters \| Vocab Size\| Offline/Online \| Notes \|)

			[//]: # (\|:--------------------------------------------------------------------------------------------------------------------------:\|:----------------------------:\|:----------:\|:----------:\|:--------------:\|:------\|)

			[//]: # (\| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) \| Alibaba Text Data \| 70M \| 272727 \| Offline \| offline punctuation model \|)
			\| Model Name \| Training Data \| Parameters \| Vocab Size\| Offline/Online \| Notes \|
			\|:--------------------------------------------------------------------:\|:----------------------------:\|:----------:\|:----------:\|:--------------:\|:------\|
			\| [CT-Transformer](https://huggingface.co/funasr/CT-Transformer-punc) \| Alibaba Text Data \| 70M \| 272727 \| Offline \| offline punctuation model \|

			[//]: # (\| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727/summary) \| Alibaba Text Data \| 70M \| 272727 \| Online \| online punctuation model \|)

			@@ -31,12 +31,7 @@
			./academic_recipe/sd_recipe.md


			.. toctree::
			:maxdepth: 1
			:caption: Model Zoo

			./modelscope_models.md
			./huggingface_models.md

			.. toctree::
			:maxdepth: 1
			@@ -56,11 +51,13 @@

			Undo


			.. toctree::
			:maxdepth: 1
			:caption: Funasr Library
			:caption: Model Zoo

			./build_task.md
			./modelscope_models.md
			./huggingface_models.md

			.. toctree::
			:maxdepth: 1
			@@ -82,6 +79,13 @@
			./benchmark/benchmark_onnx_cpp.md
			./benchmark/benchmark_libtorch.md


			.. toctree::
			:maxdepth: 1
			:caption: Funasr Library

			./build_task.md

			.. toctree::
			:maxdepth: 1
			:caption: Papers

			@@ -13,7 +13,7 @@
			\|:--------------------------------------------------------------------------------------------------------------------------------------------------:\|:--------:\|:--------------------------------:\|:----------:\|:---------:\|:--------------:\|:--------------------------------------------------------------------------------------------------------------------------------\|
			\| [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) \| CN & EN \| Alibaba Speech Data (60000hours) \| 8404 \| 220M \| Offline \| Duration of input wav <= 20s \|
			\| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) \| CN & EN \| Alibaba Speech Data (60000hours) \| 8404 \| 220M \| Offline \| Which ould deal with arbitrary length input wav \|
			\| [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) \| CN & EN \| Alibaba Speech Data (60000hours) \| 8404 \| 220M \| Offline \| Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. \|
			\| [Paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) \| CN & EN \| Alibaba Speech Data (60000hours) \| 8404 \| 220M \| Offline \| Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. \|
			\| [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) \| CN & EN \| Alibaba Speech Data (50000hours) \| 8358 \| 68M \| Offline \| Duration of input wav <= 20s \|
			\| [Paraformer-online](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) \| CN & EN \| Alibaba Speech Data (50000hours) \| 8404 \| 68M \| Online \| Which could deal with streaming input \|
			\| [Paraformer-tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary) \| CN \| Alibaba Speech Data (200hours) \| 544 \| 5.2M \| Offline \| Lightweight Paraformer model which supports Mandarin command words recognition \|

			@@ -1,7 +1,7 @@
			# Speech Recognition

			> Note:
			> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage.
			> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the typic models as examples to demonstrate the usage.

			## Inference

			@@ -62,10 +62,10 @@
			##### Define pipeline
			- `task`: `Tasks.auto_speech_recognition`
			- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
			- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU
			- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU
			- `output_dir`: `None` (Defalut), the output path of results if set
			- `batch_size`: `1` (Defalut), batch size when decoding
			- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
			- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU
			- `output_dir`: `None` (Default), the output path of results if set
			- `batch_size`: `1` (Default), batch size when decoding
			##### Infer pipeline
			- `audio_in`: the input to decode, which could be:
			- wav_path, `e.g.`: asr_example.wav,
			@@ -79,7 +79,7 @@
			```
			In this case of `wav.scp` input, `output_dir` must be set to save the output results
			- `audio_fs`: audio sampling rate, only set when audio_in is pcm audio
			- `output_dir`: None (Defalut), the output path of results if set
			- `output_dir`: None (Default), the output path of results if set

			### Inference with multi-thread CPUs or multi GPUs
			FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs.

			@@ -1,7 +1,7 @@
			# Voice Activity Detection

			> Note:
			> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of FSMN-VAD as example to demonstrate the usage.
			> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the model of FSMN-VAD as example to demonstrate the usage.

			## Inference

			@@ -47,10 +47,10 @@
			##### Define pipeline
			- `task`: `Tasks.voice_activity_detection`
			- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
			- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU
			- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU
			- `output_dir`: `None` (Defalut), the output path of results if set
			- `batch_size`: `1` (Defalut), batch size when decoding
			- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
			- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU
			- `output_dir`: `None` (Default), the output path of results if set
			- `batch_size`: `1` (Default), batch size when decoding
			##### Infer pipeline
			- `audio_in`: the input to decode, which could be:
			- wav_path, `e.g.`: asr_example.wav,
			@@ -64,7 +64,7 @@
			```
			In this case of `wav.scp` input, `output_dir` must be set to save the output results
			- `audio_fs`: audio sampling rate, only set when audio_in is pcm audio
			- `output_dir`: None (Defalut), the output path of results if set
			- `output_dir`: None (Default), the output path of results if set

			### Inference with multi-thread CPUs or multi GPUs
			FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/vad/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs.

			@@ -19,7 +19,7 @@
			```


			## Install the `funasr_onnx`
			## Install `funasr_onnx`

			install from pip
			```shell
			@@ -46,16 +46,22 @@
			from funasr_onnx import Paraformer

			model_dir = "./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
			model = Paraformer(model_dir, batch_size=1)
			model = Paraformer(model_dir, batch_size=1, quantize=True)

			wav_path = ['./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav']

			result = model(wav_path)
			print(result)
			```
			- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
			- Output: `List[str]`: recognition result
			- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- `batch_size`: `1` (Default), the batch size duration inference
			- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
			- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
			- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

			Input: wav formt file, support formats: `str, np.ndarray, List[str]`

			Output: `List[str]`: recognition result

			#### Paraformer-online

			@@ -71,9 +77,16 @@
			result = model(wav_path)
			print(result)
			```
			- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
			- Output: `List[str]`: recognition result
			- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- `batch_size`: `1` (Default), the batch size duration inference
			- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
			- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
			- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

			Input: wav formt file, support formats: `str, np.ndarray, List[str]`

			Output: `List[str]`: recognition result


			#### FSMN-VAD-online
			```python
			@@ -105,9 +118,16 @@
			if segments_result:
			print(segments_result)
			```
			- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
			- Output: `List[str]`: recognition result
			- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- `batch_size`: `1` (Default), the batch size duration inference
			- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
			- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
			- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

			Input: wav formt file, support formats: `str, np.ndarray, List[str]`

			Output: `List[str]`: recognition result


			### Punctuation Restoration
			#### CT-Transformer
			@@ -121,9 +141,15 @@
			result = model(text_in)
			print(result[0])
			```
			- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
			- Output: `List[str]`: recognition result
			- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
			- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
			- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

			Input: `str`, raw text of asr result

			Output: `List[str]`: recognition result


			#### CT-Transformer-online
			```python
			@@ -143,9 +169,14 @@

			print(rec_result_all)
			```
			- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
			- Output: `List[str]`: recognition result
			- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
			- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
			- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
			- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU

			Input: `str`, raw text of asr result

			Output: `List[str]`: recognition result

			## Performance benchmark