Merge branch 'alibaba-damo-academy:main' into main
| | |
| | | ### Speech Recognition Models |
| | | #### Paraformer Models |
| | | |
| | | [//]: # (| Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes |) |
| | | |
| | | [//]: # (|:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|) |
| | | |
| | | [//]: # (| [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s |) |
| | | | Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes | |
| | | |:-----------------------------------------------------------------------:|:--------:|:----------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------| |
| | | | [Paraformer-large](https://huggingface.co/funasr/paraformer-large) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s | |
| | | |
| | | [//]: # (| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which ould deal with arbitrary length input wav |) |
| | | |
| | |
| | | |
| | | ### Voice Activity Detection Models |
| | | |
| | | [//]: # (| Model Name | Training Data | Parameters | Sampling Rate | Notes |) |
| | | |
| | | [//]: # (|:----------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------|) |
| | | |
| | | [//]: # (| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) | Alibaba Speech Data (5000hours) | 0.4M | 16000 | |) |
| | | | Model Name | Training Data | Parameters | Sampling Rate | Notes | |
| | | |:----------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------| |
| | | | [FSMN-VAD](https://huggingface.co/funasr/FSMN-VAD) | Alibaba Speech Data (5000hours) | 0.4M | 16000 | | |
| | | |
| | | [//]: # (| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-8k-common/summary) | Alibaba Speech Data (5000hours) | 0.4M | 8000 | |) |
| | | |
| | | ### Punctuation Restoration Models |
| | | |
| | | [//]: # (| Model Name | Training Data | Parameters | Vocab Size| Offline/Online | Notes |) |
| | | |
| | | [//]: # (|:--------------------------------------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------|) |
| | | |
| | | [//]: # (| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) | Alibaba Text Data | 70M | 272727 | Offline | offline punctuation model |) |
| | | | Model Name | Training Data | Parameters | Vocab Size| Offline/Online | Notes | |
| | | |:--------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------| |
| | | | [CT-Transformer](https://huggingface.co/funasr/CT-Transformer-punc) | Alibaba Text Data | 70M | 272727 | Offline | offline punctuation model | |
| | | |
| | | [//]: # (| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727/summary) | Alibaba Text Data | 70M | 272727 | Online | online punctuation model |) |
| | | |
| | |
| | | ./academic_recipe/sd_recipe.md |
| | | |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Model Zoo |
| | | |
| | | ./modelscope_models.md |
| | | ./huggingface_models.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | |
| | | |
| | | Undo |
| | | |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Funasr Library |
| | | :caption: Model Zoo |
| | | |
| | | ./build_task.md |
| | | ./modelscope_models.md |
| | | ./huggingface_models.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | |
| | | ./benchmark/benchmark_onnx_cpp.md |
| | | ./benchmark/benchmark_libtorch.md |
| | | |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Funasr Library |
| | | |
| | | ./build_task.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Papers |
| | |
| | | |:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------| |
| | | | [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s | |
| | | | [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which ould deal with arbitrary length input wav | |
| | | | [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. | |
| | | | [Paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. | |
| | | | [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8358 | 68M | Offline | Duration of input wav <= 20s | |
| | | | [Paraformer-online](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8404 | 68M | Online | Which could deal with streaming input | |
| | | | [Paraformer-tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary) | CN | Alibaba Speech Data (200hours) | 544 | 5.2M | Offline | Lightweight Paraformer model which supports Mandarin command words recognition | |
| | |
| | | # Speech Recognition |
| | | |
| | | > **Note**: |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage. |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the typic models as examples to demonstrate the usage. |
| | | |
| | | ## Inference |
| | | |
| | |
| | | ##### Define pipeline |
| | | - `task`: `Tasks.auto_speech_recognition` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: `None` (Defalut), the output path of results if set |
| | | - `batch_size`: `1` (Defalut), batch size when decoding |
| | | - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: `None` (Default), the output path of results if set |
| | | - `batch_size`: `1` (Default), batch size when decoding |
| | | ##### Infer pipeline |
| | | - `audio_in`: the input to decode, which could be: |
| | | - wav_path, `e.g.`: asr_example.wav, |
| | |
| | | ``` |
| | | In this case of `wav.scp` input, `output_dir` must be set to save the output results |
| | | - `audio_fs`: audio sampling rate, only set when audio_in is pcm audio |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | - `output_dir`: None (Default), the output path of results if set |
| | | |
| | | ### Inference with multi-thread CPUs or multi GPUs |
| | | FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs. |
| | |
| | | # Voice Activity Detection |
| | | |
| | | > **Note**: |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of FSMN-VAD as example to demonstrate the usage. |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the model of FSMN-VAD as example to demonstrate the usage. |
| | | |
| | | ## Inference |
| | | |
| | |
| | | ##### Define pipeline |
| | | - `task`: `Tasks.voice_activity_detection` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: `None` (Defalut), the output path of results if set |
| | | - `batch_size`: `1` (Defalut), batch size when decoding |
| | | - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: `None` (Default), the output path of results if set |
| | | - `batch_size`: `1` (Default), batch size when decoding |
| | | ##### Infer pipeline |
| | | - `audio_in`: the input to decode, which could be: |
| | | - wav_path, `e.g.`: asr_example.wav, |
| | |
| | | ``` |
| | | In this case of `wav.scp` input, `output_dir` must be set to save the output results |
| | | - `audio_fs`: audio sampling rate, only set when audio_in is pcm audio |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | - `output_dir`: None (Default), the output path of results if set |
| | | |
| | | ### Inference with multi-thread CPUs or multi GPUs |
| | | FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/vad/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs. |
| | |
| | | ``` |
| | | |
| | | |
| | | ## Install the `funasr_onnx` |
| | | ## Install `funasr_onnx` |
| | | |
| | | install from pip |
| | | ```shell |
| | |
| | | from funasr_onnx import Paraformer |
| | | |
| | | model_dir = "./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" |
| | | model = Paraformer(model_dir, batch_size=1) |
| | | model = Paraformer(model_dir, batch_size=1, quantize=True) |
| | | |
| | | wav_path = ['./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav'] |
| | | |
| | | result = model(wav_path) |
| | | print(result) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | - `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - `batch_size`: `1` (Default), the batch size duration inference |
| | | - `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) |
| | | - `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` |
| | | - `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU |
| | | |
| | | Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | |
| | | Output: `List[str]`: recognition result |
| | | |
| | | #### Paraformer-online |
| | | |
| | |
| | | result = model(wav_path) |
| | | print(result) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | - `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - `batch_size`: `1` (Default), the batch size duration inference |
| | | - `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) |
| | | - `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` |
| | | - `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU |
| | | |
| | | Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | |
| | | Output: `List[str]`: recognition result |
| | | |
| | | |
| | | #### FSMN-VAD-online |
| | | ```python |
| | |
| | | if segments_result: |
| | | print(segments_result) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | - `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - `batch_size`: `1` (Default), the batch size duration inference |
| | | - `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) |
| | | - `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` |
| | | - `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU |
| | | |
| | | Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | |
| | | Output: `List[str]`: recognition result |
| | | |
| | | |
| | | ### Punctuation Restoration |
| | | #### CT-Transformer |
| | |
| | | result = model(text_in) |
| | | print(result[0]) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | - `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) |
| | | - `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` |
| | | - `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU |
| | | |
| | | Input: `str`, raw text of asr result |
| | | |
| | | Output: `List[str]`: recognition result |
| | | |
| | | |
| | | #### CT-Transformer-online |
| | | ```python |
| | |
| | | |
| | | print(rec_result_all) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | - `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) |
| | | - `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` |
| | | - `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU |
| | | |
| | | Input: `str`, raw text of asr result |
| | | |
| | | Output: `List[str]`: recognition result |
| | | |
| | | ## Performance benchmark |
| | | |