Merge pull request #389 from alibaba-damo-academy/main
update dev_lyh
42个文件已修改
11个文件已添加
14个文件已删除
| | |
| | | with: |
| | | docs-folder: "docs/" |
| | | pre-build-command: "pip install sphinx-markdown-tables nbsphinx jinja2 recommonmark sphinx_rtd_theme" |
| | | - uses: ammaraskar/sphinx-action@master |
| | | with: |
| | | docs-folder: "docs_cn/" |
| | | pre-build-command: "pip install sphinx-markdown-tables nbsphinx jinja2 recommonmark sphinx_rtd_theme" |
| | | |
| | | - name: deploy copy |
| | | if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/dev_wjm' || github.ref == 'refs/heads/dev_lyh' |
| | |
| | | mkdir public/en |
| | | touch public/en/.nojekyll |
| | | cp -r docs/_build/html/* public/en/ |
| | | mkdir public/cn |
| | | touch public/cn/.nojekyll |
| | | cp -r docs_cn/_build/html/* public/cn/ |
| | | mkdir public/m2met2 |
| | | touch public/m2met2/.nojekyll |
| | | cp -r docs_m2met2/_build/html/* public/m2met2/ |
| | |
| | | *.pyc |
| | | .eggs |
| | | MaaS-lib |
| | | .gitignore |
| | | .gitignore |
| | | .egg* |
| | | dist |
| | | build |
| | | funasr.egg-info |
| | |
| | | [**News**](https://github.com/alibaba-damo-academy/FunASR#whats-new) |
| | | | [**Highlights**](#highlights) |
| | | | [**Installation**](#installation) |
| | | | [**Docs_EN**](https://alibaba-damo-academy.github.io/FunASR/en/index.html) |
| | | | [**Docs**](https://alibaba-damo-academy.github.io/FunASR/en/index.html) |
| | | | [**Tutorial**](https://github.com/alibaba-damo-academy/FunASR/wiki#funasr%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C) |
| | | | [**Papers**](https://github.com/alibaba-damo-academy/FunASR#citations) |
| | | | [**Runtime**](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime) |
| | | | [**Model Zoo**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/modelscope_models.md) |
| | | | [**Contact**](#contact) |
| | | |
| | | | |
| | | [**M2MET2.0 Guidence_CN**](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/index.html) |
| | | | [**M2MET2.0 Guidence_EN**](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html) |
| | | |
| New file |
| | |
| | | # FQA |
| | | |
| | | ## How to use VAD model by modelscope pipeline |
| | | Ref to [docs](https://github.com/alibaba-damo-academy/FunASR/discussions/236) |
| | | |
| | | ## How to use Punctuation model by modelscope pipeline |
| | | Ref to [docs](https://github.com/alibaba-damo-academy/FunASR/discussions/238) |
| | | |
| | | ## How to use Parafomrer model for streaming by modelscope pipeline |
| | | Ref to [docs](https://github.com/alibaba-damo-academy/FunASR/discussions/241) |
| | | |
| | | ## How to use vad, asr and punc model by modelscope pipeline |
| | | Ref to [docs](https://github.com/alibaba-damo-academy/FunASR/discussions/278) |
| | | |
| | | ## How to combine vad, asr, punc and nnlm models inside modelscope pipeline |
| | | Ref to [docs](https://github.com/alibaba-damo-academy/FunASR/discussions/134) |
| | | |
| | | ## How to combine timestamp prediction model by modelscope pipeline |
| | | Ref to [docs](https://github.com/alibaba-damo-academy/FunASR/discussions/246) |
| | | |
| | | ## How to switch decoding mode between online and offline for UniASR model |
| | | Ref to [docs](https://github.com/alibaba-damo-academy/FunASR/discussions/151) |
| New file |
| | |
| | | ../../funasr/runtime/python/benchmark_libtorch.md |
| New file |
| | |
| | | ../../funasr/runtime/python/benchmark_onnx.md |
| | |
| | | ./modescope_pipeline/punc_pipeline.md |
| | | ./modescope_pipeline/tp_pipeline.md |
| | | ./modescope_pipeline/sv_pipeline.md |
| | | ./modescope_pipeline/sd_pipeline.md |
| | | ./modescope_pipeline/lm_pipeline.md |
| | | |
| | | .. toctree:: |
| | |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Benchmark and Leadboard |
| | | |
| | | ./benchmark/benchmark_onnx.md |
| | | ./benchmark/benchmark_libtorch.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Papers |
| | | |
| | | ./papers.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: FQA |
| | | |
| | | ./FQA.md |
| | | |
| | | |
| | | Indices and tables |
| | |
| | | | [Xvector](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary) | CNCeleb (1,200 hours) | 17.5M | 3465 | Xvector, speaker verification, Chinese | |
| | | | [Xvector](https://www.modelscope.cn/models/damo/speech_xvector_sv-en-us-callhome-8k-spk6135-pytorch/summary) | CallHome (60 hours) | 61M | 6135 | Xvector, speaker verification, English | |
| | | |
| | | ### Speaker diarization Models |
| | | ### Speaker Diarization Models |
| | | |
| | | | Model Name | Training Data | Parameters | Notes | |
| | | |:----------------------------------------------------------------------------------------------------------------:|:-------------------:|:----------:|:------| |
| | |
| | | # Speech Recognition |
| | | |
| | | > **Note**: |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of Paraformer and Paraformer-online as example to demonstrate the usage. |
| | | |
| | | ## Inference |
| | | |
| | | ### Quick start |
| | | #### [Paraformer model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | | |
| | | #### Inference with you data |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', |
| | | ) |
| | | |
| | | #### Inference with multi-threads on CPU |
| | | rec_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav') |
| | | print(rec_result) |
| | | ``` |
| | | #### [Paraformer-online model](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary) |
| | | ```python |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online', |
| | | ) |
| | | import soundfile |
| | | speech, sample_rate = soundfile.read("example/asr_example.wav") |
| | | |
| | | #### Inference with multi GPU |
| | | param_dict = {"cache": dict(), "is_final": False} |
| | | chunk_stride = 7680# 480ms |
| | | # first chunk, 480ms |
| | | speech_chunk = speech[0:chunk_stride] |
| | | rec_result = inference_pipeline(audio_in=speech_chunk, param_dict=param_dict) |
| | | print(rec_result) |
| | | # next chunk, 480ms |
| | | speech_chunk = speech[chunk_stride:chunk_stride+chunk_stride] |
| | | rec_result = inference_pipeline(audio_in=speech_chunk, param_dict=param_dict) |
| | | print(rec_result) |
| | | ``` |
| | | Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/241) |
| | | |
| | | #### [UniASR model](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary) |
| | | There are three decoding mode for UniASR model(`fast`、`normal`、`offline`), for more model detailes, please refer to [docs](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary) |
| | | ```python |
| | | decoding_model = "fast" # "fast"、"normal"、"offline" |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model='damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825', |
| | | param_dict={"decoding_model": decoding_model}) |
| | | |
| | | rec_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav') |
| | | print(rec_result) |
| | | ``` |
| | | The decoding mode of `fast` and `normal` |
| | | Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/151) |
| | | #### [RNN-T-online model]() |
| | | Undo |
| | | |
| | | #### API-reference |
| | | ##### define pipeline |
| | | - `task`: `Tasks.auto_speech_recognition` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `ngpu`: 1 (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: 1 (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | - `batch_size`: 1 (Defalut), batch size when decoding |
| | | ##### infer pipeline |
| | | - `audio_in`: the input to decode, which could be: |
| | | - wav_path, `e.g.`: asr_example.wav, |
| | | - pcm_path, `e.g.`: asr_example.pcm, |
| | | - audio bytes stream, `e.g.`: bytes data from a microphone |
| | | - audio sample point,`e.g.`: `audio, rate = soundfile.read("asr_example_zh.wav")`, the dtype is numpy.ndarray or torch.Tensor |
| | | - wav.scp, kaldi style wav list (`wav_id \t wav_path``), `e.g.`: |
| | | ```cat wav.scp |
| | | asr_example1 ./audios/asr_example1.wav |
| | | asr_example2 ./audios/asr_example2.wav |
| | | ``` |
| | | In this case of `wav.scp` input, `output_dir` must be set to save the output results |
| | | - `audio_fs`: audio sampling rate, only set when audio_in is pcm audio |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | |
| | | ### Inference with multi-thread CPUs or multi GPUs |
| | | FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs. |
| | | |
| | | - Setting parameters in `infer.sh` |
| | | - <strong>model:</strong> # model name on ModelScope |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `${data_dir}/wav.scp`. If `${data_dir}/text` is also exists, CER will be computed |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>batch_size:</strong> # batchsize of inference |
| | | - <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding |
| | | - <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1" |
| | | - <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob` |
| | | |
| | | - Decode with multi GPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --batch_size 64 \ |
| | | --gpu_inference true \ |
| | | --gpuid_list "0,1" |
| | | ``` |
| | | - Decode with multi-thread CPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --gpu_inference false \ |
| | | --njob 64 |
| | | ``` |
| | | |
| | | - Results |
| | | |
| | | The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set. |
| | | |
| | | If you decode the SpeechIO test sets, you can use textnorm with `stage`=3, and `DETAILS.txt`, `RESULTS.txt` record the results and CER after text normalization. |
| | | |
| | | |
| | | ## Finetune with pipeline |
| | | |
| | | ### Quick start |
| | | [finetune.py](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/finetune.py) |
| | | ```python |
| | | import os |
| | | from modelscope.metainfo import Trainers |
| | | from modelscope.trainers import build_trainer |
| | | from modelscope.msdatasets.audio.asr_dataset import ASRDataset |
| | | |
| | | def modelscope_finetune(params): |
| | | if not os.path.exists(params.output_dir): |
| | | os.makedirs(params.output_dir, exist_ok=True) |
| | | # dataset split ["train", "validation"] |
| | | ds_dict = ASRDataset.load(params.data_path, namespace='speech_asr') |
| | | kwargs = dict( |
| | | model=params.model, |
| | | data_dir=ds_dict, |
| | | dataset_type=params.dataset_type, |
| | | work_dir=params.output_dir, |
| | | batch_bins=params.batch_bins, |
| | | max_epoch=params.max_epoch, |
| | | lr=params.lr) |
| | | trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs) |
| | | trainer.train() |
| | | |
| | | |
| | | if __name__ == '__main__': |
| | | from funasr.utils.modelscope_param import modelscope_args |
| | | params = modelscope_args(model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch") |
| | | params.output_dir = "./checkpoint" # 模型保存路径 |
| | | params.data_path = "speech_asr_aishell1_trainsets" # 数据路径,可以为modelscope中已上传数据,也可以是本地数据 |
| | | params.dataset_type = "small" # 小数据量设置small,若数据量大于1000小时,请使用large |
| | | params.batch_bins = 2000 # batch size,如果dataset_type="small",batch_bins单位为fbank特征帧数,如果dataset_type="large",batch_bins单位为毫秒, |
| | | params.max_epoch = 50 # 最大训练轮数 |
| | | params.lr = 0.00005 # 设置学习率 |
| | | |
| | | modelscope_finetune(params) |
| | | ``` |
| | | |
| | | ```shell |
| | | python finetune.py &> log.txt & |
| | | ``` |
| | | |
| | | ### Finetune with your data |
| | | |
| | | ## Inference with your finetuned model |
| | | - Modify finetune training related parameters in [finetune.py](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/finetune.py) |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include files: `train/wav.scp`, `train/text`; `validation/wav.scp`, `validation/text` |
| | | - <strong>dataset_type:</strong> # for dataset larger than 1000 hours, set as `large`, otherwise set as `small` |
| | | - <strong>batch_bins:</strong> # batch size. For dataset_type is `small`, `batch_bins` indicates the feature frames. For dataset_type is `large`, `batch_bins` indicates the duration in ms |
| | | - <strong>max_epoch:</strong> # number of training epoch |
| | | - <strong>lr:</strong> # learning rate |
| | | |
| | | - Then you can run the pipeline to finetune with: |
| | | ```shell |
| | | python finetune.py |
| | | ``` |
| | | If you want finetune with multi-GPUs, you could: |
| | | ```shell |
| | | CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node 2 finetune.py > log.txt 2>&1 |
| | | ``` |
| | | ## Inference with your finetuned model |
| | | - Modify inference related parameters in [infer_after_finetune.py](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer_after_finetune.py) |
| | | - <strong>modelscope_model_name: </strong> # model name on ModelScope |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed |
| | | - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb` |
| | | - <strong>batch_size:</strong> # batchsize of inference |
| | | |
| | | - Then you can run the pipeline to finetune with: |
| | | ```python |
| | | python infer_after_finetune.py |
| | | ``` |
| | |
| | | # Speech Recognition |
| | | # Language Models |
| | | |
| | | ## Inference with pipeline |
| | | ### Quick start |
| | | #### Inference with you data |
| | | #### Inference with multi-threads on CPU |
| | | #### Inference with multi GPU |
| | | ### Inference with you data |
| | | ### Inference with multi-threads on CPU |
| | | ### Inference with multi GPU |
| | | |
| | | ## Finetune with pipeline |
| | | ### Quick start |
| | |
| | | |
| | | ### Quick start |
| | | |
| | | #### Inference with you data |
| | | ### Inference with you data |
| | | |
| | | #### Inference with multi-threads on CPU |
| | | ### Inference with multi-threads on CPU |
| | | |
| | | #### Inference with multi GPU |
| | | ### Inference with multi GPU |
| | | |
| | | ## Finetune with pipeline |
| | | |
| | |
| | | |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.speech_timestamp, |
| | | model='damo/speech_timestamp_prediction-v1-16k-offline', |
| | | output_dir='./tmp') |
| | | model='damo/speech_timestamp_prediction-v1-16k-offline',) |
| | | |
| | | rec_result = inference_pipeline( |
| | | audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_timestamps.wav', |
| | |
| | | # speaker verification |
| | | rec_result = inference_sv_pipline(audio_in=('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav','https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_same.wav')) |
| | | print(rec_result["scores"][0]) |
| | | ``` |
| | | |
| | | ### Speaker diarization |
| | | #### SOND |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | | |
| | | inference_diar_pipline = pipeline( |
| | | mode="sond_demo", |
| | | num_workers=0, |
| | | task=Tasks.speaker_diarization, |
| | | diar_model_config="sond.yaml", |
| | | model='damo/speech_diarization_sond-en-us-callhome-8k-n16k4-pytorch', |
| | | sv_model="damo/speech_xvector_sv-en-us-callhome-8k-spk6135-pytorch", |
| | | sv_model_revision="master", |
| | | ) |
| | | |
| | | audio_list=[ |
| | | "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record.wav", |
| | | "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_A.wav", |
| | | "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B.wav", |
| | | "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B1.wav" |
| | | ] |
| | | |
| | | results = inference_diar_pipline(audio_in=audio_list) |
| | | print(results) |
| | | ``` |
| | | |
| | | ### FAQ |
| | | #### How to switch device from GPU to CPU with pipeline |
| | | |
| | | The pipeline defaults to decoding with GPU (`ngpu=1`) when GPU is available. If you want to switch to CPU, you could set `ngpu=0` |
| | | ```python |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', |
| | | ngpu=0, |
| | | ) |
| | | ``` |
| | | |
| | | #### How to infer from local model path |
| | | Download model to local dir, by modelscope-sdk |
| | | |
| | | ```python |
| | | from modelscope.hub.snapshot_download import snapshot_download |
| | | |
| | | local_dir_root = "./models_from_modelscope" |
| | | model_dir = snapshot_download('damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', cache_dir=local_dir_root) |
| | | ``` |
| | | |
| | | Or download model to local dir, by git lfs |
| | | ```shell |
| | | git lfs install |
| | | # git clone https://www.modelscope.cn/<namespace>/<model-name>.git |
| | | git clone https://www.modelscope.cn/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git |
| | | ``` |
| | | |
| | | Infer with local model path |
| | | ```python |
| | | local_dir_root = "./models_from_modelscope/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model=local_dir_root, |
| | | ) |
| | | ``` |
| | | |
| | | ## Finetune with pipeline |
| | |
| | | ```shell |
| | | python finetune.py &> log.txt & |
| | | ``` |
| | | |
| | | ### FAQ |
| | | ### Multi GPUs training and distributed training |
| | | |
| | | If you want finetune with multi-GPUs, you could: |
| | | ```shell |
| | | CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node 2 finetune.py > log.txt 2>&1 |
| New file |
| | |
| | | # Speaker Diarization |
| | | |
| | | ## Inference with pipeline |
| | | |
| | | ### Quick start |
| | | |
| | | ### Inference with you data |
| | | |
| | | ### Inference with multi-threads on CPU |
| | | |
| | | ### Inference with multi GPU |
| | | |
| | | ## Finetune with pipeline |
| | | |
| | | ### Quick start |
| | | |
| | | ### Finetune with your data |
| | | |
| | | ## Inference with your finetuned model |
| | | |
| | |
| | | |
| | | ### Quick start |
| | | |
| | | #### Inference with you data |
| | | ### Inference with you data |
| | | |
| | | #### Inference with multi-threads on CPU |
| | | ### Inference with multi-threads on CPU |
| | | |
| | | #### Inference with multi GPU |
| | | ### Inference with multi GPU |
| | | |
| | | ## Finetune with pipeline |
| | | |
| | |
| | | |
| | | ### Quick start |
| | | |
| | | #### Inference with you data |
| | | ### Inference with you data |
| | | |
| | | #### Inference with multi-threads on CPU |
| | | ### Inference with multi-threads on CPU |
| | | |
| | | #### Inference with multi GPU |
| | | ### Inference with multi GPU |
| | | |
| | | ## Finetune with pipeline |
| | | |
| | |
| | | # Voice Activity Detection |
| | | |
| | | ## Inference with pipeline |
| | | > **Note**: |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of FSMN-VAD as example to demonstrate the usage. |
| | | |
| | | ## Inference |
| | | |
| | | ### Quick start |
| | | #### [FSMN-VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | | |
| | | #### Inference with you data |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.voice_activity_detection, |
| | | model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch', |
| | | ) |
| | | |
| | | #### Inference with multi-threads on CPU |
| | | segments_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav') |
| | | print(segments_result) |
| | | ``` |
| | | #### [FSMN-VAD-online model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) |
| | | ```python |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch', |
| | | ) |
| | | import soundfile |
| | | speech, sample_rate = soundfile.read("example/asr_example.wav") |
| | | |
| | | #### Inference with multi GPU |
| | | param_dict = {"in_cache": dict(), "is_final": False} |
| | | chunk_stride = 1600# 100ms |
| | | # first chunk, 100ms |
| | | speech_chunk = speech[0:chunk_stride] |
| | | rec_result = inference_pipeline(audio_in=speech_chunk, param_dict=param_dict) |
| | | print(rec_result) |
| | | # next chunk, 480ms |
| | | speech_chunk = speech[chunk_stride:chunk_stride+chunk_stride] |
| | | rec_result = inference_pipeline(audio_in=speech_chunk, param_dict=param_dict) |
| | | print(rec_result) |
| | | ``` |
| | | Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/236) |
| | | |
| | | |
| | | #### API-reference |
| | | ##### define pipeline |
| | | - `task`: `Tasks.auto_speech_recognition` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `ngpu`: 1 (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: 1 (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | - `batch_size`: 1 (Defalut), batch size when decoding |
| | | ##### infer pipeline |
| | | - `audio_in`: the input to decode, which could be: |
| | | - wav_path, `e.g.`: asr_example.wav, |
| | | - pcm_path, `e.g.`: asr_example.pcm, |
| | | - audio bytes stream, `e.g.`: bytes data from a microphone |
| | | - audio sample point,`e.g.`: `audio, rate = soundfile.read("asr_example_zh.wav")`, the dtype is numpy.ndarray or torch.Tensor |
| | | - wav.scp, kaldi style wav list (`wav_id \t wav_path``), `e.g.`: |
| | | ```cat wav.scp |
| | | asr_example1 ./audios/asr_example1.wav |
| | | asr_example2 ./audios/asr_example2.wav |
| | | ``` |
| | | In this case of `wav.scp` input, `output_dir` must be set to save the output results |
| | | - `audio_fs`: audio sampling rate, only set when audio_in is pcm audio |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | |
| | | ### Inference with multi-thread CPUs or multi GPUs |
| | | FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE//infer.sh) to decode with multi-thread CPUs, or multi GPUs. |
| | | |
| | | - Setting parameters in `infer.sh` |
| | | - <strong>model:</strong> # model name on ModelScope |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `${data_dir}/wav.scp`. If `${data_dir}/text` is also exists, CER will be computed |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>batch_size:</strong> # batchsize of inference |
| | | - <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding |
| | | - <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1" |
| | | - <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob` |
| | | |
| | | - Decode with multi GPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --gpu_inference true \ |
| | | --gpuid_list "0,1" |
| | | ``` |
| | | - Decode with multi-thread CPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --gpu_inference false \ |
| | | --njob 64 |
| | | ``` |
| | | |
| | | - Results |
| | | |
| | | The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set. |
| | | |
| | | If you decode the SpeechIO test sets, you can use textnorm with `stage`=3, and `DETAILS.txt`, `RESULTS.txt` record the results and CER after text normalization. |
| | | |
| | | |
| | | ## Finetune with pipeline |
| | | |
| New file |
| | |
| | | # Speaker Diarization |
| | | Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.). |
| | | |
| | | ## Overall Introduction |
| | | We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users. |
| | | - `CUDA_VISIBLE_DEVICES`: visible gpu list |
| | | - `gpu_num`: the number of GPUs used for training |
| | | - `gpu_inference`: whether to use GPUs for decoding |
| | | - `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU |
| | | - `data_aishell`: the raw path of AISHELL-1 dataset |
| | | - `feats_dir`: the path for saving processed data |
| | | - `nj`: the number of jobs for data preparation |
| | | - `speed_perturb`: the range of speech perturbed |
| | | - `exp_dir`: the path for saving experimental results |
| | | - `tag`: the suffix of experimental result directory |
| | | |
| | | ## Stage 0: Data preparation |
| | | This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows: |
| | | * `wav.scp` |
| | | ``` |
| | | BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav |
| | | BAC009S0002W0123 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0123.wav |
| | | BAC009S0002W0124 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0124.wav |
| | | ... |
| | | ``` |
| | | * `text` |
| | | ``` |
| | | BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购 |
| | | BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉 |
| | | BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后 |
| | | ... |
| | | ``` |
| | | These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens. |
| | | |
| | | ## Stage 1: Feature Generation |
| | | This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows: |
| | | * `feats.scp` |
| | | ``` |
| | | ... |
| | | BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055 |
| | | ... |
| | | ``` |
| | | Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows: |
| | | * `speech_shape` |
| | | ``` |
| | | ... |
| | | BAC009S0002W0122_sp0.9 665,80 |
| | | ... |
| | | ``` |
| | | * `text_shape` |
| | | ``` |
| | | ... |
| | | BAC009S0002W0122_sp0.9 15 |
| | | ... |
| | | ``` |
| | | These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length. |
| | | |
| | | ## Stage 2: Dictionary Preparation |
| | | This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows: |
| | | * `tokens.txt` |
| | | ``` |
| | | <blank> |
| | | <s> |
| | | </s> |
| | | 一 |
| | | 丁 |
| | | ... |
| | | 龚 |
| | | 龟 |
| | | <unk> |
| | | ``` |
| | | * `<blank>`: indicates the blank token for CTC |
| | | * `<s>`: indicates the start-of-sentence token |
| | | * `</s>`: indicates the end-of-sentence token |
| | | * `<unk>`: indicates the out-of-vocabulary token |
| | | |
| | | ## Stage 3: Training |
| | | This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding. |
| | | |
| | | * DDP Training |
| | | |
| | | We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training. |
| | | |
| | | * DataLoader |
| | | |
| | | We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it. |
| | | |
| | | * Configuration |
| | | |
| | | The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe. |
| | | |
| | | * Training Steps |
| | | |
| | | We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped. |
| | | |
| | | * Tensorboard |
| | | |
| | | Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command: |
| | | ``` |
| | | tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train |
| | | ``` |
| | | |
| | | ## Stage 4: Decoding |
| | | This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model. |
| | | |
| | | * Mode Selection |
| | | |
| | | As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model. |
| | | |
| | | * Configuration |
| | | |
| | | We support CTC decoding, attention decoding and hybrid CTC-attention decoding in FunASR, which can be specified by `ctc_weight` in a YAML file in `conf` directory. Specifically, `ctc_weight=1.0` indicates CTC decoding, `ctc_weight=0.0` indicates attention decoding, `0.0<ctc_weight<1.0` indicates hybrid CTC-attention decoding. |
| | | |
| | | * CPU/GPU Decoding |
| | | |
| | | We support CPU and GPU decoding in FunASR. For CPU decoding, you should set `gpu_inference=False` and set `njob` to specify the total number of CPU decoding jobs. For GPU decoding, you should set `gpu_inference=True`. You should also set `gpuid_list` to indicate which GPUs are used for decoding and `njobs` to indicate the number of decoding jobs on each GPU. |
| | | |
| | | * Performance |
| | | |
| | | We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` result. The following is an example of `text.cer`: |
| | | * `text.cer` |
| | | ``` |
| | | ... |
| | | BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00% |
| | | ref: 构 建 良 好 的 旅 游 市 场 环 境 |
| | | res: 构 建 良 好 的 旅 游 市 场 环 境 |
| | | ... |
| | | ``` |
| | | |
| New file |
| | |
| | | import os |
| | | |
| | | from modelscope.metainfo import Trainers |
| | | from modelscope.trainers import build_trainer |
| | | |
| | | from funasr.datasets.ms_dataset import MsDataset |
| | | from funasr.utils.modelscope_param import modelscope_args |
| | | |
| | | |
| | | def modelscope_finetune(params): |
| | | if not os.path.exists(params.output_dir): |
| | | os.makedirs(params.output_dir, exist_ok=True) |
| | | # dataset split ["train", "validation"] |
| | | ds_dict = MsDataset.load(params.data_path) |
| | | kwargs = dict( |
| | | model=params.model, |
| | | data_dir=ds_dict, |
| | | dataset_type=params.dataset_type, |
| | | work_dir=params.output_dir, |
| | | batch_bins=params.batch_bins, |
| | | max_epoch=params.max_epoch, |
| | | lr=params.lr) |
| | | trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs) |
| | | trainer.train() |
| | | |
| | | |
| | | if __name__ == '__main__': |
| | | params = modelscope_args(model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", data_path="./data") |
| | | params.output_dir = "./checkpoint" # m模型保存路径 |
| | | params.data_path = "./example_data/" # 数据路径 |
| | | params.dataset_type = "small" # 小数据量设置small,若数据量大于1000小时,请使用large |
| | | params.batch_bins = 2000 # batch size,如果dataset_type="small",batch_bins单位为fbank特征帧数,如果dataset_type="large",batch_bins单位为毫秒, |
| | | params.max_epoch = 50 # 最大训练轮数 |
| | | params.lr = 0.00005 # 设置学习率 |
| | | |
| | | modelscope_finetune(params) |
| New file |
| | |
| | | import os |
| | | import shutil |
| | | import argparse |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | | |
| | | def modelscope_infer(args): |
| | | os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpuid) |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model=args.model, |
| | | output_dir=args.output_dir, |
| | | batch_size=args.batch_size, |
| | | ) |
| | | inference_pipeline(audio_in=args.audio_in) |
| | | |
| | | if __name__ == "__main__": |
| | | parser = argparse.ArgumentParser() |
| | | parser.add_argument('--model', type=str, default="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch") |
| | | parser.add_argument('--audio_in', type=str, default="./data/test/wav.scp") |
| | | parser.add_argument('--output_dir', type=str, default="./results/") |
| | | parser.add_argument('--batch_size', type=int, default=64) |
| | | parser.add_argument('--gpuid', type=str, default="0") |
| | | args = parser.parse_args() |
| | | modelscope_infer(args) |
| New file |
| | |
| | | #!/usr/bin/env bash |
| | | |
| | | set -e |
| | | set -u |
| | | set -o pipefail |
| | | |
| | | stage=1 |
| | | stop_stage=2 |
| | | model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" |
| | | data_dir="./data/test" |
| | | output_dir="./results" |
| | | batch_size=64 |
| | | gpu_inference=true # whether to perform gpu decoding |
| | | gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1" |
| | | njob=4 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob |
| | | |
| | | . utils/parse_options.sh || exit 1; |
| | | |
| | | if ${gpu_inference} == "true"; then |
| | | nj=$(echo $gpuid_list | awk -F "," '{print NF}') |
| | | else |
| | | nj=$njob |
| | | batch_size=1 |
| | | gpuid_list="" |
| | | for JOB in $(seq ${nj}); do |
| | | gpuid_list=$gpuid_list"-1," |
| | | done |
| | | fi |
| | | |
| | | mkdir -p $output_dir/split |
| | | split_scps="" |
| | | for JOB in $(seq ${nj}); do |
| | | split_scps="$split_scps $output_dir/split/wav.$JOB.scp" |
| | | done |
| | | perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps} |
| | | |
| | | if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then |
| | | echo "Decoding ..." |
| | | gpuid_list_array=(${gpuid_list//,/ }) |
| | | for JOB in $(seq ${nj}); do |
| | | { |
| | | id=$((JOB-1)) |
| | | gpuid=${gpuid_list_array[$id]} |
| | | mkdir -p ${output_dir}/output.$JOB |
| | | python infer.py \ |
| | | --model ${model} \ |
| | | --audio_in ${output_dir}/split/wav.$JOB.scp \ |
| | | --output_dir ${output_dir}/output.$JOB \ |
| | | --batch_size ${batch_size} \ |
| | | --gpuid ${gpuid} |
| | | }& |
| | | done |
| | | wait |
| | | |
| | | mkdir -p ${output_dir}/1best_recog |
| | | for f in token score text; do |
| | | if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then |
| | | for i in $(seq "${nj}"); do |
| | | cat "${output_dir}/output.${i}/1best_recog/${f}" |
| | | done | sort -k1 >"${output_dir}/1best_recog/${f}" |
| | | fi |
| | | done |
| | | fi |
| | | |
| | | if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then |
| | | echo "Computing WER ..." |
| | | cp ${output_dir}/1best_recog/text ${output_dir}/1best_recog/text.proc |
| | | cp ${data_dir}/text ${output_dir}/1best_recog/text.ref |
| | | python utils/compute_wer.py ${output_dir}/1best_recog/text.ref ${output_dir}/1best_recog/text.proc ${output_dir}/1best_recog/text.cer |
| | | tail -n 3 ${output_dir}/1best_recog/text.cer |
| | | fi |
| | | |
| | | if [ $stage -le 3 ] && [ $stop_stage -ge 3 ];then |
| | | echo "SpeechIO TIOBE textnorm" |
| | | echo "$0 --> Normalizing REF text ..." |
| | | ./utils/textnorm_zh.py \ |
| | | --has_key --to_upper \ |
| | | ${data_dir}/text \ |
| | | ${output_dir}/1best_recog/ref.txt |
| | | |
| | | echo "$0 --> Normalizing HYP text ..." |
| | | ./utils/textnorm_zh.py \ |
| | | --has_key --to_upper \ |
| | | ${output_dir}/1best_recog/text.proc \ |
| | | ${output_dir}/1best_recog/rec.txt |
| | | grep -v $'\t$' ${output_dir}/1best_recog/rec.txt > ${output_dir}/1best_recog/rec_non_empty.txt |
| | | |
| | | echo "$0 --> computing WER/CER and alignment ..." |
| | | ./utils/error_rate_zh \ |
| | | --tokenizer char \ |
| | | --ref ${output_dir}/1best_recog/ref.txt \ |
| | | --hyp ${output_dir}/1best_recog/rec_non_empty.txt \ |
| | | ${output_dir}/1best_recog/DETAILS.txt | tee ${output_dir}/1best_recog/RESULTS.txt |
| | | rm -rf ${output_dir}/1best_recog/rec.txt ${output_dir}/1best_recog/rec_non_empty.txt |
| | | fi |
| | | |
| New file |
| | |
| | | import json |
| | | import os |
| | | import shutil |
| | | |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | | from modelscope.hub.snapshot_download import snapshot_download |
| | | |
| | | from funasr.utils.compute_wer import compute_wer |
| | | |
| | | def modelscope_infer_after_finetune(params): |
| | | # prepare for decoding |
| | | |
| | | try: |
| | | pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"]) |
| | | except BaseException: |
| | | raise BaseException(f"Please download pretrain model from ModelScope firstly.") |
| | | shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb")) |
| | | decoding_path = os.path.join(params["output_dir"], "decode_results") |
| | | if os.path.exists(decoding_path): |
| | | shutil.rmtree(decoding_path) |
| | | os.mkdir(decoding_path) |
| | | |
| | | # decoding |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | | model=pretrained_model_path, |
| | | output_dir=decoding_path, |
| | | batch_size=params["batch_size"] |
| | | ) |
| | | audio_in = os.path.join(params["data_dir"], "wav.scp") |
| | | inference_pipeline(audio_in=audio_in) |
| | | |
| | | # computer CER if GT text is set |
| | | text_in = os.path.join(params["data_dir"], "text") |
| | | if os.path.exists(text_in): |
| | | text_proc_file = os.path.join(decoding_path, "1best_recog/text") |
| | | compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer")) |
| | | |
| | | |
| | | if __name__ == '__main__': |
| | | params = {} |
| | | params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" |
| | | params["output_dir"] = "./checkpoint" |
| | | params["data_dir"] = "./data/test" |
| | | params["decoding_model_name"] = "valid.acc.ave_10best.pb" |
| | | params["batch_size"] = 64 |
| | | modelscope_infer_after_finetune(params) |
| New file |
| | |
| | | ../../../egs/aishell/transformer/utils |
| | |
| | | |
| | | from funasr.utils.compute_wer import compute_wer |
| | | |
| | | import pdb; |
| | | def modelscope_infer_core(output_dir, split_dir, njob, idx): |
| | | output_dir_job = os.path.join(output_dir, "output.{}".format(idx)) |
| | | gpu_id = (int(idx) - 1) // njob |
| | |
| | | |
| | | - Setting parameters in `infer.sh` |
| | | - <strong>model:</strong> # model name on ModelScope |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `${data_dir}/wav.scp`. If `${data_dir}/text` is also exists, CER will be computed |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>batch_size:</strong> # batchsize of inference |
| | | - <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding |
| | | - <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1" |
| | | - <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob` |
| | | |
| | | - Then you can run the pipeline to infer with: |
| | | ```python |
| | | sh infer.sh |
| | | - Decode with multi GPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --batch_size 64 \ |
| | | --gpu_inference true \ |
| | | --gpuid_list "0,1" |
| | | ``` |
| | | |
| | | - Decode with multi-thread CPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --gpu_inference false \ |
| | | --njob 64 |
| | | ``` |
| | | |
| | | - Results |
| | | |
| | | The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set. |
| | | The decoding results can be found in `${output_dir}/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set. |
| | | |
| | | If you decode the SpeechIO test sets, you can use textnorm with `stage`=3, and `DETAILS.txt`, `RESULTS.txt` record the results and CER after text normalization. |
| | | |
| | |
| | | gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1" |
| | | njob=4 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob |
| | | |
| | | . utils/parse_options.sh || exit 1; |
| | | |
| | | if ${gpu_inference}; then |
| | | if ${gpu_inference} == "true"; then |
| | | nj=$(echo $gpuid_list | awk -F "," '{print NF}') |
| | | else |
| | | nj=$njob |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if word_lm_train_config is not None: |
| | |
| | | #!/usr/bin/env python3 |
| | | # Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved. |
| | | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) |
| | | |
| | | import torch |
| | | torch.set_num_threads(1) |
| | | |
| | | import argparse |
| | | import logging |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if word_lm_train_config is not None: |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if word_lm_train_config is not None: |
| | | raise NotImplementedError("Word LM is not implemented") |
| | | if ngpu > 1: |
| | |
| | | export_mode = param_dict.get("export_mode", False) |
| | | else: |
| | | hotword_list_or_file = None |
| | | |
| | | |
| | | if kwargs.get("device", None) == "cpu": |
| | | ngpu = 0 |
| | | if ngpu >= 1 and torch.cuda.is_available(): |
| | | device = "cuda" |
| | | else: |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if word_lm_train_config is not None: |
| | | raise NotImplementedError("Word LM is not implemented") |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if word_lm_train_config is not None: |
| | | raise NotImplementedError("Word LM is not implemented") |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if word_lm_train_config is not None: |
| | | raise NotImplementedError("Word LM is not implemented") |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if word_lm_train_config is not None: |
| | |
| | | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved. |
| | | # MIT License (https://opensource.org/licenses/MIT) |
| | | |
| | | import torch |
| | | torch.set_num_threads(1) |
| | | |
| | | import argparse |
| | | import logging |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if ngpu > 1: |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | logging.basicConfig( |
| | | level=log_level, |
| | | format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s", |
| | | ) |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | |
| | | if ngpu >= 1 and torch.cuda.is_available(): |
| | | device = "cuda" |
| | |
| | | #!/usr/bin/env python3 |
| | | # Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved. |
| | | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) |
| | | |
| | | import torch |
| | | torch.set_num_threads(1) |
| | | |
| | | |
| | | import argparse |
| | | import logging |
| | |
| | | #!/usr/bin/env python3 |
| | | # Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved. |
| | | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) |
| | | |
| | | import torch |
| | | torch.set_num_threads(1) |
| | | |
| | | import argparse |
| | | import logging |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | logging.basicConfig( |
| | | level=log_level, |
| | | format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s", |
| | | ) |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if ngpu >= 1 and torch.cuda.is_available(): |
| | | device = "cuda" |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if ngpu > 1: |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if ngpu > 1: |
| | |
| | | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved. |
| | | # MIT License (https://opensource.org/licenses/MIT) |
| | | |
| | | import torch |
| | | torch.set_num_threads(1) |
| | | |
| | | import argparse |
| | | import logging |
| | |
| | | assert check_argument_types() |
| | | # 1. Build ASR model |
| | | tp_model, tp_train_args = ASRTask.build_model_from_file( |
| | | timestamp_infer_config, timestamp_model_file, device |
| | | timestamp_infer_config, timestamp_model_file, device=device |
| | | ) |
| | | if 'cuda' in device: |
| | | tp_model = tp_model.cuda() # force model to cuda |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if ngpu > 1: |
| | |
| | | #!/usr/bin/env python3 |
| | | # Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved. |
| | | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) |
| | | |
| | | import torch |
| | | torch.set_num_threads(1) |
| | | |
| | | import argparse |
| | | import logging |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if ngpu > 1: |
| | |
| | | #!/usr/bin/env python3 |
| | | # Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved. |
| | | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) |
| | | |
| | | import torch |
| | | torch.set_num_threads(1) |
| | | |
| | | import argparse |
| | | import logging |
| | |
| | | **kwargs, |
| | | ): |
| | | assert check_argument_types() |
| | | ncpu = kwargs.get("ncpu", 1) |
| | | torch.set_num_threads(ncpu) |
| | | |
| | | if batch_size > 1: |
| | | raise NotImplementedError("batch decoding is not implemented") |
| | | if ngpu > 1: |
| | |
| | | # Benchmark |
| | | # CPU Benchmark (Libtorch) |
| | | |
| | | ## Configuration |
| | | ### Data set: |
| | | Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds. |
| | | |
| | | ### Tools |
| | | - Install ModelScope and FunASR |
| | | #### Install Requirements |
| | | Install ModelScope and FunASR |
| | | ```shell |
| | | pip install -U modelscope funasr |
| | | # For the users in China, you could install with the command: |
| | | #pip install -U funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | ``` |
| | | |
| | | ```shell |
| | | pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html |
| | | git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR |
| | | pip install --editable ./ |
| | | cd funasr/runtime/python/utils |
| | | pip install -r requirements.txt |
| | | ``` |
| | | Install requirements |
| | | ```shell |
| | | git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR |
| | | cd funasr/runtime/python/utils |
| | | pip install -r requirements.txt |
| | | ``` |
| | | |
| | | - recipe |
| | | #### Recipe |
| | | |
| | | set the model, data path and output_dir |
| | | set the model, data path and output_dir |
| | | |
| | | ```shell |
| | | nohup bash test_rtf.sh &> log.txt & |
| | | ``` |
| | | |
| | | ```shell |
| | | nohup bash test_rtf.sh &> log.txt & |
| | | ``` |
| | | |
| | | |
| | | ## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) |
| | |
| | | # Benchmark |
| | | # CPU Benchmark (ONNX) |
| | | |
| | | ## Configuration |
| | | ### Data set: |
| | | Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds. |
| | | |
| | | ### Tools |
| | | - Install ModelScope and FunASR |
| | | #### Install Requirements |
| | | Install ModelScope and FunASR |
| | | ```shell |
| | | pip install -U modelscope funasr |
| | | # For the users in China, you could install with the command: |
| | | #pip install -U funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | ``` |
| | | |
| | | ```shell |
| | | pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html |
| | | git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR |
| | | pip install --editable ./ |
| | | cd funasr/runtime/python/utils |
| | | pip install -r requirements.txt |
| | | ``` |
| | | Install requirements |
| | | ```shell |
| | | git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR |
| | | cd funasr/runtime/python/utils |
| | | pip install -r requirements.txt |
| | | ``` |
| | | |
| | | - recipe |
| | | #### Recipe |
| | | |
| | | set the model, data path and output_dir |
| | | set the model, data path and output_dir |
| | | |
| | | ```shell |
| | | nohup bash test_rtf.sh &> log.txt & |
| | | ``` |
| | | ```shell |
| | | nohup bash test_rtf.sh &> log.txt & |
| | | ``` |
| | | |
| | | |
| | | ## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) |
| | |
| | | # pip install -e ./ -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | ``` |
| | | |
| | | ## Run the demo |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`. |
| | | ## Inference with runtime |
| | | |
| | | ### Speech Recognition |
| | | #### Paraformer |
| | | ```python |
| | | from funasr_onnx import Paraformer |
| | | |
| | | model_dir = "./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" |
| | | model = Paraformer(model_dir, batch_size=1) |
| | | |
| | | wav_path = ['./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav'] |
| | | |
| | | result = model(wav_path) |
| | | print(result) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result. |
| | | - Example: |
| | | ```python |
| | | from funasr_onnx import Paraformer |
| | | - Output: `List[str]`: recognition result |
| | | |
| | | model_dir = "/nfs/zhifu.gzf/export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" |
| | | model = Paraformer(model_dir, batch_size=1) |
| | | #### Paraformer-online |
| | | |
| | | wav_path = ['/nfs/zhifu.gzf/export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav'] |
| | | ### Voice Activity Detection |
| | | #### FSMN-VAD |
| | | ```python |
| | | from funasr_onnx import Fsmn_vad |
| | | |
| | | result = model(wav_path) |
| | | print(result) |
| | | ``` |
| | | model_dir = "./export/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" |
| | | wav_path = "./export/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/example/vad_example.wav" |
| | | model = Fsmn_vad(model_dir) |
| | | |
| | | result = model(wav_path) |
| | | print(result) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | |
| | | #### FSMN-VAD-online |
| | | ```python |
| | | from funasr_onnx import Fsmn_vad_online |
| | | import soundfile |
| | | |
| | | |
| | | model_dir = "./export/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" |
| | | wav_path = "./export/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/example/vad_example.wav" |
| | | model = Fsmn_vad_online(model_dir) |
| | | |
| | | |
| | | ##online vad |
| | | speech, sample_rate = soundfile.read(wav_path) |
| | | speech_length = speech.shape[0] |
| | | # |
| | | sample_offset = 0 |
| | | step = 1600 |
| | | param_dict = {'in_cache': []} |
| | | for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)): |
| | | if sample_offset + step >= speech_length - 1: |
| | | step = speech_length - sample_offset |
| | | is_final = True |
| | | else: |
| | | is_final = False |
| | | param_dict['is_final'] = is_final |
| | | segments_result = model(audio_in=speech[sample_offset: sample_offset + step], |
| | | param_dict=param_dict) |
| | | if segments_result: |
| | | print(segments_result) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | |
| | | ### Punctuation Restoration |
| | | #### CT-Transformer |
| | | ```python |
| | | from funasr_onnx import CT_Transformer |
| | | |
| | | model_dir = "./export/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch" |
| | | model = CT_Transformer(model_dir) |
| | | |
| | | text_in="跨境河流是养育沿岸人民的生命之源长期以来为帮助下游地区防灾减灾中方技术人员在上游地区极为恶劣的自然条件下克服巨大困难甚至冒着生命危险向印方提供汛期水文资料处理紧急事件中方重视印方在跨境河流问题上的关切愿意进一步完善双方联合工作机制凡是中方能做的我们都会去做而且会做得更好我请印度朋友们放心中国在上游的任何开发利用都会经过科学规划和论证兼顾上下游的利益" |
| | | result = model(text_in) |
| | | print(result[0]) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | |
| | | #### CT-Transformer-online |
| | | ```python |
| | | from funasr_onnx import CT_Transformer_VadRealtime |
| | | |
| | | model_dir = "./export/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727" |
| | | model = CT_Transformer_VadRealtime(model_dir) |
| | | |
| | | text_in = "跨境河流是养育沿岸|人民的生命之源长期以来为帮助下游地区防灾减灾中方技术人员|在上游地区极为恶劣的自然条件下克服巨大困难甚至冒着生命危险|向印方提供汛期水文资料处理紧急事件中方重视印方在跨境河流>问题上的关切|愿意进一步完善双方联合工作机制|凡是|中方能做的我们|都会去做而且会做得更好我请印度朋友们放心中国在上游的|任何开发利用都会经过科学|规划和论证兼顾上下游的利益" |
| | | |
| | | vads = text_in.split("|") |
| | | rec_result_all="" |
| | | param_dict = {"cache": []} |
| | | for vad in vads: |
| | | result = model(vad, param_dict=param_dict) |
| | | rec_result_all += result[0] |
| | | |
| | | print(rec_result_all) |
| | | ``` |
| | | - Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` |
| | | - Input: wav formt file, support formats: `str, np.ndarray, List[str]` |
| | | - Output: `List[str]`: recognition result |
| | | |
| | | ## Performance benchmark |
| | | |
| | |
| | | logging.warning("No keep_nbest_models is given. Change to [1]") |
| | | trainer_options.keep_nbest_models = [1] |
| | | keep_nbest_models = trainer_options.keep_nbest_models |
| | | |
| | | #assert batch_interval is set and >0 |
| | | assert trainer_options.batch_interval > 0 |
| | | |
| | | output_dir = Path(trainer_options.output_dir) |
| | | reporter = Reporter() |