Merge pull request #392 from alibaba-damo-academy/main
update dev_lyh
11个文件已修改
8个文件已添加
3 文件已重命名
| | |
| | | ## Citations |
| | | |
| | | ``` bibtex |
| | | @inproceedings{gao2020universal, |
| | | title={Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model}, |
| | | author={Gao, Zhifu and Zhang, Shiliang and Lei, Ming and McLoughlin, Ian}, |
| | | booktitle={arXiv preprint arXiv:2010.14099}, |
| | | year={2020} |
| | | } |
| | | |
| | | @inproceedings{gao2022paraformer, |
| | | title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition}, |
| | | author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie}, |
| | | booktitle={INTERSPEECH}, |
| | | year={2022} |
| | | } |
| | | @inproceedings{gao2020universal, |
| | | title={Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model}, |
| | | author={Gao, Zhifu and Zhang, Shiliang and Lei, Ming and McLoughlin, Ian}, |
| | | booktitle={arXiv preprint arXiv:2010.14099}, |
| | | year={2020} |
| | | } |
| | | @inproceedings{Shi2023AchievingTP, |
| | | title={Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model}, |
| | | author={Xian Shi and Yanni Chen and Shiliang Zhang and Zhijie Yan}, |
| New file |
| | |
| | | ## Audio Cut |
| | | |
| | | ## Realtime Speech Recognition |
| | | |
| | | ## Audio Chat |
| New file |
| | |
| | | # Docker |
| | | |
| | | ## Install Docker |
| | | |
| | | ### Ubuntu |
| | | ```shell |
| | | curl -fsSL https://test.docker.com -o test-docker.sh |
| | | sudo sh test-docker.sh |
| | | ``` |
| | | ### Debian |
| | | ```shell |
| | | curl -fsSL https://get.docker.com -o get-docker.sh |
| | | sudo sh get-docker.sh |
| | | ``` |
| | | |
| | | ### CentOS |
| | | ```shell |
| | | curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun |
| | | ``` |
| | | |
| | | ### MacOS |
| | | ```shell |
| | | brew install --cask --appdir=/Applications docker |
| | | ``` |
| | | |
| | | ### Windows |
| | | Ref to [docs](https://docs.docker.com/desktop/install/windows-install/) |
| | | |
| | | ## Start Docker |
| | | ```shell |
| | | sudo systemctl start docker |
| | | ``` |
| | | ## Download image |
| | | |
| | | ### Image |
| | | #### CPU |
| | | `registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py37-torch1.11.0-tf1.15.5-1.5.0` |
| | | |
| | | #### GPU |
| | | |
| | | `registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.3.0-py37-torch1.11.0-tf1.15.5-1.5.0` |
| | | |
| | | ### Pull Image |
| | | ```shell |
| | | sudo docker pull <image-name>:<tag> |
| | | ``` |
| | | |
| | | ### Check Image |
| | | ```shell |
| | | sudo docker images |
| | | ``` |
| | | |
| | | ## Run Docker |
| | | ```shell |
| | | sudo docker run -itd --name funasr <image-name>:<tag> bash |
| | | sudo docker exec -it funasr bash |
| | | ``` |
| | | |
| | | ## Stop Docker |
| | | ```shell |
| | | exit |
| | | sudo docker ps |
| | | sudo docker stop funasr |
| | | ``` |
| | | |
| | |
| | | :caption: Installation |
| | | |
| | | ./installation.md |
| | | ./docker.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Recipe |
| | | |
| | | ./recipe/asr_recipe.md |
| | | ./recipe/sv_recipe.md |
| | | ./recipe/punc_recipe.md |
| | | ./recipe/vad_recipe.md |
| | | ./recipe/sv_recipe.md |
| | | ./recipe/sd_recipe.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | |
| | | ./modescope_pipeline/tp_pipeline.md |
| | | ./modescope_pipeline/sv_pipeline.md |
| | | ./modescope_pipeline/sd_pipeline.md |
| | | ./modescope_pipeline/lm_pipeline.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Huggingface pipeline |
| | | |
| | | Undo |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: Application |
| | | |
| | | ./application.md |
| | | |
| | | .. toctree:: |
| | | :maxdepth: 1 |
| | | :caption: FQA |
| | | |
| | | ./FQA.md |
| New file |
| | |
| | | ../../egs_modelscope/asr/TEMPLATE/README.md |
| | |
| | | # Quick Start |
| | | |
| | | > **Note**: |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage. |
| | | |
| | | |
| | | ## Inference with pipeline |
| | | |
| | | ### Speech Recognition |
| | | #### Paraformer model |
| | | #### Paraformer Model |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | |
| | | |
| | | rec_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav') |
| | | print(rec_result) |
| | | # {'text': '欢迎大家来体验达摩院推出的语音识别模型'} |
| | | ``` |
| | | |
| | | ### Voice Activity Detection |
| | | #### FSMN-VAD |
| | | #### FSMN-VAD Model |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | |
| | | |
| | | segments_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav') |
| | | print(segments_result) |
| | | # {'text': [[70, 2340], [2620, 6200], [6480, 23670], [23950, 26250], [26780, 28990], [29950, 31430], [31750, 37600], [38210, 46900], [47310, 49630], [49910, 56460], [56740, 59540], [59820, 70450]]} |
| | | ``` |
| | | |
| | | ### Punctuation Restoration |
| | | #### CT_Transformer |
| | | #### CT_Transformer Model |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | |
| | | |
| | | rec_result = inference_pipeline(text_in='我们都是木头人不会讲话不会动') |
| | | print(rec_result) |
| | | # {'text': '我们都是木头人,不会讲话,不会动。'} |
| | | ``` |
| | | |
| | | ### Timestamp Prediction |
| | | #### TP-Aligner |
| | | #### TP-Aligner Model |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | |
| | | audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_timestamps.wav', |
| | | text_in='一 个 东 太 平 洋 国 家 为 什 么 跑 到 西 太 平 洋 来 了 呢',) |
| | | print(rec_result) |
| | | # {'text': '<sil> 0.000 0.380;一 0.380 0.560;个 0.560 0.800;东 0.800 0.980;太 0.980 1.140;平 1.140 1.260;洋 1.260 1.440;国 1.440 1.680;家 1.680 1.920;<sil> 1.920 2.040;为 2.040 2.200;什 2.200 2.320;么 2.320 2.500;跑 2.500 2.680;到 2.680 2.860;西 2.860 3.040;太 3.040 3.200;平 3.200 3.380;洋 3.380 3.500;来 3.500 3.640;了 3.640 3.800;呢 3.800 4.150;<sil> 4.150 4.440;', 'timestamp': [[380, 560], [560, 800], [800, 980], [980, 1140], [1140, 1260], [1260, 1440], [1440, 1680], [1680, 1920], [2040, 2200], [2200, 2320], [2320, 2500], [2500, 2680], [2680, 2860], [2860, 3040], [3040, 3200], [3200, 3380], [3380, 3500], [3500, 3640], [3640, 3800], [3800, 4150]]} |
| | | ``` |
| | | |
| | | ### Speaker Verification |
| | | #### X-vector |
| | | #### X-vector Model |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | |
| | | # speaker verification |
| | | rec_result = inference_sv_pipline(audio_in=('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav','https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_same.wav')) |
| | | print(rec_result["scores"][0]) |
| | | # 0.8540499500025098 |
| | | ``` |
| | | |
| | | ### Speaker diarization |
| | | #### SOND |
| | | ### Speaker Diarization |
| | | #### SOND Model |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | |
| | | |
| | | ## Finetune with pipeline |
| | | ### Speech Recognition |
| | | #### Paraformer model |
| | | #### Paraformer Model |
| | | |
| | | finetune.py |
| | | ```python |
| New file |
| | |
| | | ../../egs_modelscope/vad/TEMPLATE/README.md |
| File was renamed from egs_modelscope/asr/TEMPLATE/infer_after_finetune.py |
| | |
| | | import json |
| | | import os |
| | | import shutil |
| | | |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | | from modelscope.hub.snapshot_download import snapshot_download |
| | | |
| | | from funasr.utils.compute_wer import compute_wer |
| | | |
| | | def modelscope_infer_after_finetune(params): |
| | | # prepare for decoding |
| | |
| | | |
| | | |
| | | if __name__ == '__main__': |
| | | params = {} |
| | | params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" |
| | | params["output_dir"] = "./checkpoint" |
| | | params["data_dir"] = "./data/test" |
| | | params["decoding_model_name"] = "valid.acc.ave_10best.pb" |
| | | params["batch_size"] = 64 |
| | | modelscope_infer_after_finetune(params) |
| | | import sys |
| | | |
| | | model = sys.argv[1] |
| | | checkpoint_dir = sys.argv[2] |
| | | checkpoint_name = sys.argv[3] |
| | | |
| | | try: |
| | | pretrained_model_path = snapshot_download(model, cache_dir=checkpoint_dir) |
| | | except BaseException: |
| | | raise BaseException(f"Please download pretrain model from ModelScope firstly.") |
| | | shutil.copy(os.path.join(checkpoint_dir, checkpoint_name), os.path.join(pretrained_model_path, "model.pb")) |
| New file |
| | |
| | | # Copyright 2021 NPU, ASLP Group (Author: Qijie Shao) |
| | | |
| | | # Licensed under the Apache License, Version 2.0 (the "License"); |
| | | # you may not use this file except in compliance with the License. |
| | | # You may obtain a copy of the License at |
| | | # |
| | | # http://www.apache.org/licenses/LICENSE-2.0 |
| | | # |
| | | # Unless required by applicable law or agreed to in writing, software |
| | | # distributed under the License is distributed on an "AS IS" BASIS, |
| | | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| | | # See the License for the specific language governing permissions and |
| | | # limitations under the License. |
| | | |
| | | # process_opus.py: segmentation and downsampling of opus audio |
| | | |
| | | # usage: python3 process_opus.py wav.scp segments output_wav.scp |
| | | |
| | | from pydub import AudioSegment |
| | | import sys |
| | | import os |
| | | |
| | | |
| | | def read_file(wav_scp, segments): |
| | | wav_scp_dict = {} |
| | | with open(wav_scp, "r", encoding="UTF-8") as fin: |
| | | for line_str in fin: |
| | | wav_id, path = line_str.strip().split() |
| | | wav_scp_dict[wav_id] = path |
| | | |
| | | utt_list = [] |
| | | seg_path_list = [] |
| | | start_time_list = [] |
| | | end_time_list = [] |
| | | with open(segments, "r", encoding="UTF-8") as fin: |
| | | for line_str in fin: |
| | | arr = line_str.strip().split() |
| | | assert len(arr) == 4 |
| | | utt_list.append(arr[0]) |
| | | seg_path_list.append(wav_scp_dict[arr[1]]) |
| | | start_time_list.append(float(arr[2])) |
| | | end_time_list.append(float(arr[3])) |
| | | return utt_list, seg_path_list, start_time_list, end_time_list |
| | | |
| | | |
| | | # TODO(Qijie): Fix the process logic |
| | | def output(output_wav_scp, utt_list, seg_path_list, start_time_list, end_time_list): |
| | | num_utts = len(utt_list) |
| | | with open(output_wav_scp, "w", encoding="UTF-8") as fout: |
| | | previous_wav_path = "" |
| | | for i in range(num_utts): |
| | | utt_id = utt_list[i] |
| | | current_wav_path = seg_path_list[i] |
| | | output_dir = (os.path.dirname(current_wav_path)).replace( |
| | | "audio", "audio_seg" |
| | | ) |
| | | seg_wav_path = os.path.join(output_dir, utt_id + ".wav") |
| | | |
| | | os.makedirs(output_dir, exist_ok=True) |
| | | if current_wav_path != previous_wav_path: |
| | | source_wav = AudioSegment.from_file(current_wav_path) |
| | | previous_wav_path = current_wav_path |
| | | |
| | | start = int(start_time_list[i] * 1000) |
| | | end = int(end_time_list[i] * 1000) |
| | | target_audio = source_wav[start:end].set_frame_rate(16000).set_sample_width(2) |
| | | target_audio.export(seg_wav_path, format="wav") |
| | | |
| | | fout.write("{} {}\n".format(utt_id, seg_wav_path)) |
| | | if i % 200 == 0: |
| | | print("seg wav finished: {}%".format(int(i / num_utts))) |
| | | |
| | | |
| | | def main(): |
| | | wav_scp = sys.argv[1] |
| | | segments = sys.argv[2] |
| | | output_wav_scp = sys.argv[3] |
| | | |
| | | utt_list, seg_path_list, start_time_list, end_time_list = read_file( |
| | | wav_scp, segments |
| | | ) |
| | | output(output_wav_scp, utt_list, seg_path_list, start_time_list, end_time_list) |
| | | |
| | | |
| | | if __name__ == "__main__": |
| | | main() |
| File was renamed from docs/modescope_pipeline/asr_pipeline.md |
| | |
| | | # Speech Recognition |
| | | |
| | | > **Note**: |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of Paraformer and Paraformer-online as example to demonstrate the usage. |
| | | > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage. |
| | | |
| | | ## Inference |
| | | |
| | | ### Quick start |
| | | #### [Paraformer model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) |
| | | #### [Paraformer Model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) |
| | | ```python |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | |
| | | rec_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav') |
| | | print(rec_result) |
| | | ``` |
| | | #### [Paraformer-online model](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary) |
| | | #### [Paraformer-online Model](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary) |
| | | ```python |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.auto_speech_recognition, |
| | |
| | | ``` |
| | | Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/241) |
| | | |
| | | #### [UniASR model](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary) |
| | | #### [UniASR Model](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary) |
| | | There are three decoding mode for UniASR model(`fast`、`normal`、`offline`), for more model detailes, please refer to [docs](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary) |
| | | ```python |
| | | decoding_model = "fast" # "fast"、"normal"、"offline" |
| | |
| | | rec_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav') |
| | | print(rec_result) |
| | | ``` |
| | | The decoding mode of `fast` and `normal` |
| | | The decoding mode of `fast` and `normal` is fake streaming, which could be used for evaluating of recognition accuracy. |
| | | Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/151) |
| | | #### [RNN-T-online model]() |
| | | Undo |
| | | |
| | | #### API-reference |
| | | ##### define pipeline |
| | | ##### Define pipeline |
| | | - `task`: `Tasks.auto_speech_recognition` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `ngpu`: 1 (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: 1 (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | - `batch_size`: 1 (Defalut), batch size when decoding |
| | | ##### infer pipeline |
| | | - `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: `None` (Defalut), the output path of results if set |
| | | - `batch_size`: `1` (Defalut), batch size when decoding |
| | | ##### Infer pipeline |
| | | - `audio_in`: the input to decode, which could be: |
| | | - wav_path, `e.g.`: asr_example.wav, |
| | | - pcm_path, `e.g.`: asr_example.pcm, |
| | | - audio bytes stream, `e.g.`: bytes data from a microphone |
| | | - audio sample point,`e.g.`: `audio, rate = soundfile.read("asr_example_zh.wav")`, the dtype is numpy.ndarray or torch.Tensor |
| | | - wav.scp, kaldi style wav list (`wav_id \t wav_path``), `e.g.`: |
| | | ```cat wav.scp |
| | | ```text |
| | | asr_example1 ./audios/asr_example1.wav |
| | | asr_example2 ./audios/asr_example2.wav |
| | | ``` |
| | |
| | | FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs. |
| | | |
| | | - Setting parameters in `infer.sh` |
| | | - <strong>model:</strong> # model name on ModelScope |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `${data_dir}/wav.scp`. If `${data_dir}/text` is also exists, CER will be computed |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>batch_size:</strong> # batchsize of inference |
| | | - <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding |
| | | - <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1" |
| | | - <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `data_dir`: the dataset dir needs to include `wav.scp`. If `${data_dir}/text` is also exists, CER will be computed |
| | | - `output_dir`: output dir of the recognition results |
| | | - `batch_size`: `64` (Default), batch size of inference on gpu |
| | | - `gpu_inference`: `true` (Default), whether to perform gpu decoding, set false for CPU inference |
| | | - `gpuid_list`: `0,1` (Default), which gpu_ids are used to infer |
| | | - `njob`: only used for CPU inference (`gpu_inference`=`false`), `64` (Default), the number of jobs for CPU decoding |
| | | - `checkpoint_dir`: only used for infer finetuned models, the path dir of finetuned models |
| | | - `checkpoint_name`: only used for infer finetuned models, `valid.cer_ctc.ave.pb` (Default), which checkpoint is used to infer |
| | | |
| | | - Decode with multi GPUs: |
| | | ```shell |
| | |
| | | ### Finetune with your data |
| | | |
| | | - Modify finetune training related parameters in [finetune.py](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/finetune.py) |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include files: `train/wav.scp`, `train/text`; `validation/wav.scp`, `validation/text` |
| | | - <strong>dataset_type:</strong> # for dataset larger than 1000 hours, set as `large`, otherwise set as `small` |
| | | - <strong>batch_bins:</strong> # batch size. For dataset_type is `small`, `batch_bins` indicates the feature frames. For dataset_type is `large`, `batch_bins` indicates the duration in ms |
| | | - <strong>max_epoch:</strong> # number of training epoch |
| | | - <strong>lr:</strong> # learning rate |
| | | - `output_dir`: result dir |
| | | - `data_dir`: the dataset dir needs to include files: `train/wav.scp`, `train/text`; `validation/wav.scp`, `validation/text` |
| | | - `dataset_type`: for dataset larger than 1000 hours, set as `large`, otherwise set as `small` |
| | | - `batch_bins`: batch size. For dataset_type is `small`, `batch_bins` indicates the feature frames. For dataset_type is `large`, `batch_bins` indicates the duration in ms |
| | | - `max_epoch`: number of training epoch |
| | | - `lr`: learning rate |
| | | |
| | | - Then you can run the pipeline to finetune with: |
| | | ```shell |
| | |
| | | CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node 2 finetune.py > log.txt 2>&1 |
| | | ``` |
| | | ## Inference with your finetuned model |
| | | - Modify inference related parameters in [infer_after_finetune.py](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer_after_finetune.py) |
| | | - <strong>modelscope_model_name: </strong> # model name on ModelScope |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed |
| | | - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb` |
| | | - <strong>batch_size:</strong> # batchsize of inference |
| | | |
| | | - Then you can run the pipeline to finetune with: |
| | | ```python |
| | | python infer_after_finetune.py |
| | | - Setting parameters in [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) is the same with [docs](https://github.com/alibaba-damo-academy/FunASR/tree/main/egs_modelscope/asr/TEMPLATE#inference-with-multi-thread-cpus-or-multi-gpus) |
| | | |
| | | - Decode with multi GPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --batch_size 64 \ |
| | | --gpu_inference true \ |
| | | --gpuid_list "0,1" \ |
| | | --checkpoint_dir "./checkpoint" \ |
| | | --checkpoint_name "valid.cer_ctc.ave.pb" |
| | | ``` |
| | | - Decode with multi-thread CPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --gpu_inference false \ |
| | | --njob 64 \ |
| | | --checkpoint_dir "./checkpoint" \ |
| | | --checkpoint_name "valid.cer_ctc.ave.pb" |
| | | ``` |
| | |
| | | batch_size=64 |
| | | gpu_inference=true # whether to perform gpu decoding |
| | | gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1" |
| | | njob=4 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob |
| | | njob=64 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob |
| | | checkpoint_dir= |
| | | checkpoint_name="valid.cer_ctc.ave.pb" |
| | | |
| | | . utils/parse_options.sh || exit 1; |
| | | |
| | |
| | | done |
| | | perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps} |
| | | |
| | | if ${checkpoint_dir}; then |
| | | python utils/prepare_checkpoint.py ${model} ${checkpoint_dir} ${checkpoint_name} |
| | | model=${checkpoint_dir}/${model} |
| | | fi |
| | | |
| | | if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then |
| | | echo "Decoding ..." |
| | | gpuid_list_array=(${gpuid_list//,/ }) |
| File was renamed from docs/modescope_pipeline/vad_pipeline.md |
| | |
| | | Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/236) |
| | | |
| | | |
| | | |
| | | #### API-reference |
| | | ##### define pipeline |
| | | - `task`: `Tasks.auto_speech_recognition` |
| | | ##### Define pipeline |
| | | - `task`: `Tasks.voice_activity_detection` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `ngpu`: 1 (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: 1 (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | - `batch_size`: 1 (Defalut), batch size when decoding |
| | | ##### infer pipeline |
| | | - `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU |
| | | - `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU |
| | | - `output_dir`: `None` (Defalut), the output path of results if set |
| | | - `batch_size`: `1` (Defalut), batch size when decoding |
| | | ##### Infer pipeline |
| | | - `audio_in`: the input to decode, which could be: |
| | | - wav_path, `e.g.`: asr_example.wav, |
| | | - pcm_path, `e.g.`: asr_example.pcm, |
| | | - audio bytes stream, `e.g.`: bytes data from a microphone |
| | | - audio sample point,`e.g.`: `audio, rate = soundfile.read("asr_example_zh.wav")`, the dtype is numpy.ndarray or torch.Tensor |
| | | - wav.scp, kaldi style wav list (`wav_id \t wav_path``), `e.g.`: |
| | | ```cat wav.scp |
| | | ```text |
| | | asr_example1 ./audios/asr_example1.wav |
| | | asr_example2 ./audios/asr_example2.wav |
| | | ``` |
| | |
| | | - `output_dir`: None (Defalut), the output path of results if set |
| | | |
| | | ### Inference with multi-thread CPUs or multi GPUs |
| | | FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE//infer.sh) to decode with multi-thread CPUs, or multi GPUs. |
| | | FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/vad/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs. |
| | | |
| | | - Setting parameters in `infer.sh` |
| | | - <strong>model:</strong> # model name on ModelScope |
| | | - <strong>data_dir:</strong> # the dataset dir needs to include `${data_dir}/wav.scp`. If `${data_dir}/text` is also exists, CER will be computed |
| | | - <strong>output_dir:</strong> # result dir |
| | | - <strong>batch_size:</strong> # batchsize of inference |
| | | - <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding |
| | | - <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1" |
| | | - <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob` |
| | | - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk |
| | | - `data_dir`: the dataset dir needs to include `wav.scp` |
| | | - `output_dir`: output dir of the recognition results |
| | | - `batch_size`: `64` (Default), batch size of inference on gpu |
| | | - `gpu_inference`: `true` (Default), whether to perform gpu decoding, set false for CPU inference |
| | | - `gpuid_list`: `0,1` (Default), which gpu_ids are used to infer |
| | | - `njob`: only used for CPU inference (`gpu_inference`=`false`), `64` (Default), the number of jobs for CPU decoding |
| | | - `checkpoint_dir`: only used for infer finetuned models, the path dir of finetuned models |
| | | - `checkpoint_name`: only used for infer finetuned models, `valid.cer_ctc.ave.pb` (Default), which checkpoint is used to infer |
| | | |
| | | - Decode with multi GPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --batch_size 64 \ |
| | | --gpu_inference true \ |
| | | --gpuid_list "0,1" |
| | | ``` |
| | | - Decode with multi-thread CPUs: |
| | | ```shell |
| | | bash infer.sh \ |
| | | --model "damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" \ |
| | | --model "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ |
| | | --data_dir "./data/test" \ |
| | | --output_dir "./results" \ |
| | | --gpu_inference false \ |
| | | --njob 64 |
| | | ``` |
| | | |
| | | - Results |
| | | |
| | | The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set. |
| | | |
| | | If you decode the SpeechIO test sets, you can use textnorm with `stage`=3, and `DETAILS.txt`, `RESULTS.txt` record the results and CER after text normalization. |
| | | |
| | | |
| | | ## Finetune with pipeline |
| | | |
| New file |
| | |
| | | import os |
| | | import shutil |
| | | import argparse |
| | | from modelscope.pipelines import pipeline |
| | | from modelscope.utils.constant import Tasks |
| | | |
| | | def modelscope_infer(args): |
| | | os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpuid) |
| | | inference_pipeline = pipeline( |
| | | task=Tasks.voice_activity_detection, |
| | | model=args.model, |
| | | output_dir=args.output_dir, |
| | | batch_size=args.batch_size, |
| | | ) |
| | | inference_pipeline(audio_in=args.audio_in) |
| | | |
| | | if __name__ == "__main__": |
| | | parser = argparse.ArgumentParser() |
| | | parser.add_argument('--model', type=str, default="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch") |
| | | parser.add_argument('--audio_in', type=str, default="./data/test/wav.scp") |
| | | parser.add_argument('--output_dir', type=str, default="./results/") |
| | | parser.add_argument('--batch_size', type=int, default=64) |
| | | parser.add_argument('--gpuid', type=str, default="0") |
| | | args = parser.parse_args() |
| | | modelscope_infer(args) |
| New file |
| | |
| | | #!/usr/bin/env bash |
| | | |
| | | set -e |
| | | set -u |
| | | set -o pipefail |
| | | |
| | | stage=1 |
| | | stop_stage=2 |
| | | model="damo/speech_fsmn_vad_zh-cn-16k-common" |
| | | data_dir="./data/test" |
| | | output_dir="./results" |
| | | batch_size=64 |
| | | gpu_inference=true # whether to perform gpu decoding |
| | | gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1" |
| | | njob=64 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob |
| | | checkpoint_dir= |
| | | checkpoint_name="valid.cer_ctc.ave.pb" |
| | | |
| | | . utils/parse_options.sh || exit 1; |
| | | |
| | | if ${gpu_inference} == "true"; then |
| | | nj=$(echo $gpuid_list | awk -F "," '{print NF}') |
| | | else |
| | | nj=$njob |
| | | batch_size=1 |
| | | gpuid_list="" |
| | | for JOB in $(seq ${nj}); do |
| | | gpuid_list=$gpuid_list"-1," |
| | | done |
| | | fi |
| | | |
| | | mkdir -p $output_dir/split |
| | | split_scps="" |
| | | for JOB in $(seq ${nj}); do |
| | | split_scps="$split_scps $output_dir/split/wav.$JOB.scp" |
| | | done |
| | | perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps} |
| | | |
| | | if ${checkpoint_dir}; then |
| | | python utils/prepare_checkpoint.py ${model} ${checkpoint_dir} ${checkpoint_name} |
| | | model=${checkpoint_dir}/${model} |
| | | fi |
| | | |
| | | if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then |
| | | echo "Decoding ..." |
| | | gpuid_list_array=(${gpuid_list//,/ }) |
| | | for JOB in $(seq ${nj}); do |
| | | { |
| | | id=$((JOB-1)) |
| | | gpuid=${gpuid_list_array[$id]} |
| | | mkdir -p ${output_dir}/output.$JOB |
| | | python infer.py \ |
| | | --model ${model} \ |
| | | --audio_in ${output_dir}/split/wav.$JOB.scp \ |
| | | --output_dir ${output_dir}/output.$JOB \ |
| | | --batch_size ${batch_size} \ |
| | | --gpuid ${gpuid} |
| | | }& |
| | | done |
| | | wait |
| | | |
| | | mkdir -p ${output_dir}/1best_recog |
| | | for f in token score text; do |
| | | if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then |
| | | for i in $(seq "${nj}"); do |
| | | cat "${output_dir}/output.${i}/1best_recog/${f}" |
| | | done | sort -k1 >"${output_dir}/1best_recog/${f}" |
| | | fi |
| | | done |
| | | fi |
| | | |
| New file |
| | |
| | | ../../../egs/aishell/transformer/utils |
| | |
| | | text_name="text", |
| | | non_linguistic_symbols=train_args.non_linguistic_symbols, |
| | | ) |
| | | print("start decoding!!!") |
| | | |
| | | @torch.no_grad() |
| | | def __call__(self, text: Union[list, str], split_size=20): |
| | |
| | | result, _ = text2punc(line) |
| | | item = {'key': key, 'value': result} |
| | | results.append(item) |
| | | print(results) |
| | | return results |
| | | |
| | | for inference_text, _, _ in data_path_and_name_and_type: |
| | |
| | | # Export models |
| | | |
| | | ## Environments |
| | | torch >= 1.11.0 |
| | | modelscope >= 1.2.0 |
| | | torch-quant >= 0.4.0 (required for exporting quantized torchscript format model) |
| | | # pip install torch-quant -i https://pypi.org/simple |
| | | |
| | | ## Install modelscope and funasr |
| | | ### Install modelscope and funasr |
| | | |
| | | The installation is the same as [funasr](https://github.com/alibaba-damo-academy/FunASR/blob/main/README.md#installation) |
| | | ```shell |
| | | # pip3 install torch torchaudio |
| | | pip install -U modelscope funasr |
| | | # For the users in China, you could install with the command: |
| | | # pip install -U modelscope funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | ``` |
| | | ### Install the quantization tools |
| | | ```shell |
| | | pip install torch-quant # Optional, for torchscript quantization |
| | | pip install onnx onnxruntime # Optional, for onnx quantization |
| | | ``` |
| | | |
| | | ## Export model |
| | | `Tips`: torch>=1.11.0 |
| | |
| | | Install the modelscope and funasr |
| | | |
| | | ```shell |
| | | pip install "modelscope[audio_asr]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html |
| | | pip install -U modelscope funasr |
| | | # For the users in China, you could install with the command: |
| | | # pip install -U modelscope funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | git clone https://github.com/alibaba/FunASR.git && cd FunASR |
| | | pip install --editable ./ |
| | | ``` |
| | | |
| | | Install the requirements |
| | |
| | | ### Install [modelscope and funasr](https://github.com/alibaba-damo-academy/FunASR#installation) |
| | | |
| | | ```shell |
| | | pip3 install torch torchaudio |
| | | pip install -U modelscope |
| | | pip install -U funasr |
| | | # pip3 install torch torchaudio |
| | | pip install -U modelscope funasr |
| | | # For the users in China, you could install with the command: |
| | | # pip install -U modelscope funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | pip install torch-quant # Optional, for torchscript quantization |
| | | pip install onnx onnxruntime # Optional, for onnx quantization |
| | | ``` |
| | | |
| | | ### Export [onnx model](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) |
| | |
| | | ### Install [modelscope and funasr](https://github.com/alibaba-damo-academy/FunASR#installation) |
| | | |
| | | ```shell |
| | | pip3 install torch torchaudio |
| | | pip install -U modelscope |
| | | pip install -U funasr |
| | | #pip3 install torch torchaudio |
| | | pip install -U modelscope funasr |
| | | # For the users in China, you could install with the command: |
| | | # pip install -U modelscope funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | pip install torch-quant # Optional, for torchscript quantization |
| | | pip install onnx onnxruntime # Optional, for onnx quantization |
| | | ``` |
| | | |
| | | ### Export [onnx model](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) |
| | |
| | | Install the modelscope and funasr |
| | | |
| | | ```shell |
| | | pip install "modelscope[audio_asr]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html |
| | | pip install -U modelscope funasr |
| | | # For the users in China, you could install with the command: |
| | | # pip install -U modelscope funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple |
| | | git clone https://github.com/alibaba/FunASR.git && cd FunASR |
| | | pip install --editable ./ |
| | | ``` |
| | | |
| | | Install the requirements for server |
| | |
| | | if num_batch_updates % batch_interval == 0: |
| | | if options.use_pai and options.oss_bucket is not None: |
| | | buffer = BytesIO() |
| | | torch.save(model.state_dict(), buffer) |
| | | if hasattr(model, "module"): |
| | | torch.save(model.module.state_dict(), buffer) |
| | | else: |
| | | torch.save(model.state_dict(), buffer) |
| | | options.oss_bucket.put_object(os.path.join(output_dir, f"{num_batch_updates}step.pb"), buffer.getvalue()) |
| | | else: |
| | | torch.save(model.state_dict(), os.path.join(output_dir, f"{num_batch_updates}step.pb")) |
| | | if hasattr(model, "module"): |
| | | torch.save(model.module.state_dict(), os.path.join(output_dir, f"{num_batch_updates}step.pb")) |
| | | else: |
| | | torch.save(model.state_dict(), os.path.join(output_dir, f"{num_batch_updates}step.pb")) |
| | | |
| | | if distributed: |
| | | torch.distributed.all_reduce(iterator_stop, ReduceOp.SUM) |