python/FunASR-XL.git

parent: e93da30a | 补丁 | 提交 | ignore whitespace

Merge pull request #112 from alibaba-damo-academy/dev_wjm

zhifu gao

2023-02-15 b27f0a4691e3798283e0841e027f422d5920d7cf

Merge pull request #112 from alibaba-damo-academy/dev_wjm

update github.io

6个文件已修改

3个文件已添加

	docs/build_task.md	106 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/get_started.md	43 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/index.rst	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/installation.md	30 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/modelscope_usages.md	53 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs_cn/build_task.md	105 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs_cn/get_started.md	3 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs_cn/index.rst	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs_cn/modelscope_usages.md	16 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 docs/build_task.md

New file
@@ -0,0 +1,106 @@
# Build custom tasks
FunASR is similar to ESPNet, which applies `Task`  as the general interface ti achieve the training and inference of models. Each `Task` is a class inherited from `AbsTask` and its corresponding code can be seen in `funasr/tasks/abs_task.py`. The main functions of `AbsTask` are shown as follows:
```python
class AbsTask(ABC):
    @classmethod
    def add_task_arguments(cls, parser: argparse.ArgumentParser):
        pass
    
    @classmethod
    def build_preprocess_fn(cls, args, train):
        (...)
    
    @classmethod
    def build_collate_fn(cls, args: argparse.Namespace):
        (...)

    @classmethod
    def build_model(cls, args):
        (...)
    
    @classmethod
    def main(cls, args):
        (...)
```
- add_task_arguments：Add parameters required by a specified `Task`
- build_preprocess_fn：定义如何处理对样本进行预处理 define how to preprocess samples
- build_collate_fn：define how to combine multiple samples into a `batch`
- build_model：define the model
- main：training interface, starting training through `Task.main()`

Next, we take the speech recognition as an example to introduce how to define a new `Task`. For the corresponding code, please see `ASRTask` in `funasr/tasks/asr.py`. The procedure of defining a new `Task` is actually the procedure of redefining the above functions according to the requirements of the specified `Task`.

- add_task_arguments
```python
@classmethod
def add_task_arguments(cls, parser: argparse.ArgumentParser):
    group = parser.add_argument_group(description="Task related")
    group.add_argument(
        "--token_list",
        type=str_or_none,
        default=None,
        help="A text mapping int-id to token",
    )
    (...)
```
For speech recognition tasks, specific parameters required include `token_list`, etc. According to the specific requirements of different tasks, users can define corresponding parameters in this function.

- build_preprocess_fn
```python
@classmethod
def build_preprocess_fn(cls, args, train):
    if args.use_preprocessor:
        retval = CommonPreprocessor(
                    train=train,
                    token_type=args.token_type,
                    token_list=args.token_list,
                    bpemodel=args.bpemodel,
                    non_linguistic_symbols=args.non_linguistic_symbols,
                    text_cleaner=args.cleaner,
                    ...
                )
    else:
        retval = None
    return retval
```
This function defines how to preprocess samples. Specifically, the input of speech recognition tasks includes speech and text. For speech, functions such as (optional) adding noise and reverberation to the speech are supported. For text, functions such as (optional) processing text according to bpe and mapping text to `tokenid` are supported. Users can choose the preprocessing operation that needs to be performed on the sample. For the detail implementation, please refer to `CommonPreprocessor`.

- build_collate_fn
```python
@classmethod
def build_collate_fn(cls, args, train):
    return CommonCollateFn(float_pad_value=0.0, int_pad_value=-1)
```
This function defines how to combine multiple samples into a `batch`. For speech recognition tasks, `padding` is employed to obtain equal-length data from different speech and text. Specifically, we set `0.0` as the default padding value for speech and `-1` as the default padding value for text. Users can define different `batch` operations here. For the detail implementation, please refer to `CommonCollateFn`.

- build_model
```python
@classmethod
def build_model(cls, args, train):
    with open(args.token_list, encoding="utf-8") as f:
        token_list = [line.rstrip() for line in f]
        vocab_size = len(token_list)
        frontend = frontend_class(**args.frontend_conf)
        specaug = specaug_class(**args.specaug_conf)
        normalize = normalize_class(**args.normalize_conf)
        preencoder = preencoder_class(**args.preencoder_conf)
        encoder = encoder_class(input_size=input_size, **args.encoder_conf)
        postencoder = postencoder_class(input_size=encoder_output_size, **args.postencoder_conf)
        decoder = decoder_class(vocab_size=vocab_size, encoder_output_size=encoder_output_size,  **args.decoder_conf)
        ctc = CTC(odim=vocab_size, encoder_output_size=encoder_output_size, **args.ctc_conf)
        model = model_class(
            vocab_size=vocab_size,
            frontend=frontend,
            specaug=specaug,
            normalize=normalize,
            preencoder=preencoder,
            encoder=encoder,
            postencoder=postencoder,
            decoder=decoder,
            ctc=ctc,
            token_list=token_list,
            **args.model_conf,
        )
    return model
```
This function defines the detail of the model. For different speech recognition models, the same speech recognition `Task` can usually be shared and the remaining thing needed to be done is to define a specific model in this function. For example, a speech recognition model with a standard encoder-decoder structure has been shown above. Specifically, it first defines each module of the model, including encoder, decoder, etc. and then combine these modules together to generate a complete model. In FunASR, the model needs to inherit `AbsESPnetModel` and the corresponding code can be seen in `funasr/train/abs_espnet_model.py`. The main function needed to be implemented is the `forward` function.

 docs/get_started.md

@@ -1,21 +1,21 @@
# Get Started
This is an easy example which introduces how to train a paraformer model on AISHELL-1 data from scratch. According to this example, you can train other models (conformer, paraformer, etc.) on other datasets (AISHELL-1, AISHELL-2, etc.) similarly.
Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.).

## Overall Introduction
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 data  . This recipe consists of five stages and support training on multiple GPUs and decoding by CPU or GPU. Before introduce each stage in detail, we first explain several variables which should be set by users.
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
- `CUDA_VISIBLE_DEVICES`: visible gpu list
- `gpu_num`: the number of GPUs used for training
- `gpu_inference`: whether to use GPUs for decoding
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU.
- `feats_dir`: the path to save processed data
- `exp_dir`: the path to save experimental results
- `data_aishell`: the path of raw AISHELL-1 data
- `tag`: the suffix of experimental result directory
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `data_aishell`: the raw path of AISHELL-1 dataset
- `feats_dir`: the path for saving processed data
- `nj`: the number of jobs for data preparation
- `speed_perturb`: the range of speech perturbed
- `exp_dir`: the path for saving experimental results
- `tag`: the suffix of experimental result directory

## Stage 0: Data preparation
This stage processes raw AISHELL-1 data `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx` and `xxx` means `train/dev/test`. Here we assume you have already downloaded AISHELL-1 data. If not, you can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. Here we show examples for `wav.scp` and `text`, separately.
This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows:
* `wav.scp`
```
BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
@@ -30,17 +30,17 @@
BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后
...
```
We can see that these two files both have two columns while the first column is the wav-id and the second column is the corresponding wav-path/label tokens.
These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.

## Stage 1: Feature Generation
This stage extracts FBank feature from raw wav `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. You can set `nj` to control the number of jobs for feature generation. The output features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
* `feats.scp`
```
...
BAC009S0002W0122_sp0.9 /nfs/haoneng.lhn/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
...
```
Note that samples in this file have already been shuffled. This file contains two columns. The first column is the wav-id while the second column is the kaldi-ark feature path. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
* `speech_shape`
```
...
@@ -53,10 +53,10 @@
BAC009S0002W0122_sp0.9 15
...
```
These two files have two columns. The first column is the wav-id and the second column is the corresponding speech feature shape and text length.
These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length.

## Stage 2: Dictionary Preparation
This stage prepares a dictionary, which is used as a mapping between label characters and integer indices during ASR training. The output dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. Here we show an example of `tokens.txt` as follows:
This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
* `tokens.txt`
```
<blank>
@@ -75,7 +75,7 @@
* `<unk>`: indicates the out-of-vocabulary token

## Stage 3: Training
This stage achieves the training of the specified model. To start training, you should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.

* DDP Training

@@ -83,30 +83,29 @@

* DataLoader

[comment]: <> (We support two types of DataLoaders for small and large datasets, respectively. By default, the small DataLoader is used and you can set `dataset_type=large` to enable large DataLoader. For small DataLoader, )
We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and you can set `dataset_type=large` to enable it. 
We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it. 

* Configuration

The parameters of the training, including model, optimization, dataset, etc., are specified by a YAML file in `conf` directory. Also, you can directly specify the parameters in `run.sh` recipe. Please avoid to specify the same parameters in both the YAML file and the recipe.
The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.

* Training Steps

We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of the two parameters, the training will be stopped.
We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.

* Tensorboard

You can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
```
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```

## Stage 4: Decoding
This stage generates the recognition results with acoustic features as input and calculate the `CER` to verify the performance of the trained model. 
This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model. 

* Mode Selection

As we support conformer, paraformer and uniasr in FunASR and they have different inference interfaces, a `mode` param is specified as `asr/paraformer/uniase` according to the trained model.
As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model.

* Configuration


 docs/index.rst

@@ -16,6 +16,7 @@
   ./installation.md
   ./papers.md
   ./get_started.md
   ./build_task.md

.. toctree::
   :maxdepth: 1

 docs/installation.md

@@ -1,35 +1,35 @@
# Installation
FunASR is easy to install, which is mainly based on python packages.
FunASR is easy to install. The detailed installation steps are as follows:

- Clone the repo
``` sh
git clone https://github.com/alibaba/FunASR.git
```

- Install Conda
``` sh
- Install Conda and create virtual environment:
```sh
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda create -n funasr python=3.7
conda activate funasr
```

- Install Pytorch (version >= 1.7.0):

| cuda  | |
|:-----:| --- |
|  9.2  | conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch |
| 10.2  | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch |
| 11.1  | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch |
```sh
pip install torch torchaudio
```

For more versions, please see [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally)

- Install ModelScope

For users in China, you can configure the following mirror source to speed up the downloading:
``` sh
pip config set global.index-url https://mirror.sjtu.edu.cn/pypi/web/simple
```
Install or update ModelScope
```sh
pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
```

- Install other packages
- Clone the repo and install other packages
``` sh
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip install --editable ./
```

 docs/modelscope_usages.md

New file
@@ -0,0 +1,53 @@
# 快速使用ModelScope
ModelScope is an open-source model-as-service platform supported by Alibaba, which provides flexible and convenient model applications for users in academia and industry. For specific usages and open source models, please refer to [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition). In the domain of speech, we provide autoregressive/non-autoregressive speech recognition, speech pre-training, punctuation prediction and other models, which are convenient for users.

## Overall Introduction
We provide the usages of different models under the `egs_modelscope`, which supports directly employing our provided models for inference, as well as finetuning the models we provided as pre-trained initial models. Next, we will introduce the model provided in the `egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch` directory, including `infer.py`, `finetune.py` and `infer_after_finetune .py`. The corresponding functions are as follows:
- `infer.py`: perform inference on the specified dataset based on our provided model
- `finetune.py`: employ our provided model as the initial model for fintuning
- `infer_after_finetune.py`: perform inference on the specified dataset based on the finetuned model

## Inference
We provide `infer.py` to achieve the inference. Based on this file, users can preform inference on the specified dataset based on our provided model and obtain the corresponding recognition results. If the transcript is given, the `CER` will be calculated at the same time. Before performing inference, users can set the following parameters to modify the inference configuration:
* `data_dir`：dataset directory. The directory should contain the wav list file `wav.scp` and the transcript file `text` (optional). For the format of these two files, please refer to the instructions in [Quick Start](./get_started.md). If the `text` file exists, the CER will be calculated accordingly, otherwise it will be skipped.
* `output_dir`：the directory for saving the inference results
* `batch_size`：batch size during the inference
* `ctc_weight`：some models contain a CTC module, users can set this parameter to specify the weight of the CTC module during the inference

In addition to directly setting parameters in `infer.py`, users can also manually set the parameters in the `decoding.yaml` file in the model download directory to modify the inference configuration.

## Finetuning
We provide `finetune.py` to achieve the finetuning. Based on this file, users can finetune on the specified dataset based on our provided model as the initial model to achieve better performance in the specificed domain. Before finetuning, users can set the following parameters to modify the finetuning configuration:
* `data_path`：dataset directory。This directory should contain the `train` directory for saving the training set and the `dev` directory for saving the validation set. Each directory needs to contain the wav list file `wav.scp` and the transcript file `text`
* `output_dir`：the directory for saving the finetuning results
* `dataset_type`：for small dataset，set as `small`；for dataset larger than 1000 hours，set as `large`
* `batch_bins`：batch size，if dataset_type is set as `small`，the unit of batch_bins is the number of fbank feature frames; if dataset_type is set as `large`, the unit of batch_bins is milliseconds
* `max_epoch`：the maximum number of training epochs

The following parameters can also be set. However, if there is no special requirement, users can ignore these parameters and use the default value we provided directly:
* `accum_grad`：the accumulation of the gradient
* `keep_nbest_models`：select the `keep_nbest_models` models with the best performance and average the parameters 
  of these models to get a better model
* `optim`：set the optimizer
* `lr`：set the learning rate
* `scheduler`：set learning rate adjustment strategy
* `scheduler_conf`：set the related parameters of the learning rate adjustment strategy
* `specaug`：set for the spectral augmentation
* `specaug_conf`：set related parameters of the spectral augmentation

In addition to directly setting parameters in `finetune.py`, users can also manually set the parameters in the `finetune.yaml` file in the model download directory to modify the finetuning configuration.

## Inference after Finetuning
We provide `infer_after_finetune.py` to achieve the inference based on the model finetuned by users. Based on this file, users can preform inference on the specified dataset based on the finetuned model and obtain the corresponding recognition results. If the transcript is given, the `CER` will be calculated at the same time. Before performing inference, users can set the following parameters to modify the inference configuration:
* `data_dir`：dataset directory。The directory should contain the wav list file `wav.scp` and the transcript file `text` (optional). If the `text` file exists, the CER will be calculated accordingly, otherwise it will be skipped.
* `output_dir`：the directory for saving the inference results
* `batch_size`：batch size during the inference
* `ctc_weight`：some models contain a CTC module, users can set this parameter to specify the weight of the CTC module during the inference
* `decoding_model_name`：set the name of the model used for the inference

The following parameters can also be set. However, if there is no special requirement, users can ignore these parameters and use the default value we provided directly:
* `modelscope_model_name`：the initial model name used when finetuning
* `required_files`：files required for the inference when using the modelscope interface

## Announcements
Some models may have other unique parameters during the finetuning and inference. The specific usages of these parameters can be found in the `README.md` file in the corresponding directory.

 docs_cn/build_task.md

New file
@@ -0,0 +1,105 @@
# 搭建自定义任务
FunASR类似ESPNet，以`Task`为通用接口，从而实现模型的训练和推理。每一个`Task`是一个类，其需要继承`AbsTask`，其对应的具体代码见`funasr/tasks/abs_task.py`。下面给出其包含的主要函数及功能介绍：
```python
class AbsTask(ABC):
    @classmethod
    def add_task_arguments(cls, parser: argparse.ArgumentParser):
        pass
    
    @classmethod
    def build_preprocess_fn(cls, args, train):
        (...)
    
    @classmethod
    def build_collate_fn(cls, args: argparse.Namespace):
        (...)

    @classmethod
    def build_model(cls, args):
        (...)
    
    @classmethod
    def main(cls, args):
        (...)
```
- add_task_arguments：添加特定`Task`需要的参数
- build_preprocess_fn：定义如何处理对样本进行预处理
- build_collate_fn：定义如何将多个样本组成一个`batch`
- build_model：定义模型
- main：训练入口，通过`Task.main()`来启动训练

下面我们将以语音识别任务为例，介绍如何定义一个新的`Task`，具体代码见`funasr/tasks/asr.py`中的`ASRTask`。 定义新的`Task`的过程，其实就是根据任务需求，重定义上述函数的过程。
- add_task_arguments
```python
@classmethod
def add_task_arguments(cls, parser: argparse.ArgumentParser):
    group = parser.add_argument_group(description="Task related")
    group.add_argument(
        "--token_list",
        type=str_or_none,
        default=None,
        help="A text mapping int-id to token",
    )
    (...)
```
对于语音识别任务，需要的特定参数包括`token_list`等。根据不同任务的特定需求，用户可以在此函数中定义相应的参数。

- build_preprocess_fn
```python
@classmethod
def build_preprocess_fn(cls, args, train):
    if args.use_preprocessor:
        retval = CommonPreprocessor(
                    train=train,
                    token_type=args.token_type,
                    token_list=args.token_list,
                    bpemodel=args.bpemodel,
                    non_linguistic_symbols=args.non_linguistic_symbols,
                    text_cleaner=args.cleaner,
                    ...
                )
    else:
        retval = None
    return retval
```
该函数定义了如何对样本进行预处理。具体地，语音识别任务的输入包括音频和抄本。对于音频，在此实现了(可选)对音频加噪声，加混响等功能；对于抄本，在此实现了(可选)根据bpe处理抄本，将抄本映射成`tokenid`等功能。用户可以自己选择需要对样本进行的预处理操作，实现方法可以参考`CommonPreprocessor`。

- build_collate_fn
```python
@classmethod
def build_collate_fn(cls, args, train):
    return CommonCollateFn(float_pad_value=0.0, int_pad_value=-1)
```
该函数定义了如何将多个样本组成一个`batch`。对于语音识别任务，在此实现的是将不同的音频和抄本，通过`padding`的方式来得到等长的数据。具体地，我们默认用`0.0`来作为音频的填充值，用`-1`作为抄本的默认填充值。用户可以在此定义不同的组`batch`操作，实现方法可以参考`CommonCollateFn`。

- build_model
```python
@classmethod
def build_model(cls, args, train):
    with open(args.token_list, encoding="utf-8") as f:
        token_list = [line.rstrip() for line in f]
        vocab_size = len(token_list)
        frontend = frontend_class(**args.frontend_conf)
        specaug = specaug_class(**args.specaug_conf)
        normalize = normalize_class(**args.normalize_conf)
        preencoder = preencoder_class(**args.preencoder_conf)
        encoder = encoder_class(input_size=input_size, **args.encoder_conf)
        postencoder = postencoder_class(input_size=encoder_output_size, **args.postencoder_conf)
        decoder = decoder_class(vocab_size=vocab_size, encoder_output_size=encoder_output_size,  **args.decoder_conf)
        ctc = CTC(odim=vocab_size, encoder_output_size=encoder_output_size, **args.ctc_conf)
        model = model_class(
            vocab_size=vocab_size,
            frontend=frontend,
            specaug=specaug,
            normalize=normalize,
            preencoder=preencoder,
            encoder=encoder,
            postencoder=postencoder,
            decoder=decoder,
            ctc=ctc,
            token_list=token_list,
            **args.model_conf,
        )
    return model
```
该函数定义了具体的模型。对于不同的语音识别模型，往往可以共用同一个语音识别`Task`，额外需要做的是在此函数中定义特定的模型。例如，这里给出的是一个标准的encoder-decoder结构的语音识别模型。具体地，先定义该模型的各个模块，包括encoder，decoder等，然后在将这些模块组合在一起得到一个完整的模型。在FunASR中，模型需要继承`AbsESPnetModel`，其具体代码见`funasr/train/abs_espnet_model.py`，主要需要实现的是`forward`函数。

 docs_cn/get_started.md

@@ -106,7 +106,8 @@
本阶段用于解码得到识别结果，同时计算CER来验证训练得到的模型性能。

* Mode Selection
由于我们提供了paraformer，uniasr和conformer等模型，因此在解码时，需要指定相应的解码模式。对应的参数为`mode`，相应的可选设置为`asr/paraformer/uniase`等。

由于我们提供了paraformer，uniasr和conformer等模型，因此在解码时，需要指定相应的解码模式。对应的参数为`mode`，相应的可选设置为`asr/paraformer/uniasr`等。

* Configuration


 docs_cn/index.rst

@@ -16,6 +16,7 @@
   ./installation.md
   ./papers.md
   ./get_started.md
   ./build_task.md

.. toctree::
   :maxdepth: 1

 docs_cn/modelscope_usages.md

@@ -2,13 +2,13 @@
ModelScope是阿里巴巴推出的开源模型即服务共享平台，为广大学术界用户和工业界用户提供灵活、便捷的模型应用支持。具体的使用方法和开源模型可以参见[ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition) 。在语音方向，我们提供了自回归/非自回归语音识别，语音预训练，标点预测等模型，用户可以方便使用。

## 整体介绍
我们在egs_modelscope目录下提供了相关模型的使用，支持直接用我们提供的模型进行推理，同时也支持将我们提供的模型作为预训练好的模型作为初始模型进行微调。下面，我们将以egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch目录中提供的模型来进行介绍，包括`infer.py`，`finetune.py`和`infer_after_finetune.py`，对应的功能如下：
我们在`egs_modelscope` 目录下提供了不同模型的使用方法，支持直接用我们提供的模型进行推理，同时也支持将我们提供的模型作为预训练好的初始模型进行微调。下面，我们将以`egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch`目录中提供的模型来进行介绍，包括`infer.py`，`finetune.py`和`infer_after_finetune.py`，对应的功能如下：
- `infer.py`: 基于我们提供的模型，对指定的数据集进行推理
- `finetune.py`: 将我们提供的模型作为初始模型进行微调
- `infer_after_finetune.py`: 基于微调得到的模型，对指定的数据集进行推理

## 模型推理
我们提供了`infer.py`来实现模型推理。基于此文件，用户可以基于我们提供的模型，对指定的数据集进行推理，得到相应的识别结果。如果同时给定了抄本，则会同时计算CER。在开始推理前，用户可以指定如下参数来修改推理配置：
我们提供了`infer.py`来实现模型推理。基于此文件，用户可以基于我们提供的模型，对指定的数据集进行推理，得到相应的识别结果。如果给定了抄本，则会同时计算`CER`。在开始推理前，用户可以指定如下参数来修改推理配置：
* `data_dir`：数据集目录。目录下应该包括音频列表文件`wav.scp`和抄本文件`text`(可选)，具体格式可以参见[快速开始](./get_started.md)中的说明。如果`text`文件存在，则会相应的计算CER，否则会跳过。
* `output_dir`：推理结果保存目录
* `batch_size`：推理时的batch大小
@@ -21,14 +21,14 @@
* `data_path`：数据目录。该目录下应该包括存放训练集数据的`train`目录和存放验证集数据的`dev`目录。每个目录中需要包括音频列表文件`wav.scp`和抄本文件`text`
* `output_dir`：微调结果保存目录
* `dataset_type`：对于小数据集，设置为`small`；当数据量大于1000小时时，设置为`large`
* `batch_bins`：batch size，如果dataset_type设置为`small`，batch_bins单位为fbank特征帧数；如果dataset_type=`large`，batch_bins单位为毫秒
* `batch_bins`：batch size，如果dataset_type设置为`small`，batch_bins单位为fbank特征帧数；如果dataset_type设置为`large`，batch_bins单位为毫秒
* `max_epoch`：最大的训练轮数

以下参数也可以进行设置。但是如果没有特别的需求，可以忽略，直接使用我们给定的默认值：
* `accum_grad`：梯度累积
* `keep_nbest_models`：选择性能最好的`keep_nbest_models`个模型的参数进行平均，得到性能更好的模型
* `optim`：设置微调时的优化器
* `lr`：设置微调时的学习率
* `optim`：设置优化器
* `lr`：设置学习率
* `scheduler`：设置学习率调整策略
* `scheduler_conf`：学习率调整策略的相关参数
* `specaug`：设置谱增广
@@ -37,7 +37,7 @@
除了直接在`finetune.py`中设置参数外，用户也可以通过手动修改模型下载目录下的`finetune.yaml`文件中的参数来修改微调配置。

## 基于微调后的模型推理
我们提供了`infer_after_finetune.py`来实现基于用户自己微调得到的模型进行推理。基于此文件，用户可以基于微调后的模型，对指定的数据集进行推理，得到相应的识别结果。如果同时给定了抄本，则会同时计算CER。在开始推理前，用户可以指定如下参数来修改推理配置：
我们提供了`infer_after_finetune.py`来实现基于用户自己微调得到的模型进行推理。基于此文件，用户可以基于微调后的模型，对指定的数据集进行推理，得到相应的识别结果。如果给定了抄本，则会同时计算CER。在开始推理前，用户可以指定如下参数来修改推理配置：
* `data_dir`：数据集目录。目录下应该包括音频列表文件`wav.scp`和抄本文件`text`(可选)。如果`text`文件存在，则会相应的计算CER，否则会跳过。
* `output_dir`：推理结果保存目录
* `batch_size`：推理时的batch大小
@@ -45,8 +45,8 @@
* `decoding_model_name`：指定用于推理的模型名

以下参数也可以进行设置。但是如果没有特别的需求，可以忽略，直接使用我们给定的默认值：
* `modelscope_model_name`：微调时使用的初始模型
* `modelscope_model_name`：微调时使用的初始模型名
* `required_files`：使用modelscope接口进行推理时需要用到的文件

## 注意事项
部分模型可能在微调、推理时存在一些特有的参数，这部分参数可以在对应目录的README.md文件中找到具体用法。
部分模型可能在微调、推理时存在一些特有的参数，这部分参数可以在对应目录的`README.md`文件中找到具体用法。

New file
			@@ -0,0 +1,106 @@
			# Build custom tasks
			FunASR is similar to ESPNet, which applies `Task` as the general interface ti achieve the training and inference of models. Each `Task` is a class inherited from `AbsTask` and its corresponding code can be seen in `funasr/tasks/abs_task.py`. The main functions of `AbsTask` are shown as follows:
			```python
			class AbsTask(ABC):
			@classmethod
			def add_task_arguments(cls, parser: argparse.ArgumentParser):
			pass

			@classmethod
			def build_preprocess_fn(cls, args, train):
			(...)

			@classmethod
			def build_collate_fn(cls, args: argparse.Namespace):
			(...)

			@classmethod
			def build_model(cls, args):
			(...)

			@classmethod
			def main(cls, args):
			(...)
			```
			- add_task_arguments：Add parameters required by a specified `Task`
			- build_preprocess_fn：定义如何处理对样本进行预处理 define how to preprocess samples
			- build_collate_fn：define how to combine multiple samples into a `batch`
			- build_model：define the model
			- main：training interface, starting training through `Task.main()`

			Next, we take the speech recognition as an example to introduce how to define a new `Task`. For the corresponding code, please see `ASRTask` in `funasr/tasks/asr.py`. The procedure of defining a new `Task` is actually the procedure of redefining the above functions according to the requirements of the specified `Task`.

			- add_task_arguments
			```python
			@classmethod
			def add_task_arguments(cls, parser: argparse.ArgumentParser):
			group = parser.add_argument_group(description="Task related")
			group.add_argument(
			"--token_list",
			type=str_or_none,
			default=None,
			help="A text mapping int-id to token",
			)
			(...)
			```
			For speech recognition tasks, specific parameters required include `token_list`, etc. According to the specific requirements of different tasks, users can define corresponding parameters in this function.

			- build_preprocess_fn
			```python
			@classmethod
			def build_preprocess_fn(cls, args, train):
			if args.use_preprocessor:
			retval = CommonPreprocessor(
			train=train,
			token_type=args.token_type,
			token_list=args.token_list,
			bpemodel=args.bpemodel,
			non_linguistic_symbols=args.non_linguistic_symbols,
			text_cleaner=args.cleaner,
			...
			)
			else:
			retval = None
			return retval
			```
			This function defines how to preprocess samples. Specifically, the input of speech recognition tasks includes speech and text. For speech, functions such as (optional) adding noise and reverberation to the speech are supported. For text, functions such as (optional) processing text according to bpe and mapping text to `tokenid` are supported. Users can choose the preprocessing operation that needs to be performed on the sample. For the detail implementation, please refer to `CommonPreprocessor`.

			- build_collate_fn
			```python
			@classmethod
			def build_collate_fn(cls, args, train):
			return CommonCollateFn(float_pad_value=0.0, int_pad_value=-1)
			```
			This function defines how to combine multiple samples into a `batch`. For speech recognition tasks, `padding` is employed to obtain equal-length data from different speech and text. Specifically, we set `0.0` as the default padding value for speech and `-1` as the default padding value for text. Users can define different `batch` operations here. For the detail implementation, please refer to `CommonCollateFn`.

			- build_model
			```python
			@classmethod
			def build_model(cls, args, train):
			with open(args.token_list, encoding="utf-8") as f:
			token_list = [line.rstrip() for line in f]
			vocab_size = len(token_list)
			frontend = frontend_class(**args.frontend_conf)
			specaug = specaug_class(**args.specaug_conf)
			normalize = normalize_class(**args.normalize_conf)
			preencoder = preencoder_class(**args.preencoder_conf)
			encoder = encoder_class(input_size=input_size, **args.encoder_conf)
			postencoder = postencoder_class(input_size=encoder_output_size, **args.postencoder_conf)
			decoder = decoder_class(vocab_size=vocab_size, encoder_output_size=encoder_output_size, **args.decoder_conf)
			ctc = CTC(odim=vocab_size, encoder_output_size=encoder_output_size, **args.ctc_conf)
			model = model_class(
			vocab_size=vocab_size,
			frontend=frontend,
			specaug=specaug,
			normalize=normalize,
			preencoder=preencoder,
			encoder=encoder,
			postencoder=postencoder,
			decoder=decoder,
			ctc=ctc,
			token_list=token_list,
			**args.model_conf,
			)
			return model
			```
			This function defines the detail of the model. For different speech recognition models, the same speech recognition `Task` can usually be shared and the remaining thing needed to be done is to define a specific model in this function. For example, a speech recognition model with a standard encoder-decoder structure has been shown above. Specifically, it first defines each module of the model, including encoder, decoder, etc. and then combine these modules together to generate a complete model. In FunASR, the model needs to inherit `AbsESPnetModel` and the corresponding code can be seen in `funasr/train/abs_espnet_model.py`. The main function needed to be implemented is the `forward` function.

			@@ -1,21 +1,21 @@
			# Get Started
			This is an easy example which introduces how to train a paraformer model on AISHELL-1 data from scratch. According to this example, you can train other models (conformer, paraformer, etc.) on other datasets (AISHELL-1, AISHELL-2, etc.) similarly.
			Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.).

			## Overall Introduction
			We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 data . This recipe consists of five stages and support training on multiple GPUs and decoding by CPU or GPU. Before introduce each stage in detail, we first explain several variables which should be set by users.
			We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
			- `CUDA_VISIBLE_DEVICES`: visible gpu list
			- `gpu_num`: the number of GPUs used for training
			- `gpu_inference`: whether to use GPUs for decoding
			- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU.
			- `feats_dir`: the path to save processed data
			- `exp_dir`: the path to save experimental results
			- `data_aishell`: the path of raw AISHELL-1 data
			- `tag`: the suffix of experimental result directory
			- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
			- `data_aishell`: the raw path of AISHELL-1 dataset
			- `feats_dir`: the path for saving processed data
			- `nj`: the number of jobs for data preparation
			- `speed_perturb`: the range of speech perturbed
			- `exp_dir`: the path for saving experimental results
			- `tag`: the suffix of experimental result directory

			## Stage 0: Data preparation
			This stage processes raw AISHELL-1 data `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx` and `xxx` means `train/dev/test`. Here we assume you have already downloaded AISHELL-1 data. If not, you can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. Here we show examples for `wav.scp` and `text`, separately.
			This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows:
			* `wav.scp`
			```
			BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
			@@ -30,17 +30,17 @@
			BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后
			...
			```
			We can see that these two files both have two columns while the first column is the wav-id and the second column is the corresponding wav-path/label tokens.
			These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.

			## Stage 1: Feature Generation
			This stage extracts FBank feature from raw wav `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. You can set `nj` to control the number of jobs for feature generation. The output features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
			This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
			* `feats.scp`
			```
			...
			BAC009S0002W0122_sp0.9 /nfs/haoneng.lhn/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
			BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
			...
			```
			Note that samples in this file have already been shuffled. This file contains two columns. The first column is the wav-id while the second column is the kaldi-ark feature path. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
			Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
			* `speech_shape`
			```
			...
			@@ -53,10 +53,10 @@
			BAC009S0002W0122_sp0.9 15
			...
			```
			These two files have two columns. The first column is the wav-id and the second column is the corresponding speech feature shape and text length.
			These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length.

			## Stage 2: Dictionary Preparation
			This stage prepares a dictionary, which is used as a mapping between label characters and integer indices during ASR training. The output dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. Here we show an example of `tokens.txt` as follows:
			This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
			* `tokens.txt`
			```
			<blank>
			@@ -75,7 +75,7 @@
			* `<unk>`: indicates the out-of-vocabulary token

			## Stage 3: Training
			This stage achieves the training of the specified model. To start training, you should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
			This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.

			* DDP Training

			@@ -83,30 +83,29 @@

			* DataLoader

			[comment]: <> (We support two types of DataLoaders for small and large datasets, respectively. By default, the small DataLoader is used and you can set `dataset_type=large` to enable large DataLoader. For small DataLoader, )
			We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and you can set `dataset_type=large` to enable it.
			We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it.

			* Configuration

			The parameters of the training, including model, optimization, dataset, etc., are specified by a YAML file in `conf` directory. Also, you can directly specify the parameters in `run.sh` recipe. Please avoid to specify the same parameters in both the YAML file and the recipe.
			The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.

			* Training Steps

			We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of the two parameters, the training will be stopped.
			We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.

			* Tensorboard

			You can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
			Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
			```
			tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
			```

			## Stage 4: Decoding
			This stage generates the recognition results with acoustic features as input and calculate the `CER` to verify the performance of the trained model.
			This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model.

			* Mode Selection

			As we support conformer, paraformer and uniasr in FunASR and they have different inference interfaces, a `mode` param is specified as `asr/paraformer/uniase` according to the trained model.
			As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model.

			* Configuration

			@@ -16,6 +16,7 @@
			./installation.md
			./papers.md
			./get_started.md
			./build_task.md

			.. toctree::
			:maxdepth: 1

			@@ -1,35 +1,35 @@
			# Installation
			FunASR is easy to install, which is mainly based on python packages.
			FunASR is easy to install. The detailed installation steps are as follows:

			- Clone the repo
			``` sh
			git clone https://github.com/alibaba/FunASR.git
			```

			- Install Conda
			``` sh
			- Install Conda and create virtual environment:
			```sh
			wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
			sh Miniconda3-latest-Linux-x86_64.sh
			source ~/.bashrc
			conda create -n funasr python=3.7
			conda activate funasr
			```

			- Install Pytorch (version >= 1.7.0):

			\| cuda \| \|
			\|:-----:\| --- \|
			\| 9.2 \| conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch \|
			\| 10.2 \| conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch \|
			\| 11.1 \| conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch \|
			```sh
			pip install torch torchaudio
			```

			For more versions, please see [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally)

			- Install ModelScope

			For users in China, you can configure the following mirror source to speed up the downloading:
			``` sh
			pip config set global.index-url https://mirror.sjtu.edu.cn/pypi/web/simple
			```
			Install or update ModelScope
			```sh
			pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
			```

			- Install other packages
			- Clone the repo and install other packages
			``` sh
			git clone https://github.com/alibaba/FunASR.git && cd FunASR
			pip install --editable ./
			```

New file
			@@ -0,0 +1,53 @@
			# 快速使用ModelScope
			ModelScope is an open-source model-as-service platform supported by Alibaba, which provides flexible and convenient model applications for users in academia and industry. For specific usages and open source models, please refer to [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition). In the domain of speech, we provide autoregressive/non-autoregressive speech recognition, speech pre-training, punctuation prediction and other models, which are convenient for users.

			## Overall Introduction
			We provide the usages of different models under the `egs_modelscope`, which supports directly employing our provided models for inference, as well as finetuning the models we provided as pre-trained initial models. Next, we will introduce the model provided in the `egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch` directory, including `infer.py`, `finetune.py` and `infer_after_finetune .py`. The corresponding functions are as follows:
			- `infer.py`: perform inference on the specified dataset based on our provided model
			- `finetune.py`: employ our provided model as the initial model for fintuning
			- `infer_after_finetune.py`: perform inference on the specified dataset based on the finetuned model

			## Inference
			We provide `infer.py` to achieve the inference. Based on this file, users can preform inference on the specified dataset based on our provided model and obtain the corresponding recognition results. If the transcript is given, the `CER` will be calculated at the same time. Before performing inference, users can set the following parameters to modify the inference configuration:
			* `data_dir`：dataset directory. The directory should contain the wav list file `wav.scp` and the transcript file `text` (optional). For the format of these two files, please refer to the instructions in [Quick Start](./get_started.md). If the `text` file exists, the CER will be calculated accordingly, otherwise it will be skipped.
			* `output_dir`：the directory for saving the inference results
			* `batch_size`：batch size during the inference
			* `ctc_weight`：some models contain a CTC module, users can set this parameter to specify the weight of the CTC module during the inference

			In addition to directly setting parameters in `infer.py`, users can also manually set the parameters in the `decoding.yaml` file in the model download directory to modify the inference configuration.

			## Finetuning
			We provide `finetune.py` to achieve the finetuning. Based on this file, users can finetune on the specified dataset based on our provided model as the initial model to achieve better performance in the specificed domain. Before finetuning, users can set the following parameters to modify the finetuning configuration:
			* `data_path`：dataset directory。This directory should contain the `train` directory for saving the training set and the `dev` directory for saving the validation set. Each directory needs to contain the wav list file `wav.scp` and the transcript file `text`
			* `output_dir`：the directory for saving the finetuning results
			* `dataset_type`：for small dataset，set as `small`；for dataset larger than 1000 hours，set as `large`
			* `batch_bins`：batch size，if dataset_type is set as `small`，the unit of batch_bins is the number of fbank feature frames; if dataset_type is set as `large`, the unit of batch_bins is milliseconds
			* `max_epoch`：the maximum number of training epochs

			The following parameters can also be set. However, if there is no special requirement, users can ignore these parameters and use the default value we provided directly:
			* `accum_grad`：the accumulation of the gradient
			* `keep_nbest_models`：select the `keep_nbest_models` models with the best performance and average the parameters
			of these models to get a better model
			* `optim`：set the optimizer
			* `lr`：set the learning rate
			* `scheduler`：set learning rate adjustment strategy
			* `scheduler_conf`：set the related parameters of the learning rate adjustment strategy
			* `specaug`：set for the spectral augmentation
			* `specaug_conf`：set related parameters of the spectral augmentation

			In addition to directly setting parameters in `finetune.py`, users can also manually set the parameters in the `finetune.yaml` file in the model download directory to modify the finetuning configuration.

			## Inference after Finetuning
			We provide `infer_after_finetune.py` to achieve the inference based on the model finetuned by users. Based on this file, users can preform inference on the specified dataset based on the finetuned model and obtain the corresponding recognition results. If the transcript is given, the `CER` will be calculated at the same time. Before performing inference, users can set the following parameters to modify the inference configuration:
			* `data_dir`：dataset directory。The directory should contain the wav list file `wav.scp` and the transcript file `text` (optional). If the `text` file exists, the CER will be calculated accordingly, otherwise it will be skipped.
			* `output_dir`：the directory for saving the inference results
			* `batch_size`：batch size during the inference
			* `ctc_weight`：some models contain a CTC module, users can set this parameter to specify the weight of the CTC module during the inference
			* `decoding_model_name`：set the name of the model used for the inference

			The following parameters can also be set. However, if there is no special requirement, users can ignore these parameters and use the default value we provided directly:
			* `modelscope_model_name`：the initial model name used when finetuning
			* `required_files`：files required for the inference when using the modelscope interface

			## Announcements
			Some models may have other unique parameters during the finetuning and inference. The specific usages of these parameters can be found in the `README.md` file in the corresponding directory.

New file
			@@ -0,0 +1,105 @@
			# 搭建自定义任务
			FunASR类似ESPNet，以`Task`为通用接口，从而实现模型的训练和推理。每一个`Task`是一个类，其需要继承`AbsTask`，其对应的具体代码见`funasr/tasks/abs_task.py`。下面给出其包含的主要函数及功能介绍：
			```python
			class AbsTask(ABC):
			@classmethod
			def add_task_arguments(cls, parser: argparse.ArgumentParser):
			pass

			@classmethod
			def build_preprocess_fn(cls, args, train):
			(...)

			@classmethod
			def build_collate_fn(cls, args: argparse.Namespace):
			(...)

			@classmethod
			def build_model(cls, args):
			(...)

			@classmethod
			def main(cls, args):
			(...)
			```
			- add_task_arguments：添加特定`Task`需要的参数
			- build_preprocess_fn：定义如何处理对样本进行预处理
			- build_collate_fn：定义如何将多个样本组成一个`batch`
			- build_model：定义模型
			- main：训练入口，通过`Task.main()`来启动训练

			下面我们将以语音识别任务为例，介绍如何定义一个新的`Task`，具体代码见`funasr/tasks/asr.py`中的`ASRTask`。定义新的`Task`的过程，其实就是根据任务需求，重定义上述函数的过程。
			- add_task_arguments
			```python
			@classmethod
			def add_task_arguments(cls, parser: argparse.ArgumentParser):
			group = parser.add_argument_group(description="Task related")
			group.add_argument(
			"--token_list",
			type=str_or_none,
			default=None,
			help="A text mapping int-id to token",
			)
			(...)
			```
			对于语音识别任务，需要的特定参数包括`token_list`等。根据不同任务的特定需求，用户可以在此函数中定义相应的参数。

			- build_preprocess_fn
			```python
			@classmethod
			def build_preprocess_fn(cls, args, train):
			if args.use_preprocessor:
			retval = CommonPreprocessor(
			train=train,
			token_type=args.token_type,
			token_list=args.token_list,
			bpemodel=args.bpemodel,
			non_linguistic_symbols=args.non_linguistic_symbols,
			text_cleaner=args.cleaner,
			...
			)
			else:
			retval = None
			return retval
			```
			该函数定义了如何对样本进行预处理。具体地，语音识别任务的输入包括音频和抄本。对于音频，在此实现了(可选)对音频加噪声，加混响等功能；对于抄本，在此实现了(可选)根据bpe处理抄本，将抄本映射成`tokenid`等功能。用户可以自己选择需要对样本进行的预处理操作，实现方法可以参考`CommonPreprocessor`。

			- build_collate_fn
			```python
			@classmethod
			def build_collate_fn(cls, args, train):
			return CommonCollateFn(float_pad_value=0.0, int_pad_value=-1)
			```
			该函数定义了如何将多个样本组成一个`batch`。对于语音识别任务，在此实现的是将不同的音频和抄本，通过`padding`的方式来得到等长的数据。具体地，我们默认用`0.0`来作为音频的填充值，用`-1`作为抄本的默认填充值。用户可以在此定义不同的组`batch`操作，实现方法可以参考`CommonCollateFn`。

			- build_model
			```python
			@classmethod
			def build_model(cls, args, train):
			with open(args.token_list, encoding="utf-8") as f:
			token_list = [line.rstrip() for line in f]
			vocab_size = len(token_list)
			frontend = frontend_class(**args.frontend_conf)
			specaug = specaug_class(**args.specaug_conf)
			normalize = normalize_class(**args.normalize_conf)
			preencoder = preencoder_class(**args.preencoder_conf)
			encoder = encoder_class(input_size=input_size, **args.encoder_conf)
			postencoder = postencoder_class(input_size=encoder_output_size, **args.postencoder_conf)
			decoder = decoder_class(vocab_size=vocab_size, encoder_output_size=encoder_output_size, **args.decoder_conf)
			ctc = CTC(odim=vocab_size, encoder_output_size=encoder_output_size, **args.ctc_conf)
			model = model_class(
			vocab_size=vocab_size,
			frontend=frontend,
			specaug=specaug,
			normalize=normalize,
			preencoder=preencoder,
			encoder=encoder,
			postencoder=postencoder,
			decoder=decoder,
			ctc=ctc,
			token_list=token_list,
			**args.model_conf,
			)
			return model
			```
			该函数定义了具体的模型。对于不同的语音识别模型，往往可以共用同一个语音识别`Task`，额外需要做的是在此函数中定义特定的模型。例如，这里给出的是一个标准的encoder-decoder结构的语音识别模型。具体地，先定义该模型的各个模块，包括encoder，decoder等，然后在将这些模块组合在一起得到一个完整的模型。在FunASR中，模型需要继承`AbsESPnetModel`，其具体代码见`funasr/train/abs_espnet_model.py`，主要需要实现的是`forward`函数。

			@@ -106,7 +106,8 @@
			本阶段用于解码得到识别结果，同时计算CER来验证训练得到的模型性能。

			* Mode Selection
			由于我们提供了paraformer，uniasr和conformer等模型，因此在解码时，需要指定相应的解码模式。对应的参数为`mode`，相应的可选设置为`asr/paraformer/uniase`等。

			由于我们提供了paraformer，uniasr和conformer等模型，因此在解码时，需要指定相应的解码模式。对应的参数为`mode`，相应的可选设置为`asr/paraformer/uniasr`等。

			* Configuration

			@@ -2,13 +2,13 @@
			ModelScope是阿里巴巴推出的开源模型即服务共享平台，为广大学术界用户和工业界用户提供灵活、便捷的模型应用支持。具体的使用方法和开源模型可以参见[ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition) 。在语音方向，我们提供了自回归/非自回归语音识别，语音预训练，标点预测等模型，用户可以方便使用。

			## 整体介绍
			我们在egs_modelscope目录下提供了相关模型的使用，支持直接用我们提供的模型进行推理，同时也支持将我们提供的模型作为预训练好的模型作为初始模型进行微调。下面，我们将以egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch目录中提供的模型来进行介绍，包括`infer.py`，`finetune.py`和`infer_after_finetune.py`，对应的功能如下：
			我们在`egs_modelscope` 目录下提供了不同模型的使用方法，支持直接用我们提供的模型进行推理，同时也支持将我们提供的模型作为预训练好的初始模型进行微调。下面，我们将以`egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch`目录中提供的模型来进行介绍，包括`infer.py`，`finetune.py`和`infer_after_finetune.py`，对应的功能如下：
			- `infer.py`: 基于我们提供的模型，对指定的数据集进行推理
			- `finetune.py`: 将我们提供的模型作为初始模型进行微调
			- `infer_after_finetune.py`: 基于微调得到的模型，对指定的数据集进行推理

			## 模型推理
			我们提供了`infer.py`来实现模型推理。基于此文件，用户可以基于我们提供的模型，对指定的数据集进行推理，得到相应的识别结果。如果同时给定了抄本，则会同时计算CER。在开始推理前，用户可以指定如下参数来修改推理配置：
			我们提供了`infer.py`来实现模型推理。基于此文件，用户可以基于我们提供的模型，对指定的数据集进行推理，得到相应的识别结果。如果给定了抄本，则会同时计算`CER`。在开始推理前，用户可以指定如下参数来修改推理配置：
			* `data_dir`：数据集目录。目录下应该包括音频列表文件`wav.scp`和抄本文件`text`(可选)，具体格式可以参见[快速开始](./get_started.md)中的说明。如果`text`文件存在，则会相应的计算CER，否则会跳过。
			* `output_dir`：推理结果保存目录
			* `batch_size`：推理时的batch大小
			@@ -21,14 +21,14 @@
			* `data_path`：数据目录。该目录下应该包括存放训练集数据的`train`目录和存放验证集数据的`dev`目录。每个目录中需要包括音频列表文件`wav.scp`和抄本文件`text`
			* `output_dir`：微调结果保存目录
			* `dataset_type`：对于小数据集，设置为`small`；当数据量大于1000小时时，设置为`large`
			* `batch_bins`：batch size，如果dataset_type设置为`small`，batch_bins单位为fbank特征帧数；如果dataset_type=`large`，batch_bins单位为毫秒
			* `batch_bins`：batch size，如果dataset_type设置为`small`，batch_bins单位为fbank特征帧数；如果dataset_type设置为`large`，batch_bins单位为毫秒
			* `max_epoch`：最大的训练轮数

			以下参数也可以进行设置。但是如果没有特别的需求，可以忽略，直接使用我们给定的默认值：
			* `accum_grad`：梯度累积
			* `keep_nbest_models`：选择性能最好的`keep_nbest_models`个模型的参数进行平均，得到性能更好的模型
			* `optim`：设置微调时的优化器
			* `lr`：设置微调时的学习率
			* `optim`：设置优化器
			* `lr`：设置学习率
			* `scheduler`：设置学习率调整策略
			* `scheduler_conf`：学习率调整策略的相关参数
			* `specaug`：设置谱增广
			@@ -37,7 +37,7 @@
			除了直接在`finetune.py`中设置参数外，用户也可以通过手动修改模型下载目录下的`finetune.yaml`文件中的参数来修改微调配置。

			## 基于微调后的模型推理
			我们提供了`infer_after_finetune.py`来实现基于用户自己微调得到的模型进行推理。基于此文件，用户可以基于微调后的模型，对指定的数据集进行推理，得到相应的识别结果。如果同时给定了抄本，则会同时计算CER。在开始推理前，用户可以指定如下参数来修改推理配置：
			我们提供了`infer_after_finetune.py`来实现基于用户自己微调得到的模型进行推理。基于此文件，用户可以基于微调后的模型，对指定的数据集进行推理，得到相应的识别结果。如果给定了抄本，则会同时计算CER。在开始推理前，用户可以指定如下参数来修改推理配置：
			* `data_dir`：数据集目录。目录下应该包括音频列表文件`wav.scp`和抄本文件`text`(可选)。如果`text`文件存在，则会相应的计算CER，否则会跳过。
			* `output_dir`：推理结果保存目录
			* `batch_size`：推理时的batch大小
			@@ -45,8 +45,8 @@
			* `decoding_model_name`：指定用于推理的模型名

			以下参数也可以进行设置。但是如果没有特别的需求，可以忽略，直接使用我们给定的默认值：
			* `modelscope_model_name`：微调时使用的初始模型
			* `modelscope_model_name`：微调时使用的初始模型名
			* `required_files`：使用modelscope接口进行推理时需要用到的文件

			## 注意事项
			部分模型可能在微调、推理时存在一些特有的参数，这部分参数可以在对应目录的README.md文件中找到具体用法。
			部分模型可能在微调、推理时存在一些特有的参数，这部分参数可以在对应目录的`README.md`文件中找到具体用法。