python/FunASR-XL.git

parent: 2a38e81a | 补丁 | 提交 | ignore whitespace

Merge pull request #552 from alibaba-damo-academy/dev_wjm2

hnluo

2023-05-25 e0a8c4b00631ed636418f4280964e473f05d5002

Merge pull request #552 from alibaba-damo-academy/dev_wjm2

update ASR recipe

20个文件已修改

2个文件已添加

	docs/academic_recipe/asr_recipe.md	106 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/conformer/run.sh	21 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/data2vec_paraformer_finetune/run.sh	22 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/data2vec_transformer_finetune/run.sh	21 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/paraformer/run.sh	23 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/paraformerbert/run.sh	22 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/transformer/run.sh	21 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/transformer/utils/compute_cmvn.py	24 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/transformer/utils/compute_cmvn.sh	13 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell/transformer/utils/gen_modelscope_configuration.py	118 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/conformer/run.sh	20 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/data2vec_pretrain/run.sh	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/paraformer/run.sh	22 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/paraformerbert/run.sh	22 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/transformer/run.sh	21 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/transformer/utils/compute_cmvn.sh	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/transformer/utils/gen_modelscope_configuration.py	118 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/aishell2/transformerLM/conf/train_lm_transformer.yaml	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/librispeech/conformer/run.sh	6 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs/librispeech_100h/conformer/run.sh	6 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/bin/train.py	8 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/e2e_asr_paraformer.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 docs/academic_recipe/asr_recipe.md

@@ -8,25 +8,66 @@
```sh
cd egs/aishell/paraformer
```

Then you can directly start the recipe as follows:
```sh
conda activate funasr
. ./run.sh
```
The training log files are saved in `exp/*_train_*/log/train.log.*` and the inference results are saved in `exp/*_train_*/decode_asr_*`.

The training log files are saved in `${exp_dir}/exp/${model_dir}/log/train.log.*`， which can be viewed using the following command:
```sh
vim exp/*_train_*/log/train.log.0
```

Users can observe the training loss, prediction accuracy and other training information, like follows:
```text
... 1epoch:train:751-800batch:800num_updates: ... loss_ctc=106.703, loss_att=86.877, acc=0.029, loss_pre=1.552 ...
... 1epoch:train:801-850batch:850num_updates: ... loss_ctc=107.890, loss_att=87.832, acc=0.029, loss_pre=1.702 ...
```

Also, users can use tensorboard to observe these training information by the following command:
```sh
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```

At the end of each epoch, the evaluation metrics are calculated on the validation set, like follows:
```text
... [valid] loss_ctc=99.914, cer_ctc=1.000, loss_att=80.512, acc=0.029, cer=0.971, wer=1.000, loss_pre=1.952, loss=88.285 ...
```

The inference results are saved in `${exp_dir}/exp/${model_dir}/decode_asr_*/$dset`. The main two files are `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text, like follows:
```text
...
BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
ref:    构 建 良 好 的 旅 游 市 场 环 境
res:    构 建 良 好 的 旅 游 市 场 环 境
...
```
`text.cer.txt` saves the final results, like follows:
```text
%WER ...
%SER ...
Scored ... sentences, ...
```

## Introduction
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
- `CUDA_VISIBLE_DEVICES`: visible gpu list
- `gpu_num`: the number of GPUs used for training
- `gpu_inference`: whether to use GPUs for decoding
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `CUDA_VISIBLE_DEVICES`: `0,1` (Default), visible gpu list
- `gpu_num`: `2` (Default), the number of GPUs used for training
- `gpu_inference`: `true` (Default), whether to use GPUs for decoding
- `njob`: `1`  (Default),for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `raw_data`: the raw path of AISHELL-1 dataset
- `feats_dir`: the path for saving processed data
- `nj`: the number of jobs for data preparation
- `speed_perturb`: the range of speech perturbed
- `token_type`: `char` (Default), indicate how to process text
- `type`: `sound` (Default), set the input type
- `scp`: `wav.scp` (Default), set the input file
- `nj`: `64` (Default), the number of jobs for data preparation
- `speed_perturb`: `"0.9, 1.0 ,1.1"` (Default), the range of speech perturbed
- `exp_dir`: the path for saving experimental results
- `tag`: the suffix of experimental result directory
- `tag`: `exp1` (Default), the suffix of experimental result directory
- `stage` `0` (Default), start the recipe from the specified stage
- `stop_stage` `5` (Default), stop the recipe from the specified stage

### Stage 0: Data preparation
This stage processes raw AISHELL-1 dataset `$raw_data` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$raw_data`. The examples of `wav.scp` and `text` are as follows:
@@ -47,11 +88,10 @@
These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.

### Stage 1: Feature and CMVN Generation
This stage computes CMVN based on `train` dataset, which is used in the following stages. Users can set `nj` to control the number of jobs for computing CMVN. The generated CMVN file is saved as `$feats_dir/data/train/cmvn/cmvn.mvn`.
This stage computes CMVN based on `train` dataset, which is used in the following stages. Users can set `nj` to control the number of jobs for computing CMVN. The generated CMVN file is saved as `$feats_dir/data/train/cmvn/am.mvn`.

### Stage 2: Dictionary Preparation
This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
* `tokens.txt`
```
<blank>
<s>
@@ -63,38 +103,26 @@
龟
<unk>
```
* `<blank>`: indicates the blank token for CTC
* `<s>`: indicates the start-of-sentence token
* `</s>`: indicates the end-of-sentence token
* `<unk>`: indicates the out-of-vocabulary token
* `<blank>`: indicates the blank token for CTC, must be in the first line
* `<s>`: indicates the start-of-sentence token, must be in the second line
* `</s>`: indicates the end-of-sentence token, must be in the third line
* `<unk>`: indicates the out-of-vocabulary token, must be in the last line

### Stage 3: LM Training

### Stage 4: ASR Training
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir` to specify the path for saving experimental results. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding. FunASR implements `train.py` for training different models and users can configure the following parameters if necessary.

* DDP Training

We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training.

* DataLoader

We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it. 

* Configuration

The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.

* Training Steps

We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.

* Tensorboard

Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
```
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```
* `task_name`: `asr` (Default), specify the task type of the current recipe
* `gpu_num`: `2` (Default), specify the number of GPUs for training. When `gpu_num > 1`, DistributedDataParallel (DDP, the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)) training will be enabled. Correspondingly, `CUDA_VISIBLE_DEVICES` should be set to specify which ids of GPUs will be used.
* `use_preprocessor`: `true` (Default), specify whether to use pre-processing on each sample
* `token_list`: the path of token list for training
* `dataset_type`: `small` (Default). FunASR supports `small` dataset type for training small datasets. Besides, an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large datasets is supported and users can specify `dataset_type=large` to enable it.
* `data_dir`: the path of data. Specifically, the data for training is saved in `$data_dir/data/$train_set` while the data for validation is saved in `$data_dir/data/$valid_set`
* `data_file_names`: `"wav.scp,text"` specify the speech and text file names for ASR
* `cmvn_file`: the path of cmvn file
* `resume`: `true`, whether to enable "checkpoint training"
* `config`: the path of configuration file, which is usually a YAML file in `conf` directory. In FunASR, the parameters of the training, including model, optimization, dataset, etc., can also be set in this file. Note that if the same parameters are specified in both recipe and config file, the parameters of recipe will be employed

### Stage 5: Decoding
This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model. 
@@ -114,7 +142,6 @@
* Performance

We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` results. The following is an example of `text.cer`:
* `text.cer`
```
...
BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
@@ -140,6 +167,9 @@
. ./run.sh --stage 3 --stop_stage 5
```

* Training Steps
FunASR supports two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.

* Change the configuration of the model

The configuration of the model is set in the config file `conf/train_*.yaml`. Specifically, the default encoder configuration of paraformer is as follows:

 egs/aishell/conformer/run.sh

@@ -85,7 +85,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -136,7 +136,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
@@ -186,7 +186,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -207,4 +207,19 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode asr \
        --model_name conformer \
        --dataset aishell \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --tag $tag
fi

 egs/aishell/data2vec_paraformer_finetune/run.sh

@@ -88,7 +88,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -141,7 +141,7 @@
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --init_param ${init_param} \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
                --config $asr_config \
@@ -190,7 +190,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
@@ -212,4 +212,20 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode paraformer \
        --model_name data2vec_finetune_paraformer \
        --dataset aishell \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --nat _nat \
        --tag $tag
fi

 egs/aishell/data2vec_transformer_finetune/run.sh

@@ -88,7 +88,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -141,7 +141,7 @@
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --init_param ${init_param} \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
@@ -191,7 +191,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -212,4 +212,19 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode asr \
        --model_name data2vec_finetune_transformer \
        --dataset aishell \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --tag $tag
fi

 egs/aishell/paraformer/run.sh

@@ -85,7 +85,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -132,11 +132,12 @@
                --use_preprocessor true \
                --token_type char \
                --token_list $token_list \
                --dataset_type small \
                --data_dir ${feats_dir}/data \
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
@@ -186,7 +187,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -207,4 +208,20 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode paraformer \
        --model_name paraformer \
        --dataset aishell \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --nat _nat \
        --tag $tag
fi

 egs/aishell/paraformerbert/run.sh

@@ -89,7 +89,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -147,7 +147,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text,embeds.scp" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
@@ -197,7 +197,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -218,4 +218,20 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode paraformer \
        --model_name paraformer_bert \
        --dataset aishell \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --nat _nat \
        --tag $tag
fi

 egs/aishell/transformer/run.sh

@@ -85,7 +85,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -136,7 +136,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
@@ -186,7 +186,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -207,4 +207,19 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode asr \
        --model_name transformer \
        --dataset aishell \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --tag $tag
fi

 egs/aishell/transformer/utils/compute_cmvn.py

@@ -5,6 +5,7 @@
import numpy as np
import torchaudio
import torchaudio.compliance.kaldi as kaldi
import yaml


def get_parser():
@@ -24,6 +25,11 @@
        required=True,
        type=str,
        help="the path of wav scps",
    )
    parser.add_argument(
        "--config",
        type=str,
        help="the config file for computing cmvn",
    )
    parser.add_argument(
        "--idx",
@@ -82,11 +88,27 @@
    #         mean_stats += np.sum(mat, axis=0)
    #         var_stats += np.sum(np.square(mat), axis=0)
    #         total_frames += mat.shape[0]

    with open(args.config) as f:
        configs = yaml.safe_load(f)
        frontend_configs = configs.get("frontend_conf", {})
        num_mel_bins = frontend_configs.get("n_mels", 80)
        frame_length = frontend_configs.get("frame_length", 25)
        frame_shift = frontend_configs.get("frame_shift", 10)
        window_type = frontend_configs.get("window", "hamming")
        resample_rate = frontend_configs.get("fs", 16000)
        assert num_mel_bins == args.dim

    with open(wav_scp_file) as f:
        lines = f.readlines()
        for line in lines:
            _, wav_file = line.strip().split()
            fbank = compute_fbank(wav_file, num_mel_bins=args.dim)
            fbank = compute_fbank(wav_file,
                                  num_mel_bins=args.dim,
                                  frame_length=frame_length,
                                  frame_shift=frame_shift,
                                  resample_rate=resample_rate,
                                  window_type=window_type)
            mean_stats += np.sum(fbank, axis=0)
            var_stats += np.sum(np.square(fbank), axis=0)
            total_frames += fbank.shape[0]

 egs/aishell/transformer/utils/compute_cmvn.sh

@@ -2,15 +2,19 @@

. ./path.sh || exit 1;
# Begin configuration section.
fbankdir=$1
nj=32
cmd=./utils/run.pl
feats_dim=80
config=
scale=1.0

echo "$0 $@"

. utils/parse_options.sh || exit 1;

fbankdir=$1
# shellcheck disable=SC2046
head -n $(awk -v lines="$(wc -l < ${fbankdir}/wav.scp)" -v scale="$scale" 'BEGIN { printf "%.0f\n", lines*scale }') ${fbankdir}/wav.scp > ${fbankdir}/wav.scp.scale

split_dir=${fbankdir}/cmvn/split_${nj};
mkdir -p $split_dir
@@ -18,17 +22,18 @@
for n in $(seq $nj); do
    split_scps="$split_scps $split_dir/wav.$n.scp"
done
utils/split_scp.pl ${fbankdir}/wav.scp $split_scps || exit 1;
utils/split_scp.pl ${fbankdir}/wav.scp.scale $split_scps || exit 1;

logdir=${fbankdir}/cmvn/log
$cmd JOB=1:$nj $logdir/cmvn.JOB.log \
    python utils/compute_cmvn.py \
      --dim ${feats_dim} \
      --wav_path $split_dir \
      --idx JOB
      --config $config \
      --idx JOB \

python utils/combine_cmvn_file.py --dim ${feats_dim} --cmvn_dir $split_dir --nj $nj --output_dir ${fbankdir}/cmvn

python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/cmvn.mvn
python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/am.mvn

echo "$0: Succeeded compute global cmvn"

 egs/aishell/transformer/utils/gen_modelscope_configuration.py

New file
@@ -0,0 +1,118 @@
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--task",
        type=str,
        default="auto-speech-recognition",
        help="task name",
    )
    parser.add_argument(
        "--type",
        type=str,
        default="generic-asr",
    )
    parser.add_argument(
        "--am_model_name",
        type=str,
        default="model.pb",
        help="model file name",
    )
    parser.add_argument(
        "--mode",
        type=str,
        default="paraformer",
        help="mode for decoding",
    )
    parser.add_argument(
        "--lang",
        type=str,
        default="zh-cn",
        help="language",
    )
    parser.add_argument(
        "--batch_size",
        type=int,
        default=1,
        help="batch size",
    )
    parser.add_argument(
        "--am_model_config",
        type=str,
        default="config.yaml",
        help="config file",
    )
    parser.add_argument(
        "--mvn_file",
        type=str,
        default="am.mvn",
        help="cmvn file",
    )
    parser.add_argument(
        "--model_name",
        type=str,
        help="model name",
    )
    parser.add_argument(
        "--pipeline_type",
        type=str,
        default="asr-inference",
        help="pipeline type",
    )
    parser.add_argument(
        "--vocab_size",
        type=int,
        help="vocab_size",
    )
    parser.add_argument(
        "--dataset",
        type=str,
        help="dataset name",
    )
    parser.add_argument(
        "--output_dir",
        type=str,
        help="output path",
    )
    parser.add_argument(
        "--nat",
        type=str,
        default="",
        help="nat",
    )
    parser.add_argument(
        "--tag",
        type=str,
        default="exp1",
        help="model name tag",
    )
    args = parser.parse_args()

    model = {
        "type": args.type,
        "am_model_name": args.am_model_name,
        "model_config": {
            "type": "pytorch",
            "code_base": "funasr",
            "mode": args.mode,
            "lang": args.lang,
            "batch_size": args.batch_size,
            "am_model_config": args.am_model_config,
            "mvn_file": args.mvn_file,
            "model": "speech_{}_asr{}-{}-16k-{}-vocab{}-pytorch-{}".format(args.model_name, args.nat, args.lang,
                                                                           args.dataset, args.vocab_size, args.tag),
        }
    }
    pipeline = {"type": args.pipeline_type}
    json_dict = {
        "framework": "pytorch",
        "task": args.task,
        "model": model,
        "pipeline": pipeline,
    }

    with open(os.path.join(args.output_dir, "configuration.json"), "w") as f:
        json.dump(json_dict, f, indent=4)

 egs/aishell2/conformer/run.sh

@@ -87,7 +87,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -138,7 +138,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --dataset_type $dataset_type \
                --resume true \
@@ -189,7 +189,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -212,5 +212,19 @@
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode asr \
        --model_name conformer \
        --dataset aishell2 \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --tag $tag
fi



 egs/aishell2/data2vec_pretrain/run.sh

@@ -66,7 +66,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -109,7 +109,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --dataset_type $dataset_type \
                --resume true \

 egs/aishell2/paraformer/run.sh

@@ -87,7 +87,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -138,7 +138,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --dataset_type $dataset_type \
                --resume true \
@@ -189,7 +189,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -210,4 +210,20 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode paraformer \
        --model_name paraformer \
        --dataset aishell2 \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --nat _nat \
        --tag $tag
fi

 egs/aishell2/paraformerbert/run.sh

@@ -90,7 +90,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -148,7 +148,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text,embeds.scp" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --dataset_type $dataset_type \
                --resume true \
@@ -199,7 +199,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -220,4 +220,20 @@
        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
        cat ${_dir}/text.cer.txt
    done
fi

# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode paraformer \
        --model_name paraformer_bert \
        --dataset aishell2 \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --nat _nat \
        --tag $tag
fi

 egs/aishell2/transformer/run.sh

@@ -87,7 +87,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
@@ -138,7 +138,7 @@
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --data_file_names "wav.scp,text" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --dataset_type $dataset_type \
                --resume true \
@@ -189,7 +189,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
@@ -212,5 +212,18 @@
    done
fi


# Prepare files for ModelScope fine-tuning and inference
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "stage 6: ModelScope Preparation"
    cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
    vocab_size=$(cat ${token_list} | wc -l)
    python utils/gen_modelscope_configuration.py \
        --am_model_name $inference_asr_model \
        --mode asr \
        --model_name transformer \
        --dataset aishell2 \
        --output_dir $exp_dir/exp/$model_dir \
        --vocab_size $vocab_size \
        --tag $tag
fi


 egs/aishell2/transformer/utils/compute_cmvn.sh

@@ -29,6 +29,6 @@

python utils/combine_cmvn_file.py --dim ${feats_dim} --cmvn_dir $split_dir --nj $nj --output_dir ${fbankdir}/cmvn

python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/cmvn.mvn
python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/am.mvn

echo "$0: Succeeded compute global cmvn"

 egs/aishell2/transformer/utils/gen_modelscope_configuration.py

New file
@@ -0,0 +1,118 @@
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--task",
        type=str,
        default="auto-speech-recognition",
        help="task name",
    )
    parser.add_argument(
        "--type",
        type=str,
        default="generic-asr",
    )
    parser.add_argument(
        "--am_model_name",
        type=str,
        default="model.pb",
        help="model file name",
    )
    parser.add_argument(
        "--mode",
        type=str,
        default="paraformer",
        help="mode for decoding",
    )
    parser.add_argument(
        "--lang",
        type=str,
        default="zh-cn",
        help="language",
    )
    parser.add_argument(
        "--batch_size",
        type=int,
        default=1,
        help="batch size",
    )
    parser.add_argument(
        "--am_model_config",
        type=str,
        default="config.yaml",
        help="config file",
    )
    parser.add_argument(
        "--mvn_file",
        type=str,
        default="am.mvn",
        help="cmvn file",
    )
    parser.add_argument(
        "--model_name",
        type=str,
        help="model name",
    )
    parser.add_argument(
        "--pipeline_type",
        type=str,
        default="asr-inference",
        help="pipeline type",
    )
    parser.add_argument(
        "--vocab_size",
        type=int,
        help="vocab_size",
    )
    parser.add_argument(
        "--dataset",
        type=str,
        help="dataset name",
    )
    parser.add_argument(
        "--output_dir",
        type=str,
        help="output path",
    )
    parser.add_argument(
        "--nat",
        type=str,
        default="",
        help="nat",
    )
    parser.add_argument(
        "--tag",
        type=str,
        default="exp1",
        help="model name tag",
    )
    args = parser.parse_args()

    model = {
        "type": args.type,
        "am_model_name": args.am_model_name,
        "model_config": {
            "type": "pytorch",
            "code_base": "funasr",
            "mode": args.mode,
            "lang": args.lang,
            "batch_size": args.batch_size,
            "am_model_config": args.am_model_config,
            "mvn_file": args.mvn_file,
            "model": "speech_{}_asr{}-{}-16k-{}-vocab{}-pytorch-{}".format(args.model_name, args.nat, args.lang,
                                                                           args.dataset, args.vocab_size, args.tag),
        }
    }
    pipeline = {"type": args.pipeline_type}
    json_dict = {
        "framework": "pytorch",
        "task": args.task,
        "model": model,
        "pipeline": pipeline,
    }

    with open(os.path.join(args.output_dir, "configuration.json"), "w") as f:
        json.dump(json_dict, f, indent=4)

 egs/aishell2/transformerLM/conf/train_lm_transformer.yaml

@@ -13,7 +13,7 @@
batch_type: numel
batch_bins: 6000000
accum_grad: 1
max_epoch: 15  # 15epoch is enougth
max_epoch: 15  # 15epoch is enough

optim: adam
optim_conf:

 egs/librispeech/conformer/run.sh

@@ -97,7 +97,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
@@ -150,7 +150,7 @@
                --data_dir ${feats_dir}/data \
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
@@ -201,7 +201,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \

 egs/librispeech_100h/conformer/run.sh

@@ -93,7 +93,7 @@

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    echo "stage 1: Feature and CMVN Generation"
    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
    utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
fi

token_list=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
@@ -146,7 +146,7 @@
                --data_dir ${feats_dir}/data \
                --train_set ${train_set} \
                --valid_set ${valid_set} \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --speed_perturb ${speed_perturb} \
                --resume true \
                --output_dir ${exp_dir}/exp/${model_dir} \
@@ -197,7 +197,7 @@
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
                --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
                --key_file "${_logdir}"/keys.JOB.scp \
                --asr_train_config "${asr_exp}"/config.yaml \
                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \

 funasr/bin/train.py

@@ -272,8 +272,8 @@
    parser.add_argument(
        "--init_param",
        type=str,
        action="append",
        default=[],
        nargs="*",
        help="Specify the file path used for initialization of parameters. "
             "The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', "
             "where file_path is the model file path, "
@@ -519,6 +519,12 @@
        dtype=getattr(torch, args.train_dtype),
        device="cuda" if args.ngpu > 0 else "cpu",
    )
    for t in args.freeze_param:
        for k, p in model.named_parameters():
            if k.startswith(t + ".") or k == t:
                logging.info(f"Setting {k}.requires_grad = False")
                p.requires_grad = False

    optimizers = build_optimizer(args, model=model)
    schedulers = build_scheduler(args, optimizers)


 funasr/models/e2e_asr_paraformer.py

@@ -236,7 +236,7 @@
            loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att + loss_pre * self.predictor_weight

        if self.use_1st_decoder_loss and pre_loss_att is not None:
            loss = loss + pre_loss_att
            loss = loss + (1 - self.ctc_weight) * pre_loss_att

        # Collect Attn branch stats
        stats["loss_att"] = loss_att.detach() if loss_att is not None else None

			@@ -8,25 +8,66 @@
			```sh
			cd egs/aishell/paraformer
			```

			Then you can directly start the recipe as follows:
			```sh
			conda activate funasr
			. ./run.sh
			```
			The training log files are saved in `exp/_train_/log/train.log.` and the inference results are saved in `exp/_train_/decode_asr_`.

			The training log files are saved in `${exp_dir}/exp/${model_dir}/log/train.log.*`， which can be viewed using the following command:
			```sh
			vim exp/_train_/log/train.log.0
			```

			Users can observe the training loss, prediction accuracy and other training information, like follows:
			```text
			... 1epoch:train:751-800batch:800num_updates: ... loss_ctc=106.703, loss_att=86.877, acc=0.029, loss_pre=1.552 ...
			... 1epoch:train:801-850batch:850num_updates: ... loss_ctc=107.890, loss_att=87.832, acc=0.029, loss_pre=1.702 ...
			```

			Also, users can use tensorboard to observe these training information by the following command:
			```sh
			tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
			```

			At the end of each epoch, the evaluation metrics are calculated on the validation set, like follows:
			```text
			... [valid] loss_ctc=99.914, cer_ctc=1.000, loss_att=80.512, acc=0.029, cer=0.971, wer=1.000, loss_pre=1.952, loss=88.285 ...
			```

			The inference results are saved in `${exp_dir}/exp/${model_dir}/decode_asr_*/$dset`. The main two files are `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text, like follows:
			```text
			...
			BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
			ref: 构建良好的旅游市场环境
			res: 构建良好的旅游市场环境
			...
			```
			`text.cer.txt` saves the final results, like follows:
			```text
			%WER ...
			%SER ...
			Scored ... sentences, ...
			```

			## Introduction
			We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
			- `CUDA_VISIBLE_DEVICES`: visible gpu list
			- `gpu_num`: the number of GPUs used for training
			- `gpu_inference`: whether to use GPUs for decoding
			- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
			- `CUDA_VISIBLE_DEVICES`: `0,1` (Default), visible gpu list
			- `gpu_num`: `2` (Default), the number of GPUs used for training
			- `gpu_inference`: `true` (Default), whether to use GPUs for decoding
			- `njob`: `1` (Default),for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
			- `raw_data`: the raw path of AISHELL-1 dataset
			- `feats_dir`: the path for saving processed data
			- `nj`: the number of jobs for data preparation
			- `speed_perturb`: the range of speech perturbed
			- `token_type`: `char` (Default), indicate how to process text
			- `type`: `sound` (Default), set the input type
			- `scp`: `wav.scp` (Default), set the input file
			- `nj`: `64` (Default), the number of jobs for data preparation
			- `speed_perturb`: `"0.9, 1.0 ,1.1"` (Default), the range of speech perturbed
			- `exp_dir`: the path for saving experimental results
			- `tag`: the suffix of experimental result directory
			- `tag`: `exp1` (Default), the suffix of experimental result directory
			- `stage` `0` (Default), start the recipe from the specified stage
			- `stop_stage` `5` (Default), stop the recipe from the specified stage

			### Stage 0: Data preparation
			This stage processes raw AISHELL-1 dataset `$raw_data` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$raw_data`. The examples of `wav.scp` and `text` are as follows:
			@@ -47,11 +88,10 @@
			These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.

			### Stage 1: Feature and CMVN Generation
			This stage computes CMVN based on `train` dataset, which is used in the following stages. Users can set `nj` to control the number of jobs for computing CMVN. The generated CMVN file is saved as `$feats_dir/data/train/cmvn/cmvn.mvn`.
			This stage computes CMVN based on `train` dataset, which is used in the following stages. Users can set `nj` to control the number of jobs for computing CMVN. The generated CMVN file is saved as `$feats_dir/data/train/cmvn/am.mvn`.

			### Stage 2: Dictionary Preparation
			This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
			* `tokens.txt`
			```
			<blank>
			<s>
			@@ -63,38 +103,26 @@
			龟
			<unk>
			```
			* `<blank>`: indicates the blank token for CTC
			* `<s>`: indicates the start-of-sentence token
			* `</s>`: indicates the end-of-sentence token
			* `<unk>`: indicates the out-of-vocabulary token
			* `<blank>`: indicates the blank token for CTC, must be in the first line
			* `<s>`: indicates the start-of-sentence token, must be in the second line
			* `</s>`: indicates the end-of-sentence token, must be in the third line
			* `<unk>`: indicates the out-of-vocabulary token, must be in the last line

			### Stage 3: LM Training

			### Stage 4: ASR Training
			This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
			This stage achieves the training of the specified model. To start training, users should manually set `exp_dir` to specify the path for saving experimental results. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding. FunASR implements `train.py` for training different models and users can configure the following parameters if necessary.

			* DDP Training

			We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training.

			* DataLoader

			We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it.

			* Configuration

			The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.

			* Training Steps

			We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.

			* Tensorboard

			Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
			```
			tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
			```
			* `task_name`: `asr` (Default), specify the task type of the current recipe
			* `gpu_num`: `2` (Default), specify the number of GPUs for training. When `gpu_num > 1`, DistributedDataParallel (DDP, the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)) training will be enabled. Correspondingly, `CUDA_VISIBLE_DEVICES` should be set to specify which ids of GPUs will be used.
			* `use_preprocessor`: `true` (Default), specify whether to use pre-processing on each sample
			* `token_list`: the path of token list for training
			* `dataset_type`: `small` (Default). FunASR supports `small` dataset type for training small datasets. Besides, an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large datasets is supported and users can specify `dataset_type=large` to enable it.
			* `data_dir`: the path of data. Specifically, the data for training is saved in `$data_dir/data/$train_set` while the data for validation is saved in `$data_dir/data/$valid_set`
			* `data_file_names`: `"wav.scp,text"` specify the speech and text file names for ASR
			* `cmvn_file`: the path of cmvn file
			* `resume`: `true`, whether to enable "checkpoint training"
			* `config`: the path of configuration file, which is usually a YAML file in `conf` directory. In FunASR, the parameters of the training, including model, optimization, dataset, etc., can also be set in this file. Note that if the same parameters are specified in both recipe and config file, the parameters of recipe will be employed

			### Stage 5: Decoding
			This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model.
			@@ -114,7 +142,6 @@
			* Performance

			We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` results. The following is an example of `text.cer`:
			* `text.cer`
			```
			...
			BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
			@@ -140,6 +167,9 @@
			. ./run.sh --stage 3 --stop_stage 5
			```

			* Training Steps
			FunASR supports two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.

			* Change the configuration of the model

			The configuration of the model is set in the config file `conf/train_*.yaml`. Specifically, the default encoder configuration of paraformer is as follows:

			@@ -85,7 +85,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
			@@ -136,7 +136,7 @@
			--train_set ${train_set} \
			--valid_set ${valid_set} \
			--data_file_names "wav.scp,text" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--resume true \
			--output_dir ${exp_dir}/exp/${model_dir} \
			@@ -186,7 +186,7 @@
			--njob ${njob} \
			--gpuid_list ${gpuid_list} \
			--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--key_file "${_logdir}"/keys.JOB.scp \
			--asr_train_config "${asr_exp}"/config.yaml \
			--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
			@@ -207,4 +207,19 @@
			tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
			cat ${_dir}/text.cer.txt
			done
			fi

			# Prepare files for ModelScope fine-tuning and inference
			if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
			echo "stage 6: ModelScope Preparation"
			cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
			vocab_size=$(cat ${token_list} \| wc -l)
			python utils/gen_modelscope_configuration.py \
			--am_model_name $inference_asr_model \
			--mode asr \
			--model_name conformer \
			--dataset aishell \
			--output_dir $exp_dir/exp/$model_dir \
			--vocab_size $vocab_size \
			--tag $tag
			fi

			@@ -88,7 +88,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
			@@ -141,7 +141,7 @@
			--valid_set ${valid_set} \
			--data_file_names "wav.scp,text" \
			--init_param ${init_param} \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--resume true \
			--output_dir ${exp_dir}/exp/${model_dir} \
			--config $asr_config \
			@@ -190,7 +190,7 @@
			--njob ${njob} \
			--gpuid_list ${gpuid_list} \
			--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--key_file "${_logdir}"/keys.JOB.scp \
			--asr_train_config "${asr_exp}"/config.yaml \
			@@ -212,4 +212,20 @@
			tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
			cat ${_dir}/text.cer.txt
			done
			fi

			# Prepare files for ModelScope fine-tuning and inference
			if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
			echo "stage 6: ModelScope Preparation"
			cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
			vocab_size=$(cat ${token_list} \| wc -l)
			python utils/gen_modelscope_configuration.py \
			--am_model_name $inference_asr_model \
			--mode paraformer \
			--model_name data2vec_finetune_paraformer \
			--dataset aishell \
			--output_dir $exp_dir/exp/$model_dir \
			--vocab_size $vocab_size \
			--nat _nat \
			--tag $tag
			fi

			@@ -89,7 +89,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
			@@ -147,7 +147,7 @@
			--train_set ${train_set} \
			--valid_set ${valid_set} \
			--data_file_names "wav.scp,text,embeds.scp" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--resume true \
			--output_dir ${exp_dir}/exp/${model_dir} \
			@@ -197,7 +197,7 @@
			--njob ${njob} \
			--gpuid_list ${gpuid_list} \
			--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--key_file "${_logdir}"/keys.JOB.scp \
			--asr_train_config "${asr_exp}"/config.yaml \
			--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
			@@ -218,4 +218,20 @@
			tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
			cat ${_dir}/text.cer.txt
			done
			fi

			# Prepare files for ModelScope fine-tuning and inference
			if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
			echo "stage 6: ModelScope Preparation"
			cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
			vocab_size=$(cat ${token_list} \| wc -l)
			python utils/gen_modelscope_configuration.py \
			--am_model_name $inference_asr_model \
			--mode paraformer \
			--model_name paraformer_bert \
			--dataset aishell \
			--output_dir $exp_dir/exp/$model_dir \
			--vocab_size $vocab_size \
			--nat _nat \
			--tag $tag
			fi

			@@ -5,6 +5,7 @@
			import numpy as np
			import torchaudio
			import torchaudio.compliance.kaldi as kaldi
			import yaml


			def get_parser():
			@@ -24,6 +25,11 @@
			required=True,
			type=str,
			help="the path of wav scps",
			)
			parser.add_argument(
			"--config",
			type=str,
			help="the config file for computing cmvn",
			)
			parser.add_argument(
			"--idx",
			@@ -82,11 +88,27 @@
			# mean_stats += np.sum(mat, axis=0)
			# var_stats += np.sum(np.square(mat), axis=0)
			# total_frames += mat.shape[0]

			with open(args.config) as f:
			configs = yaml.safe_load(f)
			frontend_configs = configs.get("frontend_conf", {})
			num_mel_bins = frontend_configs.get("n_mels", 80)
			frame_length = frontend_configs.get("frame_length", 25)
			frame_shift = frontend_configs.get("frame_shift", 10)
			window_type = frontend_configs.get("window", "hamming")
			resample_rate = frontend_configs.get("fs", 16000)
			assert num_mel_bins == args.dim

			with open(wav_scp_file) as f:
			lines = f.readlines()
			for line in lines:
			_, wav_file = line.strip().split()
			fbank = compute_fbank(wav_file, num_mel_bins=args.dim)
			fbank = compute_fbank(wav_file,
			num_mel_bins=args.dim,
			frame_length=frame_length,
			frame_shift=frame_shift,
			resample_rate=resample_rate,
			window_type=window_type)
			mean_stats += np.sum(fbank, axis=0)
			var_stats += np.sum(np.square(fbank), axis=0)
			total_frames += fbank.shape[0]

			@@ -2,15 +2,19 @@

			. ./path.sh \|\| exit 1;
			# Begin configuration section.
			fbankdir=$1
			nj=32
			cmd=./utils/run.pl
			feats_dim=80
			config=
			scale=1.0

			echo "$0 $@"

			. utils/parse_options.sh \|\| exit 1;

			fbankdir=$1
			# shellcheck disable=SC2046
			head -n $(awk -v lines="$(wc -l < ${fbankdir}/wav.scp)" -v scale="$scale" 'BEGIN { printf "%.0f\n", lines*scale }') ${fbankdir}/wav.scp > ${fbankdir}/wav.scp.scale

			split_dir=${fbankdir}/cmvn/split_${nj};
			mkdir -p $split_dir
			@@ -18,17 +22,18 @@
			for n in $(seq $nj); do
			split_scps="$split_scps $split_dir/wav.$n.scp"
			done
			utils/split_scp.pl ${fbankdir}/wav.scp $split_scps \|\| exit 1;
			utils/split_scp.pl ${fbankdir}/wav.scp.scale $split_scps \|\| exit 1;

			logdir=${fbankdir}/cmvn/log
			$cmd JOB=1:$nj $logdir/cmvn.JOB.log \
			python utils/compute_cmvn.py \
			--dim ${feats_dim} \
			--wav_path $split_dir \
			--idx JOB
			--config $config \
			--idx JOB \

			python utils/combine_cmvn_file.py --dim ${feats_dim} --cmvn_dir $split_dir --nj $nj --output_dir ${fbankdir}/cmvn

			python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/cmvn.mvn
			python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/am.mvn

			echo "$0: Succeeded compute global cmvn"

New file
			@@ -0,0 +1,118 @@
			import argparse
			import json
			import os

			if __name__ == '__main__':
			parser = argparse.ArgumentParser()
			parser.add_argument(
			"--task",
			type=str,
			default="auto-speech-recognition",
			help="task name",
			)
			parser.add_argument(
			"--type",
			type=str,
			default="generic-asr",
			)
			parser.add_argument(
			"--am_model_name",
			type=str,
			default="model.pb",
			help="model file name",
			)
			parser.add_argument(
			"--mode",
			type=str,
			default="paraformer",
			help="mode for decoding",
			)
			parser.add_argument(
			"--lang",
			type=str,
			default="zh-cn",
			help="language",
			)
			parser.add_argument(
			"--batch_size",
			type=int,
			default=1,
			help="batch size",
			)
			parser.add_argument(
			"--am_model_config",
			type=str,
			default="config.yaml",
			help="config file",
			)
			parser.add_argument(
			"--mvn_file",
			type=str,
			default="am.mvn",
			help="cmvn file",
			)
			parser.add_argument(
			"--model_name",
			type=str,
			help="model name",
			)
			parser.add_argument(
			"--pipeline_type",
			type=str,
			default="asr-inference",
			help="pipeline type",
			)
			parser.add_argument(
			"--vocab_size",
			type=int,
			help="vocab_size",
			)
			parser.add_argument(
			"--dataset",
			type=str,
			help="dataset name",
			)
			parser.add_argument(
			"--output_dir",
			type=str,
			help="output path",
			)
			parser.add_argument(
			"--nat",
			type=str,
			default="",
			help="nat",
			)
			parser.add_argument(
			"--tag",
			type=str,
			default="exp1",
			help="model name tag",
			)
			args = parser.parse_args()

			model = {
			"type": args.type,
			"am_model_name": args.am_model_name,
			"model_config": {
			"type": "pytorch",
			"code_base": "funasr",
			"mode": args.mode,
			"lang": args.lang,
			"batch_size": args.batch_size,
			"am_model_config": args.am_model_config,
			"mvn_file": args.mvn_file,
			"model": "speech_{}_asr{}-{}-16k-{}-vocab{}-pytorch-{}".format(args.model_name, args.nat, args.lang,
			args.dataset, args.vocab_size, args.tag),
			}
			}
			pipeline = {"type": args.pipeline_type}
			json_dict = {
			"framework": "pytorch",
			"task": args.task,
			"model": model,
			"pipeline": pipeline,
			}

			with open(os.path.join(args.output_dir, "configuration.json"), "w") as f:
			json.dump(json_dict, f, indent=4)

			@@ -87,7 +87,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
			@@ -138,7 +138,7 @@
			--train_set ${train_set} \
			--valid_set ${valid_set} \
			--data_file_names "wav.scp,text" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--dataset_type $dataset_type \
			--resume true \
			@@ -189,7 +189,7 @@
			--njob ${njob} \
			--gpuid_list ${gpuid_list} \
			--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--key_file "${_logdir}"/keys.JOB.scp \
			--asr_train_config "${asr_exp}"/config.yaml \
			--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
			@@ -212,5 +212,19 @@
			done
			fi

			# Prepare files for ModelScope fine-tuning and inference
			if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
			echo "stage 6: ModelScope Preparation"
			cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
			vocab_size=$(cat ${token_list} \| wc -l)
			python utils/gen_modelscope_configuration.py \
			--am_model_name $inference_asr_model \
			--mode asr \
			--model_name conformer \
			--dataset aishell2 \
			--output_dir $exp_dir/exp/$model_dir \
			--vocab_size $vocab_size \
			--tag $tag
			fi

			@@ -66,7 +66,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
			@@ -109,7 +109,7 @@
			--train_set ${train_set} \
			--valid_set ${valid_set} \
			--data_file_names "wav.scp" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--dataset_type $dataset_type \
			--resume true \

			@@ -90,7 +90,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
			@@ -148,7 +148,7 @@
			--train_set ${train_set} \
			--valid_set ${valid_set} \
			--data_file_names "wav.scp,text,embeds.scp" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--dataset_type $dataset_type \
			--resume true \
			@@ -199,7 +199,7 @@
			--njob ${njob} \
			--gpuid_list ${gpuid_list} \
			--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--key_file "${_logdir}"/keys.JOB.scp \
			--asr_train_config "${asr_exp}"/config.yaml \
			--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
			@@ -220,4 +220,20 @@
			tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
			cat ${_dir}/text.cer.txt
			done
			fi

			# Prepare files for ModelScope fine-tuning and inference
			if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
			echo "stage 6: ModelScope Preparation"
			cp ${feats_dir}/data/${train_set}/cmvn/am.mvn ${exp_dir}/exp/${model_dir}/am.mvn
			vocab_size=$(cat ${token_list} \| wc -l)
			python utils/gen_modelscope_configuration.py \
			--am_model_name $inference_asr_model \
			--mode paraformer \
			--model_name paraformer_bert \
			--dataset aishell2 \
			--output_dir $exp_dir/exp/$model_dir \
			--vocab_size $vocab_size \
			--nat _nat \
			--tag $tag
			fi

			@@ -29,6 +29,6 @@

			python utils/combine_cmvn_file.py --dim ${feats_dim} --cmvn_dir $split_dir --nj $nj --output_dir ${fbankdir}/cmvn

			python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/cmvn.mvn
			python utils/cmvn_converter.py --cmvn_json ${fbankdir}/cmvn/cmvn.json --am_mvn ${fbankdir}/cmvn/am.mvn

			echo "$0: Succeeded compute global cmvn"

			@@ -13,7 +13,7 @@
			batch_type: numel
			batch_bins: 6000000
			accum_grad: 1
			max_epoch: 15 # 15epoch is enougth
			max_epoch: 15 # 15epoch is enough

			optim: adam
			optim_conf:

			@@ -97,7 +97,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
			@@ -150,7 +150,7 @@
			--data_dir ${feats_dir}/data \
			--train_set ${train_set} \
			--valid_set ${valid_set} \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--resume true \
			--output_dir ${exp_dir}/exp/${model_dir} \
			@@ -201,7 +201,7 @@
			--njob ${njob} \
			--gpuid_list ${gpuid_list} \
			--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--key_file "${_logdir}"/keys.JOB.scp \
			--asr_train_config "${asr_exp}"/config.yaml \
			--asr_model_file "${asr_exp}"/"${inference_asr_model}" \

			@@ -93,7 +93,7 @@

			if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
			echo "stage 1: Feature and CMVN Generation"
			utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} ${feats_dir}/data/${train_set}
			utils/compute_cmvn.sh ${feats_dir}/data/${train_set} --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} --config $asr_config --scale 1.0
			fi

			token_list=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
			@@ -146,7 +146,7 @@
			--data_dir ${feats_dir}/data \
			--train_set ${train_set} \
			--valid_set ${valid_set} \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--speed_perturb ${speed_perturb} \
			--resume true \
			--output_dir ${exp_dir}/exp/${model_dir} \
			@@ -197,7 +197,7 @@
			--njob ${njob} \
			--gpuid_list ${gpuid_list} \
			--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/cmvn.mvn \
			--cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \
			--key_file "${_logdir}"/keys.JOB.scp \
			--asr_train_config "${asr_exp}"/config.yaml \
			--asr_model_file "${asr_exp}"/"${inference_asr_model}" \

			@@ -272,8 +272,8 @@
			parser.add_argument(
			"--init_param",
			type=str,
			action="append",
			default=[],
			nargs="*",
			help="Specify the file path used for initialization of parameters. "
			"The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', "
			"where file_path is the model file path, "
			@@ -519,6 +519,12 @@
			dtype=getattr(torch, args.train_dtype),
			device="cuda" if args.ngpu > 0 else "cpu",
			)
			for t in args.freeze_param:
			for k, p in model.named_parameters():
			if k.startswith(t + ".") or k == t:
			logging.info(f"Setting {k}.requires_grad = False")
			p.requires_grad = False

			optimizers = build_optimizer(args, model=model)
			schedulers = build_scheduler(args, optimizers)

			@@ -236,7 +236,7 @@
			loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att + loss_pre * self.predictor_weight

			if self.use_1st_decoder_loss and pre_loss_att is not None:
			loss = loss + pre_loss_att
			loss = loss + (1 - self.ctc_weight) * pre_loss_att

			# Collect Attn branch stats
			stats["loss_att"] = loss_att.detach() if loss_att is not None else None