Revert "shfit to shift (#2266)" (#2336)
This reverts commit 1367973f9818d8e15c7bf52ad6ffba4ddb6ac2b2.
| | |
| | | The original intention of the funasr-1.x.x version is to make model integration easier. The core feature is the registry and AutoModel: |
| | | |
| | | * The introduction of the registry enables the development of building blocks to access the model, compatible with a variety of tasks; |
| | | |
| | | |
| | | * The newly designed AutoModel interface unifies modelscope, huggingface, and funasr inference and training interfaces, and supports free download of repositories; |
| | | |
| | | |
| | | * Support model export, demo-level service deployment, and industrial-level multi-concurrent service deployment; |
| | | |
| | | |
| | | * Unify academic and industrial model inference training scripts; |
| | | |
| | | |
| | | |
| | | # Quick to get started |
| | | |
| | |
| | | ``` |
| | | |
| | | * `model`(str): [Model Warehouse](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo)The model name in, or the model path in the local disk |
| | | |
| | | |
| | | * `device`(str): `cuda:0`(Default gpu0), using GPU for inference, specified. If`cpu`Then the CPU is used for inference |
| | | |
| | | |
| | | * `ncpu`(int): `4`(Default), set the number of threads used for CPU internal operation parallelism |
| | | |
| | | |
| | | * `output_dir`(str): `None`(Default) If set, the output path of the output result |
| | | |
| | | |
| | | * `batch_size`(int): `1`(Default), batch processing during decoding, number of samples |
| | | |
| | | |
| | | * `hub`(str):`ms`(Default) to download the model from modelscope. If`hf`To download the model from huggingface. |
| | | |
| | | |
| | | * `**kwargs`(dict): All in`config.yaml`Parameters, which can be specified directly here, for example, the maximum cut length in the vad model.`max_single_segment_time=6000`(Milliseconds). |
| | | |
| | | |
| | | |
| | | #### AutoModel reasoning |
| | | |
| | |
| | | ``` |
| | | |
| | | * * wav file path, for example: asr\_example.wav |
| | | |
| | | |
| | | * pcm file path, for example: asr\_example.pcm, you need to specify the audio sampling rate fs (default is 16000) |
| | | |
| | | |
| | | * Audio byte stream, for example: microphone byte data |
| | | |
| | | |
| | | * wav.scp,kaldi-style wav list (`wav_id \t wav_path`), for example: |
| | | |
| | | |
| | | |
| | | ```plaintext |
| | | Asr_example1./audios/asr_example1.wav |
| | |
| | | In this input |
| | | |
| | | * Audio sampling points, for example:`audio, rate = soundfile.read("asr_example_zh.wav")`Is numpy.ndarray. batch input is supported. The type is list:`[audio_sample1, audio_sample2, ..., audio_sampleN]` |
| | | |
| | | |
| | | * fbank input, support group batch. shape is \[batch, frames, dim\], type is torch.Tensor, for example |
| | | |
| | | |
| | | * `output_dir`: None (default), if set, the output path of the output result |
| | | |
| | | |
| | | * `**kwargs`(dict): Model-related inference parameters, e.g,`beam_size=10`,`decoding_ctc_weight=0.1`. |
| | | |
| | | |
| | | |
| | | Detailed documentation link:[https://github.com/modelscope/FunASR/blob/main/examples/README\_zh.md](https://github.com/modelscope/FunASR/blob/main/examples/README_zh.md) |
| | | |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | |
| | |
| | | "model": {"type" : "funasr"}, |
| | | "pipeline": {"type":"funasr-pipeline"}, |
| | | "model_name_in_hub": { |
| | | "ms":"", |
| | | "ms":"", |
| | | "hf":""}, |
| | | "file_path_metas": { |
| | | "init_param":"model.pt", |
| | | "init_param":"model.pt", |
| | | "config":"config.yaml", |
| | | "tokenizer_conf": {"bpemodel": "chn_jpn_yue_eng_ko_spectok.bpe.model"}, |
| | | "frontend_conf":{"cmvn_file": "am.mvn"}} |
| | |
| | | def forward( |
| | | self, |
| | | **kwargs, |
| | | ): |
| | | ): |
| | | |
| | | def inference( |
| | | self, |
| | |
| | | ## Principles of Registration |
| | | |
| | | * Model: models are independent of each other. Each Model needs to create a new Model directory under funasr/models/. Do not use class inheritance method!!! Do not import from other model directories, and put everything you need into your own model directory!!! Do not modify the existing model code!!! |
| | | |
| | | |
| | | * dataset,frontend,tokenizer, if you can reuse the existing one, reuse it directly, if you cannot reuse it, please register a new one, modify it again, and do not modify the original one!!! |
| | | |
| | | |
| | | |
| | | # Independent warehouse |
| | | |
| | |
| | | model = AutoModel ( |
| | | model="iic/SenseVoiceSmall ", |
| | | trust_remote_code=True |
| | | remote_code = "./model.py", |
| | | remote_code = "./model.py", |
| | | ) |
| | | ``` |
| | | |
| | |
| | | print(text) |
| | | ``` |
| | | |
| | | Trim reference:[https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh](https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh) |
| | | Trim reference:[https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh](https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh) |
| | |
| | | funasr-1.x.x 版本的设计初衷是【**让模型集成更简单**】,核心feature为注册表与AutoModel: |
| | | |
| | | * 注册表的引入,使得开发中可以用搭积木的方式接入模型,兼容多种task; |
| | | |
| | | |
| | | * 新设计的AutoModel接口,统一modelscope、huggingface与funasr推理与训练接口,支持自由选择下载仓库; |
| | | |
| | | |
| | | * 支持模型导出,demo级别服务部署,以及工业级别多并发服务部署; |
| | | |
| | | |
| | | * 统一学术与工业模型推理训练脚本; |
| | | |
| | | |
| | | |
| | | # 快速上手 |
| | | |
| | |
| | | ``` |
| | | |
| | | * `model`(str): [模型仓库](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo) 中的模型名称,或本地磁盘中的模型路径 |
| | | |
| | | |
| | | * `device`(str): `cuda:0`(默认gpu0),使用 GPU 进行推理,指定。如果为`cpu`,则使用 CPU 进行推理 |
| | | |
| | | |
| | | * `ncpu`(int): `4` (默认),设置用于 CPU 内部操作并行性的线程数 |
| | | |
| | | |
| | | * `output_dir`(str): `None` (默认),如果设置,输出结果的输出路径 |
| | | |
| | | |
| | | * `batch_size`(int): `1` (默认),解码时的批处理,样本个数 |
| | | |
| | | |
| | | * `hub`(str):`ms`(默认),从modelscope下载模型。如果为`hf`,从huggingface下载模型。 |
| | | |
| | | |
| | | * `**kwargs`(dict): 所有在`config.yaml`中参数,均可以直接在此处指定,例如,vad模型中最大切割长度 `max_single_segment_time=6000` (毫秒)。 |
| | | |
| | | |
| | | |
| | | #### AutoModel 推理 |
| | | |
| | |
| | | ``` |
| | | |
| | | * * wav文件路径, 例如: asr\_example.wav |
| | | |
| | | |
| | | * pcm文件路径, 例如: asr\_example.pcm,此时需要指定音频采样率fs(默认为16000) |
| | | |
| | | |
| | | * 音频字节数流,例如:麦克风的字节数数据 |
| | | |
| | | |
| | | * wav.scp,kaldi 样式的 wav 列表 (`wav_id \t wav_path`), 例如: |
| | | |
| | | |
| | | |
| | | ```plaintext |
| | | asr_example1 ./audios/asr_example1.wav |
| | |
| | | 在这种输入 |
| | | |
| | | * 音频采样点,例如:`audio, rate = soundfile.read("asr_example_zh.wav")`, 数据类型为 numpy.ndarray。支持batch输入,类型为list: `[audio_sample1, audio_sample2, ..., audio_sampleN]` |
| | | |
| | | |
| | | * fbank输入,支持组batch。shape为\[batch, frames, dim\],类型为torch.Tensor,例如 |
| | | |
| | | |
| | | * `output_dir`: None (默认),如果设置,输出结果的输出路径 |
| | | |
| | | |
| | | * `**kwargs`(dict): 与模型相关的推理参数,例如,`beam_size=10`,`decoding_ctc_weight=0.1`。 |
| | | |
| | | |
| | | |
| | | 详细文档链接:[https://github.com/modelscope/FunASR/blob/main/examples/README\_zh.md](https://github.com/modelscope/FunASR/blob/main/examples/README_zh.md) |
| | | |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | |
| | |
| | | "model": {"type" : "funasr"}, |
| | | "pipeline": {"type":"funasr-pipeline"}, |
| | | "model_name_in_hub": { |
| | | "ms":"", |
| | | "ms":"", |
| | | "hf":""}, |
| | | "file_path_metas": { |
| | | "init_param":"model.pt", |
| | | "init_param":"model.pt", |
| | | "config":"config.yaml", |
| | | "tokenizer_conf": {"bpemodel": "chn_jpn_yue_eng_ko_spectok.bpe.model"}, |
| | | "frontend_conf":{"cmvn_file": "am.mvn"}} |
| | |
| | | def forward( |
| | | self, |
| | | **kwargs, |
| | | ): |
| | | ): |
| | | |
| | | def inference( |
| | | self, |
| | |
| | | ## 注册原则 |
| | | |
| | | * Model:模型之间互相独立,每一个模型,都需要在funasr/models/下面新建一个模型目录,不要采用类的继承方法!!!不要从其他模型目录中import,所有需要用到的都单独放到自己的模型目录中!!!不要修改现有的模型代码!!! |
| | | |
| | | |
| | | * dataset,frontend,tokenizer,如果能复用现有的,直接复用,如果不能复用,请注册一个新的,再修改,不要修改原来的!!! |
| | | |
| | | |
| | | |
| | | # 独立仓库 |
| | | |
| | |
| | | # trust_remote_code:`True` 表示 model 代码实现从 `remote_code` 处加载,`remote_code` 指定 `model` 具体代码的位置(例如,当前目录下的 `model.py`),支持绝对路径与相对路径,以及网络 url。 |
| | | model = AutoModel( |
| | | model="iic/SenseVoiceSmall", |
| | | trust_remote_code=True, |
| | | remote_code="./model.py", |
| | | trust_remote_code=True, |
| | | remote_code="./model.py", |
| | | ) |
| | | ``` |
| | | |
| | |
| | | print(text) |
| | | ``` |
| | | |
| | | 微调参考:[https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh](https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh) |
| | | 微调参考:[https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh](https://github.com/FunAudioLLM/SenseVoice/blob/main/finetune.sh) |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | # frontend related |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | chunk_size: |
| | | - 16 |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | # decoder |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | |
| | | predictor: CifPredictorV3 |
| | | predictor_conf: |
| | |
| | | concat_after: bool = False, |
| | | att_layer_num: int = 6, |
| | | kernel_size: int = 21, |
| | | sanm_shift: int = 0, |
| | | sanm_shfit: int = 0, |
| | | ): |
| | | super().__init__( |
| | | vocab_size=vocab_size, |
| | |
| | | |
| | | self.att_layer_num = att_layer_num |
| | | self.num_blocks = num_blocks |
| | | if sanm_shift is None: |
| | | sanm_shift = (kernel_size - 1) // 2 |
| | | if sanm_shfit is None: |
| | | sanm_shfit = (kernel_size - 1) // 2 |
| | | self.decoders = repeat( |
| | | att_layer_num - 1, |
| | | lambda lnum: DecoderLayerSANM( |
| | | attention_dim, |
| | | MultiHeadedAttentionSANMDecoder( |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shift=sanm_shift |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit |
| | | ), |
| | | MultiHeadedAttentionCrossAtt( |
| | | attention_heads, attention_dim, src_attention_dropout_rate |
| | |
| | | self.last_decoder = ContextualDecoderLayer( |
| | | attention_dim, |
| | | MultiHeadedAttentionSANMDecoder( |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shift=sanm_shift |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit |
| | | ), |
| | | MultiHeadedAttentionCrossAtt( |
| | | attention_heads, attention_dim, src_attention_dropout_rate |
| | |
| | | lambda lnum: DecoderLayerSANM( |
| | | attention_dim, |
| | | MultiHeadedAttentionSANMDecoder( |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shift=0 |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=0 |
| | | ), |
| | | None, |
| | | PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate), |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | |
| | | predictor: CifPredictorV2 |
| | | predictor_conf: |
| | |
| | | ctc_type: builtin |
| | | reduce: true |
| | | ignore_nan_grad: true |
| | | normalize: null |
| | | normalize: null |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | padding_idx: 0 |
| | | |
| | |
| | | def __init__(self, *args, **kwargs): |
| | | super().__init__(*args, **kwargs) |
| | | |
| | | def forward(self, x, mask, mask_shift_chunk=None, mask_att_chunk_encoder=None): |
| | | def forward(self, x, mask, mask_shfit_chunk=None, mask_att_chunk_encoder=None): |
| | | q_h, k_h, v_h, v = self.forward_qkv(x) |
| | | fsmn_memory = self.forward_fsmn(v, mask[0], mask_shift_chunk) |
| | | fsmn_memory = self.forward_fsmn(v, mask[0], mask_shfit_chunk) |
| | | q_h = q_h * self.d_k ** (-0.5) |
| | | scores = torch.matmul(q_h, k_h.transpose(-2, -1)) |
| | | att_outs = self.forward_attention(v_h, scores, mask[1], mask_att_chunk_encoder) |
| | |
| | | self.stochastic_depth_rate = stochastic_depth_rate |
| | | self.dropout_rate = dropout_rate |
| | | |
| | | def forward(self, x, mask, cache=None, mask_shift_chunk=None, mask_att_chunk_encoder=None): |
| | | def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None): |
| | | """Compute encoded features. |
| | | |
| | | Args: |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ), |
| | | ), |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | if not self.normalize_before: |
| | | x = self.norm2(x) |
| | | |
| | | return x, mask, cache, mask_shift_chunk, mask_att_chunk_encoder |
| | | return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder |
| | | |
| | | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0): |
| | | """Compute encoded features. |
| | |
| | | interctc_layer_idx: List[int] = [], |
| | | interctc_use_conditioning: bool = False, |
| | | kernel_size: int = 11, |
| | | sanm_shift: int = 0, |
| | | sanm_shfit: int = 0, |
| | | selfattention_layer_type: str = "sanm", |
| | | ): |
| | | super().__init__() |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | ) |
| | | |
| | | encoder_selfattn_layer_args = ( |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | ) |
| | | |
| | | self.encoders0 = repeat( |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 5 |
| | | sanm_shfit: 5 |
| | | selfattention_layer_type: sanm |
| | | padding_idx: 0 |
| | | |
| | | tokenizer: CharTokenizer |
| | | tokenizer_conf: |
| | | unk_symbol: <unk> |
| | | unk_symbol: <unk> |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | predictor: CifPredictorV3 |
| | |
| | | ctc_type: builtin |
| | | reduce: true |
| | | ignore_nan_grad: true |
| | | |
| | | |
| | | normalize: null |
| | |
| | | concat_after: bool = False, |
| | | att_layer_num: int = 6, |
| | | kernel_size: int = 21, |
| | | sanm_shift: int = 0, |
| | | sanm_shfit: int = 0, |
| | | lora_list: List[str] = None, |
| | | lora_rank: int = 8, |
| | | lora_alpha: int = 16, |
| | |
| | | |
| | | self.att_layer_num = att_layer_num |
| | | self.num_blocks = num_blocks |
| | | if sanm_shift is None: |
| | | sanm_shift = (kernel_size - 1) // 2 |
| | | if sanm_shfit is None: |
| | | sanm_shfit = (kernel_size - 1) // 2 |
| | | self.decoders = repeat( |
| | | att_layer_num, |
| | | lambda lnum: DecoderLayerSANM( |
| | | attention_dim, |
| | | MultiHeadedAttentionSANMDecoder( |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shift=sanm_shift |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit |
| | | ), |
| | | MultiHeadedAttentionCrossAtt( |
| | | attention_heads, |
| | |
| | | lambda lnum: DecoderLayerSANM( |
| | | attention_dim, |
| | | MultiHeadedAttentionSANMDecoder( |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shift=0 |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=0 |
| | | ), |
| | | None, |
| | | PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate), |
| | |
| | | for _ in range(cache_num) |
| | | ] |
| | | return (tgt, memory, pre_acoustic_embeds, cache) |
| | | |
| | | |
| | | def is_optimizable(self): |
| | | return True |
| | | |
| | | |
| | | def get_input_names(self): |
| | | cache_num = len(self.model.decoders) + len(self.model.decoders2) |
| | | return ['tgt', 'memory', 'pre_acoustic_embeds'] \ |
| | | + ['cache_%d' % i for i in range(cache_num)] |
| | | |
| | | |
| | | def get_output_names(self): |
| | | cache_num = len(self.model.decoders) + len(self.model.decoders2) |
| | | return ['y'] \ |
| | | + ['out_cache_%d' % i for i in range(cache_num)] |
| | | |
| | | |
| | | def get_dynamic_axes(self): |
| | | ret = { |
| | | 'tgt': { |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | # decoder |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | |
| | | predictor: CifPredictorV2 |
| | | predictor_conf: |
| | |
| | | mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor( |
| | | encoder_out, |
| | | ys_pad, |
| | |
| | | mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, pre_peak_index = self.predictor( |
| | | encoder_out, |
| | | None, |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | chunk_size: |
| | | - 12 |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 5 |
| | | sanm_shfit: 5 |
| | | |
| | | predictor: CifPredictorV2 |
| | | predictor_conf: |
| | |
| | | n_feat, |
| | | dropout_rate, |
| | | kernel_size, |
| | | sanm_shift=0, |
| | | sanm_shfit=0, |
| | | lora_list=None, |
| | | lora_rank=8, |
| | | lora_alpha=16, |
| | |
| | | ) |
| | | # padding |
| | | left_padding = (kernel_size - 1) // 2 |
| | | if sanm_shift > 0: |
| | | left_padding = left_padding + sanm_shift |
| | | if sanm_shfit > 0: |
| | | left_padding = left_padding + sanm_shfit |
| | | right_padding = kernel_size - 1 - left_padding |
| | | self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0) |
| | | |
| | | def forward_fsmn(self, inputs, mask, mask_shift_chunk=None): |
| | | def forward_fsmn(self, inputs, mask, mask_shfit_chunk=None): |
| | | b, t, d = inputs.size() |
| | | if mask is not None: |
| | | mask = torch.reshape(mask, (b, -1, 1)) |
| | | if mask_shift_chunk is not None: |
| | | mask = mask * mask_shift_chunk |
| | | if mask_shfit_chunk is not None: |
| | | mask = mask * mask_shfit_chunk |
| | | inputs = inputs * mask |
| | | |
| | | x = inputs.transpose(1, 2) |
| | |
| | | |
| | | return self.linear_out(x) # (batch, time1, d_model) |
| | | |
| | | def forward(self, x, mask, mask_shift_chunk=None, mask_att_chunk_encoder=None): |
| | | def forward(self, x, mask, mask_shfit_chunk=None, mask_att_chunk_encoder=None): |
| | | """Compute scaled dot product attention. |
| | | |
| | | Args: |
| | |
| | | |
| | | """ |
| | | q_h, k_h, v_h, v = self.forward_qkv(x) |
| | | fsmn_memory = self.forward_fsmn(v, mask, mask_shift_chunk) |
| | | fsmn_memory = self.forward_fsmn(v, mask, mask_shfit_chunk) |
| | | q_h = q_h * self.d_k ** (-0.5) |
| | | scores = torch.matmul(q_h, k_h.transpose(-2, -1)) |
| | | att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder) |
| | |
| | | |
| | | """ |
| | | |
| | | def __init__(self, n_feat, dropout_rate, kernel_size, sanm_shift=0): |
| | | def __init__(self, n_feat, dropout_rate, kernel_size, sanm_shfit=0): |
| | | """Construct an MultiHeadedAttention object.""" |
| | | super().__init__() |
| | | |
| | |
| | | # padding |
| | | # padding |
| | | left_padding = (kernel_size - 1) // 2 |
| | | if sanm_shift > 0: |
| | | left_padding = left_padding + sanm_shift |
| | | if sanm_shfit > 0: |
| | | left_padding = left_padding + sanm_shfit |
| | | right_padding = kernel_size - 1 - left_padding |
| | | self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0) |
| | | self.kernel_size = kernel_size |
| | | |
| | | def forward(self, inputs, mask, cache=None, mask_shift_chunk=None): |
| | | def forward(self, inputs, mask, cache=None, mask_shfit_chunk=None): |
| | | """ |
| | | :param x: (#batch, time1, size). |
| | | :param mask: Mask tensor (#batch, 1, time) |
| | |
| | | if mask is not None: |
| | | mask = torch.reshape(mask, (b, -1, 1)) |
| | | # logging.info("in fsmn, mask: {}, {}".format(mask.size(), mask[0:100:50, :, :])) |
| | | if mask_shift_chunk is not None: |
| | | # logging.info("in fsmn, mask_fsmn: {}, {}".format(mask_shift_chunk.size(), mask_shift_chunk[0:100:50, :, :])) |
| | | mask = mask * mask_shift_chunk |
| | | if mask_shfit_chunk is not None: |
| | | # logging.info("in fsmn, mask_fsmn: {}, {}".format(mask_shfit_chunk.size(), mask_shfit_chunk[0:100:50, :, :])) |
| | | mask = mask * mask_shfit_chunk |
| | | # logging.info("in fsmn, mask_after_fsmn: {}, {}".format(mask.size(), mask[0:100:50, :, :])) |
| | | # print("in fsmn, mask", mask.size()) |
| | | # print("in fsmn, inputs", inputs.size()) |
| | |
| | | concat_after: bool = False, |
| | | att_layer_num: int = 6, |
| | | kernel_size: int = 21, |
| | | sanm_shift: int = None, |
| | | sanm_shfit: int = None, |
| | | concat_embeds: bool = False, |
| | | attention_dim: int = None, |
| | | tf2torch_tensor_name_prefix_torch: str = "decoder", |
| | |
| | | |
| | | self.att_layer_num = att_layer_num |
| | | self.num_blocks = num_blocks |
| | | if sanm_shift is None: |
| | | sanm_shift = (kernel_size - 1) // 2 |
| | | if sanm_shfit is None: |
| | | sanm_shfit = (kernel_size - 1) // 2 |
| | | self.decoders = repeat( |
| | | att_layer_num, |
| | | lambda lnum: DecoderLayerSANM( |
| | | attention_dim, |
| | | MultiHeadedAttentionSANMDecoder( |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shift=sanm_shift |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit |
| | | ), |
| | | MultiHeadedAttentionCrossAtt( |
| | | attention_heads, |
| | |
| | | attention_dim, |
| | | self_attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift=sanm_shift, |
| | | sanm_shfit=sanm_shfit, |
| | | ), |
| | | None, |
| | | PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate), |
| | |
| | | self.stochastic_depth_rate = stochastic_depth_rate |
| | | self.dropout_rate = dropout_rate |
| | | |
| | | def forward(self, x, mask, cache=None, mask_shift_chunk=None, mask_att_chunk_encoder=None): |
| | | def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None): |
| | | """Compute encoded features. |
| | | |
| | | Args: |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ), |
| | | ), |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | if not self.normalize_before: |
| | | x = self.norm2(x) |
| | | |
| | | return x, mask, cache, mask_shift_chunk, mask_att_chunk_encoder |
| | | return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder |
| | | |
| | | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0): |
| | | """Compute encoded features. |
| | |
| | | interctc_layer_idx: List[int] = [], |
| | | interctc_use_conditioning: bool = False, |
| | | kernel_size: int = 11, |
| | | sanm_shift: int = 0, |
| | | sanm_shfit: int = 0, |
| | | lora_list: List[str] = None, |
| | | lora_rank: int = 8, |
| | | lora_alpha: int = 16, |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | lora_list, |
| | | lora_rank, |
| | | lora_alpha, |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | lora_list, |
| | | lora_rank, |
| | | lora_alpha, |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | # decoder |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | |
| | | |
| | | |
| | |
| | | stride: tuple = (10,), |
| | | pad_left: tuple = (0,), |
| | | encoder_att_look_back_factor: tuple = (1,), |
| | | shift_fsmn: int = 0, |
| | | shfit_fsmn: int = 0, |
| | | decoder_att_look_back_factor: tuple = (1,), |
| | | ): |
| | | |
| | |
| | | encoder_att_look_back_factor, |
| | | decoder_att_look_back_factor, |
| | | ) |
| | | self.shift_fsmn = shift_fsmn |
| | | self.shfit_fsmn = shfit_fsmn |
| | | self.x_add_mask = None |
| | | self.x_rm_mask = None |
| | | self.x_len = None |
| | | self.mask_shift_chunk = None |
| | | self.mask_shfit_chunk = None |
| | | self.mask_chunk_predictor = None |
| | | self.mask_att_chunk_encoder = None |
| | | self.mask_shift_att_chunk_decoder = None |
| | |
| | | stride, |
| | | pad_left, |
| | | encoder_att_look_back_factor, |
| | | chunk_size + self.shift_fsmn, |
| | | chunk_size + self.shfit_fsmn, |
| | | decoder_att_look_back_factor, |
| | | ) |
| | | return ( |
| | |
| | | chunk_size, stride, pad_left, encoder_att_look_back_factor, chunk_size_pad_shift = ( |
| | | self.get_chunk_size(ind) |
| | | ) |
| | | shift_fsmn = self.shift_fsmn |
| | | shfit_fsmn = self.shfit_fsmn |
| | | pad_right = chunk_size - stride - pad_left |
| | | |
| | | chunk_num_batch = np.ceil(x_len / stride).astype(np.int32) |
| | | x_len_chunk = ( |
| | | (chunk_num_batch - 1) * chunk_size_pad_shift |
| | | + shift_fsmn |
| | | + shfit_fsmn |
| | | + pad_left |
| | | + 0 |
| | | + x_len |
| | |
| | | max_len_for_x_mask_tmp = max(chunk_size, x_len_max + pad_left) |
| | | x_add_mask = np.zeros([0, max_len_for_x_mask_tmp], dtype=dtype) |
| | | x_rm_mask = np.zeros([max_len_for_x_mask_tmp, 0], dtype=dtype) |
| | | mask_shift_chunk = np.zeros([0, num_units], dtype=dtype) |
| | | mask_shfit_chunk = np.zeros([0, num_units], dtype=dtype) |
| | | mask_chunk_predictor = np.zeros([0, num_units_predictor], dtype=dtype) |
| | | mask_shift_att_chunk_decoder = np.zeros([0, 1], dtype=dtype) |
| | | mask_att_chunk_encoder = np.zeros([0, chunk_num * chunk_size_pad_shift], dtype=dtype) |
| | | for chunk_ids in range(chunk_num): |
| | | # x_mask add |
| | | fsmn_padding = np.zeros((shift_fsmn, max_len_for_x_mask_tmp), dtype=dtype) |
| | | fsmn_padding = np.zeros((shfit_fsmn, max_len_for_x_mask_tmp), dtype=dtype) |
| | | x_mask_cur = np.diag(np.ones(chunk_size, dtype=np.float32)) |
| | | x_mask_pad_left = np.zeros((chunk_size, chunk_ids * stride), dtype=dtype) |
| | | x_mask_pad_right = np.zeros((chunk_size, max_len_for_x_mask_tmp), dtype=dtype) |
| | |
| | | x_add_mask = np.concatenate([x_add_mask, x_add_mask_fsmn], axis=0) |
| | | |
| | | # x_mask rm |
| | | fsmn_padding = np.zeros((max_len_for_x_mask_tmp, shift_fsmn), dtype=dtype) |
| | | fsmn_padding = np.zeros((max_len_for_x_mask_tmp, shfit_fsmn), dtype=dtype) |
| | | padding_mask_left = np.zeros((max_len_for_x_mask_tmp, pad_left), dtype=dtype) |
| | | padding_mask_right = np.zeros((max_len_for_x_mask_tmp, pad_right), dtype=dtype) |
| | | x_mask_cur = np.diag(np.ones(stride, dtype=dtype)) |
| | |
| | | x_rm_mask = np.concatenate([x_rm_mask, x_rm_mask_cur_fsmn], axis=1) |
| | | |
| | | # fsmn_padding_mask |
| | | pad_shift_mask = np.zeros([shift_fsmn, num_units], dtype=dtype) |
| | | pad_shfit_mask = np.zeros([shfit_fsmn, num_units], dtype=dtype) |
| | | ones_1 = np.ones([chunk_size, num_units], dtype=dtype) |
| | | mask_shift_chunk_cur = np.concatenate([pad_shift_mask, ones_1], axis=0) |
| | | mask_shift_chunk = np.concatenate([mask_shift_chunk, mask_shift_chunk_cur], axis=0) |
| | | mask_shfit_chunk_cur = np.concatenate([pad_shfit_mask, ones_1], axis=0) |
| | | mask_shfit_chunk = np.concatenate([mask_shfit_chunk, mask_shfit_chunk_cur], axis=0) |
| | | |
| | | # predictor mask |
| | | zeros_1 = np.zeros([shift_fsmn + pad_left, num_units_predictor], dtype=dtype) |
| | | zeros_1 = np.zeros([shfit_fsmn + pad_left, num_units_predictor], dtype=dtype) |
| | | ones_2 = np.ones([stride, num_units_predictor], dtype=dtype) |
| | | zeros_3 = np.zeros( |
| | | [chunk_size - stride - pad_left, num_units_predictor], dtype=dtype |
| | |
| | | ) |
| | | |
| | | # encoder att mask |
| | | zeros_1_top = np.zeros([shift_fsmn, chunk_num * chunk_size_pad_shift], dtype=dtype) |
| | | zeros_1_top = np.zeros([shfit_fsmn, chunk_num * chunk_size_pad_shift], dtype=dtype) |
| | | |
| | | zeros_2_num = max(chunk_ids - encoder_att_look_back_factor, 0) |
| | | zeros_2 = np.zeros([chunk_size, zeros_2_num * chunk_size_pad_shift], dtype=dtype) |
| | | |
| | | encoder_att_look_back_num = max(chunk_ids - zeros_2_num, 0) |
| | | zeros_2_left = np.zeros([chunk_size, shift_fsmn], dtype=dtype) |
| | | zeros_2_left = np.zeros([chunk_size, shfit_fsmn], dtype=dtype) |
| | | ones_2_mid = np.ones([stride, stride], dtype=dtype) |
| | | zeros_2_bottom = np.zeros([chunk_size - stride, stride], dtype=dtype) |
| | | zeros_2_right = np.zeros([chunk_size, chunk_size - stride], dtype=dtype) |
| | |
| | | ones_2 = np.concatenate([zeros_2_left, ones_2, zeros_2_right], axis=1) |
| | | ones_2 = np.tile(ones_2, [1, encoder_att_look_back_num]) |
| | | |
| | | zeros_3_left = np.zeros([chunk_size, shift_fsmn], dtype=dtype) |
| | | zeros_3_left = np.zeros([chunk_size, shfit_fsmn], dtype=dtype) |
| | | ones_3_right = np.ones([chunk_size, chunk_size], dtype=dtype) |
| | | ones_3 = np.concatenate([zeros_3_left, ones_3_right], axis=1) |
| | | |
| | |
| | | ) |
| | | |
| | | # decoder fsmn_shift_att_mask |
| | | zeros_1 = np.zeros([shift_fsmn, 1]) |
| | | zeros_1 = np.zeros([shfit_fsmn, 1]) |
| | | ones_1 = np.ones([chunk_size, 1]) |
| | | mask_shift_att_chunk_decoder_cur = np.concatenate([zeros_1, ones_1], axis=0) |
| | | mask_shift_att_chunk_decoder = np.concatenate( |
| | |
| | | self.x_len_chunk = x_len_chunk |
| | | self.x_rm_mask = x_rm_mask[:x_len_max, :x_len_chunk_max] |
| | | self.x_len = x_len |
| | | self.mask_shift_chunk = mask_shift_chunk[:x_len_chunk_max, :] |
| | | self.mask_shfit_chunk = mask_shfit_chunk[:x_len_chunk_max, :] |
| | | self.mask_chunk_predictor = mask_chunk_predictor[:x_len_chunk_max, :] |
| | | self.mask_att_chunk_encoder = mask_att_chunk_encoder[:x_len_chunk_max, :x_len_chunk_max] |
| | | self.mask_shift_att_chunk_decoder = mask_shift_att_chunk_decoder[:x_len_chunk_max, :] |
| | |
| | | self.x_len_chunk, |
| | | self.x_rm_mask, |
| | | self.x_len, |
| | | self.mask_shift_chunk, |
| | | self.mask_shfit_chunk, |
| | | self.mask_chunk_predictor, |
| | | self.mask_att_chunk_encoder, |
| | | self.mask_shift_att_chunk_decoder, |
| | |
| | | x = torch.from_numpy(x).type(dtype).to(device) |
| | | return x |
| | | |
| | | def get_mask_shift_chunk( |
| | | def get_mask_shfit_chunk( |
| | | self, chunk_outs=None, device="cpu", batch_size=1, num_units=1, idx=4, dtype=torch.float32 |
| | | ): |
| | | with torch.no_grad(): |
| | |
| | | concat_after: bool = False, |
| | | att_layer_num: int = 6, |
| | | kernel_size: int = 21, |
| | | sanm_shift: int = None, |
| | | sanm_shfit: int = None, |
| | | concat_embeds: bool = False, |
| | | attention_dim: int = None, |
| | | tf2torch_tensor_name_prefix_torch: str = "decoder", |
| | |
| | | |
| | | self.att_layer_num = att_layer_num |
| | | self.num_blocks = num_blocks |
| | | if sanm_shift is None: |
| | | sanm_shift = (kernel_size - 1) // 2 |
| | | if sanm_shfit is None: |
| | | sanm_shfit = (kernel_size - 1) // 2 |
| | | self.decoders = repeat( |
| | | att_layer_num, |
| | | lambda lnum: DecoderLayerSANM( |
| | | attention_dim, |
| | | MultiHeadedAttentionSANMDecoder( |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shift=sanm_shift |
| | | attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit |
| | | ), |
| | | MultiHeadedAttentionCrossAtt( |
| | | attention_heads, |
| | |
| | | attention_dim, |
| | | self_attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift=sanm_shift, |
| | | sanm_shfit=sanm_shfit, |
| | | ), |
| | | None, |
| | | PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate), |
| | |
| | | self.stochastic_depth_rate = stochastic_depth_rate |
| | | self.dropout_rate = dropout_rate |
| | | |
| | | def forward(self, x, mask, cache=None, mask_shift_chunk=None, mask_att_chunk_encoder=None): |
| | | def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None): |
| | | """Compute encoded features. |
| | | |
| | | Args: |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ), |
| | | ), |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | if not self.normalize_before: |
| | | x = self.norm2(x) |
| | | |
| | | return x, mask, cache, mask_shift_chunk, mask_att_chunk_encoder |
| | | return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder |
| | | |
| | | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0): |
| | | """Compute encoded features. |
| | |
| | | interctc_layer_idx: List[int] = [], |
| | | interctc_use_conditioning: bool = False, |
| | | kernel_size: int = 11, |
| | | sanm_shift: int = 0, |
| | | sanm_shfit: int = 0, |
| | | selfattention_layer_type: str = "sanm", |
| | | chunk_size: Union[int, Sequence[int]] = (16,), |
| | | stride: Union[int, Sequence[int]] = (10,), |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | ) |
| | | |
| | | encoder_selfattn_layer_args = ( |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | ) |
| | | self.encoders0 = repeat( |
| | | 1, |
| | |
| | | assert 0 < min(interctc_layer_idx) and max(interctc_layer_idx) < num_blocks |
| | | self.interctc_use_conditioning = interctc_use_conditioning |
| | | self.conditioning_layer = None |
| | | shift_fsmn = (kernel_size - 1) // 2 |
| | | shfit_fsmn = (kernel_size - 1) // 2 |
| | | self.overlap_chunk_cls = overlap_chunk( |
| | | chunk_size=chunk_size, |
| | | stride=stride, |
| | | pad_left=pad_left, |
| | | shift_fsmn=shift_fsmn, |
| | | shfit_fsmn=shfit_fsmn, |
| | | encoder_att_look_back_factor=encoder_att_look_back_factor, |
| | | decoder_att_look_back_factor=decoder_att_look_back_factor, |
| | | ) |
| | |
| | | else: |
| | | xs_pad = self.embed(xs_pad) |
| | | |
| | | mask_shift_chunk, mask_att_chunk_encoder = None, None |
| | | mask_shfit_chunk, mask_att_chunk_encoder = None, None |
| | | if self.overlap_chunk_cls is not None: |
| | | ilens = masks.squeeze(1).sum(1) |
| | | chunk_outs = self.overlap_chunk_cls.gen_chunk_mask(ilens, ind) |
| | | xs_pad, ilens = self.overlap_chunk_cls.split_chunk(xs_pad, ilens, chunk_outs=chunk_outs) |
| | | masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device) |
| | | mask_shift_chunk = self.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | chunk_outs, xs_pad.device, xs_pad.size(0), dtype=xs_pad.dtype |
| | | ) |
| | | mask_att_chunk_encoder = self.overlap_chunk_cls.get_mask_att_chunk_encoder( |
| | | chunk_outs, xs_pad.device, xs_pad.size(0), dtype=xs_pad.dtype |
| | | ) |
| | | |
| | | encoder_outs = self.encoders0(xs_pad, masks, None, mask_shift_chunk, mask_att_chunk_encoder) |
| | | encoder_outs = self.encoders0(xs_pad, masks, None, mask_shfit_chunk, mask_att_chunk_encoder) |
| | | xs_pad, masks = encoder_outs[0], encoder_outs[1] |
| | | intermediate_outs = [] |
| | | if len(self.interctc_layer_idx) == 0: |
| | | encoder_outs = self.encoders( |
| | | xs_pad, masks, None, mask_shift_chunk, mask_att_chunk_encoder |
| | | xs_pad, masks, None, mask_shfit_chunk, mask_att_chunk_encoder |
| | | ) |
| | | xs_pad, masks = encoder_outs[0], encoder_outs[1] |
| | | else: |
| | | for layer_idx, encoder_layer in enumerate(self.encoders): |
| | | encoder_outs = encoder_layer( |
| | | xs_pad, masks, None, mask_shift_chunk, mask_att_chunk_encoder |
| | | xs_pad, masks, None, mask_shfit_chunk, mask_att_chunk_encoder |
| | | ) |
| | | xs_pad, masks = encoder_outs[0], encoder_outs[1] |
| | | if layer_idx + 1 in self.interctc_layer_idx: |
| | |
| | | mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor( |
| | | encoder_out, |
| | | ys_out_pad, |
| | |
| | | mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor( |
| | | encoder_out, |
| | | ys_out_pad, |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | # decoder |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | |
| | | predictor: CifPredictorV2 |
| | | predictor_conf: |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | # decoder |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | |
| | | # seaco decoder |
| | | seaco_decoder: ParaformerSANMDecoder |
| | |
| | | self_attention_dropout_rate: 0.1 |
| | | src_attention_dropout_rate: 0.1 |
| | | kernel_size: 21 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | use_output_layer: false |
| | | wo_input_layer: true |
| | | |
| | |
| | | n_feat, |
| | | dropout_rate, |
| | | kernel_size, |
| | | sanm_shift=0, |
| | | sanm_shfit=0, |
| | | lora_list=None, |
| | | lora_rank=8, |
| | | lora_alpha=16, |
| | |
| | | ) |
| | | # padding |
| | | left_padding = (kernel_size - 1) // 2 |
| | | if sanm_shift > 0: |
| | | left_padding = left_padding + sanm_shift |
| | | if sanm_shfit > 0: |
| | | left_padding = left_padding + sanm_shfit |
| | | right_padding = kernel_size - 1 - left_padding |
| | | self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0) |
| | | |
| | | def forward_fsmn(self, inputs, mask, mask_shift_chunk=None): |
| | | def forward_fsmn(self, inputs, mask, mask_shfit_chunk=None): |
| | | b, t, d = inputs.size() |
| | | if mask is not None: |
| | | mask = torch.reshape(mask, (b, -1, 1)) |
| | | if mask_shift_chunk is not None: |
| | | mask = mask * mask_shift_chunk |
| | | if mask_shfit_chunk is not None: |
| | | mask = mask * mask_shfit_chunk |
| | | inputs = inputs * mask |
| | | |
| | | x = inputs.transpose(1, 2) |
| | |
| | | |
| | | return self.linear_out(x) # (batch, time1, d_model) |
| | | |
| | | def forward(self, x, mask, mask_shift_chunk=None, mask_att_chunk_encoder=None): |
| | | def forward(self, x, mask, mask_shfit_chunk=None, mask_att_chunk_encoder=None): |
| | | """Compute scaled dot product attention. |
| | | |
| | | Args: |
| | |
| | | |
| | | """ |
| | | q_h, k_h, v_h, v = self.forward_qkv(x) |
| | | fsmn_memory = self.forward_fsmn(v, mask, mask_shift_chunk) |
| | | fsmn_memory = self.forward_fsmn(v, mask, mask_shfit_chunk) |
| | | q_h = q_h * self.d_k ** (-0.5) |
| | | scores = torch.matmul(q_h, k_h.transpose(-2, -1)) |
| | | att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder) |
| | |
| | | self.stochastic_depth_rate = stochastic_depth_rate |
| | | self.dropout_rate = dropout_rate |
| | | |
| | | def forward(self, x, mask, cache=None, mask_shift_chunk=None, mask_att_chunk_encoder=None): |
| | | def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None): |
| | | """Compute encoded features. |
| | | |
| | | Args: |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ), |
| | | ), |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | self.self_attn( |
| | | x, |
| | | mask, |
| | | mask_shift_chunk=mask_shift_chunk, |
| | | mask_shfit_chunk=mask_shfit_chunk, |
| | | mask_att_chunk_encoder=mask_att_chunk_encoder, |
| | | ) |
| | | ) |
| | |
| | | if not self.normalize_before: |
| | | x = self.norm2(x) |
| | | |
| | | return x, mask, cache, mask_shift_chunk, mask_att_chunk_encoder |
| | | return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder |
| | | |
| | | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0): |
| | | """Compute encoded features. |
| | |
| | | positionwise_conv_kernel_size: int = 1, |
| | | padding_idx: int = -1, |
| | | kernel_size: int = 11, |
| | | sanm_shift: int = 0, |
| | | sanm_shfit: int = 0, |
| | | selfattention_layer_type: str = "sanm", |
| | | **kwargs, |
| | | ): |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | ) |
| | | encoder_selfattn_layer_args = ( |
| | | attention_heads, |
| | |
| | | output_size, |
| | | attention_dropout_rate, |
| | | kernel_size, |
| | | sanm_shift, |
| | | sanm_shfit, |
| | | ) |
| | | |
| | | self.encoders0 = nn.ModuleList( |
| | |
| | | right_padding = kernel_size - 1 - left_padding |
| | | self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0) |
| | | |
| | | def forward(self, inputs, mask, mask_shift_chunk=None): |
| | | def forward(self, inputs, mask, mask_shfit_chunk=None): |
| | | b, t, d = inputs.size() |
| | | if mask is not None: |
| | | mask = torch.reshape(mask, (b, -1, 1)) |
| | | if mask_shift_chunk is not None: |
| | | mask = mask * mask_shift_chunk |
| | | if mask_shfit_chunk is not None: |
| | | mask = mask * mask_shfit_chunk |
| | | |
| | | inputs = inputs * mask |
| | | x = inputs.transpose(1, 2) |
| | |
| | | mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor( |
| | | encoder_out, |
| | | ys_out_pad, |
| | |
| | | mask_chunk_predictor = self.encoder2.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder2.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder2.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor2( |
| | | encoder_out, |
| | | ys_out_pad, |
| | |
| | | mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor( |
| | | encoder_out, |
| | | ys_out_pad, |
| | |
| | | mask_chunk_predictor = self.encoder2.overlap_chunk_cls.get_mask_chunk_predictor( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | mask_shift_chunk = self.encoder2.overlap_chunk_cls.get_mask_shift_chunk( |
| | | mask_shfit_chunk = self.encoder2.overlap_chunk_cls.get_mask_shfit_chunk( |
| | | None, device=encoder_out.device, batch_size=encoder_out.size(0) |
| | | ) |
| | | encoder_out = encoder_out * mask_shift_chunk |
| | | encoder_out = encoder_out * mask_shfit_chunk |
| | | pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor2( |
| | | encoder_out, |
| | | ys_out_pad, |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | chunk_size: |
| | | - 20 |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 21 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | chunk_size: |
| | | - 45 |
| | |
| | | private float _src_attention_dropout_rate = 0.1F; |
| | | private int _att_layer_num = 16; |
| | | private int _kernel_size = 11; |
| | | private int _sanm_shift = 0; |
| | | private int _sanm_shfit = 0; |
| | | |
| | | public int attention_heads { get => _attention_heads; set => _attention_heads = value; } |
| | | public int linear_units { get => _linear_units; set => _linear_units = value; } |
| | |
| | | public float src_attention_dropout_rate { get => _src_attention_dropout_rate; set => _src_attention_dropout_rate = value; } |
| | | public int att_layer_num { get => _att_layer_num; set => _att_layer_num = value; } |
| | | public int kernel_size { get => _kernel_size; set => _kernel_size = value; } |
| | | public int sanm_shift { get => _sanm_shift; set => _sanm_shift = value; } |
| | | |
| | | public int sanm_shfit { get => _sanm_shfit; set => _sanm_shfit = value; } |
| | | |
| | | } |
| | | } |
| | |
| | | private string _pos_enc_class = "SinusoidalPositionEncoder"; |
| | | private bool _normalize_before = true; |
| | | private int _kernel_size = 11; |
| | | private int _sanm_shift = 0; |
| | | private int _sanm_shfit = 0; |
| | | private string _selfattention_layer_type = "sanm"; |
| | | |
| | | public int output_size { get => _output_size; set => _output_size = value; } |
| | |
| | | public string pos_enc_class { get => _pos_enc_class; set => _pos_enc_class = value; } |
| | | public bool normalize_before { get => _normalize_before; set => _normalize_before = value; } |
| | | public int kernel_size { get => _kernel_size; set => _kernel_size = value; } |
| | | public int sanm_shift { get => _sanm_shift; set => _sanm_shift = value; } |
| | | public int sanm_shfit { get => _sanm_shfit; set => _sanm_shfit = value; } |
| | | public string selfattention_layer_type { get => _selfattention_layer_type; set => _selfattention_layer_type = value; } |
| | | } |
| | | } |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | chunk_size: |
| | | - 12 |
| | |
| | | src_attention_dropout_rate: 0.1 |
| | | att_layer_num: 16 |
| | | kernel_size: 11 |
| | | sanm_shift: 5 |
| | | sanm_shfit: 5 |
| | | predictor: cif_predictor_v2 |
| | | predictor_conf: |
| | | idim: 512 |
| | |
| | | pos_enc_class: SinusoidalPositionEncoder |
| | | normalize_before: true |
| | | kernel_size: 11 |
| | | sanm_shift: 0 |
| | | sanm_shfit: 0 |
| | | selfattention_layer_type: sanm |
| | | |
| | | |