python/FunASR-XL.git

parent: 14ecc9c2 | 补丁 | 提交 | ignore whitespace

Merge pull request #1411 from alibaba-damo-academy/dev_gzf

Shi Xian

2024-03-01 590dfdefe39baf7da18693228e1ce6bf60b23bee

Merge pull request #1411 from alibaba-damo-academy/dev_gzf

Dev gzf

9个文件已修改

15个文件已删除

7个文件已添加

1 文件已重命名

	README.md	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	README_zh.md	21 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/industrial_data_pretraining/whisper/demo.py	13 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/industrial_data_pretraining/whisper/infer.sh	21 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/industrial_data_pretraining/whisper/infer_from_local.sh	34 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/auto/auto_model.py	5 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/datasets/llm_datasets/datasets.py	126 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/download/download_from_hub.py	6 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/download/name_maps_from_hub.py	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/frontends/whisper_frontend.py	8 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/llm_asr/__init__.py	补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/llm_asr/adaptor.py	62 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/llm_asr/model.py	341 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/model.py	316 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/template.yaml	46 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/gpt2/merges.txt	50001 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/gpt2/special_tokens_map.json	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/gpt2/tokenizer_config.json	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/gpt2/vocab.json	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/mel_filters.npz	补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/multilingual/added_tokens.json	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/multilingual/merges.txt	50000 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/multilingual/special_tokens_map.json	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/multilingual/tokenizer_config.json	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/assets/multilingual/vocab.json	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/audio.py	127 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/decoding.py	710 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/tokenizer.py	339 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/transcribe.py	316 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/utils/utils.py	163 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/tokenizer/whisper_tokenizer.py	24 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/train_utils/load_pretrained_model.py	6 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 README.md

@@ -79,6 +79,7 @@
|                                   fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                                   |              voice activity detection              | 5000 hours, Mandarin and English |    0.4M    | 
|                                     fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                                     |                timestamp prediction                |       5000 hours, Mandarin       |    38M     | 
|                                       cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                                        |        speaker verification/diarization            |            5000 hours            |    7.2M    | 
|                                                 whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🤗]() )                                                   | speech recognition, with timestamps, non-streaming |          multilingual            |     1G     |




 README_zh.md

@@ -71,16 +71,17 @@
（注：⭐ 表示ModelScope模型仓库链接，🤗 表示Huggingface模型仓库链接）


|                                         模型名字                                                                                                                 |        任务详情        |     训练数据     | 参数量  |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------:|:------------:|:----:|
| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) ) |  语音识别，带时间戳输出，非实时   |  60000小时，中文  | 220M |
|   paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )   |      语音识别，实时       |  60000小时，中文  | 220M |
|      paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )      |      语音识别，非实时      |  50000小时，英文  | 220M |
|                  conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                   |      语音识别，非实时      |  50000小时，英文  | 220M |
|                  ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                   |        标点恢复        |  100M，中文与英文  | 1.1G | 
|                       fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                       |     语音端点检测，实时      | 5000小时，中文与英文 | 0.4M | 
|                       fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                        |      字级别时间戳预测      |  50000小时，中文  | 38M  |
|                           cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                            |      说话人确认/分割      |   5000小时     |    7.2M    | 
|                                         模型名字                                                                                                                 |      任务详情       |     训练数据     | 参数量  |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------:|:------------:|:----:|
| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) ) | 语音识别，带时间戳输出，非实时 |  60000小时，中文  | 220M |
|   paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )   |     语音识别，实时     |  60000小时，中文  | 220M |
|      paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )      |    语音识别，非实时     |  50000小时，英文  | 220M |
|                  conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                   |    语音识别，非实时     |  50000小时，英文  | 220M |
|                  ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                   |      标点恢复       |  100M，中文与英文  | 1.1G | 
|                       fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                       |    语音端点检测，实时    | 5000小时，中文与英文 | 0.4M | 
|                       fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                        |    字级别时间戳预测     |  50000小时，中文  | 38M  |
|                           cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                            |    说话人确认/分割     |    5000小时    | 7.2M | 
| whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🤗]() ) | 语音识别，带时间戳输出，非实时 |     多语言      |  1G  |


<a name="快速开始"></a>

 examples/industrial_data_pretraining/whisper/demo.py

New file
@@ -0,0 +1,13 @@
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)

from funasr import AutoModel

model = AutoModel(model="iic/speech_whisper-large_asr_multilingual",
                  model_revision="v2.0.4",
                  )

res = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav", language=None)
print(res)

 examples/industrial_data_pretraining/whisper/infer.sh

New file
@@ -0,0 +1,21 @@
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)

# method1, inference from model hub

# for more input type, please ref to readme.md
input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"

output_dir="./outputs/debug"

model="iic/speech_whisper-large_asr_multilingual"
model_revision="v2.0.4"

device="cuda:0" # "cuda:0" for gpu0, "cuda:1" for gpu1, "cpu"

python -m funasr.bin.inference \
++model=${model} \
++model_revision=${model_revision} \
++input="${input}" \
++output_dir="${output_dir}" \
++device="${device}" \

 examples/industrial_data_pretraining/whisper/infer_from_local.sh

New file
@@ -0,0 +1,34 @@
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)

# method2, inference from local model

# for more input type, please ref to readme.md
input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"

output_dir="./outputs/debug"

workspace=`pwd`

# download model
local_path_root=${workspace}/modelscope_models
mkdir -p ${local_path_root}
local_path=${local_path_root}/speech_whisper-large_asr_multilingual
git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}

device="cuda:0" # "cuda:0" for gpu0, "cuda:1" for gpu1, "cpu"

config="config.yaml"
init_param="${local_path}/large-v2.pt"

python -m funasr.bin.inference \
--config-path "${local_path}" \
--config-name "${config}" \
++init_param="${init_param}" \
++input="${input}" \
++output_dir="${output_dir}" \
++device="${device}" \





 funasr/auto/auto_model.py

@@ -165,17 +165,18 @@

            kwargs["token_list"] = tokenizer.token_list if hasattr(tokenizer, "token_list") else None
            kwargs["token_list"] = tokenizer.get_vocab() if hasattr(tokenizer, "get_vocab") else kwargs["token_list"]
            vocab_size = len(kwargs["token_list"])
            vocab_size = len(kwargs["token_list"]) if kwargs["token_list"] is not None else -1
        else:
            vocab_size = -1
        
        # build frontend
        frontend = kwargs.get("frontend", None)
        kwargs["input_size"] = None
        if frontend is not None:
            frontend_class = tables.frontend_classes.get(frontend)
            frontend = frontend_class(**kwargs["frontend_conf"])
            kwargs["frontend"] = frontend
            kwargs["input_size"] = frontend.output_size()
            kwargs["input_size"] = frontend.output_size() if hasattr(frontend, "output_size") else None
        
        # build model
        model_class = tables.model_classes.get(kwargs["model"])

 funasr/datasets/llm_datasets/datasets.py

@@ -129,3 +129,129 @@
                
                outputs[key] = torch.nn.utils.rnn.pad_sequence(data_list, batch_first=True, padding_value=pad_value)
        return outputs


@tables.register("dataset_classes", "AudioLLMARDataset")
class AudioLLMARDataset(torch.utils.data.Dataset):
    """
    AudioLLMDataset
    """
    
    def __init__(self,
                 path,
                 index_ds: str = None,
                 frontend=None,
                 tokenizer=None,
                 int_pad_value: int = -1,
                 float_pad_value: float = 0.0,
                 **kwargs):
        super().__init__()
        index_ds_class = tables.index_ds_classes.get(index_ds)
        self.index_ds = index_ds_class(path, **kwargs)
        preprocessor_speech = kwargs.get("preprocessor_speech", None)
        if preprocessor_speech:
            preprocessor_speech_class = tables.preprocessor_classes.get(preprocessor_speech)
            preprocessor_speech = preprocessor_speech_class(**kwargs.get("preprocessor_speech_conf", {}))
        self.preprocessor_speech = preprocessor_speech
        preprocessor_text = kwargs.get("preprocessor_text", None)
        if preprocessor_text:
            preprocessor_text_class = tables.preprocessor_classes.get(preprocessor_text)
            preprocessor_text = preprocessor_text_class(**kwargs.get("preprocessor_text_conf", {}))
        self.preprocessor_text = preprocessor_text
        
        self.frontend = frontend
        self.fs = 16000 if frontend is None else frontend.fs
        self.data_type = "sound"
        self.tokenizer = tokenizer
        
        self.float_pad_value = float_pad_value
        self.prompt = kwargs.get("prompt", "Transcribe speech to text.")
        self.prompt_pre = "USER: \nINSTRUCTION: {}\nINPUT: ".format(
            self.prompt)  # "USER: \nINSTRUCTION: {}\nnINPUT: {}\nASSISTANT: "
        self.prompt_af = ""
        self.IGNORE_INDEX = kwargs.get("IGNORE_INDEX", -100)
        self.int_pad_value = self.IGNORE_INDEX
    
    def get_source_len(self, index):
        item = self.index_ds[index]
        return self.index_ds.get_source_len(item)
    
    def get_target_len(self, index):
        item = self.index_ds[index]
        return self.index_ds.get_target_len(item)
    
    def __len__(self):
        return len(self.index_ds)
    
    def __getitem__(self, index):
        item = self.index_ds[index]
        # import pdb;
        # pdb.set_trace()
        source = item["source"]
        data_src = load_audio_text_image_video(source, fs=self.fs)
        if self.preprocessor_speech:
            data_src = self.preprocessor_speech(data_src, fs=self.fs)
        speech, speech_lengths = extract_fbank(data_src, data_type=self.data_type, frontend=self.frontend,
                                               is_final=True)  # speech: [b, T, d]
        speech = speech.squeeze(0)
        
        target = item["target"]
        if self.preprocessor_text:
            target = self.preprocessor_text(target)
        
        prompt_ids_pre = self.tokenizer.encode(self.prompt_pre)  # [bos,prompt]
        prompt_pre_length = len(prompt_ids_pre)
        
        prompt_input = "{}{}".format(self.prompt_pre, target)
        prompt_input_ids = self.tokenizer.encode(prompt_input)
        audio_length = len(prompt_input_ids) - prompt_pre_length
        input_ids = prompt_input_ids + [self.tokenizer.pad_token_id]
        input_ids = torch.tensor(input_ids, dtype=torch.int64)  # [bos, prompt, input, pad]
        input_ids[prompt_pre_length:] = -1  # [bos, prompt,-1,-1]
        attention_mask = input_ids.ge(-1)  # [true, true, true, true], length mask
        
        prompt_answer = "{}{}".format(self.prompt_pre, target)
        prompt_answer_ids = self.tokenizer.encode(prompt_answer)
        answer_length = len(prompt_answer_ids) - prompt_pre_length
        labels_ids = copy.deepcopy(prompt_input_ids) + [self.tokenizer.eos_token_id]
        labels_ids = torch.tensor(labels_ids, dtype=torch.int64)  # [bos, prompt, input, eos]
        labels_ids[:prompt_pre_length] = -1  # [-1, -1, input, eos]
        label_mask = labels_ids.ge(0)  # [False,False,True,True]
        labels_ids[~label_mask] = self.IGNORE_INDEX  # [-100,-100,input,eos]
        
        audio_mask = [0] * prompt_pre_length + [1] * audio_length + [0]
        audio_mask = torch.tensor(audio_mask, dtype=torch.float32)
        
        ids = self.tokenizer.encode(target)  # token ids is different from labels_ids
        text = torch.tensor(ids, dtype=torch.int64)
        text_lengths = torch.tensor([len(ids)], dtype=torch.int32)
        
        return {"speech": speech,
                "speech_lengths": speech_lengths,
                "text": text,
                "text_lengths": text_lengths,
                "input_ids": input_ids,
                "attention_mask": attention_mask,
                "labels_ids": labels_ids,
                "label_mask": label_mask,
                "audio_mask": audio_mask,
                }
    
    def collator(self, samples: list = None):
        outputs = {}
        for sample in samples:
            for key in sample.keys():
                if key not in outputs:
                    outputs[key] = []
                outputs[key].append(sample[key])
        
        for key, data_list in outputs.items():
            if isinstance(data_list[0], torch.Tensor):
                if data_list[0].dtype == torch.int64:
                    
                    pad_value = self.int_pad_value
                else:
                    pad_value = self.float_pad_value
                
                outputs[key] = torch.nn.utils.rnn.pad_sequence(data_list, batch_first=True, padding_value=pad_value)
        return outputs

 funasr/download/download_from_hub.py

@@ -18,14 +18,16 @@
        model_or_path = name_maps_ms[model_or_path]
    model_revision = kwargs.get("model_revision")
    if not os.path.exists(model_or_path):
        model_or_path = get_or_download_model_dir(model_or_path, model_revision, is_training=kwargs.get("is_training"), check_latest=kwargs.get("kwargs", True))
        model_or_path = get_or_download_model_dir(model_or_path, model_revision, is_training=kwargs.get("is_training"), check_latest=kwargs.get("check_latest", True))
    kwargs["model_path"] = model_or_path
    
    if os.path.exists(os.path.join(model_or_path, "configuration.json")):
        with open(os.path.join(model_or_path, "configuration.json"), 'r', encoding='utf-8') as f:
            conf_json = json.load(f)
            
            cfg = {}
            add_file_root_path(model_or_path, conf_json["file_path_metas"], cfg)
            if "file_path_metas" in conf_json:
                add_file_root_path(model_or_path, conf_json["file_path_metas"], cfg)
            cfg.update(kwargs)
            config = OmegaConf.load(cfg["config"])
            kwargs = OmegaConf.merge(config, cfg)

 funasr/download/name_maps_from_hub.py

@@ -8,6 +8,7 @@
    "ct-punc-c": "damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
    "fa-zh": "damo/speech_timestamp_prediction-v1-16k-offline",
    "cam++": "damo/speech_campplus_sv_zh-cn_16k-common",
    "whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
}

name_maps_hf = {

 funasr/frontends/whisper_frontend.py

@@ -17,8 +17,9 @@
    def __init__(
            self,
            fs: int = 16000,
            whisper_model: str = "large-v3",
            whisper_model: str = None,
            do_pad_trim: bool = True,
            n_mels: int = 80,
    ):
        super().__init__()
        assert fs == 16000
@@ -30,17 +31,16 @@
        self.pad_samples = N_SAMPLES
        self.frame_shift = self.hop_length
        self.lfr_n = 1
        self.n_mels = n_mels
        if whisper_model == "large-v3" or whisper_model == "large":
            self.n_mels = 128
        else:
            self.n_mels = 80

        self.mel_filters = whisper.audio.mel_filters
        self.do_pad_trim = do_pad_trim
        if do_pad_trim:
            self.pad_or_trim = whisper.pad_or_trim

        assert whisper_model in whisper.available_models()
        # assert whisper_model in whisper.available_models()

    def output_size(self) -> int:
        return self.n_mels

 funasr/models/llm_asr/__init__.py


 funasr/models/llm_asr/adaptor.py

New file
@@ -0,0 +1,62 @@
import torch
import torch.nn as nn

from funasr.register import tables

@tables.register("adaptor_classes", "Linear")
class Linear(nn.Module):
    def __init__(self, downsample_rate, encoder_dim, llm_dim, ffn_dim: int = 2048, **kwargs):
        super().__init__()
        self.k = downsample_rate
        self.encoder_dim = encoder_dim
        self.llm_dim = llm_dim
        self.linear1 = nn.Linear(self.encoder_dim * self.k, ffn_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(ffn_dim, self.llm_dim)

    def forward(self, x):
        batch_size, seq_len, dim = x.size()
        num_frames_to_discard = seq_len % self.k
        if num_frames_to_discard > 0:
            x = x[:, :-num_frames_to_discard, :]
        seq_len = x.size(1)
        
        x = x.contiguous()
        x = x.view(batch_size, seq_len // self.k, dim * self.k)
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

@tables.register("adaptor_classes", "QFormer")
class EncoderProjectorQFormer(nn.Module):
    def __init__(self, downsample_rate, encoder_dim, llm_dim, ffn_dim: int = 2048, **kwargs):
        super().__init__()
        self.encoder_dim = encoder_dim
        self.llm_dim = llm_dim
        from transformers import Blip2QFormerConfig, Blip2QFormerModel
        configuration = Blip2QFormerConfig()
        configuration.encoder_hidden_size = self.encoder_dim
        configuration.num_hidden_layers = 2
        
        self.query_len = 64
        self.query = nn.Parameter(torch.zeros(1, self.query_len, configuration.hidden_size))
        self.query.data.normal_(mean=0.0, std=1.0)
        self.qformer = Blip2QFormerModel(configuration)
        
        self.linear = nn.Linear(configuration.hidden_size, self.llm_dim)
        self.norm = nn.LayerNorm(self.llm_dim, eps=1e-5)
    
    def forward(self, x, atts):
        query = self.query.expand(x.shape[0], -1, -1)
        
        query_output = self.qformer(
            query_embeds=query,
            encoder_hidden_states=x,
            encoder_attention_mask=atts,
            return_dict=True,
        )
        
        query_proj = self.norm(self.linear(query_output.last_hidden_state))
        
        return query_proj

 funasr/models/llm_asr/model.py

New file
@@ -0,0 +1,341 @@
import logging
from typing import Union, Dict, List, Tuple, Optional

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast

from funasr.models.scama.utils import sequence_mask
from funasr.losses.label_smoothing_loss import LabelSmoothingLoss
from funasr.models.ctc.ctc import CTC
from funasr.models.transformer.utils.add_sos_eos import add_sos_eos
from funasr.metrics.compute_acc import th_accuracy, compute_accuracy
# from funasr.models.e2e_asr_common import ErrorCalculator
from funasr.train_utils.device_funcs import force_gatherable
from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank
from funasr.utils import postprocess_utils
from funasr.utils.datadir_writer import DatadirWriter
from funasr.register import tables


@tables.register("model_classes", "LLMASR")
class LLMASR(nn.Module):
    """ """
    
    def __init__(
        self,
        specaug: str = None,
        specaug_conf: dict = None,
        normalize: str = None,
        normalize_conf: dict = None,
        encoder: str = None,
        encoder_conf: dict = None,
        decoder: str = None,
        decoder_conf: dict = None,
        ctc: str = None,
        ctc_conf: dict = None,
        ctc_weight: float = 0.5,
        llm: str = None,
        llm_conf: dict = None,
        adaptor: str = None,
        adaptor_conf: dict = None,
        input_size: int = 80,
        vocab_size: int = -1,
        ignore_id: int = -1,
        blank_id: int = 0,
        sos: int = 1,
        eos: int = 2,
        lsm_weight: float = 0.0,
        length_normalized_loss: bool = False,
        report_cer: bool = True,
        report_wer: bool = True,
        sym_space: str = "<space>",
        sym_blank: str = "<blank>",
        # extract_feats_in_collect_stats: bool = True,
        share_embedding: bool = False,
        # preencoder: Optional[AbsPreEncoder] = None,
        # postencoder: Optional[AbsPostEncoder] = None,
        **kwargs,
    ):
        
        super().__init__()
        
        if specaug is not None:
            specaug_class = tables.specaug_classes.get(specaug)
            specaug = specaug_class(**specaug_conf)
        if normalize is not None:
            normalize_class = tables.normalize_classes.get(normalize)
            normalize = normalize_class(**normalize_conf)
        
        # audio encoder
        hub = encoder_conf.get("hub", None)
        if hub == "funasr":
            from funasr import AutoModel
            init_param_path = encoder_conf.get("init_param_path", "iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch")
            model = AutoModel(model=init_param_path, model_revision="v2.0.4")
            # frontend = model.kwargs.get("frontend")
            model.model.decoder = None
            
            self.audio_encoder = model.model
            # self.frontend = frontend
            
        elif hub == "hf":
            pass
        else:
            encoder_class = tables.encoder_classes.get(encoder)
            encoder = encoder_class(input_size=input_size, **encoder_conf)
            encoder_output_size = encoder.output_size()

        # llm
        hub = llm_conf.get("hub", "hf")
        self.llm = None
        if hub == "hf":
            from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

            init_param_path = llm_conf.get("init_param_path", "vicuna-7b-v1.5")
            model = AutoModelForCausalLM.from_pretrained(
                init_param_path,
                load_in_8bit=None,
                device_map=None,
                use_cache=None,
            )
            freeze = llm_conf.get("freeze", True)
            if freeze:
                for name, param in model.named_parameters():
                    param.requires_grad = False
                model.eval()
            self.llm = model
        
        # adaptor
        adaptor_class = tables.adaptor_classes.get(adaptor)
        adaptor = adaptor_class(**adaptor_conf)
        
        self.adaptor = adaptor
        
        
        self.blank_id = blank_id
        self.sos = sos if sos is not None else vocab_size - 1
        self.eos = eos if eos is not None else vocab_size - 1
        self.vocab_size = vocab_size
        self.ignore_id = ignore_id
        self.specaug = specaug
        self.normalize = normalize
        self.encoder = encoder


        self.criterion_att = LabelSmoothingLoss(
            size=vocab_size,
            padding_idx=ignore_id,
            smoothing=lsm_weight,
            normalize_length=length_normalized_loss,
        )
        #
        # if report_cer or report_wer:
        #     self.error_calculator = ErrorCalculator(
        #         token_list, sym_space, sym_blank, report_cer, report_wer
        #     )
        #
        self.error_calculator = None

        self.length_normalized_loss = length_normalized_loss
        self.beam_search = None
    
    def forward(
        self,
        speech: torch.Tensor,
        speech_lengths: torch.Tensor,
        text: torch.Tensor,
        text_lengths: torch.Tensor,
        input_ids: torch.Tensor,
        attention_mask:torch.Tensor,
        labels_ids: torch.Tensor,
        label_mask: torch.Tensor,
        audio_mask: torch.Tensor,
        **kwargs,
    ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
        """Encoder + Decoder + Calc loss
        Args:
                speech: (Batch, Length, ...)
                speech_lengths: (Batch, )
                text: (Batch, Length)
                text_lengths: (Batch,)
        """
        # import pdb;
        # pdb.set_trace()
        if len(text_lengths.size()) > 1:
            text_lengths = text_lengths[:, 0]
        if len(speech_lengths.size()) > 1:
            speech_lengths = speech_lengths[:, 0]
        
        batch_size = speech.shape[0]
        
        # audio encoder
        encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, audio_mask=audio_mask)
        
        # adaptor
        encoder_out = self.adaptor(encoder_out)

        if input_ids is not None:
            input_ids[input_ids == -1] = 0
            input_ids[input_ids == -100] = 0
            if hasattr(self.llm.model, "embed_tokens"):
                inputs_embeds = self.llm.model.embed_tokens(input_ids)
            elif hasattr(self.llm.model.model, "embed_tokens"):
                inputs_embeds = self.llm.model.model.embed_tokens(input_ids)
            else:
                inputs_embeds = self.llm.model.model.model.embed_tokens(input_ids)

            if audio_mask is not None:
                batch_size, token_num, dims = inputs_embeds.shape
                _, l, _ = encoder_out.shape
                encoder_outs_pad = F.pad(encoder_out, (0, 0, token_num-l-1, 1, 0, 0), value=0.0)
                inputs_embeds = encoder_outs_pad * audio_mask[:, :, None] + inputs_embeds * (1.0-audio_mask[:, :, None])
                inputs_embeds = F.pad(inputs_embeds[:, 1:, :], (0, 0, 0, 1, 0, 0), value=0.0)

        model_outputs = self.llm(inputs_embeds=inputs_embeds, attention_mask=attention_mask, labels=labels_ids)
        loss = model_outputs.loss


        stats = {}
        with torch.no_grad():
            preds = torch.argmax(model_outputs.logits, -1)
            acc_att = compute_accuracy(preds[:, :-1], labels_ids[:, 1:], ignore_label=-100)
            stats["acc"] = acc_att

        stats["loss"] = torch.clone(loss.detach())

        # force_gatherable: to-device and to-tensor if scalar for DataParallel
        if self.length_normalized_loss:
            batch_size = int((text_lengths + 1).sum())
        loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
        return loss, stats, weight
    
    def encode(
        self, speech: torch.Tensor, speech_lengths: torch.Tensor, **kwargs,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
    
        audio_mask = kwargs.get("audio_mask", None)
        audio_token_lengths = audio_mask.sum(-1) if audio_mask is not None else None

        batch = {"speech": speech, "speech_lengths": speech_lengths}
        enc, enc_lens = self.audio_encoder.encode(**batch)
        with autocast(False):
            enc_mask = sequence_mask(enc_lens, enc.size(1), device=enc.device)[:, None, :]
            pre_acoustic_embeds, pre_token_length, _, _ = self.audio_encoder.predictor(enc,
                                                                               mask=enc_mask,
                                                                               target_label_length=audio_token_lengths,
                                                                               )

        return pre_acoustic_embeds, pre_token_length


    def inference(self,
                  data_in,
                  data_lengths=None,
                  key: list = None,
                  tokenizer=None,
                  frontend=None,
                  **kwargs,
                  ):
        
        prompt = kwargs.get("prompt", "Transcribe speech to text.")
        
        if kwargs.get("batch_size", 1) > 1:
            raise NotImplementedError("batch decoding is not implemented")


        
        meta_data = {}
        if isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank":  # fbank
            speech, speech_lengths = data_in, data_lengths
            if len(speech.shape) < 3:
                speech = speech[None, :, :]
            if speech_lengths is None:
                speech_lengths = speech.shape[1]
        else:
            # extract fbank feats
            time1 = time.perf_counter()
            audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs, audio_fs=kwargs.get("fs", 16000),
                                                            data_type=kwargs.get("data_type", "sound"),
                                                            tokenizer=tokenizer)
            time2 = time.perf_counter()
            meta_data["load_data"] = f"{time2 - time1:0.3f}"
            speech, speech_lengths = extract_fbank(audio_sample_list, data_type=kwargs.get("data_type", "sound"),
                                                   frontend=frontend)
            time3 = time.perf_counter()
            meta_data["extract_feat"] = f"{time3 - time2:0.3f}"
            meta_data["batch_data_time"] = speech_lengths.sum().item() * frontend.frame_shift * frontend.lfr_n / 1000
        
        speech = speech.to(device=kwargs["device"])
        speech_lengths = speech_lengths.to(device=kwargs["device"])
        
        # Encoder
        encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)

        # adaptor
        encoder_out = self.adaptor(encoder_out)
        
    
        prompt_pre = "USER: \nINSTRUCTION: {}\nINPUT: ".format(prompt)
        prompt_ids = tokenizer.encode(prompt_pre)
        prompt_length = len(prompt_ids)
        prompt_ids = torch.tensor(prompt_ids, dtype=torch.int64).to(kwargs["device"])


        if hasattr(self.llm.model, "embed_tokens"):
            inputs_embeds = self.llm.model.embed_tokens(prompt_ids)
        elif hasattr(self.llm.model.model, "embed_tokens"):
            inputs_embeds = self.llm.model.model.embed_tokens(prompt_ids)
        else:
            inputs_embeds = self.llm.model.model.model.embed_tokens(prompt_ids)

        inputs_embeds = torch.cat((inputs_embeds[None, :, :], encoder_out), dim=1)  # [prompt, audio]
        attention_mask = torch.ones(inputs_embeds.size()[:-1], dtype=torch.long).to(kwargs["device"])
        
        # model_outputs = self.llm.generate(
        #     inputs_embeds=inputs_embeds,
        #     max_length=kwargs.get("max_length", 200),
        #     max_new_tokens=kwargs.get("max_new_tokens", 200),
        #     num_beams=kwargs.get("num_beams", 4),
        #     do_sample=kwargs.get("do_sample", False),
        #     min_length=kwargs.get("min_length", 1),
        #     top_p=kwargs.get("top_p", 1.0),
        #     repetition_penalty=kwargs.get("repetition_penalty", 1.0),
        #     length_penalty=kwargs.get("length_penalty", 1.0),
        #     temperature=kwargs.get("temperature", 1.0),
        #     attention_mask=attention_mask,
        #     bos_token_id=tokenizer.bos_token_id,
        #     eos_token_id=tokenizer.eos_token_id,
        #     pad_token_id=tokenizer.pad_token_id
        # )


        model_outputs = self.llm(inputs_embeds=inputs_embeds, attention_mask=attention_mask, labels=None)
        preds = torch.argmax(model_outputs.logits, -1)
        text = tokenizer.batch_decode(preds, add_special_tokens=False, skip_special_tokens=True)

        text = text[0].split(': ')[-1]
        text = text.strip()
        
        # preds = torch.argmax(model_outputs.logits, -1)
        
        ibest_writer = None
        if kwargs.get("output_dir") is not None:
            if not hasattr(self, "writer"):
                self.writer = DatadirWriter(kwargs.get("output_dir"))
            ibest_writer = self.writer[f"{0 + 1}best_recog"]

        results = []
        result_i = {"key": key[0], "text": text}
        results.append(result_i)

        if ibest_writer is not None:
            ibest_writer["text"][key[0]] = text
        
        
        
        
        return results, meta_data


 funasr/models/whisper/model.py

@@ -1,273 +1,85 @@
from dataclasses import dataclass
from typing import Dict
from typing import Iterable, Optional

import time
import numpy as np
import torch
import torch.nn.functional as F
from torch import Tensor
from torch import nn
import whisper
from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank


from funasr.models.whisper.utils.decoding import detect_language as detect_language_function, decode as decode_function
from funasr.register import tables


@dataclass
class ModelDimensions:
    n_mels: int
    n_audio_ctx: int
    n_audio_state: int
    n_audio_head: int
    n_audio_layer: int
    n_vocab: int
    n_text_ctx: int
    n_text_state: int
    n_text_head: int
    n_text_layer: int


class LayerNorm(nn.LayerNorm):
    def forward(self, x: Tensor) -> Tensor:
        return super().forward(x.float()).type(x.dtype)


class Linear(nn.Linear):
    def forward(self, x: Tensor) -> Tensor:
        return F.linear(
            x, self.weight.to(x.dtype), None if self.bias is None else self.bias.to(x.dtype)
        )


class Conv1d(nn.Conv1d):
    def _conv_forward(self, x: Tensor, weight: Tensor, bias: Optional[Tensor]) -> Tensor:
        return super()._conv_forward(
            x, weight.to(x.dtype), None if bias is None else bias.to(x.dtype)
        )


def sinusoids(length, channels, max_timescale=10000):
    """Returns sinusoids for positional embedding"""
    assert channels % 2 == 0
    log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
    inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2))
    scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
    return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)


class MultiHeadAttention(nn.Module):
    def __init__(self, n_state: int, n_head: int):
@tables.register("model_classes", "WhisperWarp")
class WhisperWarp(nn.Module):
    def __init__(self, whisper_dims: dict, **kwargs):
        super().__init__()
        self.n_head = n_head
        self.query = Linear(n_state, n_state)
        self.key = Linear(n_state, n_state, bias=False)
        self.value = Linear(n_state, n_state)
        self.out = Linear(n_state, n_state)

    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        q = self.query(x)

        if kv_cache is None or xa is None or self.key not in kv_cache:
            # hooks, if installed (i.e. kv_cache is not None), will prepend the cached kv tensors;
            # otherwise, perform key/value projections for self- or cross-attention as usual.
            k = self.key(x if xa is None else xa)
            v = self.value(x if xa is None else xa)
        hub = kwargs.get("hub", "funasr")
        if hub == "openai":
            init_param_path = kwargs.get("init_param_path", "large-v3")
            model = whisper.load_model(init_param_path)
        else:
            # for cross-attention, calculate keys and values once and reuse in subsequent calls.
            k = kv_cache[self.key]
            v = kv_cache[self.value]
            dims = whisper.model.ModelDimensions(**whisper_dims)
            model = whisper.model.Whisper(dims=dims)
        
        self.model = model
        
    def forward(self, ):
        pass
    
    def inference(self,
                  data_in,
                  data_lengths=None,
                  key: list = None,
                  tokenizer=None,
                  frontend=None,
                  **kwargs,
                  ):
        if kwargs.get("batch_size", 1) > 1:
            raise NotImplementedError("batch decoding is not implemented")

        wv, qk = self.qkv_attention(q, k, v, mask)
        return self.out(wv), qk
        meta_data = {}
        if isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank":  # fbank
            speech, speech_lengths = data_in, data_lengths
            if len(speech.shape) < 3:
                speech = speech[None, :, :]
            if speech_lengths is None:
                speech_lengths = speech.shape[1]
        else:
            # extract fbank feats
            time1 = time.perf_counter()
            audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs, audio_fs=kwargs.get("fs", 16000),
                                                            data_type=kwargs.get("data_type", "sound"),
                                                            tokenizer=tokenizer)
            time2 = time.perf_counter()
            meta_data["load_data"] = f"{time2 - time1:0.3f}"
            speech, speech_lengths = extract_fbank(audio_sample_list, data_type=kwargs.get("data_type", "sound"),
                                                   frontend=frontend)
            time3 = time.perf_counter()
            meta_data["extract_feat"] = f"{time3 - time2:0.3f}"
            frame_shift = frontend.frame_shift if hasattr(frontend, "frame_shift") else 10
            lfr_n = frontend.lfr_n if hasattr(frontend, "lfr_n") else 1
            meta_data["batch_data_time"] = speech_lengths.sum().item() * frame_shift * lfr_n / 1000

    def qkv_attention(self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None):
        n_batch, n_ctx, n_state = q.shape
        scale = (n_state // self.n_head) ** -0.25
        q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3) * scale
        k = k.view(*k.shape[:2], self.n_head, -1).permute(0, 2, 3, 1) * scale
        v = v.view(*v.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
        speech = speech.to(device=kwargs["device"])[0, :, :]
        speech_lengths = speech_lengths.to(device=kwargs["device"])

        qk = q @ k
        if mask is not None:
            qk = qk + mask[:n_ctx, :n_ctx]
        qk = qk.float()
        # detect the spoken language
        _, probs = self.model.detect_language(speech)
        print(f"Detected language: {max(probs, key=probs.get)}")

        w = F.softmax(qk, dim=-1).to(q.dtype)
        return (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2), qk.detach()
        # decode the audio
        options = whisper.DecodingOptions(language=kwargs.get("language", None), fp16=False)
        result = whisper.decode(self.model, speech, options)

        results = []
        result_i = {"key": key[0], "text": result.text}

class ResidualAttentionBlock(nn.Module):
    def __init__(self, n_state: int, n_head: int, cross_attention: bool = False):
        super().__init__()

        self.attn = MultiHeadAttention(n_state, n_head)
        self.attn_ln = LayerNorm(n_state)

        self.cross_attn = MultiHeadAttention(n_state, n_head) if cross_attention else None
        self.cross_attn_ln = LayerNorm(n_state) if cross_attention else None

        n_mlp = n_state * 4
        self.mlp = nn.Sequential(Linear(n_state, n_mlp), nn.GELU(), Linear(n_mlp, n_state))
        self.mlp_ln = LayerNorm(n_state)

    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)[0]
        if self.cross_attn:
            x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
        x = x + self.mlp(self.mlp_ln(x))
        return x



@tables.register("encoder_classes", "WhisperEncoder")
class AudioEncoder(nn.Module):
    def __init__(self, n_mels: int, n_ctx: int, n_state: int, n_head: int, n_layer: int):
        super().__init__()
        self.conv1 = Conv1d(n_mels, n_state, kernel_size=3, padding=1)
        self.conv2 = Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1)
        self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state))

        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [ResidualAttentionBlock(n_state, n_head) for _ in range(n_layer)]
        )
        self.ln_post = LayerNorm(n_state)

    def forward(self, x: Tensor):
        """
        x : torch.Tensor, shape = (batch_size, n_mels, n_ctx)
            the mel spectrogram of the audio
        """
        x = F.gelu(self.conv1(x))
        x = F.gelu(self.conv2(x))
        x = x.permute(0, 2, 1)

        assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
        x = (x + self.positional_embedding).to(x.dtype)

        for block in self.blocks:
            x = block(x)

        x = self.ln_post(x)
        return x

@tables.register("decoder_classes", "WhisperDecoder")
class TextDecoder(nn.Module):
    def __init__(self, n_vocab: int, n_ctx: int, n_state: int, n_head: int, n_layer: int):
        super().__init__()

        self.token_embedding = nn.Embedding(n_vocab, n_state)
        self.positional_embedding = nn.Parameter(torch.empty(n_ctx, n_state))

        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [ResidualAttentionBlock(n_state, n_head, cross_attention=True) for _ in range(n_layer)]
        )
        self.ln = LayerNorm(n_state)

        mask = torch.empty(n_ctx, n_ctx).fill_(-np.inf).triu_(1)
        self.register_buffer("mask", mask, persistent=False)

    def forward(self, x: Tensor, xa: Tensor, kv_cache: Optional[dict] = None):
        """
        x : torch.LongTensor, shape = (batch_size, <= n_ctx)
            the text tokens
        xa : torch.Tensor, shape = (batch_size, n_mels, n_audio_ctx)
            the encoded audio features to be attended on
        """
        offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0
        x = self.token_embedding(x) + self.positional_embedding[offset : offset + x.shape[-1]]
        x = x.to(xa.dtype)

        for block in self.blocks:
            x = block(x, xa, mask=self.mask, kv_cache=kv_cache)

        x = self.ln(x)
        logits = (x @ torch.transpose(self.token_embedding.weight.to(x.dtype), 0, 1)).float()

        return logits

@tables.register("model_classes", "Whisper")
class Whisper(nn.Module):
    def __init__(self, dims: dict):
        super().__init__()
        dims = ModelDimensions(**dims) 
        self.dims = dims
        self.sos = 1
        self.eos = 1
        self.encoder = AudioEncoder(
            self.dims.n_mels,
            self.dims.n_audio_ctx,
            self.dims.n_audio_state,
            self.dims.n_audio_head,
            self.dims.n_audio_layer,
        )
        self.decoder = TextDecoder(
            self.dims.n_vocab,
            self.dims.n_text_ctx,
            self.dims.n_text_state,
            self.dims.n_text_head,
            self.dims.n_text_layer,
        )

    def embed_audio(self, mel: torch.Tensor):
        return self.encoder(mel)

    def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor):
        return self.decoder(tokens, audio_features)

    def forward(self, mel: torch.Tensor, tokens: torch.Tensor) -> Dict[str, torch.Tensor]:
        return self.decoder(tokens, self.encoder(mel))

    @property
    def device(self):
        return next(self.parameters()).device

    @property
    def is_multilingual(self):
        return self.dims.n_vocab == 51865

    def install_kv_cache_hooks(self, cache: Optional[dict] = None):
        """
        The `MultiHeadAttention` module optionally accepts `kv_cache` which stores the key and value
        tensors calculated for the previous positions. This method returns a dictionary that stores
        all caches, and the necessary hooks for the key and value projection modules that save the
        intermediate tensors to be reused during later calculations.

        Returns
        -------
        cache : Dict[nn.Module, torch.Tensor]
            A dictionary object mapping the key/value projection modules to its cache
        hooks : List[RemovableHandle]
            List of PyTorch RemovableHandle objects to stop the hooks to be called
        """
        cache = {**cache} if cache is not None else {}
        hooks = []

        def save_to_cache(module, _, output):
            if module not in cache or output.shape[1] > self.decoder.positional_embedding.shape[0]:
                cache[module] = output  # save as-is, for the first token or cross attention
            else:
                cache[module] = torch.cat([cache[module], output], dim=1).detach()
            return cache[module]

        def install_hooks(layer: nn.Module):
            if isinstance(layer, MultiHeadAttention):
                hooks.append(layer.key.register_forward_hook(save_to_cache))
                hooks.append(layer.value.register_forward_hook(save_to_cache))

        self.decoder.apply(install_hooks)
        return cache, hooks

    detect_language = detect_language_function
    decode = decode_function
        results.append(result_i)
    
        return results, meta_data
    

 funasr/models/whisper/template.yaml

New file
@@ -0,0 +1,46 @@
# This is an example that demonstrates how to configure a model file.
# You can modify the configuration according to your own requirements.

# to print the register_table:
# from funasr.register import tables
# tables.print()

# network architecture
model: WhisperWarp
model_conf:
    lsm_weight: 0.1
    length_normalized_loss: true
    hub: funasr # openai
    init_param_path: null # large-v2 or large-v3 if hub == "openai"



# only use for hub == funasr,
#  if hub == openai, whisper_dims is automaticall download
whisper_dims:
    'n_mels': 80
    'n_vocab': 51865
    'n_audio_ctx': 1500
    'n_audio_state': 1280
    'n_audio_head': 20
    'n_audio_layer': 32
    'n_text_ctx': 448
    'n_text_state': 1280
    'n_text_head': 20
    'n_text_layer': 32

# frontend related
frontend: WhisperFrontend
frontend_conf:
    fs: 16000
    n_mels: 80
    do_pad_trim: true

tokenizer: WhisperTokenizer
tokenizer_conf:
  language: null
  task: transcribe
  is_multilingual: true
  num_languages: 99

scope_map: ['none', "model."]

 funasr/models/whisper/utils/assets/gpt2/merges.txt

File was deleted

 funasr/models/whisper/utils/assets/gpt2/special_tokens_map.json

File was deleted

 funasr/models/whisper/utils/assets/gpt2/tokenizer_config.json

File was deleted

 funasr/models/whisper/utils/assets/gpt2/vocab.json

File was deleted

 funasr/models/whisper/utils/assets/mel_filters.npz

Binary files differ

 funasr/models/whisper/utils/assets/multilingual/added_tokens.json

File was deleted

 funasr/models/whisper/utils/assets/multilingual/merges.txt

File was deleted

 funasr/models/whisper/utils/assets/multilingual/special_tokens_map.json

File was deleted

 funasr/models/whisper/utils/assets/multilingual/tokenizer_config.json

File was deleted

 funasr/models/whisper/utils/assets/multilingual/vocab.json

File was deleted

 funasr/models/whisper/utils/audio.py

File was deleted

 funasr/models/whisper/utils/decoding.py

File was deleted

 funasr/models/whisper/utils/tokenizer.py

File was deleted

 funasr/models/whisper/utils/transcribe.py

File was deleted

 funasr/models/whisper/utils/utils.py

File was deleted

 funasr/tokenizer/whisper_tokenizer.py

New file
@@ -0,0 +1,24 @@

try:
    from whisper.tokenizer import get_tokenizer
except:
    print("If you want to use hugging, please `pip install -U transformers`")

from funasr.register import tables

@tables.register("tokenizer_classes", "WhisperTokenizer")
def WhisperTokenizer(**kwargs):

    language = kwargs.get("language", None)
    task = kwargs.get("task", "transcribe")
    is_multilingual = kwargs.get("is_multilingual", True)
    num_languages = kwargs.get("num_languages", 99)
    tokenizer = get_tokenizer(
        multilingual=is_multilingual,
        num_languages=num_languages,
        language=language,
        task=task,
    )
	
    return tokenizer


 funasr/train_utils/load_pretrained_model.py

@@ -68,9 +68,9 @@
    else:
        buffer = BytesIO(oss_bucket.get_object(path).read())
        src_state = torch.load(buffer, map_location=map_location)
    if "state_dict" in src_state:
        src_state = src_state["state_dict"]
	
		
    src_state = src_state["state_dict"] if "state_dict" in src_state else src_state
    src_state = src_state["model_state_dict"] if "model_state_dict" in src_state else src_state
    src_state = src_state["model"] if "model" in src_state else src_state
    
    if isinstance(scope_map, str):

			@@ -79,6 +79,7 @@
			\| fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) ) \| voice activity detection \| 5000 hours, Mandarin and English \| 0.4M \|
			\| fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) ) \| timestamp prediction \| 5000 hours, Mandarin \| 38M \|
			\| cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) ) \| speaker verification/diarization \| 5000 hours \| 7.2M \|
			\| whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary) [🤗]() ) \| speech recognition, with timestamps, non-streaming \| multilingual \| 1G \|

			@@ -71,16 +71,17 @@
			（注：⭐ 表示ModelScope模型仓库链接，🤗 表示Huggingface模型仓库链接）


			\| 模型名字 \| 任务详情 \| 训练数据 \| 参数量 \|
			\|:------------------------------------------------------------------------------------------------------------------------------------------------------------:\|:------------------:\|:------------:\|:----:\|
			\| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗](https://huggingface.co/funasr/paraformer-tp) ) \| 语音识别，带时间戳输出，非实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) ) \| 语音识别，实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) ) \| 标点恢复 \| 100M，中文与英文 \| 1.1G \|
			\| fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) ) \| 语音端点检测，实时 \| 5000小时，中文与英文 \| 0.4M \|
			\| fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) ) \| 字级别时间戳预测 \| 50000小时，中文 \| 38M \|
			\| cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) ) \| 说话人确认/分割 \| 5000小时 \| 7.2M \|
			\| 模型名字 \| 任务详情 \| 训练数据 \| 参数量 \|
			\|:------------------------------------------------------------------------------------------------------------------------------------------------------------:\|:---------------:\|:------------:\|:----:\|
			\| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗](https://huggingface.co/funasr/paraformer-tp) ) \| 语音识别，带时间戳输出，非实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) ) \| 语音识别，实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) ) \| 标点恢复 \| 100M，中文与英文 \| 1.1G \|
			\| fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) ) \| 语音端点检测，实时 \| 5000小时，中文与英文 \| 0.4M \|
			\| fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) ) \| 字级别时间戳预测 \| 50000小时，中文 \| 38M \|
			\| cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) ) \| 说话人确认/分割 \| 5000小时 \| 7.2M \|
			\| whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary) [🤗]() ) \| 语音识别，带时间戳输出，非实时 \| 多语言 \| 1G \|


			<a name="快速开始"></a>

New file
			@@ -0,0 +1,13 @@
			#!/usr/bin/env python3
			# -- encoding: utf-8 --
			# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
			# MIT License (https://opensource.org/licenses/MIT)

			from funasr import AutoModel

			model = AutoModel(model="iic/speech_whisper-large_asr_multilingual",
			model_revision="v2.0.4",
			)

			res = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav", language=None)
			print(res)

New file
			@@ -0,0 +1,21 @@
			# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
			# MIT License (https://opensource.org/licenses/MIT)

			# method1, inference from model hub

			# for more input type, please ref to readme.md
			input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"

			output_dir="./outputs/debug"

			model="iic/speech_whisper-large_asr_multilingual"
			model_revision="v2.0.4"

			device="cuda:0" # "cuda:0" for gpu0, "cuda:1" for gpu1, "cpu"

			python -m funasr.bin.inference \
			++model=${model} \
			++model_revision=${model_revision} \
			++input="${input}" \
			++output_dir="${output_dir}" \
			++device="${device}" \

New file
			@@ -0,0 +1,34 @@
			# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
			# MIT License (https://opensource.org/licenses/MIT)

			# method2, inference from local model

			# for more input type, please ref to readme.md
			input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"

			output_dir="./outputs/debug"

			workspace=`pwd`

			# download model
			local_path_root=${workspace}/modelscope_models
			mkdir -p ${local_path_root}
			local_path=${local_path_root}/speech_whisper-large_asr_multilingual
			git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}

			device="cuda:0" # "cuda:0" for gpu0, "cuda:1" for gpu1, "cpu"

			config="config.yaml"
			init_param="${local_path}/large-v2.pt"

			python -m funasr.bin.inference \
			--config-path "${local_path}" \
			--config-name "${config}" \
			++init_param="${init_param}" \
			++input="${input}" \
			++output_dir="${output_dir}" \
			++device="${device}" \

			@@ -165,17 +165,18 @@

			kwargs["token_list"] = tokenizer.token_list if hasattr(tokenizer, "token_list") else None
			kwargs["token_list"] = tokenizer.get_vocab() if hasattr(tokenizer, "get_vocab") else kwargs["token_list"]
			vocab_size = len(kwargs["token_list"])
			vocab_size = len(kwargs["token_list"]) if kwargs["token_list"] is not None else -1
			else:
			vocab_size = -1

			# build frontend
			frontend = kwargs.get("frontend", None)
			kwargs["input_size"] = None
			if frontend is not None:
			frontend_class = tables.frontend_classes.get(frontend)
			frontend = frontend_class(**kwargs["frontend_conf"])
			kwargs["frontend"] = frontend
			kwargs["input_size"] = frontend.output_size()
			kwargs["input_size"] = frontend.output_size() if hasattr(frontend, "output_size") else None

			# build model
			model_class = tables.model_classes.get(kwargs["model"])

			@@ -129,3 +129,129 @@

			outputs[key] = torch.nn.utils.rnn.pad_sequence(data_list, batch_first=True, padding_value=pad_value)
			return outputs


			@tables.register("dataset_classes", "AudioLLMARDataset")
			class AudioLLMARDataset(torch.utils.data.Dataset):
			"""
			AudioLLMDataset
			"""

			def __init__(self,
			path,
			index_ds: str = None,
			frontend=None,
			tokenizer=None,
			int_pad_value: int = -1,
			float_pad_value: float = 0.0,
			**kwargs):
			super().__init__()
			index_ds_class = tables.index_ds_classes.get(index_ds)
			self.index_ds = index_ds_class(path, **kwargs)
			preprocessor_speech = kwargs.get("preprocessor_speech", None)
			if preprocessor_speech:
			preprocessor_speech_class = tables.preprocessor_classes.get(preprocessor_speech)
			preprocessor_speech = preprocessor_speech_class(**kwargs.get("preprocessor_speech_conf", {}))
			self.preprocessor_speech = preprocessor_speech
			preprocessor_text = kwargs.get("preprocessor_text", None)
			if preprocessor_text:
			preprocessor_text_class = tables.preprocessor_classes.get(preprocessor_text)
			preprocessor_text = preprocessor_text_class(**kwargs.get("preprocessor_text_conf", {}))
			self.preprocessor_text = preprocessor_text

			self.frontend = frontend
			self.fs = 16000 if frontend is None else frontend.fs
			self.data_type = "sound"
			self.tokenizer = tokenizer

			self.float_pad_value = float_pad_value
			self.prompt = kwargs.get("prompt", "Transcribe speech to text.")
			self.prompt_pre = "USER: \nINSTRUCTION: {}\nINPUT: ".format(
			self.prompt) # "USER: \nINSTRUCTION: {}\nnINPUT: {}\nASSISTANT: "
			self.prompt_af = ""
			self.IGNORE_INDEX = kwargs.get("IGNORE_INDEX", -100)
			self.int_pad_value = self.IGNORE_INDEX

			def get_source_len(self, index):
			item = self.index_ds[index]
			return self.index_ds.get_source_len(item)

			def get_target_len(self, index):
			item = self.index_ds[index]
			return self.index_ds.get_target_len(item)

			def __len__(self):
			return len(self.index_ds)

			def __getitem__(self, index):
			item = self.index_ds[index]
			# import pdb;
			# pdb.set_trace()
			source = item["source"]
			data_src = load_audio_text_image_video(source, fs=self.fs)
			if self.preprocessor_speech:
			data_src = self.preprocessor_speech(data_src, fs=self.fs)
			speech, speech_lengths = extract_fbank(data_src, data_type=self.data_type, frontend=self.frontend,
			is_final=True) # speech: [b, T, d]
			speech = speech.squeeze(0)

			target = item["target"]
			if self.preprocessor_text:
			target = self.preprocessor_text(target)

			prompt_ids_pre = self.tokenizer.encode(self.prompt_pre) # [bos,prompt]
			prompt_pre_length = len(prompt_ids_pre)

			prompt_input = "{}{}".format(self.prompt_pre, target)
			prompt_input_ids = self.tokenizer.encode(prompt_input)
			audio_length = len(prompt_input_ids) - prompt_pre_length
			input_ids = prompt_input_ids + [self.tokenizer.pad_token_id]
			input_ids = torch.tensor(input_ids, dtype=torch.int64) # [bos, prompt, input, pad]
			input_ids[prompt_pre_length:] = -1 # [bos, prompt,-1,-1]
			attention_mask = input_ids.ge(-1) # [true, true, true, true], length mask

			prompt_answer = "{}{}".format(self.prompt_pre, target)
			prompt_answer_ids = self.tokenizer.encode(prompt_answer)
			answer_length = len(prompt_answer_ids) - prompt_pre_length
			labels_ids = copy.deepcopy(prompt_input_ids) + [self.tokenizer.eos_token_id]
			labels_ids = torch.tensor(labels_ids, dtype=torch.int64) # [bos, prompt, input, eos]
			labels_ids[:prompt_pre_length] = -1 # [-1, -1, input, eos]
			label_mask = labels_ids.ge(0) # [False,False,True,True]
			labels_ids[~label_mask] = self.IGNORE_INDEX # [-100,-100,input,eos]

			audio_mask = [0] * prompt_pre_length + [1] * audio_length + [0]
			audio_mask = torch.tensor(audio_mask, dtype=torch.float32)

			ids = self.tokenizer.encode(target) # token ids is different from labels_ids
			text = torch.tensor(ids, dtype=torch.int64)
			text_lengths = torch.tensor([len(ids)], dtype=torch.int32)

			return {"speech": speech,
			"speech_lengths": speech_lengths,
			"text": text,
			"text_lengths": text_lengths,
			"input_ids": input_ids,
			"attention_mask": attention_mask,
			"labels_ids": labels_ids,
			"label_mask": label_mask,
			"audio_mask": audio_mask,
			}

			def collator(self, samples: list = None):
			outputs = {}
			for sample in samples:
			for key in sample.keys():
			if key not in outputs:
			outputs[key] = []
			outputs[key].append(sample[key])

			for key, data_list in outputs.items():
			if isinstance(data_list[0], torch.Tensor):
			if data_list[0].dtype == torch.int64:

			pad_value = self.int_pad_value
			else:
			pad_value = self.float_pad_value

			outputs[key] = torch.nn.utils.rnn.pad_sequence(data_list, batch_first=True, padding_value=pad_value)
			return outputs

			@@ -18,14 +18,16 @@
			model_or_path = name_maps_ms[model_or_path]
			model_revision = kwargs.get("model_revision")
			if not os.path.exists(model_or_path):
			model_or_path = get_or_download_model_dir(model_or_path, model_revision, is_training=kwargs.get("is_training"), check_latest=kwargs.get("kwargs", True))
			model_or_path = get_or_download_model_dir(model_or_path, model_revision, is_training=kwargs.get("is_training"), check_latest=kwargs.get("check_latest", True))
			kwargs["model_path"] = model_or_path

			if os.path.exists(os.path.join(model_or_path, "configuration.json")):
			with open(os.path.join(model_or_path, "configuration.json"), 'r', encoding='utf-8') as f:
			conf_json = json.load(f)

			cfg = {}
			add_file_root_path(model_or_path, conf_json["file_path_metas"], cfg)
			if "file_path_metas" in conf_json:
			add_file_root_path(model_or_path, conf_json["file_path_metas"], cfg)
			cfg.update(kwargs)
			config = OmegaConf.load(cfg["config"])
			kwargs = OmegaConf.merge(config, cfg)

			@@ -8,6 +8,7 @@
			"ct-punc-c": "damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
			"fa-zh": "damo/speech_timestamp_prediction-v1-16k-offline",
			"cam++": "damo/speech_campplus_sv_zh-cn_16k-common",
			"whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
			}

			name_maps_hf = {

			@@ -17,8 +17,9 @@
			def __init__(
			self,
			fs: int = 16000,
			whisper_model: str = "large-v3",
			whisper_model: str = None,
			do_pad_trim: bool = True,
			n_mels: int = 80,
			):
			super().__init__()
			assert fs == 16000
			@@ -30,17 +31,16 @@
			self.pad_samples = N_SAMPLES
			self.frame_shift = self.hop_length
			self.lfr_n = 1
			self.n_mels = n_mels
			if whisper_model == "large-v3" or whisper_model == "large":
			self.n_mels = 128
			else:
			self.n_mels = 80

			self.mel_filters = whisper.audio.mel_filters
			self.do_pad_trim = do_pad_trim
			if do_pad_trim:
			self.pad_or_trim = whisper.pad_or_trim

			assert whisper_model in whisper.available_models()
			# assert whisper_model in whisper.available_models()

			def output_size(self) -> int:
			return self.n_mels

New file
			@@ -0,0 +1,62 @@
			import torch
			import torch.nn as nn

			from funasr.register import tables

			@tables.register("adaptor_classes", "Linear")
			class Linear(nn.Module):
			def __init__(self, downsample_rate, encoder_dim, llm_dim, ffn_dim: int = 2048, **kwargs):
			super().__init__()
			self.k = downsample_rate
			self.encoder_dim = encoder_dim
			self.llm_dim = llm_dim
			self.linear1 = nn.Linear(self.encoder_dim * self.k, ffn_dim)
			self.relu = nn.ReLU()
			self.linear2 = nn.Linear(ffn_dim, self.llm_dim)

			def forward(self, x):
			batch_size, seq_len, dim = x.size()
			num_frames_to_discard = seq_len % self.k
			if num_frames_to_discard > 0:
			x = x[:, :-num_frames_to_discard, :]
			seq_len = x.size(1)

			x = x.contiguous()
			x = x.view(batch_size, seq_len // self.k, dim * self.k)
			x = self.linear1(x)
			x = self.relu(x)
			x = self.linear2(x)
			return x

			@tables.register("adaptor_classes", "QFormer")
			class EncoderProjectorQFormer(nn.Module):
			def __init__(self, downsample_rate, encoder_dim, llm_dim, ffn_dim: int = 2048, **kwargs):
			super().__init__()
			self.encoder_dim = encoder_dim
			self.llm_dim = llm_dim
			from transformers import Blip2QFormerConfig, Blip2QFormerModel
			configuration = Blip2QFormerConfig()
			configuration.encoder_hidden_size = self.encoder_dim
			configuration.num_hidden_layers = 2

			self.query_len = 64
			self.query = nn.Parameter(torch.zeros(1, self.query_len, configuration.hidden_size))
			self.query.data.normal_(mean=0.0, std=1.0)
			self.qformer = Blip2QFormerModel(configuration)

			self.linear = nn.Linear(configuration.hidden_size, self.llm_dim)
			self.norm = nn.LayerNorm(self.llm_dim, eps=1e-5)

			def forward(self, x, atts):
			query = self.query.expand(x.shape[0], -1, -1)

			query_output = self.qformer(
			query_embeds=query,
			encoder_hidden_states=x,
			encoder_attention_mask=atts,
			return_dict=True,
			)

			query_proj = self.norm(self.linear(query_output.last_hidden_state))

			return query_proj

New file
			@@ -0,0 +1,341 @@
			import logging
			from typing import Union, Dict, List, Tuple, Optional

			import time
			import torch
			import torch.nn as nn
			import torch.nn.functional as F
			from torch.cuda.amp import autocast

			from funasr.models.scama.utils import sequence_mask
			from funasr.losses.label_smoothing_loss import LabelSmoothingLoss
			from funasr.models.ctc.ctc import CTC
			from funasr.models.transformer.utils.add_sos_eos import add_sos_eos
			from funasr.metrics.compute_acc import th_accuracy, compute_accuracy
			# from funasr.models.e2e_asr_common import ErrorCalculator
			from funasr.train_utils.device_funcs import force_gatherable
			from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank
			from funasr.utils import postprocess_utils
			from funasr.utils.datadir_writer import DatadirWriter
			from funasr.register import tables


			@tables.register("model_classes", "LLMASR")
			class LLMASR(nn.Module):
			""" """

			def __init__(
			self,
			specaug: str = None,
			specaug_conf: dict = None,
			normalize: str = None,
			normalize_conf: dict = None,
			encoder: str = None,
			encoder_conf: dict = None,
			decoder: str = None,
			decoder_conf: dict = None,
			ctc: str = None,
			ctc_conf: dict = None,
			ctc_weight: float = 0.5,
			llm: str = None,
			llm_conf: dict = None,
			adaptor: str = None,
			adaptor_conf: dict = None,
			input_size: int = 80,
			vocab_size: int = -1,
			ignore_id: int = -1,
			blank_id: int = 0,
			sos: int = 1,
			eos: int = 2,
			lsm_weight: float = 0.0,
			length_normalized_loss: bool = False,
			report_cer: bool = True,
			report_wer: bool = True,
			sym_space: str = "<space>",
			sym_blank: str = "<blank>",
			# extract_feats_in_collect_stats: bool = True,
			share_embedding: bool = False,
			# preencoder: Optional[AbsPreEncoder] = None,
			# postencoder: Optional[AbsPostEncoder] = None,
			**kwargs,
			):

			super().__init__()

			if specaug is not None:
			specaug_class = tables.specaug_classes.get(specaug)
			specaug = specaug_class(**specaug_conf)
			if normalize is not None:
			normalize_class = tables.normalize_classes.get(normalize)
			normalize = normalize_class(**normalize_conf)

			# audio encoder
			hub = encoder_conf.get("hub", None)
			if hub == "funasr":
			from funasr import AutoModel
			init_param_path = encoder_conf.get("init_param_path", "iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch")
			model = AutoModel(model=init_param_path, model_revision="v2.0.4")
			# frontend = model.kwargs.get("frontend")
			model.model.decoder = None

			self.audio_encoder = model.model
			# self.frontend = frontend

			elif hub == "hf":
			pass
			else:
			encoder_class = tables.encoder_classes.get(encoder)
			encoder = encoder_class(input_size=input_size, **encoder_conf)
			encoder_output_size = encoder.output_size()

			# llm
			hub = llm_conf.get("hub", "hf")
			self.llm = None
			if hub == "hf":
			from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

			init_param_path = llm_conf.get("init_param_path", "vicuna-7b-v1.5")
			model = AutoModelForCausalLM.from_pretrained(
			init_param_path,
			load_in_8bit=None,
			device_map=None,
			use_cache=None,
			)
			freeze = llm_conf.get("freeze", True)
			if freeze:
			for name, param in model.named_parameters():
			param.requires_grad = False
			model.eval()
			self.llm = model

			# adaptor
			adaptor_class = tables.adaptor_classes.get(adaptor)
			adaptor = adaptor_class(**adaptor_conf)

			self.adaptor = adaptor


			self.blank_id = blank_id
			self.sos = sos if sos is not None else vocab_size - 1
			self.eos = eos if eos is not None else vocab_size - 1
			self.vocab_size = vocab_size
			self.ignore_id = ignore_id
			self.specaug = specaug
			self.normalize = normalize
			self.encoder = encoder


			self.criterion_att = LabelSmoothingLoss(
			size=vocab_size,
			padding_idx=ignore_id,
			smoothing=lsm_weight,
			normalize_length=length_normalized_loss,
			)
			#
			# if report_cer or report_wer:
			# self.error_calculator = ErrorCalculator(
			# token_list, sym_space, sym_blank, report_cer, report_wer
			# )
			#
			self.error_calculator = None

			self.length_normalized_loss = length_normalized_loss
			self.beam_search = None

			def forward(
			self,
			speech: torch.Tensor,
			speech_lengths: torch.Tensor,
			text: torch.Tensor,
			text_lengths: torch.Tensor,
			input_ids: torch.Tensor,
			attention_mask:torch.Tensor,
			labels_ids: torch.Tensor,
			label_mask: torch.Tensor,
			audio_mask: torch.Tensor,
			**kwargs,
			) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
			"""Encoder + Decoder + Calc loss
			Args:
			speech: (Batch, Length, ...)
			speech_lengths: (Batch, )
			text: (Batch, Length)
			text_lengths: (Batch,)
			"""
			# import pdb;
			# pdb.set_trace()
			if len(text_lengths.size()) > 1:
			text_lengths = text_lengths[:, 0]
			if len(speech_lengths.size()) > 1:
			speech_lengths = speech_lengths[:, 0]

			batch_size = speech.shape[0]

			# audio encoder
			encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, audio_mask=audio_mask)

			# adaptor
			encoder_out = self.adaptor(encoder_out)

			if input_ids is not None:
			input_ids[input_ids == -1] = 0
			input_ids[input_ids == -100] = 0
			if hasattr(self.llm.model, "embed_tokens"):
			inputs_embeds = self.llm.model.embed_tokens(input_ids)
			elif hasattr(self.llm.model.model, "embed_tokens"):
			inputs_embeds = self.llm.model.model.embed_tokens(input_ids)
			else:
			inputs_embeds = self.llm.model.model.model.embed_tokens(input_ids)

			if audio_mask is not None:
			batch_size, token_num, dims = inputs_embeds.shape
			_, l, _ = encoder_out.shape
			encoder_outs_pad = F.pad(encoder_out, (0, 0, token_num-l-1, 1, 0, 0), value=0.0)
			inputs_embeds = encoder_outs_pad * audio_mask[:, :, None] + inputs_embeds * (1.0-audio_mask[:, :, None])
			inputs_embeds = F.pad(inputs_embeds[:, 1:, :], (0, 0, 0, 1, 0, 0), value=0.0)

			model_outputs = self.llm(inputs_embeds=inputs_embeds, attention_mask=attention_mask, labels=labels_ids)
			loss = model_outputs.loss


			stats = {}
			with torch.no_grad():
			preds = torch.argmax(model_outputs.logits, -1)
			acc_att = compute_accuracy(preds[:, :-1], labels_ids[:, 1:], ignore_label=-100)
			stats["acc"] = acc_att

			stats["loss"] = torch.clone(loss.detach())

			# force_gatherable: to-device and to-tensor if scalar for DataParallel
			if self.length_normalized_loss:
			batch_size = int((text_lengths + 1).sum())
			loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
			return loss, stats, weight

			def encode(
			self, speech: torch.Tensor, speech_lengths: torch.Tensor, **kwargs,
			) -> Tuple[torch.Tensor, torch.Tensor]:

			audio_mask = kwargs.get("audio_mask", None)
			audio_token_lengths = audio_mask.sum(-1) if audio_mask is not None else None

			batch = {"speech": speech, "speech_lengths": speech_lengths}
			enc, enc_lens = self.audio_encoder.encode(**batch)
			with autocast(False):
			enc_mask = sequence_mask(enc_lens, enc.size(1), device=enc.device)[:, None, :]
			pre_acoustic_embeds, pre_token_length, _, _ = self.audio_encoder.predictor(enc,
			mask=enc_mask,
			target_label_length=audio_token_lengths,
			)

			return pre_acoustic_embeds, pre_token_length


			def inference(self,
			data_in,
			data_lengths=None,
			key: list = None,
			tokenizer=None,
			frontend=None,
			**kwargs,
			):

			prompt = kwargs.get("prompt", "Transcribe speech to text.")

			if kwargs.get("batch_size", 1) > 1:
			raise NotImplementedError("batch decoding is not implemented")



			meta_data = {}
			if isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank": # fbank
			speech, speech_lengths = data_in, data_lengths
			if len(speech.shape) < 3:
			speech = speech[None, :, :]
			if speech_lengths is None:
			speech_lengths = speech.shape[1]
			else:
			# extract fbank feats
			time1 = time.perf_counter()
			audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs, audio_fs=kwargs.get("fs", 16000),
			data_type=kwargs.get("data_type", "sound"),
			tokenizer=tokenizer)
			time2 = time.perf_counter()
			meta_data["load_data"] = f"{time2 - time1:0.3f}"
			speech, speech_lengths = extract_fbank(audio_sample_list, data_type=kwargs.get("data_type", "sound"),
			frontend=frontend)
			time3 = time.perf_counter()
			meta_data["extract_feat"] = f"{time3 - time2:0.3f}"
			meta_data["batch_data_time"] = speech_lengths.sum().item() * frontend.frame_shift * frontend.lfr_n / 1000

			speech = speech.to(device=kwargs["device"])
			speech_lengths = speech_lengths.to(device=kwargs["device"])

			# Encoder
			encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)

			# adaptor
			encoder_out = self.adaptor(encoder_out)


			prompt_pre = "USER: \nINSTRUCTION: {}\nINPUT: ".format(prompt)
			prompt_ids = tokenizer.encode(prompt_pre)
			prompt_length = len(prompt_ids)
			prompt_ids = torch.tensor(prompt_ids, dtype=torch.int64).to(kwargs["device"])


			if hasattr(self.llm.model, "embed_tokens"):
			inputs_embeds = self.llm.model.embed_tokens(prompt_ids)
			elif hasattr(self.llm.model.model, "embed_tokens"):
			inputs_embeds = self.llm.model.model.embed_tokens(prompt_ids)
			else:
			inputs_embeds = self.llm.model.model.model.embed_tokens(prompt_ids)

			inputs_embeds = torch.cat((inputs_embeds[None, :, :], encoder_out), dim=1) # [prompt, audio]
			attention_mask = torch.ones(inputs_embeds.size()[:-1], dtype=torch.long).to(kwargs["device"])

			# model_outputs = self.llm.generate(
			# inputs_embeds=inputs_embeds,
			# max_length=kwargs.get("max_length", 200),
			# max_new_tokens=kwargs.get("max_new_tokens", 200),
			# num_beams=kwargs.get("num_beams", 4),
			# do_sample=kwargs.get("do_sample", False),
			# min_length=kwargs.get("min_length", 1),
			# top_p=kwargs.get("top_p", 1.0),
			# repetition_penalty=kwargs.get("repetition_penalty", 1.0),
			# length_penalty=kwargs.get("length_penalty", 1.0),
			# temperature=kwargs.get("temperature", 1.0),
			# attention_mask=attention_mask,
			# bos_token_id=tokenizer.bos_token_id,
			# eos_token_id=tokenizer.eos_token_id,
			# pad_token_id=tokenizer.pad_token_id
			# )


			model_outputs = self.llm(inputs_embeds=inputs_embeds, attention_mask=attention_mask, labels=None)
			preds = torch.argmax(model_outputs.logits, -1)
			text = tokenizer.batch_decode(preds, add_special_tokens=False, skip_special_tokens=True)

			text = text[0].split(': ')[-1]
			text = text.strip()

			# preds = torch.argmax(model_outputs.logits, -1)

			ibest_writer = None
			if kwargs.get("output_dir") is not None:
			if not hasattr(self, "writer"):
			self.writer = DatadirWriter(kwargs.get("output_dir"))
			ibest_writer = self.writer[f"{0 + 1}best_recog"]

			results = []
			result_i = {"key": key[0], "text": text}
			results.append(result_i)

			if ibest_writer is not None:
			ibest_writer["text"][key[0]] = text




			return results, meta_data

			@@ -1,273 +1,85 @@
			from dataclasses import dataclass
			from typing import Dict
			from typing import Iterable, Optional

			import time
			import numpy as np
			import torch
			import torch.nn.functional as F
			from torch import Tensor
			from torch import nn
			import whisper
			from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank


			from funasr.models.whisper.utils.decoding import detect_language as detect_language_function, decode as decode_function
			from funasr.register import tables


			@dataclass
			class ModelDimensions:
			n_mels: int
			n_audio_ctx: int
			n_audio_state: int
			n_audio_head: int
			n_audio_layer: int
			n_vocab: int
			n_text_ctx: int
			n_text_state: int
			n_text_head: int
			n_text_layer: int


			class LayerNorm(nn.LayerNorm):
			def forward(self, x: Tensor) -> Tensor:
			return super().forward(x.float()).type(x.dtype)


			class Linear(nn.Linear):
			def forward(self, x: Tensor) -> Tensor:
			return F.linear(
			x, self.weight.to(x.dtype), None if self.bias is None else self.bias.to(x.dtype)
			)


			class Conv1d(nn.Conv1d):
			def _conv_forward(self, x: Tensor, weight: Tensor, bias: Optional[Tensor]) -> Tensor:
			return super()._conv_forward(
			x, weight.to(x.dtype), None if bias is None else bias.to(x.dtype)
			)


			def sinusoids(length, channels, max_timescale=10000):
			"""Returns sinusoids for positional embedding"""
			assert channels % 2 == 0
			log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
			inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2))
			scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
			return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)


			class MultiHeadAttention(nn.Module):
			def __init__(self, n_state: int, n_head: int):
			@tables.register("model_classes", "WhisperWarp")
			class WhisperWarp(nn.Module):
			def __init__(self, whisper_dims: dict, **kwargs):
			super().__init__()
			self.n_head = n_head
			self.query = Linear(n_state, n_state)
			self.key = Linear(n_state, n_state, bias=False)
			self.value = Linear(n_state, n_state)
			self.out = Linear(n_state, n_state)

			def forward(
			self,
			x: Tensor,
			xa: Optional[Tensor] = None,
			mask: Optional[Tensor] = None,
			kv_cache: Optional[dict] = None,
			):
			q = self.query(x)

			if kv_cache is None or xa is None or self.key not in kv_cache:
			# hooks, if installed (i.e. kv_cache is not None), will prepend the cached kv tensors;
			# otherwise, perform key/value projections for self- or cross-attention as usual.
			k = self.key(x if xa is None else xa)
			v = self.value(x if xa is None else xa)
			hub = kwargs.get("hub", "funasr")
			if hub == "openai":
			init_param_path = kwargs.get("init_param_path", "large-v3")
			model = whisper.load_model(init_param_path)
			else:
			# for cross-attention, calculate keys and values once and reuse in subsequent calls.
			k = kv_cache[self.key]
			v = kv_cache[self.value]
			dims = whisper.model.ModelDimensions(**whisper_dims)
			model = whisper.model.Whisper(dims=dims)

			self.model = model

			def forward(self, ):
			pass

			def inference(self,
			data_in,
			data_lengths=None,
			key: list = None,
			tokenizer=None,
			frontend=None,
			**kwargs,
			):
			if kwargs.get("batch_size", 1) > 1:
			raise NotImplementedError("batch decoding is not implemented")

			wv, qk = self.qkv_attention(q, k, v, mask)
			return self.out(wv), qk
			meta_data = {}
			if isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank": # fbank
			speech, speech_lengths = data_in, data_lengths
			if len(speech.shape) < 3:
			speech = speech[None, :, :]
			if speech_lengths is None:
			speech_lengths = speech.shape[1]
			else:
			# extract fbank feats
			time1 = time.perf_counter()
			audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs, audio_fs=kwargs.get("fs", 16000),
			data_type=kwargs.get("data_type", "sound"),
			tokenizer=tokenizer)
			time2 = time.perf_counter()
			meta_data["load_data"] = f"{time2 - time1:0.3f}"
			speech, speech_lengths = extract_fbank(audio_sample_list, data_type=kwargs.get("data_type", "sound"),
			frontend=frontend)
			time3 = time.perf_counter()
			meta_data["extract_feat"] = f"{time3 - time2:0.3f}"
			frame_shift = frontend.frame_shift if hasattr(frontend, "frame_shift") else 10
			lfr_n = frontend.lfr_n if hasattr(frontend, "lfr_n") else 1
			meta_data["batch_data_time"] = speech_lengths.sum().item() * frame_shift * lfr_n / 1000

			def qkv_attention(self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None):
			n_batch, n_ctx, n_state = q.shape
			scale = (n_state // self.n_head) ** -0.25
			q = q.view(q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3) scale
			k = k.view(k.shape[:2], self.n_head, -1).permute(0, 2, 3, 1) scale
			v = v.view(*v.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
			speech = speech.to(device=kwargs["device"])[0, :, :]
			speech_lengths = speech_lengths.to(device=kwargs["device"])

			qk = q @ k
			if mask is not None:
			qk = qk + mask[:n_ctx, :n_ctx]
			qk = qk.float()
			# detect the spoken language
			_, probs = self.model.detect_language(speech)
			print(f"Detected language: {max(probs, key=probs.get)}")

			w = F.softmax(qk, dim=-1).to(q.dtype)
			return (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2), qk.detach()
			# decode the audio
			options = whisper.DecodingOptions(language=kwargs.get("language", None), fp16=False)
			result = whisper.decode(self.model, speech, options)

			results = []
			result_i = {"key": key[0], "text": result.text}

			class ResidualAttentionBlock(nn.Module):
			def __init__(self, n_state: int, n_head: int, cross_attention: bool = False):
			super().__init__()

			self.attn = MultiHeadAttention(n_state, n_head)
			self.attn_ln = LayerNorm(n_state)

			self.cross_attn = MultiHeadAttention(n_state, n_head) if cross_attention else None
			self.cross_attn_ln = LayerNorm(n_state) if cross_attention else None

			n_mlp = n_state * 4
			self.mlp = nn.Sequential(Linear(n_state, n_mlp), nn.GELU(), Linear(n_mlp, n_state))
			self.mlp_ln = LayerNorm(n_state)

			def forward(
			self,
			x: Tensor,
			xa: Optional[Tensor] = None,
			mask: Optional[Tensor] = None,
			kv_cache: Optional[dict] = None,
			):
			x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)[0]
			if self.cross_attn:
			x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
			x = x + self.mlp(self.mlp_ln(x))
			return x



			@tables.register("encoder_classes", "WhisperEncoder")
			class AudioEncoder(nn.Module):
			def __init__(self, n_mels: int, n_ctx: int, n_state: int, n_head: int, n_layer: int):
			super().__init__()
			self.conv1 = Conv1d(n_mels, n_state, kernel_size=3, padding=1)
			self.conv2 = Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1)
			self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state))

			self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
			[ResidualAttentionBlock(n_state, n_head) for _ in range(n_layer)]
			)
			self.ln_post = LayerNorm(n_state)

			def forward(self, x: Tensor):
			"""
			x : torch.Tensor, shape = (batch_size, n_mels, n_ctx)
			the mel spectrogram of the audio
			"""
			x = F.gelu(self.conv1(x))
			x = F.gelu(self.conv2(x))
			x = x.permute(0, 2, 1)

			assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
			x = (x + self.positional_embedding).to(x.dtype)

			for block in self.blocks:
			x = block(x)

			x = self.ln_post(x)
			return x

			@tables.register("decoder_classes", "WhisperDecoder")
			class TextDecoder(nn.Module):
			def __init__(self, n_vocab: int, n_ctx: int, n_state: int, n_head: int, n_layer: int):
			super().__init__()

			self.token_embedding = nn.Embedding(n_vocab, n_state)
			self.positional_embedding = nn.Parameter(torch.empty(n_ctx, n_state))

			self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
			[ResidualAttentionBlock(n_state, n_head, cross_attention=True) for _ in range(n_layer)]
			)
			self.ln = LayerNorm(n_state)

			mask = torch.empty(n_ctx, n_ctx).fill_(-np.inf).triu_(1)
			self.register_buffer("mask", mask, persistent=False)

			def forward(self, x: Tensor, xa: Tensor, kv_cache: Optional[dict] = None):
			"""
			x : torch.LongTensor, shape = (batch_size, <= n_ctx)
			the text tokens
			xa : torch.Tensor, shape = (batch_size, n_mels, n_audio_ctx)
			the encoded audio features to be attended on
			"""
			offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0
			x = self.token_embedding(x) + self.positional_embedding[offset : offset + x.shape[-1]]
			x = x.to(xa.dtype)

			for block in self.blocks:
			x = block(x, xa, mask=self.mask, kv_cache=kv_cache)

			x = self.ln(x)
			logits = (x @ torch.transpose(self.token_embedding.weight.to(x.dtype), 0, 1)).float()

			return logits

			@tables.register("model_classes", "Whisper")
			class Whisper(nn.Module):
			def __init__(self, dims: dict):
			super().__init__()
			dims = ModelDimensions(**dims)
			self.dims = dims
			self.sos = 1
			self.eos = 1
			self.encoder = AudioEncoder(
			self.dims.n_mels,
			self.dims.n_audio_ctx,
			self.dims.n_audio_state,
			self.dims.n_audio_head,
			self.dims.n_audio_layer,
			)
			self.decoder = TextDecoder(
			self.dims.n_vocab,
			self.dims.n_text_ctx,
			self.dims.n_text_state,
			self.dims.n_text_head,
			self.dims.n_text_layer,
			)

			def embed_audio(self, mel: torch.Tensor):
			return self.encoder(mel)

			def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor):
			return self.decoder(tokens, audio_features)

			def forward(self, mel: torch.Tensor, tokens: torch.Tensor) -> Dict[str, torch.Tensor]:
			return self.decoder(tokens, self.encoder(mel))

			@property
			def device(self):
			return next(self.parameters()).device

			@property
			def is_multilingual(self):
			return self.dims.n_vocab == 51865

			def install_kv_cache_hooks(self, cache: Optional[dict] = None):
			"""
			The `MultiHeadAttention` module optionally accepts `kv_cache` which stores the key and value
			tensors calculated for the previous positions. This method returns a dictionary that stores
			all caches, and the necessary hooks for the key and value projection modules that save the
			intermediate tensors to be reused during later calculations.

			Returns
			-------
			cache : Dict[nn.Module, torch.Tensor]
			A dictionary object mapping the key/value projection modules to its cache
			hooks : List[RemovableHandle]
			List of PyTorch RemovableHandle objects to stop the hooks to be called
			"""
			cache = {**cache} if cache is not None else {}
			hooks = []

			def save_to_cache(module, _, output):
			if module not in cache or output.shape[1] > self.decoder.positional_embedding.shape[0]:
			cache[module] = output # save as-is, for the first token or cross attention
			else:
			cache[module] = torch.cat([cache[module], output], dim=1).detach()
			return cache[module]

			def install_hooks(layer: nn.Module):
			if isinstance(layer, MultiHeadAttention):
			hooks.append(layer.key.register_forward_hook(save_to_cache))
			hooks.append(layer.value.register_forward_hook(save_to_cache))

			self.decoder.apply(install_hooks)
			return cache, hooks

			detect_language = detect_language_function
			decode = decode_function
			results.append(result_i)

			return results, meta_data

New file
			@@ -0,0 +1,46 @@
			# This is an example that demonstrates how to configure a model file.
			# You can modify the configuration according to your own requirements.

			# to print the register_table:
			# from funasr.register import tables
			# tables.print()

			# network architecture
			model: WhisperWarp
			model_conf:
			lsm_weight: 0.1
			length_normalized_loss: true
			hub: funasr # openai
			init_param_path: null # large-v2 or large-v3 if hub == "openai"



			# only use for hub == funasr,
			# if hub == openai, whisper_dims is automaticall download
			whisper_dims:
			'n_mels': 80
			'n_vocab': 51865
			'n_audio_ctx': 1500
			'n_audio_state': 1280
			'n_audio_head': 20
			'n_audio_layer': 32
			'n_text_ctx': 448
			'n_text_state': 1280
			'n_text_head': 20
			'n_text_layer': 32

			# frontend related
			frontend: WhisperFrontend
			frontend_conf:
			fs: 16000
			n_mels: 80
			do_pad_trim: true

			tokenizer: WhisperTokenizer
			tokenizer_conf:
			language: null
			task: transcribe
			is_multilingual: true
			num_languages: 99

			scope_map: ['none', "model."]

New file
			@@ -0,0 +1,24 @@

			try:
			from whisper.tokenizer import get_tokenizer
			except:
			print("If you want to use hugging, please `pip install -U transformers`")

			from funasr.register import tables

			@tables.register("tokenizer_classes", "WhisperTokenizer")
			def WhisperTokenizer(**kwargs):

			language = kwargs.get("language", None)
			task = kwargs.get("task", "transcribe")
			is_multilingual = kwargs.get("is_multilingual", True)
			num_languages = kwargs.get("num_languages", 99)
			tokenizer = get_tokenizer(
			multilingual=is_multilingual,
			num_languages=num_languages,
			language=language,
			task=task,
			)

			return tokenizer

			@@ -68,9 +68,9 @@
			else:
			buffer = BytesIO(oss_bucket.get_object(path).read())
			src_state = torch.load(buffer, map_location=map_location)
			if "state_dict" in src_state:
			src_state = src_state["state_dict"]


			src_state = src_state["state_dict"] if "state_dict" in src_state else src_state
			src_state = src_state["model_state_dict"] if "model_state_dict" in src_state else src_state
			src_state = src_state["model"] if "model" in src_state else src_state

			if isinstance(scope_map, str):