python/FunASR-XL.git

parent: 9cc37eaa | 补丁 | 提交 | ignore whitespace

Merge branch 'main' of https://github.com/alibaba-damo-academy/FunASR into ...

雾聪

2023-09-28 a49f2c6411637d696e787605ec611f05667e8935

Merge branch 'main' of https://github.com/alibaba-damo-academy/FunASR into main

40个文件已修改

1个文件已添加

	README.md	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	README_zh.md	3 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/images/wechat.png	补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/m2met2/Dataset.md	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/m2met2/index.rst	1 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/TEMPLATE/README.md	11 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/TEMPLATE/README_zh.md	11 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/demo.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/demo_online.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/demo_online_v2.py	44 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/finetune.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/infer.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	fun_text_processing/inverse_text_normalization/en/taggers/money.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/bin/asr_inference_launch.py	68 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/datasets/large_datasets/datapipes/batch.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/datasets/large_datasets/datapipes/filter.py	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/datasets/large_datasets/datapipes/map.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/decoder/sanm_decoder.py	89 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/encoder/sanm_encoder.py	72 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/modules/attention.py	67 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/docs/SDK_advanced_guide_offline.md	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/docs/SDK_advanced_guide_offline_zh.md	20 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/docs/SDK_advanced_guide_online.md	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/docs/SDK_advanced_guide_online_zh.md	7 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/docs/SDK_tutorial_online_zh.md	19 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/docs/SDK_tutorial_zh.md	10 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/ios/paraformer_online/paraformer_online.xcodeproj/project.pbxproj	16 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/include/funasrruntime.h	3 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/include/model.h	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/include/offline-stream.h	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/include/tpass-stream.h	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/src/funasrruntime.cpp	13 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/src/offline-stream.cpp	3 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/src/paraformer.cpp	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/src/precomp.h	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/onnxruntime/src/tpass-stream.cpp	5 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/python/websocket/funasr_client_api.py	5 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/python/websocket/funasr_wss_client.py	11 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/python/websocket/funasr_wss_server.py	8 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/readme.md	12 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/runtime/readme_cn.md	12 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 README.md

@@ -60,7 +60,7 @@

|DingTalk group |                     WeChat group                      |
|:---:|:-----------------------------------------------------:|
|<div align="left"><img src="docs/images/dingding.jpg" width="250"/> | <img src="docs/images/wechat.png" width="232"/></div> |
|<div align="left"><img src="docs/images/dingding.jpg" width="250"/> | <img src="docs/images/wechat.png" width="215"/></div> |

## Contributors


 README_zh.md

@@ -55,9 +55,10 @@
## 联系我们

如果您在使用中遇到问题，可以直接在github页面提Issues。欢迎语音兴趣爱好者扫描以下的钉钉群或者微信群二维码加入社区群，进行交流和讨论。

|                                  钉钉群                                  |                          微信                           |
|:---------------------------------------------------------------------:|:-----------------------------------------------------:|
| <div align="left"><img src="docs/images/dingding.jpg" width="250"/>   | <img src="docs/images/wechat.png" width="232"/></div> |
| <div align="left"><img src="docs/images/dingding.jpg" width="250"/>   | <img src="docs/images/wechat.png" width="215"/></div> |

## 社区贡献者


 docs/images/wechat.png



 docs/m2met2/Dataset.md

@@ -21,4 +21,6 @@
The three dataset for training mentioned above can be downloaded at [OpenSLR](https://openslr.org/resources.php). The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.
- [AliMeeting](https://openslr.org/119/)
- [AISHELL-4](https://openslr.org/111/)
- [CN-Celeb](https://openslr.org/82/)
- [CN-Celeb](https://openslr.org/82/)

Now, the new test set is available [here](https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Test_2023_Ali.tar.gz)

 docs/m2met2/index.rst

@@ -9,6 +9,7 @@
To advance the current state-of-the-art in multi-talker automatic speech recognition, the M2MeT2.0 challenge proposes a speaker-attributed ASR task, comprising two sub-tracks: fixed and open training conditions.
To facilitate reproducible research, we provide a comprehensive overview of the dataset, challenge rules, evaluation metrics, and baseline systems. 

Now the new test set contains about 10 hours audio is available. You can download from `here <https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Test_2023_Ali.tar.gz>`_

.. toctree::
   :maxdepth: 1

 egs_modelscope/asr/TEMPLATE/README.md

@@ -27,15 +27,18 @@
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.6',
    model_revision='v1.0.7',
    update_model=False,
    mode='paraformer_streaming'
    )
import soundfile
speech, sample_rate = soundfile.read("example/asr_example.wav")

chunk_size = [5, 10, 5] #[5, 10, 5] 600ms, [8, 8, 4] 480ms
param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size}
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size,
              "encoder_chunk_look_back": encoder_chunk_look_back, "decoder_chunk_look_back": decoder_chunk_look_back}
chunk_stride = chunk_size[1] * 960 # 600ms、480ms
# first chunk, 600ms
speech_chunk = speech[0:chunk_stride]
@@ -55,7 +58,7 @@
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.6',
    model_revision='v1.0.7',
    update_model=False,
    mode="paraformer_fake_streaming"
)

 egs_modelscope/asr/TEMPLATE/README_zh.md

@@ -27,15 +27,18 @@
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.6',
    model_revision='v1.0.7',
    update_model=False,
    mode='paraformer_streaming'
    )
import soundfile
speech, sample_rate = soundfile.read("example/asr_example.wav")

chunk_size = [5, 10, 5] #[5, 10, 5] 600ms, [8, 8, 4] 480ms
param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size}
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size,
              "encoder_chunk_look_back": encoder_chunk_look_back, "decoder_chunk_look_back": decoder_chunk_look_back}
chunk_stride = chunk_size[1] * 960 # 600ms、480ms
# first chunk, 600ms
speech_chunk = speech[0:chunk_stride]
@@ -55,7 +58,7 @@
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.6',
    model_revision='v1.0.7',
    update_model=False,
    mode="paraformer_fake_streaming"
)

 egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/demo.py

@@ -4,7 +4,7 @@
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.6',
    model_revision='v1.0.7',
    update_model=False,
    mode="paraformer_fake_streaming"
)

 egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/demo_online.py

@@ -14,7 +14,7 @@
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.6',
    model_revision='v1.0.7',
    update_model=False,
    mode="paraformer_streaming"
)

 egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/demo_online_v2.py

New file
@@ -0,0 +1,44 @@
import os
import logging
import torch
import soundfile

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger

logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)

os.environ["MODELSCOPE_CACHE"] = "./"
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.7',
    update_model=False,
    mode="paraformer_streaming"
)

model_dir = os.path.join(os.environ["MODELSCOPE_CACHE"], "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online")
speech, sample_rate = soundfile.read(os.path.join(model_dir, "example/asr_example.wav"))
speech_length = speech.shape[0]

sample_offset = 0
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
stride_size =  chunk_size[1] * 960
param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size, 
              "encoder_chunk_look_back": encoder_chunk_look_back, "decoder_chunk_look_back": decoder_chunk_look_back}
final_result = ""

for sample_offset in range(0, speech_length, min(stride_size, speech_length - sample_offset)):
    if sample_offset + stride_size >= speech_length - 1:
        stride_size = speech_length - sample_offset
        param_dict["is_final"] = True
    rec_result = inference_pipeline(audio_in=speech[sample_offset: sample_offset + stride_size],
                                    param_dict=param_dict)
    if len(rec_result) != 0:
        final_result += rec_result['text']
        print(rec_result)
print(final_result)

 egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/finetune.py

@@ -14,7 +14,7 @@
    ds_dict = MsDataset.load(params.data_path)
    kwargs = dict(
        model=params.model,
        model_revision='v1.0.6',
        model_revision='v1.0.7',
        update_model=False,
        data_dir=ds_dict,
        dataset_type=params.dataset_type,

 egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/infer.py

@@ -11,7 +11,7 @@
        model=args.model,
        output_dir=args.output_dir,
        batch_size=args.batch_size,
        model_revision='v1.0.6',
        model_revision='v1.0.7',
        update_model=False,
        mode="paraformer_fake_streaming",
        param_dict={"decoding_model": args.decoding_mode, "hotword": args.hotword_txt}

 fun_text_processing/inverse_text_normalization/en/taggers/money.py

@@ -53,7 +53,7 @@
            + pynini.union(
                pynutil.add_weight(((DAMO_SIGMA - "one") @ cardinal_graph), -0.7) @ add_leading_zero_to_double_digit
                + delete_space
                + pynutil.delete("cents"),
                + (pynutil.delete("cents") | pynutil.delete("cent")),
                pynini.cross("one", "01") + delete_space + pynutil.delete("cent"),
            )
            + pynutil.insert("\"")

 funasr/bin/asr_inference_launch.py

@@ -842,37 +842,72 @@
            data = yaml.load(f, Loader=yaml.Loader)
        return data

    def _prepare_cache(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
    def _prepare_cache(cache: dict = {}, chunk_size=[5, 10, 5], encoder_chunk_look_back=0,
                       decoder_chunk_look_back=0, batch_size=1):
        if len(cache) > 0:
            return cache
        config = _read_yaml(asr_train_config)
        enc_output_size = config["encoder_conf"]["output_size"]
        feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
        cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
                    "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
                    "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size,
                    "encoder_chunk_look_back": encoder_chunk_look_back, "last_chunk": False, "opt": None,
                    "feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)), "tail_chunk": False}
        cache["encoder"] = cache_en

        cache_de = {"decode_fsmn": None}
        cache_de = {"decode_fsmn": None, "decoder_chunk_look_back": decoder_chunk_look_back, "opt": None, "chunk_size": chunk_size}
        cache["decoder"] = cache_de

        return cache

    def _cache_reset(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
    def _cache_reset(cache: dict = {}, chunk_size=[5, 10, 5], encoder_chunk_look_back=0,
                     decoder_chunk_look_back=0, batch_size=1):
        if len(cache) > 0:
            config = _read_yaml(asr_train_config)
            enc_output_size = config["encoder_conf"]["output_size"]
            feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
            cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
                        "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
                        "feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)),
                        "tail_chunk": False}
                        "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size,
                        "encoder_chunk_look_back": encoder_chunk_look_back, "last_chunk": False, "opt": None,
                        "feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)), "tail_chunk": False}
            cache["encoder"] = cache_en

            cache_de = {"decode_fsmn": None}
            cache_de = {"decode_fsmn": None, "decoder_chunk_look_back": decoder_chunk_look_back, "opt": None, "chunk_size": chunk_size}
            cache["decoder"] = cache_de

        return cache

    #def _prepare_cache(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
    #    if len(cache) > 0:
    #        return cache
    #    config = _read_yaml(asr_train_config)
    #    enc_output_size = config["encoder_conf"]["output_size"]
    #    feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
    #    cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
    #                "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
    #                "feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)), "tail_chunk": False}
    #    cache["encoder"] = cache_en

    #    cache_de = {"decode_fsmn": None}
    #    cache["decoder"] = cache_de

    #    return cache

    #def _cache_reset(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
    #    if len(cache) > 0:
    #        config = _read_yaml(asr_train_config)
    #        enc_output_size = config["encoder_conf"]["output_size"]
    #        feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
    #        cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
    #                    "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
    #                    "feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)),
    #                    "tail_chunk": False}
    #        cache["encoder"] = cache_en

    #        cache_de = {"decode_fsmn": None}
    #        cache["decoder"] = cache_de

    #    return cache

    def _forward(
            data_path_and_name_and_type,
@@ -901,24 +936,34 @@
        is_final = False
        cache = {}
        chunk_size = [5, 10, 5]
        encoder_chunk_look_back = 0
        decoder_chunk_look_back = 0
        if param_dict is not None and "cache" in param_dict:
            cache = param_dict["cache"]
        if param_dict is not None and "is_final" in param_dict:
            is_final = param_dict["is_final"]
        if param_dict is not None and "chunk_size" in param_dict:
            chunk_size = param_dict["chunk_size"]
        if param_dict is not None and "encoder_chunk_look_back" in param_dict:
            encoder_chunk_look_back = param_dict["encoder_chunk_look_back"]
            if encoder_chunk_look_back > 0:
                chunk_size[0] = 0
        if param_dict is not None and "decoder_chunk_look_back" in param_dict:
            decoder_chunk_look_back = param_dict["decoder_chunk_look_back"]

        # 7 .Start for-loop
        # FIXME(kamo): The output format should be discussed about
        raw_inputs = torch.unsqueeze(raw_inputs, axis=0)
        asr_result_list = []
        cache = _prepare_cache(cache, chunk_size=chunk_size, batch_size=1)
        cache = _prepare_cache(cache, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, 
                               decoder_chunk_look_back=decoder_chunk_look_back, batch_size=1)
        item = {}
        if data_path_and_name_and_type is not None and data_path_and_name_and_type[2] == "sound":
            sample_offset = 0
            speech_length = raw_inputs.shape[1]
            stride_size = chunk_size[1] * 960
            cache = _prepare_cache(cache, chunk_size=chunk_size, batch_size=1)
            cache = _prepare_cache(cache, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, 
                                   decoder_chunk_look_back=decoder_chunk_look_back, batch_size=1)
            final_result = ""
            for sample_offset in range(0, speech_length, min(stride_size, speech_length - sample_offset)):
                if sample_offset + stride_size >= speech_length - 1:
@@ -939,7 +984,8 @@

        asr_result_list.append(item)
        if is_final:
            cache = _cache_reset(cache, chunk_size=chunk_size, batch_size=1)
            cache = _cache_reset(cache, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, 
                                 decoder_chunk_look_back=decoder_chunk_look_back, batch_size=1)
        return asr_result_list

    return _forward

 funasr/datasets/large_datasets/datapipes/batch.py

@@ -39,7 +39,7 @@
        self.batch_mode = batch_mode

    def set_epoch(self, epoch):
        self.epoch = epoch
        self.datapipe.set_epoch(epoch)

    def __iter__(self):
        buffer = []

 funasr/datasets/large_datasets/datapipes/filter.py

@@ -13,7 +13,7 @@
        self.fn = fn

    def set_epoch(self, epoch):
        self.epoch = epoch
        self.datapipe.set_epoch(epoch)

    def __iter__(self):
        assert callable(self.fn)
@@ -21,4 +21,4 @@
            if self.fn(data):
                yield data
            else:
                continue
                continue

 funasr/datasets/large_datasets/datapipes/map.py

@@ -14,7 +14,7 @@
        self.fn = fn

    def set_epoch(self, epoch):
        self.epoch = epoch
        self.datapipe.set_epoch(epoch)

    def __iter__(self):
        assert callable(self.fn)

 funasr/models/decoder/sanm_decoder.py

@@ -105,7 +105,7 @@

        return x, tgt_mask, memory, memory_mask, cache

    def forward_chunk(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
    def forward_one_step(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
        """Compute decoded features.

        Args:
@@ -147,6 +147,47 @@


        return x, tgt_mask, memory, memory_mask, cache

    def forward_chunk(self, tgt, memory, fsmn_cache=None, opt_cache=None, chunk_size=None, look_back=0):
        """Compute decoded features.

        Args:
            tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
            tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
            memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
            memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
            cache (List[torch.Tensor]): List of cached tensors.
                Each tensor shape should be (#batch, maxlen_out - 1, size).

        Returns:
            torch.Tensor: Output tensor(#batch, maxlen_out, size).
            torch.Tensor: Mask for output tensor (#batch, maxlen_out).
            torch.Tensor: Encoded memory (#batch, maxlen_in, size).
            torch.Tensor: Encoded memory mask (#batch, maxlen_in).

        """
        residual = tgt
        if self.normalize_before:
            tgt = self.norm1(tgt)
        tgt = self.feed_forward(tgt)

        x = tgt
        if self.self_attn:
            if self.normalize_before:
                tgt = self.norm2(tgt)
            x, fsmn_cache = self.self_attn(tgt, None, fsmn_cache)
            x = residual + self.dropout(x)

        if self.src_attn is not None:
            residual = x
            if self.normalize_before:
                x = self.norm3(x)

            x, opt_cache = self.src_attn.forward_chunk(x, memory, opt_cache, chunk_size, look_back)
            x = residual + x

        return x, memory, fsmn_cache, opt_cache


class FsmnDecoderSCAMAOpt(BaseTransformerDecoder):
    """
@@ -397,7 +438,7 @@
        for i in range(self.att_layer_num):
            decoder = self.decoders[i]
            c = cache[i]
            x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
            x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
                x, tgt_mask, memory, memory_mask, cache=c
            )
            new_cache.append(c_ret)
@@ -407,13 +448,13 @@
                j = i + self.att_layer_num
                decoder = self.decoders2[i]
                c = cache[j]
                x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
                x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
                    x, tgt_mask, memory, memory_mask, cache=c
                )
                new_cache.append(c_ret)

        for decoder in self.decoders3:
            x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
            x, tgt_mask, memory, memory_mask, _ = decoder.forward_one_step(
                x, tgt_mask, memory, None, cache=None
            )

@@ -837,6 +878,7 @@
        lora_rank: int = 8,
        lora_alpha: int = 16,
        lora_dropout: float = 0.1,
        chunk_multiply_factor: tuple = (1,),
        tf2torch_tensor_name_prefix_torch: str = "decoder",
        tf2torch_tensor_name_prefix_tf: str = "seq2seq/decoder",
    ):
@@ -929,6 +971,7 @@
        )
        self.tf2torch_tensor_name_prefix_torch = tf2torch_tensor_name_prefix_torch
        self.tf2torch_tensor_name_prefix_tf = tf2torch_tensor_name_prefix_tf
        self.chunk_multiply_factor = chunk_multiply_factor

    def forward(
        self,
@@ -1020,35 +1063,43 @@
            cache_layer_num = len(self.decoders)
            if self.decoders2 is not None:
                cache_layer_num += len(self.decoders2)
            new_cache = [None] * cache_layer_num
            fsmn_cache = [None] * cache_layer_num
        else:
            new_cache = cache["decode_fsmn"]
            fsmn_cache = cache["decode_fsmn"]

        if cache["opt"] is None:
            cache_layer_num = len(self.decoders)
            opt_cache = [None] * cache_layer_num
        else:
            opt_cache = cache["opt"]

        for i in range(self.att_layer_num):
            decoder = self.decoders[i]
            x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
                x, None, memory, None, cache=new_cache[i]
            x, memory, fsmn_cache[i], opt_cache[i] = decoder.forward_chunk(
                x, memory, fsmn_cache=fsmn_cache[i], opt_cache=opt_cache[i],
                chunk_size=cache["chunk_size"], look_back=cache["decoder_chunk_look_back"]
            )
            new_cache[i] = c_ret

        if self.num_blocks - self.att_layer_num > 1:
            for i in range(self.num_blocks - self.att_layer_num):
                j = i + self.att_layer_num
                decoder = self.decoders2[i]
                x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
                    x, None, memory, None, cache=new_cache[j]
                x, memory, fsmn_cache[j], _  = decoder.forward_chunk(
                    x, memory, fsmn_cache=fsmn_cache[j]
                )
                new_cache[j] = c_ret

        for decoder in self.decoders3:

            x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
                x, None, memory, None, cache=None
            x, memory, _, _ = decoder.forward_chunk(
                x, memory
            )
        if self.normalize_before:
            x = self.after_norm(x)
        if self.output_layer is not None:
            x = self.output_layer(x)
        cache["decode_fsmn"] = new_cache

        cache["decode_fsmn"] = fsmn_cache
        if cache["decoder_chunk_look_back"] > 0 or cache["decoder_chunk_look_back"] == -1:
            cache["opt"] = opt_cache
        return x

    def forward_one_step(
@@ -1082,7 +1133,7 @@
        for i in range(self.att_layer_num):
            decoder = self.decoders[i]
            c = cache[i]
            x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
            x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
                x, tgt_mask, memory, None, cache=c
            )
            new_cache.append(c_ret)
@@ -1092,14 +1143,14 @@
                j = i + self.att_layer_num
                decoder = self.decoders2[i]
                c = cache[j]
                x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
                x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
                    x, tgt_mask, memory, None, cache=c
                )
                new_cache.append(c_ret)

        for decoder in self.decoders3:

            x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
            x, tgt_mask, memory, memory_mask, _ = decoder.forward_one_step(
                x, tgt_mask, memory, None, cache=None
            )


 funasr/models/encoder/sanm_encoder.py

@@ -114,8 +114,44 @@
        if not self.normalize_before:
            x = self.norm2(x)


        return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder

    def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
        """Compute encoded features.

        Args:
            x_input (torch.Tensor): Input tensor (#batch, time, size).
            mask (torch.Tensor): Mask tensor for the input (#batch, time).
            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).

        Returns:
            torch.Tensor: Output tensor (#batch, time, size).
            torch.Tensor: Mask tensor (#batch, time).

        """

        residual = x
        if self.normalize_before:
            x = self.norm1(x)

        if self.in_size == self.size:
            attn, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)
            x = residual + attn
        else:
            x, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)

        if not self.normalize_before:
            x = self.norm1(x)

        residual = x
        if self.normalize_before:
            x = self.norm2(x)
        x = residual + self.feed_forward(x)
        if not self.normalize_before:
            x = self.norm2(x)

        return x, cache


class SANMEncoder(AbsEncoder):
    """
@@ -841,7 +877,6 @@
                      xs_pad: torch.Tensor,
                      ilens: torch.Tensor,
                      cache: dict = None,
                      ctc: CTC = None,
                      ):
        xs_pad *= self.output_size() ** 0.5
        if self.embed is None:
@@ -852,34 +887,25 @@
            xs_pad = to_device(cache["feats"], device=xs_pad.device)
        else:
            xs_pad = self._add_overlap_chunk(xs_pad, cache)
        encoder_outs = self.encoders0(xs_pad, None, None, None, None)
        xs_pad, masks = encoder_outs[0], encoder_outs[1]
        intermediate_outs = []
        if len(self.interctc_layer_idx) == 0:
            encoder_outs = self.encoders(xs_pad, None, None, None, None)
            xs_pad, masks = encoder_outs[0], encoder_outs[1]
        if cache["opt"] is None:
            cache_layer_num = len(self.encoders0) + len(self.encoders)
            new_cache = [None] * cache_layer_num
        else:
            for layer_idx, encoder_layer in enumerate(self.encoders):
                encoder_outs = encoder_layer(xs_pad, None, None, None, None)
                xs_pad, masks = encoder_outs[0], encoder_outs[1]
                if layer_idx + 1 in self.interctc_layer_idx:
                    encoder_out = xs_pad
            new_cache = cache["opt"]

                    # intermediate outputs are also normalized
                    if self.normalize_before:
                        encoder_out = self.after_norm(encoder_out)
        for layer_idx, encoder_layer in enumerate(self.encoders0):
            encoder_outs = encoder_layer.forward_chunk(xs_pad, new_cache[layer_idx], cache["chunk_size"], cache["encoder_chunk_look_back"])
            xs_pad, new_cache[0] = encoder_outs[0], encoder_outs[1]

                    intermediate_outs.append((layer_idx + 1, encoder_out))

                    if self.interctc_use_conditioning:
                        ctc_out = ctc.softmax(encoder_out)
                        xs_pad = xs_pad + self.conditioning_layer(ctc_out)
        for layer_idx, encoder_layer in enumerate(self.encoders):
            encoder_outs = encoder_layer.forward_chunk(xs_pad, new_cache[layer_idx+len(self.encoders0)], cache["chunk_size"], cache["encoder_chunk_look_back"])
            xs_pad, new_cache[layer_idx+len(self.encoders0)] = encoder_outs[0], encoder_outs[1]

        if self.normalize_before:
            xs_pad = self.after_norm(xs_pad)
        if cache["encoder_chunk_look_back"] > 0 or cache["encoder_chunk_look_back"] == -1:
            cache["opt"] = new_cache

        if len(intermediate_outs) > 0:
            return (xs_pad, intermediate_outs), None, None
        return xs_pad, ilens, None

    def gen_tf2torch_map_dict(self):

 funasr/modules/attention.py

@@ -456,6 +456,44 @@
        att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder)
        return att_outs + fsmn_memory

    def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
        """Compute scaled dot product attention.

        Args:
            query (torch.Tensor): Query tensor (#batch, time1, size).
            key (torch.Tensor): Key tensor (#batch, time2, size).
            value (torch.Tensor): Value tensor (#batch, time2, size).
            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
                (#batch, time1, time2).

        Returns:
            torch.Tensor: Output tensor (#batch, time1, d_model).

        """
        q_h, k_h, v_h, v = self.forward_qkv(x)
        if chunk_size is not None and look_back > 0 or look_back == -1:
            if cache is not None:
                k_h_stride = k_h[:, :, :-(chunk_size[2]), :]
                v_h_stride = v_h[:, :, :-(chunk_size[2]), :]
                k_h = torch.cat((cache["k"], k_h), dim=2)
                v_h = torch.cat((cache["v"], v_h), dim=2)

                cache["k"] = torch.cat((cache["k"], k_h_stride), dim=2)
                cache["v"] = torch.cat((cache["v"], v_h_stride), dim=2)
                if look_back != -1:
                    cache["k"] = cache["k"][:, :, -(look_back * chunk_size[1]):, :]
                    cache["v"] = cache["v"][:, :, -(look_back * chunk_size[1]):, :]
            else:
                cache_tmp = {"k": k_h[:, :, :-(chunk_size[2]), :],
                             "v": v_h[:, :, :-(chunk_size[2]), :]}
                cache = cache_tmp
        fsmn_memory = self.forward_fsmn(v, None)
        q_h = q_h * self.d_k ** (-0.5)
        scores = torch.matmul(q_h, k_h.transpose(-2, -1))
        att_outs = self.forward_attention(v_h, scores, None)
        return att_outs + fsmn_memory, cache


class MultiHeadedAttentionSANMwithMask(MultiHeadedAttentionSANM):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
@@ -667,6 +705,35 @@
        scores = torch.matmul(q_h, k_h.transpose(-2, -1))
        return self.forward_attention(v_h, scores, memory_mask)

    def forward_chunk(self, x, memory, cache=None, chunk_size=None, look_back=0):
        """Compute scaled dot product attention.

        Args:
            query (torch.Tensor): Query tensor (#batch, time1, size).
            key (torch.Tensor): Key tensor (#batch, time2, size).
            value (torch.Tensor): Value tensor (#batch, time2, size).
            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
                (#batch, time1, time2).

        Returns:
            torch.Tensor: Output tensor (#batch, time1, d_model).

        """
        q_h, k_h, v_h = self.forward_qkv(x, memory)
        if chunk_size is not None and look_back > 0:
            if cache is not None:
                k_h = torch.cat((cache["k"], k_h), dim=2)
                v_h = torch.cat((cache["v"], v_h), dim=2)
                cache["k"] = k_h[:, :, -(look_back * chunk_size[1]):, :]
                cache["v"] = v_h[:, :, -(look_back * chunk_size[1]):, :]
            else:
                cache_tmp = {"k": k_h[:, :, -(look_back * chunk_size[1]):, :],
                             "v": v_h[:, :, -(look_back * chunk_size[1]):, :]}
                cache = cache_tmp
        q_h = q_h * self.d_k ** (-0.5)
        scores = torch.matmul(q_h, k_h.transpose(-2, -1))
        return self.forward_attention(v_h, scores, None), cache


class MultiHeadSelfAttention(nn.Module):
    """Multi-Head Attention layer.

 funasr/runtime/docs/SDK_advanced_guide_offline.md

@@ -36,9 +36,9 @@
Use the following command to pull and launch the Docker image for the FunASR runtime-SDK:

```shell
sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2

sudo docker run -p 10095:10095 -it --privileged=true -v /root:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
sudo docker run -p 10095:10095 -it --privileged=true -v /root:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2
```

Introduction to command parameters: 

 funasr/runtime/docs/SDK_advanced_guide_offline_zh.md

@@ -22,9 +22,12 @@
通过下述命令拉取并启动FunASR runtime-SDK的docker镜像：

```shell
sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
sudo docker pull \
  registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2
mkdir -p ./funasr-runtime-resources/models
sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
sudo docker run -p 10095:10095 -it --privileged=true \
  -v ./funasr-runtime-resources/models:/workspace/models \
  registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2
```
如果您没有安装docker，可参考[Docker安装](#Docker安装)

@@ -100,18 +103,20 @@
若想直接运行client进行测试，可参考如下简易说明，以python版本为例：

```shell
python3 wss_client_asr.py --host "127.0.0.1" --port 10095 --mode offline --audio_in "../audio/asr_example.wav" --output_dir "./results"
python3 funasr_wss_client.py --host "127.0.0.1" --port 10095 --mode offline \
        --audio_in "../audio/asr_example.wav" --output_dir "./results"
```

命令参数说明：
```text
--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
       需要改为部署机器ip
--port 10095 部署端口号
--mode offline表示离线文件转写
--audio_in 需要进行转写的音频文件，支持文件路径，文件列表wav.scp
--thread_num 设置并发发送线程数，默认为1
--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴 达摩院)
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串(阿里巴巴 达摩院)
--use_itn 设置是否使用itn，默认1开启，设置为0关闭
```

@@ -124,10 +129,11 @@
命令参数说明：

```text
--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
            需要改为部署机器ip
--port 10095 部署端口号
--wav-path 需要进行转写的音频文件，支持文件路径
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴 达摩院)
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴 达摩院)
--use-itn 设置是否使用itn，默认1开启，设置为0关闭
```


 funasr/runtime/docs/SDK_advanced_guide_online.md

@@ -8,9 +8,9 @@
Use the following command to pull and start the FunASR software package docker image:

```shell
sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
mkdir -p ./funasr-runtime-resources/models
sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
```
If you do not have Docker installed, please refer to [Docker Installation](https://alibaba-damo-academy.github.io/FunASR/en/installation/docker.html)


 funasr/runtime/docs/SDK_advanced_guide_online_zh.md

@@ -11,9 +11,12 @@
通过下述命令拉取并启动FunASR软件包的docker镜像：

```shell
sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
sudo docker pull \
  registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
mkdir -p ./funasr-runtime-resources/models
sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
sudo docker run -p 10095:10095 -it --privileged=true \
  -v ./funasr-runtime-resources/models:/workspace/models \
  registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
```
如果您没有安装docker，可参考[Docker安装](https://alibaba-damo-academy.github.io/FunASR/en/installation/docker_zh.html)


 funasr/runtime/docs/SDK_tutorial_online_zh.md

@@ -67,34 +67,39 @@

命令参数说明：
```text
--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
       需要改为部署机器ip
--port 10095 部署端口号
--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，并且说话句尾采用离线模型进行纠错。
--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，
       并且说话句尾采用离线模型进行纠错。
--chunk_size：表示流式模型latency配置`[5,10,5]`，表示当前音频解码片段为600ms，并且回看300ms，右看300ms。
--audio_in 需要进行转写的音频文件，支持文件路径，文件列表wav.scp
--thread_num 设置并发发送线程数，默认为1
--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭+
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴 达摩院)
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴 达摩院)
--use_itn 设置是否使用itn，默认1开启，设置为0关闭
```

### cpp-client
进入samples/cpp目录后，可以用cpp进行测试，指令如下：
```shell
./funasr-wss-client-2pass --server-ip 127.0.0.1 --port 10095 --mode 2pass --wav-path ../audio/asr_example.wav
./funasr-wss-client-2pass --server-ip 127.0.0.1 --port 10095 --mode 2pass \
   --wav-path ../audio/asr_example.wav
```

命令参数说明：

```text
--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
            需要改为部署机器ip
--port 10095 部署端口号
--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，并且说话句尾采用离线模型进行纠错。
--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，
        并且说话句尾采用离线模型进行纠错。
--chunk-size：表示流式模型latency配置`[5,10,5]`，表示当前音频解码片段为600ms，并且回看300ms，右看300ms。
--wav-path 需要进行转写的音频文件，支持文件路径
--thread-num 设置并发发送线程数，默认为1
--is-ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴 达摩院)
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴 达摩院)
--use-itn 设置是否使用itn，默认1开启，设置为0关闭
```


 funasr/runtime/docs/SDK_tutorial_zh.md

@@ -69,13 +69,14 @@

命令参数说明：
```text
--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
        需要改为部署机器ip
--port 10095 部署端口号
--mode offline表示离线文件转写
--audio_in 需要进行转写的音频文件，支持文件路径，文件列表wav.scp
--thread_num 设置并发发送线程数，默认为1
--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴 达摩院)
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴 达摩院)
--use_itn 设置是否使用itn，默认1开启，设置为0关闭
```

@@ -88,12 +89,13 @@
命令参数说明：

```text
--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
            需要改为部署机器ip
--port 10095 部署端口号
--wav-path 需要进行转写的音频文件，支持文件路径
--thread_num 设置并发发送线程数，默认为1
--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴 达摩院)
--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴 达摩院)
--use-itn 设置是否使用itn，默认1开启，设置为0关闭
```


 funasr/runtime/ios/paraformer_online/paraformer_online.xcodeproj/project.pbxproj

@@ -90,6 +90,8 @@
        1A7F0DBE2A2F221C00A6EEB7 /* AudioCapture.mm in Sources */ = {isa = PBXBuildFile; fileRef = 1A7F0DBB2A2F221C00A6EEB7 /* AudioCapture.mm */; };
        1A7F0DBF2A2F221C00A6EEB7 /* AudioRecorder.m in Sources */ = {isa = PBXBuildFile; fileRef = 1A7F0DBD2A2F221C00A6EEB7 /* AudioRecorder.m */; };
        1A7F0DC32A2F312D00A6EEB7 /* model in Resources */ = {isa = PBXBuildFile; fileRef = 1A7F0DC22A2F312D00A6EEB7 /* model */; };
        1ACBFB692AB99D55002FC7C7 /* seg_dict.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 1ACBFB672AB99D55002FC7C7 /* seg_dict.cpp */; };
        1ACBFB6C2AB9A086002FC7C7 /* encode_converter.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 1ACBFB6B2AB9A086002FC7C7 /* encode_converter.cpp */; };
        59C4114F365C8D714BD515FB /* Pods_paraformer_online.framework in Frameworks */ = {isa = PBXBuildFile; fileRef = EA7D0713E60886A787BAA0EA /* Pods_paraformer_online.framework */; };
/* End PBXBuildFile section */

@@ -324,6 +326,10 @@
        1AB8E1EE2AA086F200F4F795 /* model.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = model.h; sourceTree = "<group>"; };
        1AB8E1EF2AA086F200F4F795 /* offline-stream.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = "offline-stream.h"; sourceTree = "<group>"; };
        1AB8E1F02AA086F200F4F795 /* vad-model.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = "vad-model.h"; sourceTree = "<group>"; };
        1ACBFB672AB99D55002FC7C7 /* seg_dict.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = seg_dict.cpp; sourceTree = "<group>"; };
        1ACBFB682AB99D55002FC7C7 /* seg_dict.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = seg_dict.h; sourceTree = "<group>"; };
        1ACBFB6A2AB9A086002FC7C7 /* encode_converter.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = encode_converter.h; sourceTree = "<group>"; };
        1ACBFB6B2AB9A086002FC7C7 /* encode_converter.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = encode_converter.cpp; sourceTree = "<group>"; };
        B9ED2A36675364C815C03C96 /* Pods-paraformer_online.debug.xcconfig */ = {isa = PBXFileReference; includeInIndex = 1; lastKnownFileType = text.xcconfig; name = "Pods-paraformer_online.debug.xcconfig"; path = "Target Support Files/Pods-paraformer_online/Pods-paraformer_online.debug.xcconfig"; sourceTree = "<group>"; };
        EA7D0713E60886A787BAA0EA /* Pods_paraformer_online.framework */ = {isa = PBXFileReference; explicitFileType = wrapper.framework; includeInIndex = 0; path = Pods_paraformer_online.framework; sourceTree = BUILT_PRODUCTS_DIR; };
/* End PBXFileReference section */
@@ -355,6 +361,8 @@
                1A6C92FB2A84D64E007E36DC /* ct-transformer.cpp */,
                1A6C93032A84D64E007E36DC /* ct-transformer.h */,
                1A6C92F92A84D64E007E36DC /* e2e-vad.h */,
                1ACBFB6B2AB9A086002FC7C7 /* encode_converter.cpp */,
                1ACBFB6A2AB9A086002FC7C7 /* encode_converter.h */,
                1A6C92F72A84D64E007E36DC /* fsmn-vad-online.cpp */,
                1A6C92E92A84D64E007E36DC /* fsmn-vad-online.h */,
                1A6C92E82A84D64E007E36DC /* fsmn-vad.cpp */,
@@ -371,6 +379,8 @@
                1A6C93022A84D64E007E36DC /* punc-model.cpp */,
                1A6C92ED2A84D64E007E36DC /* resample.cpp */,
                1A6C92E32A84D64E007E36DC /* resample.h */,
                1ACBFB672AB99D55002FC7C7 /* seg_dict.cpp */,
                1ACBFB682AB99D55002FC7C7 /* seg_dict.h */,
                1A6C93012A84D64E007E36DC /* tensor.h */,
                1A6C92F02A84D64E007E36DC /* tokenizer.cpp */,
                1A6C92EF2A84D64E007E36DC /* tokenizer.h */,
@@ -917,6 +927,7 @@
                1A6C93F72A84D66E007E36DC /* symbolize.cc in Sources */,
                1A6C93062A84D64E007E36DC /* util.cpp in Sources */,
                1A6C94222A84D66E007E36DC /* nodebuilder.cpp in Sources */,
                1ACBFB692AB99D55002FC7C7 /* seg_dict.cpp in Sources */,
                1A6C94132A84D66E007E36DC /* exp.cpp in Sources */,
                1A6C930A2A84D64E007E36DC /* vocab.cpp in Sources */,
                1A6C94012A84D66E007E36DC /* logging.cc in Sources */,
@@ -931,6 +942,7 @@
                1A6C940F2A84D66E007E36DC /* emitter.cpp in Sources */,
                1A6C93DE2A84D66E007E36DC /* fftsg.c in Sources */,
                1A6C940B2A84D66E007E36DC /* ostream_wrapper.cpp in Sources */,
                1ACBFB6C2AB9A086002FC7C7 /* encode_converter.cpp in Sources */,
                1A6C93E12A84D66E007E36DC /* log.cc in Sources */,
                1A6C94092A84D66E007E36DC /* exceptions.cpp in Sources */,
                1A6C94152A84D66E007E36DC /* node.cpp in Sources */,
@@ -1108,7 +1120,7 @@
                    "@executable_path/Frameworks",
                );
                MARKETING_VERSION = 1.0;
                PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online";
                PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online1";
                PRODUCT_NAME = "$(TARGET_NAME)";
                SWIFT_EMIT_LOC_STRINGS = YES;
                TARGETED_DEVICE_FAMILY = "1,2";
@@ -1147,7 +1159,7 @@
                    "@executable_path/Frameworks",
                );
                MARKETING_VERSION = 1.0;
                PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online";
                PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online1";
                PRODUCT_NAME = "$(TARGET_NAME)";
                SWIFT_EMIT_LOC_STRINGS = YES;
                TARGETED_DEVICE_FAMILY = "1,2";

 funasr/runtime/onnxruntime/include/funasrruntime.h

@@ -105,7 +105,10 @@
_FUNASRAPI FUNASR_RESULT    FunOfflineInfer(FUNASR_HANDLE handle, const char* sz_filename, FUNASR_MODE mode, 
                                            QM_CALLBACK fn_callback, const std::vector<std::vector<float>> &hw_emb, 
                                            int sampling_rate=16000, bool itn=true);
#if !defined(__APPLE__)
_FUNASRAPI const std::vector<std::vector<float>> CompileHotwordEmbedding(FUNASR_HANDLE handle, std::string &hotwords, ASR_TYPE mode=ASR_OFFLINE);
#endif

_FUNASRAPI void                FunOfflineUninit(FUNASR_HANDLE handle);

//2passStream

 funasr/runtime/onnxruntime/include/model.h

@@ -17,7 +17,7 @@
    virtual std::string Rescoring() = 0;
    virtual void InitHwCompiler(const std::string &hw_model, int thread_num){};
    virtual void InitSegDict(const std::string &seg_dict_model){};
    virtual std::vector<std::vector<float>> CompileHotwordEmbedding(std::string &hotwords){};
    virtual std::vector<std::vector<float>> CompileHotwordEmbedding(std::string &hotwords){return std::vector<std::vector<float>>();};
};

Model *CreateModel(std::map<std::string, std::string>& model_path, int thread_num=1, ASR_TYPE type=ASR_OFFLINE);

 funasr/runtime/onnxruntime/include/offline-stream.h

@@ -7,7 +7,9 @@
#include "model.h"
#include "punc-model.h"
#include "vad-model.h"
#if !defined(__APPLE__)
#include "itn-model.h"
#endif

namespace funasr {
class OfflineStream {
@@ -18,7 +20,9 @@
    std::unique_ptr<VadModel> vad_handle= nullptr;
    std::unique_ptr<Model> asr_handle= nullptr;
    std::unique_ptr<PuncModel> punc_handle= nullptr;
#if !defined(__APPLE__)
    std::unique_ptr<ITNModel> itn_handle = nullptr;
#endif
    bool UseVad(){return use_vad;};
    bool UsePunc(){return use_punc;}; 
    bool UseITN(){return use_itn;};

 funasr/runtime/onnxruntime/include/tpass-stream.h

@@ -7,7 +7,9 @@
#include "model.h"
#include "punc-model.h"
#include "vad-model.h"
#if !defined(__APPLE__)
#include "itn-model.h"
#endif

namespace funasr {
class TpassStream {
@@ -18,7 +20,9 @@
    std::unique_ptr<VadModel> vad_handle = nullptr;
    std::unique_ptr<Model> asr_handle = nullptr;
    std::unique_ptr<PuncModel> punc_online_handle = nullptr;
#if !defined(__APPLE__)
    std::unique_ptr<ITNModel> itn_handle = nullptr;
#endif
    bool UseVad(){return use_vad;};
    bool UsePunc(){return use_punc;}; 
    bool UseITN(){return use_itn;};

 funasr/runtime/onnxruntime/src/funasrruntime.cpp

@@ -285,10 +285,12 @@
            string punc_res = (offline_stream->punc_handle)->AddPunc((p_result->msg).c_str());
            p_result->msg = punc_res;
        }
#if !defined(__APPLE__)
        if(offline_stream->UseITN() && itn){
            string msg_itn = offline_stream->itn_handle->Normalize(p_result->msg);
            p_result->msg = msg_itn;
        }
#endif

        return p_result;
    }
@@ -364,13 +366,16 @@
            string punc_res = (offline_stream->punc_handle)->AddPunc((p_result->msg).c_str());
            p_result->msg = punc_res;
        }
#if !defined(__APPLE__)
        if(offline_stream->UseITN() && itn){
            string msg_itn = offline_stream->itn_handle->Normalize(p_result->msg);
            p_result->msg = msg_itn;
        }
#endif
        return p_result;
    }

#if !defined(__APPLE__)
    _FUNASRAPI const std::vector<std::vector<float>> CompileHotwordEmbedding(FUNASR_HANDLE handle, std::string &hotwords, ASR_TYPE mode)
    {
        if (mode == ASR_OFFLINE){
@@ -394,7 +399,7 @@
        }
        
    }

#endif

    // APIs for 2pass-stream Infer
    _FUNASRAPI FUNASR_RESULT FunTpassInferBuffer(FUNASR_HANDLE handle, FUNASR_HANDLE online_handle, const char* sz_buf, 
@@ -450,13 +455,13 @@
                    string online_msg = ((funasr::ParaformerOnline*)asr_online_handle)->online_res;
                    string msg_punc = punc_online_handle->AddPunc(online_msg.c_str(), punc_cache[0]);
                    p_result->tpass_msg = msg_punc;

#if !defined(__APPLE__)
                    // ITN
                    if(tpass_stream->UseITN() && itn){
                        string msg_itn = tpass_stream->itn_handle->Normalize(msg_punc);
                        p_result->tpass_msg = msg_itn;
                    }

#endif
                    ((funasr::ParaformerOnline*)asr_online_handle)->online_res = "";
                    p_result->msg += msg;
                }else{
@@ -501,10 +506,12 @@
                msg_punc += "。";
            }
            p_result->tpass_msg = msg_punc;
#if !defined(__APPLE__)
            if(tpass_stream->UseITN() && itn){
                string msg_itn = tpass_stream->itn_handle->Normalize(msg_punc);
                p_result->tpass_msg = msg_itn;
            }
#endif

            if(frame != NULL){
                delete frame;

 funasr/runtime/onnxruntime/src/offline-stream.cpp

@@ -84,7 +84,7 @@
            use_punc = true;
        }
    }

#if !defined(__APPLE__)
    // Optional: ITN, here we just support language_type=MandarinEnglish
    if(model_path.find(ITN_DIR) != model_path.end() && model_path.at(ITN_DIR) != ""){
        string itn_tagger_path = PathAppend(model_path.at(ITN_DIR), ITN_TAGGER_NAME);
@@ -100,6 +100,7 @@
            use_itn = true;
        }
    }
#endif
}

OfflineStream *CreateOfflineStream(std::map<std::string, std::string>& model_path, int thread_num)

 funasr/runtime/onnxruntime/src/paraformer.cpp

@@ -684,7 +684,7 @@
                return "";
            }
            //PrintMat(hw_emb, "input_clas_emb");
            const int64_t hotword_shape[3] = {1, hw_emb.size(), hw_emb[0].size()};
            const int64_t hotword_shape[3] = {1, static_cast<int64_t>(hw_emb.size()), static_cast<int64_t>(hw_emb[0].size())};
            embedding.reserve(hw_emb.size() * hw_emb[0].size());
            for (auto item : hw_emb) {
                embedding.insert(embedding.end(), item.begin(), item.end());

 funasr/runtime/onnxruntime/src/precomp.h

@@ -24,6 +24,8 @@
#else
#include "onnxruntime_run_options_config_keys.h"
#include "onnxruntime_cxx_api.h"
#include "itn-model.h"
#include "itn-processor.h"
#endif

#include "kaldi-native-fbank/csrc/feature-fbank.h"
@@ -38,11 +40,9 @@
#include "model.h"
#include "vad-model.h"
#include "punc-model.h"
#include "itn-model.h"
#include "tokenizer.h"
#include "ct-transformer.h"
#include "ct-transformer-online.h"
#include "itn-processor.h"
#include "e2e-vad.h"
#include "fsmn-vad.h"
#include "encode_converter.h"

 funasr/runtime/onnxruntime/src/tpass-stream.cpp

@@ -89,7 +89,7 @@
            use_punc = true;
        }
    }

#if !defined(__APPLE__)
    // Optional: ITN, here we just support language_type=MandarinEnglish
    if(model_path.find(ITN_DIR) != model_path.end()){
        string itn_tagger_path = PathAppend(model_path.at(ITN_DIR), ITN_TAGGER_NAME);
@@ -105,6 +105,7 @@
            use_itn = true;
        }
    }
#endif
      
}

@@ -114,4 +115,4 @@
    mm = new TpassStream(model_path, thread_num);
    return mm;
}
} // namespace funasr
} // namespace funasr

 funasr/runtime/python/websocket/funasr_client_api.py

@@ -51,7 +51,8 @@
        stride = int(60 *  chunk_size[1]/  chunk_interval / 1000 * 16000 * 2)

        chunk_num = (len(audio_bytes) - 1) // stride + 1

       

        message = json.dumps({"mode":  mode, "chunk_size":  chunk_size, "chunk_interval":  chunk_interval,

        message = json.dumps({"mode": args.mode, "chunk_size": args.chunk_size, "encoder_chunk_look_back": 4,

                              "decoder_chunk_look_back": 1, "chunk_interval": args.chunk_interval, 
                              "wav_name": wav_name, "is_speaking": True})

 

        self.websocket.send(message)

@@ -131,4 +132,4 @@
    print("text",text)

 

    

            
            

 funasr/runtime/python/websocket/funasr_wss_client.py

@@ -29,6 +29,14 @@
                    type=str,
                    default="5, 10, 5",
                    help="chunk")
parser.add_argument("--encoder_chunk_look_back",
                    type=int,
                    default=4,
                    help="number of chunks to lookback for encoder self-attention")
parser.add_argument("--decoder_chunk_look_back",
                    type=int,
                    default=1,
                    help="number of encoder chunks to lookback for decoder cross-attention")
parser.add_argument("--chunk_interval",
                    type=int,
                    default=10,
@@ -99,7 +107,8 @@
                    input=True,
                    frames_per_buffer=CHUNK)

    message = json.dumps({"mode": args.mode, "chunk_size": args.chunk_size, "chunk_interval": args.chunk_interval,
    message = json.dumps({"mode": args.mode, "chunk_size": args.chunk_size, "encoder_chunk_look_back": args.encoder_chunk_look_back,
                          "decoder_chunk_look_back": args.decoder_chunk_look_back, "chunk_interval": args.chunk_interval, 
                          "wav_name": "microphone", "is_speaking": True})
    #voices.put(message)
    await websocket.send(message)

 funasr/runtime/python/websocket/funasr_wss_server.py

@@ -103,8 +103,8 @@
    model=args.asr_model_online,
    ngpu=args.ngpu,
    ncpu=args.ncpu,
    model_revision='v1.0.4',
    update_model='v1.0.4',
    model_revision='v1.0.7',
    update_model='v1.0.7',
    mode='paraformer_streaming')

print("model loaded! only support one client at the same time now!!!!")
@@ -159,6 +159,10 @@
                    websocket.wav_name = messagejson.get("wav_name")
                if "chunk_size" in messagejson:
                    websocket.param_dict_asr_online["chunk_size"] = messagejson["chunk_size"]
                if "encoder_chunk_look_back" in messagejson:
                    websocket.param_dict_asr_online["encoder_chunk_look_back"] = messagejson["encoder_chunk_look_back"]
                if "decoder_chunk_look_back" in messagejson:
                    websocket.param_dict_asr_online["decoder_chunk_look_back"] = messagejson["decoder_chunk_look_back"]
                if "mode" in messagejson:
                    websocket.mode = messagejson["mode"]
            if len(frames_asr_online) > 0 or len(frames_asr) > 0 or not isinstance(message, str):

 funasr/runtime/readme.md

@@ -30,9 +30,9 @@

### latest version & image ID

| image version                |  image ID | INFO |
|------------------------------|-----|------|
| funasr-runtime-sdk-cpu-0.2.1 |  1ad3d19e0707   |      |
| image version                       |  image ID | INFO |
|-------------------------------------|-----|------|
| funasr-runtime-sdk-online-cpu-0.1.2 |  7222c5319bcf   |      |

## File Transcription Service, Mandarin (CPU)

@@ -53,6 +53,6 @@
The documentation mainly targets advanced developers who require modifications and customization of the service. It supports downloading model deployments from modelscope and also supports deploying models that users have fine-tuned. For detailed information, please refer to the documentation available by [docs](./docs/SDK_advanced_guide_offline.md)

### latest version & image ID
|  image version   |  image ID | INFO |
|-----|-----|------|
|   funasr-runtime-sdk-online-cpu-0.1.1  |  bdbdd0b27dee   |      |
| image version                |  image ID | INFO |
|------------------------------|-----|------|
| funasr-runtime-sdk-cpu-0.2.2 |  2c5286be13e9   |      |

 funasr/runtime/readme_cn.md

@@ -31,9 +31,9 @@

### 最新版本及image ID

| image version                |  image ID | INFO |
|------------------------------|-----|------|
| funasr-runtime-sdk-cpu-0.2.1 |   1ad3d19e0707  |      |
| image version                       |  image ID | INFO |
|-------------------------------------|-----|------|
| funasr-runtime-sdk-online-cpu-0.1.2 |   7222c5319bcf  |      |


## 中文离线文件转写服务（CPU版本）
@@ -55,6 +55,6 @@
文档介绍了背后技术原理，识别准确率，计算效率等，以及核心优势介绍：便捷、高精度、高效率、长音频链路，详细文档参考（[点击此处](https://mp.weixin.qq.com/s/DHQwbgdBWcda0w_L60iUww)）

### 最新版本及image ID
|  image version   |  image ID | INFO |
|-----|-----|------|
|   funasr-runtime-sdk-online-cpu-0.1.1  |  bdbdd0b27dee   |      |
| image version                |  image ID | INFO |
|------------------------------|-----|------|
| funasr-runtime-sdk-cpu-0.2.2 |  2c5286be13e9   |      |

			@@ -60,7 +60,7 @@

			\|DingTalk group \| WeChat group \|
			\|:---:\|:-----------------------------------------------------:\|
			\|<div align="left"><img src="docs/images/dingding.jpg" width="250"/> \| <img src="docs/images/wechat.png" width="232"/></div> \|
			\|<div align="left"><img src="docs/images/dingding.jpg" width="250"/> \| <img src="docs/images/wechat.png" width="215"/></div> \|

			## Contributors

			@@ -55,9 +55,10 @@
			## 联系我们

			如果您在使用中遇到问题，可以直接在github页面提Issues。欢迎语音兴趣爱好者扫描以下的钉钉群或者微信群二维码加入社区群，进行交流和讨论。

			\| 钉钉群 \| 微信 \|
			\|:---------------------------------------------------------------------:\|:-----------------------------------------------------:\|
			\| <div align="left"><img src="docs/images/dingding.jpg" width="250"/> \| <img src="docs/images/wechat.png" width="232"/></div> \|
			\| <div align="left"><img src="docs/images/dingding.jpg" width="250"/> \| <img src="docs/images/wechat.png" width="215"/></div> \|

			## 社区贡献者

			@@ -21,4 +21,6 @@
			The three dataset for training mentioned above can be downloaded at [OpenSLR](https://openslr.org/resources.php). The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.
			- [AliMeeting](https://openslr.org/119/)
			- [AISHELL-4](https://openslr.org/111/)
			- [CN-Celeb](https://openslr.org/82/)
			- [CN-Celeb](https://openslr.org/82/)

			Now, the new test set is available [here](https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Test_2023_Ali.tar.gz)

			@@ -9,6 +9,7 @@
			To advance the current state-of-the-art in multi-talker automatic speech recognition, the M2MeT2.0 challenge proposes a speaker-attributed ASR task, comprising two sub-tracks: fixed and open training conditions.
			To facilitate reproducible research, we provide a comprehensive overview of the dataset, challenge rules, evaluation metrics, and baseline systems.

			Now the new test set contains about 10 hours audio is available. You can download from `here <https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Test_2023_Ali.tar.gz>`_

			.. toctree::
			:maxdepth: 1

			@@ -27,15 +27,18 @@
			inference_pipeline = pipeline(
			task=Tasks.auto_speech_recognition,
			model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
			model_revision='v1.0.6',
			model_revision='v1.0.7',
			update_model=False,
			mode='paraformer_streaming'
			)
			import soundfile
			speech, sample_rate = soundfile.read("example/asr_example.wav")

			chunk_size = [5, 10, 5] #[5, 10, 5] 600ms, [8, 8, 4] 480ms
			param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size}
			chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
			encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
			decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
			param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size,
			"encoder_chunk_look_back": encoder_chunk_look_back, "decoder_chunk_look_back": decoder_chunk_look_back}
			chunk_stride = chunk_size[1] * 960 # 600ms、480ms
			# first chunk, 600ms
			speech_chunk = speech[0:chunk_stride]
			@@ -55,7 +58,7 @@
			inference_pipeline = pipeline(
			task=Tasks.auto_speech_recognition,
			model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
			model_revision='v1.0.6',
			model_revision='v1.0.7',
			update_model=False,
			mode="paraformer_fake_streaming"
			)

			@@ -4,7 +4,7 @@
			inference_pipeline = pipeline(
			task=Tasks.auto_speech_recognition,
			model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
			model_revision='v1.0.6',
			model_revision='v1.0.7',
			update_model=False,
			mode="paraformer_fake_streaming"
			)

			@@ -14,7 +14,7 @@
			inference_pipeline = pipeline(
			task=Tasks.auto_speech_recognition,
			model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
			model_revision='v1.0.6',
			model_revision='v1.0.7',
			update_model=False,
			mode="paraformer_streaming"
			)

New file
			@@ -0,0 +1,44 @@
			import os
			import logging
			import torch
			import soundfile

			from modelscope.pipelines import pipeline
			from modelscope.utils.constant import Tasks
			from modelscope.utils.logger import get_logger

			logger = get_logger(log_level=logging.CRITICAL)
			logger.setLevel(logging.CRITICAL)

			os.environ["MODELSCOPE_CACHE"] = "./"
			inference_pipeline = pipeline(
			task=Tasks.auto_speech_recognition,
			model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online',
			model_revision='v1.0.7',
			update_model=False,
			mode="paraformer_streaming"
			)

			model_dir = os.path.join(os.environ["MODELSCOPE_CACHE"], "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online")
			speech, sample_rate = soundfile.read(os.path.join(model_dir, "example/asr_example.wav"))
			speech_length = speech.shape[0]

			sample_offset = 0
			chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
			encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
			decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
			stride_size = chunk_size[1] * 960
			param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size,
			"encoder_chunk_look_back": encoder_chunk_look_back, "decoder_chunk_look_back": decoder_chunk_look_back}
			final_result = ""

			for sample_offset in range(0, speech_length, min(stride_size, speech_length - sample_offset)):
			if sample_offset + stride_size >= speech_length - 1:
			stride_size = speech_length - sample_offset
			param_dict["is_final"] = True
			rec_result = inference_pipeline(audio_in=speech[sample_offset: sample_offset + stride_size],
			param_dict=param_dict)
			if len(rec_result) != 0:
			final_result += rec_result['text']
			print(rec_result)
			print(final_result)

			@@ -14,7 +14,7 @@
			ds_dict = MsDataset.load(params.data_path)
			kwargs = dict(
			model=params.model,
			model_revision='v1.0.6',
			model_revision='v1.0.7',
			update_model=False,
			data_dir=ds_dict,
			dataset_type=params.dataset_type,

			@@ -11,7 +11,7 @@
			model=args.model,
			output_dir=args.output_dir,
			batch_size=args.batch_size,
			model_revision='v1.0.6',
			model_revision='v1.0.7',
			update_model=False,
			mode="paraformer_fake_streaming",
			param_dict={"decoding_model": args.decoding_mode, "hotword": args.hotword_txt}

			@@ -53,7 +53,7 @@
			+ pynini.union(
			pynutil.add_weight(((DAMO_SIGMA - "one") @ cardinal_graph), -0.7) @ add_leading_zero_to_double_digit
			+ delete_space
			+ pynutil.delete("cents"),
			+ (pynutil.delete("cents") \| pynutil.delete("cent")),
			pynini.cross("one", "01") + delete_space + pynutil.delete("cent"),
			)
			+ pynutil.insert("\"")

			@@ -842,37 +842,72 @@
			data = yaml.load(f, Loader=yaml.Loader)
			return data

			def _prepare_cache(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
			def _prepare_cache(cache: dict = {}, chunk_size=[5, 10, 5], encoder_chunk_look_back=0,
			decoder_chunk_look_back=0, batch_size=1):
			if len(cache) > 0:
			return cache
			config = _read_yaml(asr_train_config)
			enc_output_size = config["encoder_conf"]["output_size"]
			feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
			cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
			"cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
			"cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size,
			"encoder_chunk_look_back": encoder_chunk_look_back, "last_chunk": False, "opt": None,
			"feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)), "tail_chunk": False}
			cache["encoder"] = cache_en

			cache_de = {"decode_fsmn": None}
			cache_de = {"decode_fsmn": None, "decoder_chunk_look_back": decoder_chunk_look_back, "opt": None, "chunk_size": chunk_size}
			cache["decoder"] = cache_de

			return cache

			def _cache_reset(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
			def _cache_reset(cache: dict = {}, chunk_size=[5, 10, 5], encoder_chunk_look_back=0,
			decoder_chunk_look_back=0, batch_size=1):
			if len(cache) > 0:
			config = _read_yaml(asr_train_config)
			enc_output_size = config["encoder_conf"]["output_size"]
			feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
			cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
			"cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
			"feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)),
			"tail_chunk": False}
			"cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size,
			"encoder_chunk_look_back": encoder_chunk_look_back, "last_chunk": False, "opt": None,
			"feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)), "tail_chunk": False}
			cache["encoder"] = cache_en

			cache_de = {"decode_fsmn": None}
			cache_de = {"decode_fsmn": None, "decoder_chunk_look_back": decoder_chunk_look_back, "opt": None, "chunk_size": chunk_size}
			cache["decoder"] = cache_de

			return cache

			#def _prepare_cache(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
			# if len(cache) > 0:
			# return cache
			# config = _read_yaml(asr_train_config)
			# enc_output_size = config["encoder_conf"]["output_size"]
			# feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
			# cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
			# "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
			# "feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)), "tail_chunk": False}
			# cache["encoder"] = cache_en

			# cache_de = {"decode_fsmn": None}
			# cache["decoder"] = cache_de

			# return cache

			#def _cache_reset(cache: dict = {}, chunk_size=[5, 10, 5], batch_size=1):
			# if len(cache) > 0:
			# config = _read_yaml(asr_train_config)
			# enc_output_size = config["encoder_conf"]["output_size"]
			# feats_dims = config["frontend_conf"]["n_mels"] * config["frontend_conf"]["lfr_m"]
			# cache_en = {"start_idx": 0, "cif_hidden": torch.zeros((batch_size, 1, enc_output_size)),
			# "cif_alphas": torch.zeros((batch_size, 1)), "chunk_size": chunk_size, "last_chunk": False,
			# "feats": torch.zeros((batch_size, chunk_size[0] + chunk_size[2], feats_dims)),
			# "tail_chunk": False}
			# cache["encoder"] = cache_en

			# cache_de = {"decode_fsmn": None}
			# cache["decoder"] = cache_de

			# return cache

			def _forward(
			data_path_and_name_and_type,
			@@ -901,24 +936,34 @@
			is_final = False
			cache = {}
			chunk_size = [5, 10, 5]
			encoder_chunk_look_back = 0
			decoder_chunk_look_back = 0
			if param_dict is not None and "cache" in param_dict:
			cache = param_dict["cache"]
			if param_dict is not None and "is_final" in param_dict:
			is_final = param_dict["is_final"]
			if param_dict is not None and "chunk_size" in param_dict:
			chunk_size = param_dict["chunk_size"]
			if param_dict is not None and "encoder_chunk_look_back" in param_dict:
			encoder_chunk_look_back = param_dict["encoder_chunk_look_back"]
			if encoder_chunk_look_back > 0:
			chunk_size[0] = 0
			if param_dict is not None and "decoder_chunk_look_back" in param_dict:
			decoder_chunk_look_back = param_dict["decoder_chunk_look_back"]

			# 7 .Start for-loop
			# FIXME(kamo): The output format should be discussed about
			raw_inputs = torch.unsqueeze(raw_inputs, axis=0)
			asr_result_list = []
			cache = _prepare_cache(cache, chunk_size=chunk_size, batch_size=1)
			cache = _prepare_cache(cache, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back,
			decoder_chunk_look_back=decoder_chunk_look_back, batch_size=1)
			item = {}
			if data_path_and_name_and_type is not None and data_path_and_name_and_type[2] == "sound":
			sample_offset = 0
			speech_length = raw_inputs.shape[1]
			stride_size = chunk_size[1] * 960
			cache = _prepare_cache(cache, chunk_size=chunk_size, batch_size=1)
			cache = _prepare_cache(cache, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back,
			decoder_chunk_look_back=decoder_chunk_look_back, batch_size=1)
			final_result = ""
			for sample_offset in range(0, speech_length, min(stride_size, speech_length - sample_offset)):
			if sample_offset + stride_size >= speech_length - 1:
			@@ -939,7 +984,8 @@

			asr_result_list.append(item)
			if is_final:
			cache = _cache_reset(cache, chunk_size=chunk_size, batch_size=1)
			cache = _cache_reset(cache, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back,
			decoder_chunk_look_back=decoder_chunk_look_back, batch_size=1)
			return asr_result_list

			return _forward

			@@ -39,7 +39,7 @@
			self.batch_mode = batch_mode

			def set_epoch(self, epoch):
			self.epoch = epoch
			self.datapipe.set_epoch(epoch)

			def __iter__(self):
			buffer = []

			@@ -13,7 +13,7 @@
			self.fn = fn

			def set_epoch(self, epoch):
			self.epoch = epoch
			self.datapipe.set_epoch(epoch)

			def __iter__(self):
			assert callable(self.fn)
			@@ -21,4 +21,4 @@
			if self.fn(data):
			yield data
			else:
			continue
			continue

			@@ -105,7 +105,7 @@

			return x, tgt_mask, memory, memory_mask, cache

			def forward_chunk(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
			def forward_one_step(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
			"""Compute decoded features.

			Args:
			@@ -147,6 +147,47 @@


			return x, tgt_mask, memory, memory_mask, cache

			def forward_chunk(self, tgt, memory, fsmn_cache=None, opt_cache=None, chunk_size=None, look_back=0):
			"""Compute decoded features.

			Args:
			tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
			tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
			memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
			memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
			cache (List[torch.Tensor]): List of cached tensors.
			Each tensor shape should be (#batch, maxlen_out - 1, size).

			Returns:
			torch.Tensor: Output tensor(#batch, maxlen_out, size).
			torch.Tensor: Mask for output tensor (#batch, maxlen_out).
			torch.Tensor: Encoded memory (#batch, maxlen_in, size).
			torch.Tensor: Encoded memory mask (#batch, maxlen_in).

			"""
			residual = tgt
			if self.normalize_before:
			tgt = self.norm1(tgt)
			tgt = self.feed_forward(tgt)

			x = tgt
			if self.self_attn:
			if self.normalize_before:
			tgt = self.norm2(tgt)
			x, fsmn_cache = self.self_attn(tgt, None, fsmn_cache)
			x = residual + self.dropout(x)

			if self.src_attn is not None:
			residual = x
			if self.normalize_before:
			x = self.norm3(x)

			x, opt_cache = self.src_attn.forward_chunk(x, memory, opt_cache, chunk_size, look_back)
			x = residual + x

			return x, memory, fsmn_cache, opt_cache


			class FsmnDecoderSCAMAOpt(BaseTransformerDecoder):
			"""
			@@ -397,7 +438,7 @@
			for i in range(self.att_layer_num):
			decoder = self.decoders[i]
			c = cache[i]
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
			x, tgt_mask, memory, memory_mask, cache=c
			)
			new_cache.append(c_ret)
			@@ -407,13 +448,13 @@
			j = i + self.att_layer_num
			decoder = self.decoders2[i]
			c = cache[j]
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
			x, tgt_mask, memory, memory_mask, cache=c
			)
			new_cache.append(c_ret)

			for decoder in self.decoders3:
			x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
			x, tgt_mask, memory, memory_mask, _ = decoder.forward_one_step(
			x, tgt_mask, memory, None, cache=None
			)

			@@ -837,6 +878,7 @@
			lora_rank: int = 8,
			lora_alpha: int = 16,
			lora_dropout: float = 0.1,
			chunk_multiply_factor: tuple = (1,),
			tf2torch_tensor_name_prefix_torch: str = "decoder",
			tf2torch_tensor_name_prefix_tf: str = "seq2seq/decoder",
			):
			@@ -929,6 +971,7 @@
			)
			self.tf2torch_tensor_name_prefix_torch = tf2torch_tensor_name_prefix_torch
			self.tf2torch_tensor_name_prefix_tf = tf2torch_tensor_name_prefix_tf
			self.chunk_multiply_factor = chunk_multiply_factor

			def forward(
			self,
			@@ -1020,35 +1063,43 @@
			cache_layer_num = len(self.decoders)
			if self.decoders2 is not None:
			cache_layer_num += len(self.decoders2)
			new_cache = [None] * cache_layer_num
			fsmn_cache = [None] * cache_layer_num
			else:
			new_cache = cache["decode_fsmn"]
			fsmn_cache = cache["decode_fsmn"]

			if cache["opt"] is None:
			cache_layer_num = len(self.decoders)
			opt_cache = [None] * cache_layer_num
			else:
			opt_cache = cache["opt"]

			for i in range(self.att_layer_num):
			decoder = self.decoders[i]
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
			x, None, memory, None, cache=new_cache[i]
			x, memory, fsmn_cache[i], opt_cache[i] = decoder.forward_chunk(
			x, memory, fsmn_cache=fsmn_cache[i], opt_cache=opt_cache[i],
			chunk_size=cache["chunk_size"], look_back=cache["decoder_chunk_look_back"]
			)
			new_cache[i] = c_ret

			if self.num_blocks - self.att_layer_num > 1:
			for i in range(self.num_blocks - self.att_layer_num):
			j = i + self.att_layer_num
			decoder = self.decoders2[i]
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
			x, None, memory, None, cache=new_cache[j]
			x, memory, fsmn_cache[j], _ = decoder.forward_chunk(
			x, memory, fsmn_cache=fsmn_cache[j]
			)
			new_cache[j] = c_ret

			for decoder in self.decoders3:

			x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
			x, None, memory, None, cache=None
			x, memory, _, _ = decoder.forward_chunk(
			x, memory
			)
			if self.normalize_before:
			x = self.after_norm(x)
			if self.output_layer is not None:
			x = self.output_layer(x)
			cache["decode_fsmn"] = new_cache

			cache["decode_fsmn"] = fsmn_cache
			if cache["decoder_chunk_look_back"] > 0 or cache["decoder_chunk_look_back"] == -1:
			cache["opt"] = opt_cache
			return x

			def forward_one_step(
			@@ -1082,7 +1133,7 @@
			for i in range(self.att_layer_num):
			decoder = self.decoders[i]
			c = cache[i]
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
			x, tgt_mask, memory, None, cache=c
			)
			new_cache.append(c_ret)
			@@ -1092,14 +1143,14 @@
			j = i + self.att_layer_num
			decoder = self.decoders2[i]
			c = cache[j]
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
			x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_one_step(
			x, tgt_mask, memory, None, cache=c
			)
			new_cache.append(c_ret)

			for decoder in self.decoders3:

			x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
			x, tgt_mask, memory, memory_mask, _ = decoder.forward_one_step(
			x, tgt_mask, memory, None, cache=None
			)

			@@ -114,8 +114,44 @@
			if not self.normalize_before:
			x = self.norm2(x)


			return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder

			def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
			"""Compute encoded features.

			Args:
			x_input (torch.Tensor): Input tensor (#batch, time, size).
			mask (torch.Tensor): Mask tensor for the input (#batch, time).
			cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).

			Returns:
			torch.Tensor: Output tensor (#batch, time, size).
			torch.Tensor: Mask tensor (#batch, time).

			"""

			residual = x
			if self.normalize_before:
			x = self.norm1(x)

			if self.in_size == self.size:
			attn, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)
			x = residual + attn
			else:
			x, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)

			if not self.normalize_before:
			x = self.norm1(x)

			residual = x
			if self.normalize_before:
			x = self.norm2(x)
			x = residual + self.feed_forward(x)
			if not self.normalize_before:
			x = self.norm2(x)

			return x, cache


			class SANMEncoder(AbsEncoder):
			"""
			@@ -841,7 +877,6 @@
			xs_pad: torch.Tensor,
			ilens: torch.Tensor,
			cache: dict = None,
			ctc: CTC = None,
			):
			xs_pad = self.output_size() * 0.5
			if self.embed is None:
			@@ -852,34 +887,25 @@
			xs_pad = to_device(cache["feats"], device=xs_pad.device)
			else:
			xs_pad = self._add_overlap_chunk(xs_pad, cache)
			encoder_outs = self.encoders0(xs_pad, None, None, None, None)
			xs_pad, masks = encoder_outs[0], encoder_outs[1]
			intermediate_outs = []
			if len(self.interctc_layer_idx) == 0:
			encoder_outs = self.encoders(xs_pad, None, None, None, None)
			xs_pad, masks = encoder_outs[0], encoder_outs[1]
			if cache["opt"] is None:
			cache_layer_num = len(self.encoders0) + len(self.encoders)
			new_cache = [None] * cache_layer_num
			else:
			for layer_idx, encoder_layer in enumerate(self.encoders):
			encoder_outs = encoder_layer(xs_pad, None, None, None, None)
			xs_pad, masks = encoder_outs[0], encoder_outs[1]
			if layer_idx + 1 in self.interctc_layer_idx:
			encoder_out = xs_pad
			new_cache = cache["opt"]

			# intermediate outputs are also normalized
			if self.normalize_before:
			encoder_out = self.after_norm(encoder_out)
			for layer_idx, encoder_layer in enumerate(self.encoders0):
			encoder_outs = encoder_layer.forward_chunk(xs_pad, new_cache[layer_idx], cache["chunk_size"], cache["encoder_chunk_look_back"])
			xs_pad, new_cache[0] = encoder_outs[0], encoder_outs[1]

			intermediate_outs.append((layer_idx + 1, encoder_out))

			if self.interctc_use_conditioning:
			ctc_out = ctc.softmax(encoder_out)
			xs_pad = xs_pad + self.conditioning_layer(ctc_out)
			for layer_idx, encoder_layer in enumerate(self.encoders):
			encoder_outs = encoder_layer.forward_chunk(xs_pad, new_cache[layer_idx+len(self.encoders0)], cache["chunk_size"], cache["encoder_chunk_look_back"])
			xs_pad, new_cache[layer_idx+len(self.encoders0)] = encoder_outs[0], encoder_outs[1]

			if self.normalize_before:
			xs_pad = self.after_norm(xs_pad)
			if cache["encoder_chunk_look_back"] > 0 or cache["encoder_chunk_look_back"] == -1:
			cache["opt"] = new_cache

			if len(intermediate_outs) > 0:
			return (xs_pad, intermediate_outs), None, None
			return xs_pad, ilens, None

			def gen_tf2torch_map_dict(self):

			@@ -456,6 +456,44 @@
			att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder)
			return att_outs + fsmn_memory

			def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
			"""Compute scaled dot product attention.

			Args:
			query (torch.Tensor): Query tensor (#batch, time1, size).
			key (torch.Tensor): Key tensor (#batch, time2, size).
			value (torch.Tensor): Value tensor (#batch, time2, size).
			mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
			(#batch, time1, time2).

			Returns:
			torch.Tensor: Output tensor (#batch, time1, d_model).

			"""
			q_h, k_h, v_h, v = self.forward_qkv(x)
			if chunk_size is not None and look_back > 0 or look_back == -1:
			if cache is not None:
			k_h_stride = k_h[:, :, :-(chunk_size[2]), :]
			v_h_stride = v_h[:, :, :-(chunk_size[2]), :]
			k_h = torch.cat((cache["k"], k_h), dim=2)
			v_h = torch.cat((cache["v"], v_h), dim=2)

			cache["k"] = torch.cat((cache["k"], k_h_stride), dim=2)
			cache["v"] = torch.cat((cache["v"], v_h_stride), dim=2)
			if look_back != -1:
			cache["k"] = cache["k"][:, :, -(look_back * chunk_size[1]):, :]
			cache["v"] = cache["v"][:, :, -(look_back * chunk_size[1]):, :]
			else:
			cache_tmp = {"k": k_h[:, :, :-(chunk_size[2]), :],
			"v": v_h[:, :, :-(chunk_size[2]), :]}
			cache = cache_tmp
			fsmn_memory = self.forward_fsmn(v, None)
			q_h = q_h * self.d_k ** (-0.5)
			scores = torch.matmul(q_h, k_h.transpose(-2, -1))
			att_outs = self.forward_attention(v_h, scores, None)
			return att_outs + fsmn_memory, cache


			class MultiHeadedAttentionSANMwithMask(MultiHeadedAttentionSANM):
			def __init__(self, args, *kwargs):
			super().__init__(args, *kwargs)
			@@ -667,6 +705,35 @@
			scores = torch.matmul(q_h, k_h.transpose(-2, -1))
			return self.forward_attention(v_h, scores, memory_mask)

			def forward_chunk(self, x, memory, cache=None, chunk_size=None, look_back=0):
			"""Compute scaled dot product attention.

			Args:
			query (torch.Tensor): Query tensor (#batch, time1, size).
			key (torch.Tensor): Key tensor (#batch, time2, size).
			value (torch.Tensor): Value tensor (#batch, time2, size).
			mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
			(#batch, time1, time2).

			Returns:
			torch.Tensor: Output tensor (#batch, time1, d_model).

			"""
			q_h, k_h, v_h = self.forward_qkv(x, memory)
			if chunk_size is not None and look_back > 0:
			if cache is not None:
			k_h = torch.cat((cache["k"], k_h), dim=2)
			v_h = torch.cat((cache["v"], v_h), dim=2)
			cache["k"] = k_h[:, :, -(look_back * chunk_size[1]):, :]
			cache["v"] = v_h[:, :, -(look_back * chunk_size[1]):, :]
			else:
			cache_tmp = {"k": k_h[:, :, -(look_back * chunk_size[1]):, :],
			"v": v_h[:, :, -(look_back * chunk_size[1]):, :]}
			cache = cache_tmp
			q_h = q_h * self.d_k ** (-0.5)
			scores = torch.matmul(q_h, k_h.transpose(-2, -1))
			return self.forward_attention(v_h, scores, None), cache


			class MultiHeadSelfAttention(nn.Module):
			"""Multi-Head Attention layer.

			@@ -36,9 +36,9 @@
			Use the following command to pull and launch the Docker image for the FunASR runtime-SDK:

			```shell
			sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
			sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2

			sudo docker run -p 10095:10095 -it --privileged=true -v /root:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
			sudo docker run -p 10095:10095 -it --privileged=true -v /root:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2
			```

			Introduction to command parameters:

			@@ -22,9 +22,12 @@
			通过下述命令拉取并启动FunASR runtime-SDK的docker镜像：

			```shell
			sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
			sudo docker pull \
			registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2
			mkdir -p ./funasr-runtime-resources/models
			sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.1
			sudo docker run -p 10095:10095 -it --privileged=true \
			-v ./funasr-runtime-resources/models:/workspace/models \
			registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-cpu-0.2.2
			```
			如果您没有安装docker，可参考[Docker安装](#Docker安装)

			@@ -100,18 +103,20 @@
			若想直接运行client进行测试，可参考如下简易说明，以python版本为例：

			```shell
			python3 wss_client_asr.py --host "127.0.0.1" --port 10095 --mode offline --audio_in "../audio/asr_example.wav" --output_dir "./results"
			python3 funasr_wss_client.py --host "127.0.0.1" --port 10095 --mode offline \
			--audio_in "../audio/asr_example.wav" --output_dir "./results"
			```

			命令参数说明：
			```text
			--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
			--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
			需要改为部署机器ip
			--port 10095 部署端口号
			--mode offline表示离线文件转写
			--audio_in 需要进行转写的音频文件，支持文件路径，文件列表wav.scp
			--thread_num 设置并发发送线程数，默认为1
			--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴达摩院)
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串(阿里巴巴达摩院)
			--use_itn 设置是否使用itn，默认1开启，设置为0关闭
			```

			@@ -124,10 +129,11 @@
			命令参数说明：

			```text
			--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
			--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
			需要改为部署机器ip
			--port 10095 部署端口号
			--wav-path 需要进行转写的音频文件，支持文件路径
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴达摩院)
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴达摩院)
			--use-itn 设置是否使用itn，默认1开启，设置为0关闭
			```

			@@ -8,9 +8,9 @@
			Use the following command to pull and start the FunASR software package docker image:

			```shell
			sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
			sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
			mkdir -p ./funasr-runtime-resources/models
			sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
			sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
			```
			If you do not have Docker installed, please refer to [Docker Installation](https://alibaba-damo-academy.github.io/FunASR/en/installation/docker.html)

			@@ -11,9 +11,12 @@
			通过下述命令拉取并启动FunASR软件包的docker镜像：

			```shell
			sudo docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
			sudo docker pull \
			registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
			mkdir -p ./funasr-runtime-resources/models
			sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.1
			sudo docker run -p 10095:10095 -it --privileged=true \
			-v ./funasr-runtime-resources/models:/workspace/models \
			registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
			```
			如果您没有安装docker，可参考[Docker安装](https://alibaba-damo-academy.github.io/FunASR/en/installation/docker_zh.html)

			@@ -67,34 +67,39 @@

			命令参数说明：
			```text
			--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
			--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
			需要改为部署机器ip
			--port 10095 部署端口号
			--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，并且说话句尾采用离线模型进行纠错。
			--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，
			并且说话句尾采用离线模型进行纠错。
			--chunk_size：表示流式模型latency配置`[5,10,5]`，表示当前音频解码片段为600ms，并且回看300ms，右看300ms。
			--audio_in 需要进行转写的音频文件，支持文件路径，文件列表wav.scp
			--thread_num 设置并发发送线程数，默认为1
			--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭+
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴达摩院)
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴达摩院)
			--use_itn 设置是否使用itn，默认1开启，设置为0关闭
			```

			### cpp-client
			进入samples/cpp目录后，可以用cpp进行测试，指令如下：
			```shell
			./funasr-wss-client-2pass --server-ip 127.0.0.1 --port 10095 --mode 2pass --wav-path ../audio/asr_example.wav
			./funasr-wss-client-2pass --server-ip 127.0.0.1 --port 10095 --mode 2pass \
			--wav-path ../audio/asr_example.wav
			```

			命令参数说明：

			```text
			--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
			--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
			需要改为部署机器ip
			--port 10095 部署端口号
			--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，并且说话句尾采用离线模型进行纠错。
			--mode：`offline`表示推理模式为一句话识别；`online`表示推理模式为实时语音识别；`2pass`表示为实时语音识别，
			并且说话句尾采用离线模型进行纠错。
			--chunk-size：表示流式模型latency配置`[5,10,5]`，表示当前音频解码片段为600ms，并且回看300ms，右看300ms。
			--wav-path 需要进行转写的音频文件，支持文件路径
			--thread-num 设置并发发送线程数，默认为1
			--is-ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴达摩院)
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴达摩院)
			--use-itn 设置是否使用itn，默认1开启，设置为0关闭
			```

			@@ -69,13 +69,14 @@

			命令参数说明：
			```text
			--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
			--host 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
			需要改为部署机器ip
			--port 10095 部署端口号
			--mode offline表示离线文件转写
			--audio_in 需要进行转写的音频文件，支持文件路径，文件列表wav.scp
			--thread_num 设置并发发送线程数，默认为1
			--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴达摩院)
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴达摩院)
			--use_itn 设置是否使用itn，默认1开启，设置为0关闭
			```

			@@ -88,12 +89,13 @@
			命令参数说明：

			```text
			--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，需要改为部署机器ip
			--server-ip 为FunASR runtime-SDK服务部署机器ip，默认为本机ip（127.0.0.1），如果client与服务不在同一台服务器，
			需要改为部署机器ip
			--port 10095 部署端口号
			--wav-path 需要进行转写的音频文件，支持文件路径
			--thread_num 设置并发发送线程数，默认为1
			--ssl 设置是否开启ssl证书校验，默认1开启，设置为0关闭
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (could be: 阿里巴巴达摩院)
			--hotword 如果模型为热词模型，可以设置热词: *.txt(每行一个热词) 或者空格分隔的热词字符串 (阿里巴巴达摩院)
			--use-itn 设置是否使用itn，默认1开启，设置为0关闭
			```

			@@ -90,6 +90,8 @@
			1A7F0DBE2A2F221C00A6EEB7 /* AudioCapture.mm in Sources / = {isa = PBXBuildFile; fileRef = 1A7F0DBB2A2F221C00A6EEB7 / AudioCapture.mm */; };
			1A7F0DBF2A2F221C00A6EEB7 /* AudioRecorder.m in Sources / = {isa = PBXBuildFile; fileRef = 1A7F0DBD2A2F221C00A6EEB7 / AudioRecorder.m */; };
			1A7F0DC32A2F312D00A6EEB7 /* model in Resources / = {isa = PBXBuildFile; fileRef = 1A7F0DC22A2F312D00A6EEB7 / model */; };
			1ACBFB692AB99D55002FC7C7 /* seg_dict.cpp in Sources / = {isa = PBXBuildFile; fileRef = 1ACBFB672AB99D55002FC7C7 / seg_dict.cpp */; };
			1ACBFB6C2AB9A086002FC7C7 /* encode_converter.cpp in Sources / = {isa = PBXBuildFile; fileRef = 1ACBFB6B2AB9A086002FC7C7 / encode_converter.cpp */; };
			59C4114F365C8D714BD515FB /* Pods_paraformer_online.framework in Frameworks / = {isa = PBXBuildFile; fileRef = EA7D0713E60886A787BAA0EA / Pods_paraformer_online.framework */; };
			/* End PBXBuildFile section */

			@@ -324,6 +326,10 @@
			1AB8E1EE2AA086F200F4F795 /* model.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = model.h; sourceTree = "<group>"; };
			1AB8E1EF2AA086F200F4F795 /* offline-stream.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = "offline-stream.h"; sourceTree = "<group>"; };
			1AB8E1F02AA086F200F4F795 /* vad-model.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = "vad-model.h"; sourceTree = "<group>"; };
			1ACBFB672AB99D55002FC7C7 /* seg_dict.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = seg_dict.cpp; sourceTree = "<group>"; };
			1ACBFB682AB99D55002FC7C7 /* seg_dict.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = seg_dict.h; sourceTree = "<group>"; };
			1ACBFB6A2AB9A086002FC7C7 /* encode_converter.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = encode_converter.h; sourceTree = "<group>"; };
			1ACBFB6B2AB9A086002FC7C7 /* encode_converter.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = encode_converter.cpp; sourceTree = "<group>"; };
			B9ED2A36675364C815C03C96 /* Pods-paraformer_online.debug.xcconfig */ = {isa = PBXFileReference; includeInIndex = 1; lastKnownFileType = text.xcconfig; name = "Pods-paraformer_online.debug.xcconfig"; path = "Target Support Files/Pods-paraformer_online/Pods-paraformer_online.debug.xcconfig"; sourceTree = "<group>"; };
			EA7D0713E60886A787BAA0EA /* Pods_paraformer_online.framework */ = {isa = PBXFileReference; explicitFileType = wrapper.framework; includeInIndex = 0; path = Pods_paraformer_online.framework; sourceTree = BUILT_PRODUCTS_DIR; };
			/* End PBXFileReference section */
			@@ -355,6 +361,8 @@
			1A6C92FB2A84D64E007E36DC /* ct-transformer.cpp */,
			1A6C93032A84D64E007E36DC /* ct-transformer.h */,
			1A6C92F92A84D64E007E36DC /* e2e-vad.h */,
			1ACBFB6B2AB9A086002FC7C7 /* encode_converter.cpp */,
			1ACBFB6A2AB9A086002FC7C7 /* encode_converter.h */,
			1A6C92F72A84D64E007E36DC /* fsmn-vad-online.cpp */,
			1A6C92E92A84D64E007E36DC /* fsmn-vad-online.h */,
			1A6C92E82A84D64E007E36DC /* fsmn-vad.cpp */,
			@@ -371,6 +379,8 @@
			1A6C93022A84D64E007E36DC /* punc-model.cpp */,
			1A6C92ED2A84D64E007E36DC /* resample.cpp */,
			1A6C92E32A84D64E007E36DC /* resample.h */,
			1ACBFB672AB99D55002FC7C7 /* seg_dict.cpp */,
			1ACBFB682AB99D55002FC7C7 /* seg_dict.h */,
			1A6C93012A84D64E007E36DC /* tensor.h */,
			1A6C92F02A84D64E007E36DC /* tokenizer.cpp */,
			1A6C92EF2A84D64E007E36DC /* tokenizer.h */,
			@@ -917,6 +927,7 @@
			1A6C93F72A84D66E007E36DC /* symbolize.cc in Sources */,
			1A6C93062A84D64E007E36DC /* util.cpp in Sources */,
			1A6C94222A84D66E007E36DC /* nodebuilder.cpp in Sources */,
			1ACBFB692AB99D55002FC7C7 /* seg_dict.cpp in Sources */,
			1A6C94132A84D66E007E36DC /* exp.cpp in Sources */,
			1A6C930A2A84D64E007E36DC /* vocab.cpp in Sources */,
			1A6C94012A84D66E007E36DC /* logging.cc in Sources */,
			@@ -931,6 +942,7 @@
			1A6C940F2A84D66E007E36DC /* emitter.cpp in Sources */,
			1A6C93DE2A84D66E007E36DC /* fftsg.c in Sources */,
			1A6C940B2A84D66E007E36DC /* ostream_wrapper.cpp in Sources */,
			1ACBFB6C2AB9A086002FC7C7 /* encode_converter.cpp in Sources */,
			1A6C93E12A84D66E007E36DC /* log.cc in Sources */,
			1A6C94092A84D66E007E36DC /* exceptions.cpp in Sources */,
			1A6C94152A84D66E007E36DC /* node.cpp in Sources */,
			@@ -1108,7 +1120,7 @@
			"@executable_path/Frameworks",
			);
			MARKETING_VERSION = 1.0;
			PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online";
			PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online1";
			PRODUCT_NAME = "$(TARGET_NAME)";
			SWIFT_EMIT_LOC_STRINGS = YES;
			TARGETED_DEVICE_FAMILY = "1,2";
			@@ -1147,7 +1159,7 @@
			"@executable_path/Frameworks",
			);
			MARKETING_VERSION = 1.0;
			PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online";
			PRODUCT_BUNDLE_IDENTIFIER = "com.qiuwei.paraformer-online1";
			PRODUCT_NAME = "$(TARGET_NAME)";
			SWIFT_EMIT_LOC_STRINGS = YES;
			TARGETED_DEVICE_FAMILY = "1,2";

			@@ -105,7 +105,10 @@
			_FUNASRAPI FUNASR_RESULT FunOfflineInfer(FUNASR_HANDLE handle, const char* sz_filename, FUNASR_MODE mode,
			QM_CALLBACK fn_callback, const std::vector<std::vector<float>> &hw_emb,
			int sampling_rate=16000, bool itn=true);
			#if !defined(__APPLE__)
			_FUNASRAPI const std::vector<std::vector<float>> CompileHotwordEmbedding(FUNASR_HANDLE handle, std::string &hotwords, ASR_TYPE mode=ASR_OFFLINE);
			#endif

			_FUNASRAPI void FunOfflineUninit(FUNASR_HANDLE handle);

			//2passStream

			@@ -17,7 +17,7 @@
			virtual std::string Rescoring() = 0;
			virtual void InitHwCompiler(const std::string &hw_model, int thread_num){};
			virtual void InitSegDict(const std::string &seg_dict_model){};
			virtual std::vector<std::vector<float>> CompileHotwordEmbedding(std::string &hotwords){};
			virtual std::vector<std::vector<float>> CompileHotwordEmbedding(std::string &hotwords){return std::vector<std::vector<float>>();};
			};

			Model *CreateModel(std::map<std::string, std::string>& model_path, int thread_num=1, ASR_TYPE type=ASR_OFFLINE);

			@@ -7,7 +7,9 @@
			#include "model.h"
			#include "punc-model.h"
			#include "vad-model.h"
			#if !defined(__APPLE__)
			#include "itn-model.h"
			#endif

			namespace funasr {
			class OfflineStream {
			@@ -18,7 +20,9 @@
			std::unique_ptr<VadModel> vad_handle= nullptr;
			std::unique_ptr<Model> asr_handle= nullptr;
			std::unique_ptr<PuncModel> punc_handle= nullptr;
			#if !defined(__APPLE__)
			std::unique_ptr<ITNModel> itn_handle = nullptr;
			#endif
			bool UseVad(){return use_vad;};
			bool UsePunc(){return use_punc;};
			bool UseITN(){return use_itn;};

			@@ -285,10 +285,12 @@
			string punc_res = (offline_stream->punc_handle)->AddPunc((p_result->msg).c_str());
			p_result->msg = punc_res;
			}
			#if !defined(__APPLE__)
			if(offline_stream->UseITN() && itn){
			string msg_itn = offline_stream->itn_handle->Normalize(p_result->msg);
			p_result->msg = msg_itn;
			}
			#endif

			return p_result;
			}
			@@ -364,13 +366,16 @@
			string punc_res = (offline_stream->punc_handle)->AddPunc((p_result->msg).c_str());
			p_result->msg = punc_res;
			}
			#if !defined(__APPLE__)
			if(offline_stream->UseITN() && itn){
			string msg_itn = offline_stream->itn_handle->Normalize(p_result->msg);
			p_result->msg = msg_itn;
			}
			#endif
			return p_result;
			}

			#if !defined(__APPLE__)
			_FUNASRAPI const std::vector<std::vector<float>> CompileHotwordEmbedding(FUNASR_HANDLE handle, std::string &hotwords, ASR_TYPE mode)
			{
			if (mode == ASR_OFFLINE){
			@@ -394,7 +399,7 @@
			}

			}

			#endif

			// APIs for 2pass-stream Infer
			_FUNASRAPI FUNASR_RESULT FunTpassInferBuffer(FUNASR_HANDLE handle, FUNASR_HANDLE online_handle, const char* sz_buf,
			@@ -450,13 +455,13 @@
			string online_msg = ((funasr::ParaformerOnline*)asr_online_handle)->online_res;
			string msg_punc = punc_online_handle->AddPunc(online_msg.c_str(), punc_cache[0]);
			p_result->tpass_msg = msg_punc;

			#if !defined(__APPLE__)
			// ITN
			if(tpass_stream->UseITN() && itn){
			string msg_itn = tpass_stream->itn_handle->Normalize(msg_punc);
			p_result->tpass_msg = msg_itn;
			}

			#endif
			((funasr::ParaformerOnline*)asr_online_handle)->online_res = "";
			p_result->msg += msg;
			}else{
			@@ -501,10 +506,12 @@
			msg_punc += "。";
			}
			p_result->tpass_msg = msg_punc;
			#if !defined(__APPLE__)
			if(tpass_stream->UseITN() && itn){
			string msg_itn = tpass_stream->itn_handle->Normalize(msg_punc);
			p_result->tpass_msg = msg_itn;
			}
			#endif

			if(frame != NULL){
			delete frame;

			@@ -84,7 +84,7 @@
			use_punc = true;
			}
			}

			#if !defined(__APPLE__)
			// Optional: ITN, here we just support language_type=MandarinEnglish
			if(model_path.find(ITN_DIR) != model_path.end() && model_path.at(ITN_DIR) != ""){
			string itn_tagger_path = PathAppend(model_path.at(ITN_DIR), ITN_TAGGER_NAME);
			@@ -100,6 +100,7 @@
			use_itn = true;
			}
			}
			#endif
			}

			OfflineStream *CreateOfflineStream(std::map<std::string, std::string>& model_path, int thread_num)

			@@ -684,7 +684,7 @@
			return "";
			}
			//PrintMat(hw_emb, "input_clas_emb");
			const int64_t hotword_shape[3] = {1, hw_emb.size(), hw_emb[0].size()};
			const int64_t hotword_shape[3] = {1, static_cast<int64_t>(hw_emb.size()), static_cast<int64_t>(hw_emb[0].size())};
			embedding.reserve(hw_emb.size() * hw_emb[0].size());
			for (auto item : hw_emb) {
			embedding.insert(embedding.end(), item.begin(), item.end());