python/FunASR-XL.git

parent: e144cb9d | 补丁 | 提交 | ignore whitespace

zhifu gao

2024-03-05 8ab670afc3b7a9ae4da4043e21e89024321d5b5b

Dev gzf (#1424)

* fixbug

* qwenaudio

* qwenaudio whisper-openai v1.0.12

* qwenaudio whisper-openai

7个文件已修改

	README.md	26 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	README_zh.md	26 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/industrial_data_pretraining/whisper/demo.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/industrial_data_pretraining/whisper/demo_from_openai.py	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	examples/industrial_data_pretraining/whisper/infer_from_local.sh	12 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/download/name_maps_from_hub.py	3 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/models/whisper/model.py	15 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 README.md

@@ -27,6 +27,7 @@

<a name="whats-new"></a>
## What's new:
- 2024/03/05：Added support for the Whisper-large-v3 model, a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It can be downloaded from the[modelscope](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary), and [openai](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining/whisper).
- 2024/03/03: Offline File Transcription Service 4.4, Offline File Transcription Service of English 1.5，Real-time Transcription Service 1.9 released，Docker image supports ARM64 platform；([docs](runtime/readme.md))
- 2024/01/30：funasr-1.0 has been released ([docs](https://github.com/alibaba-damo-academy/FunASR/discussions/1319))
- 2024/01/30：emotion recognition models are new supported. [model link](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary), modified from [repo](https://github.com/ddlBoJack/emotion2vec).
@@ -67,20 +68,21 @@
## Model Zoo
FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](./MODEL_LICENSE). Below are some representative models, for more models please refer to the [Model Zoo]().

(Note: ⭐ represents the ModelScope model zoo link, 🤗 represents the Huggingface model zoo link)
(Note: ⭐ represents the ModelScope model zoo, 🤗 represents the Huggingface model zoo, 🍀 represents the OpenAI model zoo)


|                                                                                                         Model Name                                                                                                         |                    Task Details                    |          Training Data           | Parameters |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------:|:--------------------------------:|:----------:|
|          paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) )           | speech recognition, with timestamps, non-streaming |      60000 hours, Mandarin       |    220M    |
| <nobr>paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> |           speech recognition, streaming            |      60000 hours, Mandarin       |    220M    |
|               paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )                | speech recognition, with timestamps, non-streaming |       50000 hours, English       |    220M    |
|                            conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                             |         speech recognition, non-streaming          |       50000 hours, English       |    220M    |
|                               ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                               |              punctuation restoration               |    100M, Mandarin and English    |    1.1G    | 
|                                   fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                                   |              voice activity detection              | 5000 hours, Mandarin and English |    0.4M    | 
|                                     fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                                     |                timestamp prediction                |       5000 hours, Mandarin       |    38M     | 
|                                       cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                                        |        speaker verification/diarization            |            5000 hours            |    7.2M    | 
|                                                 whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🤗]() )                                                   | speech recognition, with timestamps, non-streaming |          multilingual            |     1G     |
|                                                                                                         Model Name                                                                                                         |                     Task Details                      |          Training Data           | Parameters |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------:|:--------------------------------:|:----------:|
|          paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) )           |  speech recognition, with timestamps, non-streaming   |      60000 hours, Mandarin       |    220M    |
| <nobr>paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> |             speech recognition, streaming             |      60000 hours, Mandarin       |    220M    |
|               paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )                | speech recognition, without timestamps, non-streaming |       50000 hours, English       |    220M    |
|                            conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                             |           speech recognition, non-streaming           |       50000 hours, English       |    220M    |
|                               ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                               |                punctuation restoration                |    100M, Mandarin and English    |    1.1G    | 
|                                   fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                                   |               voice activity detection                | 5000 hours, Mandarin and English |    0.4M    | 
|                                     fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                                     |                 timestamp prediction                  |       5000 hours, Mandarin       |    38M     | 
|                                       cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                                        |           speaker verification/diarization            |            5000 hours            |    7.2M    | 
|                                                  Whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🍀](https://github.com/openai/whisper) )                                                  |  speech recognition, with timestamps, non-streaming   |          multilingual            |    1.5G    |
|                                                Whisper-large-v3 <br> ([⭐](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary)  [🍀](https://github.com/openai/whisper) )                                                 |  speech recognition, with timestamps, non-streaming   |          multilingual            |    1.5G    |




 README_zh.md

@@ -29,6 +29,7 @@

<a name="最新动态"></a>
## 最新动态
- 2024/03/05：新增加Whisper-large-v3模型支持，多语言语音识别/翻译/语种识别，支持从[modelscope](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary)仓库下载，也支持从[openai](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining/whisper)仓库下载模型。
- 2024/03/03: 中文离线文件转写服务 4.4、英文离线文件转写服务 1.5、中文实时语音听写服务 1.9 发布，docker镜像支持arm64平台；详细信息参阅([部署文档](runtime/readme_cn.md))
- 2024/01/30：funasr-1.0发布，更新说明[文档](https://github.com/alibaba-damo-academy/FunASR/discussions/1319)
- 2024/01/30：新增加情感识别 [模型链接](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary)，原始模型 [repo](https://github.com/ddlBoJack/emotion2vec).
@@ -69,20 +70,21 @@

FunASR开源了大量在工业数据上预训练模型，您可以在[模型许可协议](./MODEL_LICENSE)下自由使用、复制、修改和分享FunASR模型，下面列举代表性的模型，更多模型请参考[模型仓库]()。

（注：⭐ 表示ModelScope模型仓库链接，🤗 表示Huggingface模型仓库链接）
（注：⭐ 表示ModelScope模型仓库，🤗 表示Huggingface模型仓库，🍀表示OpenAI模型仓库）


|                                         模型名字                                                                                                                 |      任务详情       |     训练数据     | 参数量  |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------:|:------------:|:----:|
| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) ) | 语音识别，带时间戳输出，非实时 |  60000小时，中文  | 220M |
|   paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )   |     语音识别，实时     |  60000小时，中文  | 220M |
|      paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )      |    语音识别，非实时     |  50000小时，英文  | 220M |
|                  conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                   |    语音识别，非实时     |  50000小时，英文  | 220M |
|                  ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                   |      标点恢复       |  100M，中文与英文  | 1.1G | 
|                       fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                       |    语音端点检测，实时    | 5000小时，中文与英文 | 0.4M | 
|                       fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                        |    字级别时间戳预测     |  50000小时，中文  | 38M  |
|                           cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                            |    说话人确认/分割     |    5000小时    | 7.2M | 
| whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🤗]() ) | 语音识别，带时间戳输出，非实时 |     多语言      |  1G  |
|                                                                                                     模型名字                                                                                                      |      任务详情       |     训练数据     | 参数量  | 
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------:|:------------:|:----:|
|    paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) )    | 语音识别，带时间戳输出，非实时 |  60000小时，中文  | 220M |
| paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) ) |     语音识别，实时     |  60000小时，中文  | 220M |
|         paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )         |    语音识别，非实时     |  50000小时，英文  | 220M |
|                      conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                      |    语音识别，非实时     |  50000小时，英文  | 220M |
|                        ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                         |      标点恢复       |  100M，中文与英文  | 1.1G | 
|                            fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                             |    语音端点检测，实时    | 5000小时，中文与英文 | 0.4M | 
|                              fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                               |    字级别时间戳预测     |  50000小时，中文  | 38M  |
|                                 cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                                 |    说话人确认/分割     |    5000小时    | 7.2M | 
|                           Whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🍀](https://github.com/openai/whisper) )                           | 语音识别，带时间戳输出，非实时 |     多语言      |  1G  |
|                         Whisper-large-v3 <br> ([⭐](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary)  [🍀](https://github.com/openai/whisper) )                          | 语音识别，带时间戳输出，非实时 |     多语言      |  1G  |


<a name="快速开始"></a>

 examples/industrial_data_pretraining/whisper/demo.py

@@ -5,7 +5,7 @@

from funasr import AutoModel

model = AutoModel(model="iic/speech_whisper-large_asr_multilingual",
model = AutoModel(model="iic/Whisper-large-v3",
                  model_revision="v2.0.4",
                  )


 examples/industrial_data_pretraining/whisper/demo_from_openai.py

@@ -7,8 +7,8 @@

# model = AutoModel(model="Whisper-small", hub="openai")
# model = AutoModel(model="Whisper-medium", hub="openai")
model = AutoModel(model="Whisper-large-v2", hub="openai")
# model = AutoModel(model="Whisper-large-v3", hub="openai")
# model = AutoModel(model="Whisper-large-v2", hub="openai")
model = AutoModel(model="Whisper-large-v3", hub="openai")

res = model.generate(
    language=None,

 examples/industrial_data_pretraining/whisper/infer_from_local.sh

@@ -13,13 +13,19 @@
# download model
local_path_root=${workspace}/modelscope_models
mkdir -p ${local_path_root}
local_path=${local_path_root}/speech_whisper-large_asr_multilingual
git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}
#Whisper-large-v2
#local_path=${local_path_root}/speech_whisper-large_asr_multilingual
#git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}
#init_param="${local_path}/large-v2.pt"
#Whisper-large-v3
local_path=${local_path_root}/Whisper-large-v3
git clone https://www.modelscope.cn/iic/Whisper-large-v3.git ${local_path}
init_param="${local_path}/large-v3.pt"

device="cuda:0" # "cuda:0" for gpu0, "cuda:1" for gpu1, "cpu"

config="config.yaml"
init_param="${local_path}/large-v2.pt"


python -m funasr.bin.inference \
--config-path "${local_path}" \

 funasr/download/name_maps_from_hub.py

@@ -8,7 +8,8 @@
    "ct-punc-c": "damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
    "fa-zh": "damo/speech_timestamp_prediction-v1-16k-offline",
    "cam++": "damo/speech_campplus_sv_zh-cn_16k-common",
    "whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
    "Whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
    "Whisper-large-v3": "iic/Whisper-large-v3",
}

name_maps_hf = {

 funasr/models/whisper/model.py

@@ -24,7 +24,7 @@
@tables.register("model_classes", "Whisper-large-v1")
@tables.register("model_classes", "Whisper-large-v2")
@tables.register("model_classes", "Whisper-large-v3")
@tables.register("model_classes", "Whisper-WhisperWarp")
@tables.register("model_classes", "WhisperWarp")
class WhisperWarp(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__()
@@ -35,8 +35,8 @@
                model_or_path = model_or_path.replace("Whisper-", "")
            model = whisper.load_model(model_or_path)
        else:
            whisper_dims = kwargs.get("whisper_dims", {})
            dims = whisper.model.ModelDimensions(**whisper_dims)
            dims = kwargs.get("dims", {})
            dims = whisper.model.ModelDimensions(**dims)
            model = whisper.model.Whisper(dims=dims)
        
        self.model = model
@@ -55,6 +55,13 @@
        if kwargs.get("batch_size", 1) > 1:
            raise NotImplementedError("batch decoding is not implemented")

        if frontend is None and not hasattr(self, "frontend"):
            frontend_class = tables.frontend_classes.get("WhisperFrontend")
            frontend = frontend_class(n_mels=self.model.dims.n_mels, do_pad_trim=kwargs.get("do_pad_trim", True))
            self.frontend = frontend
        else:
            frontend = frontend if frontend is not None else self.frontend

        meta_data = {}
        if isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank":  # fbank
            speech, speech_lengths = data_in, data_lengths
@@ -65,7 +72,7 @@
        else:
            # extract fbank feats
            time1 = time.perf_counter()
            audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs, audio_fs=kwargs.get("fs", 16000),
            audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs if hasattr(frontend, "fs") else 16000, audio_fs=kwargs.get("fs", 16000),
                                                            data_type=kwargs.get("data_type", "sound"),
                                                            tokenizer=tokenizer)
            time2 = time.perf_counter()

			@@ -27,6 +27,7 @@

			<a name="whats-new"></a>
			## What's new:
			- 2024/03/05：Added support for the Whisper-large-v3 model, a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It can be downloaded from the[modelscope](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary), and [openai](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining/whisper).
			- 2024/03/03: Offline File Transcription Service 4.4, Offline File Transcription Service of English 1.5，Real-time Transcription Service 1.9 released，Docker image supports ARM64 platform；([docs](runtime/readme.md))
			- 2024/01/30：funasr-1.0 has been released ([docs](https://github.com/alibaba-damo-academy/FunASR/discussions/1319))
			- 2024/01/30：emotion recognition models are new supported. [model link](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary), modified from [repo](https://github.com/ddlBoJack/emotion2vec).
			@@ -67,20 +68,21 @@
			## Model Zoo
			FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](./MODEL_LICENSE). Below are some representative models, for more models please refer to the [Model Zoo]().

			(Note: ⭐ represents the ModelScope model zoo link, 🤗 represents the Huggingface model zoo link)
			(Note: ⭐ represents the ModelScope model zoo, 🤗 represents the Huggingface model zoo, 🍀 represents the OpenAI model zoo)


			\| Model Name \| Task Details \| Training Data \| Parameters \|
			\|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:\|:--------------------------------------------------:\|:--------------------------------:\|:----------:\|
			\| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗](https://huggingface.co/funasr/paraformer-tp) ) \| speech recognition, with timestamps, non-streaming \| 60000 hours, Mandarin \| 220M \|
			\| <nobr>paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> \| speech recognition, streaming \| 60000 hours, Mandarin \| 220M \|
			\| paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) ) \| speech recognition, with timestamps, non-streaming \| 50000 hours, English \| 220M \|
			\| conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) ) \| speech recognition, non-streaming \| 50000 hours, English \| 220M \|
			\| ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) ) \| punctuation restoration \| 100M, Mandarin and English \| 1.1G \|
			\| fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) ) \| voice activity detection \| 5000 hours, Mandarin and English \| 0.4M \|
			\| fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) ) \| timestamp prediction \| 5000 hours, Mandarin \| 38M \|
			\| cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) ) \| speaker verification/diarization \| 5000 hours \| 7.2M \|
			\| whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary) [🤗]() ) \| speech recognition, with timestamps, non-streaming \| multilingual \| 1G \|
			\| Model Name \| Task Details \| Training Data \| Parameters \|
			\|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:\|:-----------------------------------------------------:\|:--------------------------------:\|:----------:\|
			\| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗](https://huggingface.co/funasr/paraformer-tp) ) \| speech recognition, with timestamps, non-streaming \| 60000 hours, Mandarin \| 220M \|
			\| <nobr>paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> \| speech recognition, streaming \| 60000 hours, Mandarin \| 220M \|
			\| paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) ) \| speech recognition, without timestamps, non-streaming \| 50000 hours, English \| 220M \|
			\| conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) ) \| speech recognition, non-streaming \| 50000 hours, English \| 220M \|
			\| ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) ) \| punctuation restoration \| 100M, Mandarin and English \| 1.1G \|
			\| fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) ) \| voice activity detection \| 5000 hours, Mandarin and English \| 0.4M \|
			\| fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) ) \| timestamp prediction \| 5000 hours, Mandarin \| 38M \|
			\| cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) ) \| speaker verification/diarization \| 5000 hours \| 7.2M \|
			\| Whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary) [🍀](https://github.com/openai/whisper) ) \| speech recognition, with timestamps, non-streaming \| multilingual \| 1.5G \|
			\| Whisper-large-v3 <br> ([⭐](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary) [🍀](https://github.com/openai/whisper) ) \| speech recognition, with timestamps, non-streaming \| multilingual \| 1.5G \|

			@@ -29,6 +29,7 @@

			<a name="最新动态"></a>
			## 最新动态
			- 2024/03/05：新增加Whisper-large-v3模型支持，多语言语音识别/翻译/语种识别，支持从[modelscope](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary)仓库下载，也支持从[openai](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining/whisper)仓库下载模型。
			- 2024/03/03: 中文离线文件转写服务 4.4、英文离线文件转写服务 1.5、中文实时语音听写服务 1.9 发布，docker镜像支持arm64平台；详细信息参阅([部署文档](runtime/readme_cn.md))
			- 2024/01/30：funasr-1.0发布，更新说明[文档](https://github.com/alibaba-damo-academy/FunASR/discussions/1319)
			- 2024/01/30：新增加情感识别 [模型链接](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary)，原始模型 [repo](https://github.com/ddlBoJack/emotion2vec).
			@@ -69,20 +70,21 @@

			FunASR开源了大量在工业数据上预训练模型，您可以在[模型许可协议](./MODEL_LICENSE)下自由使用、复制、修改和分享FunASR模型，下面列举代表性的模型，更多模型请参考[模型仓库]()。

			（注：⭐ 表示ModelScope模型仓库链接，🤗 表示Huggingface模型仓库链接）
			（注：⭐ 表示ModelScope模型仓库，🤗 表示Huggingface模型仓库，🍀表示OpenAI模型仓库）


			\| 模型名字 \| 任务详情 \| 训练数据 \| 参数量 \|
			\|:------------------------------------------------------------------------------------------------------------------------------------------------------------:\|:---------------:\|:------------:\|:----:\|
			\| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗](https://huggingface.co/funasr/paraformer-tp) ) \| 语音识别，带时间戳输出，非实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) ) \| 语音识别，实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) ) \| 标点恢复 \| 100M，中文与英文 \| 1.1G \|
			\| fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) ) \| 语音端点检测，实时 \| 5000小时，中文与英文 \| 0.4M \|
			\| fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) ) \| 字级别时间戳预测 \| 50000小时，中文 \| 38M \|
			\| cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) ) \| 说话人确认/分割 \| 5000小时 \| 7.2M \|
			\| whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary) [🤗]() ) \| 语音识别，带时间戳输出，非实时 \| 多语言 \| 1G \|
			\| 模型名字 \| 任务详情 \| 训练数据 \| 参数量 \|
			\|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:\|:---------------:\|:------------:\|:----:\|
			\| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗](https://huggingface.co/funasr/paraformer-tp) ) \| 语音识别，带时间戳输出，非实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) ) \| 语音识别，实时 \| 60000小时，中文 \| 220M \|
			\| paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) ) \| 语音识别，非实时 \| 50000小时，英文 \| 220M \|
			\| ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) ) \| 标点恢复 \| 100M，中文与英文 \| 1.1G \|
			\| fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) ) \| 语音端点检测，实时 \| 5000小时，中文与英文 \| 0.4M \|
			\| fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) ) \| 字级别时间戳预测 \| 50000小时，中文 \| 38M \|
			\| cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) ) \| 说话人确认/分割 \| 5000小时 \| 7.2M \|
			\| Whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary) [🍀](https://github.com/openai/whisper) ) \| 语音识别，带时间戳输出，非实时 \| 多语言 \| 1G \|
			\| Whisper-large-v3 <br> ([⭐](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary) [🍀](https://github.com/openai/whisper) ) \| 语音识别，带时间戳输出，非实时 \| 多语言 \| 1G \|


			<a name="快速开始"></a>

			@@ -5,7 +5,7 @@

			from funasr import AutoModel

			model = AutoModel(model="iic/speech_whisper-large_asr_multilingual",
			model = AutoModel(model="iic/Whisper-large-v3",
			model_revision="v2.0.4",
			)

			@@ -7,8 +7,8 @@

			# model = AutoModel(model="Whisper-small", hub="openai")
			# model = AutoModel(model="Whisper-medium", hub="openai")
			model = AutoModel(model="Whisper-large-v2", hub="openai")
			# model = AutoModel(model="Whisper-large-v3", hub="openai")
			# model = AutoModel(model="Whisper-large-v2", hub="openai")
			model = AutoModel(model="Whisper-large-v3", hub="openai")

			res = model.generate(
			language=None,

			@@ -13,13 +13,19 @@
			# download model
			local_path_root=${workspace}/modelscope_models
			mkdir -p ${local_path_root}
			local_path=${local_path_root}/speech_whisper-large_asr_multilingual
			git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}
			#Whisper-large-v2
			#local_path=${local_path_root}/speech_whisper-large_asr_multilingual
			#git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}
			#init_param="${local_path}/large-v2.pt"
			#Whisper-large-v3
			local_path=${local_path_root}/Whisper-large-v3
			git clone https://www.modelscope.cn/iic/Whisper-large-v3.git ${local_path}
			init_param="${local_path}/large-v3.pt"

			device="cuda:0" # "cuda:0" for gpu0, "cuda:1" for gpu1, "cpu"

			config="config.yaml"
			init_param="${local_path}/large-v2.pt"


			python -m funasr.bin.inference \
			--config-path "${local_path}" \

			@@ -8,7 +8,8 @@
			"ct-punc-c": "damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
			"fa-zh": "damo/speech_timestamp_prediction-v1-16k-offline",
			"cam++": "damo/speech_campplus_sv_zh-cn_16k-common",
			"whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
			"Whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
			"Whisper-large-v3": "iic/Whisper-large-v3",
			}

			name_maps_hf = {

			@@ -24,7 +24,7 @@
			@tables.register("model_classes", "Whisper-large-v1")
			@tables.register("model_classes", "Whisper-large-v2")
			@tables.register("model_classes", "Whisper-large-v3")
			@tables.register("model_classes", "Whisper-WhisperWarp")
			@tables.register("model_classes", "WhisperWarp")
			class WhisperWarp(nn.Module):
			def __init__(self, args, *kwargs):
			super().__init__()
			@@ -35,8 +35,8 @@
			model_or_path = model_or_path.replace("Whisper-", "")
			model = whisper.load_model(model_or_path)
			else:
			whisper_dims = kwargs.get("whisper_dims", {})
			dims = whisper.model.ModelDimensions(**whisper_dims)
			dims = kwargs.get("dims", {})
			dims = whisper.model.ModelDimensions(**dims)
			model = whisper.model.Whisper(dims=dims)

			self.model = model
			@@ -55,6 +55,13 @@
			if kwargs.get("batch_size", 1) > 1:
			raise NotImplementedError("batch decoding is not implemented")

			if frontend is None and not hasattr(self, "frontend"):
			frontend_class = tables.frontend_classes.get("WhisperFrontend")
			frontend = frontend_class(n_mels=self.model.dims.n_mels, do_pad_trim=kwargs.get("do_pad_trim", True))
			self.frontend = frontend
			else:
			frontend = frontend if frontend is not None else self.frontend

			meta_data = {}
			if isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank": # fbank
			speech, speech_lengths = data_in, data_lengths
			@@ -65,7 +72,7 @@
			else:
			# extract fbank feats
			time1 = time.perf_counter()
			audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs, audio_fs=kwargs.get("fs", 16000),
			audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs if hasattr(frontend, "fs") else 16000, audio_fs=kwargs.get("fs", 16000),
			data_type=kwargs.get("data_type", "sound"),
			tokenizer=tokenizer)
			time2 = time.perf_counter()