python/FunASR-XL.git

New file
			@@ -0,0 +1,105 @@
			---
			tasks:
			- audio-visual-speech-recognition
			domain:
			- audio, visual
			model-type:
			- Autoregressive
			frameworks:
			- pytorch
			backbone:
			- transformer/conformer
			metrics:
			- WER/B-WER
			license: Apache License 2.0
			language:
			- en
			tags:
			- FunASR
			- Alibaba
			- ICASSP 2024
			- Audio-Visual
			- Hotword
			- Long-Context Biasing
			datasets:
			train:
			- SlideSpeech corpus
			test:
			- dev and test of SlideSpeech corpus
			indexing:
			results:
			- task:
			name: Audio-Visual Speech Recognition
			dataset:
			name: SlideSpeech corpus
			type: audio # optional
			args: 16k sampling rate, 5002 bpe units # optional
			metrics:
			- type: WER
			value: 18.8% # float
			description: beamsearch search, withou lm, avg.
			args: default

			widgets:
			- task: audio-visual-speech-recognition
			inputs:
			- type: audio
			name: input
			title: 音频
			- type: text
			name: input
			title: OCR识别文本
			finetune-support: True
			---


			# Paraformer-large模型介绍

			## Highlights
			- 热词版本：[Paraformer-large热词版模型](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary)支持热词定制功能，基于提供的热词列表进行激励增强，提升热词的召回率和准确率。
			- 长音频版本：[Paraformer-large长音频模型](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)，集成VAD、ASR、标点与时间戳功能，可直接对时长为数小时音频进行识别，并输出带标点文字与时间戳。

			## <strong>[FunASR开源项目介绍](https://github.com/alibaba-damo-academy/FunASR)</strong>
			<strong>[FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>希望在语音识别的学术研究和工业应用之间架起一座桥梁。通过发布工业级语音识别模型的训练和微调，研究人员和开发人员可以更方便地进行语音识别模型的研究和生产，并推动语音识别生态的发展。让语音识别更有趣！

			[github仓库](https://github.com/alibaba-damo-academy/FunASR)
			\| [最新动态](https://github.com/alibaba-damo-academy/FunASR#whats-new)
			\| [环境安装](https://github.com/alibaba-damo-academy/FunASR#installation)
			\| [服务部署](https://www.funasr.com)
			\| [模型库](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo)
			\| [联系我们](https://github.com/alibaba-damo-academy/FunASR#contact)


			## 模型原理介绍

			随着在线会议和课程越来越普遍，如何利用视频幻灯片中丰富的文本信息来改善语音识别（Automatic Speech Recognition， ASR）面临着新的挑战。视频中的幻灯片与语音实时同步，相比于统一的稀有词列表，能够提供更长的上下文相关信息。因此，我们提出了一种创新的长上下文偏置网络（LCB-net），用于音频-视觉语音识别（Audio-Visual Speech Recognition，AVSR），以更好地利用视频中的长时上下文信息。

			<p align="center">
			<img src="fig/lcbnet1.png" alt="AVSR整体流程框架" width="500" />
			<p align="center">
			<img src="fig/lcbnet2.png" alt="LCB-NET模型结构" width="500" />


			具体来说，我们首先使用OCR技术来检测和识别幻灯片中的文本内容，其次我们采用关键词提取技术来获取文本内容中的关键词短语。最后，我们将关键词拼接成长上下文文本和音频同时输入到我们的LCB-net模型中进行识别。而LCB-net模型采用了双编码器结构，同时建模音频和长上下文文本信息。此外，我们还引入了一个显式的偏置词预测模块，通过使用二元交叉熵（BCE）损失函数显式预测长上下文文本中在音频中出现的关键偏置词。此外，为增强LCB-net的泛化能力和稳健性，我们还采用了动态的关键词模拟策略。实验证明，我们提出的LCB-net热词模型，不仅能够提升关键词的识别效果，同时也能够提升非关键词的识别效果。具体实验结果如下所示：

			<p align="center">
			<img src="fig/lcbnet3.png" alt="实验结果" width="500" />


			更详细的细节见：
			- 论文： [LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition](https://arxiv.org/abs/2401.06390)





			## 相关论文以及引用信息

			```BibTeX
			@inproceedings{yu2024lcbnet,
			title={LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition},
			author={Fan Yu, Haoxu Wang, Xian Shi, Shiliang Zhang},
			booktitle={ICASSP},
			year={2024}
			}
			```