Acknowledge.md
File was deleted Contribution.md
File was deleted README.md
File was deleted SECURITY.md
File was deleted START_USE.md
New file @@ -0,0 +1,51 @@ # éå¯¹ä½¿ç¨æ¤å·¥å ·æ¥è¿è¡è®ç»ï¼éå¼åï¼ ## ä¸.åå¤å·¥ä½ï¼ #### 1.å¯å¨åå°çæ°æ®éééç¨åºï¼ ``` å æ§è¡java -version ç¡®ä¿ä½¿ç¨çæ¯java21,妿䏿¯å°±è¾å ¥sudo update-alternatives --config java 忢 åå°ï¼ cd /home/boying/IdeaProjects/asr_datasets nohup java -jar AutoLabelASR > output.log 2>&1 & åç«¯ï¼ æ§å¶å°è¾å ¥ideaæå¼ï¼å¯å¨AutoLabelASR项ç®ï¼è®°å¾nodeé20.19,å¯ä»¥ç¨nvm use 20.19.5å½ä»¤ï¼ç¶ånpm run dev伿å¼httpsæ¥å£ ``` #### 2.æ¸ åºæ¸ æä»¶ ``` 192.168.0.5çæ°æ®åºasr_datasetsæ¯æ¬¡éæ°è®ç»æ¶éè¦æ¸ åºï¼ä¸ç¶ä¼æä¹åå·²ç»è®ç»å¥½çé夿°æ®ï¼ åæ ·cd /home/boying/IdeaProjects/asr_datasets 项ç®å¯å¨æä»¶éçuploadæä»¶å¤¹éè¦æ¸ 空 ``` #### 3.è¿è¡éé ``` æµè§å¨æå¼https://192.168.0.5:1443页é¢ï¼è¾å ¥è¦è¯å«çæ£ç¡®ææ¬ï¼ç¹å»å¼å§å½é³,å½å ¥ææ¬å¯¹åºçè¯é³ï¼ ç¹å»ä¿åå°æ°æ®åºï¼æ¯æ¬¡å½å ¥å»ºè®®100æ¡ä»¥ä¸ï¼ç¹å»å¯¼åºexcel ``` ## äº.è®ç»æµç¨ï¼ #### 1.æä»¶çæï¼ ``` vscodeæå¼é¡¹ç®FunASRxl-0313ï¼æå¯¼åºçexcelå¤å¶å°æ ¹ç®å½ï¼å½å为âé³é¢æ æ³¨æ°æ®.xlsxâï¼æå¼ç»ç«¯(å³ä¸è§ç忢颿¿),便¬¡æ§è¡ï¼ conda activate fun_asr_xl #忢èæç¯å¢ python gen_funasr_file.py #çææä»¶ æ§è¡å®çdata/trainç®å½ä¸ææ²¡æçætrain_text.txtåtrain_wav.scpæä»¶ æé¡¹ç®å¯å¨æä»¶éçuploadæä»¶å¤¹éçaudioæä»¶æ·è´å°data/train/wavç®å½ä¸ ``` #### 2.å¼å§è®ç»ï¼ ``` å为paraformeråsensvoice两个模åè®ç»ï¼ä»¥åå¯è½ä¼å¤whisperånano-2512,æä½æ¹æ³ç¸åparaformer举ä¾ï¼ cd /home/boying/IdeaProjects/FunASRxl-0313/examples/industrial_data_pretraining/paraformer ./finetune.sh ä¼çå°æ§å¶å°è¾åºï¼å¼å§funasr è®ç»ç»æåä¼çå°æ¨¡åä¿åå¨xxxç®å½ä¸ï¼é颿å¾å¤model模åï¼å 为æè®ç»è½®æ¬¡ï¼æ¯ä¸æ¬¡é½ä¼çææ¨¡åï¼ï¼æä¸ªmodel.pt.bestï¼æä¼æ¨¡åï¼å³å¯æå éè¦ææ´ä¸ªæä»¶å¤¹å¦åå°å«çç®å½ï¼ä¸ä¸æ¬¡è®ç»ä¼è¦çæ¤ç®å½ï¼modelåªçbesté£ä¸ª ``` data/list/train.jsonl
File was deleted data/list/train_emo.txt
File was deleted data/list/train_event.txt
File was deleted data/list/train_text.txt
@@ -1,4 +1,2 @@ BAC009S0764W0121 çè³åºç°äº¤æå ä¹åæ»çæ åµ BAC009S0916W0489 æ¹åä¸å ¬å¸ä»¥åå·¥åä¹è´·æ¬¾æ°ååå·¥è´åºåä¸ asr_example_cn_en ææåªè¦å¤ç data ä¸ç®¡ä½ æ¯å machine learning å deep learning å data analytics å data science ä¹å¥½ scientist ä¹å¥½ééé½è¦é½åçåºæ¬ååé£ again å å 对æä¸äºä¹è®¸å¯¹ ID0012W0014 he tried to think how it could be 96ed5ed7-2602-46c5-b5cb-52e737b6c19e æ°æ®éææ ç³»ç» f09683b9-7e1f-4cca-a1d5-7b17f5503116 æµæµåå data/list/train_text_language.txt
File was deleted data/list/train_wav.scp
@@ -1,4 +1,2 @@ BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav 96ed5ed7-2602-46c5-b5cb-52e737b6c19e wav/96ed5ed7-2602-46c5-b5cb-52e737b6c19e.wav f09683b9-7e1f-4cca-a1d5-7b17f5503116 wav/f09683b9-7e1f-4cca-a1d5-7b17f5503116.wav data/list/val.jsonl
File was deleted data/train/train.jsonl
data/train/train_emo.txt
New file @@ -0,0 +1,2 @@ BAC009S0764W0121 <|NEUTRAL|> BAC009S0916W0489 <|NEUTRAL|> data/train/train_event.txt
New file @@ -0,0 +1,2 @@ BAC009S0764W0121 <|Speech|> BAC009S0916W0489 <|Speech|> data/train/train_text.txt
New file @@ -0,0 +1,2 @@ BAC009S0764W0121 çè³åºç°äº¤æå ä¹åæ»çæ åµ BAC009S0916W0489 æ¹åä¸å ¬å¸ä»¥åå·¥åä¹è´·æ¬¾æ°ååå·¥è´åºåä¸ data/train/train_text_language.txt
New file @@ -0,0 +1,2 @@ BAC009S0764W0121 <|zh|> BAC009S0916W0489 <|zh|> data/train/train_wav.scp
New file @@ -0,0 +1,2 @@ BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav data/val/val.jsonl
data/val/val_text.txt
data/val/val_wav.scp
demo1.py
New file @@ -0,0 +1,21 @@ import os # ãå ³é®ä¿®æ¹ 1ãå¨ä»£ç éæå®æ¨¡åç¼åç®å½ # 注æï¼è¿éè¦å¡«å° 'models' è¿ä¸çº§ï¼ä¸è¦å¡«å°å ·ä½ç模åæä»¶å¤¹é # å 为ç¨åºä¼èªå¨å» è¿ä¸ªç®å½/iic/模åå ä¸å¯»æ¾ os.environ['MODELSCOPE_CACHE'] = '/home/boying/IdeaProjects/FunASRxl-0313/models' from funasr import AutoModel model = AutoModel(model=r"iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", vad_model=None, punc_model=None, disable_download=True, disable_update=True, ) res = model.generate( input = r"1.wav", batch_size_s=300, hotword='贷款' ) print('è¯å«ç»æï¼',res) examples/industrial_data_pretraining/paraformer/finetune.shold mode 100644 new mode 100755
@@ -2,15 +2,15 @@ # MIT License (https://opensource.org/licenses/MIT) workspace=`pwd` export MODELSCOPE_CACHE="/home/boying/IdeaProjects/FunASRxl-0313/models/" # which gpu to train or finetune export CUDA_VISIBLE_DEVICES="0,1" export CUDA_VISIBLE_DEVICES="0" gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') # model_name from model_hub, or model_dir in local path ## option 1, download model automatically model_name_or_model_dir="iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" model_name_or_model_dir="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch" ## option 2, download model by git #local_path_root=${workspace}/modelscope_models @@ -20,26 +20,29 @@ # data dir, which contains: train.json, val.json data_dir="../../../data/list" data_dir="../../../data" train_data="${data_dir}/train.jsonl" val_data="${data_dir}/val.jsonl" train_data="${data_dir}/train/train.jsonl" val_data="${data_dir}/val/val.jsonl" # generate train.jsonl and val.jsonl from wav.scp and text.txt scp2jsonl \ ++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \ ++scp_file_list='["../../../data/train/train_wav.scp", "../../../data/train/train_text.txt"]' \ ++data_type_list='["source", "target"]' \ ++jsonl_file_out="${train_data}" scp2jsonl \ ++scp_file_list='["../../../data/list/val_wav.scp", "../../../data/list/val_text.txt"]' \ ++scp_file_list='["../../../data/val/val_wav.scp", "../../../data/val/val_text.txt"]' \ ++data_type_list='["source", "target"]' \ ++jsonl_file_out="${val_data}" # exp output dir output_dir="./outputs" output_dir="/home/boying/IdeaProjects/FunASRxl-0313/exp/paraformer_train" log_file="${output_dir}/log.txt" BATCH_SIZE=16 LR=0.0005 deepspeed_config=${workspace}/../../deepspeed_conf/ds_stage1.json @@ -56,6 +59,14 @@ echo $DISTRIBUTED_ARGS echo "==========================================" echo " å¼å§ FunASR è®ç»..." echo "ð æ°æ®ç®å½: $train_data" echo "ð¾ è¾åºç®å½: $output_dir" echo " é¢è®ç»æ¨¡å: $model_name_or_model_dir" echo "ð¯ Batch Size: $BATCH_SIZE, LR: $LR, Epochs: $MAX_EPOCH" echo "==========================================" torchrun $DISTRIBUTED_ARGS \ ../../../funasr/bin/train_ds.py \ ++model="${model_name_or_model_dir}" \ @@ -65,7 +76,7 @@ ++dataset_conf.index_ds="IndexDSJsonl" \ ++dataset_conf.data_split_num=1 \ ++dataset_conf.batch_sampler="BatchSampler" \ ++dataset_conf.batch_size=6000 \ ++dataset_conf.batch_size="${BATCH_SIZE}" \ ++dataset_conf.sort_size=1024 \ ++dataset_conf.batch_type="token" \ ++dataset_conf.num_workers=4 \ @@ -78,5 +89,10 @@ ++train_conf.avg_nbest_model=10 \ ++train_conf.use_deepspeed=false \ ++train_conf.deepspeed_config=${deepspeed_config} \ ++optim_conf.lr=0.0002 \ ++output_dir="${output_dir}" &> ${log_file} ++optim_conf.lr="${LR}" \ ++output_dir="${output_dir}" &> ${log_file} echo "==========================================" echo "â è®ç»å®æï¼æ¨¡åä¿åå¨: $output_dir" echo "==========================================" examples/industrial_data_pretraining/paraformer/infer_from_local.sh
@@ -2,11 +2,11 @@ # MIT License (https://opensource.org/licenses/MIT) # method2, inference from local model export MODELSCOPE_CACHE="/home/boying/IdeaProjects/FunASRxl-0313/models/" # for more input type, please ref to readme.md input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav" output_dir="./outputs/debug" output_dir="/home/boying/IdeaProjects/FunASRxl-0313/exp/paraformer_train/debug" workspace=`pwd` @@ -22,7 +22,7 @@ cmvn_file="${local_path}/am.mvn" config="config.yaml" init_param="${local_path}/model.pt" init_param="${local_path}/model.pt.best" python -m funasr.bin.inference \ --config-path "${local_path}" \ gen_funasr_file.py
New file @@ -0,0 +1,100 @@ import pandas as pd import os import uuid as uuid_lib # é¿å åååå²çª def validate_uuid(uuid_str): """æ ¡éª UUID æ ¼å¼æ¯å¦åæ³""" try: uuid_lib.UUID(uuid_str) return True except ValueError: return False def process_excel_to_funasr_files(excel_path, output_dir="."): """ ä» Excel çæ FunASR è®ç»æéç train_text.txt å train_wav.scp :param excel_path: Excel æä»¶è·¯å¾ :param output_dir: è¾åºæä»¶ç®å½ï¼é»è®¤å½åç®å½ï¼ """ # 1. 读å Excel æä»¶ try: if excel_path.endswith(".xlsx"): df = pd.read_excel(excel_path, engine="openpyxl") elif excel_path.endswith(".csv"): df = pd.read_csv(excel_path) else: raise ValueError("ä» æ¯æ .xlsx å .csv æ ¼å¼æä»¶") except FileNotFoundError: print(f"éè¯¯ï¼æªæ¾å°æä»¶ {excel_path}") return except Exception as e: print(f"读åæä»¶å¤±è´¥ï¼{str(e)}") return # 2. å®ä¹ååï¼è¯·æ ¹æ®ä½ ç Excel å®é ååä¿®æ¹ï¼ uuid_col = "é³é¢å¯ä¸æ è¯ (UUID)" text_col = "é³é¢å¯¹åºçæåå 容" path_col = "é³é¢ä¿åçè·¯å¾" # æ£æ¥å¿ è¦åæ¯å¦åå¨ required_cols = [uuid_col, text_col, path_col] missing_cols = [col for col in required_cols if col not in df.columns] if missing_cols: print(f"é误ï¼Excel ä¸ç¼ºå°å¿ è¦åï¼{missing_cols}") return # 3. æ°æ®æ¸ æ´ # å»é¤ç©ºå¼è¡ df_clean = df.dropna(subset=required_cols).copy() # å»éï¼åºäº UUIDï¼ df_clean = df_clean.drop_duplicates(subset=[uuid_col], keep="first") # è¿æ»¤ UUID æ ¼å¼é误çè¡ df_clean["uuid_valid"] = df_clean[uuid_col].apply(validate_uuid) invalid_uuid_rows = df_clean[~df_clean["uuid_valid"]] if not invalid_uuid_rows.empty: print(f"è¦åï¼åç° {len(invalid_uuid_rows)} è¡ UUID æ ¼å¼é误ï¼å·²è¿æ»¤ï¼") print(invalid_uuid_rows[uuid_col].tolist()) df_clean = df_clean[df_clean["uuid_valid"]].drop(columns=["uuid_valid"]) # 4. çææä»¶ os.makedirs(output_dir, exist_ok=True) text_file_path = os.path.join(output_dir, "train_text.txt") scp_file_path = os.path.join(output_dir, "train_wav.scp") # çæ train_text.txt with open(text_file_path, "w", encoding="utf-8") as f_text: for _, row in df_clean.iterrows(): uuid = str(row[uuid_col]).strip() text = str(row[text_col]).strip() # è¿æ»¤ç©ºææ¬ if text: f_text.write(f"{uuid} {text}\n") # çæ train_wav.scpï¼æåæå䏿®µ uuid.wavï¼ with open(scp_file_path, "w", encoding="utf-8") as f_scp: for _, row in df_clean.iterrows(): uuid = str(row[uuid_col]).strip() full_path = str(row[path_col]).strip() # æå uuid.wavï¼å ¼å®¹ / å \ è·¯å¾åéç¬¦ï¼ wav_file = os.path.basename(full_path) # ç¡®ä¿æ¯ .wav æä»¶ if wav_file.endswith(".wav"): f_scp.write(f"{uuid} wav/{wav_file}\n") else: print(f"è¦åï¼{uuid} çé³é¢è·¯å¾ {full_path} 䏿¯ .wav æä»¶ï¼å·²è·³è¿") # 5. è¾åºç»è®¡ä¿¡æ¯ print("\n=== å¤ç宿 ===") print(f"åå§æ°æ®è¡æ°ï¼{len(df)}") print(f"æ¸ æ´åææè¡æ°ï¼{len(df_clean)}") print(f"çæ train_text.txtï¼{text_file_path}") print(f"çæ train_wav.scpï¼{scp_file_path}") # ===================== æ§è¡å ¥å£ ===================== if __name__ == "__main__": # 请修æ¹ä¸ºä½ ç Excel æä»¶è·¯å¾ EXCEL_FILE = "é³é¢æ æ³¨æ°æ®.xlsx" # è¾åºç®å½ï¼é»è®¤å½åç®å½ï¼å¯ä¿®æ¹ä¸ºç»å¯¹è·¯å¾å¦ "/home/boying/funasr_data"ï¼ OUTPUT_DIR = "./data/list/" process_excel_to_funasr_files(EXCEL_FILE, OUTPUT_DIR) models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/.mdlBinary files differ
models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/.mscBinary files differ
models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/.mv
New file @@ -0,0 +1 @@ Revision:master,CreatedAt:1706753553 models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/README.md
New file @@ -0,0 +1,411 @@ --- tasks: - auto-speech-recognition domain: - audio model-type: - Non-autoregressive frameworks: - pytorch backbone: - transformer/conformer metrics: - CER license: Apache License 2.0 language: - cn tags: - FunASR - Paraformer - Alibaba - INTERSPEECH 2022 datasets: train: - 60,000 hour industrial Mandarin task test: - AISHELL-1 dev/test - AISHELL-2 dev_android/dev_ios/dev_mic/test_android/test_ios/test_mic - WentSpeech dev/test_meeting/test_net - SpeechIO TIOBE - 60,000 hour industrial Mandarin task indexing: results: - task: name: Automatic Speech Recognition dataset: name: 60,000 hour industrial Mandarin task type: audio # optional args: 16k sampling rate, 8404 characters # optional metrics: - type: CER value: 8.53% # float description: greedy search, withou lm, avg. args: default - type: RTF value: 0.0251 # float description: GPU inference on V100 args: batch_size=1 widgets: - task: auto-speech-recognition model_revision: v2.0.4 inputs: - type: audio name: input title: é³é¢ examples: - name: 1 title: 示ä¾1 inputs: - name: input data: git://example/asr_example.wav inferencespec: cpu: 8 #CPUæ°é memory: 4096 finetune-support: True --- # Highlights - Paraformer-largeé¿é³é¢æ¨¡åéæVADãASRãæ ç¹ä¸æ¶é´æ³åè½ï¼å¯ç´æ¥å¯¹æ¶é¿ä¸ºæ°å°æ¶é³é¢è¿è¡è¯å«ï¼å¹¶è¾åºå¸¦æ ç¹æå䏿¶é´æ³ï¼ - ASR模åï¼[Parformer-large模å](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)ç»æä¸ºéèªåå½è¯é³è¯å«æ¨¡åï¼å¤ä¸ªä¸æå ¬å¼æ°æ®éä¸åå¾SOTAææï¼å¯å¿«éå°åºäºModelScope对模åè¿è¡å¾®è°å®å¶åæ¨çã - çè¯çæ¬ï¼[Paraformer-largeçè¯ç模å](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary)æ¯æçè¯å®å¶åè½ï¼åºäºæä¾ççè¯å表è¿è¡æ¿å±å¢å¼ºï¼æåçè¯çå¬åçååç¡®çã ## <strong>[FunASR弿ºé¡¹ç®ä»ç»](https://github.com/alibaba-damo-academy/FunASR)</strong> <strong>[FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>叿å¨è¯é³è¯å«ç妿¯ç ç©¶åå·¥ä¸åºç¨ä¹é´æ¶èµ·ä¸åº§æ¡¥æ¢ãéè¿åå¸å·¥ä¸çº§è¯é³è¯å«æ¨¡åçè®ç»åå¾®è°ï¼ç 究人ååå¼å人åå¯ä»¥æ´æ¹ä¾¿å°è¿è¡è¯é³è¯å«æ¨¡åçç ç©¶åç产ï¼å¹¶æ¨å¨è¯é³è¯å«çæçåå±ã让è¯é³è¯å«æ´æè¶£ï¼ [**githubä»åº**](https://github.com/alibaba-damo-academy/FunASR) | [**ææ°å¨æ**](https://github.com/alibaba-damo-academy/FunASR#whats-new) | [**ç¯å¢å®è£ **](https://github.com/alibaba-damo-academy/FunASR#installation) | [**æå¡é¨ç½²**](https://www.funasr.com) | [**模ååº**](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo) | [**èç³»æä»¬**](https://github.com/alibaba-damo-academy/FunASR#contact) ## 模ååçä»ç» Paraformeræ¯è¾¾æ©é¢è¯é³å¢éæåºçä¸ç§é«æçéèªåå½ç«¯å°ç«¯è¯é³è¯å«æ¡æ¶ãæ¬é¡¹ç®ä¸ºParaformer䏿éç¨è¯é³è¯å«æ¨¡åï¼éç¨å·¥ä¸çº§æ°ä¸å°æ¶çæ æ³¨é³é¢è¿è¡æ¨¡åè®ç»ï¼ä¿è¯äºæ¨¡åçéç¨è¯å«ææã模åå¯ä»¥è¢«åºç¨äºè¯é³è¾å ¥æ³ãè¯é³å¯¼èªãæºè½ä¼è®®çºªè¦çåºæ¯ã <p align="center"> <img src="fig/struct.png" alt="Paraformer模åç»æ" width="500" /> Paraformer模åç»æå¦ä¸å¾æç¤ºï¼ç± EncoderãPredictorãSamplerãDecoder ä¸ Loss function äºé¨åç»æãEncoderå¯ä»¥éç¨ä¸åçç½ç»ç»æï¼ä¾å¦self-attentionï¼conformerï¼SAN-MçãPredictor 为两å±FFNï¼é¢æµç®æ æå个æ°ä»¥åæ½åç®æ æå对åºç声å¦åéãSampler 为æ å¯å¦ä¹ åæ°æ¨¡åï¼ä¾æ®è¾å ¥ç声å¦åéåç®æ åéï¼çäº§å«æè¯ä¹çç¹å¾åéãDecoder ç»æä¸èªå彿¨¡å类似ï¼ä¸ºåå建模ï¼èªåå½ä¸ºåå建模ï¼ãLoss function é¨åï¼é¤äºäº¤åçµï¼CEï¼ä¸ MWER åºåæ§ä¼åç®æ ï¼è¿å æ¬äº Predictor ä¼åç®æ MAEã å ¶æ ¸å¿ç¹ä¸»è¦æï¼ - Predictor 模åï¼åºäº Continuous integrate-and-fire (CIF) ç 颿µå¨ (Predictor) æ¥æ½åç®æ æå对åºç声å¦ç¹å¾åéï¼å¯ä»¥æ´å åç¡®ç颿µè¯é³ä¸ç®æ æå个æ°ã - Samplerï¼éè¿éæ ·ï¼å°å£°å¦ç¹å¾åéä¸ç®æ æååé忢æå«æè¯ä¹ä¿¡æ¯çç¹å¾åéï¼é åååç Decoder æ¥å¢å¼ºæ¨¡å对äºä¸ä¸æç建模è½åã - åºäºè´æ ·æ¬éæ ·ç MWER è®ç»ååã æ´è¯¦ç»çç»èè§ï¼ - 论æï¼ [Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition](https://arxiv.org/abs/2206.08317) - 论æè§£è¯»ï¼[Paraformer: é«è¯å«çãé«è®¡ç®æççåè½®éèªåå½ç«¯å°ç«¯è¯é³è¯å«æ¨¡å](https://mp.weixin.qq.com/s/xQ87isj5_wxWiQs4qUXtVw) #### åºäºModelScopeè¿è¡æ¨ç - æ¨çæ¯æé³é¢æ ¼å¼å¦ä¸ï¼ - wavæä»¶è·¯å¾ï¼ä¾å¦ï¼data/test/audios/asr_example.wav - pcmæä»¶è·¯å¾ï¼ä¾å¦ï¼data/test/audios/asr_example.pcm - wavæä»¶urlï¼ä¾å¦ï¼https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav - waväºè¿å¶æ°æ®ï¼æ ¼å¼bytesï¼ä¾å¦ï¼ç¨æ·ç´æ¥ä»æä»¶é读åºbytesæ°æ®æè æ¯éº¦å é£å½åºbytesæ°æ®ã - 已解æçaudioé³é¢ï¼ä¾å¦ï¼audio, rate = soundfile.read("asr_example_zh.wav")ï¼ç±»å为numpy.ndarrayæè torch.Tensorã - wav.scpæä»¶ï¼é符åå¦ä¸è¦æ±ï¼ ```sh cat wav.scp asr_example1 data/test/audios/asr_example1.wav asr_example2 data/test/audios/asr_example2.wav ... ``` - è¥è¾å ¥æ ¼å¼wavæä»¶urlï¼apiè°ç¨æ¹å¼å¯åèå¦ä¸èä¾ï¼ ```python from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model='iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision="v2.0.4") rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_vad_punc_example.wav') print(rec_result) ``` - è¾å ¥é³é¢ä¸ºpcmæ ¼å¼ï¼è°ç¨apiæ¶éè¦ä¼ å ¥é³é¢éæ ·çåæ°audio_fsï¼ä¾å¦ï¼ ```python rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_vad_punc_example.pcm', fs=16000) ``` - è¾å ¥é³é¢ä¸ºwavæ ¼å¼ï¼apiè°ç¨æ¹å¼å¯åèå¦ä¸èä¾: ```python rec_result = inference_pipeline('asr_vad_punc_example.wav') ``` - è¥è¾å ¥æ ¼å¼ä¸ºæä»¶wav.scp(æ³¨ï¼æä»¶åéè¦ä»¥.scpç»å°¾)ï¼å¯æ·»å output_dir åæ°å°è¯å«ç»æåå ¥æä»¶ä¸ï¼apiè°ç¨æ¹å¼å¯åèå¦ä¸èä¾: ```python inference_pipeline("wav.scp", output_dir='./output_dir') ``` è¯å«ç»æè¾åºè·¯å¾ç»æå¦ä¸ï¼ ```sh tree output_dir/ output_dir/ âââ 1best_recog âââ score âââ text 1 directory, 4 files ``` scoreï¼è¯å«è·¯å¾å¾å textï¼è¯é³è¯å«ç»ææä»¶ - è¥è¾å ¥é³é¢ä¸ºå·²è§£æçaudioé³é¢ï¼apiè°ç¨æ¹å¼å¯åèå¦ä¸èä¾ï¼ ```python import soundfile waveform, sample_rate = soundfile.read("asr_vad_punc_example.wav") rec_result = inference_pipeline(waveform) ``` - ASRãVADãPUNC模åèªç±ç»å 坿 ¹æ®ä½¿ç¨éæ±å¯¹VADåPUNCæ ç¹æ¨¡åè¿è¡èªç±ç»åï¼ä½¿ç¨æ¹å¼å¦ä¸ï¼ ```python inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model='iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision="v2.0.4", vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4", punc_model='iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', punc_model_revision="v2.0.3", # spk_model="iic/speech_campplus_sv_zh-cn_16k-common", # spk_model_revision="v2.0.2", ) ``` è¥ä¸ä½¿ç¨PUNC模åï¼å¯é ç½®punc_model=""ï¼æä¸ä¼ å ¥punc_modelåæ°ï¼å¦éå å ¥LM模åï¼å¯å¢å é ç½®lm_model='damo/speech_transformer_lm_zh-cn-common-vocab8404-pytorch'ï¼å¹¶è®¾ç½®lm_weightåbeam_sizeåæ°ã ## åºäºFunASRè¿è¡æ¨ç ä¸é¢ä¸ºå¿«é䏿æç¨ï¼æµè¯é³é¢ï¼[䏿](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav)ï¼[è±æ](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav)ï¼ ### 坿§è¡å½ä»¤è¡ å¨å½ä»¤è¡ç»ç«¯æ§è¡ï¼ ```shell funasr +model=paraformer-zh +vad_model="fsmn-vad" +punc_model="ct-punc" +input=vad_example.wav ``` æ³¨ï¼æ¯æåæ¡é³é¢æä»¶è¯å«ï¼ä¹æ¯ææä»¶å表ï¼å表为kaldi飿 ¼wav.scpï¼`wav_id wav_path` ### pythonç¤ºä¾ #### é宿¶è¯é³è¯å« ```python from funasr import AutoModel # paraformer-zh is a multi-functional asr model # use vad, punc, spk or not as you need model = AutoModel(model="paraformer-zh", model_revision="v2.0.4", vad_model="fsmn-vad", vad_model_revision="v2.0.4", punc_model="ct-punc-c", punc_model_revision="v2.0.4", # spk_model="cam++", spk_model_revision="v2.0.2", ) res = model.generate(input=f"{model.model_path}/example/asr_example.wav", batch_size_s=300, hotword='éæ') print(res) ``` 注ï¼`model_hub`ï¼è¡¨ç¤ºæ¨¡åä»åºï¼`ms`ä¸ºéæ©modelscopeä¸è½½ï¼`hf`ä¸ºéæ©huggingfaceä¸è½½ã #### 宿¶è¯é³è¯å« ```python from funasr import AutoModel chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4") import soundfile import os wav_file = os.path.join(model.model_path, "example/asr_example.wav") speech, sample_rate = soundfile.read(wav_file) chunk_stride = chunk_size[1] * 960 # 600ms cache = {} total_chunk_num = int(len((speech)-1)/chunk_stride+1) for i in range(total_chunk_num): speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride] is_final = i == total_chunk_num - 1 res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back) print(res) ``` 注ï¼`chunk_size`为æµå¼å»¶æ¶é ç½®ï¼`[0,10,5]`表示ä¸å±å®æ¶åºåç²åº¦ä¸º`10*60=600ms`ï¼æªæ¥ä¿¡æ¯ä¸º`5*60=300ms`ãæ¯æ¬¡æ¨çè¾å ¥ä¸º`600ms`ï¼éæ ·ç¹æ°ä¸º`16000*0.6=960`ï¼ï¼è¾åºä¸ºå¯¹åºæåï¼æåä¸ä¸ªè¯é³ç段è¾å ¥éè¦è®¾ç½®`is_final=True`æ¥å¼ºå¶è¾åºæåä¸ä¸ªåã #### è¯é³ç«¯ç¹æ£æµï¼é宿¶ï¼ ```python from funasr import AutoModel model = AutoModel(model="fsmn-vad", model_revision="v2.0.4") wav_file = f"{model.model_path}/example/asr_example.wav" res = model.generate(input=wav_file) print(res) ``` #### è¯é³ç«¯ç¹æ£æµï¼å®æ¶ï¼ ```python from funasr import AutoModel chunk_size = 200 # ms model = AutoModel(model="fsmn-vad", model_revision="v2.0.4") import soundfile wav_file = f"{model.model_path}/example/vad_example.wav" speech, sample_rate = soundfile.read(wav_file) chunk_stride = int(chunk_size * sample_rate / 1000) cache = {} total_chunk_num = int(len((speech)-1)/chunk_stride+1) for i in range(total_chunk_num): speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride] is_final = i == total_chunk_num - 1 res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size) if len(res[0]["value"]): print(res) ``` #### æ ç¹æ¢å¤ ```python from funasr import AutoModel model = AutoModel(model="ct-punc", model_revision="v2.0.4") res = model.generate(input="é£ä»å¤©çä¼å°±å°è¿éå§ happy new year æå¹´è§") print(res) ``` #### æ¶é´æ³é¢æµ ```python from funasr import AutoModel model = AutoModel(model="fa-zh", model_revision="v2.0.4") wav_file = f"{model.model_path}/example/asr_example.wav" text_file = f"{model.model_path}/example/text.txt" res = model.generate(input=(wav_file, text_file), data_type=("sound", "text")) print(res) ``` æ´å¤è¯¦ç»ç¨æ³ï¼[示ä¾](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)ï¼ ## å¾®è° è¯¦ç»ç¨æ³ï¼[示ä¾](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)ï¼ ## Benchmark ç»åå¤§æ°æ®ã大模åä¼åçParaformerå¨ä¸åºåè¯é³è¯å«çbenchmarkä¸è·å¾å½åSOTAçææï¼ä»¥ä¸å±ç¤ºå¦æ¯æ°æ®éAISHELL-1ãAISHELL-2ãWenetSpeechï¼å ¬å¼è¯æµé¡¹ç®SpeechIO TIOBEç½çæµè¯åºæ¯çææãå¨å¦æ¯ç常ç¨ç䏿è¯é³è¯å«è¯æµä»»å¡ä¸ï¼å ¶è¡¨ç°è¿è¿è¶ äºç®åå ¬å¼å表论æä¸çç»æï¼è¿å¥½äºåç¬å°éæ°æ®éä¸ç模åãæ¤ç»æä¸º[Paraformer-large模å](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/summary)卿 VADåæ ç¹æ¨¡åä¸çæµè¯ç»æã ### AISHELL-1 | AISHELL-1 test | w/o LM | w/ LM | |:------------------------------------------------:|:-------------------------------------:|:-------------------------------------:| | <div style="width: 150pt">Espnet</div> | <div style="width: 150pt">4.90</div> | <div style="width: 150pt">4.70</div> | | <div style="width: 150pt">Wenet</div> | <div style="width: 150pt">4.61</div> | <div style="width: 150pt">4.36</div> | | <div style="width: 150pt">K2</div> | <div style="width: 150pt">-</div> | <div style="width: 150pt">4.26</div> | | <div style="width: 150pt">Blockformer</div> | <div style="width: 150pt">4.29</div> | <div style="width: 150pt">4.05</div> | | <div style="width: 150pt">Paraformer-large</div> | <div style="width: 150pt">1.95</div> | <div style="width: 150pt">1.68</div> | ### AISHELL-2 | | dev_ios| test_android| test_ios|test_mic| |:-------------------------------------------------:|:-------------------------------------:|:-------------------------------------:|:------------------------------------:|:------------------------------------:| | <div style="width: 150pt">Espnet</div> | <div style="width: 70pt">5.40</div> |<div style="width: 70pt">6.10</div> |<div style="width: 70pt">5.70</div> |<div style="width: 70pt">6.10</div> | | <div style="width: 150pt">WeNet</div> | <div style="width: 70pt">-</div> |<div style="width: 70pt">-</div> |<div style="width: 70pt">5.39</div> |<div style="width: 70pt">-</div> | | <div style="width: 150pt">Paraformer-large</div> | <div style="width: 70pt">2.80</div> |<div style="width: 70pt">3.13</div> |<div style="width: 70pt">2.85</div> |<div style="width: 70pt">3.06</div> | ### Wenetspeech | | dev| test_meeting| test_net| |:-------------------------------------------------:|:-------------------------------------:|:-------------------------------------:|:------------------------------------:| | <div style="width: 150pt">Espnet</div> | <div style="width: 100pt">9.70</div> |<div style="width: 100pt">15.90</div> |<div style="width: 100pt">8.80</div> | | <div style="width: 150pt">WeNet</div> | <div style="width: 100pt">8.60</div> |<div style="width: 100pt">17.34</div> |<div style="width: 100pt">9.26</div> | | <div style="width: 150pt">K2</div> | <div style="width: 100pt">7.76</div> |<div style="width: 100pt">13.41</div> |<div style="width: 100pt">8.71</div> | | <div style="width: 150pt">Paraformer-large</div> | <div style="width: 100pt">3.57</div> |<div style="width: 100pt">6.97</div> |<div style="width: 100pt">6.74</div> | ### [SpeechIO TIOBE](https://github.com/SpeechColab/Leaderboard) Paraformer-large模åç»åTransformer-LM模ååshallow fusionï¼å¨å ¬å¼è¯æµé¡¹ç®SpeechIO TIOBEç½çæµè¯åºæ¯ä¸è·å¾å½åSOTAçææï¼ç®å[Transformer-LM模å](https://modelscope.cn/models/damo/speech_transformer_lm_zh-cn-common-vocab8404-pytorch/summary)å·²å¨ModelScopeä¸å¼æºï¼ä»¥ä¸å±ç¤ºSpeechIO TIOBEç½çæµè¯åºæ¯without LMãwith Transformer-LMçææï¼ - Decode config w/o LM: - Decode without LM - Beam size: 1 - Decode config w/ LM: - Decode with [Transformer-LM](https://modelscope.cn/models/damo/speech_transformer_lm_zh-cn-common-vocab8404-pytorch/summary) - Beam size: 10 - LM weight: 0.15 | testset | w/o LM | w/ LM | |:------------------:|:----:|:----:| |<div style="width: 200pt">SPEECHIO_ASR_ZH00001</div>| <div style="width: 150pt">0.49</div> | <div style="width: 150pt">0.35</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00002</div>| <div style="width: 150pt">3.23</div> | <div style="width: 150pt">2.86</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00003</div>| <div style="width: 150pt">1.13</div> | <div style="width: 150pt">0.80</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00004</div>| <div style="width: 150pt">1.33</div> | <div style="width: 150pt">1.10</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00005</div>| <div style="width: 150pt">1.41</div> | <div style="width: 150pt">1.18</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00006</div>| <div style="width: 150pt">5.25</div> | <div style="width: 150pt">4.85</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00007</div>| <div style="width: 150pt">5.51</div> | <div style="width: 150pt">4.97</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00008</div>| <div style="width: 150pt">3.69</div> | <div style="width: 150pt">3.18</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH00009</div>| <div style="width: 150pt">3.02</div> | <div style="width: 150pt">2.78</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH000010</div>| <div style="width: 150pt">3.35</div> | <div style="width: 150pt">2.99</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH000011</div>| <div style="width: 150pt">1.54</div> | <div style="width: 150pt">1.25</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH000012</div>| <div style="width: 150pt">2.06</div> | <div style="width: 150pt">1.68</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH000013</div>| <div style="width: 150pt">2.57</div> | <div style="width: 150pt">2.25</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH000014</div>| <div style="width: 150pt">3.86</div> | <div style="width: 150pt">3.08</div> | |<div style="width: 200pt">SPEECHIO_ASR_ZH000015</div>| <div style="width: 150pt">3.34</div> | <div style="width: 150pt">2.67</div> | ## ä½¿ç¨æ¹å¼ä»¥åéç¨èå´ è¿è¡èå´ - æ¯æLinux-x86_64ãMacåWindowsè¿è¡ã ä½¿ç¨æ¹å¼ - ç´æ¥æ¨çï¼å¯ä»¥ç´æ¥å¯¹è¾å ¥é³é¢è¿è¡è§£ç ï¼è¾åºç®æ æåã - å¾®è°ï¼å è½½è®ç»å¥½ç模åï¼éç¨ç§ææè 弿ºæ°æ®è¿è¡æ¨¡åè®ç»ã 使ç¨èå´ä¸ç®æ åºæ¯ - éåä¸ç¦»çº¿è¯é³è¯å«åºæ¯ï¼å¦å½é³æä»¶è½¬åï¼é åGPUæ¨çæææ´å ï¼è¾å ¥é³é¢æ¶é¿ä¸éå¶ï¼å¯ä»¥ä¸ºå ä¸ªå°æ¶é³é¢ã ## 模åå±éæ§ä»¥åå¯è½çåå·® èèå°ç¹å¾æåæµç¨åå·¥å ·ä»¥åè®ç»å·¥å ·å·®å¼ï¼ä¼å¯¹CERçæ°æ®å¸¦æ¥ä¸å®çå·®å¼ï¼<0.1%ï¼ï¼æ¨çGPUç¯å¢å·®å¼å¯¼è´çRTFæ°å¼å·®å¼ã ## ç¸å ³è®ºæä»¥åå¼ç¨ä¿¡æ¯ ```BibTeX @inproceedings{gao2022paraformer, title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition}, author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie}, booktitle={INTERSPEECH}, year={2022} } ``` models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/am.mvn
New file @@ -0,0 +1,8 @@ <Nnet> <Splice> 560 560 [ 0 ] <AddShift> 560 560 <LearnRateCoef> 0 [ -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 ] <Rescale> 560 560 <LearnRateCoef> 0 [ 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 ] </Nnet> models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/config.yaml
New file @@ -0,0 +1,134 @@ # This is an example that demonstrates how to configure a model file. # You can modify the configuration according to your own requirements. # to print the register_table: # from funasr.register import tables # tables.print() # network architecture #model: funasr.models.paraformer.model:Paraformer model: BiCifParaformer model_conf: ctc_weight: 0.0 lsm_weight: 0.1 length_normalized_loss: true predictor_weight: 1.0 predictor_bias: 1 sampling_ratio: 0.75 # encoder encoder: SANMEncoder encoder_conf: output_size: 512 attention_heads: 4 linear_units: 2048 num_blocks: 50 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.1 input_layer: pe pos_enc_class: SinusoidalPositionEncoder normalize_before: true kernel_size: 11 sanm_shfit: 0 selfattention_layer_type: sanm # decoder decoder: ParaformerSANMDecoder decoder_conf: attention_heads: 4 linear_units: 2048 num_blocks: 16 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.1 src_attention_dropout_rate: 0.1 att_layer_num: 16 kernel_size: 11 sanm_shfit: 0 predictor: CifPredictorV3 predictor_conf: idim: 512 threshold: 1.0 l_order: 1 r_order: 1 tail_threshold: 0.45 smooth_factor2: 0.25 noise_threshold2: 0.01 upsample_times: 3 use_cif1_cnn: false upsample_type: cnn_blstm # frontend related frontend: WavFrontend frontend_conf: fs: 16000 window: hamming n_mels: 80 frame_length: 25 frame_shift: 10 lfr_m: 7 lfr_n: 6 specaug: SpecAugLFR specaug_conf: apply_time_warp: false time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range: - 0 - 30 lfr_rate: 6 num_freq_mask: 1 apply_time_mask: true time_mask_width_range: - 0 - 12 num_time_mask: 1 train_conf: accum_grad: 1 grad_clip: 5 max_epoch: 150 val_scheduler_criterion: - valid - acc best_model_criterion: - - valid - acc - max keep_nbest_models: 10 log_interval: 50 optim: adam optim_conf: lr: 0.0005 scheduler: warmuplr scheduler_conf: warmup_steps: 30000 dataset: AudioDataset dataset_conf: index_ds: IndexDSJsonl batch_sampler: DynamicBatchLocalShuffleSampler batch_type: example # example or length batch_size: 1 # if batch_type is example, batch_size is the numbers of samples; if length, batch_size is source_token_len+target_token_len; max_token_length: 2048 # filter samples if source_token_len+target_token_len > max_token_length, buffer_size: 500 shuffle: True num_workers: 0 tokenizer: CharTokenizer tokenizer_conf: unk_symbol: <unk> split_with_space: true ctc_conf: dropout_rate: 0.0 ctc_type: builtin reduce: true ignore_nan_grad: true normalize: null models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/configuration.json
New file @@ -0,0 +1,17 @@ { "framework": "pytorch", "task" : "auto-speech-recognition", "model": {"type" : "funasr"}, "pipeline": {"type":"funasr-pipeline"}, "vad_model": "iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", "punc_model": "iic/punc_ct-transformer_cn-en-common-vocab471067-large", "lm_model": "iic/speech_transformer_lm_zh-cn-common-vocab8404-pytorch", "model_name_in_hub": { "ms":"iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", "hf":""}, "file_path_metas": { "init_param":"model.pt", "config":"config.yaml", "tokenizer_conf": {"token_list": "tokens.json", "seg_dict_file": "seg_dict"}, "frontend_conf":{"cmvn_file": "am.mvn"}} } models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wavBinary files differ
models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/fig/struct.png
models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.ptBinary files differ
models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/seg_dict
New file Diff too large models/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/tokens.json
New file Diff too large ÒôƵ±ê×¢Êý¾Ý.xlsxBinary files differ