Speaker Attributed Automatic Speech Recognition (SA-ASR) is a task proposed to solve "who spoke what". Specifically, the goal of SA-ASR is not only to obtain multi-speaker transcriptions, but also to identify the corresponding speaker for each utterance. The method used in this example is referenced in the paper: End-to-End Speaker-Attributed ASR with Transformer.
First you need to install the FunASR and ModelScope. (installation)
After the FunASR and ModelScope is installed, you must manually download and unpack the AliMeeting corpus and place it in the ./dataset directory. The .dataset should organized as follow:shell dataset |—— Eval_Ali_far |—— Eval_Ali_near |—— Test_Ali_far |—— Test_Ali_near |—— Train_Ali_far |—— Train_Ali_near
Then you can run this receipe by running:shell bash run.sh --stage 0 --stop-stage 6
There are 8 stages in run.sh:shell stage 0: Data preparation and remove the audio which is too long or too short. stage 1: Speaker profile and CMVN Generation. stage 2: Dictionary preparation. stage 3: LM training (not supported). stage 4: ASR Training. stage 5: SA-ASR Training. stage 6: Inference stage 7: Inference with Test_2023_Ali_far
./dataset/Test_2023_Ali_far/ and put the wav.scp, segments, utt2spk, spk2utt in ./data/org/Test_2023_Ali_far/.test_2023 in run.sh should be to Test_2023_Ali_far.Run the run.sh as follow.shell # Prepare test_2023 set bash run.sh --stage 0 --stop-stage 1 # Decode test_2023 set bash run.sh --stage 7 --stop-stage 7
Finally, you need to submit a file called text_spk_merge with the following format:shell Meeting_1 text_spk_1_A$text_spk_1_B$text_spk_1_C ... Meeting_2 text_spk_2_A$text_spk_2_B$text_spk_2_C ... ...
Here, text_spk_1_A represents the full transcription of speaker_A of Meeting_1 (merged in chronological order), and $ represents the separator symbol. There's no need to worry about the speaker permutation as the optimal permutation will be computed in the end. For more information, please refer to the results generated after executing the baseline code.
N. Kanda, G. Ye, Y. Gaur, X. Wang, Z. Meng, Z. Chen, and T. Yoshioka, "End-to-end speaker-attributed ASR with transformer," in Interspeech. ISCA, 2021, pp. 4413–4417.