1
2
3
4
5
6
7
8
9
10
11
12
13
14
| 48 sphinx.addnodesdocument)}( rawsource children]docutils.nodessection)}(hhh](h title)}(hDatasetsh]h TextDatasets
}(parenth _documenthsourceNlineNuba
| attributes}(ids]classes]names]dupnames]backrefs]utagnamehhKh*/mnt/yhliang/FunASR/docs/m2met2/Dataset.mdhhhhubh)}(hhh](h)}(hOverview of training datah]hOverview of training data
}(hh0hhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hh-hhubh paragraph)}(hXl In the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.h]hXl In the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.
}(hh@hhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hh-hhubeh}(h!]overview-of-training-dataah#]h%]overview of training dataah']h)]slugoverview-of-training-datauh+h
| hKhh,hhhhubh)}(hhh](h)}(hDetail of AliMeeting corpush]hDetail of AliMeeting corpus
}(hh[hhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hhXhhubh?)}(hXù AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.h]hXù AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.
}(hhihhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(hX¶ The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m$^{2}$ to 55 m$^{2}$. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.h](hThe dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m
}(hhwhhhNhNubh math)}(h^{2}h]h^{2}
}(hhhhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hhwhhubh to 55 m
}(hhwhhhNhNubh)}(h^{2}h]h^{2}
}(hhhhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hhwhhubhX . Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.
}(hhwhhhNhNubeh}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(h(h]h image)}(hmeeting roomh]h}(h!]h#]h%]h']h)]uriimages/meeting_room.pngalth³
| candidates}*h¼suh+h¯hK hh,hh«hhubah}(h!]h#]h%]h']h)]uh+h>hK hh,hhXhhubh?)}(hXc The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train, Eval and Test sets are 42.27\%, 34.76\% and 42.8\%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train, Eval and Test set is shown in Table 2.h]hX` The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train, Eval and Test sets are 42.27%, 34.76% and 42.8%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train, Eval and Test set is shown in Table 2.
}(hhÇhhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(h-h]h°)}(hdataset detailh]h}(h!]h#]h%]h']h)]h»images/dataset_details.pngh½hÛh¾}hÀhãsuh+h¯hK hh,hhÕhhubah}(h!]h#]h%]h']h)]uh+h>hK hh,hhXhhubh?)}(hX The Test-2023 set consists of 20 sessions that were recorded in an identical acoustic setting to that of the AliMeeting corpus. Each meeting session in the Test-2023 dataset comprises between 2 and 4 participants, thereby sharing a similar configuration with the AliMeeting test set.h]hX The Test-2023 set consists of 20 sessions that were recorded in an identical acoustic setting to that of the AliMeeting corpus. Each meeting session in the Test-2023 dataset comprises between 2 and 4 participants, thereby sharing a similar configuration with the AliMeeting test set.
}(hhëhhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(hX\ We also record the near-field signal of each participant using a headset microphone and ensure that only the participant's own speech is recorded and transcribed. It is worth noting that the far-field audio recorded by the microphone array and the near-field audio recorded by the headset microphone will be synchronized to a common timeline range.h]hX^ We also record the near-field signal of each participant using a headset microphone and ensure that only the participantâs own speech is recorded and transcribed. It is worth noting that the far-field audio recorded by the microphone array and the near-field audio recorded by the headset microphone will be synchronized to a common timeline range.
}(hhùhhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(hX6 All transcriptions of the speech data are prepared in TextGrid format for each session, which contains the information of the session duration, speaker information (number of speaker, speaker-id, gender, etc.), the total number of segments of each speaker, the timestamp and transcription of each segment, etc.h]hX6 All transcriptions of the speech data are prepared in TextGrid format for each session, which contains the information of the session duration, speaker information (number of speaker, speaker-id, gender, etc.), the total number of segments of each speaker, the timestamp and transcription of each segment, etc.
}(hj hhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubeh}(h!]detail-of-alimeeting-corpusah#]h%]detail of alimeeting corpusah']h)]hVdetail-of-alimeeting-corpusuh+h
| hKhh,hhhhubh)}(hhh](h)}(hGet the datah]hGet the data
}(hj! hhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hj hhubh?)}(hX The three dataset for training mentioned above can be downloaded at [OpenSLR](https://openslr.org/resources.php). The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.h](hDThe three dataset for training mentioned above can be downloaded at
}(hj/ hhhNhNubh reference)}(hOpenSLRh]hOpenSLR
}(hj9 hhhNhNubah}(h!]h#]h%]h']h)]refuri!https://openslr.org/resources.phpuh+j7 hKhh,hj/ hhubh. The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.
}(hj/ hhhNhNubeh}(h!]h#]h%]h']h)]uh+h>hKhh,hj hhubh bullet_list)}(hhh](h list_item)}(hhh]h?)}(h&[AliMeeting](https://openslr.org/119/)h]j8 )}(h
| AliMeetingh]h
| AliMeeting
}(hja hhhNhNubah}(h!]h#]h%]h']h)]jG https://openslr.org/119/uh+j7 hKhh,hj] hhubah}(h!]h#]h%]h']h)]uh+h>hKhh,hjZ hhubah}(h!]h#]h%]h']h)]uh+jX hKhh,hjU hhubjY )}(hhh]h?)}(h%[AISHELL-4](https://openslr.org/111/)h]j8 )}(h AISHELL-4h]h AISHELL-4
}(hj hhhNhNubah}(h!]h#]h%]h']h)]jG https://openslr.org/111/uh+j7 hKhh,hj hhubah}(h!]h#]h%]h']h)]uh+h>hKhh,hj| hhubah}(h!]h#]h%]h']h)]uh+jX hKhh,hjU hhubjY )}(hhh]h?)}(h#[CN-Celeb](https://openslr.org/82/)h]j8 )}(hCN-Celebh]hCN-Celeb
}(hj¥ hhhNhNubah}(h!]h#]h%]h']h)]jG https://openslr.org/82/uh+j7 hKhh,hj¡ hhubah}(h!]h#]h%]h']h)]uh+h>hKhh,hj hhubah}(h!]h#]h%]h']h)]uh+jX hKhh,hjU hhubeh}(h!]h#]h%]h']h)]bullet-uh+jS hKhh,hj hhubeh}(h!]get-the-dataah#]h%]get the dataah']h)]hVget-the-datauh+h
| hKhh,hhhhubeh}(h!]datasetsah#]h%]datasetsah']h)]hVdatasetsuh+h
| hKhh,hhhhubah}(h!]h#]h%]h']h)]sourceh,uh+hcurrent_sourceNcurrent_lineNsettingsdocutils.frontendValues)}(hN generatorN datestampNsource_linkN
| source_urlN toc_backlinksentryfootnote_backlinksK sectnum_xformKstrip_commentsNstrip_elements_with_classesN strip_classesNreport_levelK
| halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerjý error_encodingUTF-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh,_destinationN _config_files]file_insertion_enabledraw_enabledKline_length_limitM'pep_referencesNpep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesNrfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlongsmart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformKsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}(wordcount-wordsh substitution_definition)}(h576h]h576
}hj; sbah}(h!]h#]h%]wordcount-wordsah']h)]uh+j9 hh,ubwordcount-minutesj: )}(h3h]h3
}hjK sbah}(h!]h#]h%]wordcount-minutesah']h)]uh+j9 hh,ubusubstitution_names}(wordcount-wordsj8 wordcount-minutesjJ urefnames}refids}nameids}(jÖ jÓ hShPj j jÍ jÊ u nametypes}(jÖ hSj jÍ uh!}(jÓ hhPh-j hXjÊ j u footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK
| id_countercollectionsCounter}
Rparse_messages]transform_messages]transformerNinclude_log]
| decorationNhh
| myst_slugs}(jÙ KjÓ DatasetshWKhPOverview of training dataj Kj Detail of AliMeeting corpusjÐ KjÊ Get the datauub.
|
|