python/FunASR-XL.git

>8sphinx.addnodesdocument)}(    rawsourcechildren]docutils.nodessection)}(hhh](h    title)}(hDatasetsh]h    TextDatasets}(parenth    _documenthsourceNlineNuba
attributes}(ids]classes]names]dupnames]backrefs]utagnamehhKh4/mnt/yhliang/workspace/FunASR/docs/m2met2/Dataset.mdhhhhubh)}(hhh](h)}(hOverview of training datah]hOverview of training data}(hh0hhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hh-hhubh        paragraph)}(hXlIn the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.h]hXlIn the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.}(hh@hhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hh-hhubeh}(h!]overview-of-training-dataah#]h%]overview of training dataah']h)]slugoverview-of-training-datauh+h
hKhh,hhhhubh)}(hhh](h)}(hDetail of AliMeeting corpush]hDetail of AliMeeting corpus}(hh[hhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hhXhhubh?)}(hXùAliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.h]hXùAliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.}(hhihhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(hX¶The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m$^{2}$ to 55 m$^{2}$. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.h](hThe dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m}(hhwhhhNhNubh    math)}(h^{2}h]h^{2}}(hhhhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hhwhhubh to 55 m}(hhwhhhNhNubh)}(h^{2}h]h^{2}}(hhhhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hhwhhubhX. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.}(hhwhhhNhNubeh}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(h(![meeting room](images/meeting_room.png)h]h    image)}(hmeeting roomh]h}(h!]h#]h%]h']h)]uriimages/meeting_room.pngalth³
candidates}*h¼suh+h¯hK    hh,hh«hhubah}(h!]h#]h%]h']h)]uh+h>hK    hh,hhXhhubh?)}(hXcThe number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train, Eval and Test sets are 42.27\%, 34.76\% and 42.8\%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train, Eval and Test set is shown in Table 2.h]hX`The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train, Eval and Test sets are 42.27%, 34.76% and 42.8%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train, Eval and Test set is shown in Table 2.}(hhÇhhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(h-![dataset detail](images/dataset_details.png)h]h°)}(hdataset detailh]h}(h!]h#]h%]h']h)]h»images/dataset_details.pngh½hÛh¾}hÀhãsuh+h¯hK hh,hhÕhhubah}(h!]h#]h%]h']h)]uh+h>hK hh,hhXhhubh?)}(hXThe Test-2023 set consists of 20 sessions that were recorded in an identical acoustic setting to that of the AliMeeting corpus. Each meeting session in the Test-2023 dataset comprises between 2 and 4 participants, thereby sharing a similar configuration with the AliMeeting test set.h]hXThe Test-2023 set consists of 20 sessions that were recorded in an identical acoustic setting to that of the AliMeeting corpus. Each meeting session in the Test-2023 dataset comprises between 2 and 4 participants, thereby sharing a similar configuration with the AliMeeting test set.}(hhëhhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(hX\We also record the near-field signal of each participant using a headset microphone and ensure that only the participant's own speech is recorded and transcribed. It is worth noting that the far-field audio recorded by the microphone array and the near-field audio recorded by the headset microphone will be synchronized to a common timeline range.h]hX^We also record the near-field signal of each participant using a headset microphone and ensure that only the participantâs own speech is recorded and transcribed. It is worth noting that the far-field audio recorded by the microphone array and the near-field audio recorded by the headset microphone will be synchronized to a common timeline range.}(hhùhhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubh?)}(hX6All transcriptions of the speech data are prepared in TextGrid format for each session, which contains the information of the session duration, speaker information (number of speaker, speaker-id, gender, etc.), the total number of segments of each speaker, the timestamp and transcription of each segment, etc.h]hX6All transcriptions of the speech data are prepared in TextGrid format for each session, which contains the information of the session duration, speaker information (number of speaker, speaker-id, gender, etc.), the total number of segments of each speaker, the timestamp and transcription of each segment, etc.}(hjhhhNhNubah}(h!]h#]h%]h']h)]uh+h>hKhh,hhXhhubeh}(h!]detail-of-alimeeting-corpusah#]h%]detail of alimeeting corpusah']h)]hVdetail-of-alimeeting-corpusuh+h
hKhh,hhhhubh)}(hhh](h)}(hGet the datah]hGet the data}(hj!hhhNhNubah}(h!]h#]h%]h']h)]uh+hhKhh,hjhhubh?)}(hXThe three dataset for training mentioned above can be downloaded at [OpenSLR](https://openslr.org/resources.php). The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.h](hDThe three dataset for training mentioned above can be downloaded at }(hj/hhhNhNubh        reference)}(hOpenSLRh]hOpenSLR}(hj9hhhNhNubah}(h!]h#]h%]h']h)]refuri!https://openslr.org/resources.phpuh+j7hKhh,hj/hhubh. The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.}(hj/hhhNhNubeh}(h!]h#]h%]h']h)]uh+h>hKhh,hjhhubh    bullet_list)}(hhh](h        list_item)}(hhh]h?)}(h&[AliMeeting](https://openslr.org/119/)h]j8)}(h
AliMeetingh]h
AliMeeting}(hjahhhNhNubah}(h!]h#]h%]h']h)]jGhttps://openslr.org/119/uh+j7hKhh,hj]hhubah}(h!]h#]h%]h']h)]uh+h>hKhh,hjZhhubah}(h!]h#]h%]h']h)]uh+jXhKhh,hjUhhubjY)}(hhh]h?)}(h%[AISHELL-4](https://openslr.org/111/)h]j8)}(h    AISHELL-4h]h    AISHELL-4}(hjhhhNhNubah}(h!]h#]h%]h']h)]jGhttps://openslr.org/111/uh+j7hKhh,hjhhubah}(h!]h#]h%]h']h)]uh+h>hKhh,hj|hhubah}(h!]h#]h%]h']h)]uh+jXhKhh,hjUhhubjY)}(hhh]h?)}(h#[CN-Celeb](https://openslr.org/82/)h]j8)}(hCN-Celebh]hCN-Celeb}(hj¥hhhNhNubah}(h!]h#]h%]h']h)]jGhttps://openslr.org/82/uh+j7hKhh,hj¡hhubah}(h!]h#]h%]h']h)]uh+h>hKhh,hjhhubah}(h!]h#]h%]h']h)]uh+jXhKhh,hjUhhubeh}(h!]h#]h%]h']h)]bullet-uh+jShKhh,hjhhubeh}(h!]get-the-dataah#]h%]get the dataah']h)]hVget-the-datauh+h
hKhh,hhhhubeh}(h!]datasetsah#]h%]datasetsah']h)]hVdatasetsuh+h
hKhh,hhhhubah}(h!]h#]h%]h']h)]sourceh,uh+hcurrent_sourceNcurrent_lineNsettingsdocutils.frontendValues)}(hN    generatorN    datestampNsource_linkN
source_urlN toc_backlinksentryfootnote_backlinksK sectnum_xformKstrip_commentsNstrip_elements_with_classesN strip_classesNreport_levelK
halt_levelKexit_status_levelKdebugNwarning_streamN    tracebackinput_encoding    utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerjýerror_encodingUTF-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN    id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh,_destinationN _config_files]file_insertion_enabledraw_enabledKline_length_limitM'pep_referencesNpep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesNrfc_base_url&https://datatracker.ietf.org/doc/html/    tab_widthKtrim_footnote_reference_spacesyntax_highlightlongsmart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformKsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}(wordcount-wordsh    substitution_definition)}(h576h]h576}hj;sbah}(h!]h#]h%]wordcount-wordsah']h)]uh+j9hh,ubwordcount-minutesj:)}(h3h]h3}hjKsbah}(h!]h#]h%]wordcount-minutesah']h)]uh+j9hh,ubusubstitution_names}(wordcount-wordsj8wordcount-minutesjJurefnames}refids}nameids}(jÖjÓhShPjjjÍjÊu    nametypes}(jÖhSjjÍuh!}(jÓhhPh-jhXjÊju footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs]    footnotes]    citations]autofootnote_startKsymbol_footnote_startK
id_countercollectionsCounter}Rparse_messages]transform_messages]transformerNinclude_log]
decorationNhh
myst_slugs}(jÙKjÓDatasetshWKhPOverview of training datajKjDetail of AliMeeting corpusjÐKjÊGet the datauub.