python/FunASR-XL.git

			@@ -89,7 +89,7 @@
			</ul>
			</li>
			<li class="toctree-l1"><a class="reference internal" href="Track_setting_and_evaluation.html">Track & Evaluation</a><ul>
			<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
			<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
			<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#evaluation-metric">Evaluation metric</a></li>
			<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#sub-track-arrangement">Sub-track arrangement</a></li>
			</ul>
			@@ -131,7 +131,7 @@
			</section>
			<section id="detail-of-alimeeting-corpus">
			<h2>Detail of AliMeeting corpus<a class="headerlink" href="#detail-of-alimeeting-corpus" title="Permalink to this heading">¶</a></h2>
			<p>AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train and Eval sets contain 212 and 8 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train and Eval sets is 456 and 25, respectively, with balanced gender coverage.</p>
			<p>AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.</p>
			<p>The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m<span class="math notranslate nohighlight">\(^{2}\)</span> to 55 m<span class="math notranslate nohighlight">\(^{2}\)</span>. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.</p>
			<p><img alt="meeting room" src="_images/meeting_room.png" /></p>
			<p>The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train, Eval and Test sets are 42.27%, 34.76% and 42.8%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train, Eval and Test set is shown in Table 2.</p>