From 28ccfbfc51068a663a80764e14074df5edf2b5ba Mon Sep 17 00:00:00 2001
From: kongdeqiang <kongdeqiang960204@163.com>
Date: 星期五, 13 三月 2026 17:41:41 +0800
Subject: [PATCH] 提交

---
 docs/tutorial/README.md |   32 +++++++++++++++++++++++++++-----
 1 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/docs/tutorial/README.md b/docs/tutorial/README.md
index 74febcd..3a3886e 100644
--- a/docs/tutorial/README.md
+++ b/docs/tutorial/README.md
@@ -38,7 +38,7 @@
 model = AutoModel(model=[str], device=[str], ncpu=[int], output_dir=[str], batch_size=[int], hub=[str], **kwargs)
 ```
 - `model`(str): model name in the [Model Repository](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo), or a model path on local disk.
-- `device`(str): `cuda:0` (default gpu0) for using GPU for inference, specify `cpu` for using CPU.
+- `device`(str): `cuda:0` (default gpu0) for using GPU for inference, specify `cpu` for using CPU. `mps`: Mac computers with M-series chips use MPS for inference. `xpu`: Uses Intel GPU for inference.
 - `ncpu`(int): `4` (default), sets the number of threads for CPU internal operations.
 - `output_dir`(str): `None` (default), set this to specify the output path for the results.
 - `batch_size`(int): `1` (default), the number of samples per batch during decoding.
@@ -211,7 +211,7 @@
 ### Detailed Parameter Description:
 
 ```shell
-funasr/bin/train.py \
+funasr/bin/train_ds.py \
 ++model="${model_name_or_model_dir}" \
 ++train_data_set_list="${train_data}" \
 ++valid_data_set_list="${val_data}" \
@@ -252,7 +252,7 @@
 gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
 
 torchrun --nnodes 1 --nproc_per_node ${gpu_num} \
-../../../funasr/bin/train.py ${train_args}
+../../../funasr/bin/train_ds.py ${train_args}
 ```
 --nnodes represents the total number of participating nodes, while --nproc_per_node indicates the number of processes running on each node.
 
@@ -264,7 +264,7 @@
 gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
 
 torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr=192.168.1.1 --master_port=12345 \
-../../../funasr/bin/train.py ${train_args}
+../../../funasr/bin/train_ds.py ${train_args}
 ```
 On the worker node (assuming the IP is 192.168.1.2), you need to ensure that the MASTER_ADDR and MASTER_PORT environment variables are set to match those of the master node, and then run the same command:
 
@@ -273,7 +273,7 @@
 gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
 
 torchrun --nnodes 2 --node_rank 1 --nproc_per_node ${gpu_num} --master_addr=192.168.1.1 --master_port=12345 \
-../../../funasr/bin/train.py ${train_args}
+../../../funasr/bin/train_ds.py ${train_args}
 ```
 
 --nnodes indicates the total number of nodes participating in the training, --node_rank represents the ID of the current node, and --nproc_per_node specifies the number of processes running on each node (usually corresponds to the number of GPUs).
@@ -321,6 +321,28 @@
 ++jsonl_file_in="../../../data/list/train.jsonl"
 ```
 
+
+#### Large-Scale Data Training  
+When dealing with large datasets (e.g., 50,000 hours or more), memory issues may arise, especially in multi-GPU experiments. To address this, split the JSONL files into slices, write them into a TXT file (one slice per line), and set `data_split_num`. For example:  
+```shell  
+train_data="/root/data/list/data.list"  
+
+funasr/bin/train_ds.py \  
+++train_data_set_list="${train_data}" \  
+++dataset_conf.data_split_num=256  
+```  
+**Details:**  
+- `data.list`: A plain text file listing the split JSONL files. For example, the content of `data.list` might be:  
+  ```bash  
+  data/list/train.0.jsonl  
+  data/list/train.1.jsonl  
+  ...  
+  ```  
+- `data_split_num`: Specifies the number of slice groups. For instance, if `data.list` contains 512 lines and `data_split_num=256`, the data will be divided into 256 groups, each containing 2 JSONL files. This ensures that only 2 JSONL files are loaded for training at a time, reducing memory usage during training. Note: Groups are created sequentially.  
+
+**Recommendation:**  
+If the dataset is extremely large and contains heterogeneous data types, perform **data balancing** during splitting to ensure uniformity across groups.
+
 #### Training log
 
 ##### log.txt

--
Gitblit v1.9.1