python/FunASR-XL.git

parent: 57e023e5 | 补丁 | 提交 | show whitespace

游雁

2025-02-25 82e5ca37a8bd80f56c99f9d790a03b458ced716b

Large-Scale Data Training

2个文件已修改

	docs/tutorial/README.md	30 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	docs/tutorial/README_zh.md	27 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 docs/tutorial/README.md

@@ -211,7 +211,7 @@
### Detailed Parameter Description:

```shell
funasr/bin/train.py \
funasr/bin/train_ds.py \
++model="${model_name_or_model_dir}" \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
@@ -252,7 +252,7 @@
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

torchrun --nnodes 1 --nproc_per_node ${gpu_num} \
../../../funasr/bin/train.py ${train_args}
../../../funasr/bin/train_ds.py ${train_args}
```
--nnodes represents the total number of participating nodes, while --nproc_per_node indicates the number of processes running on each node.

@@ -264,7 +264,7 @@
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr=192.168.1.1 --master_port=12345 \
../../../funasr/bin/train.py ${train_args}
../../../funasr/bin/train_ds.py ${train_args}
```
On the worker node (assuming the IP is 192.168.1.2), you need to ensure that the MASTER_ADDR and MASTER_PORT environment variables are set to match those of the master node, and then run the same command:

@@ -273,7 +273,7 @@
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

torchrun --nnodes 2 --node_rank 1 --nproc_per_node ${gpu_num} --master_addr=192.168.1.1 --master_port=12345 \
../../../funasr/bin/train.py ${train_args}
../../../funasr/bin/train_ds.py ${train_args}
```

--nnodes indicates the total number of nodes participating in the training, --node_rank represents the ID of the current node, and --nproc_per_node specifies the number of processes running on each node (usually corresponds to the number of GPUs).
@@ -321,6 +321,28 @@
++jsonl_file_in="../../../data/list/train.jsonl"
```


#### Large-Scale Data Training  
When dealing with large datasets (e.g., 50,000 hours or more), memory issues may arise, especially in multi-GPU experiments. To address this, split the JSONL files into slices, write them into a TXT file (one slice per line), and set `data_split_num`. For example:  
```shell  
train_data="/root/data/list/data.list"  

funasr/bin/train_ds.py \  
++train_data_set_list="${train_data}" \  
++dataset_conf.data_split_num=256  
```  
**Details:**  
- `data.list`: A plain text file listing the split JSONL files. For example, the content of `data.list` might be:  
  ```bash  
  data/list/train.0.jsonl  
  data/list/train.1.jsonl  
  ...  
  ```  
- `data_split_num`: Specifies the number of slice groups. For instance, if `data.list` contains 512 lines and `data_split_num=256`, the data will be divided into 256 groups, each containing 2 JSONL files. This ensures that only 2 JSONL files are loaded for training at a time, reducing memory usage during training. Note: Groups are created sequentially.  

**Recommendation:**  
If the dataset is extremely large and contains heterogeneous data types, perform **data balancing** during splitting to ensure uniformity across groups.

#### Training log

##### log.txt

 docs/tutorial/README_zh.md

@@ -213,7 +213,7 @@
### 详细参数介绍

```shell
funasr/bin/train.py \
funasr/bin/train_ds.py \
++model="${model_name_or_model_dir}" \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
@@ -258,7 +258,7 @@
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

torchrun --nnodes 1 --nproc_per_node ${gpu_num} \
../../../funasr/bin/train.py ${train_args}
../../../funasr/bin/train_ds.py ${train_args}
```
--nnodes 表示参与的节点总数，--nproc_per_node 表示每个节点上运行的进程数

@@ -270,7 +270,7 @@
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr 192.168.1.1 --master_port 12345 \
../../../funasr/bin/train.py ${train_args}
../../../funasr/bin/train_ds.py ${train_args}
```
在从节点上（假设IP为192.168.1.2），你需要确保MASTER_ADDR和MASTER_PORT环境变量与主节点设置的一致，并运行同样的命令：
```shell
@@ -278,7 +278,7 @@
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

torchrun --nnodes 2 --node_rank 1 --nproc_per_node ${gpu_num} --master_addr 192.168.1.1 --master_port 12345 \
../../../funasr/bin/train.py ${train_args}
../../../funasr/bin/train_ds.py ${train_args}
```

--nnodes 表示参与的节点总数，--node_rank 表示当前节点id，--nproc_per_node 表示每个节点上运行的进程数（通常为gpu个数）
@@ -331,6 +331,25 @@
++jsonl_file_in="../../../data/list/train.jsonl"
```

#### 大数据训练
如果数据量很大，例如5万小时以上，这时候容易遇到内存不足的问题，特别是多gpu实验，这时候需要对jsonl文件进行切分成slice，然后写到txt里面，一行一个slice，然后设置`data_split_num`，例如：
```shell
train_data="/root/data/list/data.list"

funasr/bin/train_ds.py \
++train_data_set_list="${train_data}" \
++dataset_conf.data_split_num=256
```
其中：
`data.list`：为纯文本，内容是切割后的jsonl文件，例如，`data.list`的内容为：
```bash
data/list/train.0.jsonl
data/list/train.1.jsonl
...
```
`data_split_num`：表示切分slice分组个数，例如，data.list中共512行，data_split_num=256，表示分成256组，每组有2个jsonl文件，这样每次只load 2个jsonl数据进行训练，从而降低训练过程中内存使用。注意是按照顺序分组。
如果是，非常大，并且数据类型差异比较大，建议切分时候进行数据均衡。

#### 查看训练日志

##### 查看实验log

			@@ -211,7 +211,7 @@
			### Detailed Parameter Description:

			```shell
			funasr/bin/train.py \
			funasr/bin/train_ds.py \
			++model="${model_name_or_model_dir}" \
			++train_data_set_list="${train_data}" \
			++valid_data_set_list="${val_data}" \
			@@ -252,7 +252,7 @@
			gpu_num=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')

			torchrun --nnodes 1 --nproc_per_node ${gpu_num} \
			../../../funasr/bin/train.py ${train_args}
			../../../funasr/bin/train_ds.py ${train_args}
			```
			--nnodes represents the total number of participating nodes, while --nproc_per_node indicates the number of processes running on each node.

			@@ -264,7 +264,7 @@
			gpu_num=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')

			torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr=192.168.1.1 --master_port=12345 \
			../../../funasr/bin/train.py ${train_args}
			../../../funasr/bin/train_ds.py ${train_args}
			```
			On the worker node (assuming the IP is 192.168.1.2), you need to ensure that the MASTER_ADDR and MASTER_PORT environment variables are set to match those of the master node, and then run the same command:

			@@ -273,7 +273,7 @@
			gpu_num=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')

			torchrun --nnodes 2 --node_rank 1 --nproc_per_node ${gpu_num} --master_addr=192.168.1.1 --master_port=12345 \
			../../../funasr/bin/train.py ${train_args}
			../../../funasr/bin/train_ds.py ${train_args}
			```

			--nnodes indicates the total number of nodes participating in the training, --node_rank represents the ID of the current node, and --nproc_per_node specifies the number of processes running on each node (usually corresponds to the number of GPUs).
			@@ -321,6 +321,28 @@
			++jsonl_file_in="../../../data/list/train.jsonl"
			```


			#### Large-Scale Data Training
			When dealing with large datasets (e.g., 50,000 hours or more), memory issues may arise, especially in multi-GPU experiments. To address this, split the JSONL files into slices, write them into a TXT file (one slice per line), and set `data_split_num`. For example:
			```shell
			train_data="/root/data/list/data.list"

			funasr/bin/train_ds.py \
			++train_data_set_list="${train_data}" \
			++dataset_conf.data_split_num=256
			```
			Details:
			- `data.list`: A plain text file listing the split JSONL files. For example, the content of `data.list` might be:
			```bash
			data/list/train.0.jsonl
			data/list/train.1.jsonl
			...
			```
			- `data_split_num`: Specifies the number of slice groups. For instance, if `data.list` contains 512 lines and `data_split_num=256`, the data will be divided into 256 groups, each containing 2 JSONL files. This ensures that only 2 JSONL files are loaded for training at a time, reducing memory usage during training. Note: Groups are created sequentially.

			Recommendation:
			If the dataset is extremely large and contains heterogeneous data types, perform data balancing during splitting to ensure uniformity across groups.

			#### Training log

			##### log.txt

			@@ -213,7 +213,7 @@
			### 详细参数介绍

			```shell
			funasr/bin/train.py \
			funasr/bin/train_ds.py \
			++model="${model_name_or_model_dir}" \
			++train_data_set_list="${train_data}" \
			++valid_data_set_list="${val_data}" \
			@@ -258,7 +258,7 @@
			gpu_num=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')

			torchrun --nnodes 1 --nproc_per_node ${gpu_num} \
			../../../funasr/bin/train.py ${train_args}
			../../../funasr/bin/train_ds.py ${train_args}
			```
			--nnodes 表示参与的节点总数，--nproc_per_node 表示每个节点上运行的进程数

			@@ -270,7 +270,7 @@
			gpu_num=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')

			torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr 192.168.1.1 --master_port 12345 \
			../../../funasr/bin/train.py ${train_args}
			../../../funasr/bin/train_ds.py ${train_args}
			```
			在从节点上（假设IP为192.168.1.2），你需要确保MASTER_ADDR和MASTER_PORT环境变量与主节点设置的一致，并运行同样的命令：
			```shell
			@@ -278,7 +278,7 @@
			gpu_num=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')

			torchrun --nnodes 2 --node_rank 1 --nproc_per_node ${gpu_num} --master_addr 192.168.1.1 --master_port 12345 \
			../../../funasr/bin/train.py ${train_args}
			../../../funasr/bin/train_ds.py ${train_args}
			```

			--nnodes 表示参与的节点总数，--node_rank 表示当前节点id，--nproc_per_node 表示每个节点上运行的进程数（通常为gpu个数）
			@@ -331,6 +331,25 @@
			++jsonl_file_in="../../../data/list/train.jsonl"
			```

			#### 大数据训练
			如果数据量很大，例如5万小时以上，这时候容易遇到内存不足的问题，特别是多gpu实验，这时候需要对jsonl文件进行切分成slice，然后写到txt里面，一行一个slice，然后设置`data_split_num`，例如：
			```shell
			train_data="/root/data/list/data.list"

			funasr/bin/train_ds.py \
			++train_data_set_list="${train_data}" \
			++dataset_conf.data_split_num=256
			```
			其中：
			`data.list`：为纯文本，内容是切割后的jsonl文件，例如，`data.list`的内容为：
			```bash
			data/list/train.0.jsonl
			data/list/train.1.jsonl
			...
			```
			`data_split_num`：表示切分slice分组个数，例如，data.list中共512行，data_split_num=256，表示分成256组，每组有2个jsonl文件，这样每次只load 2个jsonl数据进行训练，从而降低训练过程中内存使用。注意是按照顺序分组。
			如果是，非常大，并且数据类型差异比较大，建议切分时候进行数据均衡。

			#### 查看训练日志

			##### 查看实验log