python/FunASR-XL.git - Gitblit

python / FunASR-XL

FUNASR训练

parent: 3cee2214 | 补丁 | 提交 | ignore whitespace

游雁

2023-11-27 54a91194901ad72562d5cb5856ee8c302d93fb0e

dataloader

3个文件已修改

	funasr/datasets/data_sampler.py	4 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/datasets/dataloader_fn.py	2 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史
	funasr/datasets/dataset_jsonl.py	7 ●●●●● 补丁 \| 查看 \| 原始文档 \| blame \| 历史

 funasr/datasets/data_sampler.py

@@ -46,8 +46,8 @@

                idx_map = self.shuffle_idx[idx]
                # prompt = self.dataset.indexed_dataset[idx_map]["prompt"]
                sample_len_cur = self.dataset.indexed_dataset[idx_map]["source_len"] + \
                                 self.dataset.indexed_dataset[idx_map]["target_len"]
                sample_len_cur = self.dataset.indexed_dataset.get_source_len(self.dataset.indexed_dataset[idx_map]) + \
                                 self.dataset.indexed_dataset.get_target_len(self.dataset.indexed_dataset[idx_map])

                datalen_with_index.append([idx, sample_len_cur])
            

 funasr/datasets/dataloader_fn.py

@@ -47,7 +47,7 @@
                                                collate_fn=dataset.collator,
                                                batch_sampler=batch_sampler,
                                                shuffle=False,
                                                num_workers=8,
                                                num_workers=0,
                                                pin_memory=True)
    
    print(len(dataset))

 funasr/datasets/dataset_jsonl.py

@@ -78,6 +78,13 @@
    
    def __getitem__(self, index):
        return self.contents[index]
	
    def get_source_len(self, data_dict):
        return data_dict["source_len"]

    def get_target_len(self, data_dict):
		
        return data_dict["target_len"] if "target_len" in data_dict else 0


class AudioDataset(torch.utils.data.Dataset):