pytorch: num_workers >1 的时候 Unable to open object (bad object header version number)
问题描述:
KeyError: 'Unable to open object (bad object header version number)'
原因:
模型的num_workers >1 且使用了h5py 文件,但是h5py文件不支持多线程读取(其实不是)
解决:
This issue could be solved and the solution is simple:【两个方案】
-
Do not open hdf5 inside __init__ (不要再init里面初始化h5py文件)
-
Open the hdf5 at the first data iteration. (在第一个getitem的时候初始化)
解决示例:
class LXRTDataLoader(torch.utils.data.Dataset):
def __init__(self):
"""do not open hdf5 here!!"""
def open_hdf5(self):
self.img_hdf5 = h5py.File('img.hdf5', 'r')
self.dataset = self.img_hdf5['dataset'] # if you want dataset.
def __getitem__(self, item: int):
if not hasattr(self, 'img_hdf5'):
self.open_hdf5()
img0 = self.img_hdf5['dataset'][0] # Do loading here
img1 = self.dataset[1]
return img0, img1
解释
The multi-processing actually happens when you create the data iterator (e.g., when callingfor datum in dataloader:):
In short, it would create multiple processes which "copy" the state of the current process. Thus the opened hdf5 file object would be dedicated to each subprocess if we open it at the first data iteration.
(翻译:简而言之,它将创建多个进程,这些进程“复制”当前进程的状态。因此,如果我们在第一次数据迭代时打开hdf5文件对象,那么它将专用于每个子进程。)
If you somehow create an hdfs file in __init__ and set up the `num_workers' > 0, it might cause two issues:(如果在主进程创建 hdfs文件且`num_workers' > 0 会导致两个问题:)
-
The writing behavior is non-determistic. (We do not need to write to hdf5, thus this issue is ignored.)
-
The state of the hdfs is copied, which might not faithfully indicate the current state.
In the previous way, we bypass this two issues.(现在我们避免了这两个问题)
参考: