Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with my own data #48

Closed
cszer opened this issue Jun 21, 2020 · 9 comments
Closed

Training with my own data #48

cszer opened this issue Jun 21, 2020 · 9 comments

Comments

@cszer
Copy link

cszer commented Jun 21, 2020

Hello,thanks for this awesome project. I have the strange issue. I prepared my own dataset with imaqes 542x1024 and when training starts i always get
N/A% (0 of 200) | | Elapsed Time: 0:00:00 ETA: --:--:--
N/A% (0 of 946) | | Elapsed Time: 0:00:00 ETA: --:--:--
[torch.Size([2, 256, 34, 64]), torch.Size([2, 256, 34, 64])]
[torch.Size([2, 128, 68, 128]), torch.Size([2, 128, 68, 128])]
[torch.Size([2, 64, 136, 256]), torch.Size([2, 64, 136, 256])]
[torch.Size([2, 32, 272, 512]), torch.Size([2, 64, 271, 512])]
Dimension error when torch.cat(x,1)
Maybe its a stride , padding issue , please help me

@JiawangBian
Copy link
Owner

JiawangBian commented Jun 21, 2020

Often we require that the image width and height can be divided by 64. Here 1024/64 = 16 is ok, but 542 / 64 = 8.4。 So I suggest that you cut the top border of image t to make the height as 8*64=512. Also, do not forget change the intrinsic parameters (c_y = c_y - offset_y).

@cszer
Copy link
Author

cszer commented Jun 22, 2020

Thank you , it's works ,but new issue now , problem with nn.DataParallel RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution). When i use only 1 card all good ,but it's impossible to train on one 2070 super

@cszer
Copy link
Author

cszer commented Jun 22, 2020

this issue occurs at the decoder stage , at every network(dis,pose net)

@JiawangBian
Copy link
Owner

I suggest that you train model in one GPU, because the batchsize=4 is small. Also you can downsample your image to 1/2 resolution, i.e., 256x512. If you want to try Multi-GPU training, I suggest that you replace the DepthDecoder with the following parallel version.

class DepthDecoder_parallel(nn.Module):
def init(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True):
super(DepthDecoder_parallel, self).init()

    self.alpha = 10
    self.beta = 0.01

    self.num_output_channels = num_output_channels
    self.use_skips = use_skips
    self.upsample_mode = 'nearest'
    self.scales = scales

    self.num_ch_enc = num_ch_enc
    self.num_ch_dec = np.array([16, 32, 64, 128, 256])

    # decoder
    self.upconvs0 = []
    self.upconvs1 = []
    self.dispconvs = []
    self.i_to_scaleIdx_conversion = {}

    for i in range(4, -1, -1):
        # upconv_0
        num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
        num_ch_out = self.num_ch_dec[i]
        self.upconvs0.append(ConvBlock(num_ch_in, num_ch_out))

        # upconv_1
        num_ch_in = self.num_ch_dec[i]
        if self.use_skips and i > 0:
            num_ch_in += self.num_ch_enc[i - 1]
        num_ch_out = self.num_ch_dec[i]
        self.upconvs1.append(ConvBlock(num_ch_in, num_ch_out))

    for cnt, s in enumerate(self.scales):
        self.dispconvs.append(Conv3x3(self.num_ch_dec[s], self.num_output_channels))

        if s in range(4, -1, -1):
            self.i_to_scaleIdx_conversion[s] = cnt

    self.upconvs0 = nn.ModuleList(self.upconvs0)
    self.upconvs1 = nn.ModuleList(self.upconvs1)
    self.dispconvs = nn.ModuleList(self.dispconvs)
    self.sigmoid = nn.Sigmoid()

def init_weights(self):
    return

def forward(self, input_features):

    self.outputs = []

    # decoder
    x = input_features[-1]

    for cnt, i in enumerate(range(4, -1, -1)):
        x = self.upconvs0[cnt](x)
        x = [upsample(x)]
        if self.use_skips and i > 0:
            x += [input_features[i - 1]]
        x = torch.cat(x, 1)
        x = self.upconvs1[cnt](x)
        if i in self.scales:
            idx = self.i_to_scaleIdx_conversion[i]
            self.outputs.append(self.alpha * self.sigmoid(self.dispconvs[idx](x)) + self.beta)
        
    self.outputs = self.outputs[::-1]
    return self.outputs 

@JiawangBian
Copy link
Owner

and replace the PoseDecoder with:

class PoseDecoder_Parallel(nn.Module):
def init(self, num_ch_enc, num_input_features=1, num_frames_to_predict_for=1, stride=1):
super(PoseDecoder_Parallel, self).init()

    self.num_ch_enc = num_ch_enc
    self.num_input_features = num_input_features

    if num_frames_to_predict_for is None:
        num_frames_to_predict_for = num_input_features - 1
    self.num_frames_to_predict_for = num_frames_to_predict_for

    self.conv_squeeze = nn.Conv2d(self.num_ch_enc[-1], 256, 1)

    self.convs_pose = []
    self.convs_pose.append(nn.Conv2d(num_input_features * 256, 256, 3, stride, 1))
    self.convs_pose.append(nn.Conv2d(256, 256, 3, stride, 1))
    self.convs_pose.append(nn.Conv2d(256, 6 * num_frames_to_predict_for, 1))

    self.relu = nn.ReLU()

    self.convs_pose = nn.ModuleList(list(self.convs_pose))

def forward(self, input_features):
    last_features = [f[-1] for f in input_features]

    cat_features = [self.relu(self.conv_squeeze(f)) for f in last_features]
    cat_features = torch.cat(cat_features, 1)

    out = cat_features
    for i in range(3):
        out = self.convs_pose[i](out)
        if i != 2:
            out = self.relu(out)

    out = out.mean(3).mean(2)

    pose = 0.01 * out.view(-1, 6)

    return pose

@cszer
Copy link
Author

cszer commented Jun 22, 2020

thanks , i simple rewrite your code , and issue solved [
Снимок экрана от 2020-06-22 17-07-54
Снимок экрана от 2020-06-22 17-07-51

@cszer
Copy link
Author

cszer commented Jun 22, 2020

I think it is OrderedDict issue

@cszer cszer closed this as completed Jun 22, 2020
@JiawangBian JiawangBian added the good first issue Good for newcomers label Jun 26, 2020
@JiawangBian JiawangBian pinned this issue Jun 26, 2020
@zhengmiao1996
Copy link

Hello,thanks for this awesome project. I have the strange issue. I prepared my own dataset with imaqes 542x1024 and when training starts i always get
N/A% (0 of 200) | | Elapsed Time: 0:00:00 ETA: --:--:--
N/A% (0 of 946) | | Elapsed Time: 0:00:00 ETA: --:--:--
[torch.Size([2, 256, 34, 64]), torch.Size([2, 256, 34, 64])]
[torch.Size([2, 128, 68, 128]), torch.Size([2, 128, 68, 128])]
[torch.Size([2, 64, 136, 256]), torch.Size([2, 64, 136, 256])]
[torch.Size([2, 32, 272, 512]), torch.Size([2, 64, 271, 512])]
Dimension error when torch.cat(x,1)
Maybe its a stride , padding issue , please help me

I have trouble with prepare my own data, can you show me your code on how to run prepare own data, Thanks!

@JiawangBian
Copy link
Owner

image resolution should be divided by 32。so you can change resolution to 512x1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants