0%

3 Bugs - Pytorch, Flownet2-Pytorch and vid2vid

Haven't successfully tested three packages (all related to PyTorch), PyTorch, FlowNet2-Pytorch and vid2vid. Looking forward to assistance...

PyTorch

The Bug

After having successfully installed PyTorch current version 1.1, I still failed to import torch. Please refer to the following ERROR.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
➜  ~ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "~/.local/lib/python3.6/site-packages/torch/__init__.py", line 84, in <module>
from torch._C import *
ImportError: ~/.local/lib/python3.6/site-packages/torch/lib/libcaffe2.so: undefined symbol: _ZTIN3c1010TensorImplE
>>> import caffe2
>>> caffe2.__version__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'caffe2' has no attribute '__version__'
>>> caffe2.__file__
'~/.local/lib/python3.6/site-packages/caffe2/__init__.py'

In order to have PyTorch successully imported, I've got to remove the manually installed PyTorch v 1.1, but had it installed by pip

1
pip3 install https://download.pytorch.org/whl/cu100/torch-1.0.0-cp36-cp36m-linux_x86_64.whl

This is PyTorch v1.0, which seems NOT come with caffe2, and of course should NOT be compatible with the installed caffe2 built with PyTorch v1.1. Can anybody help to solve this issue? Please also refer to Github issue.

Solution

Remove anything/everything related to your previously installed PyTorch. In my case, file /usr/local/lib/libc10.s0 is to be removed. In order to analyze which files are possibly related to the concerned package, we can use the command ldd.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
➜  lib ldd libcaffe2.so
linux-vdso.so.1 (0x00007ffcf3dc9000)
libc10.so => /usr/local/lib/libc10.so (0x00007fca41b88000)
libmpi.so.12 => /opt/intel/mpi/intel64/lib/libmpi.so.12 (0x00007f8acdad9000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f8acd8d1000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8acd6b2000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f8acd4ae000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8acd296000)
libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f8acc765000)
libmkl_gnu_thread.so => /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so (0x00007f8acaf2c000)
libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007f8ac6df3000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8ac6a55000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8ac66cc000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f8ac649d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8ac60ac000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8ad250d000)
libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f8ac5ea1000)
libfabric.so.1 => /usr/lib/x86_64-linux-gnu/libfabric.so.1 (0x00007f8ac5bf4000)
librdmacm.so.1 => /usr/lib/x86_64-linux-gnu/librdmacm.so.1 (0x00007f8ac59de000)
libibverbs.so.1 => /usr/lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007f8ac57c8000)
libpsm_infinipath.so.1 => /usr/lib/x86_64-linux-gnu/libpsm_infinipath.so.1 (0x00007f8ac556f000)
libnl-route-3.so.200 => /usr/lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007f8ac52fa000)
libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007f8ac50da000)
libinfinipath.so.4 => /usr/lib/x86_64-linux-gnu/libinfinipath.so.4 (0x00007f8ac4ecb000)
libuuid.so.1 => /lib/x86_64-linux-gnu/libuuid.so.1 (0x00007f8ac4cc4000)

FlowNet2-Pytorch

Installation

It's not hard to have FlowNet2-Pytorch installed by one line of commands:

1
➜  flownet2-pytorch git:(master) ✗ ./install.sh

After installation, there will be 3 packages installed under folder ~/.local/lib/python3.6/site-packages:

  • correlation-cuda
  • resample2d-cuda
  • channelnorm-cuda
1
2
3
4
5
6
7
8
9
10
➜  site-packages ls -lsd correlation*
4 drwxrwxr-x 4 jiapei jiapei 4096 Jan 7 00:07 correlation_cuda-0.0.0-py3.6-linux-x86_64.egg
➜ site-packages ls -lsd channelnorm*
4 drwxrwxr-x 4 jiapei jiapei 4096 Jan 7 00:07 channelnorm_cuda-0.0.0-py3.6-linux-x86_64.egg
➜ site-packages ls -lsd resample2d*
4 drwxrwxr-x 4 jiapei jiapei 4096 Jan 7 00:07 resample2d_cuda-0.0.0-py3.6-linux-x86_64.egg
➜ site-packages ls -lsd flownet2*
zsh: no matches found: flownet2*
➜ site-packages pwd
~/.local/lib/python3.6/site-packages

That is to say: you should NEVER import flownet2, nor correlation, nor channelnorm, nor resampled2d, but

  • correlation_cuda
  • resample2d_cuda
  • channelnorm_cuda

Current Bug

Here comes the ERROR:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
➜  ~ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import correlation_cuda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: ~/.local/lib/python3.6/site-packages/correlation_cuda-0.0.0-py3.6-linux-x86_64.egg/correlation_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE
>>> import channelnorm_cuda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: ~/.local/lib/python3.6/site-packages/channelnorm_cuda-0.0.0-py3.6-linux-x86_64.egg/channelnorm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE
>>> import resample2d_cuda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: ~/.local/lib/python3.6/site-packages/resample2d_cuda-0.0.0-py3.6-linux-x86_64.egg/resample2d_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE
>>>

I've already posted an issue on github. Had anybody solved this problem?

Solution

import torch FIRST.

1
2
3
4
5
6
7
8
➜  ~ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import correlation_cuda
>>> import resample2d_cuda
>>> import channelnorm_cuda

Still Buggy

FlowNet2-Pytorch Inference

Where can we find and download checkpoints? Please refer to Inference on FlowNet2-Pytorch:

1
python main.py --inference --model FlowNet2 --save_flow --inference_dataset MpiSintelClean --inference_dataset_root /path/to/mpi-sintel/clean/dataset --resume /path/to/checkpoints

FlowNet2-Pytorch Training

Training on FlowNet2-Pytorch gives the following ERROR:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
➜  flownet2-pytorch git:(master) ✗ python main.py --batch_size 8 --model FlowNet2 --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-4 --training_dataset MpiSintelFinal --training_dataset_root /path/to/mpi-sintel/training/final/mountain_1  --validation_dataset MpiSintelClean --validation_dataset_root /path/tompi-sintel/training/clean/mountain_1
Parsing Arguments
[0.013s] batch_size: 8
[0.013s] crop_size: [256, 256]
[0.013s] fp16: False
[0.014s] fp16_scale: 1024.0
[0.014s] gradient_clip: None
[0.014s] inference: False
[0.014s] inference_batch_size: 1
[0.014s] inference_dataset: MpiSintelClean
[0.014s] inference_dataset_replicates: 1
[0.014s] inference_dataset_root: ./MPI-Sintel/flow/training
[0.014s] inference_n_batches: -1
[0.014s] inference_size: [-1, -1]
[0.014s] log_frequency: 1
[0.014s] loss: L1Loss
[0.014s] model: FlowNet2
[0.014s] model_batchNorm: False
[0.014s] model_div_flow: 20.0
[0.014s] name: run
[0.014s] no_cuda: False
[0.014s] number_gpus: 1
[0.014s] number_workers: 8
[0.014s] optimizer: Adam
[0.014s] optimizer_amsgrad: False
[0.014s] optimizer_betas: (0.9, 0.999)
[0.014s] optimizer_eps: 1e-08
[0.014s] optimizer_lr: 0.0001
[0.014s] optimizer_weight_decay: 0
[0.014s] render_validation: False
[0.014s] resume:
[0.014s] rgb_max: 255.0
[0.014s] save: ./work
[0.014s] save_flow: False
[0.014s] schedule_lr_fraction: 10
[0.014s] schedule_lr_frequency: 0
[0.014s] seed: 1
[0.014s] skip_training: False
[0.014s] skip_validation: False
[0.014s] start_epoch: 1
[0.014s] total_epochs: 10000
[0.014s] train_n_batches: -1
[0.014s] training_dataset: MpiSintelFinal
[0.014s] training_dataset_replicates: 1
[0.014s] training_dataset_root: ....../mpi-sintel/training/final/mountain_1
[0.014s] validation_dataset: MpiSintelClean
[0.014s] validation_dataset_replicates: 1
[0.014s] validation_dataset_root: ....../mpi-sintel/training/clean/mountain_1
[0.014s] validation_frequency: 5
[0.014s] validation_n_batches: -1
[0.016s] Operation finished

Source Code
Current Git Hash: b'ac1602a72f0454f65872126b70665a596fae8009'

Initializing Datasets
[0.003s] Operation failed

Traceback (most recent call last):
File "main.py", line 139, in <module>
train_dataset = args.training_dataset_class(args, True, **tools.kwargs_from_args(args, 'training_dataset'))
File "....../flownet2-pytorch/datasets.py", line 112, in __init__
super(MpiSintelFinal, self).__init__(args, is_cropped = is_cropped, root = root, dstype = 'final', replicates = replicates)
File "....../flownet2-pytorch/datasets.py", line 66, in __init__
self.frame_size = frame_utils.read_gen(self.image_list[0][0]).shape
IndexError: list index out of range

Therefore, I posted a further issue on Github.

vid2vid

Preparation

There are 4 scripts under folder scripts to run before testing vid2vid.

1
2
3
4
5
➜  vid2vid git:(master) ✗ ll scripts/*.py
-rwxrwxrwx 1 jiapei jiapei 322 Jan 5 22:04 scripts/download_datasets.py
-rwxrwxrwx 1 jiapei jiapei 579 Jan 5 22:04 scripts/download_flownet2.py
-rwxrwxrwx 1 jiapei jiapei 1.3K Jan 5 22:04 scripts/download_gdrive.py
-rwxrwxrwx 1 jiapei jiapei 257 Jan 5 22:04 scripts/download_models_flownet2.py

However, ONLY 3 of the scripts can be successfully run, but scripts/download_flownet2.py failed to run, as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
➜  vid2vid git:(master) python scripts/download_flownet2.py 
Compiling correlation kernels by nvcc...
rm: cannot remove '../_ext': No such file or directory
Traceback (most recent call last):
File "build.py", line 3, in <module>
import torch.utils.ffi
File "~/.local/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 1, in <module>
raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.
Compiling resample2d kernels by nvcc...
rm: cannot remove 'Resample2d_kernel.o': No such file or directory
rm: cannot remove '../_ext': No such file or directory
In file included from Resample2d_kernel.cu:1:0:
~/.local/lib/python3.6/site-packages/torch/lib/include/THC/THC.h:4:10: fatal error: THC/THCGeneral.h: No such file or directory
#include <THC/THCGeneral.h>
^~~~~~~~~~~~~~~~~~
compilation terminated.
Traceback (most recent call last):
File "build.py", line 3, in <module>
import torch.utils.ffi
File "~/.local/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 1, in <module>
raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.
Compiling channelnorm kernels by nvcc...
rm: cannot remove 'ChannelNorm_kernel.o': No such file or directory
rm: cannot remove '../_ext': No such file or directory
In file included from ChannelNorm_kernel.cu:1:0:
~/.local/lib/python3.6/site-packages/torch/lib/include/THC/THC.h:4:10: fatal error: THC/THCGeneral.h: No such file or directory
#include <THC/THCGeneral.h>
^~~~~~~~~~~~~~~~~~
compilation terminated.
Traceback (most recent call last):
File "build.py", line 3, in <module>
import torch.utils.ffi
File "~/.local/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 1, in <module>
raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.

Testing is ALSO Buggy

If we run Testing on vid2vid homepage, it gives the following ERROR:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
➜  vid2vid git:(master) python test.py --name label2city_2048 --label_nc 35 --loadSize 2048 --n_scales_spatial 3 --use_instance --fg --use_single_G
------------ Options -------------
add_face_disc: False
aspect_ratio: 1.0
basic_point_only: False
batchSize: 1
checkpoints_dir: ./checkpoints
dataroot: datasets/Cityscapes/
dataset_mode: temporal
debug: False
densepose_only: False
display_id: 0
display_winsize: 512
feat_num: 3
fg: True
fg_labels: [26]
fineSize: 512
gpu_ids: [0]
how_many: 300
input_nc: 3
isTrain: False
label_feat: False
label_nc: 35
loadSize: 2048
load_features: False
load_pretrain:
max_dataset_size: inf
model: vid2vid
nThreads: 2
n_blocks: 9
n_blocks_local: 3
n_downsample_E: 3
n_downsample_G: 3
n_frames_G: 3
n_gpus_gen: 1
n_local_enhancers: 1
n_scales_spatial: 3
name: label2city_2048
ndf: 64
nef: 32
netE: simple
netG: composite
ngf: 128
no_canny_edge: False
no_dist_map: False
no_first_img: False
no_flip: False
no_flow: False
norm: batch
ntest: inf
openpose_only: False
output_nc: 3
phase: test
random_drop_prob: 0.05
random_scale_points: False
remove_face_labels: False
resize_or_crop: scaleWidth
results_dir: ./results/
serial_batches: False
start_frame: 0
tf_log: False
use_instance: True
use_real_img: False
use_single_G: True
which_epoch: latest
-------------- End ----------------
CustomDatasetDataLoader
dataset [TestDataset] was created
vid2vid
Traceback (most recent call last):
File "test.py", line 25, in <module>
model = create_model(opt)
File "....../vid2vid/models/models.py", line 7, in create_model
from .vid2vid_model_G import Vid2VidModelG
File "....../vid2vid/models/vid2vid_model_G.py", line 13, in <module>
from . import networks
File "....../vid2vid/models/networks.py", line 12, in <module>
from .flownet2_pytorch.networks.resample2d_package.resample2d import Resample2d
File "....../vid2vid/models/flownet2_pytorch/networks/resample2d_package/resample2d.py", line 2, in <module>
from .functions.resample2d import Resample2dFunction
File "....../vid2vid/models/flownet2_pytorch/networks/resample2d_package/functions/resample2d.py", line 3, in <module>
from .._ext import resample2d
ModuleNotFoundError: No module named 'models.flownet2_pytorch.networks.resample2d_package._ext'

This is still annoying me. A detailed github issue has been posted today. Please give me a hand. Thanks...