What's the meaning of "self.base_model=base_model"?

Steve_Wong · October 3, 2020, 1:45am

I have class “aggregator” and “encoder” whose code are as follows. it seems that the code “self.base_model = base_model” is ineffective, since self.base_model is not used in “encoder”. But when I remove the statement, there raise a error. Is “base_model” have special function(s)?

class Encoder(nn.Module):
    def __init__(self, features, feature_dim, embed_dim, adj_lists, aggregator, device,
                 num_sample=10, base_model=None, gcn=False):
        super(Encoder, self).__init__()

        self.features = features
        self.feat_dim = feature_dim
        self.embed_dim = embed_dim
        self.adj_lists = adj_lists
        self.aggregator = aggregator
        self.num_sample = num_sample
        if base_model is not None:
            self.base_model = base_model
        self.gcn = gcn
        self.deivce = device

        self.weight = nn.Parameter(torch.FloatTensor(
            self.feat_dim if self.gcn else 2 * self.feat_dim, embed_dim))
        init.xavier_uniform_(self.weight)

    def forward(self, nodes):
        neigh_feats = self.aggregator.forward(
            nodes, [self.adj_lists[int(node)] for node in nodes], self.num_sample)

        if not self.gcn:
            self_feats = self.features(torch.LongTensor(nodes).to(self.device))
            combined = torch.cat([self_feats, neigh_feats], dim=1)  # (batch, 2 * feat_dim)
        else:
            combined = neigh_feats

        # (batch, feat_dim) x (feat_dim, embed_dim) -> (batch, embed_dim)
        combined = F.relu(combined.mm(self.weight))
        return combined

in main(), I call “encoder” like this,

agg1 = MeanAggregator(features, device)
enc1 = Encoder(features, num_feat, num_embed, adj_lists, agg1, device, gcn=True)
agg2 = MeanAggregator(lambda nodes: enc1(nodes), device)
enc2 = Encoder(lambda nodes: enc1(nodes), num_embed, num_embed, adj_lists, agg2, device, base_model=enc1, gcn=True)

if I remove the statement of “base_model” in main() and “encoder”, a runtime error occur

Traceback (most recent call last):
  File "C:/Users/Administrator/Desktop/mycode/deep-learning/graphsage-simple-master/graphsage/model.py", line 223, in <module>
    run_cora()
  File "C:/Users/Administrator/Desktop/mycode/deep-learning/graphsage-simple-master/graphsage/model.py", line 119, in run_cora
    loss.backward()
  File "C:\Users\Administrator\Anaconda3\envs\pytorch\lib\site-packages\torch\tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\Administrator\Anaconda3\envs\pytorch\lib\site-packages\torch\autograd\__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at ..\aten\src\ATen\cuda\CublasHandlePool.cpp:8 (most recent call first):
00007FFA8DED75A200007FFA8DED7540 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFA2CA6AEA800007FFA2CA69E70 torch_cuda.dll!at::cuda::getCurrentCUDASparseHandle [<unknown file> @ <unknown line number>]
00007FFA2CA6A7D800007FFA2CA69E70 torch_cuda.dll!at::cuda::getCurrentCUDASparseHandle [<unknown file> @ <unknown line number>]
00007FFA2CA6B66700007FFA2CA6B1A0 torch_cuda.dll!at::cuda::getCurrentCUDABlasHandle [<unknown file> @ <unknown line number>]
00007FFA2CA6B24700007FFA2CA6B1A0 torch_cuda.dll!at::cuda::getCurrentCUDABlasHandle [<unknown file> @ <unknown line number>]
00007FFA2CA6320700007FFA2CA624B0 torch_cuda.dll!at::native::sparse_mask_cuda [<unknown file> @ <unknown line number>]
00007FFA2BF6CA9700007FFA2BF6B990 torch_cuda.dll!at::native::lerp_cuda_tensor_out [<unknown file> @ <unknown line number>]
00007FFA2BF6E4D200007FFA2BF6DF60 torch_cuda.dll!at::native::addmm_out_cuda [<unknown file> @ <unknown line number>]
00007FFA2BF6F64300007FFA2BF6F560 torch_cuda.dll!at::native::mm_cuda [<unknown file> @ <unknown line number>]
00007FFA2CAD1B0F00007FFA2CA6E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFA2CAC1B2200007FFA2CA6E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFA24B8D94900007FFA24B88FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFA24BC057700007FFA24BC0520 torch_cpu.dll!at::mm [<unknown file> @ <unknown line number>]
00007FFA25F1EC7900007FFA25E2E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFA246D715700007FFA246D6290 torch_cpu.dll!at::indexing::TensorIndex::boolean [<unknown file> @ <unknown line number>]
00007FFA24B8D94900007FFA24B88FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFA24CA210700007FFA24CA20B0 torch_cpu.dll!at::Tensor::mm [<unknown file> @ <unknown line number>]
00007FFA25DBB9BD00007FFA25DBA760 torch_cpu.dll!torch::autograd::profiler::Event::kind [<unknown file> @ <unknown line number>]
00007FFA25D91CF000007FFA25D91B30 torch_cpu.dll!torch::autograd::generated::MmBackward::apply [<unknown file> @ <unknown line number>]
00007FFA25D67E9100007FFA25D67B50 torch_cpu.dll!torch::autograd::Node::operator() [<unknown file> @ <unknown line number>]
00007FFA262CF9BA00007FFA262CF300 torch_cpu.dll!torch::autograd::Engine::add_thread_pool_task [<unknown file> @ <unknown line number>]
00007FFA262D03AD00007FFA262CFFD0 torch_cpu.dll!torch::autograd::Engine::evaluate_function [<unknown file> @ <unknown line number>]
00007FFA262D4FE200007FFA262D4CA0 torch_cpu.dll!torch::autograd::Engine::thread_main [<unknown file> @ <unknown line number>]
00007FFA262D4C4100007FFA262D4BC0 torch_cpu.dll!torch::autograd::Engine::thread_init [<unknown file> @ <unknown line number>]
00007FF9EE2708F700007FF9EE249F80 torch_python.dll!THPShortStorage_New [<unknown file> @ <unknown line number>]
00007FFA262CBF1400007FFA262CB780 torch_cpu.dll!torch::autograd::Engine::get_base_engine [<unknown file> @ <unknown line number>]
00007FFA96F7FA9500007FFA96F7F9F0 ucrtbase.dll!iswascii [<unknown file> @ <unknown line number>]
00007FFA996D37E400007FFA996D37D0 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFA99DCCB6100007FFA99DCCB40 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]

tell me if there is any other information I can provide that would be helpful. Thanks for any suggestions.

ptrblck · October 4, 2020, 8:33am

No, base_model is not an internal attribute and since it’s unused you could remove it.
If I understand your issue correctly, the error is only raised, if you remove self.base_model?

Could you rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.pt args and post the stack trace here?
Also, could you check, if you are running out of memory on the used GPU?

Steve_Wong · October 5, 2020, 11:52am

Thanks for your reponse. I’ve added os.environ['CUDA_LAUNCH_BLOCKING'] = "1" before running main() and here is the stack trace.

Traceback (most recent call last):
  File "C:/Users/Administrator/Desktop/mycode/deep-learning/graphsage-simple-master/graphsage/model.py", line 227, in <module>
    run_cora()
  File "C:/Users/Administrator/Desktop/mycode/deep-learning/graphsage-simple-master/graphsage/model.py", line 120, in run_cora
    loss = graphsage.loss(batch_nodes, torch.LongTensor(labels[np.array(batch_nodes)]).to(device))
  File "C:/Users/Administrator/Desktop/mycode/deep-learning/graphsage-simple-master/graphsage/model.py", line 40, in loss
    scores = self.forward(nodes)
  File "C:/Users/Administrator/Desktop/mycode/deep-learning/graphsage-simple-master/graphsage/model.py", line 34, in forward
    embeds = self.enc(nodes)  # (num_nodes, embed_dim)
  File "C:\Users\Administrator\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Administrator\Desktop\mycode\deep-learning\graphsage-simple-master\graphsage\encoders.py", line 38, in forward
    neigh_feats = self.aggregator.forward(
  File "C:\Users\Administrator\Desktop\mycode\deep-learning\graphsage-simple-master\graphsage\aggregators.py", line 65, in forward
    embed_matrix = self.features(torch.LongTensor(unique_nodes_list).to(self.device))
  File "C:/Users/Administrator/Desktop/mycode/deep-learning/graphsage-simple-master/graphsage/model.py", line 98, in <lambda>
    agg2 = MeanAggregator(lambda nodes: enc1(nodes), device)
  File "C:\Users\Administrator\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Administrator\Desktop\mycode\deep-learning\graphsage-simple-master\graphsage\encoders.py", line 49, in forward
    combined = F.relu(combined.mm(self.weight))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

When i remove the statement of “base_model”, it runs without error. So I don’t think it’s running out of GPU memory.

Steve_Wong · October 5, 2020, 12:41pm

Here is the source code.
when I remove base_model=enc1, in model.py, it raise the error above. Thanks for any suggestions.

ptrblck · October 5, 2020, 10:54pm

Thanks for the code.
I cannot reproduce the error using this command and your repository:

python -m graphsage.model
0 1.9449124336242676
1 1.9262391328811646
2 1.9075403213500977
3 1.869475245475769
4 1.8472117185592651
5 1.8181393146514893
6 1.8005340099334717
7 1.746289849281311
8 1.6870187520980835
9 1.6369025707244873
10 1.5767720937728882
11 1.4736994504928589
12 1.4498679637908936
13 1.4377225637435913
14 1.4284756183624268
15 1.3335570096969604
16 1.2711284160614014
17 1.1341012716293335
18 1.1241765022277832
19 1.0289225578308105
20 0.9355937242507935
21 0.9255473613739014
22 0.8900326490402222
23 0.806100606918335
24 0.8072971701622009
25 0.8284955620765686
26 0.6966670751571655
27 0.6289700269699097
28 0.6647626757621765
29 0.8228136897087097
30 0.747298538684845
31 0.714832067489624
32 0.5909189581871033
33 0.5408354997634888
34 0.5010537505149841
35 0.5862414836883545
36 0.5422082543373108
37 0.4259440004825592
38 0.44785746932029724
39 0.48034438490867615
40 0.43961235880851746
41 0.4774445593357086
42 0.41150081157684326
43 0.36814427375793457
44 0.35858404636383057
45 0.3964439332485199
46 0.3785482347011566
47 0.45746365189552307
48 0.6183907985687256
49 0.7749939560890198
50 0.9017699956893921
51 0.4618622958660126
52 0.37373557686805725
53 0.34176552295684814
54 0.3155927062034607
55 0.35520535707473755
56 0.3165806531906128
57 0.31519120931625366
58 0.32234856486320496
59 0.2658422887325287
60 0.2659823000431061
61 0.3343285620212555
62 0.25045305490493774
63 0.2747704088687897
64 0.22687111794948578
65 0.3643914461135864
66 0.263433039188385
67 0.26506614685058594
68 0.25543108582496643
69 0.2547190189361572
70 0.2573583424091339
71 0.21613654494285583
72 0.273402601480484
73 0.1756642758846283
74 0.21497443318367004
75 0.31738513708114624
76 0.2398620843887329
77 0.22140288352966309
78 0.22747889161109924
79 0.26642531156539917
80 0.2145370990037918
81 0.21441781520843506
82 0.16217008233070374
83 0.19470077753067017
84 0.1829792857170105
85 0.24118557572364807
86 0.21673281490802765
87 0.1997360736131668
88 0.20058968663215637
89 0.1757909208536148
90 0.21877747774124146
91 0.20824533700942993
92 0.17878247797489166
93 0.21536171436309814
94 0.22572994232177734
95 0.17374852299690247
96 0.2164204865694046
97 0.20531226694583893
98 0.24563446640968323
99 0.19727684557437897
Validation F1: 0.858
Test F1: 0.857
Average batch time: 0.036545968055725096

Which PyTorch and CUDA version and which GPU are you using?

Steve_Wong · October 6, 2020, 1:22am

PyTorch - 1.6.0 with CUDA - 10.2 and my GPU is 1050 Ti.
It occurs to me that I’ve installed CUDA drivers whose version are 11.0 and 9.0. Are they incompatible with PyTorch 1.6 ?

ptrblck · October 6, 2020, 5:49am

NVIDIA drivers have version number such as 450 and this table shows which drivers are compatible to the CUDA toolkit version.
However, if your driver is too old you code should not execute at all.

EDIT: given you are using a 1050Ti with 4GB I guess you might be running out of memory by creating the additional unused module and the CUBLAS error is raised wrongly (as was reported in the past also).

Steve_Wong · October 7, 2020, 11:44am

It sounds reasonable. Thank you so much.