Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ONNX] Support large attribute and subgraph for large model #38793

Closed
wants to merge 6 commits into from

Conversation

BowenBao
Copy link
Collaborator

Previously large tensor data in attributes and subgraphs are not stored externally. ONNX won't be able to serialize the model for cases where the total size sums up to >= 2GB. This PR enables that.

@BowenBao BowenBao requested a review from apaszke as a code owner May 20, 2020 17:38
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 20, 2020
@dr-ci
Copy link

dr-ci bot commented May 20, 2020

💊 CI failures summary and remediations

As of commit 4b0a1a1 (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_build (1/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found

MM -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\queue\blobs_queue_db.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\queue\blobs_queue_db.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

X2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/quantization/server/fbgemm_fp16_pack_op.cc.obj  
X2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C2039: 'runtime_error': is not a member of 'std'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.26.28801\include\string(24): note: see declaration of 'std'
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

 -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_pack_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

ION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fully_connected_dnnlowp_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fully_connected_dnnlowp_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch_windows_vs2019_py36_cpu_build (2/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found

X2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fully_connected_fake_lowp_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fully_connected_fake_lowp_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

FINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/quantization/server/fbgemm_fp16_pack_op.cc.obj  
FINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C2039: 'runtime_error': is not a member of 'std'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.26.28801\include\string(24): note: see declaration of 'std'
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

experimental -DNDEBUG -DUSE_FBGEMM -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\sgd\wngrad_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\sgd\wngrad_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

E_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fully_connected_dnnlowp_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fully_connected_dnnlowp_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (3/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 27 18:35:40 ERROR: test_accurracy (__main__.TrainMnist)
May 27 18:35:37 Core 0 got rendezvous! 
May 27 18:35:37 + python3 /var/lib/jenkins/workspace/xla/test/test_mp_save.py 
May 27 18:35:38 2020-05-27 18:35:38.281900: W tensorflow/compiler/jit/xla_device.cc:398] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(experimental_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-effort compilation. 
May 27 18:35:38 + python3 /var/lib/jenkins/workspace/xla/test/test_mp_mesh_reduce.py 
May 27 18:35:39 Running MNIST Test 
May 27 18:35:39 + echo 'Running MNIST Test' 
May 27 18:35:39 + python test/test_train_mnist.py --tidy 
May 27 18:35:40  0it [00:00, ?it/s]   0%|          | 0/9912422 [00:00<?, ?it/s]E   0%|          | 8192/9912422 [00:00<07:08, 23098.72it/s] 
May 27 18:35:40  
May 27 18:35:40 ====================================================================== 
May 27 18:35:40 ERROR: test_accurracy (__main__.TrainMnist) 
May 27 18:35:40 ---------------------------------------------------------------------- 
May 27 18:35:40 Traceback (most recent call last): 
May 27 18:35:40   File "test/test_train_mnist.py", line 186, in test_accurracy 
May 27 18:35:40     self.assertGreaterEqual(train_mnist(), FLAGS.target_accuracy) 
May 27 18:35:40   File "test/test_train_mnist.py", line 74, in train_mnist 
May 27 18:35:40     transforms.Normalize((0.1307,), (0.3081,))])) 
May 27 18:35:40   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 71, in __init__ 
May 27 18:35:40     self.download() 
May 27 18:35:40   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 138, in download 
May 27 18:35:40     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

@ailzhang ailzhang requested a review from lara-hdr May 21, 2020 16:38
@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2020
Copy link
Contributor

@neginraoof neginraoof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!
About tests: since these tests won't fail even without your changes, is there a way to check external data export or files, maybe in test_utility?

@BowenBao
Copy link
Collaborator Author

BowenBao commented Jun 2, 2020

@houseroad please take a look, thanks!

@BowenBao
Copy link
Collaborator Author

@houseroad please take a look at this PR.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@houseroad has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Member

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@facebook-github-bot
Copy link
Contributor

@houseroad merged this pull request in eaa9107.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants