Template instantations are killing me

part 2

May 2, 2017

I perhaps have overused templates in Bertini2. It's true. I am in love with templates, and just keep using them. But, compile time is killing me. It is taking forever to compile Bertini2, and I want it to be faster. 30 minutes build-test times on my Jenkins servers is unacceptable.

So, here I go!

First attempt: -ftime-report

My first attempts to understand which templates are taking so long is to attempt to get information from GCC/Clang during compilation.

Per this SO post, u/Schamp asks about a different but similar scenario. /u/anon replies to use -ftime-report, to get some information. I did not find it useful at all. Maybe I don't understand correctly how to read the output? Let's try another tool.

It's not so useful to me. I want to see which particular template instantations are taking so long. Is it stuff from Boost.Multiprecision? Eigen? My own crazy large templates?

Second attempt -- Templight

Another SO post has an answer from u/Mikael Persson, suggesting a tool called Templight, of which they are an author. So I figured I'd give it a try.

0. Compiling Templight

It's not a simple brew install templight sadly, but instead requires building it with LLVM/Clang. I've been readying myself mentally to use LLVM, so this seems like as good a time as any to just install it from source. Ok, so I use the install instructions for Clang, and those for Templight on their readme. I download the SVN for Clang and LLVM, versions 301787 and 301782 respectively. The patching step for Templight fails. Bummer.

I viewed the issues for templight, and found [#38](https://github.com/mikael-s-persson/templight/issues/38) gives a known good version of templight commit 0738fa1, and clang rev number 289544. Patch applied cleanly. Project builds smoothly.

Then, I run the tests for Clang.

Ok, 90 failed. Something about capital letters.

1. ./configureing b2's core to use Templight

I think things are going well, so I try using the compiled Clang and Templight with my configure script for b2's core:

No dice. Look in config.log

The first fail is expected. It's checking whether it's c++14 by default, and it isn't, so fail is ok. The next checks fail for a stupid reason. The type_traits file is missing. Hmm. Where is it?

It's not in /usr/include. It's not in /usr/include/c++/4.2.1/. Adding -stdlib=libc++ doesn't change anything.

Ok, re-configure using XCode's clang. Apparently the needed file is in /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1. Why doesn't the new clang look there? Somehow it needs to be told to. Adding -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk was insufficient for some reason. Bummer. But, manually adding the location of type_traits to the include path did allow my ./configure to succeed. Here's the command: ./configure CXX="/Users/ofloveandhate/software/usr/local/bin/templight++ -std=c++14 -stdlib=libc++ -I/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1" Configure is slow to run with this clang. I think that's because it is compiled for debug, not release. Whatever. I can re-compile in release mode if I need to. I do expect compilation to be slower, but hope it doesn't change too much which templates take forever to compile. After all, that's my goal -- to understand which templates are the most costly, so I can hopefully move them to a .cpp and manually instantiate them.

Adding some templight flags to the compiler name causes ./configure to fail:

./configure CXX="/Users/ofloveandhate/software/usr/local/bin/templight++ -std=c++14 -stdlib=libc++ -I/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1 -Xtemplight -profiler -Xtemplight -ignore-system"

yields C++ preprocessor "/lib/cpp" fails sanity check Weak. The problem in config.log: /lib/cpp conftest.cpp ./configure: line 1950: /lib/cpp: No such file or directory

Getting closer. Finally,

./configure CXX="/Users/ofloveandhate/software/usr/local/bin/templight++ -stdlib=libc++" CPPFLAGS="-I/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1" CXXFLAGS="-Xtemplight -profiler -Xtemplight -ignore-system -g -O0"

was successful, and lead to a successful compile of the core library. Yay!

3. Compiling the core

Compilation was straightforward. I told it to make and it did. It took forever. Maybe I'll give it another try with Release mode clang...

4. Understanding the profiling data

So now I have a bunch of binary data files (I think) from templight++. bla.o.trace.pbf. Great.

Templight documentation suggests I use Templight tools. Here goes.

Clone to computer, no problem. mkdir build && cd build && cmake ../ no problem. make ran just fine. Install and add to path, check. This is easy.

Challenge time. How to interpret the .pbf files?

The command is templight-convert, which will transform from a Google protobuf file into another format. Let's try graphviz, because I have had good experiences with it.

My first struggle was getting the options right. Here's a line which produced a file: templight-convert --format graphviz -o bla.gv system.o.trace.pbf The file extension was not obvious to me. Nor is what to do with this file now that I have it. I can call dot on it, but it doesn't seem to want to do anything productive with it... :(

Ok, let's move on to another format, since at the end of the templight-tools readme, there is a claim that KCacheGrind is going to be the best way to view this data. I have had GREAT experiences with it. On Mac OS, it's qcachegrind, and you can install it with brew. Cool.

templight-convert --format graphviz-cg -o bla.gc.out system.o.trace.pbf

Once again, I am guessing on this file extension. Let's try opening it in QCacheGrind. Nope. It doesn't want to open it with extension cg.out, nor with .callgrind.out or other variants. How do I open this file? Changing the format to callgrind lets it open. Here's the first screen I was presented with:

Stacks Image 4758

Groovy, now we're getting somewhere. Let's step up to a more challenging compile, the endgames testing code for the adaptive multiprecision tracker. Here are some stats and a figure

bertini2_amptrackertemplateinstantiations20170502

and the caller map for a cycle instantiated a whopping 12407 times during compilation of this single file:

amptracker_cycle1_callmap
amptracker_numericaldata

5. Lessons

Ok, what's the lesson here?

  • I have simply GOT to reduce the number of instantiations of Boost.Multiprecision's number template. There are probably a few common ones I can mark extern, and let the compiler do the rare ones. But 8264 instantiations of one form of it, that's gotta go.
  • I need to understand this <cycle 1> instantiation. 12k times is ludicrous.
  • My own templates are not the problem here, but the fact that Eigen is templated, and Boost.Multiprecision is templated, is causing the bottle neck.

6. Conclusions

Wrapping up for my first ever Clang-profiling session with Templight and its companion Templight-tools, I am absolutely sold. I have identified pretty clearly the culprit of my compile time woes as being ludicious number of instantiations of Boost.Multiprecision's number<...>, and have a tentative path forward with extern. I am hopeful and positive about being able to bring compile time down significantly.