R package profiling with Google

I want to speed up R packages written with the Rcpp and RcppArmadillo packages.

gprof is not an option because R is not compiled with -pg, and just re-compiling the shared libraries of the packages is insufficient.

valgrind can generate some output files, but it does not generate any results that are remotely informative about the shared libraries accessed by R.

sprof was suggested by some, but I can’t make it work on Debian Linux. I can get to the point where it produces an output file, but the dynamic linker will not analyze the data that was created. It has been that way since 2009, apparently I fought with it even back then when I posted about it in one of the R lists.

On the other hand, I do get relatively helpful information from the Google’s C++ profiler, which was suggested by Dirk Eddelbuettel.

Here are web pages about it

http://code.google.com/p/gperftools/wiki/GooglePerformanceTools

http://code.google.com/p/gperftools/?redir=1

I found these instructions the most helpful:

http://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html

I think these are older, still OK

http://goog-perftools.sourceforge.net/doc/cpu_profiler.html

The Debian Linux OS has packages for it now (2013-01-06) that are very up to date.

Observe:

$ dpkg -l | grep perftools
ii  google-perftools      2.0-2       all          command line utilities to analyze the performance of C++ programs
ii  libgoogle-perftools-dev  2.0-2 amd64        libraries for CPU and heap analysis
ii  libgoogle-perftools4  2.0-2    amd64     libraries for CPU and heap analysis, plus an efficient thread-caching malloc

I eventually found that I could make this profiler cooperate with the package shared library in all 3 of the ways that are suggested in the documentation. I’ll include the HOWTO notes at the end. I think, before you bother to try to do this, you need to know what you get from it.

When the program is run, the command line is like so:

$ CPUPROFILE="myprof.log" R -f my-program.R

When that’s finished, the current working directory has a file “myprof.log”. We can’t read that without the help of special tools.

There are 2 kinds of output that are the most helpful. First, there is a text output that has 5 columns, as describedon the Google webpage about the program. This output shows that the program IS calling the Armadillo matrix routines, it indicated that quite a bit of the program’s runtime is spend in arma::subview_elem2. The weird part about this output is that it does not show any of the functions called in the shared library that I’m trying to profile.

In the output, the right column is the function name. Most are unintelligible.

$ google-pprof --text /usr/lib/R/bin/exec/R myprof.log | more
Using local file /usr/lib/R/bin/exec/R.
Using local file myprof.log.
Removing _L_unlock_15 from all stack traces.
Total: 863 samples
      62   7.2%   7.2%       65   7.5% _int_malloc
      58   6.7%  13.9%       59   6.8% ATL_dJIK0x0x0NN0x0x0_aX_bX
      35   4.1%  18.0%       78   9.0% arma::subview_elem2::extract
      30   3.5%  21.4%      142  16.5% arma::subview_elem2::inplace_op
      26   3.0%  24.4%       26   3.0% ATL_dJIK0x0x48TN48x48x0_a1_bX
      24   2.8%  27.2%       24   2.8% ATL_dJIK0x0x0TN0x0x0_aX_bX
      23   2.7%  29.9%       23   2.7% ATL_dreftrmvUNN
      22   2.5%  32.4%       31   3.6% *__GI___libc_free
      20   2.3%  34.8%       79   9.2% *__GI___libc_malloc
      17   2.0%  36.7%       31   3.6% arma_check (inline)
      16   1.9%  38.6%       17   2.0% ATL_ddot_xp0yp0aXbX
      16   1.9%  40.4%       55   6.4% ATL_dgemv
      16   1.9%  42.3%       24   2.8% arma::Mat::at (inline)
      14   1.6%  43.9%       14   1.6% ATL_dgezero
      14   1.6%  45.5%       14   1.6% _int_free
      12   1.4%  46.9%       12   1.4% dgemv_
      11   1.3%  48.2%       11   1.3% ATL_dgemvN_a1_x1_b1_y1
      11   1.3%  49.5%       11   1.3% __memcpy_ssse3
      10   1.2%  50.6%       10   1.2% ATL_1dsplit
      10   1.2%  51.8%       15   1.7% helper (inline)
       9   1.0%  52.8%        9   1.0% ATL_dNCmmJIK
       8   0.9%  53.8%        8   0.9% ATL_dtrmvUN
       8   0.9%  54.7%       10   1.2% arma::eglue_core::apply
       7   0.8%  55.5%        7   0.8% 00007f8335cb77b7
       6   0.7%  56.2%        6   0.7% *__GI___strcoll_l
       6   0.7%  56.9%        6   0.7% ATL_dJIK0x0x24NN0x0x0_aX_bX
       6   0.7%  57.6%        9   1.0% ATL_ddot
       6   0.7%  58.3%        6   0.7% ATL_dptgemm_nt
       6   0.7%  59.0%        6   0.7% ATL_dscal_xp0yp0aXbX
       6   0.7%  59.7%        6   0.7% arma::arrayops::copy_big
       6   0.7%  60.4%        6   0.7% dtrmv_
       5   0.6%  61.0%        5   0.6% ATL_daxpy
       5   0.6%  61.5%      128  14.8% apply (inline)
       5   0.6%  62.1%        5   0.6% atl_f77wrap_dgemv_
       5   0.6%  62.7%       18   2.1% copy (inline)
       5   0.6%  63.3%        5   0.6% crc32_combine64
       5   0.6%  63.8%        5   0.6% lsame_
       4   0.5%  64.3%        4   0.5% 00007f8335cb892a
       4   0.5%  64.8%        4   0.5% ATL_dJIK0x0x24TN0x0x0_aX_bX
       4   0.5%  65.2%        4   0.5% ATL_dNCmmIJK
       4   0.5%  65.7%        4   0.5% ATL_daxpy_xp0yp0aXbX
       4   0.5%  66.2%        4   0.5% ATL_dgecopy
       4   0.5%  66.6%        4   0.5% ATL_dgemvN_a1_x1_bX_y1
       4   0.5%  67.1%        4   0.5% ATL_dptgemm
       4   0.5%  67.6%        4   0.5% ATL_dtrmv
       4   0.5%  68.0%        4   0.5% __parse_one_specmb
       4   0.5%  68.5%        9   1.0% arma::op_strans::apply_noalias
       4   0.5%  68.9%        4   0.5% malloc_consolidate
       4   0.5%  69.4%       52   6.0% operator new
       4   0.5%  69.9%        4   0.5% pthread_join
       3   0.3%  70.2%        3   0.3% 00007f8335cb64fd
       3   0.3%  70.6%        3   0.3% 00007f8335cb77bb
       3   0.3%  70.9%        3   0.3% 00007f8335cbab87
       3   0.3%  71.3%        3   0.3% ATL_dcopy_xp0yp0aXbX
       3   0.3%  71.6%        7   0.8% ATL_join_tree
       3   0.3%  72.0%        8   0.9% _IO_vfprintf_internal
       3   0.3%  72.3%       53   6.1% acquire (inline)
       3   0.3%  72.7%       42   4.9% arma::Mat::init_warm
       3   0.3%  73.0%        5   0.6% arma::gemm::apply_blas_type
       3   0.3%  73.3%        5   0.6% arma::op_symmat::apply
       3   0.3%  73.7%       15   1.7% arma::subview::extract
       3   0.3%  74.0%        5   0.6% arma::subview::operator=
       3   0.3%  74.4%        3   0.3% arma::unwrap_check_mixed::~unwrap_check_mixed
       2   0.2%  74.6%        2   0.2% 00007f833095a1b4
       2   0.2%  74.9%        2   0.2% 00007f83309865d9
       2   0.2%  75.1%        2   0.2% 00007f8330987005
       2   0.2%  75.3%        2   0.2% 00007f8335c49fcb
       2   0.2%  75.6%        2   0.2% 00007f8335c4adf1
       2   0.2%  75.8%        2   0.2% 00007f8335c59e20
       2   0.2%  76.0%        2   0.2% 00007f8335c5f390
       2   0.2%  76.2%        2   0.2% 00007f8335cb7fd6
       2   0.2%  76.5%        2   0.2% 00007f8335cb8015
       2   0.2%  76.7%        2   0.2% 00007f8335cb8397
       2   0.2%  76.9%        2   0.2% 00007f8335cb891d
       2   0.2%  77.2%        2   0.2% 00007f8335cb8a49
       2   0.2%  77.4%        2   0.2% 00007f8335cb930d
       2   0.2%  77.6%        2   0.2% 00007f8335cb9a3c
       2   0.2%  77.9%        2   0.2% ATL_dJIK0x0x0TN0x0x0_a1_bX
       2   0.2%  78.1%        2   0.2% ATL_dcpsc
       2   0.2%  78.3%        2   0.2% ATL_ddot_xp1yp1aXbX
       2   0.2%  78.6%        4   0.5% ATL_thread_exit
       2   0.2%  78.8%      151  17.5% Mat (inline)
       2   0.2%  79.0%        3   0.3% __clone
       2   0.2%  79.3%        2   0.2% __getdents64
       2   0.2%  79.5%        2   0.2% __memcmp_sse4_1
       2   0.2%  79.7%        3   0.3% __pthread_attr_init_2_1
       2   0.2%  80.0%        2   0.2% __pthread_attr_setscope
       2   0.2%  80.2%        2   0.2% __strcpy_ssse3
       2   0.2%  80.4%        3   0.3% _gfortran_compare_string
       2   0.2%  80.6%        2   0.2% _gfortran_concat_string
       2   0.2%  80.9%        2   0.2% _init@efd8

We can force it to show functions going into and out of a target function. In this case, here’s the “emcore” function, which I know is important and is called a lot in this example.

$ google-pprof --text --focus=emcore /usr/lib/R/bin/exec/R myprof.log | more
Using local file /usr/lib/R/bin/exec/R.
Using local file myprof.log.
Removing _L_unlock_15 from all stack traces.
Total: 863 samples
      34  11.7%  11.7%       37  12.8% _int_malloc
      34  11.7%  23.4%       77  26.6% arma::subview_elem2::extract
      30  10.3%  33.8%      140  48.3% arma::subview_elem2::inplace_op
      17   5.9%  39.7%       31  10.7% arma_check (inline)
      16   5.5%  45.2%       24   8.3% arma::Mat::at (inline)
      11   3.8%  49.0%       48  16.6% *__GI___libc_malloc
      10   3.4%  52.4%       15   5.2% helper (inline)
       9   3.1%  55.5%       13   4.5% *__GI___libc_free
       8   2.8%  58.3%        8   2.8% __memcpy_ssse3
       8   2.8%  61.0%       10   3.4% arma::eglue_core::apply
       6   2.1%  63.1%        6   2.1% arma::arrayops::copy_big
       5   1.7%  64.8%        5   1.7% _int_free
       5   1.7%  66.6%      126  43.4% apply (inline)
       5   1.7%  68.3%       18   6.2% copy (inline)
       3   1.0%  69.3%       42  14.5% arma::Mat::init_warm
       3   1.0%  70.3%        5   1.7% arma::gemm::apply_blas_type
       3   1.0%  71.4%        8   2.8% arma::op_strans::apply_noalias
       3   1.0%  72.4%        5   1.7% arma::op_symmat::apply
       3   1.0%  73.4%        3   1.0% arma::unwrap_check_mixed::~unwrap_check_mixed
       3   1.0%  74.5%        3   1.0% malloc_consolidate
       2   0.7%  75.2%        2   0.7% 00007f833095a1b4
       2   0.7%  75.9%        2   0.7% 00007f83309865d9
       2   0.7%  76.6%        2   0.7% 00007f8330987005
       2   0.7%  77.2%      148  51.0% Mat (inline)
       2   0.7%  77.9%        3   1.0% _gfortran_compare_string
       2   0.7%  78.6%        2   0.7% _gfortran_concat_string
       2   0.7%  79.3%        2   0.7% _init@efd8
       2   0.7%  80.0%       52  17.9% acquire (inline)
       2   0.7%  80.7%        3   1.0% arma::Mat::colptr (inline)
       2   0.7%  81.4%        4   1.4% arma::Proxy::at (inline)
       2   0.7%  82.1%        2   0.7% arma::eGlue::at (inline)
       2   0.7%  82.8%       12   4.1% arma::glue_times::apply
       2   0.7%  83.4%       14   4.8% arma::subview::extract
       2   0.7%  84.1%        4   1.4% arma::subview::operator=
       2   0.7%  84.8%        2   0.7% arma::subview::operator[] (inline)
       2   0.7%  85.5%        2   0.7% inplace_plus (inline)
       2   0.7%  86.2%        2   0.7% inplace_set (inline)
       2   0.7%  86.9%       50  17.2% operator new
       1   0.3%  87.2%        1   0.3% 00007f83308aba8d
       1   0.3%  87.6%        1   0.3% 00007f833095a182
       1   0.3%  87.9%        1   0.3% 00007f833095a3a4
       1   0.3%  88.3%        1   0.3% 00007f833095a7f5
       1   0.3%  88.6%        1   0.3% 00007f833098655e
       1   0.3%  89.0%        1   0.3% 00007f83309865eb
       1   0.3%  89.3%        1   0.3% 00007f8330986fe0
       1   0.3%  89.7%        1   0.3% 00007f8330a56362
       1   0.3%  90.0%        1   0.3% 00007f8330a563e6
       1   0.3%  90.3%        1   0.3% 00007f8330a56c08
       1   0.3%  90.7%        1   0.3% 00007f8330a56c83
       1   0.3%  91.0%        1   0.3% 00007f8330a6065b
       1   0.3%  91.4%        1   0.3% 00007f8330a6067d
       1   0.3%  91.7%        1   0.3% 00007f8330a6096a
       1   0.3%  92.1%        1   0.3% 00007f8330a60b30
       1   0.3%  92.4%        1   0.3% Op (inline)
       1   0.3%  92.8%        1   0.3% _IO_vfprintf_internal
       1   0.3%  93.1%        1   0.3% __memcmp_sse4_1
       1   0.3%  93.4%        1   0.3% accumulate (inline)
       1   0.3%  93.8%       50  17.2% apply
       1   0.3%  94.1%        6   2.1% arma::Mat::Mat
       1   0.3%  94.5%       13   4.5% arma::Mat::init_cold
       1   0.3%  94.8%        1   0.3% arma::Mat::is_vec (inline)
       1   0.3%  95.2%        1   0.3% arma::Proxy::get_n_elem (inline)
       1   0.3%  95.5%       32  11.0% arma::auxlib::inv_sympd
       1   0.3%  95.9%       25   8.6% arma::op_find::apply
       1   0.3%  96.2%        2   0.7% arma::op_find::helper
       1   0.3%  96.6%        8   2.8% arma::op_strans::apply
       1   0.3%  96.9%       19   6.6% arma::op_strans::apply_proxy
       1   0.3%  97.2%        4   1.4% arma::op_sum::apply
       1   0.3%  97.6%        1   0.3% arma::subview::colptr (inline)
       1   0.3%  97.9%        1   0.3% arma::unwrap_check_mixed::unwrap_check_mixed
       1   0.3%  98.3%        1   0.3% dgemm_
       1   0.3%  98.6%      290 100.0% emcore
       1   0.3%  99.0%        1   0.3% gemm (inline)
       1   0.3%  99.3%        1   0.3% primitive_range_wrap__impl__nocast (inline)
       1   0.3%  99.7%        1   0.3% std::num_put::_M_insert_int
       1   0.3% 100.0%       12   4.1% ~Mat (inline)
       0   0.0% 100.0%        3   1.0% 000000000000001b
       0   0.0% 100.0%        1   0.3% 000000000000006b
       0   0.0% 100.0%      290 100.0% 000000000040086a
       0   0.0% 100.0%      290 100.0% 000000000040089c
       0   0.0% 100.0%        1   0.3% 00007f83308aba8c
       0   0.0% 100.0%        1   0.3% 00007f833095a181
       0   0.0% 100.0%        2   0.7% 00007f833095a1b3
       0   0.0% 100.0%        1   0.3% 00007f833095a3a3
       0   0.0% 100.0%        2   0.7% 00007f833095a4ab
       0   0.0% 100.0%        1   0.3% 00007f833095a7f4
       0   0.0% 100.0%        3   1.0% 00007f833095a80d
       0   0.0% 100.0%        1   0.3% 00007f833098655d
       0   0.0% 100.0%        2   0.7% 00007f83309865d8
       0   0.0% 100.0%        1   0.3% 00007f83309865ea
       0   0.0% 100.0%        1   0.3% 00007f8330986611

Frankly, that does not help me too much to understand what is going on. I really wish I could use gprof, that has the best profiler output I’ve ever seen, but I can’t get it, so I soldier on.

Second, the same information can be converted into a graphviz image, showing the functions that call each other. This is not perfect, but pretty good. This result is presented in postscript, but I converted to PDF with ps2pdf. I’ve uploade a copy here:

http://pj.freefaculty.org/blog-attachments/pprof5994.0-LD_PRELOAD.pdf

That’s a really wide image, you have to scroll it about. The boxes at the top are functions, but we don’t know their names. Well, we can see their machine names like “00007f8335c78029”, but that’s useless. I’ve installed the debugging symbols that go with the R packages, but this program does not pick them up automatically. That surprised me.

But as you track down into the graph, you eventually come to functions that are in the shared library, such as “emcore”. There’s quite a bit of time being spent in the function called “sweep”. The percentages in the boxes are the percent of total program time and percent of time within the “local” sequence of calls. If we add the command line “–focus=emcore” then the percentages are re-calculated in an interesting way. emcore is then assigned 100%, and functions leading into it and out of it are shown, with their percentages of time. That is interesting, maybe.

Now, HOW CAN YOU DO THIS FOR YOURSELF?

1. The easy way. May work, probably will.

Do not recompile anything, do not change any R packages. Just follow two steps. First, export to the environment the variable LD_PRELOAD that points at the shared library provided by the Google perftools package. My shell is always BASH, or something like it. I run

$ export LD_PRELOAD=/usr/lib/libprofiler.so

Then run the target program, but PREPEND to it an enviroment variable for the output file

$ CPUPROFILE="myprof.log" R -f amelia-faster-1.7.5.2.R

While that is running, the R output whirs by on the screen. But you know it was profiling because when it is done, you see this:

PROFILE: interrupts/evictions/bytes = 870/96/208320

Look in the working directory, and the output file “myprof.log” is sitting there. Then analyze that with the text or graph tools.

Recall the problem that, in the output, many functions do not have human readable names. The have names like 02324252543. The only symbols we can read are in the target package, which is not great. I thought it was missing the R function names because I did not have the R dbg package installed. But I installed it, same result.

Then I tried fiddling with the command line. I thought I would get better symbols by telling it to look in the actual R executable. /usr/bin/R is not actually a binary executable, it is a script that sets some environment variables. I think /usr/lib/R/bin/exec/R might be more helpful. Nope. I tried several variants of this, but did not find a way that produced better output than the others. So I define R_HOME and try the direct approach:

$ R_HOME = /usr/lib/R
$ CPUPROFILE="myprof.log" /usr/lib/R/bin/exec/R -f amelia-faster-1.7.5.2.R

No difference. No legible names on functions in the R libraries themselves.

2. Recompile the R packages for which profiling is desired, after adding a Makefile variable that changes the linker.

In the example I’m profiling, like all R packages with C or C++ code, there is a folder src and in there, Makefile that can be edited. All we need to do is insert -lprofiler in the linker statement. Change something like this:

PKG_LIBS = `$(R_HOME)/bin/Rscript -e "Rcpp:::LdFlags()"` $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)

to

PKG_LIBS = `$(R_HOME)/bin/Rscript -e "Rcpp:::LdFlags()"` $(LAPACK_LIBS) -lprofiler $(BLAS_LIBS) $(FLIBS)

Then make a new tarball of the package, and then rebuild the package.

$ tar czvf myRpackage.tar.gz my-package-dir
$ R CMD INSTALL myRpackage.tar.gz

Of course, on Linux that’s a standard process, if you are stuck in Windows, why the heck are you reading these notes anyway? Honestly, I really doubt this is relevant to Windows at all.

After that, start a NEW terminal session with a “clean” environment, with no LD_PRELOAD. I did not see this warning in the documenation, but the -lprofiler linker option is doing the same thing as the LD_PRELOAD statement, BUT, and this is a big BUT, neither works right if you are using the other. So if you do the -lprofiler option, don’t set the LD_PRELOAD environment variable.

The Google docs say that the linker option -lprofiler has no effect unless you add the CPUPROFILE output element in the command line. I believe them, and I’m considering re-compiling R with that flag set to find out if the profiling output is more readable.

3. Recompile the C or C++ code in the package for which you want the profile. In some ways, this was the most fun. Here’s the idea.

In the cpp file, at the top add the include line for the Google profiler, which is something like

#include <gperftools /profiler.h >

I did this with C++, but was assuming that C would be OK too. But you would have to test and tell me.

Here’s the part I thought was neat. We can turn on the profiler before a certain function is called, and be careful to include the name of the output file in

ProfilerStart("/full/path/to/profile.output.file.log") 
//your functions here
ProfilerStop()

Presumably, the command you are running that is causing the big slowdown is sandwiched between these.

Make sure the link statements in the Makefile still have -lprofiler and then retar and rebuild the package.

After taking those steps, DO NOT include the output log file in the command line. Just run like usual.

$ R -f my-prog.R

Incidentally, I tried to insert only “ProfilerStart()” in the source code, but that won’t compile, and then I simply made a blind guess that it was asking for a file name as an argument. And that worked.

About pauljohn

Paul E. Johnson is a Professor of Political Science at the University of Kansas. He is an avid Linux User, an adequate system administrator and C programmer, and humility is one of his greatest strengths.
This entry was posted in R and tagged , . Bookmark the permalink.