Tuesday, July 23, 2013

Offline data, visualizer and profiling tools help write efficient code

In the past few months I made great progress with analyzing point clouds from the velodyne sensors. One of the challenges is to write super fast algorithms. Looking back, I realize that a few things made my life easier.
  • offline data. I spent quite some time writing tools to allow playing back data recorded from the live sensor. I could play back at real time speed, or play it back as fast as possible. And the later one was really useful when I wanted to profile my code. Besides it allowed me to work from home and save a lot of commuting time.
  • a visualizer, tightly integrated with my code, so that it can access much of the internal data. With the visualizer I could look at the data, change the value of some parameters to see in real time their effect, pause, zoom, and really ask questions to my code, such as "why this point is considered foreground?"
  • a profiler. I've been using google's gperf. Besides, I have easy to use timers that I can quickly integrate in my code to measure the execution time of a particular block. Those time print to stdout the timing results, and I wrote a script to parse the result and display the stats neatly in a table (mean, standard deviation, min, max). Those timers were useful to cover for some of the short comings of the profiler (it has to be run in debug mode).

Friday, May 24, 2013

issues in profiling

I'm trying to profile some piece of code. I'm using google perftools for that. I have 4 nested loops:


1:  for( ... ) {  
2:    for( ... ) {  
3:      for( /*loop over some values of b2*/ ) {  
4:        for( vector<xxx>::const_iterator it=some_vector[b2].begin(); it!=some_vector[b2].end(); ++it ) {  
5:          do_something_heavy(it);  
6:        }  
7:      }  
8:    }  
9:  }  

I built this with debugging symbols and compiler optimization turned off. and ran it with google's profiler. Which told me that there were many samples on line 5, which I expected of course, and also many samples on line 4 which I expected less (about 500 each). Of course line 4 being the 4th nested loop, it's being called a lot, but the only thing it does is incrementing the iterator and comparing it with the end iterator.

To understand it a bit better, I broke it down and cached the end iterator:

1:  for( ... ) {  
2:    for( ... ) {  
3:      for( /*loop over some values of b2*/ ) {  
4:        const vector<xxx>::const_iterator end = some_vector[b2].end();  
5:        vector<xxx>::const_iterator it = some_vector[b2].begin();  
6:        while( it != end ) {  
7:          do_something_heavy(it);  
8:          ++it;  
9:        }  
10:      }  
11:    }  
12:  }  


and what the profiler is telling me now, is that line 6 is getting about 200 samples, and that lines 4 and 5 get almost nothing, line 8 (++it) gets 50. Meaning that most of the looping time is spent on comparing the 2 iterators...

I tried with using an integer index rather than an iterator:


1:  for( ... ) {  
2:    for( ... ) {  
3:      for( /*loop over some values of b2*/ ) {  
4:        const vector<xxx> & vector_b2 = some_vector[b2];  
5:        const unsigned N = vector_b2.size();  
6:        unsigned n = 0;  
7:        while( n != N ) {  
8:          do_something_heavy(vector_b2[n]);  
9:          ++n;  
10:        }  
11:      }  
12:    }  
13:  }  

what the profiler is telling me now is that a very small amount of samples is associated with lines other than the do_something_heavy line.


BUT !

when I turn on compiler optimization on, and time the execution of the code, the 3 versions give no significant difference in execution time. Which is quite reassuring in a way.

Conclusion:
- it sucks to have to run the profiler with the compiler optimization flags turned off
- it's good to know (or rather confirm) that iterator are not slower than indexing

If somebody knows of a reliable method to profile code, I'd be happy to hear it.

Wednesday, May 8, 2013

running a script when the network connection status changes

There a number of things I'd like to do when my network status changes: for instance use a specific .ssh/config, or a specific synergy profile. Thanks to this post, I learned that NetworkManager made that easy. The trick is to place scripts in /etc/NetworkManager/dispatcher.d/ and use a switch case approach to detect what interface (eth0, wlan0, etc.) has changed and what's the new status (up, down, etc.). This is documented in NetworkManager's man page.

Incidentally, I also learned about the logger command, which allows to write messages to the /var/log/syslog log file.

running a script on resume / suspend

I wrote a script to write my .ssh/config file according to my network location: using route I can detect which network I am running on and use that to select the appropriate ssh config options. But I had to run that script manually. Until I found a way to run a script automatically when my session is resumed.

The trick is to place scripts in /etc/pm/sleep.d. This is documented in pm-util man page.

Saturday, April 20, 2013

tools to format code before posting it online

gist.github.com
Snippets created as gists can be easily shared with others, who can also fork them, etc. It comes with code highlighting. To embed it in your blog/site you have to pull a javascript block, which might not be allowed everywhere (e.g. I think google-sites does not allow scripts)


codeformatter.blogspot.com
The result comes as an HTML code that can be copied and embeded on any web page

Wednesday, February 13, 2013

obstacle detection from velodyne, a story of code optimization

I started working on this algorithm for detecting obstacles in a velodyne scan (about 140,000 points per revolution in 64 beams). The algorithm is as follow:

1:    for each beam b  
2:        pb ← partner beam (a beam slightly further)
3:        for each point P in beam pb  
4:            y ← distance(P)  
5:            for each point Q in beam b, around P (+/- few degrees)  
6:                x ← distance(Q)  
7:                E[P] = expected distance of P, as a function of x  
8:                if y < E[P]  
9:                    P is an obstacle  

That can be up to a million distance tests (line 8) depending on the situation. Velodyne data comes once every 100ms and we need to do a lot more once points have been labeled, so this should run in about 10ms. When I started coding it, it was running in a couple of seconds. This is the story of how I brought it down to 12ms on a single core of an average laptop (without using a GPU or OpenMP).

Since the structure of the velodyne is fixed, we can pre-compute the beam association (beam partnering), formula for expected the distance, which is a function of the vertical angle of beams.

Then, I realized that once a point had been labeled has an obstacle, there was no need comparing it with its other neighbors. This cut down the execution time from a few seconds to about 500ms.

I organized the data in a 2D array, beam index being the first dimension. For each point I am storing its angular position, so that I can do neighbor searching based on angular distance. After running it through a profiler (gperftools) I realized that this organization was slowing down the algorithm a lot: first of all the algorithm didn't really know where to look for neighboring points so a bit of searching was necessary, and angle comparison was too long.

So I reorganized the points into a 3D array, with points binned by small groups according to their angular position in number of encoder ticks (integer). I am using 700 bins, leading to a couple of points per bin. Then finding neighboring points is straightforward: knowing the encoder value for a point, simply divide it by the max value and multiply by the number of bins. This led to a huge speedup, down to about 35ms.

Again, I inspected the code with the profiler and noticed that it was spending a lot of time accessing the points in the array. So I cached the pointer values when the same location was used several times.

 Obj &p = table[i][j]  
 for( )  
   do something with p  

is faster than

 for()  
   do something with table[i][j]  

This brought down the execution time to about 20ms.

Letting the compiler optimize the code (until now I was in debug mode), brought the execution time down to 12ms!

ps: code formatting from http://codeformatter.blogspot.com/