```
1: for each beam b
2: pb ← partner beam (a beam slightly further)
3: for each point P in beam pb
4: y ← distance(P)
5: for each point Q in beam b, around P (+/- few degrees)
6: x ← distance(Q)
7: E[P] = expected distance of P, as a function of x
8: if y < E[P]
9: P is an obstacle
```

That can be up to a million distance tests (line 8) depending on the situation. Velodyne data comes once every 100ms and we need to do a lot more once points have been labeled, so this should run in about 10ms. When I started coding it, it was running in a couple of seconds. This is the story of how I brought it down to 12ms on a single core of an average laptop (without using a GPU or OpenMP).

Since the structure of the velodyne is fixed, we can pre-compute the beam association (beam partnering), formula for expected the distance, which is a function of the vertical angle of beams.

Then, I realized that once a point had been labeled has an obstacle, there was no need comparing it with its other neighbors. This cut down the execution time from a few seconds to about 500ms.

I organized the data in a 2D array, beam index being the first dimension. For each point I am storing its angular position, so that I can do neighbor searching based on angular distance. After running it through a profiler (gperftools) I realized that this organization was slowing down the algorithm a lot: first of all the algorithm didn't really know where to look for neighboring points so a bit of searching was necessary, and angle comparison was too long.

So I reorganized the points into a 3D array, with points binned by small groups according to their angular position in number of encoder ticks (integer). I am using 700 bins, leading to a couple of points per bin. Then finding neighboring points is straightforward: knowing the encoder value for a point, simply divide it by the max value and multiply by the number of bins. This led to a huge speedup, down to about 35ms.

Again, I inspected the code with the profiler and noticed that it was spending a lot of time accessing the points in the array. So I cached the pointer values when the same location was used several times.

```
Obj &p = table[i][j]
for( )
do something with p
```

is faster than

```
for()
do something with table[i][j]
```

This brought down the execution time to about 20ms.

Letting the compiler optimize the code (until now I was in debug mode), brought the execution time down to 12ms!

ps: code formatting from http://codeformatter.blogspot.com/