March 27th 2012
I’ve looked on github for a good C++ implementation of Cover Trees for nearest-neighbors search, but I didn’t find one. I may have overlooked some repositories, but in the end, implementing it myself wasn’t that difficult.
Contrary to kdtrees, the cover tree nodes always represent one point. So you have as many nodes as you have data points. Each of them can have several children, sorted by “level”, or distance hierarchy. Each level halves the distance between the node and its children. When constructing the tree instance, the distance you use is given as parameters (and its type is a template parameter, as well as the point type and the data type used for comparisons).
Construction of the tree is done by searching in the tree for the closest node and the lowest hierarchy possible. In all time, the min and max levels are tracked in the tree. On insertion, the max level is tried, and if it fails, the max level is increased by one. The failure occurs if the new point is furthest of the root point that the maximum distance on the max level. At each level, the set of the nodes that are closer to the new point than the current maximum distance are kept (function populate_set_from_node). If the least distance is greater than the next level maximum level (function find_min_dist, the recursion stops and the new node is inserted in the first node in the set.
When doing a search, a list of distances and points (of type NearestNodesStructure) is created and its size is kept at the number of neighbors k. On entrance of a new level (level_traversal), all nodes that have children that could be close enough (current distance less than maximum level distance + distance to the current k-th neighbors found with find_k_dist) are added to the list. In order to speed search, a map is not used, but a partial sort is done at each iteration (which makes find_k_dist work without sorting the container again). Without this, the performance is worse than a linear knn search.
I didn’t implement other functions like removal or traversal. I don’t need them, but feel free to add them if you need. Sidenote: I will try to find time to refactor the code, because it is not very clean at the moment (especially the knn method).
So some figures. I’ve populated a tree with 100k elements, and then made a 10-neighbors search with the cover tree, and with a standard linear search. Without the construction time, the search is about ten times faster than a linear one. Of course, it all depends on the cover tree balance. You could design a cover tree that has too many levels (but also the data would be unusual) and that would result in an almost linear time.
I’ve done a profile with callgrind, but the majority of the time is in the tree construction. No point in checking the search and displaying it here as the following timings show:
Build time 00:09:38.530397 Out time (linear) 00:00:00.238989 Out time (cover_tree) 00:00:00.012518
What must be optimized is clearly the build time.
The test code is in the main.cpp file on github.
No Comments yet »