August 28th 2012
Yes, because Cover Trees are sometimes too slow. In fact, I asked myself this question, not for the build time, but for the search time if the data has a structure. Imagine, what would happen if your data was more a less a regular grid? When I tried that, starting with a point at (0,0), then (1,0)… the first node (0,0) had references to all the last points (9,9), (9,8)… And I figured, it would be slower than a tree search. So I decided to give kd-trees a shot for this kind of search on a regular grid.
The implementation is not optimized, the only noticeable thing is that I don’t create subnodes for all points, I only split nodes when they have more than n points, and then according to the best dimension (best is defined by the biggest distance on one axis in the node, the root distance being set by hand). The search is of course really close to the cover tree implementation.
Also, when doing the search, I can override the original distance function with a new one. This is because when doing the search, some dimensions may be of greater interest than others. At least, this is the case I have. When creating the tree, I don’t have access to the actual distance, and it could even change during the process.
OK, some comparisons now. I’ve changed the number of nodes to 1 million. I’ve relaunched the cover tree test, and this is the result:
Build time 12:20:49.484375 Out time (linear) 00:00:04.500000 Out time (cover_tree) 00:00:00.062500
The first figure is true, it is more than 12 hours. You have to search a lot to be profitable to use a cover tree.
Build time 00:00:01.890625 Out time (linear) 00:00:02.921875 Out time (kd-tree) 00:00:00.875000
Strangely the timing for the linear search is a bit different, but I guess this is due to the long time the program had to run. The timing is not as good, but it is only one test. I would need to test far more points, but due to the high cover tree build time, I didn’t try yet.
Now, if we use another distance (infinite instead of L2):
Out time (linear) 00:00:04.718750 Out time (kd-tree) 00:00:00.890625
The linear time is a bit higher, like the one from the cover tree. What is interesting is that the kdtree time is comparable to the one from the L2 search time. And this is what I wanted to check.
Kd-trees are not something earth-shaking, but sometimes it is still very relevant. Of course, I didn’t optimize any of the cover tree or kd tree implementation, so there is place for improvement. What I really like for the kdtree is that if you change the distance function, it can still work, whereas a cover tree, being implemented on a specific distance, cannot easily.
Kd-tree code on Github is here.
2 Comments »