Sunday, December 14, 2008

Does data have velocity?

While reading I am a Strange Loop[1] by Doug Hofstadter, where he is trying to come up with an appropriate metaphor to explain his notion of a single human "mind/soul" being distributed over multiple human brains (somewhat like a country is distributed over its many scattered embassies), it made me muse on the boundary between an actual distributed mind/soul and other mind/souls that are merely affected/influenced by that mind/soul.  This is of course a particular instance of the general problem of determining the boundary of a diffuse object.  The boundary of a solid asteroid is easy to determine whereas the borderline between one planetary ring and an adjacent ring is harder.  Any individual "rock" residing in the region where two rings overlap could be a part of either ring.

A data example of this problem is the one where lots of individual names/addresses need to be clustered into identities even though there is variation in the various names/addresses.  There are cases where it is ambiguous which identity "owns" a particular name/address when the fuzzy blob of one identity cluster overlaps the fuzzy blob of another identity. How to tell which one it belongs with? Why do we even think that there are two overlapping blobs instead of just one oddly shaped blob?

AHA - Look at velocity!

The problem of determining which points belong to which overlapping fuzzy regions is hard when looking at a static picture, however it is easy when there is movement.  When looking at which stars belong to which of two colliding galaxies, we look at the velocity of the star to see which galaxy it is moving with.

So, can this be applied to data?  Is there some "velocity" that can be determined for each data point such that it can be associated with the "proper" data cluster?  Is there a velocity associated with a name/address instance?

[1] "I am a Strange Loop",2007, Hofstadter, Basic Books