Sunday, December 14, 2008

Does data have velocity?

While reading I am a Strange Loop[1] by Doug Hofstadter, where he is trying to come up with an appropriate metaphor to explain his notion of a single human "mind/soul" being distributed over multiple human brains (somewhat like a country is distributed over its many scattered embassies), it made me muse on the boundary between an actual distributed mind/soul and other mind/souls that are merely affected/influenced by that mind/soul.  This is of course a particular instance of the general problem of determining the boundary of a diffuse object.  The boundary of a solid asteroid is easy to determine whereas the borderline between one planetary ring and an adjacent ring is harder.  Any individual "rock" residing in the region where two rings overlap could be a part of either ring.

A data example of this problem is the one where lots of individual names/addresses need to be clustered into identities even though there is variation in the various names/addresses.  There are cases where it is ambiguous which identity "owns" a particular name/address when the fuzzy blob of one identity cluster overlaps the fuzzy blob of another identity. How to tell which one it belongs with? Why do we even think that there are two overlapping blobs instead of just one oddly shaped blob?


AHA - Look at velocity!

The problem of determining which points belong to which overlapping fuzzy regions is hard when looking at a static picture, however it is easy when there is movement.  When looking at which stars belong to which of two colliding galaxies, we look at the velocity of the star to see which galaxy it is moving with.

So, can this be applied to data?  Is there some "velocity" that can be determined for each data point such that it can be associated with the "proper" data cluster?  Is there a velocity associated with a name/address instance?

[1] "I am a Strange Loop",2007, Hofstadter, Basic Books

Wednesday, October 22, 2008

Silver Bullet: Model the world, not the application

DISCLAIMER: Ok, I admit it...this is cut/pasted directly from my brain fart notebook, i.e. not ready for prime time...but dammit Jim, its just a blog!

In the arsenal needed to fight unsuccessful software development projects, it will take a whole clip full of silver bullets.  One of those silver bullets, I believe, is more accurately modeling the world using knowledge of Philosophy.

There is a great struggle between getting everything "right" up front,  versus, doing "just enough" specification and design.  When trying to balance "make it flexible" in order to support future re-use, versus XP mandates like "don't design what isn't needed today", it is hard to know (or justify) where to draw the line.  Due to "changing requirements", those "flexible reuse" features (that were merely contingent at design time) are often mandatory before the original development cycle is even complete.

WELL, lots of requirements don't change THAT much if you are modeling correctly in the first place.

Humans haven't changed appreciably in millennia, even if the roles they play do.  So, if "humans" are modeled separately from "employees", it is that much less work when you later need to integrate them with "customers". [Theme here is "promote roles programming", the justification of which is made more obvious when taking essentialism to heart.]

In general, the foundation of one's data/domain/business/object/entity-relationship model is solid and unchanging, if all "domain objects", "business objects", etc are modeled based on a clear understanding of the essential versus accidental aspects of the "real world", and NOT based on the requirements description of a particular computer system or application.  Modeling based on "just what is needed now according to this requirements document today" is too brittle, both for future changes, and especially for integrating with other systems and data models.

After all, adding properties and relationships to entities is fairly easy if the entities themselves are correctly identified.  It is much harder to change the basic palette of entities once a system design is built upon them.  Also, all the more reason to be sure and not confuse entities with roles they can take on.

Example: I don't have to wonder who I might have to share employee data with if I realize that an "employee" is actually just a role that a person takes on.  If I model the essentials of the person separately from the attributes of the employee role, it will be much easier to integrate that data with, say, a "customer" database later.  If the customer data model recognizes that "customer" is just a role that a person takes on, its Person table is much more likely to be compatible with my Person table than would be the case with my naive Customer and their Employee tables (and still other Patient tables, etc, etc.)

Wednesday, April 30, 2008

Your Pipe Inventory Record is not a Pipe

In this blog, I recently posted: Reality is the System of Record. Unfortunately, I just found a lost reminder to myself about a really good introductory example to use. Even though it didn't make it into the original post, it seems worth mentioning here anyway...

La trahison des images is a famous painting by Magritte which seems to pose a riddle.  It contains nothing but a pipe and the phrase, "This is not a pipe". However, it only seems enigmatic because the solution is too obvious.

As Art critic Robert Hughes explains in The Shock of the New, "This, indeed, is not a pipe. It is a painting; a work of art; a sign that denotes an object and triggers memory". As Magritte himself once remarked, "Of course it's not a pipe. Just try to fill it with tobacco".

In other words, it is a representation of a pipe, not to be mistaken for an actual one. And as obvious as that seems, computer system developers make the same mistake all the time. They do so when they forget that their Customer data table & business domain objects are not actual customers, but only representations, i.e. memories, i.e. copies of information.  So, as with all cached copies of external data, it is the duty of your so-called "system of record" to keep in sync with the real customer, in the real world, because Reality is the System of Record.

Thursday, April 17, 2008

Reality is the System of Record

"The system of record is the place where there is a definitive value for some unit of data... If you have no system of record for your bank account or if you have multiple systems of record for the same account, something is fundamentally wrong."
Bill Inmon, father of the data warehouse.[1]
Over my years of consulting in large enterprise environments, I've heard arguments over whose system is the "system of record" for some particular piece of data.  I have seen programmers officially acknowledge another system as the SOR, all the while building their own system as if it were.  I've also heard corporate developers say that something is a customer if and only if it has a record in the SOR (never mind what the customer thinks, nor if the record literally has the name "Donald Duck").  I've seen that same customer information system built with so little regard for mirroring the real world that it defined Frankenstein customers, some of whom were composed of parts from multiple actual people.  Owning an SOR seems to breed a certain lack of humility which (IMHO) could benefit from learning a little Philosophy.  It will teach you that you are never the system of record, and instead, merely a "faint copy" of one of many idiosyncratic conceptions of the world.

Representationalism

In modern Cognitive Science, there is the assumption that "the mind has mental representations analogous to computer data structures". The idea in Philosophy that "the mind perceives only mental images (representations) of material objects, not the objects themselves"[2] is called Representationalism (and more generally indirect realism), and it can trace its roots across 2400 years of Philosophy of Mind.  Socrates says (in Plato's Parable of the Cave) that most people only see shadows of puppets instead of reality. Aristotle said thoughts are likenesses of things, and words refer to things indirectly through thoughts.  Rene Descartes proposed that all sensory information is transmitted by the nerves to a central "theatre", where the soul makes contact with the physical body and watches it.  John Locke said, "The mind represents the external world, but does not duplicate it."  David Hume thought that ideas were "faint copies" of physical sensations.

Skepticism (i.e. the notion that we don't know what we think we know) is one of the oldest ideas in Philosophy, and the raison d'être for the entire branch called Epistemology (which asks how can we be sure we know what we think we know?).  They were born from the ancient realization that our formation of ideas, concepts, and representations out of our sensory input (hidden behind a "veil of perception") is a very inexact process, complete with optical illusions, dreams, hallucinations, color-blindness, double vision, etc, etc.  In Immanuel Kant's "Critique of Pure Reason", he argues that because our minds are hardwired to perceive the world in certain ways, they actively shape experience rather than passively record perceptions.  His claim was that our sense perception is effectively a pair of tinted glasses that we can't take off.  Because we've never seen the world without them, it takes effort to see the world as it is really is.  One of the very reasons to practice Philosophy is to understand the true nature of things and avoid the "naive realism" of "common sense". Guides to this understanding are found in the branches of Metaphysics and Ontology which explore how to understand and model the world respectively.  One of the tactics of Epistemology to aid in this effort is the doctrine of Verificationism which says that a statement has no meaning if there is not a way to verify its truth.

For programmers, the big epiphany here should be that their business database is really just one particular representation of reality, not reality itself.  Reality is the System of Record. A developer should be humble, and realize that it is difficult to accurately model the world such that it integrates with the data models of other systems.  They also should be skeptical that their actual data is both accurate and up-to-date, developing ongoing mechanisms to actively verify each. 

Syncing the Systems of Record

When any system keeps copies of data for which it is not the system of record, that system is actually just a cached copy of that external data.  And like any cache, it has the responsibility to keep track of whether its copy is "stale", and to implement mechanisms to insure "cache coherence". Once one realizes that reality is the system of record, it becomes clear that it is not enough to keep databases in sync with each other; they ALL must chase after the ever-changing state of the world.

In a large organization, each database is just one of many competing representations. In the same way that different people have different mental representations of their shared reality, different databases will each have their own slightly (or largely) incompatible data models, and sets of data values, for the same domain entities.  A side-effect of this is that many systems keep copies of external data that they've transformed in some way for their own use.  When systems don't realize that their SOR data represents the same real-world entities as other SORs do (e.g. patient DB versus employee DB), the eyesight of the entire enterprise goes out of focus as each data-set drifts apart.

CASE STUDY: ChoicePoint

ChoicePoint was an independent company prior to being bought out by Reed Elsevier in 2008.  It collected and combined data about businesses and individuals from a wide variety of sources, selling access to both private and public (i.e. government) clients.  While consulting there, I saw first hand how data was effectively only verified if someone phoned in to complain about inaccuracies. I also was informed by management that no access control mechanisms were to be included in the design of a new system, despite Congress exploring adding requirements for such.  They said that ChoicePoint wanted to be able to protest at the cost of adding the controls after the fact in the hope that it would defeat the requirements in the first place. There have been a whole series of lawsuits and government actions against ChoicePoint for out-of-date data, inaccurate data, and selling data to unauthorized buyers which cost it so much money that it had to be sold.

[1] The System of Record in the Global Data Warehouse, Bill Inmon, Information Management Magazine, May 2003
http://www.information-management.com/issues/20030501/6645-1.html

[2] Representationalism, Encyclopædia Britannica
http://www.britannica.com/EBchecked/topic/498476/representationism


Thursday, March 20, 2008

The Logical Positivists were Test-Infected
(Is that a good thing?)

While reading Looking at Philosophy[1], I came across the section about Logical Positivists[2] whom I immediately recognized as "test infected"[3]! The Logical Positivists (circa 1910s-1930s) promoted the idea British philosopher, A. J. Ayer, named "the principle of verification"[8][9] i.e. "the meaning of a proposition is its method of verification". In other words, a statement is meaningless if it can't be objectively tested. This is the same big idea of Test Driven Development[5] and more specifically Test Driven Specifications[6] (aka Test Driven Requirements[7]); namely, don't create a requirement (or interface definition) that is so vague, subjective, or contradictory that an automated computer program can't be written to test for compliance.

In other words, all that verbage[12] that usually passes for a specification/requirements document is really just commentary to the REAL specification which is encoded in a comprehensive compliance test suite. [BTW, this isn't just a software concept since IC chips have long had not only "testbed" circuit boards to test chips, but also ICE (in circuit emulators) to test circuit boards via simulated chips (which software copied with "mock objects").]

The goal of both Logical Positivists and test-driven engineers is to weed out statements that are so poorly conceived and worded as to be effectively meaningless. The way to ensure this is to submit each statement to this rigorous method and rework any statements that come up short.

What did Logical Positivists have to say about "how" to do tests?

My whole excitement about finding philosophy movements that parallel aspects of software engineering is the notion that "top men"[11] have already thought and argued long & hard about this stuff, and therefore we programmers can benefit from what they've already learned the hard way. So, what method did they advocate? Basically, it is the idea that all statements should be broken down into "protocol sentences" plus logical conclusions built on top of them. Protocol Sentences were to be simply observation reports (based on first hand direct sensory experience). This was ultimately unworkable for humans, but everything in a computer test is based on its first hand direct sensory experience (of electrical signals anyway).

Karl Popper, among the most influential philosophers of science of the 20th century, said that claims could only be considered "scientific" if they were falsifiable, meaning that there should be some observation or experiment specified that, if it were verified, would prove the claim false.

Logical Positivists claimed that unverifiable statements were LITERALLY meaningless, and therefore they dismissed entire disciplines like metaphysics, morality, and ethics. Other, more moderate philosophers held that untestable propositions were merely unproductive to work with. Popper claimed that his demand for falsifiability was not meant as a theory of meaning, but rather as a methodological norm for the sciences.

Uh oh, Logical Positivism was considered a failure...Will Test Driven be too?!

I was all happy about a philosophical foundation to being test driven (since I was test infected years ago), until I got to the part where Logical Positivism has been so devalued by other philosophers that one wrote "Logical positivism is one of the very few philosophical positions which can be easily shown to be dead wrong, and that is its principle claim to fame."[10] OUCH!

By saying that so-called metaphysical claims were meaningless, they unwittingly said that the very claim "metaphysical claims are meaningless" was itself meaningless! I.E. there was no way to prove the statement that "only provable statements had meaning".

Lessons for the Test Driven Approach

Luckily, Logical Positivism was only an extreme position in the spectrum of ideas under the Verificationism[4] umbrella. In the conclusion of "Verificationism: Its History and Prospects"[8], some points were noted...
  • one can't reduce every statement to be a logically equivalent statement about sensory experiences. (i.e. not every thought comes under the umbrella of "natural science")
  • one can say that "a statement lacks legitimacy or objectivity if there would be no evidence for or against it; if it is insulated from reason, where reason is linked to the possibility of public evidence for or against the statement from other, already established statements"
  • one must admit that some more abstract statements are understandable even if not testable.
So, the bottom line seems to be that the Test Driven approach is a good one if you...
  1. concentrate on testing what can be tested, 
  2. create tests that would actively prove statements to be false, and 
  3. know that you probably can't prove (with tests) that the test driven approach itself is correct.

[1] Looking at Philosophy, Donald Palmer, 4th Ed. 2005

[2] ibid, pg 327

[6] Acceptance Tests and the Test Driven Specification

[7] Test Driven Development or Test Driven Requirements?

[8] Verificationism: Its History and Prospects, C. J. Misak, 1995

[9] Language, Truth and Logic, Alfred Jules Ayer, 1952

[10] Prolegomena to Philosophy, Jon Wheatley, 1970

[11] "top men", Raiders of the Lost Arc

[12] verbage vs verbiage


Tuesday, March 11, 2008

Scott is Scott

While reading Looking at Philosophy[1], I came across its discussion of Bertrand Russell's Theory of Descriptions[2]. It was a proposed solution to several problems that occur when logic propositions are made about an entity's existence or identity. One of the problems being solved was that a statement like "Scott is the author of the novel Waverley", if it is true, reduces to the statement "Scott is Scott". This is because in traditional logic, two terms that denote the same object can be interchanged without affecting the meaning or truth of a statement. But since "Scott is Scott" doesn't seem to be equivalent to "Scott is the author of Waverley", Russell proposed that a better way to state the latter was using the template:

There is an entity C, such that the sentence "X is Y" is true, iff X=C.

This would produce the sentence "There is an entity C, such that "X wrote Waverley" is true, if and only if X=C; moreover, C is Scott". What Russell was getting at was that there was a problem with the traditional view that a definite description could be exchanged with a proper name. A phrase describing something (e.g. "the author of Waverley") means something different than a name (e.g. Scott) because while it is true that "George IV wanted to know if Scott was the author of Waverley", it is false that "George IV wanted to know if Scott was Scott". Russell's template solved this problem along with others like making claims about non-existent things. E.G. "The present king of France is bald" is problematic because there is no present king of France. If it were considered false then "The present king of France is not bald" would have to be considered true. This is avoided by saying instead: There is an entity C, such that the sentence 'X is French, bald, and kingly' is true iff X=C. THAT statement is false (because there is no entity that fits that description), and its opposite is true (i.e. There is NOT an entity...).

[1] pg 322, "Looking At Philosophy: The Unbearable Heaviness of Philosophy Made Lighter", 2005, Palmer
[2] Existence and Description, Bertrand Russell from "Metaphysics: an anthology" by Jaegwon Kim, Ernest Sosa - 1999

  

Thursday, February 7, 2008

Is there anything left after all the roles are stripped away?

While reading Looking at Philosophy[1], I came across its discussion of Kierkegaard saying that there is some existential being that is left when all the "roles" of that being are stripped away.  With my advocation in Existential Programming that "roles" be promoted over "is-a subclassing", it begs a deep question. If we have factored out all state and behavior from a class, moving them into various roles, when do we know that it is time to create a new instance of a "thing" to which roles will be attached?


For example, in a language like Javascript, one can create an empty object instance and dynamically add in attributes and methods later.  This capability can be used to implement "mixin" classes that each encapsulate the properties and behavior associated with some role.
While it is true that we can program the operations of instantiating an empty (i.e. essence-less) object and graft in the mixin classes for each role we expect to fulfill, the Philosophical question is, how could we know that a new blank instance is needed if we didn't have a particular "thing" in mind in the first place? I.E. doesn't a thing still have to start as a particular "kind" of thing in order to know when it is time to create a new one? 
In still other words, is there an actual case of a "thing" that consists only of roles? Are we sure that one of those roles isn't a thing itself?  Does the technology give us the capability to do something that makes no ontological sense?

[1] "Looking At Philosophy: The Unbearable Heaviness of Philosophy Made Lighter",2005, Palmer

Thursday, January 24, 2008

Word Matters: Words Matter

In my earlier blog entries, I noted the debate over whether words had meanings, or were only references to things. It was noted that while "rose" and "gulaab" work equally well for speakers of English and Urdu, "Superman" was not interchangeable with "Clark Kent". This was because Superman and Clark Kent both referenced different aspects of the same entity instead of both referencing the same entity as a whole.

However, the actual words still didn't matter in that discussion. While Clark Kent was different than Superman, the actual word "Superman" didn't matter...it could have been "Clark Kent is different than Foo" (if everyone called the man of steel "Foo" instead of Superman).

After reading Salt: A World History[1], I've decided that words DO have their own meaning because of all the connotations, and rich web of connections with history, culture, and language, that go along with each. The book is chock full of nuggets like the following (paraphrased):

The first of the Roman roads, the Via Salaria (i.e. Salt Road), was built to bring salt not only to Rome but across Italy. It was important because at times the Roman soldiers were paid in salt which was the origin of the word salary and the expression "worth his salt" and "earning his salt". The latin word for salt (sal) became the French word for pay (solde) and that was the origin of the word soldier.

The book makes it obvious that the word Salt couldn't really be much different given its linguistic lineage. Words DO matter because each has a whole slew of connections to other things, and a past history, because each is part of a naturally evolved language. Words evolve from other words; they were not picked out of the air. There is a family tree of related words just like there is a family tree of organisms with related DNA.

Knowing the rich set of interconnections of words, their origins, history, etc makes a word like Salary carry many connotations that "foo" would not. A whole semantic network of related concepts lights up in my mind with the word Salary that would not have if the word for salt was foo.

[1] Salt: A World History, by Mark Kurlansky, 2002

[Ed. Note - 12/11/12: as per my disclaimers, once I start looking for my epiphanies on the net, I find them. E.G. in this case, see "Lecture 26: Culture, Hermeneutics, and Structuralism". Congrats Bruce, you've just discovered what Gadamer said a century ago: language itself is a historical accumulation, each term carrying its past usages which are part of its meaning.]