Because of the topic of testing coming up in my recent post "Is Morality Eating Your Own Dogfood?", it made me finally publish the following notebook entry from 2007...
In reading Philosophy 101, about Truth with a capital "T", and the non-traditional logics that use new notions of truth, we of course arrive at Fuzzy Logic with its departure from simple binary true/false values, and embrace of an arbitrarily wide range of values in between.
Contemplating this gave me a small AHA moment: Unit Testing is an area where there is an implicit assumption that "Test Passes" has either a true or false value. How about Fuzzy Unit Testing where there is some numeric value in the 0...1 range which reports a degree of pass/fail-ness? i.e. a percentage pass/fail for each test. For example, testing algorithms that predict something could be given a percentage pass/fail based on how well the prediction matched the actual value. Stock market predictions, bank customer credit default prediction, etc come to mind. This sort of testing of predictions about future defaults (i.e. credit grades) is just the sort of thing that the BASEL II accords are forcing banks to start doing.
Another great idea (if I do say so myself) that I had a few years ago was the notion that there is extra meta-data that could/should be gathered as a part of running unit test suites; specifically, the performance characteristics of each test run. The fact that a test still passes, but is 10 times slower than the previous test run, is a very important piece of information that we don't usually get. Archiving and reporting on this meta-data about each test run can give very interesting metrics on how the code changes are improving/degrading performance on various application features/behavior over time. I can now see that this comparative performance data would be a form of fuzzy testing.
Showing posts with label fuzzy. Show all posts
Showing posts with label fuzzy. Show all posts
Tuesday, September 21, 2010
Sunday, August 8, 2010
Neural Nets, Vagueness, and Mob Behavior
In response to the following question on a philosophy discussion board, I replied with the short essay below and reproduce it here.
This leads to the entire philosophy of "vagueness". i.e. are there yes/no questions that don't have a yes/no answer? Are some things like baldness vague in essence, or, is our knowledge merely incomplete? e.g. we don't know the exact number of hairs on your head, and/or, we don't know/agree on the exact number of hairs that constitutes the "bald" / "not bald" boundary?
NEURAL NETS
My personal conclusion is that there ARE many vague concepts that we have created that are tied to the way our brains learn patterns (and, as a side effect, how we put things into categories). In contrast to rational thought (i.e. being able to demonstrate logically step by step our conclusions), we "perceive" (ala Locke/Hume/Kant) many things without being able to really explain how we did it.
In Artificial Intelligence, there are "neural network" computer programs that simulate this brain-neuron style of learning. They are the programs that learn how to recognize all different variations of a hand-written letter "A" for example. They do not accumulate a list of shapes that are definitely (or are definitely not) an "A", but rather develop a "feel" for "A"-ness with very vague boundaries. They (like our brains) grade a letter as being more or less A-like. It turns out that this technique works much better than attempting to make rational true/false rules to decide. This is the situation that motivates "fuzzy logic" where instead of just true or false answers (encoded as 1 or 0), one can have any number in-between, e.g. 0.38742 (i.e. 38.7% likely to be true).
WISDOM OF THE CROWD?
Because each person has their own individually-trained "neural net" for a particular perception (e.g. baldness, redness, how many beans are in that jar?), we each come up with a different answer when asked about it. However, the answers do cluster (in a bell-curve-like fashion) around the correct answer for things like "how many beans". This is what led Galton to originally think that there was "wisdom in the crowd". This idea has been hailed as one of the inspirations for the new World Wide Web (aka Web 2.0). The old idea was that McDonalds should ask you if "you want fries with that?" to spur sales. The new Web 2.0 idea is that Amazon should ask you if you want this OTHER book based on what other people bought when they bought the book you are about to buy. I.E. the crowd of Amazon customers know what to ask you better than Amazon itself.
The problem is that there are many failures of "crowd wisdom" (as mentioned in that Wikipedia page in the link above). My conclusion is that most people advocating crowd wisdom have not realized that it is limited to "perceptions". Many Web 2.0 sites are asking the crowd instead about rational judgments, expecting them to come up with a better answer than individuals. The idea of democracy (i.e. giving you the right to vote) has been confused with voting guaranteeing the best answer, no matter the question. In fact, Kierkegaard wrote "Against The Crowd" almost 200 years ago where he recognized that individuals act like witnesses to an event, whereas people speaking to (or as a part of) a crowd, speak what we would now call "bullshit" because they are self-consciously part of a crowd. We can see this in the different results of an election primary (a collection of individuals in private voting booths) versus Caucuses where people vote in front of each other.So, Web 2.0 sites (Facebook, MySpace, blog Tag Clouds, etc) that allow people to see the effect on other people of what they are saying, are chronicling mob mentality rather than collecting reliable witness reports.
BTW, I have written several blog posts related to vagueness, for example:
http://existentialprogramming.blogspot.com/2010/03/model-entities-not-just-their-parts.html
"It was then that it became apparent to me that these dilemmas – and indeed, many others – are manifestations of a more general problem that affects certain kinds of decision-making. They are all instances of the so-called ‘Sorites’ problem, or ‘the problem of the heap’. The problem is this: if you have a heap of pebbles, and you start removing pebbles one at a time, exactly at what point does the heap cease to be a heap?"VAGUE CONCEPTS
This leads to the entire philosophy of "vagueness". i.e. are there yes/no questions that don't have a yes/no answer? Are some things like baldness vague in essence, or, is our knowledge merely incomplete? e.g. we don't know the exact number of hairs on your head, and/or, we don't know/agree on the exact number of hairs that constitutes the "bald" / "not bald" boundary?
NEURAL NETS
My personal conclusion is that there ARE many vague concepts that we have created that are tied to the way our brains learn patterns (and, as a side effect, how we put things into categories). In contrast to rational thought (i.e. being able to demonstrate logically step by step our conclusions), we "perceive" (ala Locke/Hume/Kant) many things without being able to really explain how we did it.
In Artificial Intelligence, there are "neural network" computer programs that simulate this brain-neuron style of learning. They are the programs that learn how to recognize all different variations of a hand-written letter "A" for example. They do not accumulate a list of shapes that are definitely (or are definitely not) an "A", but rather develop a "feel" for "A"-ness with very vague boundaries. They (like our brains) grade a letter as being more or less A-like. It turns out that this technique works much better than attempting to make rational true/false rules to decide. This is the situation that motivates "fuzzy logic" where instead of just true or false answers (encoded as 1 or 0), one can have any number in-between, e.g. 0.38742 (i.e. 38.7% likely to be true).
WISDOM OF THE CROWD?
Because each person has their own individually-trained "neural net" for a particular perception (e.g. baldness, redness, how many beans are in that jar?), we each come up with a different answer when asked about it. However, the answers do cluster (in a bell-curve-like fashion) around the correct answer for things like "how many beans". This is what led Galton to originally think that there was "wisdom in the crowd". This idea has been hailed as one of the inspirations for the new World Wide Web (aka Web 2.0). The old idea was that McDonalds should ask you if "you want fries with that?" to spur sales. The new Web 2.0 idea is that Amazon should ask you if you want this OTHER book based on what other people bought when they bought the book you are about to buy. I.E. the crowd of Amazon customers know what to ask you better than Amazon itself.
The problem is that there are many failures of "crowd wisdom" (as mentioned in that Wikipedia page in the link above). My conclusion is that most people advocating crowd wisdom have not realized that it is limited to "perceptions". Many Web 2.0 sites are asking the crowd instead about rational judgments, expecting them to come up with a better answer than individuals. The idea of democracy (i.e. giving you the right to vote) has been confused with voting guaranteeing the best answer, no matter the question. In fact, Kierkegaard wrote "Against The Crowd" almost 200 years ago where he recognized that individuals act like witnesses to an event, whereas people speaking to (or as a part of) a crowd, speak what we would now call "bullshit" because they are self-consciously part of a crowd. We can see this in the different results of an election primary (a collection of individuals in private voting booths) versus Caucuses where people vote in front of each other.So, Web 2.0 sites (Facebook, MySpace, blog Tag Clouds, etc) that allow people to see the effect on other people of what they are saying, are chronicling mob mentality rather than collecting reliable witness reports.
BTW, I have written several blog posts related to vagueness, for example:
http://existentialprogramming.blogspot.com/2010/03/model-entities-not-just-their-parts.html
Labels:
fuzzy,
philosophy,
POSTSCRIPT
Sunday, June 6, 2010
Fuzzy Unit Testing, Performance Unit Testing
Because of the topic of testing coming up in my recent post "Is Morality Eating Your Own Dogfood?", it made me finally publish the following notebook entry from 2007...
In reading Philosophy 101, about Truth with a capital "T", and the non-traditional logics that use new notions of truth, we of course arrive at Fuzzy Logic with its departure from simple binary true/false values, and embrace of an arbitrarily wide range of values in between.
Contemplating this gave me a small AHA moment: Unit Testing is an area where there is an implicit assumption that "Test Passes" has either a true or false value. How about Fuzzy Unit Testing where there is some numeric value in the 0...1 range which reports a degree of pass/fail-ness? i.e. a percentage pass/fail for each test. For example, testing algorithms that predict something could be given a percentage pass/fail based on how well the prediction matched the actual value. Stock market predictions, bank customer credit default prediction, etc come to mind. This sort of testing of predictions about future defaults (i.e. credit grades) is just the sort of thing that the BASEL II accords are forcing banks to start doing.
Another great idea (if I do say so myself) that I had a few years ago was the notion that there is extra meta-data that could/should be gathered as a part of running unit test suites; specifically, the performance characteristics of each test run. The fact that a test still passes, but is 10 times slower than the previous test run, is a very important piece of information that we don't usually get. Archiving and reporting on this meta-data about each test run can give very interesting metrics on how the code changes are improving/degrading performance on various application features/behavior over time. I can now see that this comparative performance data would be a form of fuzzy testing.
In reading Philosophy 101, about Truth with a capital "T", and the non-traditional logics that use new notions of truth, we of course arrive at Fuzzy Logic with its departure from simple binary true/false values, and embrace of an arbitrarily wide range of values in between.
Contemplating this gave me a small AHA moment: Unit Testing is an area where there is an implicit assumption that "Test Passes" has either a true or false value. How about Fuzzy Unit Testing where there is some numeric value in the 0...1 range which reports a degree of pass/fail-ness? i.e. a percentage pass/fail for each test. For example, testing algorithms that predict something could be given a percentage pass/fail based on how well the prediction matched the actual value. Stock market predictions, bank customer credit default prediction, etc come to mind. This sort of testing of predictions about future defaults (i.e. credit grades) is just the sort of thing that the BASEL II accords are forcing banks to start doing.
Another great idea (if I do say so myself) that I had a few years ago was the notion that there is extra meta-data that could/should be gathered as a part of running unit test suites; specifically, the performance characteristics of each test run. The fact that a test still passes, but is 10 times slower than the previous test run, is a very important piece of information that we don't usually get. Archiving and reporting on this meta-data about each test run can give very interesting metrics on how the code changes are improving/degrading performance on various application features/behavior over time. I can now see that this comparative performance data would be a form of fuzzy testing.
Labels:
fuzzy,
POSTSCRIPT,
test driven,
testing
Thursday, March 4, 2010
Model Entities, not just their parts
One of the oldest puzzles in Philosophy is the paradox of how something can change and yet still be considered the same thing. After all, if “same” is defined as “identical; not different; unchanged”, then how can it “change”? On the other hand, even if I lose that hand (pun intended), I am still the same me. In chapter 5 of Peter Cave’s new book, “this sentence is false”[1], there is a collection of example paradoxes that illustrate how our intuitions about “sameness” are inconsistent. Some paradoxes involve entities (or properties) whose definition is "vague", as in “How many cows make up a herd?” or “At what weight does adding a pound change you into being ‘fat’?” However, here I will be focusing on the change paradoxes involving things with a well defined set of parts. They illustrate the problem with defining something as merely the collection of its parts (unless of course “it” is truly only a collection, and not an entity in its own right).
What does it mean to be an individual?
As discussed (at great length) in the book Parts[2], there is a whole spectrum of things in between “individuals” and “groups”, and they are referred to in everyday language by singular terms (e.g. person), plural terms (e.g. feet), and some words that could mean either (e.g. hair). There are individuals (say, a car), parts of individuals that are themselves individuals (say, a wheel), parts of individuals that are NOT themselves individuals (say, the paint), collections that do not form an individual (say, “the wheels of that car”), collections that DO constitute an individual (say, the car parts that comprise the engine where the engine is itself an individual), and so on, and so on.
A key to distinguishing whether a thing being referred to is truly a thing in its own right (and not just a plural reference masquerading as a single thing) is what sorts of things can be said about it. Orchestra is an ambiguous term because it can be used as a singular or a plural as in “the orchestra IS playing” vs the equally grammatical “the orchestra ARE playing”. If it is considered an individual then we can say things about its creation, its history, etc, whereas the plural use simply denotes a collection of players where not much can be said about “it” apart from the count of players, their average age, etc. Relational Database programmers will recognize individuals as those that get their own record in some entity table, and plurals/sets/collections as equivalent to the result set from some arbitrary query. SQL aggregate functions (like count, average, minimum, maximum, etc) are the only things that can be said about the result set as a whole. Result sets do not get primary keys because they are not a “thing”, whereas real individuals do (or should!) get their own personal identity key. Even when an arbitrary query is made to look like an entity by defining a “view”, it is not always possible to perform updates against the search results because the view is not a real entity.
What does it mean to be the same?
A big problem is that there are many different flavors of “sameness” when we say that A is the same as B. Right off the bat there is a difference between Qualitative identity versus Numerical identity. Two things are qualitatively identical if they are duplicates, like a pair of dice. Two things are numerically identical if they are one and the same thing, like the Morning Star and the Evening Star (both of which are, in fact, really the planet Venus). They are “numerically” identical in that when counting things they only count as one thing. Another complication is that there is a difference between identity (right this second) versus Identity over time which deals with the whole question of how something can be different at two different times and yet still be considered the same thing. For example, you are still considered numerically identical to the you of your youth even though you have clearly changed…although this gets into the even more involved topic of Personal Identity [which may or may not apply to an axe ;-) ] Traditionally, if x was identical to y, and y was identical to z, then x had to be identical to z. Relative Identity has been proposed such that this need not be true, thus allowing both the morning and evening stars to be identical to Venus but not to each other.
When specifically asking whether the paradoxical ships and axes are numerically identical, as Peter Cave points out, two of our usual criteria for being “one and the same thing” are in conflict. They are (a) being composed of largely the same set of parts, and (b) being appropriately continuous through some region of space and time. The continually refurbished ship meets (b) but the reassembled original parts meet (a).
In traditional logic, as formulated in Leibniz’s Law, two things are the “same” only if everything that can be said about one thing can also be said about the other. In other words, all the properties of each object/entity need to be equal if they are one and the same. By this token, the two axes and the various ships are not the same. Of course, this means that ANY change to ANY property causes the new thing to not be “the same” as the old. To avoid this, others have said that only essential and not accidental properties should be compared. This means that the definitions of “ship” and “axe” should distinguish between those properties that must remain the same throughout the lifetime of the object versus those properties that may change over time.
So, the particular individual parts of a thing need not all be “essential” properties of that thing, and hence they may change without affecting that thing’s identity. (You are still you even if you lose a leg or lung, but not a head). Well then, what are some potential essential properties of an individual thing? Many advocate taking a look at Aristotle’s “four causes” of a thing, where he defined “cause” as anything that was involved in the creation of that thing. His two main varieties of causes were intrinsic, for causes that are “in the object”, and extrinsic, for those that are not. The two sub-varieties of intrinsic causes were material cause (the material the thing consists of) and formal cause (the thing’s form [OOP programmers think Class]). The two sub-varieties of extrinsic causes were efficient cause (the “who” or “what” that made it happen, or “how”) and final cause (the goal, or purpose, or “why”).
By analyzing the paradoxes using Aristotle’s causes it can be argued that the Ship of Theseus is the same ship, because the form does not change, even though the material used to construct it may vary with time. Also, the Ship of Theseus would have the same purpose, that is, transporting Theseus, even though its material would change with time. The builders and tools used may, or may not, have been the same, therefore, depending on how important the efficient cause is to you, it would make more or less of a difference. So, giving priority in definitions to some causes over other causes can answer riddles like these.
Further more, analyzing the “causes” of a thing’s creation, forces one to agree on when a thing actually comes into and out of existence, how to tell it apart from other similar things, how to count them, how to recognize it again in the future, and so forth. Circularly, Causes also provide justifications for those agreements. These criteria for identity help define the sortal definition of the thing (i.e. knowing how to sort these sorts of things from other sorts of things, and being able to count them on the way).
Case Studies: BigBank “Facilities” and “Customers”
I worked on some projects at "BigBank" (a recently defunct Top-5-in-the-USA bank) where these Philosophy-inspired techniques would have really helped. Here are two case studies that illustrate the problems of modeling the parts but not the wholes.
In the first case study, BigBank (in order to meet new international banking standards) needed to retrofit its computer systems to record and report on their track record in guessing whether loans would be paid off eventually. Each guess took the form of a “default grade” for a package of loans, each known as a “facility”.
A major problem was that their various systems did not agree on the basic definition of “facility”. This was because the definition of a “facility” went so without saying that no one actually said (in a rigorous way) what it was. Everyone interviewed knew intuitively what one was but couldn’t quite put it into words, and when pressed, it turned out that they all had different definitions from each other. As a result, the various systems around the bank were built with different ontologies (i.e. models of the world). A key problem was that many of BigBank’s systems assumed that Facilities were no more than the collection of their parts, and so only the parts were recorded with no standard place to say things about each Facility as a whole. As a result, it came as a surprise to everyone that there had never been any agreement as to when which parts belonged to which wholes, nor even when any particular whole Facility came into or out of existence. Consequently, BigBank had several different “Facility ID”s, none of which agreed which each other, hence, no way to definitively report on the history of any particular Facility.
A second case study (which I detailed back in 2006) involves BigBank's treatment of customer information. Some BigBank systems defined Customer entities and assigned a single ID for each one, but other systems gave the same person or corporation a different ID in each state and called them Obligors. Once again, some systems modeled only the wholes (i.e. customers) and other systems only modeled the parts (i.e. obligors). And once again, because the systems working at the parts level did not tie them together as a whole, there was disagreement about which obligors belonged to which customers. It had become so bad that the data model had to be changed to allow multiple customers to be tied to a single obligor, lest conflicting data feeds go unprocessed. It was like having Person records and BodyPart records, but needing to kludge in the ability to have multiple people associated with the same particular foot!
[1] chapter 5, this sentence is false, Peter Cave, 2009, Continuum, ISBN: 9781847062208
[2] Parts, Peter Simons, 1987, Oxford University Press
[3] Introducing Aristotle, Rupert Woodfin and Judy Groves, 2001
George Washington's axeAt the bottom of these paradoxes is the question of whether a thing-made-up-of-parts is the same as the collection of all its parts. I.E. can everything that can be said of the whole thing be equally said of the collection of all its parts, and vice-versa? For 2500 years, western philosophers including Socrates, Plato, and Aristotle, right through to the 21st century, have been debating this question, generating whole libraries of book and papers. In fact, Mereology is an entire field of study that is just about the relationship between parts and their respective wholes.
Harry: I have here the very axe with which George Washington chopped down the cherry tree. It’s been used by my family for generations.
Sally: But this says “Made in China”!
Harry: Well, over the years, the handle was replaced each time it wore out. Oh, and the blade’s been replaced a couple of times too.
Sally: But those are the only two parts…that’s not the same axe at all then!!
Ship of Theseus
(original paradox by Plutarch)
Theseus had a ship whose parts were replaced over time such that, at a certain point, no original pieces were left.
How can the latter ship be said to be the same ship as the original if they have no parts in common?
(sequel paradox by Hobbes)
Suppose that those old parts were stockpiled as they were being replaced, and later they were reassembled to make a ship.
NOW, which ship is the same as the original ship; the one with the original parts, or, the one with the replacement parts?
What does it mean to be an individual?
As discussed (at great length) in the book Parts[2], there is a whole spectrum of things in between “individuals” and “groups”, and they are referred to in everyday language by singular terms (e.g. person), plural terms (e.g. feet), and some words that could mean either (e.g. hair). There are individuals (say, a car), parts of individuals that are themselves individuals (say, a wheel), parts of individuals that are NOT themselves individuals (say, the paint), collections that do not form an individual (say, “the wheels of that car”), collections that DO constitute an individual (say, the car parts that comprise the engine where the engine is itself an individual), and so on, and so on.
A key to distinguishing whether a thing being referred to is truly a thing in its own right (and not just a plural reference masquerading as a single thing) is what sorts of things can be said about it. Orchestra is an ambiguous term because it can be used as a singular or a plural as in “the orchestra IS playing” vs the equally grammatical “the orchestra ARE playing”. If it is considered an individual then we can say things about its creation, its history, etc, whereas the plural use simply denotes a collection of players where not much can be said about “it” apart from the count of players, their average age, etc. Relational Database programmers will recognize individuals as those that get their own record in some entity table, and plurals/sets/collections as equivalent to the result set from some arbitrary query. SQL aggregate functions (like count, average, minimum, maximum, etc) are the only things that can be said about the result set as a whole. Result sets do not get primary keys because they are not a “thing”, whereas real individuals do (or should!) get their own personal identity key. Even when an arbitrary query is made to look like an entity by defining a “view”, it is not always possible to perform updates against the search results because the view is not a real entity.
What does it mean to be the same?
A big problem is that there are many different flavors of “sameness” when we say that A is the same as B. Right off the bat there is a difference between Qualitative identity versus Numerical identity. Two things are qualitatively identical if they are duplicates, like a pair of dice. Two things are numerically identical if they are one and the same thing, like the Morning Star and the Evening Star (both of which are, in fact, really the planet Venus). They are “numerically” identical in that when counting things they only count as one thing. Another complication is that there is a difference between identity (right this second) versus Identity over time which deals with the whole question of how something can be different at two different times and yet still be considered the same thing. For example, you are still considered numerically identical to the you of your youth even though you have clearly changed…although this gets into the even more involved topic of Personal Identity [which may or may not apply to an axe ;-) ] Traditionally, if x was identical to y, and y was identical to z, then x had to be identical to z. Relative Identity has been proposed such that this need not be true, thus allowing both the morning and evening stars to be identical to Venus but not to each other.
When specifically asking whether the paradoxical ships and axes are numerically identical, as Peter Cave points out, two of our usual criteria for being “one and the same thing” are in conflict. They are (a) being composed of largely the same set of parts, and (b) being appropriately continuous through some region of space and time. The continually refurbished ship meets (b) but the reassembled original parts meet (a).
In traditional logic, as formulated in Leibniz’s Law, two things are the “same” only if everything that can be said about one thing can also be said about the other. In other words, all the properties of each object/entity need to be equal if they are one and the same. By this token, the two axes and the various ships are not the same. Of course, this means that ANY change to ANY property causes the new thing to not be “the same” as the old. To avoid this, others have said that only essential and not accidental properties should be compared. This means that the definitions of “ship” and “axe” should distinguish between those properties that must remain the same throughout the lifetime of the object versus those properties that may change over time.
Java Programmers can relate to the philosophical meanings of “essential” and “accidental” in the following way. [To keep this sidebar simple, think of “entity beans” where only one bean/object/instance is allowed to represent a particular real world entity (e.g. {name=Joe Blow,ssn=123456789})…i.e. there are never multiple object instances in RAM simultaneously representing Joe.] Class definitions could have “essential” properties implemented via constants (i.e. final instance variables initialized in the constructor ala the Immutable design pattern). And, “accidental” properties are implemented via normal instance members.More than the sum of its parts
The essential properties must be final because if their values were different then they would have to be a different individual. E.G. If an instance of class Person has a constant DNA_Fingerprint_Code with value of 1234567890, it would not be correct to change that value on that same object because a person’s DNA both defines them and never changes; i.e. “essential” in the Philosophy sense. The correct procedure would be to create a new instance of Person because it must truly be a different person if it has different DNA. [Of course, this brings up the whole separate topic of the difference between changing a property’s value because it has a truly new value versus merely correcting a mistaken value. Normally, computer software has not been designed to make this distinction even though it would make some systems much more robust, and able to reflect reality better if they did.]
The putative method IsTheSame(Object o) would compare either all properties, or only essential properties, of this and o depending on your philosophy. [This also brings up the whole separate topic of the Java equals() method, and the many potential meanings of “equals” apparent when thinking Philosophically.]
So, the particular individual parts of a thing need not all be “essential” properties of that thing, and hence they may change without affecting that thing’s identity. (You are still you even if you lose a leg or lung, but not a head). Well then, what are some potential essential properties of an individual thing? Many advocate taking a look at Aristotle’s “four causes” of a thing, where he defined “cause” as anything that was involved in the creation of that thing. His two main varieties of causes were intrinsic, for causes that are “in the object”, and extrinsic, for those that are not. The two sub-varieties of intrinsic causes were material cause (the material the thing consists of) and formal cause (the thing’s form [OOP programmers think Class]). The two sub-varieties of extrinsic causes were efficient cause (the “who” or “what” that made it happen, or “how”) and final cause (the goal, or purpose, or “why”).
By analyzing the paradoxes using Aristotle’s causes it can be argued that the Ship of Theseus is the same ship, because the form does not change, even though the material used to construct it may vary with time. Also, the Ship of Theseus would have the same purpose, that is, transporting Theseus, even though its material would change with time. The builders and tools used may, or may not, have been the same, therefore, depending on how important the efficient cause is to you, it would make more or less of a difference. So, giving priority in definitions to some causes over other causes can answer riddles like these.
Further more, analyzing the “causes” of a thing’s creation, forces one to agree on when a thing actually comes into and out of existence, how to tell it apart from other similar things, how to count them, how to recognize it again in the future, and so forth. Circularly, Causes also provide justifications for those agreements. These criteria for identity help define the sortal definition of the thing (i.e. knowing how to sort these sorts of things from other sorts of things, and being able to count them on the way).
Case Studies: BigBank “Facilities” and “Customers”
I worked on some projects at "BigBank" (a recently defunct Top-5-in-the-USA bank) where these Philosophy-inspired techniques would have really helped. Here are two case studies that illustrate the problems of modeling the parts but not the wholes.
In the first case study, BigBank (in order to meet new international banking standards) needed to retrofit its computer systems to record and report on their track record in guessing whether loans would be paid off eventually. Each guess took the form of a “default grade” for a package of loans, each known as a “facility”.
A major problem was that their various systems did not agree on the basic definition of “facility”. This was because the definition of a “facility” went so without saying that no one actually said (in a rigorous way) what it was. Everyone interviewed knew intuitively what one was but couldn’t quite put it into words, and when pressed, it turned out that they all had different definitions from each other. As a result, the various systems around the bank were built with different ontologies (i.e. models of the world). A key problem was that many of BigBank’s systems assumed that Facilities were no more than the collection of their parts, and so only the parts were recorded with no standard place to say things about each Facility as a whole. As a result, it came as a surprise to everyone that there had never been any agreement as to when which parts belonged to which wholes, nor even when any particular whole Facility came into or out of existence. Consequently, BigBank had several different “Facility ID”s, none of which agreed which each other, hence, no way to definitively report on the history of any particular Facility.
CASE STUDY: At BigBank, credit grades are calculated for "facilities". A facility is a collection of "obligations" (i.e. loans, lines of credit) that are being considered together as a single deal and graded as a whole. The particular set of obligations grouped into each "facility" changes over time as individual obligations get paid off or expire. Plus, changed or not, the facilities are supposed to be re-graded from time to time. Unfortunately, some key BigBank databases only had records for individual obligations. There was no Facility entity table.
So, for example, whenever a "facility" was (re)graded, in reality, only a set of obligation records were updated, all with the same single “facility-grade”. In fact, other than the loan officer's neurons, there was no record of which obligations had been associated with which "facility" over time. So, when there was a new requirement to store for each facility all its grading documents, there was no place to put them. Even worse, since a Facility entity had never been formally defined, the analysis had never been done to make sure everyone had the same definition of a "facility" (which they didn't). There was no agreement on what the thing being graded actually was! For some, each individual grading event was considered a "facility" (along with its own "facility ID") because "the grading sheet is what is graded".
A second case study (which I detailed back in 2006) involves BigBank's treatment of customer information. Some BigBank systems defined Customer entities and assigned a single ID for each one, but other systems gave the same person or corporation a different ID in each state and called them Obligors. Once again, some systems modeled only the wholes (i.e. customers) and other systems only modeled the parts (i.e. obligors). And once again, because the systems working at the parts level did not tie them together as a whole, there was disagreement about which obligors belonged to which customers. It had become so bad that the data model had to be changed to allow multiple customers to be tied to a single obligor, lest conflicting data feeds go unprocessed. It was like having Person records and BodyPart records, but needing to kludge in the ability to have multiple people associated with the same particular foot!
[1] chapter 5, this sentence is false, Peter Cave, 2009, Continuum, ISBN: 9781847062208
[2] Parts, Peter Simons, 1987, Oxford University Press
[3] Introducing Aristotle, Rupert Woodfin and Judy Groves, 2001
Labels:
accidental/essential,
case study,
database,
definitions,
equals,
fuzzy,
identity,
language,
ontologies,
parts,
philosophy,
POSTSCRIPT,
programming,
views
Saturday, November 10, 2007
Subjective, Objective, Relative, Existential
In the imposing, but handy, Oxford Companion to Philosophy[1], there are entries about "objectivism and subjectivism"[2], and "relativism, epistemological"[3] that lead to the following observations:
[2] ibid, pg 667
[3] ibid, pg 800
- Objectivism says that some statements are objective, in that they are true independent of anyone's opinion. e.g. This ball is red. Alternatively, values assigned to properties can be dependent on other factors and be described by a function rather than a simple value. e.g. The color of Ayer's Rock is F(time-of-day, weather).
- Subjectivism says that (potentially all) statements are subjective, in that they are dependent on the opinion of the person making the statement. e.g. This cake is delicious.
- Relativism says that statements are always subjective even when the decider thinks he's made an objective evaluation. I.E. no evaluations are objective because man is always biased by his particular cultural, historical, religious, etc viewpoint, and no particular viewpoint is "the right one". e.g. This society is primitive.
- Existential Programming philosophy says that even if something is supposedly scalar & objective, and even if one does not subscribe to relativism (thus implying that there is no need to ascribe to values a particular data source), the reliability of any particular data source is never perfect, and thus one needs to model data as if relativism were true. I.E. keep track of "says who" for each "fact" and therefore be prepared to simultaneously handle/store multiple values for everything, tagging them with "says who", "said when", etc. So, in effect, there are no scalar values, only functions with at least a data source as a parameter.
[2] ibid, pg 667
[3] ibid, pg 800
Labels:
database,
existential programming,
fuzzy,
ontologies,
philosophy
Tuesday, October 16, 2007
Relativism, Absolutism, and Existential Programming
In the handy Philosopher's Toolkit book[1], there is a section[2] explaining the difference between relative statements and absolute statements (and similarly relativism and absolutism). As a prototypical example, it explains how before Einstein, the "time" an event occurred was considered an absolute statement. In other words, the whole universe would know what it meant because time was the same everywhere (just different time zones). However, Einstein revealed that time is relative to the location and speed of the observer and can't be the same everywhere. Plus, since there is no place and speed that could/should be considered the "official" one, all "times" are equally valid.
Because there are many aspects of reality and opinion that are considered relative by some number of people, Existential Programming counts this as yet another reason to embrace/support multiple ontologies simultaneously. Absolute vs Relative points of view are yet another aspect of modeling the world that traditional object-oriented and relational database modeling make assumptions about.
[1] The Philosopher's Toolkit, Julian Baggini and Peter S. Fosl, Blackwell Publishers, 2003, ISBN: 0631228748
[2] ibid, section 4.2
Because there are many aspects of reality and opinion that are considered relative by some number of people, Existential Programming counts this as yet another reason to embrace/support multiple ontologies simultaneously. Absolute vs Relative points of view are yet another aspect of modeling the world that traditional object-oriented and relational database modeling make assumptions about.
[1] The Philosopher's Toolkit, Julian Baggini and Peter S. Fosl, Blackwell Publishers, 2003, ISBN: 0631228748
[2] ibid, section 4.2
Sunday, September 9, 2007
Quantum Math for Fuzzy Ontologies
In my earlier post "Existential Programming as Quantum States", I mused that objects that were simultaneously carrying properties from multiple ontologies (i.e. multiple class hierarchies or data models), were like Quantum States in Quantum Physics. This led me later to wonder what math had been developed to work with quantum states...i.e. is there some sort of quantum algebra that might be applicable to Existential Programming? It is needed because, in Existential Programming, a property of an object might carry multiple conflicting values simultaneously, each with varying degrees of certainty or confidence or error margins.
I found Wikipedia page on Quantum-indeterminacy which looks applicable.
Why would you want that ability? How about data mining web pages where several people's names and a single birth-date (or phone number, address, etc) are found. Even though it isn't known which person's name is associated with the birthday, one could associate the birth-date with each person with some fractional probability. With enough out of focus wisps of data like this, from many web pages, the confidence factor of the right birthdate with the right person would rise to the top of the list of all possible dates (analogous to the way that very long range telescopes must accumulate lots of individual, seemingly random, photons to build up a picture of the stars/galaxies being imaged). The fractional probability assigned could be calculated with heuristics like "lexical-distance-between-age-and-name is proportional to the probability assigned". This could make the "value" of a scalar property (like birth-date), in reality, the summarization of a complete histogram of values-by-source-web-pages.
I found Wikipedia page on Quantum-indeterminacy which looks applicable.
Quantum indeterminacy can be quantitatively characterized by a probability distribution on the set of outcomes of measurements of an observable. The distribution is uniquely determined by the system state, and moreover quantum mechanics provides a recipe for calculating this probability distribution.
Indeterminacy in measurement was not an innovation of quantum mechanics, since it had been established early on by experimentalists that errors in measurement may lead to indeterminate outcomes. However, by the later half of the eighteenth century, measurement errors were well understood and it was known that they could either be reduced by better equipment or accounted for by statistical error models. In quantum mechanics, however, indeterminacy is of a much more fundamental nature, having nothing to do with errors or disturbance.AHA! It dawns on me that going beyond the mere fuzzy logic idea of values having a probability or certainty factor, Existential Programming could have a fuzziness value for the property as a whole...as in "it is not certain that this property even applies to this object"...and even further it could mean "it is not certain that this property even applies to the entire Class". A FUZZY ONTOLOGY: method of associating attributes/relationships with entities where each entity is not conclusively known. The value of a property may be certain (i.e. not vague or probabilistic), but whether that property belongs to this object is fuzzy.
Why would you want that ability? How about data mining web pages where several people's names and a single birth-date (or phone number, address, etc) are found. Even though it isn't known which person's name is associated with the birthday, one could associate the birth-date with each person with some fractional probability. With enough out of focus wisps of data like this, from many web pages, the confidence factor of the right birthdate with the right person would rise to the top of the list of all possible dates (analogous to the way that very long range telescopes must accumulate lots of individual, seemingly random, photons to build up a picture of the stars/galaxies being imaged). The fractional probability assigned could be calculated with heuristics like "lexical-distance-between-age-and-name is proportional to the probability assigned". This could make the "value" of a scalar property (like birth-date), in reality, the summarization of a complete histogram of values-by-source-web-pages.
Labels:
existential programming,
fuzzy,
logic,
mathematics,
ontologies,
POSTSCRIPT,
quantum
Friday, August 31, 2007
Fuzzy Unit Testing, Performance Unit Testing
In reading Philosophy 101, about Truth with a capital "T", and the non-traditional logics that use new notions of truth, we of course arrive at Fuzzy Logic with its departure from simple binary true/false values, and embrace of an arbitrarily wide range of values in between.
Contemplating this gave me a small AHA moment: Unit Testing is an area where there is an implicit assumption that "Test Passes" has either a true or false value. How about Fuzzy Unit Testing where there is some numeric value in the 0...1 range which reports a degree of pass/fail-ness? i.e. a percentage pass/fail for each test. For example, testing algorithms that predict something could be given a percentage pass/fail based on how well the prediction matched the actual value. Stock market predictions, bank customer credit default prediction, etc come to mind. This sort of testing of predictions about future defaults (i.e. credit grades) is just the sort of thing that the BASEL II accords are forcing banks to start doing.
Another great idea (if I do say so myself) that I had a few years ago was the notion that there is extra meta-data that could/should be gathered as a part of running unit test suites; specifically, the performance characteristics of each test run. The fact that a test still passes, but is 10 times slower than the previous test run, is a very important piece of information that we don't usually get. Archiving and reporting on this meta-data about each test run can give very interesting metrics on how the code changes are improving/degrading performance on various application features/behavior over time. I can now see that this comparative performance data would be a form of fuzzy testing.
Contemplating this gave me a small AHA moment: Unit Testing is an area where there is an implicit assumption that "Test Passes" has either a true or false value. How about Fuzzy Unit Testing where there is some numeric value in the 0...1 range which reports a degree of pass/fail-ness? i.e. a percentage pass/fail for each test. For example, testing algorithms that predict something could be given a percentage pass/fail based on how well the prediction matched the actual value. Stock market predictions, bank customer credit default prediction, etc come to mind. This sort of testing of predictions about future defaults (i.e. credit grades) is just the sort of thing that the BASEL II accords are forcing banks to start doing.
Another great idea (if I do say so myself) that I had a few years ago was the notion that there is extra meta-data that could/should be gathered as a part of running unit test suites; specifically, the performance characteristics of each test run. The fact that a test still passes, but is 10 times slower than the previous test run, is a very important piece of information that we don't usually get. Archiving and reporting on this meta-data about each test run can give very interesting metrics on how the code changes are improving/degrading performance on various application features/behavior over time. I can now see that this comparative performance data would be a form of fuzzy testing.
Labels:
fuzzy,
test driven,
testing
Sunday, January 21, 2007
Imaginary Numbers paradigm for Existential Programming
It occurs to me that one of the things that Existential Programming hopes to enable is the ability to continue working with data that is vague, fuzzy, semi-inconsistent instead of screeching to a halt as would be the case with a strongly-typed implementation of a single ontology.
An analog to this is the invention (discovery?) of Imaginary Numbers in mathematics. The imaginary number "i" is defined to be the square root of -1. Now the mildly mathematical reader will note that you can't have a square root of a negative number because any time you square a number it is always positive. So, when early mathematicians came to a point in their formulas where a square root of a negative number was required, they were stuck. By creating a way to talk about and manipulate numbers that "can't exist" (i.e. imaginary numbers), formulas could be worked through such that "real" answers could eventually emerge.
By developing techniques to work with data that is not consistent with a single ontology (i.e. existential programming), programs can get past the "thats not legal data" stage and work its way to answers that ultimately do result in "legal data".
An analog to this is the invention (discovery?) of Imaginary Numbers in mathematics. The imaginary number "i" is defined to be the square root of -1. Now the mildly mathematical reader will note that you can't have a square root of a negative number because any time you square a number it is always positive. So, when early mathematicians came to a point in their formulas where a square root of a negative number was required, they were stuck. By creating a way to talk about and manipulate numbers that "can't exist" (i.e. imaginary numbers), formulas could be worked through such that "real" answers could eventually emerge.
By developing techniques to work with data that is not consistent with a single ontology (i.e. existential programming), programs can get past the "thats not legal data" stage and work its way to answers that ultimately do result in "legal data".
Labels:
epiphanies,
existential programming,
fuzzy,
mathematics
Monday, July 17, 2006
Existential Programming as Quantum States
In reading about Quantum States in Wikipedia...
"In quantum physics, a quantum state is a mathematical object that fully describes a Quantum system. One typically imagines some experimental apparatus and procedure which "prepares" this quantum state; the mathematical object then reflects the setup of the apparatus. Quantum states can be statistically mixed, corresponding to an experiment involving a random change of the parameters. States obtained in this way are called mixed states, as opposed to pure states, which cannot be described as a mixture of others. When performing a certain measurement on a quantum state, the result generally described by a probability distribution, and the form that this distribution takes is completely determined by the quantum state and the observable describing the measurement. However, unlike in classical mechanics, the result of a measurement on even a pure quantum state is only determined probabilistically. This reflects a core difference between classical and quantum physics.
Mathematically, a pure quantum state is typically represented by a vector in a Hilbert space. In physics, bra-ket notation is often used to denote such vectors. Linear combinations (superpositions) of vectors can describe interference phenomena. Mixed quantum states are described by density matrices."
...I was struck by the analogy with Existential Programming which proposes that objects hold multiple values for various properties (and in fact multiple sets of properties, hence, multiple ontologies) simultaneously.
Unlike Quantum States however, reading one set of values doesn't make the other sets vanish! ;-)
Labels:
epiphanies,
existential programming,
fuzzy,
mathematics,
ontologies,
quantum,
vector space
Wednesday, June 7, 2006
Ontology Mismatch Case Study: Customers & Obligors
Here is a real world example (from a major bank) of the problems of ontology mismatches between different silo systems whose data must never-the-less be integrated. Some systems have the concept of "customer" and implement a customer entity and customer key. Other systems (which do not talk to each other, i.e. there is no universal "system of record" for person or legal entity) have a customer concept, but they are distributed geographically and they have a different key for each state or regional location, and they are called "obligors". So, obligors should be a simple one-to-many relationship to customers.
However, since errors are made by the automated contact address parsing algorithms that try to figure out which customer is associated with which obligor, multiple customers can be associated with a single obligor. Hence, customers and obligors have a many-to-many relationship, and therefore, customers are many-to-many within themselves! Obligors are many-to-many within themselves! Customers not only have duplicates for the same person, they don't always represent a definite person or even set of definite people. They are vague and refer to parts of multiple people. Customers are effectively anything with a customer ID! Very existential.
A particular obligor (which, again, should be a particular customer in a particular location) was linked with three customers: JoeBlow, JaneBlow, a-customer-with-Jane's-name-and-Joe's-SSN! To make things worse, the attempt to clean up customers by defining them as a role of a "legal entity" didn't work in this case because the "customer" was really a married-household which was not a "legal entity" because it doesn't have its own tax id! Even worse, the rationale that legal entities are those things that are separately liable for money demands ignores the fact that both parties in a married household are liable (but even then differing on a state by state basis). Whew!
However, since errors are made by the automated contact address parsing algorithms that try to figure out which customer is associated with which obligor, multiple customers can be associated with a single obligor. Hence, customers and obligors have a many-to-many relationship, and therefore, customers are many-to-many within themselves! Obligors are many-to-many within themselves! Customers not only have duplicates for the same person, they don't always represent a definite person or even set of definite people. They are vague and refer to parts of multiple people. Customers are effectively anything with a customer ID! Very existential.
A particular obligor (which, again, should be a particular customer in a particular location) was linked with three customers: JoeBlow, JaneBlow, a-customer-with-Jane's-name-and-Joe's-SSN! To make things worse, the attempt to clean up customers by defining them as a role of a "legal entity" didn't work in this case because the "customer" was really a married-household which was not a "legal entity" because it doesn't have its own tax id! Even worse, the rationale that legal entities are those things that are separately liable for money demands ignores the fact that both parties in a married household are liable (but even then differing on a state by state basis). Whew!
Labels:
bank,
case study,
existential programming,
fuzzy,
ontologies,
roles
Object Orientation's Ontological Assumptions
Once one realizes that Object Oriented Programming is isomorphic with Semantic Networks[1][2], and one is cognizant of the meta-data it takes to represent imperfect data from a variety of sources (e.g. data mining the WWW), it becomes clear that OOP makes several large assumptions when modeling the world. These assumptions lie at the root of many problems mapping OO models to relational E/R data models.
The Class hierarchy defined in an OO program represents a model of entities, their attributes, and their relationships with other entities; i.e. an Ontology. Unlike modern semantic network approaches, where it is clear that a multiplicity of ontologies must be recognized and mediated between, OO Classes implicitly assume that they are "the only model", "the correct model", "the universal model". Some assumptions of OO, as normally practiced, are...
[2] http://www.semanticresearch.com/semantic/index.php
The Class hierarchy defined in an OO program represents a model of entities, their attributes, and their relationships with other entities; i.e. an Ontology. Unlike modern semantic network approaches, where it is clear that a multiplicity of ontologies must be recognized and mediated between, OO Classes implicitly assume that they are "the only model", "the correct model", "the universal model". Some assumptions of OO, as normally practiced, are...
- Only a single ontology is supported. OOP needs a way of mixing Class hierarchies where each is a different perspective on the same "thing(s)".
- No model exists for describing the author of the ontology. It is potentially implied by its [Java] package name (when that concept applies), but as far as other attributes of the author, there is no way to represent the "reliability" of the author, or of this particular model, or of a particular set of data values associated with this model.
- No model of whether particular values of Object attributes are "true", "up to date", "not vague", "not fuzzy" (i.e. clusters of possible points with probabilities for each point).
- No concept of object instances overlapping; each object either exists or not; objects don't "partially overlap" each other; objects exist in a single place in a single "copy". In other words, OOP doesn't distinguish between "a thing" and some number of (potentially imprecise) "representations" of that thing.
- The Class hierarchy is assumed to be the only way to classify/divide the world into "things" (anyway, at least the "things" that those classes model).
- An instance of Class X is assumed to be a member of the set of all Xs in the world. I.E. OOP doesn't have a way to say, I've created an object instance, but whether it is a member of the class of all X is not tied to whether it was created as an instance of Class X at birth. OOP doesn't support an agnostic attitude towards class/type membership. In still other words, Essence precedes Existence!
- The values of all entity attributes (aka an object instance) are assumed to be available in a single contiguous location. I.E. OOP can't normally handle attribute values being spread all over creation (as would be the case for data mined about someone via web page searches). OOP can't normally handle taking widely different amounts of time to retrieve different attributes (as would be the case in data mining operations).
[2] http://www.semanticresearch.com/semantic/index.php
Labels:
accidental/essential,
equals,
existential programming,
fuzzy,
ontologies
Three Levels of "Existential-ness" Support?
In thinking about how one would build "a language" and/or tools to support Existential Programming, there seemed three increasing levels to sort features into.
Level I - Model Mapping
Level I - Model Mapping
- Make it easy to map Object-Oriented models to Entity-Relationship models to Semantic-Network models. I.E. implement OO persistence layer in the style of the EAV approach to semantic network databases. Implement auto-translation of data in traditional E/R tables into EAV records. Implement auto loading of data into OO model from arbitrary EAV tuples (and therefore arbitrary relational tables). In other words, automated persistence with automatic data mapping.
- Make it easy to accept ontologies and data from multiple sources; i.e. not just relational database. Example data sources could be: Web searches, Enterprise Silo systems, etc. In other words, build common adapters and mediators to broaden the reach of the "language" beyond structured local databases.
- "Consider the source". Make it easy to associate fuzzy logic factors to data-assertions and ontology-assertions of all granularities, based on the source of the data, the ontology, and even the assertions themselves. Examples are: for any given attribute value, "say's who?", "said when", "how reliable is this source?", "how reliable is this source for this attribute?", "who says that this attribute even applies to this class of thing", "how reliable is the source about the ontology definitions?". I want to be able to encode: "Sam is 89% trustworthy about colors", "Joe lies about AGEs", "Harry is 100% reliable when he says that Joe lies about AGEs", etc.
- Make it easy to handle attribute values that are themselves fuzzy. I.E. Probabilistic attribute values, conflicting values, cluster values, vague values, time varying values, outdated values, missing values, values whose availability is defined by some set of limits on the effort expended in finding the value (e.g. find all values of phone for joe blow that can be found within 10 seconds real time).
Labels:
existential programming,
fuzzy,
language,
tools
Monday, June 5, 2006
The Original Epiphanies of Existential Programming
The items below are a summary of the several AHA! moments I had over the May/June 2006 time frame. [see my std disclaimers]
It began with contemplating how Object-oriented modeling, and Entity-relationship modeling, and Semantic Network modeling are all isomorphisms of each other. Next I realized that O/O and E/R models are way too rigid because they expect a single "correct" model to work, whereas Semantic modelers pretty much know it is futile to expect everyone to use a single ontology! So, where would it take us to explore doing O/O and database development with that in mind? Next I had the intuition that Philosophy (with a capital P) probably had something to say about this topic and so I started reading Philosophy 101 books to learn at age 50 what I never took in college. It quickly became obvious that Philosophy has SO MUCH to say about these topics that it is criminal how little explicit reference to it there is in the software engineering literature.
It began with contemplating how Object-oriented modeling, and Entity-relationship modeling, and Semantic Network modeling are all isomorphisms of each other. Next I realized that O/O and E/R models are way too rigid because they expect a single "correct" model to work, whereas Semantic modelers pretty much know it is futile to expect everyone to use a single ontology! So, where would it take us to explore doing O/O and database development with that in mind? Next I had the intuition that Philosophy (with a capital P) probably had something to say about this topic and so I started reading Philosophy 101 books to learn at age 50 what I never took in college. It quickly became obvious that Philosophy has SO MUCH to say about these topics that it is criminal how little explicit reference to it there is in the software engineering literature.
- When mapping Object Oriented classes to semantic networks I realized that CLASSES/SUBCLASSES etc were the same as sets of semantic-relationship-triples (Entity-Attribute-Value aka EAV records) and therefore a class hierarchy formed an ontology (as used in the semantic network/web/etc world). AHA! It is futile to get everyone to agree upon ONE ontology (from my experience), SO, that is why it is a false assumption of O/O that there can/should be a single Class hierarchy. But, all O/O languages fundamentally assume this which is why they are hard to map to relational databases. Databases explicitly provide for multiple "views" of data. And in Enterprise settings, where there are often multiple models (from different stovepipe systems) of the same basic data, this causes even more of a mismatch with the single object model.
- Mapping O/O Class hierarchies to DB E/R models to Semantic Networks brings up questions about the meaning of Identity (with a capital I) and Essential vs Accidental properties. AHA! This sounds like Philosophy (which had I not started reading about before transcribing these notes into a blog, I would have not known terms like Essential and Accidental and Identity with a capital I to even use them here), SO, it would be worth learning Philosophy to improve my Software Engineering and Computer Science skills.
- Having now worked with both Java and Javascript deeply enough to understand class versus prototype based languages (see my AJAX articles), I see that Java is like Plato's view of the world, and Javascript is more like Existentialism (where an object can be instantiated without saying what "type" it is).
- Web pages can be thought of as a database whose data model/ontology is implied. Data mining can be done on it where the URL and the "time of last update" are added to each EAV tuple extracted from the page to extend a normal EAV "fact" with a "says who?" dimension and a temporal dimension to the database. In order to really capture all the nuances of the data mined from the web a standard data model ala O/O or E/R models have to also add some model of:
- completeness
- accuracy
- different values at different points of time
- not only "say's who?" but "say's how?" i.e. which ontology is being used implicitly or explicitly
- only some attributes of a "thing" are being defined on any given URL
- O/O languages could/should be extended to make it easier to work with arbitrary sets of semantic network relationships/tuples such that it could handle integration of various (E/R, Enterprise, web page, data mining) data models.
- Google, Homeland Security, Corporate data warehouses all would benefit from being able to work with "everything we know about X". This could be a good technique to integrate disparate data sources.
- O/O languages need to be more like Javascript in letting any set of attributes be associated with an object and "classes" are more like "roles" or interfaces that the VIEWER chooses instead of tightly coupling the attribute set to a predefined list. The VIEW chosen by the viewer/programmer can still be type-safe once chosen BUT it cant assume the source of data used the same "view".
- "View" (see above) includes all aspects of traditional classes PLUS parameters for deciding trustworthiness, deciding the "identity" of the thing that attributes are known about, and all other "unassumable" things. An O/O language could set defaults for these parameters to match the assumptions of traditional programming languages.
- Searching the web and trying to integrate the data is much like trying to integrate the data from disparate silo systems into a single enterprise data model or data warehouse. They both need to take into account where each data value came from, how accurate/reliable those sources are, and how their ontologies map to each other and accumulate attributes from different sources about the same entity.
- When dealing with the sort of non-precise, non-reliable values of object properties as found on the web, the following are needed as a part of the "ontology" defined to work with that data:
- Equality test should return a decimal probability ( 0..1) rather than a true/false value
- Find/Search operations should allow specification of thresholds to filter results
- Property "getters" become the same as "find" operations
- The result of a get/find is a set of values, each includes a source-of-record & time/space region i.e. say's who?, when and where was this true?
- Property "setters" should accept parameters for source-of-record-spec, time/space region, data freshness, as well as probability factor, or other means of specifying cluster values, vague values, etc.
- Multiple levels of granularity with regard to setting probability of truth values for entire source-of-record as well as for individual "fact"
- How to handle deciding what a thing is? What "level" of abstraction/reality is it on? E.G. an asteroid is a loose collection of pebbles, but that means that the parts of something don't always "touch" the thing. i.e. What is the real difference between the following:
- x is a part of y
- x is touching y
- x and y are in the set S
- How are attribute values of null to be interpreted? What is the difference between "has definitely no value" and "dont know the value"? Attributes of X (according to some given ontology) are either:
- Identity Criteria
- Required as Essential
- known as possible (but optional)
- unanticipated/unknown (but a value was found)
- unanticipated and not found (i.e. not conceived of)
- It is a big deal to understand the borderline between the set of "thing"s (aka entity, object) and the set of "value"s (e.g. 1,2,3,a,b,c,true,false,etc) especially when many OO languages represent them all with "objects".
- It is a big deal to handle the problem where ontologies mismatch each other with regard to "what is a thing" and "where does one thing end and another one begin". E.G.
- parts of A == parts of B but A<>B
- overlapping things like jigsaw puzzle pieces vs the objects in the completed puzzle picture
- a defacto Customer record that does not equal a "person" because the name belonged to one person but the SSN belonged to another. On the other hand, if the "customer" can really be "a married household" but the system can't handle that, then this customer record is not overlapping people, it is just incomplete. On the other other hand, how do the customer records for the husband and wife jive with the "household"?
- There are attributes of an entity and there are "meta-attributes", e.g. an EAV tuple of an attribute could be (object123,color,green) [where "color" and "green" should be defined in the ontology in question.] Meta-attributes could be...
- "which ontology is this based on?", (i.e. "whose definitions are we using?")
- "says who?", (source of the data)
- "and when was it said?", (date source was queried)
- "over what period of time was it green?" (because values change over time)
- If objects can have arbitrary collections of attributes, and they are not any definite "thing", then how do you know what-is-a/when-to-create-a-new-instance-of-the "thing"?? And where does one "thing" end and the next one begin?
- Intuitively, people agree on when one person begins and another person ends even if we cant define how/why. This is not true of abstract concepts. Modeling should find the easy to recognize real-world entities and use them in preference to concepts (which are often roles anyway like customer or prospect or employee).
- People "know" other people (i.e. recognize them later) via shared "events" which both can verify to each other. [Just like the shared PIN# secret between you and the bank. And now increasingly asking all sorts of personal questions like whats your favorite movie?]
Labels:
accidental/essential,
epiphanies,
existential programming,
fuzzy,
origins,
POSTSCRIPT,
roles
Subscribe to:
Posts (Atom)