Saturday, June 24, 2006

Notes on Ontology Tools

While reading about ontology tools[2], I found a tool (Hozo) that supports "roles" explicitly (see #17 of the original Existential Programming epiphanies, and a role case study). Hozo separates "role concepts" from "basic concepts"; however, the tool allows each role attribute to be mapped to a basic concept attribute. I see that an existential programing language should allow "roles" to "inherit" from its "roleholder" ala subclass inheritance without actually being a subclass. [Hmmm... in an existential programming language, where all "classes" were effectively mixins anyway, how would roles be different?]

(click to enlarge)

From[3], seeing CYC's concepts of #$is-a versus #$genls reminds me of a discussion I had back in 2002 with the Protege 2000 folks at Stanford who produced a Wine ontology, where I wanted to have no distinction between classes and instances because I wanted a hierarchy like wine->reds->shiraz->Rosemont->vintage94->bottle#123. I.E. something considered a leaf on the tree might later be a node with children itself. Protege would only allow variables to take on values that were "instances" and I wanted to put "chardonnay" (a subclass) as the value of a "wine variety" property. SO, is there no difference between classes and object, or should the "value" of an attribute be able to contain a "class" reference??

From[4], seeing the Semantic Web's layer cake, I see that my ideas about recording "says who?" and "how reliable are you?" seem similar to the "trust layer". [Ed. note 11-23-07: like maybe you read this stuff years ago and it was the subliminal seed of this "says who" epiphany?]

[1] Tutorial on Ontological Engineering: Part 2: Ontology Development, Tools and Languages, Riichiro Mizoguchi, 2004

[*** Get the PDF here ***]
[2] ibid, Page 14, Fig. 2.
[3] ibid, Page 15, Section 3.1
[4] ibid, Page 23

Saturday, June 17, 2006

Philosphers Toolkit

In the handy Philosopher's Toolkit book[1], there is a section[2] explaining the difference between "categorical" statements and "modal" statements. In reading it, I see that some of my intuitions about the assumptions implicit in the object oriented programming model (e.g. "what time period was this data true?", "says who?", etc) were actually a recognition that OO models contain "categorical" assertions and do not (without explicit programming) support "modalities". Temporal modality, intensional logics, etc. E.G. there is no "date range" associated with each attributes value.

In another section[3], Leibniz's law of identity (which says that A is the same "thing" as B if all attributes of A are equal to their corresponding attributes in B), relates to my epiphany #5. But which set of properties are necessary/sufficient to claim a match? It depends on the ontology. Consider "cross temporal identity"...the river of today vs the river of yesterday..."molecules" vs "water". For people, the properties are often not used to identify them, but instead a "continuity of memory" connects yesterday's YOU vs today's YOU.

In another section[4], the difference between "types" and "tokens" are discussed. Type is an analog of "class". Token is an analog of "object". Type-identical is an analog of "instanceOf". Token-identical is an analog of "address-of(A) == address-of(B)".

[1] The Philosopher's Toolkit, Julian Baggini and Peter S. Fosl, Blackwell Publishers, 1st Ed., 2003, ISBN: 0631228748
[2] ibid, section 4.4
[3] ibid, section 3.6
[4] ibid, section 4.17

Wednesday, June 7, 2006

Ontology Merging Strategy

Summarizing the posts from yesterday, there is a general problem of "things" in one ontology/data model not mapping (in a definite way) to "things" in another model. How to support (or even automate) mapping from one model to another? I.E. how to facilitate "transformation" from one "basis" to another?

A strategy at the heart of an existential programming language could be to reduce entities to their most atomic level: semantic relations between an entity and a single attribute. Use "identity" algorithms to reconstitute these atoms into "things".

A new language that did this and integrated multiple sources of data (OO data, E/R relational data, semantic networks, web-search-results) could create a single seamless framework and data continuum.
[Ed. Note: as found 10/29/07, others have had similar ideas.]

Ontology Mismatch Case Study: Customers & Obligors

Here is a real world example (from a major bank) of the problems of ontology mismatches between different silo systems whose data must never-the-less be integrated. Some systems have the concept of "customer" and implement a customer entity and customer key. Other systems (which do not talk to each other, i.e. there is no universal "system of record" for person or legal entity) have a customer concept, but they are distributed geographically and they have a different key for each state or regional location, and they are called "obligors". So, obligors should be a simple one-to-many relationship to customers.

However, since errors are made by the automated contact address parsing algorithms that try to figure out which customer is associated with which obligor, multiple customers can be associated with a single obligor. Hence, customers and obligors have a many-to-many relationship, and therefore, customers are many-to-many within themselves! Obligors are many-to-many within themselves! Customers not only have duplicates for the same person, they don't always represent a definite person or even set of definite people. They are vague and refer to parts of multiple people. Customers are effectively anything with a customer ID! Very existential.

A particular obligor (which, again, should be a particular customer in a particular location) was linked with three customers: JoeBlow, JaneBlow, a-customer-with-Jane's-name-and-Joe's-SSN! To make things worse, the attempt to clean up customers by defining them as a role of a "legal entity" didn't work in this case because the "customer" was really a married-household which was not a "legal entity" because it doesn't have its own tax id! Even worse, the rationale that legal entities are those things that are separately liable for money demands ignores the fact that both parties in a married household are liable (but even then differing on a state by state basis). Whew!

What is "Identity" in OOP & ER & SN Data Models?

When trying to map Object Oriented Programming models, to Entity Relationship models, to Semantic Network models, how does the philosophical concept of "identity" get handled? I.E. how is a "thing" identified in each model? [And the following assumes that incorrect criteria is not used e.g. using the "name" of the thing as its "identifier".]
  • OOP models assume that the "object pointer" (or object "handle") is a global unique identifier (GUID) for the "thing" represented by that object instance of that class.
  • E/R models assume that there is either an opaque key (ala sequence numbers) or some set of attributes whose combined values form a GUID for the "thing" represented by that row of that table.
  • S/N models assume that there is some explicit or internal key associated with each "entity".

Object Orientation's Ontological Assumptions

Once one realizes that Object Oriented Programming is isomorphic with Semantic Networks[1][2], and one is cognizant of the meta-data it takes to represent imperfect data from a variety of sources (e.g. data mining the WWW), it becomes clear that OOP makes several large assumptions when modeling the world. These assumptions lie at the root of many problems mapping OO models to relational E/R data models.

The Class hierarchy defined in an OO program represents a model of entities, their attributes, and their relationships with other entities; i.e. an Ontology. Unlike modern semantic network approaches, where it is clear that a multiplicity of ontologies must be recognized and mediated between, OO Classes implicitly assume that they are "the only model", "the correct model", "the universal model". Some assumptions of OO, as normally practiced, are...
  • Only a single ontology is supported. OOP needs a way of mixing Class hierarchies where each is a different perspective on the same "thing(s)".
  • No model exists for describing the author of the ontology. It is potentially implied by its [Java] package name (when that concept applies), but as far as other attributes of the author, there is no way to represent the "reliability" of the author, or of this particular model, or of a particular set of data values associated with this model.
  • No model of whether particular values of Object attributes are "true", "up to date", "not vague", "not fuzzy" (i.e. clusters of possible points with probabilities for each point).
  • No concept of object instances overlapping; each object either exists or not; objects don't "partially overlap" each other; objects exist in a single place in a single "copy". In other words, OOP doesn't distinguish between "a thing" and some number of (potentially imprecise) "representations" of that thing.
  • The Class hierarchy is assumed to be the only way to classify/divide the world into "things" (anyway, at least the "things" that those classes model).
  • An instance of Class X is assumed to be a member of the set of all Xs in the world. I.E. OOP doesn't have a way to say, I've created an object instance, but whether it is a member of the class of all X is not tied to whether it was created as an instance of Class X at birth. OOP doesn't support an agnostic attitude towards class/type membership. In still other words, Essence precedes Existence!
  • The values of all entity attributes (aka an object instance) are assumed to be available in a single contiguous location. I.E. OOP can't normally handle attribute values being spread all over creation (as would be the case for data mined about someone via web page searches). OOP can't normally handle taking widely different amounts of time to retrieve different attributes (as would be the case in data mining operations).

Three Levels of "Existential-ness" Support?

In thinking about how one would build "a language" and/or tools to support Existential Programming, there seemed three increasing levels to sort features into.

Level I - Model Mapping
  • Make it easy to map Object-Oriented models to Entity-Relationship models to Semantic-Network models. I.E. implement OO persistence layer in the style of the EAV approach to semantic network databases. Implement auto-translation of data in traditional E/R tables into EAV records. Implement auto loading of data into OO model from arbitrary EAV tuples (and therefore arbitrary relational tables). In other words, automated persistence with automatic data mapping.
Level II - Data Source Spanning
  • Make it easy to accept ontologies and data from multiple sources; i.e. not just relational database. Example data sources could be: Web searches, Enterprise Silo systems, etc. In other words, build common adapters and mediators to broaden the reach of the "language" beyond structured local databases.
Level III - Fuzzy Models/Values/Sources
  • "Consider the source". Make it easy to associate fuzzy logic factors to data-assertions and ontology-assertions of all granularities, based on the source of the data, the ontology, and even the assertions themselves. Examples are: for any given attribute value, "say's who?", "said when", "how reliable is this source?", "how reliable is this source for this attribute?", "who says that this attribute even applies to this class of thing", "how reliable is the source about the ontology definitions?". I want to be able to encode: "Sam is 89% trustworthy about colors", "Joe lies about AGEs", "Harry is 100% reliable when he says that Joe lies about AGEs", etc.
  • Make it easy to handle attribute values that are themselves fuzzy. I.E. Probabilistic attribute values, conflicting values, cluster values, vague values, time varying values, outdated values, missing values, values whose availability is defined by some set of limits on the effort expended in finding the value (e.g. find all values of phone for joe blow that can be found within 10 seconds real time).

Tuesday, June 6, 2006

Identity() versus Equals()

To expand on item 5 in my original entry, it seems that object oriented languages need to be extended to support the following notions.
  • Identity() as a separate model-definable function rather than using a single "key" in the form of an object pointer or reference. It would define whether multiple "things" are the "same thing".
  • Equals() is different than Identity() because objects being equal is not the same thing as "the thing this object represents" is the same as "the thing that object represents".
  • Determining the membership of "object 123" in the "set of all instances of class X" could/should be via an explicit list (along with "says who?", "as of when?", etc) rather than an intrinsic property of that object.
  • Class definitions are in the mind of the "viewer" and can be applied to any object. Therefore, one should be able to use a mixture of many ontologies.
  • Attributes of objects should be stored independently so that they are available to all "views", "classes", "entity types", EAV tuples, etc, etc.

Monday, June 5, 2006

The Original Epiphanies of Existential Programming

The items below are a summary of the several AHA! moments I had over the May/June 2006 time frame. [see my std disclaimers]
It began with contemplating how Object-oriented modeling, and Entity-relationship modeling, and Semantic Network modeling are all isomorphisms of each other. Next I realized that O/O and E/R models are way too rigid because they expect a single "correct" model to work, whereas Semantic modelers pretty much know it is futile to expect everyone to use a single ontology! So, where would it take us to explore doing O/O and database development with that in mind? Next I had the intuition that Philosophy (with a capital P) probably had something to say about this topic and so I started reading Philosophy 101 books to learn at age 50 what I never took in college. It quickly became obvious that Philosophy has SO MUCH to say about these topics that it is criminal how little explicit reference to it there is in the software engineering literature.

  1. When mapping Object Oriented classes to semantic networks I realized that CLASSES/SUBCLASSES etc were the same as sets of semantic-relationship-triples (Entity-Attribute-Value aka EAV records) and therefore a class hierarchy formed an ontology (as used in the semantic network/web/etc world). AHA! It is futile to get everyone to agree upon ONE ontology (from my experience), SO, that is why it is a false assumption of O/O that there can/should be a single Class hierarchy. But, all O/O languages fundamentally assume this which is why they are hard to map to relational databases. Databases explicitly provide for multiple "views" of data. And in Enterprise settings, where there are often multiple models (from different stovepipe systems) of the same basic data, this causes even more of a mismatch with the single object model.
  2. Mapping O/O Class hierarchies to DB E/R models to Semantic Networks brings up questions about the meaning of Identity (with a capital I) and Essential vs Accidental properties. AHA! This sounds like Philosophy (which had I not started reading about before transcribing these notes into a blog, I would have not known terms like Essential and Accidental and Identity with a capital I to even use them here), SO, it would be worth learning Philosophy to improve my Software Engineering and Computer Science skills.
  3. Having now worked with both Java and Javascript deeply enough to understand class versus prototype based languages (see my AJAX articles), I see that Java is like Plato's view of the world, and Javascript is more like Existentialism (where an object can be instantiated without saying what "type" it is).
  4. Web pages can be thought of as a database whose data model/ontology is implied. Data mining can be done on it where the URL and the "time of last update" are added to each EAV tuple extracted from the page to extend a normal EAV "fact" with a "says who?" dimension and a temporal dimension to the database. In order to really capture all the nuances of the data mined from the web a standard data model ala O/O or E/R models have to also add some model of:
    • completeness
    • accuracy
    • different values at different points of time
    • not only "say's who?" but "say's how?" i.e. which ontology is being used implicitly or explicitly
    • only some attributes of a "thing" are being defined on any given URL
  5. O/O languages could/should be extended to make it easier to work with arbitrary sets of semantic network relationships/tuples such that it could handle integration of various (E/R, Enterprise, web page, data mining) data models.
  6. Google, Homeland Security, Corporate data warehouses all would benefit from being able to work with "everything we know about X". This could be a good technique to integrate disparate data sources.
  7. O/O languages need to be more like Javascript in letting any set of attributes be associated with an object and "classes" are more like "roles" or interfaces that the VIEWER chooses instead of tightly coupling the attribute set to a predefined list. The VIEW chosen by the viewer/programmer can still be type-safe once chosen BUT it cant assume the source of data used the same "view".
  8. "View" (see above) includes all aspects of traditional classes PLUS parameters for deciding trustworthiness, deciding the "identity" of the thing that attributes are known about, and all other "unassumable" things. An O/O language could set defaults for these parameters to match the assumptions of traditional programming languages.
  9. Searching the web and trying to integrate the data is much like trying to integrate the data from disparate silo systems into a single enterprise data model or data warehouse. They both need to take into account where each data value came from, how accurate/reliable those sources are, and how their ontologies map to each other and accumulate attributes from different sources about the same entity.
  10. When dealing with the sort of non-precise, non-reliable values of object properties as found on the web, the following are needed as a part of the "ontology" defined to work with that data:
    • Equality test should return a decimal probability ( 0..1) rather than a true/false value
    • Find/Search operations should allow specification of thresholds to filter results
    • Property "getters" become the same as "find" operations
    • The result of a get/find is a set of values, each includes a source-of-record & time/space region i.e. say's who?, when and where was this true?
    • Property "setters" should accept parameters for source-of-record-spec, time/space region, data freshness, as well as probability factor, or other means of specifying cluster values, vague values, etc.
    • Multiple levels of granularity with regard to setting probability of truth values for entire source-of-record as well as for individual "fact"
  11. How to handle deciding what a thing is? What "level" of abstraction/reality is it on? E.G. an asteroid is a loose collection of pebbles, but that means that the parts of something don't always "touch" the thing. i.e. What is the real difference between the following:
    • x is a part of y
    • x is touching y
    • x and y are in the set S
  12. How are attribute values of null to be interpreted? What is the difference between "has definitely no value" and "dont know the value"? Attributes of X (according to some given ontology) are either:
    • Identity Criteria
    • Required as Essential
    • known as possible (but optional)
    • unanticipated/unknown (but a value was found)
    • unanticipated and not found (i.e. not conceived of)
  13. It is a big deal to understand the borderline between the set of "thing"s (aka entity, object) and the set of "value"s (e.g. 1,2,3,a,b,c,true,false,etc) especially when many OO languages represent them all with "objects".
  14. It is a big deal to handle the problem where ontologies mismatch each other with regard to "what is a thing" and "where does one thing end and another one begin". E.G.
    • parts of A == parts of B but A<>B
    • overlapping things like jigsaw puzzle pieces vs the objects in the completed puzzle picture
    • a defacto Customer record that does not equal a "person" because the name belonged to one person but the SSN belonged to another. On the other hand, if the "customer" can really be "a married household" but the system can't handle that, then this customer record is not overlapping people, it is just incomplete. On the other other hand, how do the customer records for the husband and wife jive with the "household"?
  15. There are attributes of an entity and there are "meta-attributes", e.g. an EAV tuple of an attribute could be (object123,color,green) [where "color" and "green" should be defined in the ontology in question.] Meta-attributes could be...
    • "which ontology is this based on?", (i.e. "whose definitions are we using?")
    • "says who?", (source of the data)
    • "and when was it said?", (date source was queried)
    • "over what period of time was it green?" (because values change over time)
  16. If objects can have arbitrary collections of attributes, and they are not any definite "thing", then how do you know what-is-a/when-to-create-a-new-instance-of-the "thing"?? And where does one "thing" end and the next one begin?
  17. Intuitively, people agree on when one person begins and another person ends even if we cant define how/why. This is not true of abstract concepts. Modeling should find the easy to recognize real-world entities and use them in preference to concepts (which are often roles anyway like customer or prospect or employee).
  18. People "know" other people (i.e. recognize them later) via shared "events" which both can verify to each other. [Just like the shared PIN# secret between you and the bank. And now increasingly asking all sorts of personal questions like whats your favorite movie?]