Why OWL triples matter
The Portable Ontology Revolution in Domain Knowledge
Representation
Stu Baurmann - July 17, 2005
As provocation, I'll mention my opinion that movement towards representing
knowledge with triples will
turn out to be a particularly important trend in the modern history of
computing, as demonstrated by the
current emergence into the mainstream of RDF and OWL-enabled knowledge
infrastructure such as
Protege, SWOOP, Jena, RDFGateway, Kowari, and many other tools, both open-source
and commercial.
Why is "knowledge" on the large scale really happening this time? Isn't this
the same rosy
immediate future we've heard about for 50+ years in both the popular culture and
theoretical
computing, "artificial intelligence" in the form of emergently intelligent,
sentient computers that
have personalities and want to be your friend and/or destroy all humans, blah
blah blah?
Yes and no. I think what has changed is that a convergence of various
storylines is leading
us into an age where we are able to be realistic and practical about some
elementary forms
of knowledge encoding which strike a balance between power of expression and
practical,
managable applicability. One storyline regards our continued improvement in
understanding
of domain-specific knowledge representation techniques applied as part of a
value-generating
process, when approached cautiously and with metrics in hand. Another storyline
is the
day-by-day improvement in quality and number of integratable web resources such
as Wikipedia,
Google, Amazon, and countless others, with their broad variety of business
models. A third, parallel
storyline concerns the general maturation of what we call "information
technology", both within the
profession and in its role as a principal segment of the North American economy
and culture.
Neal Postman would probably say that information technology has been the prime
driver of the
world economy and culture since at least Gutenberg, but that's another story
(any Posties out there?).
Okay, so given those storylines as context, what I'm saying is that we're
currently graduating a
magnitude in our ability to represent and share human knowledge, which is
progress
in our power as a society (perhaps not exactly what our "society" most needs
right now, but
technology is currently on it's own calendar to a great extent) that is
independent of our computers'
capability for gee-whowzers so-called Artificial Intelligence features.
Confusing these two subjects -
"Knowledge Management" and "Artificial Intelligence" - is a source of great
current misunderstanding
in our profession's relationship to the econoculture at large.
AI, that is computer inference of new information based on inputs, is a nifty
idea that inflames the
imagination to the point of triggering various positive/negative fantasies.
That said, limited AI features
can be implemented today, with appropriate investment in a limited
knowledge-engineering domain.
But regardless of what one thinks about AI, I think we should carefully
consider the importance
of human knowledge representation to solving our currently gigantic problems
with complexity,
accuracy, and productivity of software applications for organized human
activities. I say this
in part out of painful personal experience as a software+process consultant for
many years.
The pervasive task in my consulting career has been helping organizations
overcome bottlenecks in
the understanding of themselves - in every case this turns out to dwarf
the complexity of the particular
technical problem my client sub-unit is attempting to address. I am writing
about this now because
I feel the need to declare for the benefit of my naturally suspicious peers:
This stuff works
...having in recent years led the adoption of semantic technology for a few
large corporate projects,
and having thereby confirmed in my own mind that this toolset and methodology
indicate a promising direction for practical future work in addressing a large
slate of thorny business-of-information problems,
though certainly not a panacea in itself. I see this semantic triples stuff as
not just fun, interesting,
powerful technology, but as an important enabler for structural reform of many
organizational processes requiring flexible handling of a diversity of
circumstances,
OK, so why am I hammering on this point? Data warehouses and business
intelligence dashboards
and expert systems and whatnot have been around for awhile, right? What's
changing now? Why all
the sweet-smelling hot laundry about Knowledge Representation? Who am I
shilling for? Why
LogicU, now, Mr Westerner Guy?
Well, the short answer is that triples, 3-ary relationships, turn out to hit
a kind of sweet spot in the
squishy underbelly of today's organizational knowledge beast. The importance or
goodness of
RDF as a technology and knowledge model can be pontificated about theoretically
on any
side you like, but my point is that my experience indicates adoption of
triple-store representations
represents a close-to-optimal extension of the presently conventional software
construction
methodologies. This optimality (within certain assumptions) arises because this
adoption is both:
- A large enough change in expressiveness to permit a true quantum improvement
in system capability and quality (this I've seen firsthand, perhaps it's easier
to show than tell)
- A small enough change in structure to be digestible culturally within the
profession and in our user community (witness W3 semantic-web standards
activity)
Either of those points can be challenged, and I'm interested to see which is
more contentious among
my peers, so have at it!
Alright, so I've told you what my point is. Now, to flesh it out, let's go
back and understand triples a
little better. What's so great about 'em compared to the way most of us work
with our information
sets today?
Well, if you've studied a bit of algebraic topology (it's OK if you haven't,
ask me or someone else to
explain sometime, or ask wikipedia), you know that the dimension of a space is a
crucial parameter in
determining what can be represented in it. For a long time, we've gotten by in
our imperative computer
programs with an abundance of 2-ary relationships, that is, variations on the
name-value pair. If you've
bothered to read this far, I'm sure you know what I am talking about, and you
are well familiar with
the abundance of NV-pair-like-things in the software you have worked on. But
wait, what about
abstract datatypes, objects, and relational databases, aren't those n-ary
relationships? Well, yes
and no. Yes in absolute, theoretical, structural terms, but No
in terms of generally available
expressive semantics available to the programmer.
Huh? Beespresso Demantics? What you talkin bout, Hot Laundry Man? OK, OK,
settle down, kiddos.
Think about what an object instance or an SQL row represents in its fields or
columns: it is a set of
name-value pairs. Each name is the name of a field or column, and each value is
either a primitive or a
compound construct (sub-object, array, etc) which is itself addressable as
name-value pairs.
This may not be the representation in memory, but that is irrelevant to my
point, which is that
a conventional imperative program must be described in terms of navigation of
contained name-value
pairs in order to execute and do useful operations.
More importantly, the program is always constructed in terms of some
expectation about the
types of the value of most of the pairs. Some variation in this type is
generally permitted
(i.e. with object-oriented inheritance, or a field/column called "type", etc.),
and it is management
of this variation across the program's subsystems and workflow and GUI that
provides the
sustenance of the modern programmer (or UML modeler). However, it is only in
advanced,
more research-oriented environments that a programmer has full access to the
typing model
of all information at runtime. For example, Java objects and C++ objects cannot
easily "change
type" in a running program. On the other hand, with today's OWL technology, it
is in fact
possible to derive/calculate the type of an acquired piece of information, while
remaining within a
JVM running on Linux or a Windows/C#/.NET environment or similar robust and
conventionally
deployed software platforms. This dynamically computed type can then be used to
drive behavior
as declaratively configured by mortal engineers using standards-based technology
(not mysteriously
evoked by that one brilliant programmer who really understands your magical
code-generator!).
The "triples + kernel" approach is simply a generalization of imperative
techniques so that the
useful behaviour currently described by program instructions (or UML sequence
diagrams) is
instead provided by a small generic kernel which is configured by a knowledge
model containing
assertions about relationships between entities. This knowledge model is
managable with techniques
that, with proper architecture and planning, are magnitudes more efficient in
expressing and confirming
human intentions than are sets of UML models and Java/C# code with their
attendant RDBMS
configuration and so on.
There are many qualifications applicable to these statements, but I hope
you'll see my basic point.
BTW,the use of the term "kernel" implies a useful analogy with operating system
construction, which I
hope is apparent. Let me know if it ain't.
Recapitulating, the knowledge oriented approach to software system
development is an application of
mathematical insight informed by understanding of human engineering dynamics.
This understanding
has been gleaned from observation and participation in the current
heavy-treading industrialized
approach to software definition, which involves establishment of numerous
choke-points for expression
of intent, as well-motivated defense against the horrific risks that software
projects have always
faced: incomplete and inconsistent requirements, inadequate testing, etc.
So, to boil it down: By committing to an architecture based on a small,
simple kernel of operations
driven by a human-knowledge-containing triple store, we can build software
systems that are more
accurate reflections of our needs and are more automatically testable. These
two improvements
together give a quantum improvement in system quality as perceived by the
end-users.
I further submit that the extent to which these improvements are undertaken as
evolutionary or
revolutionary changes in an organization can be shaped by those managing the the
knowledge-enabled
oriented project, depending on their priorities.
Now, to be clear about what is new-ish here, a general n-ary expressive power
is available implicitly
in the SQL information model, no doubt about it. In fact, comprehending the
role of SQL in modern
systems development is a key to understanding of where we're going with this
semantic technology.
When you are filtering sets of rows by using SQL WHERE clauses, you are, in
fact, working with
an n-ary tuple model that is properly abstracted and accessible. Projecting
this power forward
towards the user through your layers of MVC and ASPs/JSPs/EJBs and so forth is
what much of
current system engineering is about, yes? My point is that the triples approach
is an approriate
modern grounding formalism for broad system engineering efforts beyond
the transactional data
store, and can in fact be implemented in orthogonal harmony with the SQL
approach. The parameters
of coexistence can be understood this way: triples can be formulated as
simultaneously a limitation (in
dimension) and extension (in practical expressive power) of the modern
conventional SQL-grounded
approach.
The limitation is this: Since all the triples in an RDF model can
easily be stored in a single SQL table
with 3 columns, then all RDF operations are inherently emulatable as SQL
operations, so there is a
kind of inherent backward-compatibility here, and we can see RDF as simply a
subset of things SQL
already does just fine. But the extension comes in here: We're
re-conceptualizing the form of the
information representation so that all relevant meta-data is now expressible
within the same single
3-column table as the data-data, and both are mutable by the same core model
read/write operations.
That's it, folks, That's the key! You haveta get that last point in order to
understand what I'm talking
about, here. In a conventional SQL approach, we start with an ERD that
identified all the entities
in your model, and we implement that model in a bunch of RDBMS tables, one for
each entity type.
Now, in any particular RDBMS you are able to work with metadata by querying
system tables and so
on. Thus we can see a RDBMS platform as being equivalent or even a superset of
a simple
triple-store system, which makes sense since we know that an RDBMS is
functionally sufficient
to implement just about any abstract model understandable by more than
a handful of math-lovers.
However, now we must recognize some practical concerns: An RDBMS is generally a
self-contained
entity that must be interfaced with rather than used as a total
solution platform, unless you completely
commit to a particular solution framework (even your own) and thereby sacrifice
portability and
interoperability.
The point of triple-based semantic technology is that it allows you to move
knowledge models around
freely between programs and platforms, work within regular standards-based
XML-oriented web software
environments, and so forth. Thus you get the power of working with
type-flexible and mathematically
expressive 3-ary relationships, without having to "hit the database" or "
program in the database" every
time you need this power. This is perhaps a subtle and highly-qualified point,
but it turns out to
have huge potential ramifications in both development efficiency and system
quality.