Semantic databases

· 10 min read
Semantic databases

Traditional relational databases have well understood properties for performance and locking with indexes and works well for well-defined homogenous data where you usually search for selections from one specific table.

I have always been drawn to more flexible data structures. A simple address book would have a name, email, phone and address. But some people have several names or types of names. Several emails. Several phone numbers. And so on. Some of it is temporal. Some of it is domain segmentation, like work/home. With a relational database schema, you would convert the single table to a main person table with relations to one table for phone, one for email and so on. But then you come to all the extra things you would like to know about those connected information. When was the email address last updated, is it connected to a specific job, what is the status of the email account, regarding email bounces, what is the canonical form and the formatted form (that could include the full name and sometimes title).

Now consider the similarity between people, departments, offices and businesses. They may all have emails, phone numbers, visiting addresses and names. The name may be different depending on context or language. There are also a lot of metadata that should be connected to each individual data, like the source of the data, information of when it was updated or confirmed to be still valid, who is allowed to see it and in what contexts its intended to be the preferred value among its alternatives. You may also want to keep track of the reliability, quality and precision of each value. A normal datetime field may have a value stored in millisecond precision, but may have been entered with the precision as minute, day or even just the year. You may want to store more information of how it was imported as to consider the type of dirt it has.

On top of that, you have the tri-temporal data that should be accessible by all parts of the frontend and backend, including considerations in changes of data structure, application logic and changes in what agents have authorization to reach or write. The tri-temporal data consists of a traditional change-log along with changes over time in the modeled domain, as in a person changing name, phone number, email or living address. An updated value may be a correction of a wrongful input, or reflect that the previous value was correct until recently. You also want a third dimension of decision time, for changes that can go through the stages of draft, proposed and accepted.

SQL was not made for data structures that conditionally can spread out forever. You don't want to make the schema more complicated or detailed than needed for the application. All the facets will have a cost in speed, storage requirement and complexity. But expanding the schema will require an abstraction layer. You wouldn't want to refactor old SQL statements for every addition. Tho, I don’t think there is any existing abstraction library that will handle all the things I mentioned here, in addition to the deep reactive propagation through layers of composition I will cover in another article.

noSQL

Going back to basics, you can define lookup tables with offsets or pointers for stored binary data, defining your own data types. It can be strings of text holding things like JSON, or other types of objects. It's similar to a directory of files, with indexes using hashes or binary trees.

Types or features of databases include key-value stores, hierarchical data and document stores. Some databases have added features for handling and searching XML, JSON, geospatial or ML vector data.

The next level is the Entity–attribute–value (EAV) data model. It’s the easiest way to support ad-hoc data. Wordpress uses it in a mysql table with columns for entity_id, key and value, for users, posts and comments.

Add support for the values being pointers to other entities, and you end up with a graph database.

Triplestores are a specific type of graph database, built for Resource Description Framework (RDF) data. RDF defines the triple as Subject–Predicate–Object, where the object can be another subject or a Literal. Subjects and Predicates should have a globally unique URI.

There is no dominant standard similar to SQL for how to search data in graph databases. SPARQL was created for RDF databases. There has been a process for creating a standard for a Graph Query Language (GQL) based on Cypher that is used with Neo4J. Meanwhile, the completely unrelated (FB) GraphQL has become very popular.

Gremlin Query Language has some support among graph databases. It's a form of graph traversal that treats each node as its own object. The selections and groupings are done as method calls rather than plain text strings sent to a server as in SQL. This is the type of API that will work best with the type of multi-granular or flexible complexity I described here.

ML Vector databases

Any database application will have some sort of search functionality. A full-text search index is expected. But you should also handle alternative spellings and typos. Results should be prioritized and grouped based on what would be most useful in the specific context. Text data can be indexed using phonemes or tokens allowing for the related spellings.

There are a lot of full-text extensions to existing databases. But the recent improvements in Machine Learning (ML) Large Language Models (LLM) will allow for more flexible search capabilities, using vector databases. Each document or record will be converted to embeddings and stored in the vector database. This DB can then be used for finding the most relevant documents for any free-form text search.

An LLM can also be used in search even without a custom vector database. You can simply ask it to translate the free-form query into one or more formal search queries, get the result from those queries, and then ask the LLM to sort and return the most relevant results.

A vector database can also be used for clustering. Grouping similar data together, making it easier to navigate. Also for classifying data, such as automated tagging.

Artificial Intelligence (AI), Generative pre-trained transformers (GPT) and LLM is not an alternative to structured data. But it will make it easier to search and parse all available data.

Symbolic AI vs Machine Learning

AI has traditionally been synonymous with what now falls under Symbolic AI. Symbolic AI is using Knowledge Representation (KR) with Ontologies and the rules of logic for step by step reasoning. This has been the image of AI in science fiction for a long time, depicting it as emotionless rigid mathematical thinking.

On the other hand we have the Machine Learning, Deep Learning, Generative pre-trained transformers (GPT) and LLMs, that uses fuzzy pattern-matching statistics in a way that is closer to creativity and feeling. Tho, there have been amazing improvements to reliability for many tasks.

There has been a lot of work in the symbolic AI field trying to build up a knowledge base that would have the ability to reason across all domains. All those models have failed to handle the fuzziness in human language, thinking and the complexities of the real world. Trying to handle all the edge cases can make the logic complex and hard to understand. The resulting reasoning will be rather far removed from how people think in the corresponding situation, and will still fail in many simple scenarios.

All computer programs work by algorithmic processing of structured data. KR is used together with ML, regardless of the use of symbolic AI. One of the most popular uses for LLM is to search databases, parse text and output JSON. KR in and KR out. Humans use tools and logic in addition to neural network associations.

There needs to be a symbiosis between feeling and thinking. Neuro-symbolic AI. This is happening in many ways now in how LLMs can use APIs for looking up or validating information, such as code syntax, or integrations like Wolfram Alpha.

Semantic databases

Most traditional databases restrict the encoding of real-world information in a way that often loses important information. People will often knowingly input wrong information since that is the only thing the application accepts. That is one of many ways that introduces “dirt” in the data.

A semantic database refers to the storage and DB functions for using semantic data. The point of semantic data is to have enough information to know how to use the data in all places where it could be relevant. Only having a table definition with data types and some constraints is not enough. The promise is of being able to generalize functionality in a way that you don't have to hand-code special cases for how the data is used in each new system. It will make integrations and expanded functionality easier. For example, consider what a calendar program would need to know to actually help you with when you are busy and when it’s time to prepare for the next part of your day. And consider how much of this it could deduce from available information such as not to force you to provide the information again.

There are many ways to build a semantic database. This is how I would do it. RDF was constructed to be the simplest way to represent semantic data. A single table of statements with id and the triple subject–predicate–object is enough since metadata can be added by letting the subject refer to another statement. But there is some metadata important enough to be added directly to each statement, like the origin, agent, transaction id, the tri-temporal transaction time, activation time and valid time, object type, sorting weight and low-level metadata for authorization and inference.

Each part of data in RDF is called a resource. All resources can have properties by using the resource as a subject in a statement. Statements are also resources. With my extended metadata, the origin and agent are also resources. Resources can also be literals. A literal is a value with a data type such as a text or number. The object type metadata is used to refer to where and how the literal is stored in the database.

Predicates may have an inferred type for its object. Subjects can have one or more type statements, referring to the class that defines the resource. Each class can have a schema defining its relationship to other resources. For example; names, phone numbers and email addresses may all be stored as text strings. The type can be implicit from its predicate or explicit from another type-statement. If there is a label for the phone number, it can be given in an additional statement with the literal as the subject.

The power of the semantic database comes from when the type is augmented with programming code used to compose the properties of the resource. This gives a uniform and flexible interface, regardless of how the data is stored in the database. This allows for multi-granular or flexible specificity of the data. You don’t have to consider if the address is stored as a single text block or divided into several fields. Just use the dynamic properties of the address resource.

Data Integrity and Consistency

Graph databases have the flexibility of adding data and relations with no constraints. This will also make it easier to create and evolve the model. Constraints can be introduced gradually, with optional and required properties, relationship cardinality and the expected data type.

I prefer to define the actual schema with the class rather than with the rest of the data, in order to keep the schema in sync with all the code for dynamic properties, methods and validations. The class can list mixins and inheritance. Properties will be declared with min and max cardinality for the domain and range. A property can be defined to point to a literal type that may be a subclass of another literal, compatible and mapped to XML datatypes and the primitive types stored in the DB.

The system should handle state transitions with transactions. Traditional RDB transactions will usually not allow multi-stage updates spanning client-server. The temporal data structure allows you to set up a future transaction that will exist alongside the active version, but only used within that agent's transaction context.

Each class will have authorization checks for reading or writing. There will also be hooks for creating, updating and deleting data. They will validate the change, notify dependent data, do the change, validate the resulting state, and then commit or throw an exception based on the result. A commit will return a list of all resources updated.

Performance and Scalability

A semantic database is inherently slower than a traditional DB. But you will still have the usual ways to optimize the performance. Inferred and composed data can be indexed and stored for quick lookup. Those lookup tables may be created dynamically or set up by hand as part of optimization work for the application that needs it.

Searches in the DB should go through the process of planning the query based on available indexes and the amount of resources returned by each constraint. Start with the lookup returning the least amount of resources before going through and filtering out all the results not having the rest of the constraints.

Data can be replicated to several servers for read transactions. Write transactions can be put in an event log that will be used (by tail-file) to update the other servers. The application needs to handle updates in a way where expected changes may be denied or may take a while to commit. The data versioning with transaction id and time plays a part in this.

I have been using syllogism rules for inferring statements such as types through subClassOf, memberships, locations and the like. Instead of searching for resources having a specific inferred property, they can be stored directly on creation, among other dynamic properties. Just mark the origin as an inferred statement. This will make lookups much faster.

The parts of the database that are more uniform can be stored in a more traditional way. With the API being the same and the dynamic properties handled by the class, it doesn't matter how the data is stored. It can be a local cache, a remote database or anything else. Just map the resources and let the origin state the source of the data.

Data Manipulation

Everything here is stuff I have worked with for the last 25 years of semantic database development. Everything except the “valid time” for the temporal data, modeling slowly changing dimensions. The query language used is similar to Gremlin, where you search, traverse, collate and modify data through property and method calls on objects.

I have created 5 different semantic databases. See my article 30 years in the Web, for more details. I would like to publish more from the latest cutting edge semantic database named Gad, written in isometric javascript. That system was set up to overcome many of the problems from the older RDF::Base.

[[https://www.facebook.com/aigan/posts/10159702481267393]]
[[https://twitter.com/aigan/status/1727295041893450098]]

Written by Jonas Liljegren
Building modern web components on reactive state semantic graphs. Passionate about exploring unconventional methods in technology development to shape a better future.
π