The Dream of Semantic Web

When I first started learning web development, there was a huge emphasis on semantics. Back then there was no html 5, and xhtml was the hype. If I remember well, it was around early 2006. People back then seemed to be talking about semantics, and content vs presentation. CSS was in everyone’s mind, and there was a huge push to move javascript out of html. I learned html, and some php back then, and then I stepped away from the web scene. I used to take my text editor out every once in a while and write some code, but none of it ever did anything useful. By the end of 2009, I lost my job at a club, and was fortunate enough to get my first client almost right away. My first gig as a web programmer was to develop an online comic viewer which pre-loaded the 3 pages following the one you where currently viewing so that the experience was as smooth as possible. I wish I still had the source code.

By the time I came back to the web scene in January 2010, a lot had changed. HTML 5 was beginning to make a reputation, and the talk of semantics was not as loud anymore. I don’t believe semantics ever really took off, and even now, there are people trying to push semantics into the web, but honestly, I don’t feel like there is a lot of interest from most developers. I am not talking about famous people that give talks around the world. I am talking about the bast majority of developers; the ones that build the site for the local supermarket. These developers care about getting the job done, and getting paid. For them semantics is such a pain in the neck.

Why Semantics

I believe it is sad we don’t care about semantics. Semantics are a very important part of web authoring. They give meaning to what is otherwise a bunch of markup, and text. For you and me, it may be easy to look at a phone number in the screen and know that it is a phone number, and to know to whom it is related. Our brains are powerful enough to make the connection, to identify patterns, and to know that what it is looking at is a phone number, but computers are dumb. They need semantics to make the connection. They may be able to identify that something is a phone number based on patterns, but they don’t know what that phone number means. They don’t know to whom it belongs, or what in the page is related to that phone number.

Another example would be a sale. A human can visit an online store and see that there is a big banner announcing a sale. A human will know that there is a sale, but a computer won’t, unless there is some kind of semantic bit that tells it that the banner is for a sale.

I hope you see why semantics are important. However, make no mistake, I’m not here trying to convince you to care about semantics. On the contrary, I’m just here to rant about how hard it is to actually implement semantics in the web, and ultimately to argue that we may be wrong in our efforts towards a semantic web.

I appreciate the semantics of the web, not as a developer, but as web user. When I say web user, I don’t mean a web parasite: someone who just goes to youtube, facebook, and twitter, or who spends all day on funny sites looking at memes and cats. What I mean by web user is someone who uses the web to gather information, do research, and collect data. When you do these things, you appreciate the importance of semantics, because better semantics means you can automate your search for information. When there are good semantics, a computer can talk to a website, and understand what it is saying without human interaction. That is the dream.

Imagine a web where you can sit and ask the computer, “What is the phone number of my closest pizza parlor?.” The computer then connects to the internet, searches for pizza parlors close to your current location, grabs the phone number, and gives it back to you. We have things like this today, or at least very similar. The mobile industry is pushing hard towards these kind of interaction with the machine, and the only way this can be possible today is by incorporating semantics so that computers can talk to each other, and identify the information they need. Semantics tell the computer what a piece of information means.

Maybe it is because when I learned web development there was a huge emphasis in semantics that I believe that a good developer implements semantics. However, as a developer I have internal debates all the time regarding semantics, mostly because semantics get in the way of me developing happily.

Classes for Semantics

The problem, I believe, is how we try to implement semantics. The most basic step towards the semantic web is in the elements we use. Once we have used the semantically correct elements, we rely mostly on classes to give meaning to the data withing those elements’ tags. However, classes are also the main means of specifying styles. Since most of us care more about getting the job done than we do about semantics, we usually end up creating classes like “inner_content”, “main_content”, “column”, “sidebar”, and other like them. Some of these classes are arguably semantic, and this uncovers one of the main problems with semantics: semantics are very subjective.

Because semantics are basically the meaning of things, they depend a lot on the interpreter. Something can be semantically correct for one person, but not so much for another. Lets take for example, the class name “main_content”. One could argue that it is semantically correct, since it specifies that the content of the element is the main content in the document, but it could also be argued that it provides no useful information on what this content is about, which is basically what semantics should do. Another example could be the class name “phone_number”. It is very semantic, because it specifies that the content of the element is a phone number, but at the same time, it does not provide any information regarding whose phone number it is. We could argue that if we want real semantics, we should use the class name “company_phone_number”, or “president_phone_number”, or “store_phone_number”.

However, if we use many class names like such, we end up with a document that is hard to style. A quick solution would be to use two class names for each phone number. One would specify that the element represents a phone number, and the other one would specify what the phone number means, or to whom it is related. But if we do that, we may end up with documents that have thousands of different class names. To overcome this issue, we have created standards such as microformats.

False Semantics

The number one rule I learned in regards of class names is that a class should specify what the element is about, not how it should look like. Following that principle, I try to avoid using class names such as “blue_button”, “floating_bar”, “left_column”, and the like. But this is not always possible. There are times when you need to design an element without knowing what content is actually going to be placed in it. Think for example, of templates. We design templates, and we mean that some element should ideally contain certain kind of content, but we don’t know what is actually going to be in there. This can create false semantics.

Imagine that you develop a template for a portfolio. The template has a section for portfolio items. These items have an image, a title, and a description. You give these element a class name of “portfolio_item”. Someone sees the template, and thinks that the portfolio section would be great to create a directory. The portfolio item image could serve as the person’s picture, the title, as the person’s name, and the description could be used to enter the person’s details. There is now a site that has false semantics. This does happen in real life.

This raises the question, Is class-based semantics really a good idea? Regardless of the answer to that question, one of the most widely adopted initiatives toward a semantic web, microformats, uses classes extensively to provide semantic meaning to data. To be fair, regardless of how we implement semantics in our documents, there will always be the risk of false semantics when working with templates. However, one of my main concerns with class names for semantics is that classes are used to style elements.

Class Names to Describe Structure

As I mentioned earlier, one of the main rules regarding class names is that they should describe the contents of the element, not the visual representation. However, sometimes you need elements purely for the sake of design. Think of instances when you need to have columns available. You will most likely create some sort of markup to use every time you need to use columns. These markup needs some kind of styling, which means you will most likely add class names to the tags in the markup, but these classes describe the visual representation of the element, or at the very least they describe the type of visual structure that these elements create. For example, some of these classes may be “columns_container”, “column_one_of_four”, or “column_two_of_three”. These class names have no real semantic meaning. They simply describe how the elements are rendered in the page.

What is the Difference?

By thinking of that example, you may ask what the difference between a class name of “column” and one of “phone_number” is. For a person like me, who has invested a lot of time thinking about semantics, the difference is obvious, but for some people it may not be.

The easiest way to identify if a class name is semantic or not, is to see if you can guess what the element contains just by looking at the class name. If you cannot tell exactly what content the element has, then the class name is not semantic enough. The class name “phone_number” tells you right away that the content is going to be a phone number, but the class name “column” doesn’t tell you what the content is, but rather, that the content will most likely be displayed in a column. Thinking like this makes it really easy to see why a class name of “main_content” is not very semantic. It tells you that the content is probably the most important part of the page, but you don’t really know what that content is about.

However, we need those kinds of non-semantic classes if we want to come up with beautiful websites with complex structures. Back when the internet was still a baby, there was not much need for beauty in sites. The web was mostly a way to share information. Website were really simple, usually consisting of only text. It was mostly meant for the exchange of scientific information. However, as the web became available to a wider range of the population with different needs and likes, it was necessary to implement ways for developers to add beauty to the web. But the web was initially not meant to be like it is today. It evolved in its own, and it may have done it in the wrong way.

I believe that if we were to invent the web today, with the knowledge that we’ve gained, we would do it much differently. However, this is one of those situation where we need the experience we gained from the mistakes we made.

A Step in the “Right” Direction

I think the best move we’ve made so far in regards to semantics was the creation of XML. XML is hated for a lot of people, and for good reasons. XML is hard. I’m sure many of you will say, XML is very easy. What most people know about XML is:

It is a markup language.
You create your own tags.

I can see why they think it is easy. But if you actually read the XML documentation, you will see that it is not as easy as it seems. XML is hard to work with, hard to parse, you need to write DTDs, it is very verbose, hard for humans to read, it is easy to make mistakes in XML, and many other reasons. But the one thing it got right is that you define your own tags. This means you can tackle semantics really well. But the closest we’ve come to XML in the web was XHTML, and that didn’t end very well.

Semantics is a Dream

Semantics is just a dream. We like to think that somewhere, somehow there are people who have the key to a web semantic, and to turning iron into gold. The truth is that semantics is most likely not possible to accomplish with the current approaches and technology. That is not to say you should not care at all. A little semantics is better than no semantics. But a pure semantic web is far from our reach.

A New Hope

One of the main problems with semantics is that it leaves the low lever work to the author. Programmers are almost inherently lazy when it comes to doing repetitive tasks. The sounds of semantics doesn’t really appeal to a programmer because it requires manually specifying the meaning of data. It is true that the web is not entirely built by programmers. There are a lot of people out there who only know html and css, and they are building websites. Moreover, there is also a bast amount of people who don’t really know html and css but who are also building websites. The first group may be willing to implement semantics, but they mostly care only about the paycheck. The second group don’t care at all about any of these. They haven’t even cared enough to learn to create proper markup. So, the 3 kinds of people that build the web are not really interested on investing time on low level, repetitive tasks such as implementing semantics. I believe, it is at this point that we should look for other alternatives such as machine learning, natural language processing, and content recognition algorithms. If our goal is really to create a method for computers to understand data without human interaction, the approach should not be trying to create languages that computer can understand, but creating computers that understand the human language. While semantics may be a dream, creating such computers is the future.