On DISCOVERABILITY

If you're a person with any curiosity at all, you don't just want to CONSUME CONTENT online, you want to know ABOUT the things that you read and the images that you see.

Who made this? When? Where can I find more? What inspired this? Is there something similar? What else do I need to know to understand this? Can I trust this? Can I ask the author a question?

Hyperlinks on Web pages will sometimes lead us to answers for questions like these – but sometimes not. And what about other types of resources? How do we find out more information about images? Text files? Audio recordings? What are the means of DISCOVERY?

TRANSCLUSION

The chief benefit of a hypertext system built around transclusion is that it facilitates discovery of a media fragment's origin. In order for transclusion to work, a transcluding document must contain a reference to a transclusion source – and users need only follow that reference back to the source media to learn more about it, such as its authorship or original context ...right?

This works very well when the transclusion source contains all of this information. For example, a Project Gutenberg etext is a wonderful transclusion source as it will contain a published work in its entirety, the author/editor's name(s), the original publication date of the digitized source, the publisher's name and in many cases the full front-matter/indicia of the book, in addition to information about who digitized the work as well as the usual Project Gutenberg license. Sadly, not all transclusion sources will be so richly outfitted.

Browsing around textfiles.com gives a good sampling of the high level of variation in the historical metadata-bearing qualities of plain texts online. In some cases, you're lucky to get an author's email address. License, publication date, and other context is often absent – I'm as guilty as anyone of doing this with my plain texts (take a look at the source of this explainer and see what you can glean from it!)

When dealing with other transcluded media such as images, the situation is often even worse. This is something that most of us have familiarity with already, because images have always been transcluded on the Web. Right-clicking on an image and downloading it or opening it in a new tab often reveals no more information about the image than was apparent to you when you first saw it in its transcluding Web page. There MIGHT be something useful in the file name – but I wouldn't count on it. There MIGHT be some metadata embedded in the image, in the form of EXIF, IPTC, XMP, or any of various other embeddings – but do you know off the top of your head how to access that information? The browser is certainly not offering it up to you.

One thing that COULD be very informative about a media resource is its LOCATION.

HIERARCHIES AND INDEXES

In the earliest generation of the Web, sites were structured in a very consistent way; they were essentially file systems exposed to the HTTP protocol. The "path" of directories after the server name in a Web address correspond to the named, nested "folders" that you store files in on your personal computer. The advantage of this predictable hierarchy was that you could move "up" the path to find more information about a resource and the site that hosted it.

For example, let's say you are reading a Web page about early model Volkswagen beetles at:

http://zorgonscars.com/vw/beetle/early_models.html

And you see a particularly good mechanical drawing on that page. Upon inspection of the image, you see that it's being hot-linked from:

http://www.oldautomanuals.com/type-1/1940s/drawings/04.jpg

Now, at the end of that URL, we have the name of an image file, "04.jpg", and the name of the directory it's contained in, "drawings". It would be a matter of a few keystrokes to remove the file name and send your Web browser to the parent directory:

http://www.oldautomanuals.com/type-1/1940s/drawings/

In the early days of the Web, this would almost certainly have presented you with an INDEX PAGE listing the contents of the "drawings" directory, which, in addition to containing some other images, might also contain information ABOUT the images contained in that directory. Moving up the hierarchy to the "1940s" directory would give you another index page where you could see what other resources, aside from drawings, were available. The path of a URL was like a series of doors leading back to the front entrance of a site, and you could look into each of the rooms on the path and DISCOVER all sorts of wonderful resources, and information about those resources.

In today's Web, resources such as image files are so often retrieved via a nondescript API endpoint that this kind of browsability is no longer available. For example, you can learn nothing from this Wikimedia Commons URL aside from the fact that there is an image file somewhere in the Wikimedia Commons named "ATMS_52_-_Volkswagen.jpg".

https://upload.wikimedia.org/wikipedia/commons/f/fe/ATMS_52_-_Volkswagen.jpg

Trying to move up the path of that URL just gives errors about a failed regular expression match. We can't discover anything. The path is a series of doors to nowhere.

At least it can be read by a human being, though – unlike the impentrable walls of base-36 query strings in your average CDN resource URL...

METADATA

HTTP transactions can carry a pretty rich metadata payload via Link headers. Servers COULD deliver information about authorship and licensing, as well as links to original or alternative publication contexts with each server response. But, by and large, they don't.

Metadata systems like Open Graph are used here and there online to offer some of this information in HTML documents (they could be used in HTTP headers, but I've never really seen this in the wild), and that's just dandy for HTML documents – but what about everything else?

ALPH

In the design of the Alph system, I'm not just interested in bringing "back" the surfability of the old Web, but extending it.

Transclusion is, of course, a core component of the system. The Docuplextron is designed around the creation of text-transcluding HTML documents, and "GET SOURCE MEDIA" is one mouse-click away on all transcluded resources.

The Alph server generates useful index pages for all directories. They may not be terribly pretty, but they're informative. Users can override the server-generated index pages by providing their own index.html files, but these can be bypassed. In addition, by the creation of documents like this one I hope that this project contributes to a rethinking of Web site design best practices, and I intend to advocate for the benefits of a discoverable Web, which includes the use of explorable paths, and useful index/landing pages.

Finally, on the metadata front: all media resources (text, images, audio, video, everything) being served from Alph.py advertise themselves (in an HTTP Link header) as Linked Data Platform Resources. While the server is nowhere near a complete implementation of the Linked Data Platform spec, it at least allows user agents to send an HTTP GET request to a resource with the Accept header set to an appropriate RDF serialization format ('appllication/ld+json', 'application/rdf+xml', 'application/n-triples', 'text/n3', and 'text/turtle' are supported at the moment) to receive a document that DESCRIBES the resource. For more information about this aspect of the server, check out the Alph.py explainer.

This is not metadata for machines and search engine optimization; this is metadata for humans. The Docuplextron makes it easy to retrieve it on all transcluded resources, if it's available. If an author and/or title for a work is present in the metadata, it appears in the lower-right corner of the browser as soon as the user's pointer hovers over the resource. If the author/publisher has provided any presentationContext links in the metadata, the user will have the option to OPEN CONTEXT DOCUMENTs when right-clicking on media fragments. And if they want to inspect all available metadata: right-click > SOURCE INFO. Poof.

And finally, resources on an Alph server will try to keep a record of documents that link to and transclude from themselves. This store of in-links can be queried just as easily as the resource's Linked Data representation. The implementation of this is rather crude at the moment, but a rough mechanism is in place.

The hypertextuality of the Web has, in the past 15 years, been obscured by the dominance of ever narrower content-delivery channels. Commercial interests have done a remarkable job of turning it into a platform for user-facing apps fed by arcane services, with decontextualized media stored away in proprietary databases. When you find something intriguing or inspiring online, if the app that served it to you doesn't frame it with relevant information, you're left with one recourse: Google.

The necessity of search engines will never go away, but does the experience of using the Web really have to be: peruse an app, think of a question, ask Google, ...repeat? Who does this serve? And does that usage cycle even produce good outcomes?

Media made by humans should be published at named locations that humans can read, and those locations should be explorable by humans. The names we give things, and the places where we put them convey a lot of meaning. Furthermore, media resources themselves should be able to respond to queries from humans about their name, their origin, and their situatedness in the Web. That's the philosophy of discoverability in Alph.