WHY ANNOTATING THE WEB IS HARD

And Why A Xanalogical Approach Would Help

Part 3

TRANSIENT CONTENT

Now, what I'm about to describe here isn't a problem with HTML. It's a problem with the way that we tend to use HTML today as a transient document format rather than anything intended to be long-term.

Let us imagine that the fates have been kind to us and we're trying to annotate an article on a news site that at least does us the favor of naming the content portion of their pages using the appropriate semantic tag, <article>.

So here's our hypothetical, and very short, article:

<html>
  <head>
   ...stylesheets, scripts, meta tags...
  </head>
  <body>
    <header>
      Company Logo
      <nav>Navigation links...</nav>
    </header>
    <article id="content_section">
      <h1>Animal Encounter</h1>
      <p>A population of a rare subspecies of wolf once thought extinct was confirmed to exist on Vorvos Island this week. The wolves were spotted by a high-school student who was on a class trip to the Zornkorn Wildlife Park.</p>
      <p>The student posted photos of the wolves on social media after becoming isolated from her group. The photos were later confirmed by biologists to show a group of canis lupus superdupus, a species not known to have been encountered by humans in over a century.</p>
    </article>
    <footer>©2025 NewsWorldSite dot News</footer>
  </body>
</html>


Let's imagine we've got something to say about the section between "on a class trip..." and "...from her group." This selection crosses the boundary between two paragraphs, but that's not a problem. A range of pointers can span any number of elements. Let's look at the graph that we'll be dealing with:

            <article#content_section>
                      │
  ┌───┬─────┬────┬────┴──┬─────────┬──────┐
  │1  │2    │3   │4      │5        │6     │7 
  #  <h1>   #   <p>      #        <p>     # 
      │1         │1                │1 
     #Animal..  #A population...  #The student....

Because the article element has been given a name, we can use it as the root node of our graph. If there's a sidebar with trending articles, some clickbait footer snippets, whatever else, we're not worried about it. The article content is isolated in the <article> element.

The area that we want to address can be specified with a range, so we need two pointers. You can count the characters yourself or just trust me that it's 173 codepoints into the first paragraph for our first pointer, giving us a child sequence that looks like "content_section"/4/1.173

For the second pointer of the range, which ends 95 codepoints into the second paragraph, we have a sequence that looks like: "content_section"/6/1.95

And so we create an annotation document that refers to this section of the article:

Oh my gosh, my class is going there next week! I hope this happens to me: https://newsworldsite.news/article/202501090257#range(/"content_section"/4/1.173, /"content_section"/6/1.95)

This works!

Until the article is revised so that the second paragraph leads with "The student was devoured by the wolves."

Sure, a respectable news agency will add a footer on the article that tells everyone the article has been revised – but that doesn't change the fact that when YOU go back and look at your annotation, it now seems to say that you look forward to being eaten by wolves. Why did you write that?!

This is a totally absurd example, but the principle is illustrated well enough, I think: even if your pointers are valid, the content that falls between those two pointers will quite possibly be different when you revisit it from the day when you annotated it, because HTML is the delivery mechanism for transient content on the Web, not its long-term storage medium.

(The more likely scenario here is that the news site would insert an advertisement somewhere inside of the <article> element that would end-up throwing-off your pointers.)

So... we're back to screenshots and print-outs, I guess!