November 1, 2003

Why page numbers fail us

I keep running into a deep information habit that has never worked well for its intended purpose, the page number has been an information curse. Printed documents use page numbers, which are intended as a reference point (not bragging rights often referenced in Harry Potter and Neal Stephenson books - I am on page 674 and you are on page 233). All of us are familiar with this problem from high school and college if you happened to have a different printed copy of a classic text. Page 75 of Hemmingway's Old Man and the Sea was not the same in everybody's copy.

Even modern books fail when trying to reference pages, just look at the mass market edition of Crypnomicon with 1168 pages and the hardcopy version of Crypnomicon with 928 pages of the same text. Trying to use a page number as a reference does absolutely no good.

Now we try and reference information on the Web, which should not be chunked up by page count, but by logical information breaks. These breaks are often done by chapter or headings and rightly so as it most often helps the reader with context. Documents that are placed on the Internet, many times for two purposes - the ability to print and to keep the page numbers. Having information that is broken logically for a print presentation makes some sense if it is going to be printed and read in that manner, but more and more electronic information is being read on electronic devices and not printed. The Adobe reader does not easily flow from page to page, which is a complaint I often hear when readers are trying to read page delimited PDF files.

So if page numbers fail us in the printed world and are even more abysmal in the realm of the electronic medium, what do we use? One option is to use natural information breaks, which are chapters, headers, and paragraphs. These breaks in the information occur in every medium and would cause problems for readers and the information's structure if they are missing.

If we use remove page numbers, essentially going native as books and documents did not havepage numbers originally (Gutenberg's Bible did not rely on page numbers, actually page numbers in any Bible are almost never used Biblical reference), then we can easily place small paragraph numbers in the margins to the left and right. In books, journals, and periodicals with tables of contents the page or article jumps the page numbers can remain as the documents self-reference. The external reference could have a solid means of reference that actually worked.

Electronic media do not necessarily needs the page numbers for self-references within the document as the medium uses hyper-linking to perform the same task appropriately. To reference externally from a document one would use the chapter, header, and paragraph to point the reader to the exact location of text or microcontent. In (X)HTML each paragraph tag could use an incremented "id" attribute. This could be scripted to display in the presentation as well as be used as hyperlink directly to the content using the "id" as an anchor.

I guess the next question is what to do about "blockquote" and "table" tags, etc., which are block level elements? One option is to not use an id attributes in these tags as they are not paragraphs and may be placed in different locations in various presentation mediums the document is published in. The other option is to include the id tag, but then the ease of creating the reference information for each document type is eliminated.

We need references in our documents that are not failures from the beginning.

Other ideas?

Posted Comments

Tom, Very interesting topic you have brought up here. This falls under the same lines of what I am doing at work with the publications our team has to do. I'm definatley going to put some thought into this and get back to you. This could have some viable outcome for the work we do on our contract. I wish you had brought this up earlier, so I could have applied this to the current publication I just finished. To be continued...

For the paragraph tagging to work it would also need to be in the publication that is printed. Often the printed material is developed first and the Web components come last, even if the reports are never printed. This idea has been around before. The client does not think about these things early enough to act.

This is a problem which you'll find is a raging fire within the RDF and Topic Maps (and anyone else wanting web identity to function) these days. Now, the topic map people have a good solution, while it is a blatant flaw in the RDF. The issue comes after you've got your anchors in place. Say you want to address something at If you reference that, are you talking about the subject of that URL, the URL itself, the page the URL sits at, or a subject at that page at that anchor? This is the problem with identity; the subject indicator doesn't lie in the identifier itself. In topic maps we distinguish the subject reference from the indicator, but RDF does not. Anyways, a few pointers there.


Comments are closed.

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License.