Information Architecture: Taxonomy and Metadata Revisited

Origins of Taxonomy

Modern taxonomy was invented by Carl Linnaeus in the 1730s to describe the ways in which living creatures—plants and animals—were both similar to and different from each other. He used his powers of observation to identify the physiological details and method of reproduction on which he based his classification scheme of seven levels for plants: kingdom, phylum, class, order, family, genus, species. He classified living creatures relative to each other according to the degree of their similarities.

The Linnaean taxonomy was a decomposition in kind—where each subordinate node (taxon) was a special case of its parents. Each plant and animal was classified by one species and thus had one and only one taxonometric parentage, and was unique from its fellows.

Classification schemes had been devised before, but Linnaeus' was the broadest in scope and the most audacious. Many sighed in relief at is publication.

Since then species have been reclassified as new information arose, but the concept of the scheme remains unchecked.

As a mathematician I like the precision of Linnaeus' taxonomy.

Library Science

Libraries, as their collection of books and other materials have grown, also turned to classification schemes. The father of library science is generally regarded to be Sirkali Ramamrita Ranganathan (1892–1972), a university librarian and professor of library science in India who published The Five Laws of Library Science in 1931. He developed the colon classification system, the first ever faceted classification; faceted classes are mutually exclusive. The Dewey Decimal System of classifying books was created in 1876 by Melvil Dewey (1851–1931) and is still widely used. The second widely used classification scheme is the Library of Congress Classification invented by Herbert Putnam (1861–1955) in 1897. Both of these two systems are hierarchical.

Electronic Libraries

Electronic document libraries—collections of electronic documents—emerged, at least in my experience, in the 1990s enabled by software to author, store, access, and render text documents. The hypertext link and document viewer were key tools.

The Challenges of the Researcher

The challenges of the researcher have been and remain finding the right document at the right time. To this end documents have been classified and cataloged and navigation tools provided that access the catalogs.

Classification Schemes for Electronic Libraries: Taxonomy

These classification schemes involve a subject taxonomy and metadata. The taxonomy reflects the subject domain and is constructed from available documents. Caution is needed to identify taxa for which no documents currently exist, as is often the case.

The discussion that follows is specific to a subject taxonomy of documents. This taxonomy differs in a number of ways from a taxonomy of different objects (such as plants).

A taxonomy applies to a domain, i.e., the classification scheme is specific to a group of documents that have some relationship to each other based on their subject and a shared usage. It is composed of a hierarchy of taxa in parent/child relationships in which each child represents a subdivision or decomposition (by type or kind) of its parent. The top-level taxon, called the root, refers to the domain. Each taxon in the hierarchy may be called a node.

The classified, or classed, documents are not taxa, but exist separately from their taxonomy. They are classified by an association with one taxon, which could exist at any level in the hierarchy (in contrast with plants that are associated only with species, at the lowest level in the botanical taxonomy). A document's subject has a context composed of a hierarchy of taxa.

For me, a subject taxonomy differs from a plain hierarchical organization scheme by three additional characteristics: (1) it applies only to documents, (2) it is a decomposition in kind, and (3) documents can be classed with any taxon, regardless of its position in the hierarchy.

There are two kinds of subject taxonomies:

one-dimensional taxonomy
This is a simple hierarchy as described above.
multi-dimensional taxonomy
This is a taxonomy with more than one hierarchy, each hierarchy has its own root taxa. This scheme is used when each dimension is not a decomposition of another and when the documents can be classed in one or more dimensions. Each dimension can have a different number of levels. These dimensions correspond to the facets described by Ranganathan. A multi-dimensional taxonomy is useful in representing what in software are called entity-relationship (E-R) diagrams where the entities are taxa and the relation between any two taxa is a verb, e.g., "produces," "provides," "enables;" each such taxon is in its own dimension-hierarchy.

A document library is specific to a single subject domain. A website can host one or more document libraries and thus one or more subject domains. Each domain has its own taxonomy. Should a website have only one catalog, then it can contain entries for more than one domain, i.e., more than one taxonomy.

Classification Schemes: Metadata

Metadata is the second part of a document classification scheme.

Metadata is loosely defined as data about data; this is how database administrators use it. In the context of document libraries, it is the properties (characteristics) that describe each document. Metadata is typically chosen to be consistent with the library contents, its audience, and their needs. Consequently not all the things that can be said about a document need to be declared.

Metadata can describe the file, its intended use, and use limitations; can describe the content and the nature of the content; and can establish references to other documents.

A metadata property can have more than one level. For example, document type may be further qualified by document sub-type.

Commonly useful metadata are: title (this is a special case), author, date created, date last updated, provenance, owner, security (e.g., public, confidential, top secret), document type, subject, description, summary, primary audience. Additional metadata can be used that is descriptive of objects in the particular domain. In some examples I have seen of faceted classifications, the facets that do not fit into a hierarchy can be considered as metadata.

When metadata includes subject, that value should correspond to a taxon in the document's taxonomy. This association can place the document within a larger context.

Relationships

The only relationships that can exist between subjects are their taxonometric ones.

Relationships primarily exist between documents. This is often stated as "see also." Such relationships can be included in the content of each document. They can also be declared in metadata. In this case the metadata property can have zero, one, or more text values.

Challenges for Authors

These are to organize the material with headings, table of contents, and perhaps an index so the key concepts are clearly visible, and to phrase titles and headings to reflect the subject matter in the personality of the document itself. The text should include, even introduce, key words, concepts, and acronyms, and place them in context. Documents can begin with a statement of purpose.

Navigation

I feel strongly that a strict classification scheme can be different than the site navigation scheme. The classification scheme reflects the subject domain and the documents' contents. The navigation scheme reflects the audience and their needs. Navigation needs should not impact the classification.

The classification scheme is held in a catalog, either real or virtual.

Navigational devices include the catalog, table of contents, index, site map, and search.

Catalog

Users can browse the catalog. If it is presented in tabular format, the column headings can be sortable and the unique values clearly seen. It can also be used by the search.

The colon classification of Ranganathan uses codes consisting of numbers and letters separated by colons to represent the object's classes; each class code is separated from the others by a colon. The Dewey Decimal System uses codes consisting of a 3-digit number sometimes followed by a decimal point and additional numbers to represent a class. The Library of Congress Classification uses a combination of letters and numbers in nodes separated by periods to represent a class.

The navigation demands on the catalog will likely require that the classes be named in English.

Table of Contents

This can be a different presentation of the catalog, or it can be organized to match the audience needs.

Index

An index is an alphabetical list of things. There could be more than one index, perhaps one by document title and another by document title within taxon.

Site Map

There are many different ways in which site maps are done. They can include some or all of the documents. They can be organized by taxonomy or by directory structure or by audience use cases. They can be in alphabetical order, or not.

Search

Search is often the navigation tool of last resort, that is, it is used only when a cursory browse of the other navigation devices fails to turn up the desired document. The most effective search will include the catalog entities (taxa and metadata) as well as a full-text search. The most useful search results list will include the parentage taxa and metadata of the selected documents.

[ Top of Page ]

Written: 7-18-2012. Revised 5-15-2014.