Zthes

The Zthes abstract model for thesaurus representation, version 1.0

1st May 2006

1. Introduction

This document describes an abstract model for representing thesauri - semantic hierarchies of terms as described in ISO 2788 (monolingual thesauri) and ISO 5964 (multilingual thesauri). This abstract model may also be used to store and transmit dictionaries, glossaries, notation-based classification schemes, taxonomies, simple ontologies, and authority files of personal, corporate and geographic names.

2. Scope

This specification is concerned solely with the abstract representation of thesaurus terms and how they may be searched; separate specifications describe the representation of these abstract concepts in XML, and how they may be searched in SRU/SRW and Z39.50.

The abstract model described here is intended to be sufficiently general that it can also be implemented by data formats other than XML, and accessed using protocols other than Z39.50, SRU and SRW. For example, a YAML schema might be defined for representing terms, and a profile might be defined for searching and navigating Zthes databases using A9 OpenSearch.

Because the model described in this specification is abstract, it will generally have little or no effect on the data model used by a server's local representation of its thesaurus. In particular, pre-existing thesaurus databases may well represent a more complete model of a term than what is described by this abstract model. That's fine: the additional information is not representable in the Zthes model, but may be used in any number of other ways beyond the scope of the Zthes specifications.

This model does not mandate any relationship between a thesaurus and any other database. The model is that terms from any thesaurus database may be used to search any other database (called a target database). However, postings information for one or more databases may optionally be included along with terms.

3. Abstract model

3.1. Overview

This model represents a thesaurus as a database of inter-linked terms. A single thesaurus may contain multiple vocabularies (which can be thought of as ``virtual thesauri''), each consisting of terms drawn from a specific set whose name is specified by its terms' termVocabulary elements (see below). Links may exist between terms in different vocabularies, or even in different physical thesauri.

Each individual term in a thesaurus is represented by a record in the database. In the interests of simplicity and orthogonality, even non-preferred terms must be represented by their own records.

Term records consist of an initial part describing the term itself (with information such as its unique identifier, scope note, etc.), together with sub-records (that is, named sections within the main record) briefly describing related terms. The primary means of navigation from one term to another is by searching for the unique identifiers of the terms related to the first one.

In addition to the term records, a thesaurus may contain a single additional special record describing the thesaurus as a whole, including for example information about who compiled it and who owns the copyright.

3.2. Schema

In the element tables that follow, the occurrence columns describe whether the elements are mandatory and/or repeatable as follows:

Value Meaning
1 mandatory, not repeatable
1+ mandatory, repeatable
[0,1] optional, not repeatable
0+ optional, repeatable

The top level term record is composed of the following elements:

Name Occurrence Description
termId 1 An opaque string of characters which uniquely identifies the term within the thesaurus.
termUpdate [0,1] The update status of the record, which may be add or delete. This is used when a collection of Zthes terms is used to specify an update to a thesaurus. In this scenario, terms in the update set that have termUpdate=add are added to the destination thesaurus, with each new term replacing any existing term that has the same termId; and terms in the update set that have termUpdate=delete cause the terms in the destination thesaurus with the correponding termId to be deleted. Since the remainder of every term records that has termUpdate=delete is ignored, such records may omit all other elements apart from the termId.
termName 1 The name of the term in a form which may be displayed to a user or used as a search term in a target database.
termQualifier [0,1] An additional string which, if supplied, qualifies the term name such that the combination of term and qualifier is unique within the thesaurus.
termType [0,1] An indication of the type of the term, chosen from the controlled vocabulary described below.
termLanguage [0,1] The language of the term.
termVocabulary [0,1] An indication of the vocabulary the term is drawn from, if the thesaurus contains multiple vocabularies.
termCategory 0+ Identifies a term as belonging to a particular topical subset (``microthesaurus''). As this element is repeatable, a term may belong to zero or more categories. This element differs from termVocabulary in being much lower-level: every term belongs unambiguously to (at most) one particular vocabulary, but any number of categories. Categories may span vocabularies within a single thesaurus, i.e. terms with the same value of termCategory may have different values of termVocabulary (and of course vice versa).
termStatus [0,1] The deletion status of the term, which may be active, deactivated or deleted. In general, only active terms should be used to guide searching. The difference between deactivated and deleted terms is that a user should be prevented from adding a term identical to a deactivated term, instead reinstating the suppressed term (which may have history notes explaining why it was withdrawn); whereas deleted terms can be ignored for all purposes except reconstructing thesaurus history.
termApproval [0,1] An indication of whether the term has been approved for inclusion in the thesaurus, has merely been proposed and awaits approval, or has been considered and rejected. Acceptable values are candidate, approved and rejected. termStatus and termApproval are orthogonal, despite their surface similarity: the three values that each can take make up nine possible combinations, and all of these are distinct.
termSortkey [0,1] An explicit sort key for the term, based on language- and application-specific sorting rules that may remove leading articles, parse number names, etc., to produce natural order not attainable through a strict alphanumeric sort.
termNote 0+ A note about the term: that is, arbitrary prose clarifying the meaning and scope of the term. Multiple notes may be included. Each note may carry a label indicating its role, e.g. scope, source. In the common case where only a scope note is required, no label need be included.

Each note may also carry an indication of a constraint on the vocabulary of its content, e.g. indicating that the content of a specific note must be drawn from another Zthes thesaurus, or a plain list. The precise expression of this constraint is dependent on the encoding of the Zthes records: see for example the discussion of this constraint in the XML specification.

termCreatedDate [0,1] The date on which the record defining the term was created.
termCreatedBy [0,1] The name of the person who created the record defining the term.
termModifiedDate [0,1] The date on which the record defining the term was last modified.
termModifiedBy [0,1] The name of the person who last modified the record defining the term.
postings 0+ A sub-record, in the format described below, indicating the frequency with which the term occurs in a target database.
relation 0+ A sub-record, in the format described below, briefly describing a term related to this one. Each relation sub-record may carry a weight indicating how strongly the related term is related to the main term.

In many thesauri there is no explicit unique identifier field, and the term itself, perhaps in combination with the qualifier, uniquely identifies a record. Thesauri such as these must nevertheless provide a termID field, which may be automatically generated simply by combining the term and qualifier.

The termType element may take the following values:

PT
Preferred term (also known as a descriptor)
ND
Non-descriptor: that is, a non-preferred term.
NL
Node label: that is, a dummy term not assigned to documents when indexing, but inserted into the hierarchy to indicate the logical basis on which a category has been divided - for example, by function. Also known as a guide term or a facet indicator.

Applications may use other values of termType at their discretion. It is recommended that such extension values begin with the string X-.

Each postings sub-record is composed of the following elements:

Name Occurrence Description
sourceDb 1 Details identifying a service that provides the target database in which the term may be found.
fieldName [0,1] If specified, the name of a field in the target database in which the term may be found; otherwise, the sub-record represents a postings count across the entire target database.
hitCount 1 The number of occurrences of the term in the target database (in the nominated field only, if specified).

If a server wishes to communicate separate postings counts for a term in more than one field, then multiple postings sub-records with the same value of sourceDb should be used.

Each relation sub-record is composed of the following elements:

Name Occurrence Description
relationType 1 An indication of the type of the relation, chosen from the controlled vocabulary described below.
sourceDb [0,1] If specified, details identifying a service that provides the target database in which the related term is found; otherwise, the related term is in the same database as the current one.
termId 1 The unique identifier of the related term within its database.
termName 1 The name of the related term.
termQualifier [0,1] The qualifier of the related term.
termType [0,1] The type of the related term.
termLanguage [0,1] The language of the related term.

The relationType element may take the following values:

NT
Narrower term: that is, the related term is more specific than the current one.
BT
Broader term: that is, the related term is more general than the current one.
USE
Use instead: that is, the related term should be used in preference to the current one.
UF
Use for: that is, the current term should be used in preference to the related one
RT
Related term.
LE
Linguistic equivalent: the current term and the related term are preferred terms representing the same concept - or ``sufficiently close'' concepts - in different languages.

Servers may return other values of relationType at their discretion. It is recommended that such extension values begin with the string X-.

With a single exception, this model deliberately restricts its set of supported relations to those discussed in ISO 2788, in the belief that it is better for a small set of relations to be used interoperably than for a larger set to be specified, with different servers and clients in practice using different subsets.

That sole exception is the addition to the standard relation types of LE, introduced to model the multilingual links described in ISO 5964.

The NT and BT relationships are reciprocal; so are USE and UF; and RT and LE are reflexive. That is:

  • When any term T1 points to another T2 using the relation NT, T2 should point back to T1 using BT and vice versa;
  • when T1 points to T2 using the relation USE, T2 should point back to T1 using UF and vice versa; and
  • when T1 points to T2 using the relation RT or LE, T2 should point back to T1 using the same relation.

The termType element in a relation sub-record may take the same values as in the top-level record.

If a whole-thesaurus descriptive record is provided, it must consist of Dublin Core elements (which are all optional and repeatable), together with an optional repeatable, label-bearing thesNote element analogous to termNote.

3.3. Searching

The following searches must be supported:

  • A search which finds the single record with a specified termId at the top level (i.e. not within a relation sub-record.)
  • A search which finds records with a specified termName at the top level.
  • A search which finds records with a specified termQualifier at the top level. (This is useful primarily in conjunction with the termName search.)
  • A search which finds words occurring anywhere in the term record.

Support for additional searches, including the following, may be useful.

  • A search which locates all the records considered suitable as starting points for browsing.
  • A search for terms of a specified type.
  • A search for terms in a specified language.
  • Searches for terms in a specified vocabulary or category.
  • Searches for terms with a specified status or approval.
  • A search for terms whose termNote contains specified words.
  • Searches for terms created and/or modified at a specified time or by a specified person.
  • A search which finds all the records in a specified relation to the term with a specified termId. (This is useful for finding the narrower terms of a very broad term without needing to retrieve a very large record.)
  • A search which finds a special record describing not a term but the thesaurus as a whole - authorship, version information, etc.

4. Acknowledgements

This document is derived from the historical Zthes Z39.50 profile, which until version 0.5 contained the specification of the abstract model as its Section 2. That profile represents the consensual outcome of extensive discussions between the members of the informally convened Zthes working group:

5. Bibliography

  1. International Organization for Standardization. ISO 2788: Guidelines for the establishment and development of monolingual thesauri, 2nd ed. Geneva: ISO, 1986.
  2. International Organization for Standardization. ISO 5964: Guidelines for the establishment and development of multilingual thesauri. Geneva: ISO, 1985.

For some inexplicable and inexcusable reason, ISO standards are not generally available on-line.