Specifications
Abstract model
XML schema
CQL context set
SRU/SRW
Z39.50
Annexes
Implementations
Applications
Thesauri
Historical
Old profile
Old XML DTD
Navigation
What's new?
Site map
|
The Zthes abstract model for thesaurus representation, version 1.0
1st May 2006
This document describes an abstract model for representing
thesauri - semantic hierarchies of terms as described in
ISO 2788 (monolingual thesauri)
and
ISO 5964 (multilingual thesauri).
This abstract model may also be used to store and transmit
dictionaries, glossaries, notation-based classification schemes,
taxonomies, simple ontologies, and authority files of personal,
corporate and geographic names.
This specification is concerned solely with the abstract
representation of thesaurus terms and how they may be searched;
separate specifications describe the representation of these
abstract concepts in
XML,
and how they may be searched in
SRU/SRW
and
Z39.50.
The abstract model described here is intended to be sufficiently
general that it can also be implemented by data formats other
than XML, and accessed using protocols other than Z39.50, SRU
and SRW. For example, a YAML schema might be defined for
representing terms, and a profile might be defined for searching
and navigating Zthes databases using A9 OpenSearch.
Because the model described in this specification is abstract,
it will generally have little or no effect on the data model
used by a server's local representation of its thesaurus. In
particular, pre-existing thesaurus databases may well represent
a more complete model of a term than what is described by this
abstract model. That's fine: the additional information is not
representable in the Zthes model, but may be used in any number
of other ways beyond the scope of the Zthes specifications.
This model does not mandate any relationship between a thesaurus
and any other database. The model is that terms from any
thesaurus database may be used to search any other database
(called a target database). However, postings
information for one or more databases may optionally be included
along with terms.
This model represents a thesaurus as a database of inter-linked
terms. A single thesaurus
may contain multiple vocabularies (which can be thought of as
``virtual thesauri''), each consisting of terms drawn from a
specific set whose name is specified by its terms'
termVocabulary elements (see below). Links may exist
between terms in different vocabularies, or even in different
physical thesauri.
Each individual term in a thesaurus is represented by a record
in the database. In the interests of simplicity and
orthogonality, even non-preferred terms must be represented by
their own records.
Term records consist of an initial part describing the term
itself (with information such as its unique identifier, scope
note, etc.), together with sub-records (that is, named
sections within the main record) briefly describing related
terms. The primary means of navigation from one term to another
is by searching for the unique identifiers of the terms related
to the first one.
In addition to the term records, a thesaurus may contain a
single additional special record describing the thesaurus as a
whole, including for example information about who compiled it
and who owns the copyright.
In the element tables that follow, the occurrence columns describe
whether the elements are mandatory and/or repeatable as follows:
Value |
Meaning |
1 |
mandatory, not repeatable |
1+ |
mandatory, repeatable |
[0,1] |
optional, not repeatable |
0+ |
optional, repeatable |
The top level term record is composed of the following elements:
Name |
Occurrence |
Description |
termId |
1 |
An opaque string of characters which uniquely
identifies the term within the thesaurus. |
termUpdate |
[0,1] |
The update status of the record, which may be
add
or
delete.
This is used when a collection of Zthes terms is used to
specify an update to a thesaurus. In this scenario, terms in
the update set that have termUpdate=add are
added to the destination thesaurus, with each new term
replacing any existing term that has the same termId;
and terms in the update set that have
termUpdate=delete cause the terms in the
destination thesaurus with the correponding termId to
be deleted. Since the remainder of every term records that has
termUpdate=delete is ignored, such records
may omit all other elements apart from the termId.
|
termName |
1 |
The name of the term in a form which may be
displayed to a user or used as a search term in a
target database. |
termQualifier |
[0,1] |
An additional string which, if supplied,
qualifies the term name such that the combination of
term and qualifier is unique within the thesaurus. |
termType |
[0,1] |
An indication of the type of the term, chosen
from the controlled vocabulary described below. |
termLanguage |
[0,1] |
The language of the term. |
termVocabulary |
[0,1] |
An indication of the vocabulary the term is
drawn from, if the thesaurus contains multiple vocabularies. |
termCategory |
0+ |
Identifies a term as belonging to a particular topical subset
(``microthesaurus''). As this element is repeatable, a term
may belong to zero or more categories. This element differs
from termVocabulary in being much lower-level: every
term belongs unambiguously to (at most) one particular
vocabulary, but any number of categories. Categories may span
vocabularies within a single thesaurus, i.e. terms with the
same value of termCategory may have different values
of termVocabulary (and of course vice versa).
|
termStatus |
[0,1] |
The deletion status of the term, which may be
active,
deactivated
or
deleted. In general, only active terms should be
used to guide searching.
The difference between deactivated and deleted terms is that
a user should be prevented from adding a term identical to a
deactivated term, instead reinstating the suppressed term
(which may have history notes explaining why it was
withdrawn); whereas deleted terms can be ignored for all
purposes except reconstructing thesaurus history.
|
termApproval |
[0,1] |
An indication of whether the term has been
approved for inclusion in the thesaurus, has merely been
proposed and awaits approval, or has been considered and
rejected. Acceptable values are
candidate,
approved
and
rejected.
termStatus and termApproval are orthogonal,
despite their surface similarity: the three values that each
can take make up nine possible combinations, and all of these
are distinct.
|
termSortkey |
[0,1] |
An explicit sort key for the term, based on
language- and application-specific sorting rules that may
remove leading articles, parse number names, etc., to
produce natural order not attainable through a strict
alphanumeric sort.
|
termNote |
0+ |
A note about the term: that is, arbitrary prose clarifying
the meaning and scope of the term. Multiple notes may be
included. Each note may carry a label indicating its role,
e.g. scope, source. In the common case
where only a scope note is required, no label need be
included.
Each note may also carry an indication of a constraint on the
vocabulary of its content, e.g. indicating that the content
of a specific note must be drawn from another Zthes
thesaurus, or a plain list. The precise expression of this
constraint is dependent on the encoding of the Zthes records:
see for example the discussion of this constraint in the XML
specification.
|
termCreatedDate |
[0,1] |
The date on which the record defining the
term was created. |
termCreatedBy |
[0,1] |
The name of the person who created the
record defining the term. |
termModifiedDate |
[0,1] |
The date on which the record defining the term
was last modified. |
termModifiedBy |
[0,1] |
The name of the person who last modified the
record defining the term. |
postings |
0+ |
A sub-record, in the format described below,
indicating the frequency with which the term occurs in
a target database. |
relation |
0+ |
A sub-record, in the format described below,
briefly describing a term related to this one.
Each relation sub-record
may carry a weight indicating how strongly the
related term is related to the main term.
|
In many thesauri there is no explicit unique identifier field,
and the term itself, perhaps in combination with the qualifier,
uniquely identifies a record. Thesauri such as these must
nevertheless provide a termID field, which may be
automatically generated simply by combining the term and
qualifier.
The termType element may take the following values:
- PT
- Preferred term (also known as a descriptor)
- ND
- Non-descriptor: that is, a non-preferred term.
- NL
- Node label: that is, a dummy term not assigned to
documents when indexing, but inserted into the hierarchy to
indicate the logical basis on which a category has been
divided - for example, by function.
Also known as a guide term or a facet
indicator.
Applications may use other values of termType at their
discretion. It is recommended that such extension values begin with
the string X-.
Each postings sub-record is composed of the following
elements:
Name |
Occurrence |
Description |
sourceDb |
1 |
Details identifying a service that provides
the target database in which the term may be found. |
fieldName |
[0,1] |
If specified, the name of a field in the target
database in which the term may be found; otherwise,
the sub-record represents a postings count across the
entire target database. |
hitCount |
1 |
The number of occurrences of the term in the
target database (in the nominated field only, if
specified). |
If a server wishes to communicate separate postings counts for a term
in more than one field, then multiple postings sub-records
with the same value of sourceDb should be used.
Each relation sub-record is composed of the following elements:
Name |
Occurrence |
Description |
relationType |
1 |
An indication of the type of the relation,
chosen from the controlled vocabulary described
below. |
sourceDb |
[0,1] |
If specified, details identifying a service
that provides the target database in which the related
term is found; otherwise, the related term is in the
same database as the current one. |
termId |
1 |
The unique identifier of the related term
within its database. |
termName |
1 |
The name of the related term. |
termQualifier |
[0,1] |
The qualifier of the related term. |
termType |
[0,1] |
The type of the related term. |
termLanguage |
[0,1] |
The language of the related term. |
The relationType element may take the following values:
- NT
- Narrower term: that is, the related term is more specific
than the current one.
- BT
- Broader term: that is, the related term is more general
than the current one.
- USE
- Use instead: that is, the related term should be used in
preference to the current one.
- UF
- Use for: that is, the current term should be used in
preference to the related one
- RT
- Related term.
- LE
- Linguistic equivalent: the current term and the related
term are preferred terms representing the same concept - or
``sufficiently close'' concepts - in different languages.
Servers may return other values of relationType at their
discretion. It is recommended that such extension values begin with
the string X-.
With a single exception,
this model deliberately restricts its set of supported relations to
those discussed in ISO 2788, in the belief that
it is better for a small set of relations to be used interoperably
than for a larger set to be specified, with different servers and
clients in practice using different subsets.
That sole exception is the addition to the standard relation types of
LE, introduced to model the multilingual links described in
ISO 5964.
The NT and BT relationships are reciprocal; so are
USE and UF; and RT and LE are reflexive. That
is:
-
When any term T1 points to another T2 using
the relation NT, T2 should point back to T1
using BT and vice versa;
-
when T1 points to T2
using the relation USE, T2 should point back to
T1 using UF and vice versa; and
-
when T1 points
to T2 using the relation RT or LE, T2
should point back to T1 using the same relation.
The termType element in a relation sub-record may
take the same values as in the top-level record.
If a whole-thesaurus descriptive record is provided, it must
consist of
Dublin Core elements
(which are all optional and repeatable),
together with an optional repeatable, label-bearing
thesNote
element analogous to termNote.
The following searches must be supported:
-
A search which finds the single record with a specified
termId at the top level (i.e. not within a relation
sub-record.)
-
A search which finds records with a specified termName at the
top level.
-
A search which finds records with a specified termQualifier
at the top level. (This is useful primarily in conjunction with the
termName search.)
-
A search which finds words occurring anywhere in the term record.
Support for additional searches, including the following, may be
useful.
-
A search which locates all the records considered suitable as
starting points for browsing.
-
A search for terms of a specified type.
-
A search for terms in a specified language.
-
Searches for terms in a specified vocabulary or category.
-
Searches for terms with a specified status or approval.
-
A search for terms whose termNote contains specified words.
-
Searches for terms created and/or modified at a specified time or by a
specified person.
-
A search which finds all the records in a specified relation to the
term with a specified termId. (This is useful for finding
the narrower terms of a very broad term without needing to retrieve a
very large record.)
-
A search which finds a special record describing not a term but the
thesaurus as a whole - authorship, version information, etc.
This document is derived from the historical Zthes Z39.50
profile, which until version 0.5 contained the specification of
the abstract model as its Section 2. That profile represents
the consensual outcome of extensive discussions between the
members of the informally convened Zthes working group:
-
International Organization for Standardization. ISO 2788:
Guidelines for the establishment and development of
monolingual thesauri, 2nd ed. Geneva: ISO, 1986.
-
International Organization for Standardization. ISO 5964:
Guidelines for the establishment and development of
multilingual thesauri. Geneva: ISO, 1985.
For some inexplicable and inexcusable reason, ISO standards are
not generally available on-line.
|