The CIC Metadata Portal The CIC Metadata Portal
The CIC Metadata Portal

Resource Aggregation: The Filtering Process

This page describes the CIC metadata processing. Each institution provides its content in a different way, even though the descriptive elements are similar. In order to use this content within the context of different "views" (geographic, search engine ...) of our metadata aggregation, a series of transformations are applied to harvested metadata records prior to final indexing. This process was informed by procedures developed by the National Science Digital Library( NSDL metadata primer )

Selection according to a collection development policy

what the record describes
only keeps descriptive records
where the resource comes from
from CIC institutions only

Trace the record : the provenance element

repository URL and identifier
 
the harvest date

Clean the record

removes white spaces
when multiple white spaces are left or at the end and at the beginning of a string
removes incorrect characters at the beginning or the end of a metadata value
for example strings ending or beginning with a semi-colon
removes metadata with no content

Duplicate the record

(create a version of the record in a local namespace optimized for indexing for resource discovery)

splits different elements according to different marks/characters
when they contain different values for example subjects separated by semi-colons
removes metadata with content meaningless for information retrieval
such as " no date "
removes illegal characters in URI
such as "&&" in a URL

"Enrich" Unqualified Dublin Core records

This stage aims at renaming elements, adding elements and adding xsi:types in order to recognize Qualified Dublin Core concepts

Identifier
Recognizes the scheme and sets a xsi:type when it corresponds to a URI scheme
Type
Recognizes the DCtype according to a mapping table based on values found in CIC records. When the record describes an analog-only resource, it adds a dcmi:type PhysicalObject, according to either the collection or defined criteria (set of tests).
Format
Recognizes the IMT format according to a mapping table based on values found in CIC records. When the content is not an IMT format but fits in a corresponding DCT concept, it replaces the element with dct:medium and dc:extent.
Language
Recognizes ISO 639-2 languages according to a mapping table based on the original standard and values found in CIC records
Date
Checks the presence of a DCTerm such as " created:xxx " and makes a Qualified Dublin Core element accordingly. Recognizes a YYYY format as W3C-DTF date format (xsi:type)
Coverage and Subject
Checks a known spatial value (US states and countries), then rename potentially the element to dct:spatial with the equivalent xsi:type. For several collections, a geographic coverage is applied at all records (eg. Aerial photos of Illinois).
Relation
Rebuilds relation between records (for example a finding aid and an image collection on the same physical items). When recognizes a collection, adds a collection element.

Specific CIC elements for the CIC portal

This stage aims at defining concepts which are used by the CIC interfaces

Type
Recognizes a CIC Type (extended version of the DC Type) according to a mapping table based on values found in CIC records.
Relation:isPartOf
Rystematically adds a IsPartOf element with a "ui:code" collection code. Each record belongs to a collection, based on the repository, the OAI set or a sub-set automatically identified.
accessRights
Adds an element " restricted " when the collection is not freely available on the Web.