This page describes the CIC metadata processing. Each institution provides its content in a different way, even though the descriptive elements are similar. In order to use this content within the context of different "views" (geographic, search engine ...) of our metadata aggregation, a series of transformations are applied to harvested metadata records prior to final indexing. This process was informed by procedures developed by the National Science Digital Library( NSDL metadata primer )
Selection according to a collection development policy
- what the record describes
- only keeps descriptive records
- where the resource comes from
- from CIC institutions only
Trace the record : the provenance element
- repository URL and identifier
-
- the harvest date
Clean the record
- removes white spaces
- when multiple white spaces are left or at the end and at the beginning of a string
- removes incorrect characters at the beginning or the end of a metadata value
- for example strings ending or beginning with a semi-colon
- removes metadata with no content
Duplicate the record
(create a version of the record in a local namespace optimized for indexing for resource discovery)
- splits different elements according to different marks/characters
- when they contain different values for example subjects separated by semi-colons
- removes metadata with content meaningless for information retrieval
- such as " no date "
- removes illegal characters in URI
- such as "&&" in a URL
"Enrich" Unqualified Dublin Core records
This stage aims at renaming elements, adding elements and adding xsi:types in order to recognize Qualified Dublin Core concepts
- Identifier
- Recognizes the scheme and sets a xsi:type when it corresponds to a URI scheme
- Type
- Recognizes the DCtype according to a mapping table based on values found in CIC records. When the record describes an analog-only resource, it adds a dcmi:type PhysicalObject, according to either the collection or defined criteria (set of tests).
- Format
- Recognizes the IMT format according to a mapping table based on values found in CIC records. When the content is not an IMT format but fits in a corresponding DCT concept, it replaces the element with dct:medium and dc:extent.
- Language
- Recognizes ISO 639-2 languages according to a mapping table based on the original standard and values found in CIC records
- Date
- Checks the presence of a DCTerm such as " created:xxx " and makes a Qualified Dublin Core element accordingly. Recognizes a YYYY format as W3C-DTF date format (xsi:type)
- Coverage and Subject
- Checks a known spatial value (US states and countries), then rename potentially the element to dct:spatial with the equivalent xsi:type. For several collections, a geographic coverage is applied at all records (eg. Aerial photos of Illinois).
- Relation
- Rebuilds relation between records (for example a finding aid and an image collection on the same physical items). When recognizes a collection, adds a collection element.
Specific CIC elements for the CIC portal
This stage aims at defining concepts which are used by the CIC interfaces
- Type
- Recognizes a CIC Type (extended version of the DC Type) according to a mapping table based on values found in CIC records.
- Relation:isPartOf
- Rystematically adds a IsPartOf element with a "ui:code" collection code. Each record belongs to a collection, based on the repository, the OAI set or a sub-set automatically identified.
- accessRights
- Adds an element " restricted " when the collection is not freely available on the Web.