The CIC Metadata Portal The CIC Metadata Portal
The CIC Metadata Portal

Metadata Mining

Analysis of metadata provided by CIC metadata providers

Using the Open Archives Initiative Protocol for Metadata Harvesting, the CIC metadata portal aggregates more than 500,000 metadata records from 11 universities, totaling187 collections (January 2006). Metadata Records are harvested utilizing simple Dublin Core, Qualified Dublin Core, or Metadata Object Description Schema (MODS) standards. Each standard is the result of unique metadata creation practices.

Metadata quality and shareability allow the metadata to be parsed out of the original context and integrated into an aggregation database, such as the CIC metadata portal. Relevant work in these areas was notably led by Bill Moen, Diane Hillmann, and others. The Digital Library Federation/National Science Digital Library best practices for shareable metadata were developed on the experience of major service and data providers in the United States.

The Metadata Mining experiments intended to assess the behavior of specific metadata collections in a generalist aggregation (determined by data providers based on OAI repositories, OAI sets, or OAI subsets). The analyses of the experiment follow.

Metadata records similarity

Similarities between metadata records were calculated using two methods: shingles (4 terms) calculated Resemblance, and Levenshtein (using terms as a unit, ponderated by records length) calculated edit Distance. Two indicators were determined: the Average Resemblance (R) of records in the collection, and the Term-Based Levenshtein Distance Ratio (D). Both indicators were comprised between 0 and 1. Collections with very similar records have a high R and low D indicators. Metadata values from 176 metadata collections were loaded into a database, concatenated by property, and then by alphabetical value (unlike traditional deduplication algorithms analyzing individual properties).

Pairwise comparisons were performed on up to 100 records (randomly selected) in each collection. The average Resemblance or Distance for CIC collections are reported in the following document: Similarity of full metadata records.

A similar analysis was performed on a subset of properties (Dublin Core). This subset of properties is often used in snippets (list of results) which allow users to assess the relevance of each resource. This analysis resulted in the differentiation of one resource from another: Title, Subject, Description, and Creator Similarity of partial metadata records.

Metadata language: Variety of terms used in metadata records

The total number of terms and unique terms (excluding stop words) used in the metadata records was calculated for all records in the 176 collections. For collections with few records, the rate of unique terms per record was not a useful indicator. However, the objective of this experiment was to assess the additional chance of a collection to be discovered through one of its records (any new term helping a potential match on one or more items in the collection). The experiment should also show a considerable difference with the full text of a resource; thus reveling potential performance concerns in applying full text retrieval algorithms to metadata records. The detail per collection is available in the following document.

Percentage of unique terms in collection
%g Unique terms / total number of termsNumber of collections
<1%15
1-5%90
5-10%55
10-20%15
>20%1
Total176
Percentage of unique terms per record
Average # of terms per recordNumber of collections
<1 term14
1-10 terms142
10-20 terms14
>20 terms6
Total176

The collection discovery experiment

176 metadata collections containing 10 to 85,000 records were included in the experiment, and then the as-harvested records were loaded in a database.

10,403 logs from the generalist service OAIster were collected 2 days a month over a 13 month period. 5,192 unique authentic user queries (not considering stop words and redundant terms) were extracted from the logs, and launched against the complete metadata records - whichever field the original query was directed to.

Approximately 1/2 the queries did not originally address the full metadata records. Nonetheless, the purpose of this experiment was to assess the potential redundancy of information contained in metadata records from similar collections, for the intent of discovery in a generalist aggregation. In order to analyze the actual redundancy, the queries should be directed to specific fields (which also depend on the specific interface and mapping choices of the service providers).

The following table illustrates the percentage of queries that found the collection through one or more of its items, which also found 10% to 100% of the items in the collection (full matches only):

Percentage of records from the collection returned for each query having at least one record match in the collection. For example, any query finding one item of the lib.umich.edu.brutbib collection always found all items in that same collection. In contrast, 95.57% of the queries allowed to discover the lib.umich.edu.archivision1icbib collection found less than 10% of the items in that collection.

Affinity between metadata records. From this analysis, an indicator of Affinity of records in the collection was determined as the average percentage of the collection items popping up together in a list of results.

Collection performance. Reported in the collection performance document, are the number of queries retrieving records from each collection, the number of matches among the top 40 matches, and the number of queries retrieving a collection through 1 or more items appearing in the top 40 matches of a query.

skewed distribution of collections by number of matches per record

Metadata properties

The collections typically harvested utilizing the MODS format were excluded. The records harvested in qualified DC were mapped to simple DC. Analyses of the metadata characteristics aimed to determine which properties were systematically used in all records from the same collection, how often they had the same value, and how often user queries could find a match in those properties.

number of properties used inconsistently in the collection

Several items in the collections properties may either be non applicable or unknown, but is not true for other collections. If a cataloger adapts the semantics used to individual items, the item level records are indeed "customized." A difference may also depend on the property considered. Typically, when providing context to individual metadata records, the cataloger adds the same values to every individual record (e.g., the Relation property).

Only 40 of the 176 collections always have the same metadata properties (23% of the total number of collections considered) [Figure 1]. This suggests that a case by case cataloging was performed for 77% of the collections to adapt relevant concepts to individual resources.

Usage of Dublin Core properties in the CIC metadata aggregation
Dublin Core property # collections
having property at least once
# collections
using property all the time
# collections
using property irregularly
% of irregular
use of property
Contributor36112569.44%
Coverage1741376.47%
Creator139449568.35%
Date146776947.26%
Description1601075333.13%
Format160147138.13%
Identifier17517321.14%
Language1351181712.59%
Publisher1361063022.06%
Relation585446.90%
Rights158148106.33%
Source61431829.51%
Subject144826243.06%
Title1661452112.65%
Type159145148.81%

The "differentiation potential" of a property in a collection was calculated as the ratio between the number of distinct values of the property, and the number of records having the property. If a distinct value existed for each record, then the ratio was 1. The multiplication of identical values in the collection decreased the ratio. The average differentiation potential was calculated for the various CIC collections - 1 was the maximum differentiation potential. Type, Rights ,and Language were clearly the most consistent properties (likely to be similar for all records in a collection).

Differentiation potential of Dublin Core properties in the CIC aggregation
PropertyAverage differentiation potential across CIC collections
Contributor0.34
Coverage0.41
Creator0.42
Date0.23
Description0.62
Format0.06
Identifier1.00
Language0.02
Publisher0.16
Relation0.07
Rights0.03
Source0.60
Subject0.51
Title0.83
Type0.01

The Average property length was also calculated for each collection, as the concatenation of values for the same element.

The Analysis of matches in the collection discovery experiment and presence of specific metadata properties, showed that even without considering queries targeted to specific fields, the metadata fields have a different tendency to contain the terms searched by users.

Frequency of contribution to a match
Properties # times the property contributed to a match # times the property contained a full match # times the property contributed to a full match # records having property Average # times a property contributes to a full match Average # times a property has a full match
contributor16359474298557767383251.121.51
coverage3729981221921312267160.460.80
creator272750656662740553654780.160.20
date64204210386122464286140.020.03
description197473204843945994663771351.281.59
language693470351453224100.000.00
relation154853854324876022172246751.922.68
rights279933336779198023483840351.772.09
source60011421782322339021379711.291.70
subject107148882528413154553209150.790.98
title92003381858022586685216870.360.50
type55105542484352705074098420.610.66

Correlations

A clear and not surprising correlation was observed between the number of records and records length (total number of terms in the collection), and the number of times the records were retrieved with user queries performed on the overall metadata record.

The Pearson correlation between the amount of data (number of terms in the collection representing both the number of records and the length of records) and the number of matches, are obviously very important ensuring a collection will be discovered, but the richness or variety of terms in the collection are more important (number of unique terms).

Correlation between records length and form of language and collection discovery
  # partial matches found on collection items # full matches found on collection items # queries with at least one item having a partial match # queries with at least one item having a full match # queries with at least one item in the top 40 results
# unique terms in collection0.680.600.770.850.74
# terms in collection0.970.920.550.570.53

Correlations linking indicators of similarity between metadata records from the same collections were also calculated. The following table indicates the correlation between metadata records similarity indicators in the various collections (as described above), metadata language indicators (as described above), and metadata affinity indicators as calculated on the basis of the collection discovery experiment (as described above).

  # distinct term per record rate distinct terms Tbld distance ratio Resemblance %g queries gathering below 10% items in collection with partial matches %g queries gathering 100% items in collection with partial matches %g queries gathering below 10% items in collection with full matches %g queries gathering 100% items in collection with full matches Affinity
# distinct term per record 0.760.44-0.32-0.11-0.13-0.02-0.11 -0.03
rate distinct terms0.76 0.36-0.34-0.05-0.16-0.03-0.11-0.03
Tbld ratio0.440.36 -0.930.40-0.600.47-0.55-0.52
Resemblance-0.32-0.34-0.93 -0.510.69-0.550.640.60
%g queries gathering below 10% items in collection with partial matches-0.11-0.050.40-0.51 -0.910.96-0.90-0.94
%g queries gathering 100% items in collection with partial matches-0.13-0.16-0.600.69-0.91 -0.890.960.94
%g queries gathering below 10% items in collection with full matches-0.02-0.030.47-0.550.96-0.89 -0.93-0.98
%g queries gathering 100% items in collection with full matches-0.11-0.11-0.550.64-0.900.96-0.93 0.98
Affinity-0.03-0.03-0.520.60-0.940.94-0.980.98 

Some of the results presented here will be published soon in Information Processing & Management