![]() | ![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Metadata Mining
Analysis of metadata provided by CIC metadata providers Using the Open Archives Initiative Protocol for Metadata Harvesting, the CIC metadata portal aggregates more than 500,000 metadata records from 11 universities, totaling187 collections (January 2006). Metadata Records are harvested utilizing simple Dublin Core, Qualified Dublin Core, or Metadata Object Description Schema (MODS) standards. Each standard is the result of unique metadata creation practices. Metadata quality and shareability allow the metadata to be parsed out of the original context and integrated into an aggregation database, such as the CIC metadata portal. Relevant work in these areas was notably led by Bill Moen, Diane Hillmann, and others. The Digital Library Federation/National Science Digital Library best practices for shareable metadata were developed on the experience of major service and data providers in the United States. The Metadata Mining experiments intended to assess the behavior of specific metadata collections in a generalist aggregation (determined by data providers based on OAI repositories, OAI sets, or OAI subsets). The analyses of the experiment follow. Metadata records similarity Similarities between metadata records were calculated using two methods: shingles (4 terms) calculated Resemblance, and Levenshtein (using terms as a unit, ponderated by records length) calculated edit Distance. Two indicators were determined: the Average Resemblance (R) of records in the collection, and the Term-Based Levenshtein Distance Ratio (D). Both indicators were comprised between 0 and 1. Collections with very similar records have a high R and low D indicators. Metadata values from 176 metadata collections were loaded into a database, concatenated by property, and then by alphabetical value (unlike traditional deduplication algorithms analyzing individual properties).
Pairwise comparisons were performed on up to 100 records (randomly selected) in each collection. The average Resemblance or Distance for CIC collections are reported in the following document: Similarity of full metadata records. A similar analysis was performed on a subset of properties (Dublin Core). This subset of properties is often used in snippets (list of results) which allow users to assess the relevance of each resource. This analysis resulted in the differentiation of one resource from another: Title, Subject, Description, and Creator Similarity of partial metadata records. Metadata language: Variety of terms used in metadata records The total number of terms and unique terms (excluding stop words) used in the metadata records was calculated for all records in the 176 collections. For collections with few records, the rate of unique terms per record was not a useful indicator. However, the objective of this experiment was to assess the additional chance of a collection to be discovered through one of its records (any new term helping a potential match on one or more items in the collection). The experiment should also show a considerable difference with the full text of a resource; thus reveling potential performance concerns in applying full text retrieval algorithms to metadata records. The detail per collection is available in the following document.
The collection discovery experiment 176 metadata collections containing 10 to 85,000 records were included in the experiment, and then the as-harvested records were loaded in a database.
10,403 logs from the generalist service OAIster were collected 2 days a month over a 13 month period. 5,192 unique authentic user queries (not considering stop words and redundant terms) were extracted from the logs, and launched against the complete metadata records - whichever field the original query was directed to. Approximately 1/2 the queries did not originally address the full metadata records. Nonetheless, the purpose of this experiment was to assess the potential redundancy of information contained in metadata records from similar collections, for the intent of discovery in a generalist aggregation. In order to analyze the actual redundancy, the queries should be directed to specific fields (which also depend on the specific interface and mapping choices of the service providers). The following table illustrates the percentage of queries that found the collection through one or more of its items, which also found 10% to 100% of the items in the collection (full matches only): Percentage of records from the collection returned for each query having at least one record match in the collection. For example, any query finding one item of the lib.umich.edu.brutbib collection always found all items in that same collection. In contrast, 95.57% of the queries allowed to discover the lib.umich.edu.archivision1icbib collection found less than 10% of the items in that collection. Affinity between metadata records. From this analysis, an indicator of Affinity of records in the collection was determined as the average percentage of the collection items popping up together in a list of results. Collection performance. Reported in the collection performance document, are the number of queries retrieving records from each collection, the number of matches among the top 40 matches, and the number of queries retrieving a collection through 1 or more items appearing in the top 40 matches of a query.
Metadata properties The collections typically harvested utilizing the MODS format were excluded. The records harvested in qualified DC were mapped to simple DC. Analyses of the metadata characteristics aimed to determine which properties were systematically used in all records from the same collection, how often they had the same value, and how often user queries could find a match in those properties.
Several items in the collections properties may either be non applicable or unknown, but is not true for other collections. If a cataloger adapts the semantics used to individual items, the item level records are indeed "customized." A difference may also depend on the property considered. Typically, when providing context to individual metadata records, the cataloger adds the same values to every individual record (e.g., the Relation property). Only 40 of the 176 collections always have the same metadata properties (23% of the total number of collections considered) [Figure 1]. This suggests that a case by case cataloging was performed for 77% of the collections to adapt relevant concepts to individual resources.
The "differentiation potential" of a property in a collection was calculated as the ratio between the number of distinct values of the property, and the number of records having the property. If a distinct value existed for each record, then the ratio was 1. The multiplication of identical values in the collection decreased the ratio. The average differentiation potential was calculated for the various CIC collections - 1 was the maximum differentiation potential. Type, Rights ,and Language were clearly the most consistent properties (likely to be similar for all records in a collection).
The Average property length was also calculated for each collection, as the concatenation of values for the same element. The Analysis of matches in the collection discovery experiment and presence of specific metadata properties, showed that even without considering queries targeted to specific fields, the metadata fields have a different tendency to contain the terms searched by users.
Correlations A clear and not surprising correlation was observed between the number of records and records length (total number of terms in the collection), and the number of times the records were retrieved with user queries performed on the overall metadata record. The Pearson correlation between the amount of data (number of terms in the collection representing both the number of records and the length of records) and the number of matches, are obviously very important ensuring a collection will be discovered, but the richness or variety of terms in the collection are more important (number of unique terms).
Correlations linking indicators of similarity between metadata records from the same collections were also calculated. The following table indicates the correlation between metadata records similarity indicators in the various collections (as described above), metadata language indicators (as described above), and metadata affinity indicators as calculated on the basis of the collection discovery experiment (as described above).
Some of the results presented here will be published soon in Information Processing & Management |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A collaborative project of the Committee on Institutional Cooperation.
Hosted by Grainger Engineering Library and Information Center and the University of Illinois Library at Urbana-Champaign. Questions or comments about this web page? Contact mfoulonn@uiuc.edu. © 2004 University of Illinois. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||