Harvestable Metadata Services Development: Analysis of Use Cases from the World Data System

Minimally, a research data repository exists to make a collection of data assets available to potential users. If a dataset cannot be discovered and found, it cannot be reused (Garnett et al. 2017). Harvestable metadata catalogues are a key strategy for achieving greater global findability of data assets, as they create a surveyable access point to discover data products within large data collections. Such catalogues can be especially effective if they are tailored for interoperability with feature-rich infrastructures (e.g. meta-catalogues, see Kapiszewski & Karcher 2020; CRFCB 2014) that are highly visible and widely used, and also themselves integrated within the larger ecosystem of research infrastructures. This study offers insight into a set of World Data System (WDS) research data repositories ongoing and successful implementations of harvestable metadata services, which apply established and emerging research data standards and practices to fit global, local and domain-specific interoperability contexts. Establishing a harvestable metadata service involves making choices in a space where standards and technologies are continuously evolving. The repositories in this study leverage the resources they have, within the policy and funding constraints of their institution, to


INTRODUCTION
Harvestable metadata services are an effective, established and widely-used approach to promoting data discovery and sharing across broad communities of potential data users, across multiple disciplines (Lokers et al. 2016;Valentine et al. 2020).For the purpose of this study, we understand harvestable metadata as a set of metadata records in a standardized format and schema that is shared with aggregation services by means of specific protocols for metadata transfer, which are also standardized.In this paper, we describe examples of ongoing and successful implementations of harvestable metadata services, which apply emerging and established standards and community practices to fit local and domain-specific research data management contexts.These use cases originated from the Harvestable Metadata Services Working Group (HMetS-WG), 1 which met frequently in a series of working sessions over 6 months during 2020, followed by occasional meetings during 2021.The study offers an overview of the infrastructures, standards and communities of the repositories that were members of the HMets-WG, as well as offering a wider-ranging discussion of challenges that repositories may face when developing data services, such as harvestable metadata.
Taking a qualitative approach, this study explores issues for implementing harvestable metadata services at repositories.We start with a description of use cases, focusing on each repository's technical features, along with the challenges encountered in pursuit of repository-defined and community-oriented service development goals.Repositories are also characterized by the subject and disciplinary areas covered, targeted user groups, and services offered.The fulllength profiles for each repository are described as use cases by Urquidi Diaz et al. (2022).After examining the use cases within the context of the current literature on recommended practices for metadata syndication and pathways toward interoperability, we present a set of common characteristics and challenges described by the repositories in this study.These experiences involved making decisions about which technologies to develop for an often heterogeneous dynamic user base, within an evolving technological landscape, in order to implement data and metadata services that fit within the resource and policy constraints of the repository.

METADATA HARVESTING, STANDARDS AND PROTOCOLS
In a typical metadata sharing process, a research data repository will share a catalogue of assets: a collection of metadata records that describe each dataset, which are typically accessible through a search interface on the repository's portal.The repository may also share a set of standardized metadata records via additional access points (or harvestable metadata services), using a metadata transfer protocol through which aggregation services, such as harvesters, obtain the metadata (see Figure 1).Persistent links to data landing pages at the host repository are typically contained in those records.An aggregator may then convert (reformat or cross-walk) the acquired records into a unified display standard, to be disseminated by means of a federated metadata catalogue or a federated search engine.Examples of metadata harvesters that target research data include the Canadian Federated Research Data Repository (FRDR), 2 and B2FIND (Europe). 3  The adherence to shared standards and community practices is a key tenet for successful digital research infrastructure (DRI) integration and interoperability (Dietze et al. 2018;Waide, Brunt & Servilla 2017;Yu et al. 2021).Common standards for harvestable metadata include the Dublin Core (DCMI 2020), DataCite (DataCite Metadata Working Group 2021) and ISO 19115 (International Standards Office 2019) metadata schemas, as well as protocols for transferring metadata, like the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) (Lagoze et al. 2005) and the Open Geospatial Consortium Catalogue Service for the Web (OGC-CSW) (Nebert, Voges & Bigagli 2016).Metadata records are usually transferred as eXtensible Markup Language (XML) or JavaScript Object Notation (JSON, and JSON-LD, for linking data within semantic metadata) files.Another approach to syndicating metadata uses semantic metadata tags, such as Schema.org, 4 that are placed in the HTML of dataset landing pages on a repository's web portal, or in separate metadata files.This strategy relies on web crawlers (such as Google) parsing the semantic metadata to aggregate and index the landing pages for search engine retrieval.Even though this approach to metadata sharing can complement harvestable metadata services, the semantic strategy was not pursued by the HMetS-WG.Instead, WDS-ITO engaged members of various communities in a separate initiative to develop semantic metadata using Schema.org(Payne & Verhey 2022).
As the importance of reusing data is increasingly recognized across disciplines, data repositories have proliferated to meet this demand, with the number of data repositories listed in repository registries, such as the Registry of Research Data Repositories (Re3Data), 5 growing rapidly (Culina et al. 2018).Also, with each data repository offering additional open data products, finding a particular dataset of interest becomes challenging for potential data users (Kramer, Klas & Hausstein 2018;Plante et al. 2021).A recognized approach to address this challenge is for repositories to establish capabilities for harvesting metadata to facilitate searchability and global discoverability (Culina et al. 2018;Plante et al. 2021;Wu et al. 2019).Furthermore, the availability of harvestable metadata is an indicator of dataset findability (CoreTrustSeal 2022; FAIR Data Maturity Model WG 2020) and contributes to repository TRUST-worthiness as it improves integration with the wider data management community (Lin et al. 2020: 3).

DATA COLLECTION
In September 2019, the WDS-ITO invited WDS member repositories to participate in the HMetS WG and to (optionally) serve as use cases for the study.The invitation was sent to 35 unique WDS member organizations that had previously expressed interest in being informed about new WDS initiatives.Nine WDS member repositories participated in the group (Table 1), and seven (Table 2) were adopted as use cases.Over the course of the group's sessions, the repositories presented an overview of their infrastructure, data holdings, and services.All participating repositories also provided the group with a schematic overview of their features and subsequently, three repositories (NSSDC, INTERMAGNET and SEDAC) also completed the implementation plan template, described below, and shared these with the WDS-ITO.

5
The Registry of Research Data Repositories, https://www.re3data.org/, is a global registry of research data repositories.

Figure 1
The metadata harvesting process.Standardized metadata is harvested from repository catalogues, then processed by an aggregation service.The service disseminates the metadata records through a search and discovery portal and/or by serving it to further aggregation services for distribution.Downs et al. Data Science Journal DOI: 10.5334/dsj-2023-020The group's work agenda was initially guided by a workflow structure proposed by WDS-ITO (Figure 2), which represents harvestable metadata services development as a set of discrete, successive steps.As group discussions progressed, WDS-ITO provided members with a Harvestable Metadata Services Implementation Plan template (Urquidi Diaz 2021b) to describe their implementation plans.The template was inspired by and borrowed heavily from the CESSDA-Saw guidance package (Bornatici et al. 2017) and the JISC project plan templates (JISC 2011) for data service planning, which were designed to be adapted to specific use cases for a single service or a subset of services, such as harvestable metadata services.Both of these resources also include guidance for drafting implementation plans for these types of services (Bornatici et al. 2017;JISC 2011), which also informed the development of the template.Supporting information resources also included a Twine interactive narrative/storyfied walkthrough of the implementation plan flowchart (Urquidi Diaz, Li & Payne 2021), and a Zotero library with resources related to harvestable metadata services (Urquidi Diaz 2021a).While the questions derived from the workflow structure guided initial HMets-WG discussions, the availability of these additional resources, along with the individual repository overviews and implementation plans, facilitated broader discussions of implementation issues among the HMetS-WG repositories.

DATA PROCESSING AND ANALYSIS
Building on the initial discussions of the workflow questions, the subsequent broader discussions among the HMets-WG repositories further contributed to the development of detailed repository profiles, which are accessible online as use cases (Urquidi Diaz et al. 2022).Where available, the profiles reference the repositories' technical documentation and other relevant publications to provide informative use cases.Urquidi Diaz et al. (2022) described the following characteristics of the repositories within the use cases: 1. Institutional overview: Brief description of the repository's institutional context: Its governance, history, mandate, mission, memberships, and other organizational features.
2. User community: Target communities for repository services.

3.
Infrastructure overview: Description of repository's data holdings and technical infrastructure for service provision.

4.
Current state of metadata: Metadata formats, standards used, and metadata services (if any).

5.
Planned development: Plans for future development.

6.
Resources: Description of repositories' sources of support and financing.

7.
Challenges: Initially, each repository described the challenges they have faced in developing harvestable metadata and other data services on their platforms.
Discussing the institutional overviews and implementation plans, as well as the compiled information resources, in terms of applicability to repository practices, contributed to understanding the current state of the repositories implementation issues.While differences across the repositories were observed, discussions about the common challenges that the repositories faced when considering the issues associated with the development of harvestable metadata services identified similarities among the challenges of the repositories represented in the HMetS-WG.Recognition of these similarities led to the emergence of a consensus on the challenges that the participating repositories face for the development and deployment of harvestable metadata services.

REPOSITORIES PARTICIPATING IN THE WORKING GROUP
The host institutions of the participating repositories (see Table 1) were based in China, the UK, the US, and France.Seven repositories were Regular 6 Members of the WDS and two were Network 7 Members (WDS Scientific Committee 2016).

RESEARCH AREAS AND TARGET USER COMMUNITIES
The research areas served by the repositories represent a predominant Earth-and planetary sciences orientation.Social sciences, including environmental and economic sciences, also are strongly represented.As described by Urquidi Diaz et al. (2022)

REPOSITORY FEATURES
Table 3 gives an overview of each repository's technical features: the type of repository platform and catalogue service used, metadata standards and protocols, and a list of any current, known aggregators of their metadata assets.Figure 3 presents the metadata exchange protocols utilized by the repositories studied, in the context of those of the larger WDS membership, as surveyed in 2019 by the WDS-ITO (Payne & Urquidi Diaz 2020).Relative to WDS members previously surveyed, the repositories in the use cases have, or plan to develop, more OGC-CSW and Opensearch, and fewer OAI PMH services (Urquidi Diaz et al. 2022).It also should be noted that the WDS member survey data reported by Payne and Urquidi Diaz (2022) does not distinguish between protocols residing within repositories and those that are provided by aggregators, such as the Earth Observing System Data and Information System (EOSDIS) and the Global Earth Observation System of Systems (GEOSS), that disseminate metadata on behalf of repositories.

Participation in research data networks
As described in the sections below, it appears that participation in national, regional, as well as subject-specific networks has generally shaped the repositories' infrastructure, particularly in the ways that their adoption of harvestable metadata services has developed or is being planned for development.

CHALLENGES
As described within the Methodology section, analyses and discussions of the similarities among the challenges that the HMetS-WG repositories face for developing and deploying harvestable metadata services led to consensus on the similarities observed among these challenges.The emerging consensus among the challenges that were reported by the repositories revealed three major overarching themes for the common challenges that were identified.The themes that represent the common challenges for developing and deploying harvestable metadata services include changing user needs, sustainability, and evolving technologies.
The three themes that were found for the challenges faced by the HMets-WG repositories when developing and deploying harvestable metadata services are closely linked to each other.Developing a good understanding of current and evolving technology trends and changing user needs, in light of existing and projected capabilities and resources, can help repositories to identify a sustainable approach for their new development efforts, and reduce the potential of incurring costs to employ expensive corrective measures in the future.

CHANGING USER NEEDS
The first major theme reflects repositories' efforts to identify and meet the changing needs of the user communities that they serve.Such efforts include adopting standards that maximize metadata interoperability, deploying metadata schemas that are widely used, but also versatile and extensible to address the changing needs of the user community.Serving the needs of repository users, including data producers and data reusers, is one of the primary objectives of research data repositories.Meeting the challenges for providing services to the user community as the needs of the users change is a key indicator of repository success.
Minimally, a research data repository exists to make a collection of data assets available to a designated community of users.Deploying harvestable metadata catalogues is a key strategy for reaching users, as these services can inform potential users and increase awareness of repository holdings.Such catalogues can be especially effective if they are tailored for interoperability with infrastructures (e.g.metacatalogues) 20 that are highly visible, featurerich, widely-used, and also themselves integrated within the larger ecosystem of research infrastructures.(2016).At INTERMAGNET, participants are volunteer magnetic observatories which, following standards defined by the network, seek to share and confidently reuse geomagnetic data within the community.ISGI's participants, in contrast, are institutes whose official task is defined by the International Association of Geomagnetism and Aeronomy (IAGA): to derive and make available officially endorsed data products.In recent years, the geomagnetism community has sought to achieve interoperability with other scientific fields of Earth and environmental observation, and to keep up with current trends to make data more usable, and also more useful, to a larger group of users, not only geomagnetism specialists.

New users, new challenges
As we shall see below, both organizations have needed to factor in these developments when selecting their data and metadata sharing technologies.
Post-Pandemic, data driven regional economic development efforts involve new challenges, especially in rural areas, mountain regions, and small islands.In order to help such regional stakeholders, including decision makers and small business companies, GCdataPR initiated the Geographical Indications Environment & Sustainability (GIES) program.By opening quality datasets, data papers, and metadata (physical geographical data, agriculture products data, socio-economic data and local culture information, as well as in situ timely ecosystem monitoring data), the geographical indications or specific agriculture products could be used by consumers.The GIES cases clusters and practices demonstrated this as an effective solution for the repository to serve local people in attaining the 2030 Sustainable Development Goals (SDGs) (Liu, Gong & Liu, et al. 2021).

Stakeholder engagement, user outreach, adaptation of services
The repositories in this study have shown a clear user orientation, and most report an intent to serve diverse user communities: from the general public to industry data users, to researchers in highly specialized knowledge areas (see Table 2).Concerted outreach is regularly carried out among multiple groups of users and stakeholders, including current and potential users.Also, without exception, each of the repositories participates actively in sundry working groups and opportunities to exchange knowledge, within grassroots, top-down, or federated organizations.Some of these include the WDS, the Research Data Alliance (RDA), and the International Science Council's Committee on Data for Science and Technology (CODATA), the American Geophysical Union (AGU), the European Open Science Cloud's (EOSC) EPOS ERIC, the Group on Earth Observations (GEO), China-GEOSS, and the ESDIS system at NASA.At IGS, for example, data services are being developed to meet the needs of new and established users (Ventura-Traveset, Navarro & Romero 2019), such as those found within IGS itself, including product coordinators, participants in working groups and pilot projects or in analysis centers, (Villiger & Dach 2019a: 139).But because all users of modern mapping, orientation and navigation systems are beneficiaries of the work done by IGS, the IGS Central Bureau has established various channels for outreach and communication (ibid.: 18), with the public and individuals, enterprises, non-profits, institutions and government actors worldwide.These channels include social media outlets like Twitter, where IGS uses the #GNSS4impact hashtag to tweet about common applications of GNSS data.Part of the aim is to make the general public aware of this foundational yet invisible infrastructure.Making IGSs work visible to the general public in ways that can be measured -such as through citation of IGS data, products, and other published outputs -helps IGS advocate for the organization and make a strong case to its supporting partners and funders (IGS Central Bureau 2023).
Repositories also will need to adapt the metadata that they distribute to address the current needs of the user communities that they serve as these needs change.In addition to revising repository services offered, such as recommended uses and data formats and the like, it may be necessary to adopt metadata standards and enhance metadata harvesting capabilities to reflect the knowledge and research interests of the new community segments and domains that are being served.For example, a repository may discover changes in the disciplines of its users by identifying the disciplines of publications and authors that are currently citing the repository's data holdings.Learning about such changes can enable the repository to identify additional metadata standards, particular metadata elements, specific vocabularies and harvesters that can serve the needs of the new communities as the disciplines of users change.Recent developments, such as those described by Musen et. al (2022), include metadata templates, discipline-specific ontologies, and metadata evaluation software tools that enable rich FAIR-compliant metadata to be produced for distribution to particular communities and across communities of data users.

Repository usage metrics and citation counts
To some extent, repositories can keep track of their efforts to increase data discovery and, ultimately usage, through counters that measure user engagement with repository assets (e.g.clicks, downloads, searches, turnaways), which can help keep track of fluctuations and patterns in a repository's engagement and usage.A current standard for repository metrics is embodied in the COUNTER Code of Practice (Fenner et al. 2018).Some repositories, such as NSSDC and SEDAC employ a simple user authentication requirement, via a single log-in or registration with an e-mail address, to gain insight into data usage patterns beyond raw metrics, shedding light onto the frequency of usage for each item and the types of users who may be accessing data assets.In contrast, GCdataPR reports using IP addresses and real-time usage statistics to keep track of the repository's international visits, in a way that is consistent with GCdataPR's stated goal of reaching a broader international user base.But, while potentially useful for tracking users' online interactions with the repository, these alternative metrics also have limitations as indicators of actual dataset reuse (Ramachandran, Bugbee and Murphy, 2021).
Alternatively, data citation tracking, despite its limitations,21 is increasingly becoming a tool that can be used to estimate the scientific impact of a repository's data assets and to facilitate some types of bibliometric analysis of data usage.Among our use cases, GCdataPR, SEDAC and WDC-RRE report tracking data citations.SEDACs platform has also implemented a searchable online database that contains references to citations of the repository's datasets (Socioeconomic Data and Applications Center 2023a).

SUSTAINABILITY
The second set of challenges of repositories for developing and deploying harvestable metadata services refers to the ways in which repositories are limited in terms of opportunities for ensuring the sustainability of their services, especially when considering resource and policy constraints.Sustainable services are needed to provide continuous operations while facing the combined challenges of meeting the changing needs of users with technology that is evolving.Furthermore, with limited resources for technical development, repositories must consider the costs of establishing new services while providing and maintaining existing services.
Securing continual support for sustainable repository development and maintenance is a fundamental management challenge, especially for small-and medium-scale research facilities.Our group of repositories have faced these challenges by gaining support within their host institutions and finding support through partnerships.

Sustainable growth and operations
In research organizations without a strong culture of research data management (RDM), it may take time to build support for expanding data services with initiatives such as a new metadata service.For example, the Göittingen eResearch Alliance (Dierkes & Wuttke 2016) built institutional support by engaging with the organization's key decision makers and stakeholders.Alternatively, SEDAC and the three WDS members in China have been able to build support for their data centers within their host institutions and their national data infrastructures, and this is reflected in the repositories' maturity status.These examples also underscore that collaboration among community stakeholders fosters efforts to attain data repository interoperability, as reported by Gries et al. (2018).For less hierarchical organizations like research networks and data federations, the most salient challenges involve coordinating the development of a common standard or application profile, or coordinating the adoption of an existing technology (Yarmey & Baker 2013).The two WDS network members among our use cases, IGS and INTERMAGNET, are different examples of established, international data federations that managed to create impressive infrastructures on the basis of voluntary member participation, through many decades of collaborative work.
The voluntary, federated character of IGS relies on decentralized funding schemes for projects and initiatives, usually by public institutions, governments or other research organizations.
To maintain its reliable service provision, IGS must rely on system redundancy and on multiyear support commitments from the institutions that host the key elements of the system (IGS Central Bureau 2023; Villiger & Dach 2019b).To marshal support for a project, repository partners have to be able to envision the positive and tangible ways in which the project will impact funding partners and their constituencies, and how it will benefit the institution and society as a whole.In particular, IGS public outreach and communication initiatives reflect the organizations keen understanding of that fact.

Resource constraints
It is also useful to bear in mind that open-source software (OSS) is being produced and made available on a regular basis, some of which is intended for repositories to implement harvesting protocols with lower investment costs.For example, harvesting protocols can be implemented as modules in bespoke repository platforms by means of Viringo, an OAI-PMH API created by DataCite and further developed at FRDR, or Pycsw, a Python implementation of the OGC CSW protocol that is used by WDC RRE for its Catalogue Service.A minimal implementation of harvestable metadata may consist of a web-accessible folder (WAF), sitemap, or publicly accessible XML file of machine-readable metadata.
While a discussion of the advantages and disadvantages of OSS lies outside of the scope of this paper (see Trappler 2009, for a discussion of OSS pros and cons), it bears mentioning that repository managers will need to weigh the benefits of OSS against potential trade-offs (e.g.increased labor costs, community vs. corporate support services, etc.).Nevertheless, software solutions implemented with OSS may offer advantages for adoption if technological compatibility and software reusability is possible.
Independent of the decision to select a particular approach for implementing an enhancement, such as harvestable metadata capabilities, additional sources of support may be needed to sustainably develop and deploy improvements to data repository infrastructure.If the costs of enhancements are not absorbed by operating budgets, such costs may need to be supported separately.In such cases, data repositories may need to initiate projects and secure additional support for improvements to their services as part of their approach to providing sustainable data stewardship (Downs & Chen, 2016).

EVOLVING TECHNOLOGIES
The third theme reflects the set of challenges for making strategic decisions and associated investments in a landscape of evolving technologies and changing standards.Weighing the factors that influence such decisions presents a significant challenge for repository managers.
Repositories must assess the potential of a technology or standard to meet current and future needs, as well as its maturity, to determine whether and when it can be adopted.
The repositories in this study represent established data-sharing communities that have been sharing scientific data (in analogue and digital formats) long before the advent of the internet.
Considering the ever-changing technological landscape, the 'ideal' constellation of technologies and services may seem like a moving target: Over the past few decades, these repositories have experienced multiple waves of technical innovation, which have time and again transformed the ways in which data is obtained, documented and shared with other researchers.

Metadata and open data access policies
In general, repositories may be hesitant to expose metadata for protected datasets and/ or collections.Although none of our repositories reported hosting private or confidential data, some assets in the NSSDC repository are embargoed for a short time period, which is deemed long enough to ensure that data owners' rights and interests are protected.NSSDC's approach The ability to permanently and uniquely reference arbitrary data subsets and subsequent versions of a dataset is key to safeguarding the reproducibility of scientific studies that rely on shared data.
To tackle the technical challenge involved, groups such as DataCite (DataCite Metadata Working Group 2021) and the Research Data Alliance's (RDA) Data Versioning Working Group (Klump et al. 2021;Klump et al. 2020) have developed approaches and recommendations to implement dataset versioning and dynamic data citation.In 2015 the latter group released an RDA recommendation describing the dynamic assignment of PIDs to every new, unique data query that produced a given data subset (Rauber et al. 2015).With this approach, when a dataset changes due to updates or reprocessing (Klump et al. 2021), or when a subset of data is extracted from a larger dataset, or republished within a larger data collection (as described in Klump, Huber & Diepenbroek 2016), these unique products can themselves be reconstructed identified, referenced, cited and reused.These RDA recommendations have been implemented in various data repositories that enable citation of time-stamped versions of subsetted dynamic datasets with persistent identifiers, facilitating retrieval, across sundry data types, for reuse (Rauber et al. 2021).
Of our present set of use cases, only the WDC-RRE repository reported having already implemented a system to assign PIDs to versioned datasets (WDC-RRE 2016), in which identifiers are coded to refer back to data queries executed on specific, time stamped dataset versions.Two others, INTERMAGNET and ISGI, expressed an interest in developing a PID versioning system in future stages of their repositories' development.This approach would expedite the release of non-definitive datasets of geomagnetic observations, making these very detailed and highly valuable data assets available sooner to the scientific community.
Another recent and well-documented example of metadata versioning from a WDS Member repository is Project MINTED at Ocean Networks Canada (Jenkyns & Ridsdale 2020;Jenkyns 2019), who also had an active role in developing the RDA's Data Versioning WG's outputs.To determine how much an existing repository infrastructure can achieve, and to pursue new development opportunities accordingly, an ongoing and thorough assessment of a repository's infrastructure is recommended.To support a repository's initial self-assessment, the global RDM community has produced instruments to assess the maturity and trustworthiness of a data repository and the data assets, including metadata records, it contains (Downs 2021;Peng 2018).Some practical, up-to-date frameworks for reviewing a repository's current state are the most recent version of the CoreTrustSeal requirements for trustworthy data repositories (CoreTrustSeal 2022), the RDAs new FAIR data maturity model (FAIR Data Maturity Model WG 2020), the CARE Principles for Indigenous Data Governance (Carroll et al. 2020), and the TRUST Principles for digital repositories (Lin et al. 2020).Data repositories also need to continually assess the technology landscape to identify opportunities for improving capabilities to serve their designated communities.Cooperating with other repositories, within and across disciplines, helps with such assessments, especially when cooperating repositories share adoption stories and lessons-learned.
Two cases in our study reflect an interplay between changing user needs, evolving technologies, and resource constraints.The two geomagnetism data repositories, INTERMAGNET and ISGI, contain data assets with enormous potential for innovative, interdisciplinary research, but whose metadata formats and services have not been updated to current standards.For each repository, the challenge lies in finding a strategy that will allow them to exploit their data's potential to serve their current (known) users as well as future (known and unknown) ones.It involves optimizing between general and use-case based repository developments, including metadata standards and exchange protocols.ISGI and INTERMAGNET have reported different strategies, based on different priorities, to respond to this challenge.INTERMAGNET has reported having to ponder the advantages of general-purpose, extensive standards that can open future (yet unknown) avenues of research and collaboration, versus use-case based approaches that tailor new developments to better support each new case.In contrast, the existence of concrete opportunities for interdisciplinary collaboration for example, between ISGI and researchers in the biological sciences may justify an approach that tailors a repository's developments to a set of concrete use cases, taking a chance on their potential for future extensibility.HMetS-WG repositories also recognize the tension between the two fundamental principles of investing in future-proof technologies or maximizing user engagement with the data over time.In practice, repositories will usually attempt to balance both principles when designing their development plans.

LIMITATIONS OF A HARVESTING STRATEGY FOR DATASET DISCOVERY
In many of the cases described in these reports, the development strategy for harvestable metadata services has been very thorough.To varying degrees, the SEDAC, WDC-RRE and GCdataPR use cases hint at the limits of a discovery/findability strategy based on harvestable metadata services alone.These repositories, in particular, have motivated the ITOs decision to create an inventory of metadata aggregation services (Li & Payne 2021) that will allow repository managers to find aggregators outside their community's beaten path.Furthermore, and as mentioned above, motivated in part by inclusion in Google Dataset Search, SEDAC has a metadata harvesting capability already underway.Furthermore, WDC-RRE and GCdataPR have expressed future interest in receiving ITO support to develop a semantic metadata strategy as well.

CONCLUSIONS
The experiences reported in this study frame the socio-technical dimensions of research service development, where success depends largely on meeting the diverse needs of stakeholders within the designated communities of the repositories studied.And within each repository, the users may reflect different research perspectives in terms of interests and methods, or they may even employ different epistemological and ontological approaches (Poirier & Costelloe-Kuehn 2019).In effect, developing repository services, including harvestable metadata, involves identifying, adopting, and developing technologies that are continuously evolving to demonstrably serve the changing needs of heterogeneous user communities, within the policy and funding constraints of the institution.While the 'ideal' constellation of technologies and services may seem like a moving target, finding the right balance for their unique use case appears to be an attainable goal for most repositories.Downs et al. Data Science Journal DOI: 10.5334/dsj-2023-020When developing new services using cross-domain recommendations and policies, the "need for standardization and interoperability" must be balanced 'against the need for flexibility and discipline-specific nuance' (Goddard et al. 2021).Which standards and technologies will best serve the original producers and established users of datasets, as well as the larger user community, including new and future data users?Nearly all of our repositories conduct some level of market research and intelligence gathering to inform their service development in general, and harvestable metadata services in particular: Gathering usage data and data citation counts and characteristics is necessary to monitor how data is queried and used.Other common practices involve engaging in designated community outreach and participation in cross-domain and/or international working groups, as well as having dedicated working groups with diverse stakeholders; or engaging with current and prospective users directly, such as via interdisciplinary research collaborations.
Strategies for project sustainability vary according to the repositories' institutional structure.
For repositories embedded in centralized and hierarchical institutions (such as research centers, or national digital infrastructure projects), attaining long-term sustainability is contingent on continued support by parent organizations.In these settings, some key strategies include sustained engagement with the organization's key decision makers and stakeholders to seek strategic alignment, and maximizing opportunities to build support for data centers within their host institutions.For repositories embedded in decentralized organizations, like research networks and data federations, the main sustainability challenge is one of coordination and community development.Among our use cases, IGS and INTERMAGNET represent examples of data infrastructures that leverage voluntary member participation and decades of collaborative work to develop and maintain their services over time.
Lastly, the results from this study strongly suggest that participation and integration into technical networks (national, regional or subject-specific) can be a driver of technological development in member repositories.In all cases, the intermediating entity (a network, community or institution) effectively functions as a catalyst for service development and standards implementation, as well as an incubator that connects repositories' local ecosystems with global research data sharing spaces.The three themes that have been identified in this study for the challenges of developing and deploying harvestable metadata services also offer implications for the challenges that repositories face, generally and in terms of other capabilities, as they try to improve their services while meeting the changing needs of users with evolving technology in a sustainable manner.Such implications may be considerations for future research and theory development.

Table 1
Downs et al.

Table 2
Subject areas represented by repositories and target users groups.

Table 3
(Noll and Michael 2019)Flower and TGS Geomagnetic Observations 2019)sitories have been guided or supported by a larger entity while developing harvestable metadata services: INTERMAGNET and ISGI have participated in the European Open Science Cloud's (EOSC) EPOS ERIC project, while WDC-RRE, NSSDC and GCdataPR have developed with support from Chinese research data institutions.One of the data sources for the GCdataPR comes from cooperation with journals for enabling discovery.GCdataPR initiated a trijournal program since 2015 to facilitate dataset publication, data paper publication and science discovery publication.The three journals worked closely with authors to publish discovery papers as well as datasets and data papers.Finally, both SEDAC's and IGS's infrastructures have been supported by the National Aeronautics and Space Administration (NASA) EOSDIS community, and their extensive collections of knowledge and technical resources.Within the European geomagnetism community, the European Plate Observing System European Research Infrastructure Consortium (EPOS ERIC) has played a major role in promoting the uptake of 21st century technologies and standards to create more granular and robust metadata and dataset documentation(Chambodut et al. 2018;Flower and TGS Geomagnetic Observations 2019).Following EPOS ERIC's leadership, ISGI plans to migrate the repository's metadata records into an interoperable schema that will allow repositories to serve metadata to European aggregators like OpenAIRE.Currently, ISGI is considering implementing CERIF, DataCite, and/or DCAT compliant metadata.Since 2013, INTERMAGNET has been publishing yearly definitive data through the GFZ (GeoForschungsZentrum) Data Service, which serves dataset metadata to aggregators using various metadata standards and sharing protocols.Furthermore, a metadata development project is underway to gather metadata for all observatories recording geomagnetic data worldwide.This includes the INTERMAGNET geomagnetic observatories metadata combined with metadata records held by the WDC for Geomagnetism, Edinburgh.Outside of the WDS, the Chinese repositories contribute to the larger Chinese digital research infrastructure, as part of 20 Chinese Data Centers organized under the National Science and Technology Infrastructure Center of China.18The20nationaldatacentersprovidetheirmetadatacollections on a regular basis to a unified metadata search portal operated by the National Science and Technology Data Sharing Network of China (National Science and Technology Infrastructures 2016).These records must comply with the Chinese Science and Technology Infrastructure Resource Core Metadata standard (GB/T 30523-2014, China National Institute of Standardization 2014).Furthermore, all metadata records held by the 20 national data centers, including NSSDC, must be registered in accordance with the Science and Technology Resource Identification (CSTR), GB/T 32843-2016 (China National Institute of Standardization 2016), so that these metadata records can be discovered in the CSTR Identification platform.Another class of the data repository is peer reviewed dataset publications through the digital journal.The Global Change Data Repository is a digital journal (ISSN 2096-868X), which is issued monthly and compatible with the Journal of Global ChangeData & Discovery (ISSN 2096- 3645), a journal for publishing data papers.The two journals and the data and knowledge hub (metadata based links for specific applications) are part of the Global Change Research Data & Repository (GCdataPR).Through its publication methodology and procedures, the GCdataPR maintains long-term preservation and public availability of timely, quality and informative datasets.Both WDC-RRE and NSSDC also maintain custom metadata profiles that integrate local and international interoperability features.In addition, the China National Knowledge Infrastructure (CNKI) also is aggregating metadata from GCdataPR and WDC-RRE.Two of the repositories, SEDAC and IGS, are (at least partially) based in the United States, and they receive support from the National Aeronautics and Space Administration's (NASA) infrastructure.As one of NASA's Distributed Active Archive Centers (DAACs), SEDAC participates actively in initiatives stewarded by the Earth Science Data and Information System (ESDIS) project and SEDAC metadata is provided to NASA's EOSDIS Common Metadata Repository (CMR).The CMR is the back-end of Earthdata Search, the Global Change Master Directory (GCMD), and the International Data Network (IDN), the latter of which transfers SEDAC metadata into GEOSS.The complete collection of IGS data, which is distributed across data centers, has one of two complete mirrors hosted by a NASA EOSDIS data center, the Crustal Dynamics Data Information System (CDDIS) (the second mirror is hosted by the European Space Agency).19Thus, at present, metadata records for SEDAC datasets and for IGS collections are served in metadata search/retrieval endpoints at the CMR(Noll and Michael 2019), and they are available in multiple established metadata formats, specifically: DIF 10,ECHO 10, ISO 19115-2:2009 (MENDS and SMAP dialects), and UMM-C (Reiter and llincione 2019).
(Wang et al. 2020)S, etc.).*Includes schema.org.Downs et al.The Chinese research data infrastructure.GCdataPR, NSSDC and WDC-RRE were among the original Chinese data repositories that joined the ICSU system of World Data Centers in 1988.In 2008, to promote collaboration between the eight Chinese repositories at the WDS, the WDS China Common Clearinghouse was created(Wang et al. 2020).The prototype for the WDS China's unified metadata search portal was constructed with Pycsw, a Python implementation of the OGC's Catalogue Services for the Web (CSW) specification(Wang et al. 2020).This initiative, led by WDC-RRE, encouraged and supported WDS members to develop harvestable metadata services based on similar spatial data interoperability standards, notablyISO 19115/19139/19119metadata and the OGC CSW protocol.18Many of which also maintain a close collaboration with WDS as non-members.Downs et al.
Downs et al.ISGI and INTERMAGNET provide good examples of how users' growing diversity may pose challenges to repositories, even those with well-established data-sharing cultures.Open Data and sharing have always been essential for the geomagnetism community, as earth-observation research can rarely be done without data from multiple countries.In fact, geomagnetism's established data-sharing tradition is evidenced by over 50 years of collaborative data practices which have included yearly data publications and established, shared standards; e.g. the IAGA2002 data Exchange Format Downs et al.
(Paskin 1999).2019))nceJournalDOI:10.5334/dsj-2023-020is compatible with the requirement that data be as open as allowable, but as restricted as necessary.SEDAC favors the use of open data licenses (mainly CC BY 4.0),22'unless there are extenuating circumstances such as data restrictions inherited from input data' (Socioeconomic Data and Applications Center 2023b).Wherever relevant, necessary consideration must also be given to data sharing practices and principles -beyond FAIR -that focus on various ethical concerns, such as the First Nations Principles of OCAP (First Nations Information Governance Centre 2014), and the CARE Principles for Indigenous Data Governance(Carroll et al. 2020).This means investing in the technical solutions that embody those principles: differentiated access policies and secure data storage, with trustworthy capabilities for offering selective data access under distinct protection classifications; or providing access only to authorized users.Machinereadable data licenses in metadata (Creative Commons 2002) can instruct search engines and automated software to display and filter content according to their licensing, which can in turn remind users of the freedoms and obligations (e.g.proper attribution) associated with the dataset.4.3.2.PIDs, DOIs, and identifiers for dynamic datasetsPersistent, unique identifiers (PIDs) for digital objects can enhance and enable a range of interoperability features, from automatic metadata retrieval for bibliographic references in tools like RefWorks and Zotero, to deduplicated aggregation of dataset metadata into federated catalogues, to the analysis and visualization of networks of scholarly communication and collaboration like OpenAIRE's Research Graph(Manghi et al. 2019).The Digital Object Identifier (DOI) standard(Paskin 1999), which emerged in the 1990s, as well as newer PIDs like the Research Organization Registry (ROR)23and Open Researcher and Contributor IDs (ORCID)24have opened new avenues for automating links between metadata records, and for creating new digital research services.The growing use of the ROR identifier in dataset metadata is a case in point.Since implementing ROR tags in 2020, national aggregation platforms like the Federated Research Data Repository (FRDR) have the option to selectively harvest Canadian data from non-Canadian repositories when at least one of the authors is affiliated with a Canadian research organization (Digital Research Alliance 2023).Similarly, ORCIDs make it easier to track the scholarly output of individual researchers.