ISB response to NIH RFI: NIH Strategic Plan for Data Science

On behalf of the International Society for Biocuration (ISB), we provide the following response to the Request for Information: NIH STRATEGIC PLAN FOR DATA SCIENCE, which describes NIH’s overarching goals, strategic objectives, and implementation tactics for modernizing the NIH-funded biomedical data-resource ecosystem.

We are a community highly involved in the development and maintenance of biological and biomedical databases, and the task of biocuration: the translation and integration of information relevant to biology into a database enabling the integration of the scientific literature as well as large data sets (distilling data into knowledge). The International Society for Biocuration (ISB) community includes, among others, biocurators, software developers, bioinformaticians, and standard developers. We are thus familiar with the pitfalls of current funding mechanisms for databases and recognize the importance of developing a different model which is what the strategic plan for data science intends to address. In this response, we focus exclusively on selected aspects of Goal 2: Promote Modernization of the Data-Resources Ecosystem, and Goal 4: Enhance Workforce Development for Biomedical Data Science.

Information requested:

* The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them:
Goal 2: Promote Modernization of the Data-Resources Ecosystem
Whilst overall the ISB is generally supportive of the statements made in this RFI, we feel that some terminology used needs to be improved. The RFI refers to databases and repositories indistictively. It should be noted that the term database is an overarching term, and we see the separation as being between primary data repositories, such as members of the INSDC (http://www.insdc.org/), with set submission criteria and minimal subsequent expert curation of the data (biocuration), and Knowledgebases [1]. Then both repositories and knowledgebases are types of databases. We suggest that the terms database, repositories and knowledgebase are clearly defined. Here are our proposed definitions and changes to the text:

A database is a computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data [2].

Data Repositories and Knowledgebases: What’s the Difference?
Data repositories and knowledgebases are both types of databases which store, organize, validate, and make accessible the core data related to a particular system or set of technologies. In the case of a data repository, the data is deposited by researchers following a set of guidelines and, other than ensuring the guidelines are adhered to, receives minimal subsequent input or modification.

Knowledgebases accumulate, organize, and link growing bodies of information related to the deposited data. A knowledgebase may contain information about gene models, transcript/protein expression patterns, splicing variants, localization, and protein-protein interaction and pathway networks related to an organism or set of organisms. Knowledgebases typically require significant semi-automated as well as manual biocuration by domain experts (e.g., literature-based gene ontology and phenotype annotations) beyond the quality assurance/quality control and annotation needed for data repositories.

We propose that the definition of biocuration is added to the glossary.

Biocuration is the extraction of knowledge from unstructured biological data (typically but not limited to publications) into a structured, computable form. Biocurators are typically Ph.D. level biologists, often with lab bench experience, coupled with
specialized expertise in computational knowledge representation. Their work entails the synthesis and integration of information from multiple sources, including, for example, peer-reviewed papers, large-scale projects, or conference abstracts. They contact authors directly for clarification, digest supplemental information, and resolve
identifiers, in order to accurately capture a researcher’s conclusion and their evidence for that conclusion. Biocurators strive to distill the current ‘best view’ from conflicting sources and ensure that their resources provide data that is not only
Findable, Accessible, Interoperable, and Reproducible (FAIR), but also Traceable, appropriately Licensed, and inter-Connected (collectively, the FAIR-TLC principles) [3].

Goal 4-Enhance Workforce Development for Biomedical Data Science
Again, the ISB is in favor of this proposed goal as training different stakeholders in data science is key for the NIH to achieve the stewardship goals outlined in the NIH-wide strategic plan. However, the enhancement of the workforce is only discussed in terms of data-scientists, and we believe biocurators are relevant stakeholders as well.
In section 4.1 “In addition, NIH will recruit a cohort of data scientists and others with expertise in areas such as project management, systems engineering, and computer science from the private sector and academia for short-term (1- to 3-year) national service sabbaticals. These “NIH Data Fellows” will be embedded within a range of high-profile, transformative NIH projects such as All of Us, the Cancer MoonshotSM and the BRAIN initiative and will serve to provide innovation and expertise not readily available within the federal government.”
We think that biocurators would offer a unique perspective to these NIH projects given their training in formulating and using standards, in data analysis and integration, working with a variety of research communities for adoption of FAIR principles [3]. We suggest that biocurators are explicitly listed and considered as potential “NIH Data Fellows”.
One of the ISB goals is to train the next generation of biocurators, and have developed/collected training materials that could be used by NIH for training grant reviewers (https://www.biocuration.org/dissemination/biocuration-training-materials/).

* Opportunities for NIH to partner in achieving these goals:
NIH should establish a closer interaction with the International Society for Biocuration (ISB) to learn about biocuration and data science. ISB could collect/prepare training materials that could contribute to NIH training goals. ISB members could serve as NIH Data Fellows.
NIH should consult FAIRsharing (a catalogue of data preservation, management and sharing policies from international funding agencies, regulators and journals) and the BioDBcore guidelines [4-5], a community-defined, uniform, generic description of the core attributes of biological databases; ensuring consistency and interoperability between resources.
Encourage and provide guidance to R01 and R21 proposal writers to budget correctly for data sharing. Dumping data into a repository is not trivial, it takes time to deposit data with adequate information. There needs to be clear instructions to grant recipients to submit structured data to journals and/or databases. The biocuration community could help identify a few examples of how such structured data can be submitted. In addition, minimal common standards for databases are already described in BioDBcore guidelines, mentioned in the previous point.
There should be more emphasis on how NIH intramural researchers could collaborate with external groups to link resources. The plan discusses linking all NIH data resources in detail. However, there is a need to also link to external resources and vice-versa.

* Additional concepts that should be included in the plan:
We propose that the definitions of database and biocuration be added to the glossary.

* Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections:
Nothing to comment at this point

* Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan:

Sustained long-term funding for key resources. Whilst we appreciate that resources need to be constantly re-evaluated and shown to be keeping pace with the demands of new technologies and new use cases, constantly moving from one short-term grant to another, with no guarantee of renewed funding is not beneficial to the resource growth and the user community that relies on it.

References:
1. Gabella C, Durinx C, Appel R. Funding knowledgebases: Towards a sustainable funding model for the UniProt use case. F1000Res. 2017 Nov 27;6. Pii: ELIXIR-2051. doi: 10.12688/f1000research.12989.1. eCollection 2017. PubMed PMID: 29333230; PubMed Central PMCID: PMC5747334.

2. Mount D. Bioinformatics: Sequence and Genome Analysis, Second Edition (2004). Chapter 2. Cold Spring Harbor Laboratory Press

3. International Society for Biocuration. Biocuration: Distilling data into knowledge. PLOS Biology (2018) in press.

4. Gaudet P, Bairoch A, Field D, Sansone SA, Taylor C, Attwood TK, Bateman A, Blake JA, Bult CJ, Cherry JM, Chisholm RL, Cochrane G, Cook CE, Eppig JT, Galperin MY, Gentleman R, Goble CA, Gojobori T, Hancock JM, Howe DG, Imanishi T, Kelso J, Landsman D, Lewis SE, Karsch Mizrachi I, Orchard S, Ouellette BF, Ranganathan S, Richardson L, Rocca-Serra P, Schofield PN, Smedley D, Southan C, Tan TW, Tatusova T, Whetzel PL, White O, Yamasaki C; BioDBCore Working Group.Towards BioDBcore: a community-defined information specification for biological databases. Database (Oxford). (2011) baq027. doi:10.1093/database/baq027. Print 2011. PubMed PMID: 21205783; PubMed Central PMCID: PMC3017395.

5. Gaudet P, Bairoch A, Field D, Sansone SA, Taylor C, Attwood TK, Bateman A, Blake JA, Bult CJ, Cherry JM, Chisholm RL, Cochrane G, Cook CE, Eppig JT, Galperin MY, Gentleman R, Goble CA, Gojobori T, Hancock JM, Howe DG, Imanishi T, Kelso J, Landsman D, Lewis SE, Mizrachi IK, Orchard S, Ouellette BF, Ranganathan S, Richardson L, Rocca-Serra P, Schofield PN, Smedley D, Southan C, Tan TW, Tatusova T, Whetzel PL, White O, Yamasaki C; BioDBCore Working Group. Towards BioDBcore: a community-defined information specification for biological databases. (2011) Nucleic Acids Res. 39(Database issue):D7-10. doi:10.1093/nar/gkq1173. Epub 2010 Nov 18. PubMed PMID: 21097465; PubMed CentralPMCID: PMC3013734.

Additional information requested:
Name: Cecilia Arighi, Nicole Vasilevsky and Sandra Orchard
Work Email: intsocbio@gmail.com
Name of Organization:International Society for Biocuration (ISB) (www.biocuration.org)

 

For members of advocacy groups or professional societies (optional): Please indicate your role and indicate whether you are responding on behalf of your organization.
Cecilia Arighi is the Chair of the Society, Nicole Vasilevsky is the Secretary and Sandra Orchard the Treasurer. This RFI is submitted on behalf of the ISB.

Sent April 01, 2018

Search by Categories