[Deadline Extended!] NIDDK's Data-Centric Challenge: Advancing AI-Ready Datasets for Diabetes Research

05:14pm September 28, 2023
Ko-Wei Lin

News

NIDDK Central Repository announced NIDDK's Data-Centric Challenge - Enhancing NIDDK datasets for future Artificial Intelligence (AI) applications! Apply now to be part of this groundbreaking initiative focusing on Type 1 Diabetes datasets and help enhance datasets for artificial intelligence, shaping the future of healthcare. Don't miss the opportunity to make a significant impact—apply today and be a part of this transformative journey!

Here is the information from the NIDDK Central Repository:

"NIDDK Central Repository Data

Centric ChallengeEnhancing NIDDK datasets for future Artificial Intelligence (AI) applications.

Background:

The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository (https://repository.niddk.nih.gov/home/) is conducting a Data Centric Challenge aimed at augmenting existing Repository data for future secondary research including data-driven discovery by artificial intelligence (AI) researchers. The NIDDK Central Repository (NIDDK-CR) program strives to increase the utilization and impact of the resources under its guardianship. However, lack of standardization and consistent metadata within and across studies limit the ability of secondary researchers to easily combine datasets from related studies to generate new insights using data science methods. In the fall of 2021, the NIDDK-CR began implementing approaches to augment data quality to improve AI-readiness by making research data FAIR (findable, accessible, interoperable, and reusable) via a small pilot project utilizing Natural Language Processing (NLP) to tag study variables. In 2022, the NIDDK-CR revised internal processes and implemented industry-leading data standards to improve technical data and metadata quality to align with FAIR and TRUST (transparency, responsibility, user focus, sustainability, and technology) principles. Capitalizing on these accomplishments, and to further promote visibility of resources and increase potential for reuse in innovative research, the NIDDK-CR has established a data challenge platform to conduct a series of data-centric challenges that will build on one another to develop tools, approaches, models and/or methods to increase data interoperability and usability for artificial intelligence and machine learning applications. These enhanced NIDDK-CR practices and AI-ready datasets would then be used in subsequent data challenges focusing on hypothesis generation and new analysis methodology in NIDDK mission areas.

Challenge Overview:

NIDDK is seeking innovative approaches to enhance the utility of NIDDK datasets for AI applications. Towards this, the goals of the NIDDK-CR Data Centric Challenge will be to 1) generate an “AI-ready” dataset that can be used for future data challenges, and 2) to produce methods that can be used to enhance the AI-readiness of NIDDK data. Participants will enhance data from the following longitudinal studies focused on Type 1 Diabetes (T1D), which have been de-identified and made available through the NIDDK-CR:

The Environmental Determinants of Diabetes in the Young (TEDDY) study
Four studies from the Type 1 Diabetes TrialNet (TrialNet) network (see Intermediate level participation description below for further details).

Participation in this challenge will be tiered based on the challenge applicants’ self-described experience with data science and analytics (i.e., beginner, intermediate, or advanced). Participants will be instructed to 1) prepare a single dataset by aggregating all data files associated with one or more longitudinal studies on T1D listed above, and 2) augment the single dataset to ensure AI-readiness. One winner from each group will be selected:

Beginner – For the beginner-level challenge, the goal for challenge participants will be to aggregate the 50+ datasets from the TEDDY study into a single unified and machine-readable dataset harmonized by participant ID (MaskID). TEDDY follows children with and without a family history of T1D. Newborns less than four months old with either a high-risk human leukocyte antigen (HLA) haplotype or a first-degree relative with T1D are followed for 15 years or until the first appearance of islet cell autoantibodies or development of T1D. The primary objectives of the study include identifying environmental factors that trigger or protect against the development of T1D. Consistent with the primary study outcome measures of first appearance of autoantibodies or diagnosis with T1D, an appropriate machine learning method would likely be a time-to-event type. Since NIDDK cannot know what other study designs may arise in the future, or what discoveries could be pursued when combining the TEDDY dataset with other datasets, AI-readiness will require aggregation of all 50+ dataset files into a single tabular (i.e., spreadsheet or rectangular) .csv file type; data enhancement steps that do not meaningfully alter the original data; and preparation of dataset documentation that is both human- and machine-readable.
Intermediate – For the intermediate-level challenge, the goal for challenge participants will be to harmonize the four studies listed above within the TrialNet set of studies. TN01 is a screening and monitoring prospective cohort study established to provide a source of participants for enrollment into prevention trials. Participants in TN01 range in age from 1 to 45, have an immediate or extended family member with T1D, and have been screened for pancreatic autoantibodies. Participants who do not develop T1D while in TN01 are prospectively observed to 1) assess predictive value of existing and novel risk markers for T1D, and 2) examine demographic, immunologic, and metabolic characteristics of individuals at risk for developing T1D. Due to the design of TN01, participants can be followed longitudinally from a pre-disease state into an early disease state within subsequent TrialNet studies, and TN01 can serve as a source of control participants for retrospective case-control studies. The subsequent TrialNet studies included in this challenge are TN16 (Long-Term Investigative Follow-Up, a prospective observation study that aims to observe the long-term effects of treatments used in other TrialNet studies), TN19 (ATG-GSCF in New Onset T1D, a clinical trial following two potential prophylaxes), and TN20 (Immune Effects of Oral Insulin in Relatives at Risk for Type 1 Diabetes Mellitus, a clinical trial to learn more about the immune effects of oral insulin). Since NIDDK cannot predict the varied ways AI researchers could construct epidemiologic studies using this or any other harmonized TrialNet dataset, these four studies will help understand the feasibility of data harmonization within TrialNet studies. For data challenge participants, it will be important to harmonize study participants by MaskID and to retain those study participants who appear in TN16, TN19, or TN20 but did not also appear in TN01. AI-readiness for this task will require data aggregation for all dataset files within each study into a single tabular (i.e., spreadsheet or rectangular) .csv file type; data enhancement steps that do not meaningfully alter the original datasets; and harmonization of study participants across TrialNet studies, which may additionally require de-duplication of records. Challenge participants will also need to prepare dataset documentation that is both human- and machine-readable.
Advanced – For the advanced level challenge, the goal for challenge participants will be to 1) aggregate the 50+ files in TEDDY, 2) harmonize the four studies in TrialNet, and 3) fuse the aggregated TEDDY dataset with the harmonized TrialNet dataset. In addition to the steps outlined above, data fusion will require detailed documentation of the newly created study population, harmonization of data elements when possible, and documentation of data elements that cannot be harmonized. The resulting dataset could be used for many epidemiologic study designs, such as retrospective longitudinal case-control studies, and exploratory machine learning studies to group study participants by disease progression or protective factors. AI-readiness for this task will require all previous described steps and consideration of already mentioned issues and challenges, as well as careful consideration of those elements that can and cannot be fused between the TEDDY and TrialNet studies.

The Data Centric Challenge will be split into two phases, a registration phase, and a competition phase. During Phase 1, interested applicants will register for the Challenge by completing and submitting the NIDDK-CR Data and Resources Agreement (DUA), NIDDK-CR Data Challenge Addendum, and the Data Challenge Registration Form to Challenge.gov (see Data Challenge Registration Instructions). Approved participants will then proceed to Phase 2, where they will receive access to the NIDDK study datasets as a series of .csv files within a secure NIDDK-provided workbench environment hosted by Amazon Web Services (AWS; Research Service Workbench). This workbench provides analytic tools for participants to perform data aggregation, augmentation, and enhancement tasks using SageMaker Jupyter Notebook (Python or R Programming Language) to generate their code for making the NIDDK data AI-ready.

To learn more about the studies and datasets in the NIDDK-CR that will be utilized for the Data-Centric Challenge, please see the links below, which contain information about the study and corresponding documentation:

TEDDY: https://repository.niddk.nih.gov/studies/teddy/
TrialNet:

[Note: Certain variables have been redacted from the original data files available in the NIDDK-CR. Please refer to the Data Challenge-specific data dictionary for TEDDY and the TrialNet studies available on their respective study overview pages using the links provided.]

Submission Requirements:

Final submissions for NIDDK judging will include:

A single “Raw” dataset resulting from data aggregation, harmonization, and/or fusion of all study data files (from TEDDY, TrialNet, or both – as per level of participation), but has not otherwise been altered. This dataset must be represented as a single rectangular file (i.e., tabular, spreadsheet, or matrix) in .csv file format.
An “AI-ready” version of the aggregated, harmonized, and/or fused dataset that has been enhanced for AI-readiness (see below). This dataset must be represented as a single rectangular file (i.e., tabular, spreadsheet, or matrix) in .csv file format.
The code script used to generate the Raw and AI-ready files (Python or R Programming Language) submitted to your private GitHub repository.
A human-readable data dictionary/codebook documenting the AI-ready dataset (Excel format preferred with the following information included at a minimum: variable name, variable label/description, variable type, measurement unit as applicable (e.g., pounds, kilograms), and corresponding code lists as needed (e.g., 0 = No, 1 = Yes)
Challenge Solution Submission Form, describing the 1) AI-ready dataset, 2) methods for preparing the AI-ready dataset, and 3) potential use cases for the prepared dataset as it relates to T1D, or other disease areas of interest to NIDDK.

Detailed instructions on how to submit solutions will be provided to applicants that have been approved to proceed to Phase 2 of the Challenge.

Guidance regarding the “AI-ready” dataset submission: “AI-readiness” here refers to data that are machine-readable, reliable, accurate, explainable, predictive and accessible for future AI applications. AI-readiness will include pre-processing steps such as addressing errant values, handling of missing values, relabeling of data elements (aka columns, variables, features, or attributes) and recoding of element values during harmonization to ensure consistent and standardized formatting of dates, laboratory values, measurements, etc. Participants should attempt to retain as much information as possible within the data by creating new data elements that are transforms of existing elements without deleting or overwriting existing elements. For example, when harmonizing datasets from two separate studies, it is observed that one study collected participant weight and height while the other study only classified participants by body mass index (BMI) category. Best practice would be to create two new data elements in the first dataset: one that calculates exact BMI from weight and height, and a second new element that classifies exact BMI using the same cut-off values that were used in the second study. The two studies can then be harmonized on BMI category while retaining weight and height information in the first study that may be useful later.

Guidance regarding human-readable data dictionary/codebook documentation submission: Human reviewers will assess descriptiveness and informativeness of the Challenge Solution Submission Form, code script, and data dictionary/codebook, which should include documentation of data enhancements and methods for preparing the AI-ready dataset; documentation of all data elements including name, description, data type such as numeric or free text; documentation of handling missingness (see also the data quality dimension of "completeness"); potential use cases for the prepared data, and any other important information to convey about the methods or procedures used. Special attention should be given to created/derived variables so future users can understand 1) the original data from which the new element(s) were generated and methodology used to create, and 2) the intended purpose of newly generated elements (e.g., right-censor indicator useful for time-to-event analysis derived from missing values in T1D diagnosis, which indicate that a participant was not diagnosed with T1D during the study period).

Key Dates:

(Phase 1) Registration:
- Registration Submission Start: 9/20/2023 9:00 AM EDT
- Registration Submission Ends: 11/30/2023 5:00 PM EDT
(Phase 2) Data Enhancement:
- Challenge Start: 12/1/2023 9:00 AM EDT
- Challenge End:01/19/2024 5:00 PM EDT
Judging Start/End: 01/19/2024 through 02/05/2024
Winner Announced: 02/2024"

Source and more information: https://www.challenge.gov/?challenge=niddk-central-repository-data-centric-challenge

Search

Recent dkNET News

: Celebrate 75 Years of Impact: The NIDDK 75th Anniversary Collection is Now Live

: HIRN Funding Announcement: Data Scholars Awards

: HIRN Webinar: Identification of Unique Cell Type Responses in Pancreatic Islets to Stress

About

Community Resources

More Resources

Literature

News

[Deadline Extended!] NIDDK's Data-Centric Challenge: Advancing AI-Ready Datasets for Diabetes Research

Background:

Challenge Overview:

Submission Requirements:

Key Dates:

Search

Recent dkNET News

dkNET News Tags

About

Recent News Entries

Contact Us

Stay Connected

SciCrunch

Log in

Leaving Community

About

Community Resources

More Resources

Literature

Log in

News

[Deadline Extended!] NIDDK's Data-Centric Challenge: Advancing AI-Ready Datasets for Diabetes Research

Background:

Challenge Overview:

Submission Requirements:

Key Dates:

Search

Recent dkNET News

dkNET News Tags

About

Recent News Entries

Contact Us

Stay Connected

SciCrunch