Are you sure you want to leave this community? Leaving the community will revoke any permissions you have been granted in this community.
NIDDK Central Repository announced NIDDK's Data-Centric Challenge - Enhancing NIDDK datasets for future Artificial Intelligence (AI) applications! Apply now to be part of this groundbreaking initiative focusing on Type 1 Diabetes datasets and help enhance datasets for artificial intelligence, shaping the future of healthcare. Don't miss the opportunity to make a significant impact—apply today and be a part of this transformative journey!
Here is the information from the NIDDK Central Repository:
Centric ChallengeEnhancing NIDDK datasets for future Artificial Intelligence (AI) applications.
The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository (https://repository.niddk.nih.gov/home/) is conducting a Data Centric Challenge aimed at augmenting existing Repository data for future secondary research including data-driven discovery by artificial intelligence (AI) researchers. The NIDDK Central Repository (NIDDK-CR) program strives to increase the utilization and impact of the resources under its guardianship. However, lack of standardization and consistent metadata within and across studies limit the ability of secondary researchers to easily combine datasets from related studies to generate new insights using data science methods. In the fall of 2021, the NIDDK-CR began implementing approaches to augment data quality to improve AI-readiness by making research data FAIR (findable, accessible, interoperable, and reusable) via a small pilot project utilizing Natural Language Processing (NLP) to tag study variables. In 2022, the NIDDK-CR revised internal processes and implemented industry-leading data standards to improve technical data and metadata quality to align with FAIR and TRUST (transparency, responsibility, user focus, sustainability, and technology) principles. Capitalizing on these accomplishments, and to further promote visibility of resources and increase potential for reuse in innovative research, the NIDDK-CR has established a data challenge platform to conduct a series of data-centric challenges that will build on one another to develop tools, approaches, models and/or methods to increase data interoperability and usability for artificial intelligence and machine learning applications. These enhanced NIDDK-CR practices and AI-ready datasets would then be used in subsequent data challenges focusing on hypothesis generation and new analysis methodology in NIDDK mission areas.
NIDDK is seeking innovative approaches to enhance the utility of NIDDK datasets for AI applications. Towards this, the goals of the NIDDK-CR Data Centric Challenge will be to 1) generate an “AI-ready” dataset that can be used for future data challenges, and 2) to produce methods that can be used to enhance the AI-readiness of NIDDK data. Participants will enhance data from the following longitudinal studies focused on Type 1 Diabetes (T1D), which have been de-identified and made available through the NIDDK-CR:
Participation in this challenge will be tiered based on the challenge applicants’ self-described experience with data science and analytics (i.e., beginner, intermediate, or advanced). Participants will be instructed to 1) prepare a single dataset by aggregating all data files associated with one or more longitudinal studies on T1D listed above, and 2) augment the single dataset to ensure AI-readiness. One winner from each group will be selected:
The Data Centric Challenge will be split into two phases, a registration phase, and a competition phase. During Phase 1, interested applicants will register for the Challenge by completing and submitting the NIDDK-CR Data and Resources Agreement (DUA), NIDDK-CR Data Challenge Addendum, and the Data Challenge Registration Form to Challenge.gov (see Data Challenge Registration Instructions). Approved participants will then proceed to Phase 2, where they will receive access to the NIDDK study datasets as a series of .csv files within a secure NIDDK-provided workbench environment hosted by Amazon Web Services (AWS; Research Service Workbench). This workbench provides analytic tools for participants to perform data aggregation, augmentation, and enhancement tasks using SageMaker Jupyter Notebook (Python or R Programming Language) to generate their code for making the NIDDK data AI-ready.
To learn more about the studies and datasets in the NIDDK-CR that will be utilized for the Data-Centric Challenge, please see the links below, which contain information about the study and corresponding documentation:
[Note: Certain variables have been redacted from the original data files available in the NIDDK-CR. Please refer to the Data Challenge-specific data dictionary for TEDDY and the TrialNet studies available on their respective study overview pages using the links provided.]
Final submissions for NIDDK judging will include:
Detailed instructions on how to submit solutions will be provided to applicants that have been approved to proceed to Phase 2 of the Challenge.
Guidance regarding the “AI-ready” dataset submission: “AI-readiness” here refers to data that are machine-readable, reliable, accurate, explainable, predictive and accessible for future AI applications. AI-readiness will include pre-processing steps such as addressing errant values, handling of missing values, relabeling of data elements (aka columns, variables, features, or attributes) and recoding of element values during harmonization to ensure consistent and standardized formatting of dates, laboratory values, measurements, etc. Participants should attempt to retain as much information as possible within the data by creating new data elements that are transforms of existing elements without deleting or overwriting existing elements. For example, when harmonizing datasets from two separate studies, it is observed that one study collected participant weight and height while the other study only classified participants by body mass index (BMI) category. Best practice would be to create two new data elements in the first dataset: one that calculates exact BMI from weight and height, and a second new element that classifies exact BMI using the same cut-off values that were used in the second study. The two studies can then be harmonized on BMI category while retaining weight and height information in the first study that may be useful later.
Guidance regarding human-readable data dictionary/codebook documentation submission: Human reviewers will assess descriptiveness and informativeness of the Challenge Solution Submission Form, code script, and data dictionary/codebook, which should include documentation of data enhancements and methods for preparing the AI-ready dataset; documentation of all data elements including name, description, data type such as numeric or free text; documentation of handling missingness (see also the data quality dimension of "completeness"); potential use cases for the prepared data, and any other important information to convey about the methods or procedures used. Special attention should be given to created/derived variables so future users can understand 1) the original data from which the new element(s) were generated and methodology used to create, and 2) the intended purpose of newly generated elements (e.g., right-censor indicator useful for time-to-event analysis derived from missing values in T1D diagnosis, which indicate that a participant was not diagnosed with T1D during the study period).
Source and more information: https://www.challenge.gov/?challenge=niddk-central-repository-data-centric-challenge