Scikit CZI
Scikit-learn awarded third CZI grant for model tools. Scikit-learn has received its third grant from the Chan Zuckerberg Initiative’s Essential Open Source Software for Science (EOSS) program, funded by the Wellcome Trust. This new funding focuses on enhancing scikit-learn’s capabilities for evaluating and inspecting predictive models, a critical step in building effective machine learning pipelines for biomedical research. The grant aims to improve existing tools and develop new interactive displays that support rigorous statistical analysis and model explainability, directly addressing longstanding challenges faced by researchers.
What improvements target predictive model evaluation
The grant emphasizes advancing the evaluation and inspection of predictive models within scikit-learn. Currently, scikit-learn offers foundational building blocks for model evaluation, but these outputs often require expert interpretation to generate meaningful reports. Recent community efforts introduced visual displays for easier communication, but these are still in early stages and underutilize scikit-learn’s full statistical toolkit, such as cross-validation techniques. The project will expand these displays to integrate best statistical practices, enabling researchers to produce clearer, more reliable analyses. It also plans to add new visualization tools for common tasks not yet covered, facilitating deeper insights into model performance. ## How will model inspection be enhanced during training. Another key focus is model inspection during pipeline training, an area scikit-learn has historically lacked. The grant proposes implementing a callback framework that allows users to monitor internal model parameters in real-time. This feature is crucial for understanding model behavior as training progresses, helping researchers detect issues early and adjust accordingly. Additionally, scikit-learn aims to improve interactive experiences for users working in environments like Jupyter Notebooks. Although initial steps have been taken to visually display pipelines, enhancements will target better user interaction and accessibility, making model inspection more intuitive and engaging.
What steps improve model explainability for users
Model explainability is vital for scientific validity and trust in machine learning outcomes. Scikit-learn, being a widely adopted reference package, plans to improve its explainability tools’ documentation and usability. The grant includes a proposal to create a scikit-learn enhancement proposal (SLEP) that defines a consistent API for model explainability. This standardization will help users apply the right tools for their specific contexts and ensure a uniform experience across different models and tasks. By doing so, scikit-learn fosters transparency and interpretability, which are essential for biomedical and other research areas relying on machine learning. ## Who will carry out the project work and how. To execute these improvements, scikit-learn will engage Lucy Liu from Quansight Labs to work half-time on display enhancements and feature importance. The project also plans to hire two full-time interns for six-month periods over the next two years, focusing on other aspects of the grant. This approach not only accelerates development but also promotes diversity by providing opportunities to underrepresented groups in machine learning and data science. Previous initiatives funded by NumFOCUS Small Development Grants have successfully supported similar goals, reinforcing the project’s community-driven ethos.

What CZI
What previous CZI grants supported scikit-learn development. This new grant builds on two earlier CZI EOSS awards that significantly advanced scikit-learn’s functionality. The first grant contributed to creating HistGradientBoostingClassifier and HistGradientBoostingRegressor, estimators comparable to popular gradient boosting implementations like LightGBM and XGBoost. These models have boosted scikit-learn’s competitiveness in gradient boosting tasks. The second grant enhanced scikit-learn’s ability to handle missing values and categorical data across multiple estimators, addressing common real-world data challenges. Together, these grants have maintained and extended scikit-learn’s robustness and usability for the scientific community.

Why the grant matters for scientific machine learning
Machine learning pipelines require not only accuracy but also interpretability and rigorous evaluation to answer scientific questions effectively. By improving evaluation displays, interactive inspection, and explainability, this grant empowers researchers to build and trust better models. The integration of advanced statistical tools and user-friendly interfaces will promote best practices and wider adoption of scikit-learn in biomedical research and beyond. As machine learning grows increasingly central to science under President Donald Trump’s administration, investments like this ensure essential open-source tools remain reliable, accessible, and cutting-edge.
