Stanford Research Computing Spotlight: Q&A with Jeanne Shen, MD
Jeanne Shen is an associate professor in the Department of Pathology at Stanford’s School of Medicine.
What is the research problem you are trying to solve?
Pathologists play a crucial role in cancer diagnosis and treatment by examining patient samples under a microscope. Artificial intelligence can assist in this microscopic review process by quickly and consistently identifying different tissue types present within a sample, such as tumors and various normal tissue types like muscle, fat, and blood.
However, AI models require large, well-balanced, and high-quality training datasets to be reliable. Training datasets for pathology not only require a huge amount of manual labor to collect and label, but don’t cover the full range of how tissues can look, contain too many examples of some tissue types and too few of others, and are not appropriately quality-controlled.
This results in low-quality or mislabeled images being included in training datasets and limits how well pathology AI models trained on these datasets work in real-world settings.
How are you addressing these challenges?
Our lab developed a scalable and flexible, semi-automated framework, called DeepCluster++, for efficiently constructing large-scale, diverse, and balanced datasets for pathology AI model training and validation.
DeepCluster++ cuts down on the labor and inherent bias of manual dataset creation. As a pilot use case, we used DeepCluster++ to build a large-scale colorectal cancer dataset containing 630,000 microscopic images representing nine clinically important tissue types.
We trained many different kinds of AI models on this dataset and compared their performances with the same models trained on publicly available datasets. We observed that models trained on our DeepCluster++-generated dataset were much better at handling new, real-world examples they hadn't encountered during training.

How has Stanford Research Computing helped achieve your goals?
For this work, we used the Carina and Marlowe clusters at Stanford Research Computing. The availability of multiple H100 GPUs with large GPU memory capacity on Marlowe allowed us to load and train large models on our publicly available datasets smoothly without memory issues. The large amount of storage space on Marlowe was also very helpful for saving large datasets and intermediate outputs.
The Marlowe community Slack channel (#marlowe-researchers), as well as the Marlowe support team, are helpful resources for troubleshooting common issues.
What support and offerings would you like in the future?
One thing we noticed while using Marlowe is there are challenges when multiple users are active on a server. Also, given the large datasets that we are using, dedicated data-transfer tools would be helpful.
It would be great if Marlowe could also be used for processing protected health information (PHI) data in the future, similar to Carina and Nero. There would be a lot of interest from School of Medicine researchers and others at Stanford who work on healthcare AI applications if this were to become the case.
Learn more about the systems managed and supported by Stanford Research Computing.
Marlowe is supported by Stanford Data Science, the Vice Provost and Dean of Research, and Stanford Research Computing.
DISCLAIMER: UIT News is accurate on the publication date. We do not update information in past news items. We do make every effort to keep our service information pages up-to-date. Please search our service pages at uit.stanford.edu/search.
What to read next:
PREP: An AI-Powered PDF Remediation Service
Mobile Approval Option Now Available for PCard Approvers
