How the University of South Carolina advanced research into climate change by optimizing their workflow on the cloud

Using Google Cloud, the Research Computing team helps the Molecular Microbial Biology Lab cut their data processing time from three months to sixteen hours—and gives them tools and methods to catch up on a year’s backlog of data.

The University of South Carolina’s Research Computing (RC) team serves as a central resource for all research computing on campus. Paul Sagona, the team’s Interim Executive Director, describes their work as “research facilitation:” they show faculty and students how to use existing tools or develop innovative solutions, enabling them to focus on their research. Sometimes faculty ask for help; other times the team analyzes patterns of technology usage to identify areas of inefficiency and improve the process to save money, time, or resources.

"Bioinformatics has lots of new tools coming out but it’s a huge gap for scientists to learn them. As we move toward cloud infrastructure there will be a smaller gap in training as scientists will have less need to code. It will become easier to move from on prem to cloud."
Behzad Torkian, Senior Applications Scientist, University of South Carolina

The challenge: how to redesign their workflow to speed data analysis

In March 2018 the team at the Molecular Microbial Biology Lab, led by Dr. Sean Norman, reached out to them with a problem: their researchers were collecting much more metagenomics data than they could easily process and use, which had created a large backlog. The field of metagenomics studies genetic material from environmental sites to create a better picture of the world’s complex and diverse microbial ecology; USC’s team was gathering environmental samples at a coastal pond in the Bahamas to help understand how climate change impacts ecosystems. After collecting each sample, they had to amplify and sequence the genetic material, then run the analysis. With tens of terabytes of data per sample and hundreds of terabytes to compute, the research was computationally intensive and expensive.

The solution: moving the workflow onto Google Cloud

To solve this challenge, the RC team turned to Google Cloud. For Behzad Torkian, Senior Applications Scientist, the biggest factors were the global reach and lightning-fast speed of its network and the control it gives researchers over their own work: “you need to move the data between the nodes and you want the freedom to use the tools the way you want, when you want,” he said. Another important factor was the opportunity to work with Google’s support team, whom they met during a campus IT visit. “Our collaborative process with Google was fantastic,” Sagona says. Bob Doran, Application Scientist at USC, adds, “they were super at communicating and excited about what we were doing.”

To start, the team came up with a plan to mimic their existing high performance computing (HPC) cluster on Google Cloud , with flexible storage options and dynamically installed software that would scale with the data. They set up a read-only attached persistent solid-state disk for the reference database and metagenomic samples on Google Cloud and started with single tests and small runs through a compute stack on Google Compute Engine that was distributed over 32 core instances. The results were outputted into Google Cloud cloud buckets. Integrating Slurm in Google Cloud enabled them to schedule, scale, and resubmit the jobs automatically. After troubleshooting along the way to verify results, they finally moved the whole job to Google Cloud. “We ran the job on 124,352 cores concurrently. We ran on 3,886 nodes. And we did that in 16.5 hours,” Sagona reports. “The transition to Google Cloud was enormously successful and greatly enhanced this research. It demonstrated that we could process these massive data sets in just a fraction of the time.” The difference was dramatic: a month’s worth of new samples that would have taken seven years to process on a personal computer, or three months on a local cluster, took sixteen hours on Google Cloud. This means that a year’s worth of data backlog that would take fifty years on a personal computer or two years on a local cluster will take just three days on Google Cloud. “That’s a big deal,” Torkian states, in part because it allows researchers to demonstrate more progress within the short time frames of typical research grants.

"We ran the job on 124,352 cores concurrently. We ran on 3,886 nodes. And we did that in 16.5 hours. The transition to Google Cloud was enormously successful and greatly enhanced our research."
Paul Sagona, Interim Executive Director, Research Computing, University of South Carolina

The benefits beyond the cloud: reproducible results

Moving to the cloud proved to have other benefits as well. Saving time saves money, and running jobs in parallel was more cost-effective. The next run will utilize containers to be able to take advantage of Google Cloud’s modular organization so the workflow would be easily transferable and reproducible for other researchers. Torkian adds that “bioinformatics has lots of new tools coming out, but it’s a huge gap for scientists to learn them. As we move toward cloud infrastructure there will be a smaller gap in training as scientists will have less need to code. It will become easier to move from on prem to cloud.” According to Doran, new methods could also set new algorithmic standards for validating studies, which could change how science is conducted. The USC team also has high hopes for STRIDES, Google’s new partnership with the National Institute of Health to share public biomedical datasets: better collaboration and better access to data will help their researchers to make even more progress. Sagona speaks for the team at USC when he says, “we’re really looking forward to the future.”