University of Pisa leans into the I/O challenge AI applications create

At a time when workloads that employ machine and deep learning algorithms are being built and deployed more frequently, organizations need to optimize I/O throughput in a way that enables those workloads to cost-effectively share the expensive GPU resources used to train AI models. Case in point: the University of Pisa, which has been steadily expanding the number of GPUs it makes accessible to AI researchers in a green datacenter optimized for high performance computing (HPC) applications.

The challenge the university has encountered as it deployed AI is that machine learning and deep learning algorithms tend to make more frequent I/O requests to a larger number of smaller files than traditional HPC applications, said University of Pisa CTO Maurizio Davini. To accommodate that, the university has deployed NVMesh software from Excelero that can access more than 140,000 small files per second on Nvidia DGX A100 GPU servers.

While Davini said he generally views AI applications as just another type of HPC workload, the way AI workloads access compute and storage resources requires a specialized approach. The NVMesh software addresses that approach by offloading the increasingly frequent I/O requests, freeing up additional compute resources on the Nvidia servers for training AI models, said Davini.

“We wanted to provide our AI researchers with a better experience,” Davini said.

Above: University of Pisa CTO Maurizio Davini

Excelero is among a bevy of companies that are moving to address the I/O challenges that IT teams will encounter when trying to make massive amounts of data available to AI models. As the number of AI models that organizations build and maintain starts to grow, legacy storage systems can’t keep pace. The University of Pisa deployed Excelero to make sure the overall IT experience of its AI researchers remains satisfactory, said Davini.

Of course, more efficient approaches to managing I/O only begin to solve the data management issues organizations that build their own AI models will encounter. IT teams have tended to manage data as an extension of the application employed to create it. That approach is the primary reason there are so many data silos strewn across the enterprise.

Even more problematic is the fact much of the data in those silos conflicts because different applications either might have rendered a company name differently or simply may not have been updated with the most recent transaction data. Having one single source of truth about a customer or event at any specific moment in time remains elusive.

AI models, however, require massive amounts of accurate data to be trained properly. Otherwise, the AI models will generate recommendations that are based on inaccurate assumptions because the data the machine learning algorithms were exposed to was either inconsistent or unreliable. IT organizations are addressing that issue by first investing heavily in massive data lakes to normalize all their data and then applying DataOps best processes, as outlined in a manifesto that describes how to automate as many data preparation and management tasks as possible.

Legacy approaches to managing data based on manual copy and paste processes is one of the primary reasons it takes so long to build an AI model. Data science teams are lucky if they can roll out two AI models a year. Cloud service providers such as Amazon Web Services (AWS) offer products such as Amazon SageMaker to automate the construction of AI models, increasing the rate at which AI models are created in the months ahead.

Not every organization, however, will commit to building AI models in the cloud. That requires storing data in an external platform, which creates a range of potential compliance issues they might rather avoid. The University of Pisa, for example, finds it easier to convince officials to allocate budget to a local datacenter than to give permission to access an external cloud, Davini noted.

Ultimately, the goal is to eliminate the data management friction that has long been a plague on IT by adopting a set of DataOps processes that are similar in nature to the DevOps best practices widely employed to streamline application development and deployment. However, all the best practices in the world won’t make much of a difference if the underlying storage platform is simply too slow to keep up.


  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Source: Read Full Article