Storage solutions Data storage as a key component for efficient AI workflows

A guest contribution by Uwe Kemmer* | Translated by AI 5 min Reading Time

Related Vendors

Artificial intelligence offers companies new opportunities to optimize their business processes and save resources. However, the success of an AI workflow does not only depend on powerful algorithms and GPUs. Efficient storage solutions are crucial.

The starting point for any AI workflow is the collection of large data sets.(Image: freely licensed /  Pixabay)
The starting point for any AI workflow is the collection of large data sets.
(Image: freely licensed / Pixabay)

Uwe Kemmer is Director EMEAI Field Engineering at Western Digital Corporation.

Artificial intelligence has ushered in a new era of technological achievements, from the impressive performance of intelligent language models to generative AI's ability to create images from text. Companies also benefit from these new opportunities: with technical know-how, they are able to develop and train their own AI models. There are hardly any limitations on which business processes can be analyzed and optimized by AI.

When it comes to AI hardware, GPUs with their enormous computing power are often the focus. However, they alone cannot realize an AI workflow. The reason is the data involved. Whether for analysis, training, or rapid decision-making, AI requires and generates huge amounts of information at every step. Storage systems—at the edge, on-premises, or in the cloud—serve as infrastructure to collect, store, and manage these datasets.

Choosing the right storage technology can sometimes be crucial for using AI efficiently and ultimately achieving the desired results. It's all the more important to know where or how data is used throughout an AI workflow. Fortunately, despite its seemingly high complexity, the process can be broken down into four fundamental steps: data collection, model creation, training, and deployment.

The four phases of the AI workflow

1. Data Collection: The starting point for any AI workflow is the collection of large amounts of data. During collection, raw data is generated from sources such as sensors, cameras, and databases. The storage solutions used in this phase must efficiently capture and organize structured and unstructured formats such as images, texts, and videos. This is fundamental to the entire AI process. Typically, raw data lands on a local storage platform but can also be gradually uploaded to the cloud for analysis.

In some special cases, physical data transport devices—such as an external hard drive or a rugged edge server—are required to transport large amounts of information to the data center. This method is usually employed when the upload is too large and/or costly. Robust edge solutions can also ensure seamless data collection in extreme environments such as a desert or ocean, where an internet connection is not possible.

2. Model Creation: With a clearly defined problem in mind, AI experts in this phase engage in various processing steps to refine the algorithms and extract the desired insights from the data. Model creation and the training phase are the most compute-intensive processes in the AI workflow. Choosing the right storage media is particularly important here and is not necessarily limited to fast all-flash arrays. Hard disk drives (HDDs) play a crucial role in storing large datasets and snapshots for future training. Machine learning algorithms repeatedly process these datasets to optimize the model. While HDDs provide cost-effective mass storage, flash ensures the speed, allowing training and model development to proceed without delay.

3. Training: During training, the previously refined model is applied to and tested on a comprehensive dataset. Training times vary greatly; even the most popular language models took up to a year. Other models may only need hours to days or months: the duration of training always depends on the problem and the dataset used. Essentially, every AI model operates in iterative loops, where process optimization is done before each run. The required GPU performance is immense, and the resulting data must be accessible for the next round of training. At first glance, a pure flash setup seems ideal for the training phase. In reality, however, an AI should always be able to draw on the largest possible data pool so that insights from past iterations continue to be considered in the training algorithm. Similar to model creation, a hybrid approach of HDDs and flash drives is therefore optimal.

4. Deployment: Once the training process is completed and the algorithm is finalized, it can be deployed. The most common method is to use the cloud and enable use via web-based services. For example, companies can utilize an algorithm across multiple locations or offer it as a service. In combination with edge locations, this can also be supplemented with real-time data analysis. Of course, deployment on a smaller scale is just as feasible. In the case of SMEs, the algorithm can reside on the local server and be accessible throughout the corporate network.

Individual storage strategy for each AI workflow

Based on the described phases of the AI workflow, important insights can be gained for choosing the right storage. Fundamentally, there is no single solution. Rather, the optimal storage strategy depends on the specific use case. It is important to be aware of the individual requirements of the AI model and not to lose sight of the desired goal:

Subscribe to the newsletter now

Don't Miss out on Our Best Content

By clicking on „Subscribe to Newsletter“ I agree to the processing and use of my data according to the consent form (please expand for details) and accept the Terms of Use. For more information, please see our Privacy Policy. The consent declaration relates, among other things, to the sending of editorial newsletters by email and to data matching for marketing purposes with selected advertising partners (e.g., LinkedIn, Google, Meta)

Unfold for details of your consent
  • Data collection strategy: What is the basic approach to data collection— bulk transfer or incremental upload? In some scenarios, a physical data transfer device or a rugged edge server may be necessary.

  • Training environment: Is the training conducted in the cloud, on the company's own system, or possibly directly with an external provider offering a pre-trained model? Each option has its own advantages and necessary trade-offs.

  • Application: Who is intended to use the final algorithm, and how is it accessible? If the goal is, for example, edge inferencing, then it must be ensured that the hardware requirements for the necessary edge scenarios are met everywhere.

AI and data go hand in hand

Artificial intelligence is here to stay. The extent of changes for society is still difficult to foresee. However, it is already clear that data plays a crucial role. It is no coincidence that new units of measurement, Quetta and Ronna, were introduced in 2022 to keep the exponentially growing global data quantities quantifiable. For AI, mere storage is secondary; much more important are the speed and efficiency with which models can access information and operate based on it. Relying on an ill-considered storage strategy imposes an avoidable bottleneck in the long run.

When introducing artificial intelligence, companies should therefore keep a close eye on the interplay of data collection, model creation, training, and deployment. Because the choice of suitable storage solutions is ultimately crucial for the success or failure of an AI workflow.