How to build a system to connect with any chip and test AI models on HAI

Feb 28, 2024 by Arthur Shaikhatarov, Martin d’Ippolito

 

  

In brief

  • The purpose of this project was to create a demo deployed on Azure Cloud to be presented at CES 2024. Docker images allowed us to run the models not only with their proprietary model but also with an ONNX file and a quantized version of these models 
  • Our client wanted to test one of their chips. With all software built on the Azure cloud, Zoreza Global was responsible for not only deploying the resources but also building the mechanism of the MLOps infrastructure 
  • You need massive environments for testing and running AI/ML models on individual chips; it's the UI part for uploading and checking ML model results. It's also about optimization, quantization and running on HAI chips 

  

A major semiconductor manufacturer asked Zoreza Global’s GEMS team (graphics embedded machine learning and simulation) to develop and implement a machine learning MLOps infrastructure that would allow them to leverage models and demonstrate the results. 

In particular, the client wanted to test one of their boards (chips). All software was built on the Azure cloud, and Zoreza Global was responsible for deploying the resources and building the mechanism of the MLOps infrastructure. Local implementation of Docker images allowed the team to start the building process. 

  

Keeping on schedule

 

We built Docker images for Apache Airflow, a tool that enabled us to schedule and orchestrate data and the machine learning pipelines. Airflow is triggered by a POST request for the jobs that had to be done. Consequently, Airflow took care of the data management, ensuring we had the correct parameters for testing the client’s models and, most importantly, reaching their chip and other isolated environments running Docker container for testing. 

The team created the Airflow infrastructure, taking default images, which we customized accordingly. Our first move was to build the DAGs (a mechanism within Airflow). The DAGs allowed the team to orchestrate and create step dependencies (before any operation, the previous step must be completed). In that way, we could either perform steps one by one or parallelize suitable tasks.  

Ultimately all these resources were further abstracted and automated by our DevOps engineer. Terraform and Helm Chart were used for automatically deploying on the cloud. Cloud resources, mainly Kubernetes pods, were efficiently and automatically generated by creating all the necessary image containers. 

  

Tracking meta and artifacts

 

The second tool we used was MLflow. MLflow allows you to save each experiment and persist the output value received — like accuracy, recall and configuration files. As well as detailing the parameters and data used, it allows you to save the artifacts, i.e., data used to train or test the model. 

MLflow Model Registry is another vital tool. It allows you to register models that can be deployed for web services, etc. In this case, we experimented with Registry to see if there was a chance of deploying something but mainly used MLflow for tracking. 

Airflow triggered pipelines for testing some models. And we captured and registered accuracy and latency (the model’s execution time) in MLflow. These were the client’s two main metrics. 

   

Demoing at CES 2024

 

The purpose of this project was to create a demo deployed on Azure Cloud to be presented at CES 2024. Docker images allowed us to run the models not only with their proprietary model but also with an ONNX file and a quantized version of these models. 

The purpose of the demo, primarily, is to show the client can run it on cloud. We’re seeing a rising interest in cloud resources because it's an efficient, cost-saving way to maintain best practices, delegating tasks to the cloud platform that, otherwise, you’d have to implement yourself, activities that can cost a great deal of time and money. 

Once we created and tested the local resources, our team deployed them on Azure using a Kubernetes cluster (allows you to scale containers). They maintained and scaled them to make sure there were no hardware management problems. Subsequently, we deployed them on Kubernetes and ran all the resources. 

 

Planning our endgame

 

Having engineered the back-end ML section, we connected a UI web page — still deployed in the Kubernetes cluster — so everything was within an efficient and effective infrastructure. The front-end UI, developed in Angular, sent requests to the back end. 

The project was developed mainly in Python, Airflow, MLflow and Docker, using a component that allowed us to deploy containers in Kubernetes. Connecting with a novel way of deploying machine learning models added to the team’s collective experience. 

 

  

Targeting sharper accuracy

 

Apache TVM enables users to optimize a machine-learning model for a target machine. This was vital because one of the purposes of the project was to show we could achieve excellent accuracy with an ONNX model. Quantization of the model reduces its footprint (the memory it takes). It can be deployed on non-performant hardware, e.g., a smartphone or other device that uses hardware but isn’t primarily a personal computer. And there’s little difference from an execution time latency perspective. 

The ONNX format performed as the original model, but accuracy was reduced. So, we needed to determine the most productive time to use this model.  

 

Cutting your losses

 

From the outset, we achieved acceptable accuracy for the application. Inevitably, you lose a certain degree of accuracy with quantization, but by using the best algorithm, we minimized the loss. We just needed to know how much it was likely to be. 

ONNX is the standard format for deployment. It allows you to install a runtime and run the model. Consequently, developers, ML engineers and data scientists can build the model in whichever frame they want — PyTorch, TensorFlow, Keras and many others.  

ONNX conversion provided a unified way of deploying the model. TVM is one step ahead in the sense that ONNX is not optimized for any particular hardware. TVM optimization provides a faster runtime, especially for larger, more complex models. And most importantly, you can choose the target machine. 

For instance, say you have a specific CPU or GPU that you want to deploy the model on. TVM significantly reduces the execution time, a crucial feature for applications demanding a high volume of predictions. 

 

Everything with chips

 

HAI (human-AI interaction in ML) encompasses running specific ML models on small chips, like those in smartphones, separate devices, some cameras and tablets, etc. Samsung, Apple, Qualcomm, LG and our client all develop chips for HAI (for running certain AI models and their products). 

You need massive environments for testing and running AI/ML models on individual chips; it's the UI part for uploading and checking ML model results. It's also about the pipeline of optimization, quantization and running on HAI chips. 

The project required a comparatively large team to implement the whole process — back-end, front-end, DevOps, MLOps and ML engineers. Unsurprisingly, building an environment is an expensive process. 

However, now we can implement our fully developed system for convenient connection with any chip for testing AI models on HAI for other clients. 

Find out more

 

To learn more about how Zoreza Global’s GEMS team can help you implement a system to connect with any chip and test models on HAI, visit our website or contact us. 

  

 

 

Arthur Shaikhatarov , AI/ML Project Manager, Automotive, Zoreza Global

Arthur Shaikhatarov author linkedin

AI/ML Project Manager, Automotive, Zoreza Global

Arthur has served as project manager specializing in AI/ML initiatives at Zoreza Global for 4 years. His journey in artificial intelligence and machine learning began as an engineer, where he honed his skills and knowledge. Arthur is responsible for team management and formation, presales activities for ML projects, the orchestration and execution of AI/ML proofs-of-concept and the growth of AI/ML activities within Zoreza Global.

Martin d’Ippolito , Data Scientist and ML Engineer, Automotive, Zoreza Global

Martin d’Ippolito author linkedin

Data Scientist and ML Engineer, Automotive, Zoreza Global

Martin is an accomplished data scientist and machine learning engineer, with five years’ expertise in the field. His knowledge and experience go beyond computer vision and natural language processing (NLP), to include the development of end-to-end web applications. This unique combination of talents helps his team craft outstanding proofs-of-concept to engage clients.

Arthur Shaikhatarov , AI/ML Project Manager, Automotive, Zoreza Global

Arthur Shaikhatarov author linkedin

AI/ML Project Manager, Automotive, Zoreza Global

Martin d’Ippolito , Data Scientist and ML Engineer, Automotive, Zoreza Global

Martin d’Ippolito author linkedin

Data Scientist and ML Engineer, Automotive, Zoreza Global