Artificial intelligence chatbots that can adjust to comprehend a user’s accent or smart keyboards that constantly update to more accurately guess the next word based on a user’s typing history can be made possible by personalized deep-learning models. A machine-learning model must constantly be adjusted with new data to accommodate this personalization.
User data are usually transferred to cloud servers where the model is updated, as smartphones and other edge devices do not have the memory and processing capacity required for this fine-tuning process. However, transmitting data requires a lot of energy, and there is a security risk when transferring private user information to a cloud server.
A method has been devised by researchers at MIT, the MIT-IBM Watson AI Lab, and other places that allow deep-learning models to effectively adapt to new sensor data directly on an edge device.
Their on-device training technique, called PockEngine, only saves and computes the precise parts of a large machine-learning model that need to be updated to increase accuracy. The majority of these calculations are completed during model preparation, prior to runtime, which reduces computing costs and expedites the fine-tuning procedure.
PockEngine significantly accelerated on-device training in comparison to previous methods, with up to 15 times quicker performance on some hardware platforms. Furthermore, PockEngine prevented models from losing accuracy. The researchers also discovered that by refining their technique, a well-known AI chatbot was able to provide more accurate answers to challenging queries.
On-device fine-tuning can enable better privacy, lower costs, customization ability, and also lifelong learning, but it is not easy. Everything has to happen with a limited number of resources. We want to be able to run not only inference but also training on an edge device. With PockEngine, now we can.
Song Han, Study Senior Author and Associate Professor, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
Lead author and EECS graduate student Ligeng Zhu, collaborators from MIT, the MIT-IBM Watson AI Lab, and the University of California San Diego collaborated with Han on the study. The study was presented at the most recent IEEE/ACM International Symposium on Microarchitecture.
Layer by Layer
Neural networks, which are made up of numerous interconnected layers of nodes, or “neurons,” that analyze data and generate predictions, are the foundation of deep learning models. An image or other type of data input is transmitted from layer to layer during the model’s inference process, which is run until the prediction—possibly the image label—is output at the conclusion. After processing the input during inference, each layer is no longer required to be stored.
However, backpropagation is a process that the model goes through as it is being trained and adjusted. Backpropagation involves running the model backward after comparing the output to the right response. As the output of the model approaches the right response, each layer is updated.
Fine-tuning requires more memory than inference because each layer can need to be updated, requiring the storage of the complete model and interim outcomes.
However, not every neural network layer is necessary to increase accuracy. Furthermore, it is possible that only a portion of a layer needs to be updated, even for crucial layers. It is not necessary to store those layers or any portion of them. Furthermore, the process might be stopped somewhere in the middle, negating the need to travel back to the initial layer to enhance accuracy.
PockEngine makes use of these characteristics to reduce the amount of computing and memory needed while accelerating the fine-tuning process.
The system measures the accuracy improvement after each layer after fine-tuning each layer individually on a particular task. PockEngine calculates the percentage of each layer that requires fine-tuning automatically in this manner, identifying the contribution of each layer as well as trade-offs between accuracy and fine-tuning cost.
Han added, “This method matches the accuracy very well compared to full back propagation on different tasks and different neural networks.”
A Pared-Down Model
Typically, a lot of computation is done during runtime to create the backpropagation graph. Rather, PockEngine does this action while the model is compiling and getting ready for deployment.
PockEngine reduces the model’s graph to be used during runtime by deleting lines of code to eliminate extraneous layers or portions of layers. Next, in an effort to increase performance even more, it applies more graph improvements.
Runtime computational overhead is reduced because all of this just needs to be done once.
“It is like before setting out on a hiking trip. At home, you would do careful planning — which trails are you going to go on, which trails are you going to ignore. So then at execution time, when you are actually hiking, you already have a very careful plan to follow,” Han added.
PockEngine conducted on-device training up to 15 times faster, without sacrificing accuracy, when they deployed it to deep-learning models on various edge devices, such as Apple M1 Chips and the digital signal processors found in many smartphones and Raspberry Pi computers. Additionally, PockEngine drastically reduced the amount of memory needed for fine-tuning.
Additionally, the scientists used Llama-V2, a large language model, to apply the technique. According to Han, fine-tuning large language models necessitates giving a lot of examples, and it is critical that the model learns how to communicate with people. For models entrusted with deriving solutions or addressing complicated issues, the procedure is also crucial.
On an NVIDIA Jetson Orin edge GPU platform, PockEngine reduced the time required for each fine-tuning iteration from approximately seven seconds to less than one second.
The researchers hope to employ PockEngine in the future to refine larger models intended for combined text and image processing.
This work addresses growing efficiency challenges posed by the adoption of large AI models such as LLMs across diverse applications in many different industries. It not only holds promise for edge applications that incorporate larger models but also for lowering the cost of maintaining and updating large AI models in the cloud.
Ehry MacRostie, Senior Manager, Artificial General Intelligence Division, Amazon
MacRostie collaborates with MIT on relevant AI research through the MIT-Amazon Science Hub, but was not involved in this study.
The National Science Foundation (NSF), the Qualcomm Innovation Fellowship, the MIT AI Hardware Program, the MIT-Amazon Science Hub, and the IBM Watson AI Lab provided some funding for this study.