Microsoft LAMs are trained using fine-tuning, imitation learning, and reinforcement learning.
As we have been observing, large language models (LLMs) are at the forefront of rapid progress in AI as they have been enabling chatbots, text generation, and even code writing.
Although LLMs are good at understanding and generating text, they grapple when it comes to performing tasks in real-world environments.
Researchers at Microsoft have created what they call a Large Action Model (LAM), an AI model that can operate Windows programs on its own.
Large Action Models (LAMs) represent a significant advancement in artificial intelligence, enabling AI systems to execute complex tasks based on human instructions. LAMs mark the shift from AI models that only talk to models that can actually perform tasks.
What are LAM Models?
Traditional AI models mainly process and generate text, but LAMs take things a step further. They are capable of turning user requests into real actions. These actions could range from operating software to even controlling robots.
It needs to be noted that this concept is not new; LAM is just the first model that has been specifically trained to work with Microsoft Office products.
LAMs as a concept gained prominence in the first half of 2024 when Rabbit’s AI device was launched with an AI that could interact with mobile applications without the need of the user.
LAM models can understand inputs such as text, voice, or images, and they can also convert these requests into detailed step-by-step plans.
They are also able to adjust their approach in real time. In simple words, LAMs are AIs that are designed not just to understand but to act as well.
Based on the research paper, Large Action Models: From Inception to Implementation, these models have been designed to interact with both the digital and physical environments.
Think of it like, instead of asking an AI how to create a PowerPoint presentation, one could ask an AI to open the app, create slides, and format them according to preferences.
At its core, LAMs essentially combine understanding intent, meaning they interpret user commands accurately; action generation, which is the ability to plan actionable steps; and dynamic adaptation, as they are able to adjust based on feedback from their environment.
How are LAMs Built?
When compared with LLMs, the creation of LAM is far more complex as it involves five stages. Data is the foundation of any AI, and LAMs require two types of data: task-plan data, which are high-level steps for tasks such as opening a Word doc and highlighting text.
The second type is task-action data, which is essentially specific doable steps. When it comes to training, these models undergo something known as supervised fine-tuning, reinforcement learning, and imitation learning.
Before they are deployed, these models are tested in controlled environments. They are also integrated into agent systems, such as Windows GUI agents, to interact with other environments.
Eventually, the model is tested in live scenarios to gauge its adaptability and performance.
The LAMs mark a big leap in evolution—from text generation to action-driven AI agents. From automating workflows to helping people with disabilities, LAMs are not just smarter AI but AIs that can be more useful in everyday lives.
As the technology evolves, LAM may soon become a standard AI system for all sectors.
In other news, TechEnthu reported that DeepSeek-V3 is an incredibly powerful, free, and open-source AI model that outperforms some of the latest models from OpenAI and Meta on key benchmarks, all while being developed at a fraction of the cost.