Reinforced Learning systems are ready to make the jump from the research bench to real-world applications.
Wondering why you are still seeing advertisements for holidays in your browser during a global pandemic? It's because most companies that place ads generally use traditional Machine Learning (ML) models that make predictions based on past behaviors. Thus, when the temperature drops, these models suggest a trip to some far-flung destination, because that has worked in the past. Therefore the same ads are placed even if consumers are unlikely to be going anywhere due to lockdown restrictions.
The issue arises from the fact that ML systems rely heavily on past experience, and can't consider changing circumstances to adapt to customer preferences, only adapting to new stimuli if they are specifically retrained to do so. That could be about to change, however.
Microsoft is introducing Reinforce Learning (RL) systems to developers, services 'learn' more like human beings, rather than being trained with a body of data.
Just like teaching a human being, RL systems need 'feedback' to learn to use trial and error, with decision processes improved by reinforcement and rewards. The system then balances exploitation of existing data against rewards for exploring new options¹.
One of the issues with adapting to RL systems has always been the fact that they require large amounts of data to function. But Microsoft says their new systems can operate on data sets smaller than those used in traditional ML.
Micro-Choices and Minecraft
The reliance on feedback to learn means that RL systems are well suited to making small choices in repeated situations, adjusting behavior and learning from the results.
The foundational work in developing RL can be traced back as far as 1992² and over recent years it has been used to teach computers how to play video games such as Minecraft that have in-built rewards. Now techniques developed by Microsoft are emerging from the lab and moving into production.
Perhaps the most significant embodiment of this development is Personalizer, part of Azure Cognitive Services on the Azure AI Platform³. The service was originally devised in conjunction with MSN and Bing to personalize the news received by users.
Personalizer's Application Programming Interface (API) — software that allows applications to communicate with each other — prioritizes relevant content, layouts. This results in a more suitable outcome for a user every time they use an app, rather than delivering the same set of options.
This option customization delivered an impressive boost in 'clicks' on MSN and increased engagement with products on the Microsoft homepage and can be applied by any website that wants to adopt the API.
Problem Solving and 'Nudging' User Habits
As well as providing personalized options for a user, RL systems can also make decisions in a range of other areas. This results in on-the-spot problem-solving. An example that the Microsoft research team is working on is systems that can assess a virtual machine (VM) encountering problems and decide whether to fix the issues or simply reboot it.
Microsoft's Anomaly Detector depends on RL and is already being used to scan Windows, Office, Bing, and over 200 other Microsoft products for spikes and dips in activity as well as unexpected changes in trends. Building upon this is the new Metrics Advisor, another RL based system that will delve even deeper than Anomaly Detector gathering more data and suggesting courses of action.
This could be of immense use to current applications by adding features to monitor business metrics, automate computer operations, and predict when maintenance will have to be performed.
RL systems could also go further than making their own decisions, however, even influencing the user to make choices. This has already been demonstrated by getting a user to select a new app or game from the Microsoft or X-Box Live homepage, but it could have more positive effects too.
The Microsoft team is looking at ways that RL could 'nudge' users into adopting healthy behavior — for example, a chatbot could encourage users to walk more by displaying different messages.
Applying Reinforced Learning to the 'Real World'
Just like feedback is crucial for learning in RL systems, it is also vital for researchers seeking to develop these systems. Whilst RL is refined enough to be employed by Microsoft in certain applications when it comes to 'real world' circumstances, work still needs to be done.
The Microsoft team points out that the key to this adaption is getting the reinforcements and 'rewards' right. An important element of this is eliminating bias, not an easy task in RL. Bias can be easily seen in ML via the data sets' fed' to a system by an operator, but bias in RL is much subtler.
Much of this could be related to teaching RL systems to distinguish between short and long-term rewards. As an example, filling Bing with a huge volume of ads could provide short-term rewards in the form of more click-throughs. Yet, ultimately the long-term effect would likely be increased use of ad-blockers or even the adoption of another search engine.
Thus, whilst it's easy to teach an RL system with immediate rewards, adapting them to work with delayed rewards and comprehend how the environment they occupy operates is important. Doing this could make RL systems primed for use in various forms of recognition software personalizing interfaces.
Ultimately, this could require teaching RL systems a modicum of imagination.
- Sutton. R. S., Barto. A. G., , 'Reinforcement Learning: An Introduction,' MIT Press, [http://incompleteideas.net/book/RLbook2020.pdf]
- William. R.J., , 'Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,' Machine Learning, [https://doi.org/10.1007/BF00992696]
- Azure AI, [https://azure.microsoft.com/en-us/overview/ai-platform/]