Reinforcement Learning Comes Closest to AI Self-Learning – Unravelling the Labyrinth of AI Myths

Discover our AI blog series that debunks misconceptions about the brain power of AI. This is the third blog in the series, but you can also read “AI, Machine Learning and Deep Learning Share Genetic DNA, but Are No Clones” and “AI Does Not Learn by Itself.”

In the previous blog, we addressed supervised and unsupervised learning to dispel the myth that AI doesn’t need any training in order to learn and can just operate as a human brain, learning on its own.

To summarize the points we made, AI technology needs to be trained, whether training is undertaken by the vendor, end user or someone else, and it learns through unsupervised learning or supervised learning. Supervised learning happens when, AI is given pre-classified data to examine and categorize. Unsupervised learning is the kind of learning where AI analyzes input data to identify patterns within the data itself.

Who knew reinforcement learning was math?

In this blog post, we’re going to look at reinforcement learning, a third type of AI learning, which is the closest to “self-learning.” Reinforcement learning is – as the name suggests – a kind of learning where AI receives and records feedback it receives on the actions it takes. It uses the feedback to classify the actions it has taken as desirable (and hence to be repeated/improved) or non-desirable. This classification involves an algorithm that registers whether an action is rewarded or punished in a specific environment.

An example is Google’s Alpha Go, the computer program that plays the board game, Go. When Alpha Go’s computer player places a white token at a location and then is surrounded by black tokens, it loses and is punished for making that move. After losing several times, the system adjusts to avoid making that move again in similar circumstances and improves its ability to win. Essentially the system is learning by trial and error based on actions that reward or punish.

Therefore, reinforcement learning helps build systems using incentivized trial and error to enable the AI technology to learn the sequence of actions that lead to a bigger, long-term reward – i.e. winning the game, not the move per se.

Conditions required for reinforced learning to be effective

Learning by trial and error becomes more complicated when one considers that success in any complex task is the sum of all actions. For instance, in the context of Go, not placing a token in location 1, at move 1 might be the correct action for that move, but several moves later, may prove to be the event that lost the game. As a result of this complexity, to optimize reinforced learning the following conditions need to be met.

  • The system needs to be able to quantify the environment’s variables at every step in the learning process. This is often tricky for most real-world problems, but necessary so that the information can be abstracted via simulation. For example, to use reinforcement learning and have a system learn how to conduct a contract review, all the variables that go into a contract need to be mapped out comprehensively so that the AI system can be incentivized to produce accurate results. This is a challenging exercise, given the nature, breadth and scope of contracts, the types of review and the inherent ambiguity of contract language.
  • The system needs to have the capability to design a concrete reward function. This again poses a significant challenge. For example, in the context of self-driving cars, complicated ethics come into play. In a road emergency, should the AI system be rewarded for crashing the car and killing the driver to save a pedestrian, or should it be rewarded for killing the pedestrian and saving the driver? Returning to the context of contract review, codifying the “correct” vs. the “incorrect” interpretation of a contract clause into a reward function is extremely tricky because it involves not only interpreting the contract language, but also taking into account various legal rules of interpretation. Lawyer A may understand a sentence in a clause to mean one thing, while lawyer B may understand it in a different way. How can a reward be set up for the right interpretation, when interpretation isn’t finite?
  • The system needs to be allowed to make mistakes in order to learn. If a reasonable quantification of variables has been achieved and a reward function has been successfully designed, then the system needs to be allowed to make mistakes to learn further. For example, self-driving cars are both trained on real roads with a human supervisor, as well as in simulated environments, where various dangerous scenarios are run to enable the system to learn from mistakes, but in a way that does not pose a real threat.

AI technology and the mathematical and algorithmic roots of machine learning use an ambiguous nomenclature (e.g. “supervised,” “unsupervised,” “reinforcement,” “learn,” and “train”). In addition, the media get caught up in the excitement of artificial general intelligence (e.g., Terminator, i-Robot) to project the ability to self-learn on what we call narrow or practical AI, the AI that is currently available to businesses. The truth of the matter is that, no matter how you examine current AI — learning, supervised, unsupervised, or reinforced – there’s not magic to it, but only mathematical algorithms that still need human input.

Read more of our thought leadership


AI Does Not Learn by Itself



AI, Machine Learning and Deep Learning Share Genetic DNA, but Are No Clones



What is AI’s impact on legal departments?


Learn more about our products


iManage RAVN Extract
Unlock the value of your documents and unstructured data



iManage RAVN Classify
Adding structure to chaos



iManage RAVN Insight
Find information, discover knowledge and leverage experts across disparate locations