120103_001.MP3.txt

The Blue Pill. Self-improving AI. Self-improving AI is a meme that has been circulating since
the 1980s. Current proponents of the idea include Wastram and Omihandro. My own summary goes
something like this. If we get any kind of AGI going, no matter how slow it is and how
buggy it is, we can give it access to its own source code and let it analyze it and
clean up and fix the bugs and then rewrite its code to be as good as it can make it.
We then start up the slightly smarter AGI and repeat the process until the AGI's get
super intelligent. On the surface, this is irrefutable. We already have examples of systems
improving themselves. We can buy a cheap 3D printer and then quite cheaply print out parts
for a much better 3D printer. Or to make computer chips that go into computers that design better
computer chips. Not to mention evolution of all species in nature. I look at it from an
epistemologist point of view and say, that's a hard line reductionist idea that should
not have made it out of the 20th century. The idea, as its inception, imagined an AGI
as something that was written by teams of human programmers using software development
tools and mathematical equations. What I think the only thing that even approximates this
outcome is that the code is perfect, and humans as well as machines all agree there are no
more improvements to be made. And the resulting AGI's are still not super intelligent. The
most likely outcome is that we all realize the folly in this argument and won't even
try. It's not about the code. The number of lines of code in AI related projects has been
declining rapidly. 2012. 34,000 lines.py.kudukrzebski et al. for ImageNet. 2013. 1571 lines of
Lua to Play Atari games. 2017. 196 lines of Keras to Implement Deep Dream. 2018. Less
than 100 lines of Keras for research paper-level results. And all of these, except Saig, included
as the most famous example of a 20th century reductionist AI system, demonstrates new levels
of power of machine learning. The limits to intelligence are not in the code. In fact,
they are not even technological. The limit of intelligence is the complexity of the
world. Admission is unavailable. The main purpose of intelligence is to guess, to jump
to conclusions on scant evidence, and to do it well, based on a large set of historical
patterns of problems and their solutions or events and their consequences. Because scant
evidence is all we will ever have, we don't even know what goes on behind our back. And
because our intelligence is guessing, I have repeatedly claimed that, all intelligences
are fallible. We are already making machines that are better than humans in some aspect
of guessing. Protein folding and playing go are examples of this. And these machines
will get bigger and better at what they do and will be superhuman in various ways and
in many problem domains, simply based on larger capacity to hold, look up, or search useful
patterns. The code doing that can be hand optimized to the point where any AI improvement
would be insignificant. My own code in the inner loop for understanding any language
on the planet, once it has learned it, in inference mode is about 90 lines of Java.
We can expect a best minor improvements to efficiency and speed. It comes down to the
corpus. In my domain, NLU, simple tests can be scored at 100% after a few minutes of learning
on a laptop. Continue learning for days and weeks would provide a larger sample set of
vocabulary in appropriate contexts, which would mainly correct misunderstandings in
corner cases. But these corporal are not comparable by several orders of magnitude, to the gathered
life experience of a human at age 25. The main limit of intelligence is corpus size
in ML situation. Future artificial intelligences will be nothing like what AGI fans have been
fearmongering about. These are 20th century reductionist AI ideas. The components are
blind to the most fundamental basics of epistemology. Reductionist good old fashioned AI has been
demonstrated to being inferior in their own domains to even semi-trivial machine learning
methods. We need AGL, not AGI. Machines learning to code. As of this writing, there are a handful
of available code writing systems based on ML technology that has learned from large
quantities of open source code. For example GitHub Copilot, OpenAI Codex, and Amazon Code
Whisperer. They have not yet surpassed human programmers. But it's not about writing code
either. AI's writing code is about as silly as AI magazine covers with pictures of robots
typing, wink wink. In the future, if we want the computer to do something, we will have
a conversation, speaking and listening, with the computer. The conversation is at the level
of discussing a problem with a competent coworker or professional. It may spontaneously ask
clarifying questions. I call this, contiguously rolling topic, mixed initiative dialogue, others
talk of these bots as dialogue Asians. But this will go beyond Siri or Alexa, and when
the computer understands exactly what you want done. It just does it. Why would reductionist
style programming be a necessary step? Yes, there will still be lots of places where we
want to use code. But whether that code is written by humans or AI's will make much
less of a difference than we might expect based on today's use of computers.
The Pink Pill. The Wisdom Salon. Wisdom Salon is an online world cafe. The World Cafe protocol
is a recipe for organizing conversations that matter on a large scale. Thousands of people
can cooperate in order to bring clarity to complex issues. This is a post-mortem summary
for my interrupted wisdom salon project. I have all the code in an archive, but it requires
a complete rewrite in order to fix the two biggest problems. The switch from flash,
hack, to HTML5 for video and the cost of video connections. I know how to fix these but I'm
busy working on understanding machines. At the moment, I am looking for someone to take
this over. I also observe that there is a need for something like this. I see things discussed
on Quora that would make good topics for a wisdom salon. I happen to believe video in
spoken words are an important component for many reasons.
Wisdom. Knowledge and information can easily be found on the web. But what about wisdom?
Intelligence is based on gathered knowledge. Wisdom is based on gathered experience. To
get wiser, seek out more experiences. Engage yourself. Do more stuff. Travel. Talk to people
to share their experiences. Conversation with others is the easiest way to gain wisdom.
But not all conversations are equal. We want conversations that matter. The World Café
Protocol. The World Café Protocol is a recipe for organizing such conversations that matter
on a large scale. Thousands of people can cooperate in order to bring clarity to complex
issues. To find out more, buy the book or study the World Café website. But this is
how it typically works. In some conference facilities or gymnasium, the organizers provide
dozens to hundreds of square tables. Each has four chairs, a box of crayons, and a piece
of butcher paper as a tablecloth. Stakeholders from all walks of life get invited and sit
down at the tables. This could be a mixture of farmers, teachers, politicians, in corporate
environments. Sometimes this is everybody in the company. Organizers now unveil a carefully
phrased focusing question as the topic of the conversations. It is important that the
question is positive and focusing. For education reform, don't ask, what is wrong with our
education system? Instead, ask, what could a great school also be? The four people at
each table now start a conversation around the question. Everyone takes notes on the
butcher paper, using the crayons. After 20 minutes, a gong rings. Three people. Everyone
except south in duplicate bridge terms. At each table get up and move to other tables
at random. Through fresh random people sit down at each table. South now first explains
to the newcomers what the notes on the tablecloth mean. This provides a kind of lightweight
continuity from the previous conversation at this table. The three newcomers comment
on these notes and add fresh comments. The best parts of what was said at their previous
tables. These conversations unfold very naturally. Four strangers can easily have a friendly
conversation about complex things that matter. They don't even have to introduce themselves.
They contribute their wisdom and experiences. Not their resumes. Conversations now continue
for another 20 minutes. The gong rings again, and the shuffling repeats. After two to three
hours, the session is over and the butcher papers are gathered by the organizers into
what is called the harvest. They are summarized in some time later. Perhaps, after lunch,
the results are shared with all the stakeholders. Why this works so well? Someone pushing a
bad idea of theirs at every table can spam at worst 27 people in three hours. A good
idea. Introduced at the first table and repeated by all participants at subsequent tables will
reach over 100,000 people or the majority of the audience, whichever is smaller. This
is the filtering power of the World Café protocol. Wisdom Salon is an online World
Café. Sadly, the Wisdom Salon project has been suspended because of changing infrastructure
and cost structure for online video transmissions, and because of lack of time on my part. It
is possible to restart the project using current video technology and with funding and a larger
team. If interested in contributing to this, please get in touch. What follows is the original
high-level design specification, written in the present tense, design specification.
The Wisdom Salon is a 24-7 online World Café implemented as a video chat site. Conversations
have four participants, but each conversation can also have a passive and quiet audience
of any size. All conversations are always public. All conversation participants are
known by their login identities. Why would anyone want to participate? The main purpose
of Wisdom Salon is increased wisdom and improved clarity and complex issues for the participants.
This is your main benefit. This is why you would want to participate. You will not get
lags, but you might earn a local currency, called, Influence, that you can selectively
use to extend your influence.
Goal. The goal is specifically not to find the best grains of wisdom in the harvest.
The grains are there mainly to provide continuity and shorten the time to get to talking about
things that matter. The system is there to provide the users a chance to analyze large
and complex issues with others in conversation and in exchange of experiences. Do not underestimate
how different an interactive conversation is from a web search or reading a book. Have
you ever spent days studying something without getting it only to have someone set you straight
in two minutes of conversation? Have you ever been in a meeting where the resolution is
something none of the participants even understood when the meeting started?
Sample questions. What kinds of questions demonstrate the power of the Wisdom Salon?
Consider these samples. I am considering a midlife career change. What matters? Where
should I retire, and why there? Should I pursue a career in engineering or medicine?
Lifestyle design in interesting times. What is the true promise of genetics research and
why should I care? What movies should I let my children watch, and why?
Musical education for my child. What matters? What instruments, and why? What is it really
like to be a soldier in places like Afghanistan and Iraq? Should I retire in Costa Rica? User
experience. People arrive when they want and leave when they want. They can engage in multiple
ways. Upon entering the site, users are presented with the, at the moment, most popular conversation,
the one with the largest audience. Below the conversation, there will be a list of other
popular conversations, headed by conversations and topics the user may have watched or previously
participated in. They can browse all ongoing conversations much like watching talk shows
on television. They can select from hundreds of questions to find something that interests
them, or add their own. Instead of a butcher paper, they can leave notes on each question
known as, grains of wisdom, to provide the lightweight continuity from table to table.
They can vote on these grains of wisdom so that they better result rise to the top. Results
are immediately visible to all. They can observe what other people say and how they behave
and modify their own social graph to improve their chances of interaction with the best
people. A local currency is earned by passive engagement per hour, more of it is earned
by participating in conversations, and the currency is used to pay for the privilege
of posting a comment, because posting cost currency, spelling the grains of wisdom will
be limited. A topic without currently active conversations still allows you to browse the
grains of wisdom on the topic, and if you have influence, you can vote on the grains or notes
that you like or otherwise agree with, and you can restart the topic by creating a table
and hope others will join. Four main uses of wisdom salon. The site enables, but doesn't
enforce the World Cafe protocol. You can use the site for several different purposes.
As entertainment and education, passively watching conversations among your peers,
much like flipping channels on television. To get both factual information and broad
ranging personalized advice from experts. To share your expertise in fields you understand.
To do micromantering. To find an audience for storytelling and sharing personal experiences
from your life. To gain wisdom and personal clarity in complex issues. To debate the major
issues of the day in person and productively selected and well behaved groups. To find
new interesting and competent friends by observing their behavior and then befriending them,
much like other social media. Any active conversation starts a 20 minute clock bar moving. You
can leave anytime. System provides some incentive to stay the full 20 minutes. On the other
hand, you don't have to leave after 20 minutes. If you like, you can continue conversation
along as you want. But we expect a large fraction of people to adhere to the protocol. We believe
this maximizes the wisdom gain per session. Without the right people, the system is worthless.
Do not be discouraged. Facebook would be worthless with only 10 people on it. Wisdom salon really
requires at least 50 people to be on the system before you are likely to find a conversation
around a question you actually care about anytime you join. So nobody knows if this
will work or not, and it may take a while before the system matures enough to attract
a sufficient repeat audience to become what I designed it for. If you don't like it
at first, please try again. It might well improve, and you might get lucky to get into
an amazing conversation when you least expect it. Welcome to my experiment.
The lavender pill. Model free AI. Don't model the world. Just model the mind. It's a lot
easier. With some poetic freedom, I'd like to claim 1. Model the world. 10 billion lines
of code. 2. Model the brain. 10 million lines of code. 3. Model the mind. 10,000 lines of
code. Number one is regular programming. We make computers perform actions in a context
that matches the programmer's mental model of some relevant parts of the world. Number
two is neuroscience-based models of neurons, synapses and other biological structures and
systems in brains. The number three is epistemology-based models of learning, understanding, reasoning,
prediction, abstraction, and other holistic and emergent phenomena. Epistemology-based
methods require a rather minimal infrastructure to support whatever operations these concepts
require. I put models within irony quotes because they are strictly speaking metamodels
because they are used in metascales. They are not about skills, such as English or folding
proteins. They are about how to acquire such skills by learning from our mistakes.
The purple pill. Corpus congruence. Understanding in brains and machines can be defined and
measured as corpus congruence. Corpus congruence as a metric spans up almost all of NLP. Understanding
in brains and machines can be defined and measured as corpus congruence. Let's consider
this in the machine learning sense. If a machine is model-free, holistic, as all general understanders
have to be in order to not get trapped into a limited model, then all it ever knows comes
from the corpus it was trained on. And all it really can say is, this is more like my
corpus than that. Or, this is more like these documents in my corpus than those corpus congruence
as a metric spans up almost all of NLP. Because most of NLP is doxen in various guises. Given
two documents A and B in some corpus, a classifier can say that an unknown document, which we
can call U, is more like it than B given this capability we can build. Classification and
clustering by using A, B, up to N as defining classes. Filtering by using A, wanted dox
and B, unwanted dox. Summoned analysis by using A, negative dox and B, positive dox.
Entity extraction by softly matching termed against lists of known entities. Doxen, find
me more documents like this one. Reductionist and NLP uses all of these at the bag of words
or word count levels for things like web search, span filtering, and clustering. Holistic
NLU aims to do the same based on the meanings expressed in sentences and paragraphs. But
semantic corpus congruence is still corpus congruence. Common sense now becomes, is the
proposition before me congruent with my entire world model, as required by learning things
from my training corpus. If it is well known, then we can likely ignore it this time, and
if it is not, then the next question will be, is it close enough that it might be worth
while extending the world model with this information? If the answer is no, then the
input is by its definition nonsense. Otherwise it is either a new fact or a lie, but since
we cannot tell, we have to accept it, possibly with a note that this is fresh, untested knowledge
that may turn out to be irrelevant, false, counterproductive, or noise. Next we can note
that it doesn't matter whether documents are text or images, or input from a point cloud
of sensors for robots or autonomous vehicle sensors. And finally we can note that this
definition also holds for humans if we take our corpus to be everything we've experienced
since birth.
Monika's Little Pills
Chapter 1
Why I Works
Intelligence equals understanding plus reasoning. Interest in artificial intelligence is exploding,
and for good reasons, computers and cars, phone apps, and on the web can do amazing
things that we simply could not do before 2012. What's going on? This is an attempt
to explain the current state of AI to a general audience without using mathematics, computer
science, or neuroscience, discussions at these levels with focus on how AI works. Here I
will discuss this at the level of epistemology and will try to explain why it works. Epistemology
sounds scary, but it really isn't. It's mostly scary because it is unknown, it is not taught
in schools anymore, which is a problem, because we now desperately need this branch of philosophy
to guide our AI development. Epistemology discusses things like reasoning, understanding, learning,
novelty, problem solving in the abstract, how to create models of the world, etc. These
are all concepts one would think would be useful when working with artificial intelligences,
but most practitioners enter the field of AI without any exposure to epistemology which
makes their work more mysterious and frustrating than it has to be. I think of it epistemology
as the general base for everything related to knowledge and problem solving. Science forms
a small special case subset domain where we solve well-formed problems of the kind that
science is best at. In the epistemology outside of science we are free to productively also
discuss pre-scientific problem solving strategies, which is what brains are using most of the
time. More later, intelligence equals understanding plus reasoning. In his book, Thinking Fast
and Slow, Daniel Kahneman discusses the idea that human minds use two different and complementary
processes, two different modes of thinking, which we call understanding and reasoning.
The idea has been discussed for decades and has been verified using psychological studies
and by neuroscience. Subconscious intuitive understanding is the full name of the fast
thinking or system one thinking. It is fast because the brain can perform many parts of
this task in parallel. The brain spends a lot of effort on this task. Conscious logical
reasoning is the full name of slow thinking or system two thinking. To many people's
surprise, this is very rarely used in practice. By soundbite for this is, you can make breakfast
without reasoning. Almost everything we do on a daily basis in our rich mundane reality
is done without a need to reason about it. We just repeat whatever worked last time we
performed this task. Real experience driven. Intuitive means that the system can very quickly
provide solutions to very complex problems but those solutions may not be correct every
time. Logical means that answers are always correct as long as input data is correct and
sufficient, which is not true in our rich mundane reality. It can only be true in a mathematically
pure model space. If you like logic, you must also like models. Subconscious means we have
no conscious, introspective access to these processes. You are reading this sentence
and you understand it fully but you cannot explain to anyone, including yourself, how
or why you understand it. Conscious means we are aware of the thought, we can access
it through introspection and we may find reasons to why we believe a certain idea. Expensive
is on the list because brains spend most of their effort on this understanding part. We
really shouldn't be surprised that AI now requires very powerful computers. More later.
In contrast, reasoning is efficient. It is most useful when you are stuck in a novel
situation or experience and understanding doesn't help you. Or perhaps you need to plan ahead
or need to find reasons for why something happened after the fact. It is used at a formal level
in the sciences. Reasoning is important but just rarely needed or used. Finally, understanding
is model-free and reasoning is model-based. This is likely the most important distinction
to people who are implementing intelligent systems since it provides a way to keep the
implementation on the correct path when the going gets rough. We cannot discuss these
issues quite yet but if you are curious you can watch the videos at Vimeo.com which discuss
this distinction at length. Think of the appearance in this table as a kind of foreshadowing.
All of this groundwork allows me to state the main point of this section. We have known
for a long time that brains use these two modes. But the AI research community has been spending
over much effort on the reasoning part and has been ignoring the understanding part for
60 years. We had several good reasons for this. Until quite recently, our machines were too
small to run any useful sized neural network. And also, we didn't have a clue about how
to implement this understanding. But that is exactly what changed in 2012 when a group
of AI researchers from Toronto effectively demonstrated that deep neural networks could
provide a simple kind of shallow and hollow proto-understanding. Well, they didn't call
it that, but I do. I will look just a little into the future and overstate this just a
little in order to make it more memorable. Deep neural networks can provide understanding.
This new phase of AI took decades to develop, but it would never have happened without people
like the group led by Jeffrey Hinton at the University of Toronto, who spent 34 plus years
to develop the deep neural network technology we now call, deep learning. A number of breakthroughs
from 1997 to 2006 led to a number of successful demonstrations, including first prizes in
AI competitions in 2012. And we therefore count this year as the birth year of machine
understanding. To an outsider, it may look like an older program or phone app might be
understanding whatever the app is doing, but that understanding really only happened in
the mind of the programmer creating the app. The programmer first simplified the problem
in their own head by discarding a lot of irrelevant detail using programmer's understanding.
The simplified mental model of the problem domain could then be explained to a computer
in the form of a computer program. What is changing is that computers are now making
these models themselves. The first bullet point describes regular programming, including
old style AI programs. AI has, since 1955, provided many novel and brilliant algorithms
that we now use in programs everywhere. But when you contrast old style AI to understanding
systems, the old kind of AI is basically indistinguishable from any other kind of programming we do
nowadays. The second bullet point describes the recent developments. Deep neural networks
are so different from regular programs that we have to acknowledge them as a different
computational paradigm. This is why they took almost four decades to develop. And the
paradigm, being pre-scientific and model-free, is difficult to grasp if you receive a solid
reductionist and model-based education. It takes a long time for an established AI practitioners
or experienced programmer to switch. People who are just starting out in AI have an easier
time assimilating this new paradigm since they haven't had a full career's worth of
experience and success using old style AI techniques. The amount of work we have to
do to get a deep neural network to understand is surprisingly small, and companies like
Google and Cintiens are working on eliminating the remaining effort of programming neural
networks. This is where things will get really weird. When the deep neural network, DNN,
understands enough about the world and about the problem it is faced with, then we no longer
need a programmer to acquire this understanding. Let me elaborate. Programmers are employed
to bridge two different domains. They first have to study whatever application domain
they are working on. For instance, if they are writing an airline ticket reservation
system they will have to learn a lot of detailed information about airlines, airline tickets,
flights, luggage, etc. and then know to provide features for unusual cases such as cancelled
flights. And then the programmer uses their understanding of the problem domain to explain
to a computer how it can reason about these things, but the programmer cannot make the
system understand, it can only put in the hollow and fragile kind of reasoning, as a
program with many of thin cases, and any misunderstandings the programmer has about the problem domain
will become bugs in the computer program. Notice the shift in terminology. More later.
But today, for certain classes of moderately complex problems, we can use a DNN to automatically
learn for itself how to understand the problem, which means we no longer need a programmer
to understand the problem. We have delegated our understanding to a machine, and if you
think about that for a minute you will see that that's exactly what an AI should be doing.
It should understand all kinds of things, so that we humans won't have to. And there
are two common situations where this will be a really good idea. One is when we have
a problem we cannot understand ourselves. We know a lot of those, starting with cellular
biology. The other common case will be when we understand the problem well, but making
the machine understand it well enough to get the job done is cheaper and easier than any
alternative. MoonBoss accomplish this level of using old style AI methods, but I predict
we will one day be flooded with similar, but DNN based devices that understand several
aspects of domestic maintenance, as well as we do. Do machines really understand? If we
give a picture like this to a DNN trained on images it will identify the important objects
in the image and provide the rectangles, called, owning boxes as approximations to where the
objects are. The text on the right says, woman in white dress standing with tennis racket
to people in green behind her, which is not a bad description of the image. It could be
used as the basis for a test for English skill level for adult education placement, for all
practical purposes. This is understanding. We had no idea how to make our computers do
this before 2012. This is a really big deal. This feat requires not only a new algorithm,
it requires a new computational paradigm and images to a computer, a single long sequence
of numbers denoting values for red, blue and green colors and values from 0 to 255. It
also knows how wide the image is. How does it get from this very low level representation
to knowing that there is a woman with a tennis racket in the image? This is what William
Calvin has called, a river that flows uphill. There are very few mechanisms that can go
in this direction, from low levels to high levels. Calvin used the term to describe evolution,
and I can use this quote to describe understanding. I like to think of evolution as, nature's
understanding because the phenomena are very similar at several levels. Evolution of species
can bring forth advanced species starting from simpler species in the same manner that
understanding is the discovery and reuse of high level concepts and low level input.
In contrast, reasoning proceeds by breaking problems into sub-problems and solving those,
which is a, flowing downhill, kind of strategy. In mathematics we accept, and many mathematicians
only accept this reluctantly, that we need to use induction to move uphill in abstractions,
and that's a very limited uphill movement at that. Epistemology allows for much stronger
uphill moves. This is known as, jumping to conclusions on scant evidence and it's allowed
in epistemology based pre-scientific systems. As an aside, here's a pretty deep related
thought. In nature, evolution reuses anything that works. I like to think that understanding
is a spandrel of evolution itself. Neural Darwinism certainly straddles this gap. Could
be coincidence, or the only answer that will work at all. More later, we doubled our AI
toolkit in 2012. We can now use these deep neural networks as components in our systems
to provide understanding of certain things like vision, speech, and other problems that
require that we discover high level concepts and low level data. The technical, epistemology
level name for this uphill flow in processes, reduction, and we'll be using that term later
after we explain what it means. Let's look at what the industry is doing with their new
found toys. This is my view of what I think Tesla is doing, based on public sources in
their self-driving, autopilot, cars, cameras feed vision understanding components based
on deep learning, and radar feeds to radar understanding components. These supply bounding
boxes in 2D or 3D with additional information like, there's a woman with a tennis racket
ahead to a traffic reasoning component that uses regular programming, or some old style
AI like a rule based system to actually control the car based on the vision and radar inputs,
and the driver's desires. But this is not the only possible configuration. George Hopps
at Comma.ai, a team at NVIDIA Corporation, and the deep Tesla class at MIT are using
a simpler architecture with just a neural network that implements lane following and
other simple driving behaviors directly in one single deep neural network. There's room
for improvement, but there a big step in the direction we want to move in. Future automotive
systems will likely integrate everything about driving into one single neural network, or
something that effectively behaves as one. Vision, traffic, the car itself including
various functionality like windscreen wipers, lights, and entertainment, how to drive in
a safe and polite manner, and to understand also the drivers or car owners desires. And
if we've gotten that far, then it is a given that we will have speech input and output
so that the driver can have a conversation with the car while driving, and can just
advise it in case it does something wrong. We are nowhere close to this today. But after
a DNM breakthrough or two, who knows how quickly these kinds of systems become available. We
can already see an increasing stream of new features built using understanding components.
This article, and the next, are expansions of a talk given on June 10, 2017 at the San
Francisco Bill Conference. A decade ago I created artificialintuition.com. I now have
a lot more to say, but I need to split this meme package into digestible chunks. This
takes a lot of effort to get right. If you liked this article and would like to see more
like it then you can support my writing and my research in many ways, small to large,
like and share these ideas with someone who might want to invest in sentience incorporated
or might be otherwise interested in my research on a novel language understanding technology
called organic learning. More on that later. I do not receive external funding from any
investors for this research. You can support my research and writing directly at the donation
section at artificialintuition.com. Chapter 2. Our Greatest Invention, Model Based Problem
Solving. The first chapter, why AI works, provided the big picture of AI and understanding
machines. Next we will focus on how to actually implement understanding in a computer. But
before we can attack that core issue, we need to simplify the journey a bit by defining
four important words and concepts. I'll define one in this section, two in the next, and
the concept of reduction after that. We can then discuss the epistemology level algorithm
for understanding itself. If you are already familiar with these concepts, just check the
headings and definitions that follow in order to ensure we are using these words roughly
the way you use them. You may have noticed I write certain, sometimes common words,
such as model, with an uppercase first letter. This means I am using the word in a technical,
well-defined, unchanging sense. I will define all such technical terms over time and I will
try not to use these terms until I have defined them. We define 11 such terms in the first
chapter, starting with understanding and reasoning. A dictionary of defined terms is in the works.
Models are simplifications of reality in epistemology and science. Models are simplifications
of reality. A rich mundane reality is too complex to land itself directly to computation.
In OTB science fiction shows, we would sometimes hear. And then we fed all the information
into the computer and this is what came out. Well, not anymore. Audiences now know that's
not how regular computers work. Consider an automobile. It consists of thousands of parts,
each with properties like materials, size, color, function, and sometimes complex interactions
with other parts. What's all the information here? We can just feed all of those properties
and measurements and facts into a computer and expect to get an answer. We need to ask
a question and we also need to simplify the problem so that we can feed in just the facts
or numbers that matter so that our question can be answered with minimum effort. How do
we do that? We must identify or create, first in our minds, a very simple model of some
sort of a generic automobile, and use that model for our computation. After we get the
answer for the pure and simple model case, we apply the answer, with some care, back
to our complex reality where the real automobile and the problem exists. What kind of model
we choose depends on our goals. As an example of a model, Newton's second law states that
force equals mass times acceleration, f equals ma. This equation is a classical scientific
model. If we measure mass and acceleration of a car, then we can estimate how many horsepower
the engine has. To use this equation, we engineers would model, in our minds, the car as a single
small point mass with all the mass of the car in that point. Because if we don't, then
we'd have to worry about the car rotating and other problems. This is how model-based
science works. One or more scientists somehow derive a model for some phenomenon. The model
is published as an equation, a formula, or a computer program. Scientists and engineers
anywhere can now use this equation program model, treating it as a quick shortcut that
works every time, as long as they have correct input data and are confidently applying the
formula to a suitable problem in their reality. Our greatest invention, model-based problem
solving, aka reductionism, is the greatest invention our species has ever made. The
general strategy of simplifying problems before solving them must be tens of thousands of
years old. In some sense, it is a prerequisite for all other inventions, including the use
of fire. If you see a forest fire then you need to first imagine the utility of fire.
As a model, before you can figure out that it might be useful to carry home a burning
branch, we don't think of this problem solving strategy as an invention because it is already
ubiquitous in our lives. We are all taught how to use model-based problem solving in
school when we start solving story problems in math class, but most people never learn
the names of these strategies and are missing the big epistemology level picture. This rarely
matters until you start working with AI, where lack of an epistemological drowning may lead
you astray into failing strategies. These little pills are an attempt to remedy that.
Model-based methods were examined and refined into scientific methods over the past 450
years. Science is now a collection of thousands of models that taken together allow science
competent people to solve problems quickly and efficiently without having to redo all
the work that scientists, like Newton, put into creating these models in the first place.
And the sum total of those models covers many problems we want to solve scientifically,
such as how to build a bridge or travel to the moon. This reuse is what makes science
so effective, but not all sciences can benefit equally from this model-making. It works well
for physics, chemistry, and most of biochemistry. As I'm fond of saying, physics is for simple
problems, but as you get to more and more complex sciences, as you get further away
from physics and closer to life, it gets harder to make decent models. The models used by
for instance psychology, ecology, physiology, and medicine are generally more complex but
also less powerful than models in physics. Given some solid data, a physicist can compute
the mass of the proton to six decimal places, but we would have a harder time predicting
the number of muskrats in New England next summer because that outcome depends on millions
of parameters. The life sciences base many of their models on statistics. Statistical
models are among the weakest models used in science. These statistical models when more
powerful models with better predictive capabilities cannot be used for complexity reasons. Models
are apothesis, unverified models, scientific theories, models verified by peer review,
equations, formulas, complex scientific models, simulations of climate, weather, etc. Naive
models that we create to simplify our own lives. Computer programs, and what is mathematics?
It is a system that allows us to manipulate our models to cover more cases. Mathematics
is the purest, most context free of all scientific disciplines. As such, its greatest value to
humanity is in its role as a help discipline to all other disciplines. Einstein's famous
equals MC squared model was derived using mathematical manipulation of other models
known to Einstein at the time. But perhaps mathematics isn't as much a scientific discipline
as an epistemological one. I may explore this aside later. Model use requires understanding.
A good model is context free, since it maximizes the number of contexts it can be applied in.
Newton's second law, F equals MA, works pretty much everywhere. We have forces, masses, and
accelerations. The trade-off for this flexibility is that we ourselves need to understand the
problem domain. In rocket science, when maneuvering in space, F equals MA will often work perfectly,
but when you are applying it to the acceleration of your car, you need to account for lots of
effects like friction between the road and the wheels, wind resistance, and the like.
So, F equals MA, applied naively would give you the wrong answer if friction is involved.
This demonstrates the main disadvantage with models. They require that both the model maker,
scientists like Newton and the model users, STEM competent people everywhere, understand
enough about the problem domain to know whether the model is applicable or not, and how to use it.
This understanding is the expensive part of science, since using science requires first
getting a solid science education in order to avoid mistakes when using models.
And since models require understanding, they cannot be used to create understanding.
This is a major problem for AI implementers. Chapter 3
2 Dirty Words Reductionism is the use of models.
Holism is the avoidance of models. Matters are scientific models, theories,
hypotheses, formulas, equations, superstitions, and most computer programs.
Reductionism and Holism. After having sorted out what models are, we can now discuss two
complementary problem-solving strategies, or perhaps meta-strategies. There are in many ways
each other's opposites, but the classification can become an argument about novel levels and
definitions. I will initially pretend the division is clear and obvious, and will elaborate later.
Reductionism is the use of models. In this series we will use exactly the above definition of the
word, reductionism. If you look up the definition elsewhere you may find that some sources divide
the strategy into sub-strategies. They also seem to miss the most important sub-strategy,
which we'll discuss later. But what all these sub-strategies have in common is that they all
provide ways to simplify observations of fragments of our rich mundane reality into much simpler
models, which we use for reasoning, computation, and sharing. Reductionism is so central to how
we do science, the heavy reliance on models, such as theories, equations and formulas,
and physics, chemistry, etc. That we can speak of model-based sciences or reductionist sciences
where such model-making is easy and effective, and this classification excludes those sciences,
like psychology, where such model-making is difficult and less often rewarded with reliable
results. After considering all the advantages of models we might wonder why we even bother
discussing it. Too many people, especially those with a solid stem, science, technology,
engineering, and mathematics education, it may well look like the only choice,
but there's also the other strategy. Homism is the avoidance of models. This is where the
questions start. This is where the paradox is surface. This is where your worldview may get
shaken up. Seriously, especially if you are a scientist or engineer with a solid stem education
and decades of professional success using science and models. In some sense, the goal of this entire
series is to demonstrate that we need to use both problem-solving strategies when creating
our artificial intelligences, because that is what it is going to take. We need holistic
understanding. We established that in the first chapter, as a sample of the new ideas that we
will have to deal with I will just mention, reasoning is reductionist. Understanding is holistic.
Newer networks are holistic. Holistic systems can jump to conclusions on scant evidence.
Holistic systems can themselves know what is important and what isn't.
Holistic systems can solve problems we ourselves cannot or don't care to understand.
Holistic systems are model-free. We do not use any a priori models of any problem domain.
Reasoning systems inherit all problems and benefits of reductionism.
Understanding systems inherit all problems and benefits of holism.
Humans are born holistic. Humans each solve thousands of little
problems every day, and we are solving almost all these problems holistically, using understanding,
and without a need to reason at all. This includes fluent language use.
A stem education instills a strict reductionist discipline in order to mitigate problems
with fallibility of holistic human minds. Our intelligences are fallible.
These claims all deserve individual treatments, and we'll get to all of them in later sections.
But the major theme is clear. Humans are mainly holistic problem solvers.
This must be true for our artificial intelligences. We had several reasons for focusing on reductionist
methods, models, and reasoning during the first 60 years of AI. Our computers were too small to
make neural networks work at all. But there were also ideological reasons. AI was born out of the
math and computer science departments of our universities, and therefore most of the people
working on AI were solidly oriented towards the goal of creating a logic-based reductionist
infallible artificial mind. To build early AIs, like expert systems, we entered rules
or programmed in lots of facts to reason about. But this was budding reductionist castles in the
air, comprised of unanchored facts that didn't tie to any understanding whatsoever. The troubles
with classical AI, such as bitterness, the tendency to make spectacular and expensive
mistakes at the edges of their competence, can be directly traced to the lack of foundational
understanding to support these attempts at reasoning. Understanding machines will not
suffer from this brittleness, but will fail gracefully at the edges of their competence,
much like humans. Most of the time they will know the answer beyond that they will guess,
and the guesses they make are based on a lifetime of experience, gained through learning from a
large corpus and so they have a good chance of being at least a workable choice, if not perfect.
How can anyone solve problems without using models? A lot of people coming from a STEM
background cannot even imagine how to solve problems without using models. But it's not
hard, once you understand the difference, mostly it's a matter of doing what worked last time.
The problem is now figuring out whether we are in a situation that's similar enough that it will
work again. This is mostly a pattern matching problem. More later, what's the result? The
holistic answer is a quick guess at the best action, based on experience with similar situations.
Most of the time it's correct, sometimes it's a little wrong, and every now and then,
there's a noticeable mistake. And if we get things a little wrong, we may notice the outcome
and correct the action. We learn from our mistakes. If we practice something a lot,
we will start doing it effectively and perfectly every time. Do we learn faster if we make more
mistakes? Should we make mistakes on purpose? More later, in situations where you cannot use
models, which are more common than many realize, the holistic guess may also be your only option.
Conversely, if you have an adequately well-working model-based solution, just use that. My video,
Model-Free Methods Workshop demonstrates how the group solves four different problems
at a high level, using both reductionist and holistic methods. Why are these dirty words?
Well, they are not dirty to epistemologists. Reductionism has been the default problem-solving
paradigm because it's the one that has to be taught. We are born with a holistic problem-solving
apparatus. But reductionist science doesn't come naturally. Therefore, it has to be taught in
schools, practiced, and carried out according to certain rules. Perhaps that's why the sciences
are called disciplines, because following the ideal scientific method requires practice and
constant vigilance. J. C. Smutsbrook, Holism and Evolution, 1926 established the terminology in the
epistemological literature. And no inchrodinger wrote, what is life, 1944, questioning the power
of physics to provide useful explanations to the life sciences. Percy Grote, zen and the art of
motorcycle maintenance, 1974, had contrast something very holistic, zen Buddhism, with
something very reductionist, motorcycle maintenance. So the chasm between the strategies was identified
a long time ago. The strategies are each other's opposites. H-O-L-E-L-I-S-M-based strategies for
understanding can handle many important kinds of complexity and can quickly provide a guest
answer. But these guesses are fallible, and often more expensive to compute. Reductionist
education and strategies brought benefits of cheap model reuse and formal rigor to improve
correctness, but cannot handle complexity and is therefore dependent on an external
understander to determine applicability in real-world complexity rich situations.
And as part of that education, we are told that holistic methods, such as jumping to conclusions
unscanned evidence, are bad, in spite of the fact that our brains use holistic methods thousands
of times each day to successfully understand the environment we live in. We can all use
either strategy as appropriate. If we don't have a STEM education, we will still sometimes make
naive models. But sometimes there is a choice and different people may prefer one or the other.
When playing pool, some people estimate and compute bouncing angles and some people shoot
by feel. But we have our preferences, and it may be tempting to label a person with an overly
strong preference as a holistic or a reductionist. This is sometimes received badly, if perceived
as a limitation. Some dictionaries even flag reductionist as derogatory. And yet, some people
use it as a self-assigned label. I try to use these terms only as shorthand for a person with a
stated strong preference for holistic or reductionist methods. The two terms were very useful in
epistemology. But then someone invented the concept of holistic medicine. Instead of just shooting
a single medical problem, you analyze the patient's entire situation, attempting to account for diet,
exercise, sleep, work, habits, stress levels, allergies, family, friends, and environmental
poisons. A good idea, in general. But the wide scope was unmanageable by the, traditionally
reductionist medical establishment and the idea faded away. Instead, the whole idea of holism
became tainted as woo-woo in the term, holistic medicine, became associated with woo-woo merchants
selling crystals and aromatherapy. As explained above, holism is the avoidance of models,
or better phrased, holism is the metastrategy of avoiding a priori models of the problem domain.
That extra precision rarely matters. There's nothing woo-woo about it. It does say,
science not required, but, you can make breakfast without reasoning. It is important to note that
holistic methods are based on a lifetime of experience, in humans and a training corpus
worth of experience, in neural networks. When you're making breakfast, you are relying on this
experience, mostly repeating whatever worked yesterday. Some people claim they use reasoning
while making breakfast, but they can make their breakfast while speaking to someone else on the
phone. And as they hang up, they find themselves suddenly sitting at the breakfast table with
their coffee and hot oatmeal. Same thing when driving to work. You may get lost in thought,
and then you find yourself parked at work. You didn't need to reason, since all sub-problems
that occur in driving had been solved multiple times, during years of driving.
Sub-conscious understanding is used for simple things like sequencing our leg muscles as we
walk. You have no idea how you are doing that, it just works. Same thing with vision. You understand
that you are looking at a chair, but you do not have conscious access to the 15th rod cone pixel
to the left of your center of vision, and have no idea how this understanding works.
Same thing with understanding and generating language. You do not have any explanation for
how you are able to understand the meaning of this sentence. Understanding is sub-conscious
and holistic. So for the majority of things we do every day, we do not need reasoning or
reductionist methods. Some people would like to think they are, logical thinkers, immune to
most cognitive fallacies, but whether they are or not, at the lower levels, everyone is solving
most of their problems holistically. I claim that reductionist reasoning requires holistic
understanding. In other words, I need to understand the problem domain at hand before I can create
and reuse models to enable me to reason about the domain. So holistic understanding is much
more important than reductionist reasoning because it is the most used strategy, by far,
and the former is also a prerequisite for the latter. But the fallibility of holistic understanding
forced us to create reductionist science and to teach it in STEM education. It is as if the purpose
of science is to keep holistic guessing in check, but this aversion to fallibility has a cost,
because it means complexity bound and irreducible problems cannot be solved. Like language
understanding, global resource allocation, and social interactions, reductionism and model-based
science appeared around 1650 after a century of gestation. Excluding minor romantic interludes,
it has held its position as the dominant paradigm for about 400 years. This is changing.
The reductionist train is running out of track. The remaining hard problems facing humanity
are problems of irreducible complexity in domains where reductionist model-based methods
simply cannot work. Whether we like the idea or not, we need to accept these holistic methods
into our AI toolkits. Starting now, we will use these methods either in their raw form,
as model-free methods, or as understanding machines at any level from component to robot
co-worker. Chapter 4. Reduction. Epistemic reduction is a process that discovers higher-level
abstractions and lower-level data by discarding everything at the lower layer that it recognizes
as irrelevant. We have seen the power of models. We have introduced the two problem-solving
meta-strategies of reductionism and holism. We also noted that the creation and use of models
requires an intelligent agent that understands the problem domain. Someone or something has to
perform the reduction. I will now discuss reduction in some detail. Until 2012, only humans and other
animals with brains could perform reduction. Now our deep neural networks, DNN, can perform
limited reduction. How do brains and DNNs accomplish this? And how can we improve these algorithms?
This may be, to some readers, the most rewarding part of this series, because it provides you
the opportunity to learn a new and useful skill. Most people never think about the world at this
level. Knowledge of reduction provides a new point of view that you can use to better understand
your environment, other intelligent agents around you, and modern AI systems.
Definition of reduction. Reduction is a process that discovers higher-level abstractions and
lower-level data. We will initially note that reduction is exactly the same as abstraction.
Why do we need a new word? Because the term abstraction is mostly used
by scientists already operating in a pure model space, seeking a higher level of abstraction
in that space. But to them, abstraction is something that just magically happens in their
heads, since there are no scientific theories for how abstraction works. There cannot be,
since abstraction is a concept in epistemology, not science. AI researchers are starting from
something much closer to a rich mundane reality, where there is a lot of confounding context.
We are solving the metal problem of how to move from there into a space that is sufficiently
abstract to solve the problem at hand. Here, reduction is a much more appropriate term.
We can abstract the red pixel or the letter B, but we can reduce a rich context containing
that pixel or letter into a higher-level concept. We are swimming in reduction.
Paradoxically, one of the hardest things about teaching reduction is that we don't see the
need to learn about it because we all do it all the time, every millisecond, and the resulting
reductions, models, become available to our conscious minds as if, by magic, brains reduce
away 99.999% of their sensory input, but this process is subconscious and hence invisible to us.
The situation is much like, supposedly, a fish swimming in water. We are all masters of reduction,
but we don't know how we do it or that we even do it. We didn't know this would ever matter.
And generally, it doesn't. Well, it matters in epistemology, and it matters in AI,
since we need to actually implement that magic. We as epistemologists must know how abstraction
is actually performed, and we give the epistemology-level equivalent of abstraction the name
reduction, because that's the recipe for how to accomplish it. We reduce our rich mundane
reality by discarding, reducing away, what's irrelevant. And by using the name reduction,
we, as AI epistemologists, keep reminding ourselves how it is properly done.
Consider the following descriptions of a car. The slide is meant to be read from the bottom up,
to match abstraction levels from low to high. If I'm driving to work, I better be driving my car.
If the police are looking for a stolen car, they would be looking for red 2010 Toyota Celica.
If I'm buying a new car, then I might be looking for just a new Toyota Celica.
And a self-driving car would likely only need to understand whether an obstacle is a vehicle or
not, in order to model maximum speed for future movement. We see that we want to pick the appropriate
level of abstraction to deal with the same object, or topic, in different situations.
But more importantly, we see that we can get from a more detailed description,
at the bottom, to a more generic one, higher up, by simply discarding some detail.
I hasten to point out that reduction is more complicated than this simple example of decreasing
specificity shows. What we need to start somewhere in this image allows us to form intuitions that
will serve for a while. True reduction involves operations like shifting from syntax to semantics
or from instance to type. The appearance of car as an abstraction of Toyota, and the step from
my Toyota to a Toyota illustrates these steps. Algorithms for these things are known.
Salience, part of the trick is to know what to discard. At each level of abstraction,
something can typically be identified as the least important property. Red and Celica are more
significant than 2010 for anyone looking for a car. If we had started from my red 2010 Toyota
truck, then the word truck would not be discarded until the top level. Reduction requires understanding
what's relevant. In reduction we keep that which is salient. More later, partial reductions.
Most of the time we do not perform reduction all the way to models. I cannot stress this enough.
We discuss reduction to models for pedagogical reasons. It is easy to initially see the context
free model as the goal of reduction. In reality, in brains, we can stop reducing the moment we
recognize that we have a working answer or response, such as a command to contract some muscle or
having understood the meaning of a sentence subconsciously. At this point, there is still
some residual context but we use that context productively rather than discard it to move
to higher levels. Some people claim we use models for all our thinking, but I'm using capital M
model only to describe a completely context free abstraction. F equals M A is an example of that.
There is no need to check whether a car is a red car or a Toyota. The equation works not only for
all cars but for all forces, masses and accelerations. We might come up with a special equation for
acceleration of Tesla cars which would require different inputs like battery charge level
and software settings. That would not be a context free model since it would not work on a Toyota.
For almost all tasks, basically, in everything except science and even there, only rarely,
we only perform as much reduction as is necessary to get the job done. When learning to ski,
you only figure out how you yourself need to perform given your body and equipment.
We do not need to parameterize our skiing skills for someone with twice the body mass
because that would be useless to us for the purpose of our own skiing. But a scientist would
have to go that far in order to parameterize away one more piece of context from the model
they are creating. For instance, when creating a skiing video game or designing a new ski,
if we consider the enormous amount of subconscious activity that happens in the brain,
we can safely say that partial reductions are the most common reductions. For instance,
when we take a step forward, our subconscious has analyzed our posture and velocity by using
reduction based on low level nerve signals and is commanding leg muscles to contract an
up precisely timed sequence. This activity is something we are unaware of. Most of us don't
even know what leg muscles we have. And there would be no time to perform reduction all the way to
models. That process takes a minimum of a half second and you don't have that kind of time
available to respond to an imbalance when walking or skiing. Reduction in society.
Most of us get paid to understand whatever we need to understand in order to perform our jobs.
In other words, most of us get paid to do reduction. If you are approving building permits,
you reduce a stack of forms to a one bit verdict of approved or rejected. We accelerate reduction,
and this is the main reason most of us haven't been replaced by robots.
But we see that when future understanding machines can perform reduction by themselves,
then we are unlikely to get paid for it. Levels of reduction.
Suppose a young man and a young woman fall in love, something happens to mess it all up,
and then they sort this out and reunite. This is what happened in the man's,
which mundane reality. Suppose the man wants to share this experience, because there was some
moral to the story that he thinks would be interesting to others and possibly important.
He could analyze what happened and figure out which were the key events in the saga and then
have actors on a stage re-enact the story as a play. This is a reduction because the boring parts
of the story would not be part of the play. They are discarded as irrelevant, but the story would
be acted out by real people in front of a live audience. If you are in the audience, you can move
your head to see behind any actor on the stage and you can clearly see everything on the stage,
not just one actor speaking at a time. He can make a movie about it. Now your point of view
is pre-defined by the camera angle and cropping. You can no longer see behind an actor, and you
can often only see those actors that are involved in the main action. He could write a book about it.
We no longer can see even the people described in the book, except in our imagination.
A critic review in the theater play may reduce it to, Boy meets girl, Boy loses girl, Boy gets
girl. A drama school graduate may summarize it as a double reversal plot. This is a description
that is so free from context, doesn't even specify boys or girls that it could be argued it qualifies
to be called a model. Plays, movies, books, stories, tropes, etc. are all partial reductions of
reality, and some are more reduced than others. Just like in the red Toyota case, we need to find
the appropriate level of abstraction to work with. The young man in the example, when writing a
book or a screenplay, has much in common with a scientist trying to describe something in nature
in a reusable context free manner by reducing it to a model. They are model makers, or are at
least performing partial reduction. They are discarding the irrelevant bits. The opposite of
reduction. We also need to be able to move in the opposite direction, from models to reality,
or at least from more abstract partial models to partial models closer to reality. When an actor
is given a screenplay, they know it only contains rough directions for what to do and what lines
to say. The actor's job is to give a little of themselves to flesh out the screenplay to actual
actions, including creating, synthesizing, the appropriate display of emotions, tone of voice,
and body language. They use their experience as people and as actors. They use elements of their
past lives and skills they have acquired by training to create something people in the audience
might relate to. For example, they may repurpose a personal experience. He is sad as when my
hamster died. Things they learned in drama school, such as speaking, singing, dancing, and swordplay,
from other actors, what would bogart do, from fiction, from other movies and plays, etc.
The actor's artist who convey whatever the script intends to convey, emotions, a morality cookie,
a political position, titillation, surprise, and so on. Starting from the simple model,
the screenplay, their job is similar to an engineer's when they are faced with a problem
and use a model to solve it. The engineer would use their experience to decide that
M is the mass of the car and not the tire pressure. The actor decides that sadness
is more appropriate than grief for a certain scene, etc. I call this process, which is the
opposite of reduction by the name it is used in problem solving application. We use a model to
simplify a problem situation, moving it into an abstract and pure model space. We solve the
problem there by performing math, perhaps, and then apply the answer to our rich reality
to the problem we are trying to solve. Many of you may recognize the word application or
its abbreviation, app. That's not as far-fetched as it might seem. Apps are software-based models.
Reduction in application and brains. Back to the issue of partial reductions.
Consider the actor reading a screenplay. They are using their eyes to gather pixels of color
and orientation. The brain then performs pattern matching, reduction, from these low-level signals
to letters, words, to language, to high-level concepts like love and separation, and eventually
to a high-level understanding of the playwright's intents. The actor then takes this high-level
understanding and by performing application, they add their own experience to the script
to get closer to reality and their performance. Our brains are capable of moving up and down
many levels of abstraction at once. Perhaps it tracks all of them simultaneously,
keeping layers of abstraction separate. This is a clue for why deep neural networks
perform better than shallow ones. Which is what we'll discuss next.
Chapter 5. Why Deep Learning Works. Deep learning performs epistemic reduction.
A math-free computer science-free description of why deep learning works. We have now built
a base of theory for why AI works, what models are, and how to create them, what reductionism
and holism are, and what the process of reduction is. These are the fundamentals of AI epistemology.
This base allows us to discuss various strategies to move towards understanding machines in a
well-understood and controlled manner. We are now ready to discuss why deep learning,
DL, works. This is the fifth and last entry in the AI epistemology primer. Deep learning
performs reduction. This is an unsurprising claim, considering the preceding chapters.
There are several mutually compatible theories for how deep learning works. But just as in
the first chapter, we will now discuss the epistemological aspects, why it works,
from several viewpoints and levels, starting from the bottom. We would use examples from the
TensorFlow system and API as a library, as a stand-in for all deep learning family algorithms
and TF programs, because the available API functions heavily shape and constrain solutions
that can be implemented in this space. And the generalization should be straightforward enough.
Consider the following illustration of image understanding using Keras, an excellent
abstraction layer on top of TensorFlow. I like to refer to the input layer as being
on the bottom rather than at the far left as in this image. When viewing it my way,
the low to high dimension we use in my rotated version of the image can be mentally mapped
to a low to high stack of abstraction levels. I'm not the only one using this dimension this way.
I hope this rotation isn't too confusing. We can see that there is an obvious data reduction
and an obvious complexity reduction. Can we determine whether the system is also performing
what I'd like to call the epistemic reduction? Is it reducing a way that which is unimportant?
And if so, how does it accomplish this? How does an operator in a deep learning stack
know what makes something important? Salient, up your data, reduction of sorts could be
accomplished by compression schemes or even random deletion. This is undesirable. We need to discard
the non-salient parts so that in the end, we are left with what is salient. Some people have not
understood the importance of salient's based reduction and useless compression power of
reversible algorithms as a measurement of intelligence, which is no more useful than
believing a simple video camera can understand what it sees. So let me conjure up a bit like in
the movie, Inside Out, a fairy tale of what goes on in a deep learning network, except we'll do it,
bottom up. Suppose we have built a system for finding faces in an image with the intent of
incorporating that as a feature in a camera. Many cameras have this feature already,
so this is not a far-fetched example. We implement an image understanding neural network,
show the system many kinds of images for a few days, perhaps using so-called supervised learning
in order to improve this story, and then we show it an image of a family having a picnic in a park
and ask the system to outline where the faces are so that the camera can focus sharply on them.
The input image is converted from RGB color values to an input array and the data in this array is
then shuffled through many layers of operators. And for many of these layers, there are fewer
outputs than there are inputs, as you can see above, which means some things have to be discarded
by the processing. Each layer receives initially signals, from below, that is, from the input,
or from lower levels of abstraction, and produces some reduced output to send to the next layer
operator above. To continue detail, at some early level, some operator is given a few
adjacent pixels and determines that there is a vertical, slightly curved line dividing the
darker green area from the lighter green area. So it tells the operator above the simpler line
or color-based description using some encoding we don't really care about. The operator at the
level above might have gotten another matching curve and says, these match what I saw a lot of
when the label blade of grass was given as a ground truth label during supervised learning.
If no label is known, then we again assume some other uninteresting representation.
It is okay to propagate results without human-labeled signals because whatever signaling scheme is
used will be learned by the level above. The operator above that says, when I get lots of
blades of grass signals, I reduce all of that to a long signal as I send it upward.
And eventually we reach the higher operator layers and someone there says, we are a face-finder
application. We are completely uninterested in lawns and discards the lawn as non-cellient.
What remains after you discard all non-faces are the faces. You cannot discard anything
until you know what it is, or can at least estimate whether it's worth learning. Specifically,
until you understand it at the level of abstraction you are operating at. The low-level blade of
grass recognizers could not discard the grass because they had no clue about the high-level
saliencies of lawn or not in face or not that the higher layers specialize in. You can only tell
what salient or not, important or not at the level of understanding and abstraction you are
operating at. Each layer receives lower-level descriptions from below, discards what it
recognizes as irrelevant, and sends its own version of higher-level descriptions upward
until we reach someone who knows what we are really looking for. This is of course why deep
learning is deep. This idea itself is not new. It was discussed by Oliver Selfridge in 1959.
He described an idea called, Pandemonium, which was largely ignored by the AI community because of
its radical departure from the logic-based AI promoted by people like John McCarthy and Marvin
Minsky. But Pandemonium presaged, by almost 60 years, the layer-by-layer architecture with
signals passing up and down that is used today in all deep neural networks. This is the reason my
online handle is at Pandemonica. So do any TensorFlow operators support this reduction?
Let's start by examining the pooling operators. There are a few in the diagram. They are conceptually
simple. There are over 50 pooling operators in TensorFlow. There is an operator named
2x2 Max Pool operator. In the diagram, it is used four times. It is given four inputs with
varying values and propagates the highest value of those as its only output. Close to the input
layer of these four values may be four adjacent pixels where their values might be a brightness
in some color channel, but higher up they mean whatever they mean. In effect, the Max Pool 2x2
discards the least important 75% of its input data, preserving and propagating only one
highest value. In the case of pixels, it might mean the brightest color value. In the case of blades
of grass, it might mean there is at least one blade of grass here. The interpretation of what is
discarded depends on the layer, because in a very real sense, layers represent levels of reduction,
abstraction levels, if you prefer that term. And we should now be clearly seeing one of the most
important ideas in deep neural networks, the reduction has to be done at multiple levels
of abstraction. Each set of decisions about what is reduced away as irrelevant and what is kept as
possibly relevant can only be made at an appropriate abstraction level. We cannot yet abstract away
the lawn if all we know is there are dark and light green areas levels. This is a simplification.
Decisions made in this manner will be heated only if they have contributed to positive outcomes in
learning. Unreliable and useless decision makers will be ignored using any of several mechanisms
that we may apply during learning. More later, for now, we continue by examining the most popular
subset of all TensorFlow operators. The convolution family from the TensorFlow manual,
note that although these ops are called convolution, they are strictly speaking cross
correlation. Convolution layers discover cross correlations and co-occurrences of various kinds.
Co-occurrences to known patterns in the image at various locations. Spatial relationships
within an image itself, like Jeff Hinton's recent example of the mouth normally being found below
the nose. And more obviously, in the supervised learning case, correlations between discovered
patterns and the available meta-information, tags, labels that correlate with the patterns
the system may discover. This is what allows an image-understander to tag the occurrence of a
nose in an image with the text string nose. Beyond this, such systems may learn to understand
concepts like behind and under. The information that is propagated to the higher levels in the
network now describes these correlations. Uncorrelated information is viewed as non-salient
and is discarded. In the Crescent diagram, this discarding is done by a max pooling layer after
the convolution plus ReLU layers. ReLU is a kind of layer operator that discards negative values,
introducing a non-linearity that is important for DL but not really important for our analysis.
This pattern of three layers, convolution, then ReLU, then a pooling layer, is quite popular
because this combination is performing one reliable reduction step. These three-layer types
in this packaged sequence may appear many times in a DL computational graph. In each of these
three-layer packages is reducing away things that levels below had no chance of evaluating
for saliency because they didn't understand their input at the correct level. Again,
this is why deep learning is deep because you can only do reduction by discarding the irrelevant
if you understand what is relevant and irrelevant at each different level of abstraction. Is
deep learning science or not? While the deep learning process can be described using mathematical
notation, mostly using linear algebra, the process itself isn't scientific. We cannot explain how
this system is capable of forming any kind of understanding by just staring at these equations,
since understanding is an emergent effect of repeated reductions over many layers.
Consider the convolution operators. As the TF manual quote clearly states, convolution layers discover
correlations. Many blades of grass together typically means a lawn. In TF, a lot of cycles
are spent on discovering these correlations. Once found, the correlation leads to some
adjustments of some way to make the correct reduction more likely to be rediscovered
the next round, because this reduction is done multiple times. But in essence,
all correlations are forgotten and have to be rediscovered in every path through the deep
learning loop of upward signaling and downward gradient descent with minute adjustments to
erring variables. This system is in effect learning from its mistakes, which is a good sign,
since that may well be the only way to learn anything. At least at these levels.
This up and down may be repeated many times for each image in the learning set. This up and down
makes some sense for image understanding. Some are using the same algorithms for text.
Fortunately, in the text case, there are very efficient alternatives to this ridiculously
expensive algorithm. For starters, we can represent the discovered correlations explicitly,
using regular pointers or object references in our programming languages.
Or, synapses in brains. This software neuron correlates with that software neuron says a
synapse or reference connecting this to that. We shall discuss such systems in the section on
organic learning, which is coming up next. Then either the deep learning family of algorithms,
or organic learning, are scientific in any meaningful way. They jump to conclusions on
scant evidence and trust correlations without insisting on provable causality. This is disallowed
in scientific theory, where absolutely reliable causality is the coin of the realm.
F equals m a or go home. The most deep neural network programming is uncomfortably close to
trial and error, with only minor clues about how to improve the system when reaching mediocre results.
Adding more layers doesn't always help. These kinds of problems are the everyday reality to
most practitioners of deep neural networks. With no a priori models, there will be no a priori
guarantees. The best estimate of the reliability and correctness of any deep neural network,
or even any holistic system we can ever devise, is going to be extensive testing.
We're on this later. Why would we ever use engineered systems that cannot be guaranteed
to provide the correct answer? Because we have no choice. We only use holistic methods when the
reliable reductionist methods are unavailable. As is the case when the task requires the ability
to perform autonomous reduction of context rich slices of our rich complex reality as a whole.
When the task requires understanding, don't we have an alternative to these under liable machines?
Sure we do. There are billions of humans on the planet that are already masters of this complex
task because they live in the rich world and need skills that are unavailable with reductionist methods,
starting with low level things like object permanence. So you can replace a well performing
but theoretically unproven contraption, a holistic understanding machine built out of deep neural
networks, with a well performing human being using a deeply mystical kind of understanding
hidden in their opaque heads. Who earns much more per hour. This doesn't look like much of an
improvement. The machine cannot be proven correct because it doesn't function like normal computers.
It is performing reduction, the skill formally restricted to animals. A holistic skill. My
favorite soundbite is a mere corollary to the frame problem by McCarthy and Hayes. You have seen
it and you will see it again, since it is one of the stronger results of AI epistemology.
But we will, in but a few years, agree on a definition of intelligence that makes autonomous
reduction a requirement. This once semi-heretic soundbite will then be obvious to all. If it
isn't already, our intelligences are fallible. Chapter 6. Experimental Epistemology for AI
We can now create computer based experimental implementations to epistemology level theories
in order to test them and learn from the outcomes. Experimental epistemology is the use of the
experimental methods of the cognitive sciences to shed light on debates within epistemology,
the philosophical study of knowledge and rationally justified belief. Some skeptics contend that
experimental epistemology or experimental philosophy more generally is an oxymoron.
If you are doing experiments, they say, you are not doing philosophy. You are doing psychology
or some other scientific activity. It is true that the part of experimental philosophy that is
devoted to carrying out experiments and performing statistical analyses on the data obtained is
primarily a scientific rather than a philosophical activity. However, because the experiments are
designed to shed light on debates within philosophy, the experiments themselves grow out of mainstream
philosophical debate and their results are injected back into the debate, with an item
moving the debate forward. This part of experimental philosophy is indeed philosophy,
not philosophy as usual perhaps, but philosophy nonetheless. Experimental epistemology by James
R. B. B. Traditional experimental epistemology conducted experiments on interviews and psychological
tests on human volunteers or relied on population statistics. As one of the newer branches of
cognitive science, machine learning has now provided us with a very different approach
to this domain. We can now create computer-based experimental implementations to epistemology
level theories in order to test them and learn from the outcomes. In machine learning, the most
important epistemology level concepts and hypotheses are about reasoning, understanding,
learning, epistemic reduction, abstraction, creativity, prediction, attention, instincts,
intuitions, concepts, resiliency, models, reductionism, wholism, and other things all
sharing these features. One, science has no equations, formulas, or other models for how
they work. They're epistemology level concepts, not science level concepts. Two, our theories
about these concepts have to be sufficiently solid and detailed to allow for computer implementations.
This is because science itself is built on top of epistemology level concepts, and practitioners
need to be aware of this or they will experience cognitive dissonance-induced confusion and stress.
The red pill of machine learning confronts the elephant in the room of machine learning.
Machine learning is not scientific. What can we learn from AI epistemology? An excerpt from the
red pill can say the following statements from the domain of epistemology and how each of them
can be viewed as an implementation hint for AI designers. We are already able to measure
their effects and system competence. You can only learn that which you already almost know.
Patrick Winston, MIT. Our intelligences are fallible. Monica Anderson. In order to detect
that something is new, you need to recognize everything old. Monica Anderson. You cannot
reason about that which you do not understand. Monica Anderson. You are known by the company
you keep, simple version of the Yanida Lemur from Category Theory and the justification for embeddings
in deep learning. All useful novelty in the universe is due to processes of variation and
selection. The selectionist manifesto. Selectionism is the generalization of Darwinism. This is
right genetic algorithms work. Science has no equations for concepts like understanding, reasoning,
learning, abstraction, or modeling since they are all epistemology level concepts.
We cannot even start using science until we have decided what model to use. We must use our
experience to perform epistemic reductions, discarding the irrelevant, starting from the messy
real world problem situation until we are left with a scientific model we can use, such as an
equation. The focus in AI research should be on exactly how we can get our machines to perform
this pre-scientific epistemic reduction by themselves and the answer to that cannot be found inside
of science. Chapter 7. The Red Pill of Machine Learning. Reductionism is the use of models.
Holism is the avoidance of models. Models are scientific models, theories, hypotheses, formulas,
equations, naive models based on personal experiences, superstitions, and traditional
computer programs. The deep learning revolution of 2012 changed how we think about artificial
intelligence, machine learning, and deep neural networks. What changed, and what does this mean
going forward? The new cognitive capabilities in our machines are the result of a shift in the way
we think about problem solving. It is the most significant change ever in artificial intelligence
AI, if not in science as a whole. Machine learning, ML based systems are successfully
attacking both simple and complex problems using novel methods that only became available after
2012. We are experiencing a revolution at the level of epistemology which will affect much more
than just the field of machine learning. We want to add more of these novel methods to our
standard problem solving toolkit, but we need to understand the trade-offs and the conflict.
I argue that understanding deep neural networks, DNNs, and other ML technologies requires that
practitioners adopt a holistic stance which is, at important levels, blatantly incompatible with
the reductionist stance of modern science. As ML practitioners we have to make hard choices
that seemingly contradict many of our core scientific convictions. As a result we may get
the feeling something is wrong. The conflict is real and important and the seemingly counter-intuitive
choices make sense only when viewed in the light of epistemology. Improved clarity in these matters
should alleviate the cognitive dissonance experienced by some ML practitioners and should
accelerate progress in these fields. The title refers to the eye-opening clarity
some machine learning practitioners achieve when adopting a holistic stance. Parallel dichotomies
sentient sync research is natural language understanding, NLU. We are creating novel
systems that allow computers to learn to understand human natural languages. Any one of them,
we use deep neural networks of our own design. The goal is to achieve some kind of human-like
but not necessarily human-level understanding. This is very different from traditional natural
language processing, NLP, which relies on human-made models of some language, such as English,
and perhaps models of fragments of the world. The NLP and NLU disciplines have chosen
opposite answers to their difficult two-way choices. They are now defined by these choices,
and we can use their stances to highlight the main conflict. The split is so deep
that it cuts through many layers of our reality. The following dichotomies are all manifestations
of this incompatibility at different levels, listed by impact, but discussed in no particular order.
The main science, the complex, including the mundane, epistemology, reductionism,
realism, meanings, reasoning, understanding, problem solving, plan it, then do it, just do it.
Artificial intelligence, 20th century, good, old-fashioned AI machine learning,
deep neural networks, natural language and computers, NLP, NLU. The problem-solving level
provides many familiar examples of these issues. In our mundane lives, we solve many kinds of
problems every day but our strategies for solving them fall into just those two categories.
For any complicated problem, we had better have a plan before we start, but most problems
the brain deals with every day are things we never have to think about because we do not need to plan
a reason about them. These are the millions of low-level problems we encountered in our
mundane life every day, and this is the world that our AIs will have to operate in.
Consider someone walking across the floor. Their brain signals their leg muscles to contract in
the correct cadence. Do they need to consciously plan each step? Do they reason about how to
maintain their balance? No. They probably don't even know what leg muscles they have.
Consider understanding this sentence. Did you use reasoning? Did you use grammar?
If you are a fluent speaker, you do not need grammars to understand or produce language,
and you do not have time to reason about language while hearing it spoken. Reasoning is slow,
but understanding is instantaneous. Consider someone braking for a stoplight.
How hard should they push on the brake pedal? Do they compute the required differential equation?
Should such equations be part of the driver's license test?
Consider someone making breakfast. Did they have to reason about anything or plan anything,
or did they just do what worked yesterday, without thinking about it? Without consciously planning
it? Walking and talking, braking and breakfasting, like almost everything we use our brains for,
rely on learning from our experiences in order to reuse anything that has worked in the past,
and, over time, we learn to correct our mistakes. These strategies are simple enough that we can
identify them in other life forms. Dogs understand a lot but do not reason much,
and we can see how they could be implemented in something like neurons and brains. The split
in our brains between reasoning and understanding was examined at length in Thinking Fast and Slow
by Daniel Kahneman. The absolute majority of the brain's effort is spent processing low-level
sensory input, mostly from the eyes. He calls this System 1. It provides understanding.
Reasoning is done by System 2 based on the understanding from System 1.
What most problems we deal with on a daily basis do not require System 2 at all.
Artificial intelligence and machine learning computers can solve any suitable problem when
given sufficient human help, such as a complete plan for the solution in the form of a computer
program and valid input data. But since the AIML Revolution of 2012, we now know how to make
computers understand certain problem domains through machine learning. The acquired understanding
allows the machine to just do it for many different problems in the domain, without any human planning,
reasoning, or programming, and using incomplete, unreliable, and noisy input data. This is
changing how we are building systems with cognitive capabilities. Everyone working in ML or AI needs
to understand the trade-offs we must make at the most fundamental, epistemological levels.
Modern ML requires examining and seriously rethinking many things we were taught to vigilantly
strive for in our science, technology, engineering, and mathematics STEM educations. Things like
correlation is bad, but causality is good and do not jump to conclusions on scant evidence
are still solid advice everywhere inside science. But when building understanding systems,
these established strategies and modes of thinking no longer work, because correlation discovery and
handling of sparse, unreliable, and inconsistent input data are exactly the kinds of tasks we will
have to perform and perform well at these pre-scientific levels. In order to understand how to do this,
we must switch to a holistic stance. A motivating example, beginning machine learning students
are given exercises like this. They are given a large spreadsheet, which lists data about houses
sold a certain year in the US. This information includes among other things, the zip code of the
house, the living area and square feet, lot size, the number of bedrooms and bathrooms,
the year the house was built, and the final sale price of the house. We would like to be able to
predict this final sale price, given the corresponding data for current house we are about to list for
sale. The given spreadsheet is the data the student will use to train a deep neural network.
It is the entire learning corpus. It contains everything the system will ever know.
These students can download deep neural network libraries like Keras and TensorFlow and runnable
examples for many kinds of problems, including useful training data from places like Hugging Face
and GitHub. Next the student trains, learns their network using the given data. This may take a while,
but when learning finishes, they can give the system data for a house it has never seen and
it will quite reliably predict what the house might sell for. This was the goal of the exercise.
The student has created a system that understands how to estimate real estate prices from listings,
but the student still does not understand anything about real estate. The predictive capability
that many people working in real estate would be willing to pay money for is 100% based on
understanding in the deep neural network, in the computer, and because all the libraries in many
pre-sold examples of this nature were freely available, the student did not have to do much
programming either. The vision. This is desirable. This is what AI should mean. The computer understands
the problem so that we don't have to. Programming in the future will be like having a conversation
with a competent coworker, and when the machine understands exactly what we want done, it will
simply do it. No programming required on our part or on part of the machine, once a suitable,
partially reductionist framework exists. The rest is learning and it can be done in any human
language with equal ease. We are on the right track towards something worthy of the name AI
with current machine learning. Going forward, there are thousands of paths to choose from,
and the ability to choose wisely will depend on our ability to understand and adopt a holistic
stance. Reductionism and Holism. These are important terms of the art in epistemology.
Both of them have numerous correct, useful, and compatible definitions. We will henceforth
use the following definitions for reasons of usefulness and simplicity. Reductionism is the
use of models. Holism is the avoidance of models. Models are scientific models, theories, hypotheses,
formulas, equations, naive models based on personal experiences, superstitions, if you can
believe that, and traditional computer programs. In the reductionist paradigm, these models are
created by humans, ostensibly by scientists, and are then used, ostensibly by engineers,
to solve real world problems. Model creation and model use both require that these humans
understand the problem domain, the problem at hand, the previously known shared models available,
and how to design and use models. A PhD degree could be seen as a formal license to create new
models. Mathematics can be seen as a discipline for model manipulation. But now, by avoiding the
use of human-made models and switching to holistic methods, data scientists, programmers, and others
do not themselves have to understand the problems they are given. They are no longer asked to provide
a computer program or to otherwise solve a problem in a traditional reductionist or scientific way.
Holistic systems like DNNs can provide solutions to many problems by first learning about the
domain from data-insult examples, and then, in production, to match new situations to this
gathered experience. These matches are guesses, but with sufficient learning, the results can be
highly reliable. We will initially use computer-based holistic methods to solve individual and specific
problems, such as self-driving cars. Over time, increasing numbers of artificial understanders
will be able to provide immediate answers, guesses, to wider and wider ranges of problems.
We can expect to see cell phone apps with such good command of language that it feels like
talking to a competent co-worker. Voice will become the preferred way to interact with our
personal AIs. Early and low-level but useful AI will manifest as computers that can solve problems
we ourselves cannot or cannot be bothered to solve. They need not be superhuman.
All they need to have in order to be extremely useful is exactly the ability to autonomously
discover higher-level abstractions in some given problem domain, starting from low-level sensory
input, for example, by learning from images or reading books. Such systems now exist.
If we want to understand machine learning, then we need to understand all the strategies in the
right most column in the tables that follow. They are all part of a holistic stance, and if we are
working in machine learning, we need to adopt as many of them as possible. Differences at the level
of epistemology. Reductionism in Science versus Holism in Machine Learning. The use of
models versus the avoidance of models. Raising versus understanding requires human understanding
versus provides human-like understanding. Problems are solved in an abstract model space versus
problems are solved directly in the problem domain. Unbeatable strategy for dealing with
a wide range of suitable problems faced by humans. Versus may handle some problems in
domains where reductionist models cannot be created or used, known as bizarre domains.
Handles many important complicated problems such as going to the moon or a highway system.
Versus handles many important complex problems such as protein folding and playing go.
Handles problems requiring planning or cooperation. Versus handles simple mundane problems such as
understanding language or vision or making breakfast. Money rows in these tables discuss
hard trade-offs where compromises are impossible or prohibitively expensive. These are identified
by bold face numbers in the first column. The meaning rows may not be clear trade-offs or even
disjoint alternatives. Mixed systems are described in a separate chapter. These form the core of
these dichotomies and are discussed in most of what follows, but also in detail at the
chapter on introducing AI epistemology and in videos of talks. A leather report is based on
models in meteorology. To solve the problem directly in the problem domain, open a window
to check if it smells like rain. Reductionism is the greatest invention our species has ever made.
But reductionist models cannot be created or used when any one of the multitudes of blocking
issues are present. Models work, in theory or in a laboratory where we can isolate a device,
organism or phenomenon from a changing environment. However, complex situations may involve tracking
and responding to a large number of conflicting and unreliable signals from a constantly changing
world or environment. Reductionism is here at a severe disadvantage and can rarely perform
above the level of statistical models. In contrast, holistic machine learning methods
learning from unfiltered inputs can discover correlations that humans might miss and can
construct internal pattern-based structures to provide recognition, epistemic reduction,
abstraction, prediction, noise rejection and other cognitive capabilities. Humans generally
use holistic methods for seemingly simple, but in reality, complex mundane problems
like understanding vision, human language, learning to walk, or making breakfast.
Computers use them for very complex problems and mel-based AI in general, such as protein folding
and playing go, but also simpler ones, such as real estate pricing. Main trade-offs.
Reductionism in science versus realism in machine learning. Optimality, the best answer,
versus economy, reuse no useful answers. Completeness, all answers, versus promptness,
except first use for a answer. Repeatability, same answer every time, versus learning,
versus learning, results improve with practice. Extrapolation, in low-dimensionality domains,
versus interpolation, even in high-dimensionality domains. Transparency, understand the process
to get the answer, versus intuition, accept useful answers even if achieved by unknown or
subconscious means. Explainability, understand the answer, versus positive ignorance, no need to
even understand the problem or problem domain. Shareability, abstract models are taught in
communicated using language or software, versus copyability. ML understanding, a competence
can be copied as a memory image. Optimality, completeness, and repeatability are only available
in theoretical model spaces and sometimes under laboratory conditions. Economy and promptness
had much higher survival value in evolutionary history than optimality and completeness.
The strongest hint that a system is holistic is that the results improve with practice because
the system learns from its mistakes. In machine learning, a larger learning corpus is in general
better than the smaller one because it provides more opportunities for making mistakes to learn from,
such as corner cases. Models created by humans have manageable numbers of parameters because
the scientist or engineer working on the problem has done a, hopefully correct,
epistemic reduction from a complex and messy world to a computable model. This allows experimentation
with what if scenarios by varying model parameters. It is up to the model user to determine which
extrapolations are reasonable. In holistic ML systems, we are getting used to systems with
millions or billions of parameters. These structures are very difficult to analyze,
and just like with human intelligences, the best way to estimate their competence is through
testing. Extrapolation is typically out of scope for holistic systems. The majority of end users
will have no interest in how some machine came up with some obviously correct answer. They will
just accept it the way we accept our own understanding of language, even though we do not know how we
do it. We now find ourselves asking our machines to solve problems we either don't know how to solve,
or can't be bothered to figure out how to solve. We have reached a major benefit of AI. We can be
positively ignorant of many mundane things and will be happy to delegate such matters to our
machines so that we may play or focus on more important things. Some schools of thought tend to
overvalue explainability. To them, ML is a serious step down from results obtained scientifically
where we can all inspect the causality, for instance in a reductionist production, expert
systems. But the bottom line is that today we can often choose between one, understanding the
problem domain, problem, the use of science and relevant models, and the answer. Or two, just
getting a useful answer without even bothering to understand the problem or the problem domain.
The latter, positive ignorance, is a lot closer to AI than the first, and we can expect the use
of holistic methods to continue to increase. Science strives towards a consensus world model in order
to facilitate communication and minimize costly engineering mistakes caused by ignorance and
misunderstandings. Scientific communication requires a high-level context, a world model,
shared by participants, and agreed upon signals such as words, math, and software. But direct
understanding, such as the skills to become a just grandmaster or a downhill skier, cannot be
shared using words. The experience must be acquired using individual practice. Computer-based systems
that learn a skill through practice can share the entire understanding so acquired by copying the
memory content to another machine. Advantages of Holistic Methods
Reductionism in Science versus Holism in Machine Learning
N.P. Hard Problems cannot be solved, versus fines-valid solutions by guessing well-based
on a lifetime of experience. Geigo, garbage in, garbage out is a recognized problem,
versus copes with missing, erroneous, and misleading inputs.
Brightness. Experience catastrophic failures at edges of competence, versus anti-fragile.
Learns from mistakes, especially almost correct guesses in small, correctable failures.
The models of a constantly changing world are obsolete the moment they are created,
versus incremental learning provides continuous adaptation to a constantly changing world.
Algorithms may be incorrect or may be incorrectly implemented,
versus self-repairing systems can tolerate or correct internal errors. It is because we desire
certainty, optimality, completeness, etc. L.N.P. Hardness becomes a problem. There are many
problems where it is relatively easy to find a provably valid solution, but where finding
all solutions can be very expensive. Real-world traveling salesmen merrily travel long reasonable
routes. If a reductionist system does not have complete and correct input data, it either cannot
get started or produces questionable output. But it is an important requirement of real-world
understanding machines that they be able to detect what is salient, important, in their input in
order to avoid paying attention to, and learning from, noise. And they have to deal with incomplete,
erroneous, and misleading input generated by millions of other intelligent agents with
goals at odds with their own. They need to be able to detect omissions, duplications, errors,
noise, lies, etc. And the only epistemologically plausible way to do this is to relate the input
to similar input they have understood in the past, what they already know. They need to understand
what matters but if they can also understand some of the noise. This is advertising, they can exploit
that. There are many image and video apps available featuring image understanding based on deep
learning. These apps can remove backgrounds, sharpen details like eyelashes, restore damaged
photographs, etc. We need to keep in mind that the ability of holistic systems to fill in data
and detect noise depends on them having learned from similar data in the past. We note that all
the image improvements are confabulations based on prior experience from their learning corpora.
But we can also note that image composition using these methods yields totally seamless images,
very far from cut and paste of pixels. And quite similarly, we find language confabulation by
systems like GPT-3 to flow seamlessly between sentences and topics. They have nothing to say,
but they say it well. However, they bring us closer to meaningful language generation and
when we achieve that, the public perception of what computers are capable of will totally change.
Most of cognition is recognition. Being able to recognize that something has occurred before
and knowing what might happen next has enormous survival value for any animal species. A mature
human has used their eyes and other senses for decades. This represents an enormous learning
corpus and they can understand anything they have prior experience of. The mistakes made by humans,
animals and by holistic ML systems are very often of a near-miss variety which provides an
opportunity to learn to do a better next time. Contrast is to reductionist software systems
created for similar goals. Rule-based systems have long been infamous for their brittleness.
As long as the rules and the rules that match the current input and reality perfectly,
the results will be useful, repeatable and reliable. But at the edges of their competence,
where the matches become more tenuous, the quality rapidly drops. Minor mistakes in the
rule sets in the world modeling may lead such systems to return spectacularly incorrect results.
Sometimes repeatability is important and sometimes tracking a changing world by
continuously learning more about it is important. In ML, continuous incremental learning makes
it possible to stay up to date. If we want repeatability, we can emit a condensed,
cleaned and frozen competence file from a learner that can be loaded into non-learning,
read-only, cloud-based understanding machines that serve the world and provide repeatability
between scheduled software and competence releases. Three, in the case of reductionist systems,
such as cell phone OS releases, we are used to getting well-tested new versions with minor bug
fixes and occasional major features at regular intervals. Such systems learn only in the sense
that the people who created them have learned more and put these insights into the new release.
Reductionist systems working with complete incorrect input data are expected to provide
correct and repeatable results according to the implementation of the algorithm.
But both the algorithm and the implementation may have errors. If the algorithm does not adequately
model its reality, then we have reduction errors. In the implementation, we may have bugs.
Holistic software systems can be designed to a different standard of correctness.
Since input data is normally incomplete and noisy, and results are based on emergent effects,
we can expect similar enough results even if parts of the system have been damaged,
for instance by catastrophic forgetting. Holistic systems can be made capable of
self-repair using incremental learning. This has been observed in the deep learning community.
Another technique is that when using multiple parallel threads in learning,
there may be conflicts that would normally require locking of some values.
But if the operations are simple enough, such as just incrementing a value,
we can forego thread safety in the locking since the worst outcome is the loss of a single increment
in a system that uses emergent results from millions of such values. And the mistake would,
in a well-designed system, be self-correcting in the long run. At the cloud level, absolute
consistency may not be as hard a requirement as it is for reductionist systems. Much larger
mistakes can be expected to be attributable to misunderstandings of the corpus or poor corpus
coverage. General strategies, decomposition into smaller problems, versus generalization
may lead to an easier problem. Assuming discards everything irrelevant based on how new information
matches existing experience, versus a machine discards everything irrelevant based on how
new information matches existing experience, modularity, versus composability, gather valid,
correct, and complete input data, versus use whatever information is available, and use all
of it. Formal, rigorous methods, versus informal ad hoc methods, absolute control, versus creativity,
intelligent design, versus evolution. The reductionist battle cries, the whole is equal to the sum
of its parts, which gives us a license to split a large complicated problem into smaller problems
to solve each of those using some suitable model, and then to combine all the sub-solutions
into a model-based solution for the original, larger problem, such as in moonshots, highway
systems, international banking, and generally in industrial intelligent design. This works
in simple and some complicated domains, but cannot be done in complex domains, where everything
potentially affects everything else. Spreading a complex system may cause any emergent effects
to disappear, confounding analysis. Examples of complex problem domains are politics, neuroscience,
ecology, economy, including stock markets, and cellular biology. Our life sciences operate
in a complex problem domain because life itself is complex. Some say biology has physics envy,
because in the life sciences, reductionist models are difficult to create and justify.
On the other hand, physics is for simple problems. Problems with many complex interdependencies and
unknown webs of causality can now be attacked using deep neural networks. These systems discover
useful correlations and may often find solutions using mere hints in the input which match their
prior experience. Reductionist strategies with correctness requirements outlaw this.
It is notable that one of the larger triumphs of holistic methods is protein folding, which is a
problem at the very core of the life sciences. So holistic understanding of a complex system
can be acquired by observing it over time and learning from its behavior. There is no need to
split the problem into pieces. Part of the holistic stance is that we give the machine everything.
Holism comes from the Greek word, holos, amicron, lambda, amicron, sigma, in the written text.
The whole, that is to say, all the information we have. If we start filtering the input data,
by cleaning it up, then the system will effectively learn from a polyana version of the world,
which will be confusing once it has to deal with real life inputs in a production environment.
If we want our machines to learn to understand the world all by themselves,
then we should not start by applying heavy-handed heuristic cleanup operations of our own design
on their input data. Sometimes, reductionist strategies are clearly inferior. The natural
language understanding is such a domain. Language understanding in a fluent speaker
is almost 100% holistic because it is almost entirely based on prior exposure. We are now
finding out that it is much easier to build a machine to learn any language on the planet from
scratch than it is to build a good old-fashioned artificial intelligence, 20th century reductionist
AI-based style machine that understands a single language such as English. The process where a
human, by using their understanding, discards everything irrelevant to arrive at what matters
is called the epistemic reduction and is discussed in the first five chapters in this book.
This is the most important operation in reductionism, but for some reason discussions of reductionism
in the past have tended to focus on other aspects. Perhaps this is a new result. ML systems discard
with little fanfare anything that was expected and that has been seen before as boring, harmless,
or otherwise ignorable. They may also discard things significantly outside of their experience
as noise. Things can only be reduced away at the semantic level. They can be recognized that
operations capable of epistemic reduction at multiple layers discard anything that's understood
at that layer, and they may pass on upward to the next higher semantic layer, a summary of what
they discarded plus everything they did not understand at their level. Empire levels do the
same. This is why deep learning is deep. Intelligently designed systems are often made up out of
interchangeable modules, which allow for easy replacement in case of failure, and in some
cases, and especially in software, allow for customization of functionality by replacing
or adding modules. These modules have well specified interfaces that allow for such
interconnections. In the holistic case we can consider a human cell with thousands of proteins
interact on contact or as required with many substances floating around in the cellular fluid.
It is not the result of intelligent design, and it shows
there are overlaps and redundancies that may contribute to more reliable operation, and there
are multiple potentially complex mechanisms keeping each other in check, or we can consider music,
or multiple notes in accord in different timbres in a symphony orchestra in a composition will
conjure an emerging harmonic whole that sounds different than the sum of its parts, or consider
spices in a soup, or opinions in a meeting that leads to a consensus. The word, composability,
fits this capability in the holistic case. Unfortunately, in much literature it is merely
used as a synonym for its reductionist counterpart, modularity. As discussed in the Geico case,
in the section above, holistic ML systems can fill in missing details starting from
various scant evidence. Compare for example confabulations of systems like GPT-3 and image
enhancement apps. They supply the missing details by jumping to conclusions based on few clues
and lots of experience. Since we are not omniscient and don't even know what is happening behind our
backs, scant evidence is all we will ever have, but it is amazing how effective scant evidence
can be in a familiar context. We can drive a car through fog or find an alarm clock in absolute
darkness. The more the system has learned, the less input is needed to arrive at a reasonable
identification of the problem and hence retrieve a previously discovered working solution.
Formal methods and experimental rigorousness make for good science. On the other hand,
holistic methods can follow tenuous threads, hoping for stronger threads or some solution,
with little effort spent on backtracking or documentation because once a solution is found,
it is the only thing that matters. Tracking has little value in non-repeating situations
or when using holistic methods at massive scales, such as in deep learning. Absolute control requires
that we know exactly what the problems and solutions are and all we need to do is implement them.
Once deployed, systems frozen in this manner, which are exactly implementing the models of
their creators, cannot improve by learning since there is no room for variation in the existing
process and hence no experimentation and no way to discover further improvements. Only
holistic systems can provide creativity and useful novelty. We also observe that, learning itself
is a creative act, since it must fit new information into an existing network of prior
experience. Just like the term, holism has been abused, so has intelligent design,
which is a perfectly reasonable term for reductionist industrial end-to-end practice
that consistently provides excellent results. On the holistic side, evolution in nature has
created wonderful solutions to all kinds of problems that plants and animals need to handle.
But we can put evolution, also known in the general sense as selectionism,
to work for us in our holistic machines. They can create new, wonderful designs with a biological
flavor to them that sometimes, depending on the problem, cannot perform intelligently designed
alternatives. Evolution is the most holistic phenomenon we know. No goal functions, no models,
no equations. Evolution is not a scientific theory. Science cannot contain it. It must be
discussed in epistemology. Mixed systems, deep neural networks can perform autonomous epistemic
reduction to find high-level representations for low-level input, such as pixels in an image
or characters in text. Current vision understanding systems can reliably identify thousands of
different kinds of objects from many different angles in a variety of lighting conditions
and weather. They can classify what they see, but do not necessarily understand much beyond that,
such as the expected behaviors of other, intentional Asians like cars, pedestrians,
or cats. Therefore, at the moment in 2022, most deployments of machine learning use a mixture
of reductionist and holistic methods, equations and formulas devised by humans implemented as
computer code, and some inputs from a deep neural network solving a sub-problem that requires it,
such as vision understanding. Self-driving cars use DNNs for understanding vision, radar,
and lidar images, discovering high-level information like a pedestrian on the side of the road
from pixel-based images and this understanding has, until recently, been fed to logic and
role-based programs that implement the decision-making. Avoid driving into anything, period, that is used
to control the car. The trend here is to move more and more responsibilities into the deep
neural network, and over time to remove the hand-coded parts. In essence, the network learns
not only to see, but learns to understand traffic. We are delegating more and more of our
understanding of how to drive to the vehicle itself. This is desirable. Experimental Epistemology
Epistemology is the theory of knowledge. It is concerned with the mind's relation to reality.
This includes artificial minds. An introduction to epistemology should benefit anyone working in
the IML field. Scientific statements look like F equals MA, Newton's second law, or E equals MC
squared, Einstein's famous equation, and can all be proven and or derived from other accepted
results or verified experimentally. Algebra is built on lemurs that are not part of algebra.
They cannot be proven inside of algebra. Similarly, epistemological statements are
not provable in science because science is built on top of epistemology. But when science is not
helping, such as in bizarre domains, then setting scientific methodology aside and dropping down
to the level of epistemology sometimes works. Epistemology is, just like philosophy in general,
an armchair thinking exercise, and the results are judged on internal coherence and consistency
with other accepted theory rather than by proofs or experiments. However, the availability of
understanding machines, such as DNNs now suddenly provides the opportunity for actual experiments
in epistemology. Consider the following statements from the domain of epistemology,
and how each of them can be viewed as an implementation hint for AI designers. We are
already able to measure their effects on system competence. You can only learn that which you
already almost know. Patrick Winston, MIT. Our intelligences are fallible. Monica Anderson.
In order to detect that something is new, you need to recognize everything old. Monica Anderson.
You cannot reason about that which you do not understand. Monica Anderson. You are known by
the company you keep, simple version of the yanni dilemma from category theory and the justification
for embeddings in deep learning. All useful novelty in the universe is due to processes
of variation and selection. The selectionist manifesto. Selectionism is the generalization
of Darwinism. This is right genetic algorithms work. Science has no equations for concepts like
understanding, reasoning, learning, abstraction, or modeling since they are all epistemology level
concepts. We cannot even start using science until we have decided what model to use. We must use
our experience to perform epistemic reductions, discarding the irrelevant, starting from the
messy real world problem situation until we are left with a scientific model we can use,
such as an equation. The focus in AI research should be on exactly how we can get our machines
to perform this pre-scientific epistemic reduction by themselves and the answer to that
cannot be found inside of science. Artificial General Intelligence Artificial General Intelligence,
AGI, was a theoretical 20th century reductionist AI attempt to go beyond the narrow AIA of domain
specific expert systems closer to a general intelligence they thought humans had. The
term was mostly used by independent researchers, amateurs and enthusiasts. But the AGI term was
not well enough defined and was not backed by sufficient theory to provide any AI implementation
guidance and what little progress had been made by these groups was overtaken by holistic methods
after 2012. Today we know that the entire premise of 20th century reductionist AGI was wrong.
Humans are not general intelligences at birth. Instead, we are general learners capable of
learning almost any skill or knowledge required in a wide range of problem domains. If we want
human compatible cognitive systems, then we should build them in our image in this respect to build
machines that learn and jump to conclusions on scant evidence. Decades ago, AGI implied a human
programmed reductionist hand coded program based on logic and reasoning that can solve any problem
because the programmers anticipated it. To argue against claims that this was impossible,
the AGI community came up with a promise or threat of self-improving AI. But the amount
of code in our cognitive systems has shrunk from 6 million propositions in sake around 1990
to 600 lines of code to play video games around 2017 to about 13 lines of cares code in some
research reports. And now there's AutoML and other efforts at eliminating all remaining programming
from ML. The problems are not in the code. There's almost no code left to improve in
modern machine learning systems. All that matters is the corpus. We can now, after 2012,
see that machine learning is an absolute requirement for anything worthy of the name AI,
which makes recursive self-improvement leading to evil superhuman omniscience logic based
godlike artificial general intelligence a 20th century reductionist AI myth. We must focus on
artificial general learners. Afterward, science was created to stop people from overrating correlations
and jumping to erroneous conclusions on scant evidence and then sharing those conclusions
with others, leading to compounded mistakes and much wasted effort. Consequently, promoting a
holistic stance has long been a career-ending move in academia, and especially in computer science.
But now we suddenly have machine learning that performs cognitive tasks such as protein folding,
playing go, and estimating house prices at useful levels using exactly a holistic stance.
So now science itself has a cognitive dissonance. This is a conflict about what science is or
should be. Inherence of these stances leads people to develop significant personal cognitive
dissonances, which is why discussions about these issues are very unpopular among people
with solid STEM educations. But the dichotomy is real. We need to deal with it. Our choices so far
seem to have been too. Claim that dichotomy doesn't exist. But Schrodinger and Pursig also discuss it.
Claim that the holistic stance doesn't work. But deep learning works. Claim that reductionist
methods are requirement, hobbling our toolkits for a principle. The reductionist stance also
makes it difficult to imagine and accept things like systems capable of autonomous epistemic
reduction, systems that do not have a goal function, systems that improve with practice,
systems that exploit emergent effects, systems that by themselves make decisions about what
matters most, systems that occasionally give a wrong answer but are nevertheless very useful.
So after a serious education in machine learning we don't actually need to do almost any programming
at all. And we don't need to understand anybody else's problem domains. Because we don't have
to perform any epistemic reduction ourselves. We should recognize this for what it is. AI was
supposed to solve our problems for us so we would not have to learn or understand any new
problem domains. To not have to think. And that's what we have today, in machine learning,
and with holistic methods in general. Why are some people surprised or unhappy about this?
In my opinion, this is AI, this is what we have been trying to accomplish for decades.
People who claim machine understanding is not AI are asking for human level human-centric reasoning
and are, at their peril, blind to the nascent ML-based understanding we can achieve today.
With expected reasonable improvements in machine understanding capabilities, familiarity and
acceptance of the holistic stance will become a requirement for ML and AI-based work. It will
likely take years for our educational system to adjust. This has been Monica's Little Pills,
read to you by a computer. Thank you for listening.