Image recognition technology — once geeky, back-end and technical — has emerged as a driving force behind much of the recent innovation in digital photo apps as companies market their products with mysterious buzzwords such as "machine learning" and "computer vision."
But what do these words really mean, and why is it suddenly a thing? Well, just look at your photo collection. Today, the average consumer with an iPhone or Samsung Galaxy shoots hundreds, if not thousands, of images and videos a year. Compare that with the predigital era, when the average person might have taken only a few dozen images during that same time period.
Now, instead of going in a shoe box or picture frame, an overwhelming number of casual photos are saved — but seldom viewed — in local or online repositories such as Apple Photos, iCloud, Flickr or Google Photos. Their owners are desperately seeking an easy way to organize them, which typically involves manually tagging, adding keywords and cataloging them by subject, location, time and face. Despite those efforts, it remains a frustrating challenge for most people to locate a particular photo.
Increasingly, sophisticated artificial-intelligence-related technologies are helping photo apps learn to automatically manage vast and growing digital photo collections. But even these systems are imperfect, for now.
But what do those terms mean in the first place? Here are some simplified, plain-English translations of some of the high-level concepts behind these buzzwords so that anyone — even people without a computer-science or math degree — can get the picture.
1. Machine learning
Machine learning is exactly what it sounds like: computers being trained to recognize and discriminate among specific types of things, like images or even speech.
With machine learning, computers learn from training data rather than by being programmed to recognize specific things. When it comes to images, the data comprise millions of images of cats, sunsets, beaches, cars and so forth, which are clearly labeled. Those data allow the computer to recognize the content in hundreds of millions of photos and tag each photo with descriptive keywords, to help you quickly find the image you want. Each photo app with this sort of technology uses its own proprietary algorithm. That means none of them will generate the exact same keywords, but the results should be similar.
"When Flickr photos are uploaded, they are automatically tagged and indexed for search," said Gerry Pesavento, a Yahoo senior product manager who leads its vision and machine learning team.
In addition to the more than 2,000 auto-tags, and aesthetic quality calculations, Flicker provides an analysis for faces, duplication, similarity, adult content, logos, color and text, all designed to promote more efficient search, organization and discovery.
But it's not like these examples are just sitting around on call. Rather, the algorithms "tune" themselves based on this training data. That's why apps that use machine learning get "smarter" over time. As users correct mistakes, they further train the algorithm to better identify and tag content.
Flip Phillips, a professor of psychology and neuroscience at Skidmore College, emphasized the strong correlation between machine learning and generalization. "You could have a template match system that can only match a specific thing to another specific thing (such as a particular view of the Mona Lisa to only that particular view of the Mona Lisa). A machine learning version could match all views of the Mona Lisa, and maybe even learn to generalize and recognize all of Leonardo [da Vinci]'s paintings."
But it can also do the job too well, Phillips said. "There is a particularly interesting phenomenon called "overlearning" or "overfitting" in that that the algorithm has a hard time generalizing — something like a classical pianist who has a hard time improvising because that just isn't part of the practice."
Though machine learning has come into vogue recently, it is not new. The first successful, large-scale commercial example of the technology was pioneered by none other than Google, which used a machine-learning algorithm to power its search results. Today, the tech is being employed in a wide range of consumer applications, including audio, credit card data and buying patterns, Netflix watching patterns, and Amazon and TiVo suggestions.
2. Artificial neural networks (ANNs)
ANNs are a mathematical construct designed to roughly imitate the structure and function of neurons in the human brain. In case you missed that eighth-grade biology lesson, neurons are an interconnected web of brain cells that work together by processing and transmitting information via electrochemical signals. With ANN, brain function is accomplished mathematically via layers, or interconnected processing elements.
ANN, as a cornerstone of "reinforcement learning," can conduct a wide variety of image-recognition tasks. If you input vast amounts of data (for example, photos of kids on the beach), these networks learn to recognize similar image content.
In addition to machine learning, computer vision and a variety of methods within its vision pipeline for photo and video intelligence, Flickr also uses deep convolutional neural networks to detect your photos' objects and scenes. "We are discovering that deep neural networks not only work well for objects, scenes and shapes, but even aesthetic qualities such as serene, minimalism, surreal and others," Pesavento said.
3. Deep Learning
With "learning" as part of the name, it's no surprise that deep learning is related to, works in tandem with and is often considered part of machine learning. In fact, it is just one of many variations of this learn-by-example algorithmic technique. Deep learning is used to accurately identify objects and faces in photos because the computer understands what it sees and can pinpoint correct features — with a little help from a programmer.
Because deep learning involves ANNs, which resemble the structure and function of neurons in the brain, the combination of neural networks and machine learning mimics brain functions by sharing data and learning patterns. While deep learning has been around since the 1980s, it's been only recently that computers have been powerful enough to perform its high-level computing tasks.
Deep learning teaches computers to pick and choose features from data (in this case, images) to find the right answers. A sample data set containing millions of random images (big data set) must be scanned via an algorithm, which peers into the arrangement of pixels in a photo, seeking objects with a similar shape. The more data and computation time, the better the results.
"The thing that separates deep learning from general ANN approaches is that it tries to discover these features on its own (usually in an unsupervised fashion) rather than by having the algorithm designer specify what they think is important," Phillips said.
Deep-learning generalization is the reason Apple's Photos app sometimes wrongly identifies objects as faces when they merely resemble faces — a phenomenon called pareidolia.
That's because human intuition about what is important in an image rarely matches the judgment of deep-learning algorithms. "Knowing that trees are 'green things' doesn't really help us discriminate between tree types, and doesn't help us figure out if a tree is the same or different than a frog," Phillips said.
4. Computer vision
As a branch of artificial intelligence, computer vision's ultimate goal is to use computers to emulate human vision. To understand how computer vision works, remember that, to the computer, images comprise electronically stored numbers where each image contains a certain number of pixels, with numerical values for red, green and blue.
Computer vision uses algorithms to evaluate those numbers to get additional information from that image. It analyzes and interprets image components — such as edges, lines, corners or curves — and is useful when you want to extract meaningful information, such as specific content or attributes like object recognition and motion analysis.
Overall, think of computer vision as an application of deep learning that is even subject to some human error. "The image-recognition algorithm, in some cases, makes the exact same mistakes humans make," Phillips said. "This suggests that the algorithms are on to something about how our perceptual system works, and that they're possibly using the same features and classifications that we humans use."
Flickr has been using computer vision for quite some time, and back in 2014, it released a lighthearted experimental web app called Park or Bird, in which it demoed the concept. But the technology has progressed since then, according to Pesavento. "Flickr has a full photo and video intelligence pipeline in place now, and is transitioning to a state-of-the-art pixel serving and storage platform," Pesavento said.
Over the past year, the folks at Flickr have noticed improvements in the computer-vision field and, more specifically, with machine accuracy for visual recognition. "It is starting to surpass human accuracy in several areas," Pesavento said. "Access to training data, computer infrastructure and neural network techniques are constantly advancing.”
5. Big data
Photo apps from Google, Flickr and Apple all know your dog. They know everyone's dog. How? Through hundreds of thousands of dog photos identified and fed into an algorithm. That's data, and for data in machine learning, bigger is better. Large data sets offer more information to feed and train algorithms and result in more streamlined algorithms that perform better than on smaller data sets. But the most important reason to use big data for photo analysis is to extract meaning — or what computer scientists call "semantic content" — from specific, recognizable objects.
For example, a golden retriever may appear as a golden- to red-colored furry beast of 50 to 80 pounds, with ears, a tail and a nose. But semantic content might also include characteristics such as domestic, dog, friendly and goofy, and possibly even name the specific breed.
There's also statistical content, which represents more quantitative aspects of the image: how big it is, how many red pixels it has or whether it contains an octagon.
Flickr claims a competitive data advantage because of its unique training data and social signals around the very large Flickr database of photos. "For example, if we wanted to build a classifier to detect 'jump' in a photo, it requires several thousand photos with jump action in them for positive training data," Pesavento said. "On Flickr, such data is readily available, as there are over 12.4 billion photos and 27 million images uploaded daily."
While the terms machine learning, deep learning, artificial neural networks, computer vision and big data have emerged recently in common image-app parlance, they will eventually become less mysterious. Gnarly privacy issues have already come up as people question the role that servers play in the photo ID process, raising questions about who actually "sees" your images.
Apple's new version of Photos, for example, conducts its deep-learning magic entirely on the device, without server involvement. Aesthetic algorithms in photo apps such as The Roll are also becoming more popular in determining photographic quality and helping people decide which images are worth sharing. As these technologies improve over time, we will be able to increasingly rely on them to take over tedious tasks such as searching, sorting and photo discovery. Eventually, today's perplexing artificial intelligence buzzwords will morph into everyday conversation.
See many more additional Wolfram examples like "puppy or bagel?" and "kitten or ice cream?" on Flip Phillips' blog.