Different from biological intelligence, Artificial Intelligence or AI is the form of intelligence that is created by human using technology, especially computer science. The technology domains closely related to creation of AI are machine learning, big data, cloud computing, and high-performance computing.
Machine learning is one technological solution that can lead us to creating intelligent machines (or helping us achieving AI). As its name indicates, the goal of machine learning is to teach machines to learn human intelligence and capabilities. For example, such capability may include to recognize objects in pictures, understand human languages, or play board games, etc.
There are different teaching paradigms, including unsupervised learning, supervised learning, or self-supervised learning, as well as student-teacher paradigm where a stronger and more generalized AI model teaches weaker models to perform certain specific task.
Machine learning is a long standing field with many learning algorithms within each teaching paradigm.
Neural networks are one category of learning algorithms within machine learning. It was proposed in 1950s with a deep root in mathematics dating back in the 1870s.
Nowadays, neural networks mean specifically the architecture where layers of artificial neurons stacked together and linked together to pass numbers back and forth in order to achieve stable “weights” which will be used in prediction tasks.
In contrast to AI, where a machine learning model is good at one or a few tasks, e.g., predicting which movie to watch in Netflix, AGI models are good at significantly more tasks all at once.
For example, OpenAI’s chatGPT is good at question anwsering in the open domain, it can also handle many specific tasks in the language domain, such as writing simple code snippets, understanding sentiments, propose ideas or outlines for an article, etc, each of which would require independent model training for previous generation AI models.
As the model capabilities increase, the definition of AGI will slightly evolve as well. But the final definition will be “an AI model that performs as many tasks as human can do, if not more”
If that model can do more things than human and beat human at each thing we do, then it becomes a strong AGI model. “Strong AGI” is also something being defined by the academia as we speak.
A straightforward definition of it is that some models that can generate content in multimedia format, such as pictures, music, videos, articles, etc. But there’s more to it.
In machine learning, we usually categorize models into 2 types: generative models and discriminative models. They solve the same problems from two different angles. Say a bank wants to build a model to decide credit card approval. To solve that problem, a discriminative model would learn how to “draw a boundary” among available data points collected to distinguish good credit card candidates from bad ones. To solve the same problem, a generative model would learn a probabilistic space by observing all data points. When a new candidate’s record is presented, the model would see where in the space this record falls and make a prediction based on that.
The most recent generative AI models work in similar fashion. For example, the vision diffusion models are neural networks trained with billions of pictures and they’re learning the “probabilistic space”. When asked to generate a dog picture, it samples a dot (a vector of numbers) in that space, and propagates the numbers through the network to expand the vector into a matrix which is how pictures are represented in computers.
Similar mechanism is used in other models in other domains. Notable generative AI models are GPT series, Stable Diffusion, AudioGen, BLOOM, DALL-E2, etc.
CNN, or convolutional neural network, enabled the huge performance gain in the 2012 ImageNet object recognition benchmark, and became the go-to architecture for many computer vision tasks ever since. The success of CNN in CV tasks also inspired other architectures on other tasks as well, such as RNN/LSTM for language or speech, Transformers for multimodal, etc.
RNN/LSTM, or Recurrent Neural Network/Long short-term memory are architectures used to model sequential data, such as language understanding or speech recognition or text to speech. They had been the major way to teach machines on those tasks during 2012 – 2017. However, the major shortcomings are their limitations to attend to previous concepts in a “long” sequence.
Transformers, the architecture that enabled the most recent AI progress with their signature “attention” mechanism. After being proposed in 2017, the transformers first took over sequence learning, and then the computer vision tasks. A lot of recent trending progress in AI, such as GPT3, DALL-E2, Stable Diffusion, are based on transformer architectures.
In 2012, AlexNet won the ImageNet by a big margin compared to the previous generation computer vision algorithms. It is the first time that neural network-based methods showed their potential in large-scale machine learning tasks.
In 2014, VGG network won the ImageNet challenge of the year with its simple and unified layer design. Such design principles inspired a lot of research which would later change the world of AI completely.
Also in 2014, GAN (Generative Adversarial Network) was proposed and it was the first time network based generative model showed great potential in generating multimedia data. Although it shies away compared to nowadays generative AI models, the result was quite impress at the time and inspiring to all AI researchers.
In 2015, ResNet was proposed. It was the deepest network at the time yet very efficient in training and inferencing, thanks to its simplicit design. ResNet became a backbone of many models throughout, including vision and language tasks and many more.
In 2017, Transformer architecture was proposed. It features the multi-head attention mechanism and encoder-decoder architecture within the network. It is significantly different from CNN and RNN models which were mainstream at the time, and the attention mechanism is the key to enable the model to memorize or understand longer context in sequence, or understand the relation between local features and global semantics in vision tasks. Transformer is by far the most important architecture breakthrough.
Also in 2017, a model called VQ-VAE was proposed. It was a generative model based on Variational Auto Encoder. The vector and query mechanism will inspire many generative models later on.
In 2020, GPT-3 was released as the largest language model at the time and it demonstrated incredible understanding to human language.
Also in 2020, the first successful attempt of applying Transformer to computer vision was accomplished by ViT Transformer model.
Also in 2020, VQ-GAN was proposed to largely boost the GAN’s ability in generating content. The methods and tricks would be adopted later in diffusion models to achieve the impressive generative AI performance that we saw in later 2022. So this was one of the very important seed for generative AI hype.
Also in 2020, Diffusion Model was proposed. Although its performance was still not quite impressive, the learning paradigm was new. Unlike other generative models, diffusion models learn by adding noise to the data and try to recover from it. It was a new self-supervised method that suits large models and large training data with billions of instances. Diffusion models sparked the successors such as Stable Diffusion, and they were also very important seed for the generative AI we saw in 2022.
In 2021, Swin-Transformer was proposed. Not only it was the very successful transformer architecture in vision tasks, but also the architecture sort of mixed CNN with transformer. It was a paradigm shift again. The architecture mix trick inspired many following state-of-the-art models and nowadays majority of big models are some type of mixture of previous architectures.
Also in 2021, OpenAI released DALL-E, CLIP, and GLIDE. These were models very capable of generating images from text prompts. However, they were OpenAI’s experiments on 3 different methodologies. DALL-E was the expansion of VQ-VAE. CLIP was based on transformer architecture. And GLIDE was a large diffusion model. Although they were almost equally incredible in generation performance, we can see OpenAi’s strategy in covering all major breakthroughs at the time.
Also in 2021, Github Copilot was released. AI could generate reasonably good quality code from certain prompt such as comments and descriptions that would appear in regular computer programs. The implication of that was profound because one can control computer through code.
In 2022, Stable Diffusion was released. Although the training costed thousands of GPU machines and the original model was qutie large, its design principle was to make the generative power accessible to the public. So after stripping down, the model fits the memory of a reasonable single GPU machine. More importantly, unlike OpenAi or other AI superpower, Stable Diffusion was completely open-sourced by Stability.ai, the company behind the model with the mission of making AI really accessible by the public.
Also in 2022, Whisper and AudioGen were released, which were the first text to audio generation models. Unlike text-to-speech, text to audio would take a text prompt as a description of certain scene and generate audio as if the sound comes from the scene. For example, one input could be “fire truck going through noisy road with bird singing around it”, and the generated audio would be some sound like that.
Software 1.0 means software in a traditional, or pre-AI, sense, when a computer program consisted predefined rules. For example, a traditional database query of people’s median income given certain demographical range would return a number based on that people’s record in that database under that specific demography. If no such records in the database, the query will return error or an empty result.
Software 2.0 means software in the era of AI, when a computer program consisted a combination of predefined rules and some AI module with weights and ever-evolving models. A “software 2.0” database would return the best estimate of the median income based on certain machine learning model if the conditioned demography doesn’t show up in the database.
The core functions of the product are only made possible by the very frontier of AI technology. Everything, from what data to collect, how to store them, all the way to API and UI/UX design, has to be put together in a specific way to unlock the potential of the cutting-edge AI models. Examples of such products include Github Copilot, OpenAI’s ChatGPT, and Huski.ai’s trademark prosecution and brand protection services.
When AI can build AI, the singularity will arrive. In general, the expectation is that the singularity will arrive around year 2045. But it may be sooner.