Google's Advanced Language Model, PaLM 2, Trained on a Record 3.6 Trillion Tokens

Image credit: Google
Last week, Google announced the release of its new large language model (LLM), PaLM 2, which boasts almost five times more training data than its predecessor launched in 2022. This increase allows the model to accomplish more advanced coding, math, and creative writing tasks.
Unveiled at Google I/O, PaLM 2, which stands for Pathways Language Model, is trained on a staggering 3.6 trillion tokens, a significant leap from the original PaLM that was trained on 780 billion tokens. Tokens, sequences of words, are critical in LLM training, as they enable the model to predict the subsequent word in a sequence.
Google, while keen to demonstrate its AI technology's capabilities and its integration into services like search, email, word processing, and spreadsheets, has not publicly shared the specifics of its training data. Similarly, OpenAI, backed by Microsoft and creator of ChatGPT, has been secretive about the particulars of its latest LLM, GPT-4. Both companies attribute this to the competitive nature of the industry as they race to attract users interested in information searches via conversational chatbots over traditional search engines. However, this lack of disclosure is facing pushback from the research community, which is calling for more transparency in this burgeoning AI race.
Despite revealing that PaLM 2 is smaller than its predecessors, Google's new model demonstrates an increase in the sophistication of tasks it can accomplish, indicating improved efficiency. PaLM 2 is trained on 340 billion parameters, showcasing the model's complexity, compared to the original PaLM, which was trained on 540 billion parameters.
Google has touted PaLM 2 as utilizing a "new technique" named "compute-optimal scaling", which enhances the LLM's efficiency and overall performance. This includes faster inference, fewer parameters to serve, and a reduction in serving cost.
The new model, already deployed across 25 features and products, including the experimental chatbot Bard, can process 100 languages and perform a variety of tasks. It is available in four sizes: Gecko, Otter, Bison, and Unicorn, from smallest to largest.
In terms of disclosed public metrics, PaLM 2 outperforms any existing model. For comparison, Facebook's LLM, LLaMA, announced in February, was trained on 1.4 trillion tokens. The last disclosed training size for OpenAI's ChatGPT was for GPT-3, with 300 billion tokens. OpenAI launched GPT-4 in March, claiming "human-level performance" on many professional tests.
The ongoing rapid adoption of new AI applications has heightened controversies surrounding the technology's transparency and control, fueling spirited debates. Recent incidents include the resignation of El Mahdi El Mhamdi, a senior Google Research scientist, over Google's lack of transparency. OpenAI's CEO, Sam Altman, testified at a Senate Judiciary subcommittee hearing on privacy and technology, agreeing with lawmakers that a new system to manage AI is needed. Source