Training AI large language models (LLMs) like those currently making waves in the enterprise software market — ChatGPT, LLaMA 2, Claude 2, Bard, Falcon 180B, etc. — typically requires extensive and specialized compute power. Little wonder, then, that it has been relegated to larger, well-funded organizations like OpenAI, Meta, Cohere, Google, Technology Innovation Institute in Abu Dhabi, etc.
However, Sebastien Bubeck, leader of the Machine Learning Foundation team at Microsoft Research believes this could change soon thanks to their research on open source and resource-efficient models like their new, non-commerical phi-1.5.
By generating curated, high quality, synthetic data using existing LLMs (in this case, OpenAI’s ChatGPT) and training a new model on this, the researchers are able to achieve results comparable to leading LLMs at a fraction of the cost and training time.
The evolution of AI training
Announced in a paper this week, phi-1.5 is an evolution of the phi-1 code generation model Bubeck unveiled this June in the “Textbooks Are All You Need” paper. Building on their experience with code generation, Bubeck’s team sought to make a lean and efficient language model. To accomplish this, the team created a source of textbook-like content in ChatGPT and then they used that synthetic data to train the phi-1.5 model.
The phi-1.5 model uses 1 billion parameters, small in terms of other models with over 100 billion inputs, but it has already demonstrated some exciting emergent abilities normally found in the larger models.
As phi-1.5 is solely trained on synthetic data via the “Textbooks” approach, it does not need to leverage web scraping or the usual data sources fraught with copyright issues.
When asked about their goals for phi-1.5, Bubeck explained they wanted to “make it available everywhere.” By focusing on a model with just 1 billion parameters, “now anybody can go and play and you know, it becomes just much more democratized that way,” he said in a call with VentureBeat.
Training phi-1.5 required only two weeks of time on eight A100 GPUs and Bubeck noted: “renting eight GPUs for one week, it’s $1,000. Basically, any individual can get this level of compute.”
This stands in contrast to other models which require massive GPU resources, costing multiple millions.
Cracking open the textbooks
The “Textbooks Are All You Need” methodology aims to democratize AI by extracting reasoning abilities from smaller models. As Bubeck described, “if you want to teach your kid something you don’t just give them a bunch of random internet pages about this topic. You actually carefully curate some material for them to go through.
When discussing how they ensured diversity in the synthetic textbooks created to train phi-1.5, Bubeck drew comparisons to the “Tiny Stories” work by Ronen Eldan, another researcher at Microsoft and Carnegie Mellon University professor Yunazhi Li. The team was able to have an LLM output children’s stories with a transformer using only 10 million parameters.
“They wrote a list of 3000 words. Then what they did is, every time they wanted to produce a short story, they picked three at random. And they asked ChatGPT to write a short story for kids, which includes those three words.”
By introducing seed words into the data in this way, the researchers were able to achieve “many, many different very different looking stories,” Bubeck said. This combinatorial approach resulted in a vast expansion of the possible output from the model.
In turn, the “Textbooks” approach is more sophisticated, but the link is clear between the two techniques.
Bubeck also noted that creating training data through the “textbooks” methodology ensures that reasoning tokens are much more common in the model inputs. This means that robust LLM output results can be achieved without needing to process the immense amount of information found in classical training data sets.
Benchmarks, while handy, need to evolve
In the course of development, phi-1.5 has already resulted in some exciting benchmark figures: 74% on Winogrande (common sense reasoning, 5% higher than Llama2-7B), 37% on OpenbookQA (reading comprehension, 6% higher than Llama2-7B) and HumanEval at 34% (coding, 20% higher than Llama2-7B).
Despite these exciting and successful figures, traditional benchmarks have come under scrutiny, says Bubeck. He advocates moving to more nuanced evaluation methods, as evidenced by comments on benchmarking phi-1.5: “Benchmarks are not telling us a story of what’s going on with LLMs,” Bubeck stated. He sees limitations in static tests, saying they cannot capture model interactions or full range of abilities.
Instead of benchmarks, Bubeck suggested “a different way to test models” is needed. Specifically, methods based on playing with the model through direct conversations: “The power of those LLMs is that they can interact with you. You can have a back and forth, you can modify the premise, you can see how robust it is to variation, etc,” said Bubeck.
By releasing phi-1.5 under a research license (not for commercial purposes), others can now “ask their own question and see what the model replies,” said Bubeck. This “ultimate decontamination” allows more flexible, nuanced evaluation than benchmarks alone can provide.
Through developing models that can learn from focused, high-quality synthetic data rather than vast web corpora, AI may soon be within reach of many more individuals and organizations. Bubeck believes their approach “opens the door to many, many new types of applications” no longer restricted to tech giants. If successful, it could truly usher in a new era of decentralized, democratic AI development.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.