San Francisco-based company Databricks has released a large language model (LLM) Dolly 2.0, the next version of the model that the company released two weeks back. This is similar to the LLM models that form the base for chatbots like ChatGPT.
“Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human-generated instruction following dataset, crowdsourced among Databricks employees,” a company blog said.
The company claims that it is the first open-source, instruction-following LLM that is fine-tuned on a freely available dataset. Databricks also said that the model is open for commercial applications without paying for API access or data sharing with third parties.
The company is also releasing databricks-dolly-15k dataset on which Dolly 2.0 was fine-tuned on. It is a corpus of more than 15,000 records generated by gathering responses from its 5,000 employees across 40 countries. The dataset has been vetted for quality, which cost the company millions of dollars, chief executive officer Ali Ghodsi said. As per the company, it is the “first open-source, human-generated instruction corpus specifically designed to enable large language to exhibit the magical interactivity of ChatGPT.”
CEO Ghodsi said that the dataset is not perfect since it consists of only Databricks’ employee base that skews male. However, users will be able to examine the training data themselves, which was not possible with OpenAI’s ChatGPT or Google’s Bard, whose training data has not been made public.
Notably, Databricks released Dolly a few days back. However, this model could not be used in commercial products because the data used to train the model was generated by ChatGPT, whose terms of service restrict users from using its data to develop commercial AI systems that could compete with OpenAI.