I have made the repo public now that there is enough content to get a general sense of some of the direction. I have not uploaded any of my models, datasets, or other larger code.
I welcome all contribution from the OHDSI community.
Upfront:
- Why is this not under the OHDSI Github? My hope is that the models, their reproduction, weights, and datasets will eventually housed under the OHDSI Github. That way they can be integrated into other tooling. However, that will require evaluation and other considerations. These will have the same Apache 2.0 license OHDSI generally uses.
- Why the GNU Affero General Public License? This license is used on the main Polyphemus repo simply because it deals with SaaS. The intent is to have models interact deployed on K3s or K8s. While we will absolutely seek benefactors for cloud hosting / training, we would want the resources to be usable by the entire OHDSI community. This prevents a private entity from taking the developed code and hosting a SaaS without providing some beneficence to the community.
A few additional thoughts:
- Langchain is incredibly useful for amplifying fine-tuned 7b llama models. There has been some interesting GUI implementations as well.
- The quantization of models makes them feasible to run on minimal hardware. So far I have seen effective 4q models, and some work on 3q.
- Fine-tuning the models takes significantly more resources. I struggle with a 3070ti, so if you are hoping to do anything local with your machine I would suggest 3090/4090 w/ 12GB VRAM. However, services like Google Collab and LambdaLabs give access to necessary hardware, and most training can be done for sub-$5 on the 7B models.
- We should closely follow OpenLLaMA closely as the licensing is more permissible. Larger model runs are planned for the end of the week.
The following is from the README to give a better sense of the project
Introduction
The infamous Cyclops, Polyphemus, the son of Poseidon in Greek mythology, met his downfall due to the cunning of the hero, Odysseus. When asked for his name, Odysseus cleverly responded with ‘Nobody.’ Unable to pinpoint the culprit behind his blinding, Polyphemus could not seek revenge, his actions limited to hurling a boulder aimlessly into the sea. It wasn’t until Odysseus revealed his true name that the cyclops was able to enlist his father Poseidon’s divine wrath, highlighting the power of a name.
Similarly, in the world of systems integration, the naming of objects is crucial. Each object must have a consistent name across various systems. While the characteristics of the object might change over space and time, the name serves as a constant identifier. With a proper name, actions can be directed effectively, whereas without it, one can only attempt hit-or-miss strategies.
The importance of naming extends to the realm of artificial intelligence (AI) systems. Large Language Models (LLMs) such as GPT-3, the driving force behind ChatGPT and OpenAI’s API, exemplify this. These models can perform tasks but lack the ability to recognize specific names within specialized knowledge domains. This limitation hampers their wider application, especially in fields such as observational medical research.
Introducing Polyphemus, a network of specialty-trained Large Language Models (LLMs), designed to overcome this limitation. Polyphemus models can not only act but also identify object names within their specialized domain. This project, a collaborative effort between the Observational Health Data Sciences and Informatics (OHDSI) community and Polyphemus, focuses on the exploration of LLMs’ application in open science and observational medical research.
Recent Developments in Open Source Large Language Models
On March 3, 2023, a significant event occurred in the field of artificial intelligence. Meta’s Large Language Model Meta AI (LLaMA) was leaked and subsequently shared via a torrent file on the internet. This foundational large language model (fLLM) spurred a rapid proliferation of open source projects. Unlike its counterparts, such as OpenAI’s GPT4 and Google’s PaLM2, LLaMA became freely accessible rather than being locked behind authorized web-applications or application program interfaces (APIs). In a noteworthy example of the model’s accessibility, the smaller 7B parameter version was successfully deployed on a Raspberry Pi by Artem Andreenko.
A few days later, on March 13, researchers from Stanford fine-tuned the 7B LLaMA model using a technique known as Self-Instruct. They open-sourced the instructions, thereby creating a model that was trained for both instruction and conversation. Following the publication of LLaMA’s code, Eric Wang released Alpaca-LoRA, a reproduction of Stanford’s Alpaca utilizing low-rank adaptation (LoRA). Alpaca-LoRA made it possible for consumer hardware to reproduce Stanford’s methods. As a result, anyone with a high-performance gaming computer or a modest cloud computing budget could train their own generalized or specialized LLaMA-derived model. The potential implications of these developments are far-reaching, to say the least. Continued performance enhancements by Georgi Gerganov using 4-bit quantization even made it feasible to run LLaMA using only a CPU.
As of mid-March, open-source and open science projects were starting to gain momentum:
-
March 19: The Vicuna project trained the model on high-quality ChatGPT dialogues sourced from sites such as ShareGPT. The training cost was approximately $400.
-
March 25: GPT4ALL trained a model and implemented the first ecosystem that unites other models like Vicuna in a single location. The training cost was around $100.
-
March 28: Cerebras trained the GPT-3 architecture using the optimal compute schedule implied by Chinchilla and the optimal scaling implied by μ-parameterization. This marked the first example of a model that outperformed GPT-3 and was trained from scratch. Consequently, the open source community gained full access to two fLLMs that rival GPT4 and PaLM2.
-
March 28: The LLaMA-Adapter project introduced instruction tuning and multimodality in a mere hour of training, establishing a new state-of-the-art (SOTA) for Science Q&A.
-
April 3: The Koala project released a model trained on open source data. When evaluated against ChatGPT with human subjects, over 50% of users either preferred Koala responses or expressed no preference between the two. The training cost was approximately $100.
-
April 15: The Open Assistant Conversations project launched a model and dataset for Alignment via Real Life Human Feedback (RLHF). Their model closely competed with ChatGPT in terms of human preference (48.3% vs. 51.7%). The dataset could also be applied to Pythia-12B, and a complete open-source technology stack was provided to run the model. This publicly available dataset made RLHF feasible for individual users.
The Era of Democratized AI: Developments Since April 15th
After April 15, the trend of democratizing AI continues to accelerate. The following are some of the most notable developments in the field of open-source AI since then:
-
LangChain - A library to assist in the development of data-aware agentic LLM applications by including multiple LLMs combined with other sources of information. Examples include Q&A over specific documents, databases, or APIs; chatbots that can reason before responding; agents that have control over system environments to perform tasks.
-
Chain of Thought (CoT) / Step-by-Step - a prompting technique used to encourage the model to generate a series of intermediate reasoning steps.
- Action Plan Generation (WebGPT - SayCan), a prompting technique that uses a language model to generate actions to take. The results of these actions can then be fed back into the language model to generate a subsequent action.
-
ReAct - a prompting technique that combines Chain-of-Thought prompting with action plan generation. This induces the model to think about what action to take, then take it.
- More concepts from LangChain documentation
-
LlamaIndex - a comprehensive data framework that enhances LLMs and allows users to incorporate their private data effectively. Users can connect their existing data sources and formats easily. LlamaIndex also offers an advanced retrieval and query interface for retrieving context and augmenting knowledge-based output.
-
GPTCache - a semantic cache for storing LLM responses.
-
LocalAI - a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing.
-
PandasAI - a Python library that adds generative artificial intelligence capabilities to Pandas.