DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

2023

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, and 10 more authors

Oct 2023

Paper Abstract

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded "prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively). On top of that, DSPy programs compiled to open and relatively small LMs like 770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely on expert-written prompt chains for proprietary GPT-3.5. DSPy is available at https://github.com/stanfordnlp/dspy

@article{2310.03714v1,
  author = {Khattab, Omar and Singhvi, Arnav and Maheshwari, Paridhi and Zhang, Zhiyuan and Santhanam, Keshav and Vardhamanan, Sri and Haq, Saiful and Sharma, Ashutosh and Joshi, Thomas T. and Moazam, Hanna and Miller, Heather and Zaharia, Matei and Potts, Christopher},
  title = {DSPy: Compiling Declarative Language Model Calls into Self-Improving
    Pipelines},
  eprint = {2310.03714v1},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2023},
  month = oct,
  url = {http://arxiv.org/abs/2310.03714v1},
  file = {2310.03714v1.pdf},
  eprintnover = {2310.03714}
}

Important Things

1. DSPy

From the authors of ColBERT, DSPy is a framework for improving a language model (LM) pipeline. It can be thought of as an improvement on LangChain, where instead of just allowing application developers to compose different parts of a LM pipeline together, it is also capable of optimizing the prompts within each component.

DSPy stands for “Demonstrate–Search–Predict”.

2. Components

There are 3 main components in DSPy.

Signature: the signature defines the input/outputs of a module, which represents one or many LLM calls. An example would be like question -> answer:
Modules: modules are the main building blocks that have a signature defined on them. Examples of modules are Predict (as given above), ChainOfThought, MultiChainComparison, ReAct.
Teleprompters: these are what they call the optimizers that will optimize the prompts for a given module. In order to use a teleprompter, there must be an evaluation dataset and a metric to measure model performance (i.e exact match or F1).

In the example above, the teleprompter is used to generate few-shot examples that will help the model answer the question more effectively.

With this design, modules can be composed together and optimized by teleprompters.

3. Example of a Teleprompter

They included a few examples of pseudocode of teleprompters in the appendix. Let’s look at the one for BootstrapFewShot:

The goal of this teleprompter is to come up with some few-shot examples. This is done as follows:

Take examples from the train set, and have a teacher (supposedly a DSPy program that doesn’t perform too poorly on the task) generate answers for the problems. If a teacher doesn’t exist, use the student program.
If the teacher gets the answer right based on some metrics, this is accepted as a new few-shot example.

Most Glaring Deficiency

I expected the paper to go into the details of the algorithms used for optimizing the prompts used in the pipeline, but none were provided other than the very simple few-shot example generation ones. That would have been the most interesting part of the DSPy pipeline as a reader, which made the paper a somewhat disappointing read.

Conclusions for Future Work

Abstract the “how” away from the “what” to build composable programs. In this case, the functionality of how the LLM modules work based on their prompts was abstracted away from the goals of what each module is supposed to achieve.