Promptimize: Step towards the Future

6 min readApr 26, 2023

The whole world has taken aback when the chatgpt was launched. And with that so many new possibilities have been unlocked.

There are new use cases and new innovations which grew in every part of the organization.

The uses cases are many for AI and chatgpt. And every Data Driven Companies are making use of these new technologies to learn more about their Organizations and take relevant decisions at the right time.

With many possibilities, it gave a birth to a new set of engineers whom are called as Prompt engineers to help the organization to enrich and analyze the AIs output.

Prompt Engineering

According to ChatGPT,

We have been using the machine learning models to predict the future and we train these models with the help of old datasets.

And there are hundreds of ways to test these models to make it more and more accurate.

But in the world of AI, we are in need of such mechanism which could help us.

We would need a toolkit which can take help us to code the test cases and give us the accuracy report.

Promptimize

Promptimize is a evaluvation and testing tool kit for Prompt engineers.

With promptimize, you can:

Define your “prompt cases” (think “test cases” but specific to evaluating prompts) as code and associate them with evaluation functions
Generate prompt variations dynamically
Execute and rank prompts test suites across different engines/models/temperature/settings and compare results, brining the hyperparameter tuning mindset to prompt engineering
Get reports on your prompts’ performance as you iterate. Answer question around how different prompt suites are performing against one-another. Which individual cases or categories of cases improved? regressed?
Minimize API calls! only re-assess what changed as you change it
Perform human if and where needed, introspected failed cases, overriding false negatives

In essence, promptimize provides a programmatic way to execute and fine-tune your prompts and evaluation functions in Python, allowing you to iterate quickly and with confidence.

Source: https://github.com/preset-io/promptimize

To install follow the below steps:

pip install promptimize

pip3 install pandas
pip3 install openai

Let’s setup a OpenAI account https://platform.openai.com/

And generate an API Key.

export OPENAI_API_KEY=sk-{{ REDACTED }}

Clone the project

git clone git@github.com:preset-io/promptimize.git
cd promptimize

There are pre-built examples, But for our test let’s build our own test case.

"""
Some basic examples for promptimize.

to run, simply execute `p9e ./examples/set_of_standard_ques.py`
"""
# Brining some "prompt generator" classes
from promptimize.prompt_cases import PromptCase

# Bringing some useful eval function that help evaluating and scoring responses
# eval functions have a handle on the prompt object and are expected
# to return a score between 0 and 1
from promptimize import evals

# Promptimize will scan the target folder and find all Prompt objects
# and derivatives that are in the python modules
simple_prompts = [
    # Prompting "hello there" and making sure there's "hi" or "hello"
    # somewhere in the answer
    PromptCase("hello on the other side!", lambda x: evals.any_word(x.response, ["heyy", "hey"])),
    PromptCase(
        "name the top 10 cricketers!",
        lambda x: evals.any_word(x.response, ["sachin", "don bradman"]),
        weight=2,
 category="cricket"
    ),
    PromptCase(
        "top 10 countries in the world by gdp",
        lambda x: evals.any_word(x.response, ["Germany", "Italy"]),
        weight=2,
 category="world"
    ),
]

So at the beginning we import the Class PromptCase which will be used to define our test cases.

“evals” to evaluate the output with our given expected output.

We define the question like “top 10 countries in the world by gdp”.

And we evaluate the response by matching them with our expected output.

Based on the matching criteria, we will be given a score from 0 to 1.

Weight is defined for us to prioritize some of the cases based on our need.

We define the category so that we can group the output.

To execute we can either use, “promptimize” or “p9e”. Both can be used interchangeably.

p9e run ./examples/set_of_standard_ques.py --verbose --output ./report.yaml

💡 ¡promptimize! 💡
# ----------------------------------------
# (1/3) [RUN] prompt: prompt-f502b83f
# ----------------------------------------
key: prompt-f502b83f
user_input: hello on the other side!
prompt_hash: f502b83f
prompt: hello on the other side!
category: null
response: Hi there! How can I help you?
execution:
  api_call_duration_ms: 994.3192005157471
  run_at: '2023-04-26T14:09:05.493281'
  score: 0.0

# ----------------------------------------
# (2/3) [RUN] prompt: prompt-c5b9fb83
# ----------------------------------------
key: prompt-c5b9fb83
user_input: name the top 10 cricketers!
prompt_hash: c5b9fb83
prompt: name the top 10 cricketers!
category: cricket
response: |-
  1. Sachin Tendulkar
  2. Virat Kohli
  3. Brian Lara
  4. Shane Warne
  5. Jacques Kallis
  6. Muttiah Muralitharan
  7. Ricky Ponting
  8. Imran Khan
  9. Rahul Dravid
  10. Wasim Akram
weight: 2
execution:
  api_call_duration_ms: 2597.066879272461
  run_at: '2023-04-26T14:09:08.102307'
  score: 1.0

# ----------------------------------------
# (3/3) [RUN] prompt: prompt-eb7d2b9a
# ----------------------------------------
key: prompt-eb7d2b9a
user_input: top 10 countries in the world by gdp
prompt_hash: eb7d2b9a
prompt: top 10 countries in the world by gdp
category: world
response: |-
  1. United States
  2. China
  3. Japan
  4. Germany
  5. India
  6. United Kingdom
  7. France
  8. Brazil
  9. Italy
  10. Canada
weight: 2
execution:
  api_call_duration_ms: 1658.2231521606445
  run_at: '2023-04-26T14:09:09.763306'
  score: 1.0

# ----------------------------------------
# Suite summary
# ----------------------------------------
suite_score: 0.4
git_info:
  sha: b79bc3406656
  branch: main
  dirty: false

We can evaluate the output

promptimize report report.yaml

# Reading report @ report.yaml
+--------+-------+
| weight |  5.00 |
| score  |  4.00 |
| perc   | 80.00 |
+--------+-------+
+------------+----------+---------+--------+
| category   |   weight |   score |   perc |
|------------+----------+---------+--------|
| cricket    |        2 |    2.00 | 100.00 |
| world      |        2 |    2.00 | 100.00 |
+------------+----------+---------+--------+

Upon running the same command, without the change, the promptimize will refrain from running it to minimize the API calls. You may use the extra arguments — force or — repair.

Full set of options.

Options:
  -v, --verbose             Trigger more verbose output
  -f, --force               Force run, do not skip
  -h, --human               Human review, allowing a human to review and force
                            pass/fail each prompt case
  -r, --repair              Only re-run previously failed
  -x, --dry-run             DRY run, don't call the API
  --shuffle                 Shuffle the prompts in a random order
  -s, --style [json|yaml]   json or yaml formatting
  -m, --max-tokens INTEGER  max_tokens passed to the model
  -l, --limit INTEGER       limit how many prompt cases to run in a single
                            batch
  -t, --temperature FLOAT   max_tokens passed to the model
  -e, --engine TEXT         model as accepted by the openai API
  -k, --key TEXT            The keys to run
  -o, --output PATH
  -s, --silent
  --help                    Show this message and exit.

Conclusion

promptimize is one of the coolest toolkit what I have seen a recent past. And it can help you and your Prompt engineers to build more suffeistcated test cases and test them.

promptimize is much more that what has been discussed in here. Here we have merely showed how promptimize works and not explained how extensively this can be used.

I strongly recommend you to visit the below pages to know more about the creator and the documentation.

Full Credits

Maxime Beauchemin

https://preset.io/blog/introducing-promptimize/

GitHub - preset-io/promptimize: Promptimize is a python framework that accelerates prompt…

Promptimize is a prompt engineering evaluation and testing toolkit. It accelerates and provides structure around prompt…

github.com

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓

Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data