Promptimize: Step towards the Future

Ajith Shetty
6 min readApr 26


Photo by Andrew Neel on Unsplash

The whole world has taken aback when the chatgpt was launched. And with that so many new possibilities have been unlocked.

There are new use cases and new innovations which grew in every part of the organization.

The uses cases are many for AI and chatgpt. And every Data Driven Companies are making use of these new technologies to learn more about their Organizations and take relevant decisions at the right time.

With many possibilities, it gave a birth to a new set of engineers whom are called as Prompt engineers to help the organization to enrich and analyze the AIs output.

Prompt Engineering

According to ChatGPT,

We have been using the machine learning models to predict the future and we train these models with the help of old datasets.

And there are hundreds of ways to test these models to make it more and more accurate.

But in the world of AI, we are in need of such mechanism which could help us.

We would need a toolkit which can take help us to code the test cases and give us the accuracy report.


Promptimize is a evaluvation and testing tool kit for Prompt engineers.

With promptimize, you can:

  • Define your “prompt cases” (think “test cases” but specific to evaluating prompts) as code and associate them with evaluation functions
  • Generate prompt variations dynamically
  • Execute and rank prompts test suites across different engines/models/temperature/settings and compare results, brining the hyperparameter tuning mindset to prompt engineering
  • Get reports on your prompts’ performance as you iterate. Answer question around how different prompt suites are performing against one-another. Which individual cases or categories of cases improved? regressed?
  • Minimize API calls! only re-assess what changed as you change it
  • Perform human if and where needed, introspected failed cases, overriding false negatives

In essence, promptimize provides a programmatic way to execute and fine-tune your prompts and evaluation functions in Python, allowing you to iterate quickly and with confidence.


To install follow the below steps:

pip install promptimize
pip3 install pandas
pip3 install openai

Let’s setup a OpenAI account

And generate an API Key.


Clone the project

git clone
cd promptimize

There are pre-built examples, But for our test let’s build our own test case.

Some basic examples for promptimize.

to run, simply execute `p9e ./examples/`
# Brining some "prompt generator" classes
from promptimize.prompt_cases import PromptCase

# Bringing some useful eval function that help evaluating and scoring responses
# eval functions have a handle on the prompt object and are expected
# to return a score between 0 and 1
from promptimize import evals

# Promptimize will scan the target folder and find all Prompt objects
# and derivatives that are in the python modules
simple_prompts = [
# Prompting "hello there" and making sure there's "hi" or "hello"
# somewhere in the answer
PromptCase("hello on the other side!", lambda x: evals.any_word(x.response, ["heyy", "hey"])),
"name the top 10 cricketers!",
lambda x: evals.any_word(x.response, ["sachin", "don bradman"]),
"top 10 countries in the world by gdp",
lambda x: evals.any_word(x.response, ["Germany", "Italy"]),

So at the beginning we import the Class PromptCase which will be used to define our test cases.

evals to evaluate the output with our given expected output.

We define the question like “top 10 countries in the world by gdp”.

And we evaluate the response by matching them with our expected output.

Based on the matching criteria, we will be given a score from 0 to 1.

Weight is defined for us to prioritize some of the cases based on our need.

We define the category so that we can group the output.

To execute we can either use, “promptimize” or “p9e”. Both can be used interchangeably.

p9e run ./examples/ --verbose --output ./report.yaml
💡 ¡promptimize! 💡
# ----------------------------------------
# (1/3) [RUN] prompt: prompt-f502b83f
# ----------------------------------------
key: prompt-f502b83f
user_input: hello on the other side!
prompt_hash: f502b83f
prompt: hello on the other side!
category: null
response: Hi there! How can I help you?
api_call_duration_ms: 994.3192005157471
run_at: '2023-04-26T14:09:05.493281'
score: 0.0

# ----------------------------------------
# (2/3) [RUN] prompt: prompt-c5b9fb83
# ----------------------------------------
key: prompt-c5b9fb83
user_input: name the top 10 cricketers!
prompt_hash: c5b9fb83
prompt: name the top 10 cricketers!
category: cricket
response: |-
1. Sachin Tendulkar
2. Virat Kohli
3. Brian Lara
4. Shane Warne
5. Jacques Kallis
6. Muttiah Muralitharan
7. Ricky Ponting
8. Imran Khan
9. Rahul Dravid
10. Wasim Akram
weight: 2
api_call_duration_ms: 2597.066879272461
run_at: '2023-04-26T14:09:08.102307'
score: 1.0

# ----------------------------------------
# (3/3) [RUN] prompt: prompt-eb7d2b9a
# ----------------------------------------
key: prompt-eb7d2b9a
user_input: top 10 countries in the world by gdp
prompt_hash: eb7d2b9a
prompt: top 10 countries in the world by gdp
category: world
response: |-
1. United States
2. China
3. Japan
4. Germany
5. India
6. United Kingdom
7. France
8. Brazil
9. Italy
10. Canada
weight: 2
api_call_duration_ms: 1658.2231521606445
run_at: '2023-04-26T14:09:09.763306'
score: 1.0

# ----------------------------------------
# Suite summary
# ----------------------------------------
suite_score: 0.4
sha: b79bc3406656
branch: main
dirty: false

We can evaluate the output

promptimize report report.yaml
# Reading report @ report.yaml
| weight | 5.00 |
| score | 4.00 |
| perc | 80.00 |
| category | weight | score | perc |
| cricket | 2 | 2.00 | 100.00 |
| world | 2 | 2.00 | 100.00 |

Upon running the same command, without the change, the promptimize will refrain from running it to minimize the API calls. You may use the extra arguments — force or — repair.

Full set of options.

-v, --verbose Trigger more verbose output
-f, --force Force run, do not skip
-h, --human Human review, allowing a human to review and force
pass/fail each prompt case
-r, --repair Only re-run previously failed
-x, --dry-run DRY run, don't call the API
--shuffle Shuffle the prompts in a random order
-s, --style [json|yaml] json or yaml formatting
-m, --max-tokens INTEGER max_tokens passed to the model
-l, --limit INTEGER limit how many prompt cases to run in a single
-t, --temperature FLOAT max_tokens passed to the model
-e, --engine TEXT model as accepted by the openai API
-k, --key TEXT The keys to run
-o, --output PATH
-s, --silent
--help Show this message and exit.


promptimize is one of the coolest toolkit what I have seen a recent past. And it can help you and your Prompt engineers to build more suffeistcated test cases and test them.

promptimize is much more that what has been discussed in here. Here we have merely showed how promptimize works and not explained how extensively this can be used.

I strongly recommend you to visit the below pages to know more about the creator and the documentation.

Full Credits

Maxime Beauchemin

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓

Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data



Ajith Shetty

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: