💰 Compensation

This interview project may**$^*$** offer an honorarium of up to $1,000 (maximum, regardless of total hours spent), paid at a rate of $155 per hour spent directly engineering deliverables. Specifics must be agreed between you & your interviewer before starting (i.e. while interviewing).

**$^*$**The availability, amount, timing, and other terms of any honorarium are at the sole discretion of Spindle AI, and may change with or without notice. Honoraria are subject to complete & truthful time logging, timely & applicable invoicing, full confidentiality, degree of completion, good faith, standard conflict of interest principles, and Spindle’s sole discretion. Spindle AI reserves all customary rights, remedies, and indemnities. Interviewees can elect to forego the honorarium.

🎯 Your Mission

The Setup: First, create ≥5 distinct tools that an LLM-based agent could sequence to help solve user-provided math problems (e.g. SUM, DELTA, PRODUCT, QUOTIENT, MODULO, POWER, ABS, LOG, TRIG, SQRT, AVG, MODE, ROUND, UNION, INTERSECT, DIFFERENTIATE, INTEGRATE, FACTORIZE, … — the specific tools are entirely up to you).
1. Modify 1-2 of the most basic tools to intentionally but silently throw errors (and/or silently give incorrect answers) 30%-50% of the time the tool is called.
2. You may also want to include a basic GET_USER_INPUT tool for requesting input/clarification from a human user.
3. You can organize all tools in some form of “toolbox” if you want, but we’d prefer you do not hardcode a string listing all the tools, their docs, their usage examples, etc. in a single prompt string/file anywhere in the project.
The Architecture: Prototype a multi-agent system with at least 2 agents and at most 5 agents (for whatever definition of “agent” you believe makes sense for your project), that discovers which tools are available and sequences tool calls to reliably solve basic user-provided math problems (or if you prefer, mathy word problems). The agents can only ****use the available tools (including the unreliable tool[s]), i.e. no LLM-hallucinated arithmetic should be used for user-facing answers (even if that arithmetic is correct, as is increasingly the case among frontier models).
1. You might choose to include planning, reasoning, and/or task decomposition agents or layers in your prototype — but unless you have a compelling justification, all user-facing outputs (and most intermediate outputs) should be structured or semistructured, not unstructured.
- Don’t hesitate to ask us for an OpenAI API key or Anthropic API key. Otherwise, we’re happy to reimburse these costs after submission (within reason/at Spindle’s discretion).
The Twist: When your prototype identifies a sequence of tool calls that reliably or fairly reliably solves a certain class of math problem(s) based on successful execution(s), it should learn to do something like (e.g.) memoize or semantically cache that sequence of tool calls as a single, idempotent new VirtualTool (i.e. some learning behavior akin to **“bundling” the tool calls into a single new idempotent tool, to which a single call can be made, which can be reliably invoked next time a math problem of the same or similar form is encountered). Some of you may recognize a Voyager-/Oddysey-like flavor to this.
The Finish Line: Prove programmatically that your prototype works reasonably well (or at least that it could be completed to work reasonably well, if short on time).
1. Bonus points for using real evals to demonstrate this.
  1. (If you’re are an “evals-focused” candidate, consider reframing/approaching the entire task through the lens of an evals system instead, i.e. evals-driven development. Just tell us to judge your quality vs. emphasis vs. completion accordingly.)
Bonus Points:
1. Create the math toolbox/interfaces in a non-Python language (ideally Rust, Go, or Typescript).
2. If you decide to use a vector database anywhere for any reason, consider prototyping your own vector DB or VDB-like utility. (Not if this takes up all your time, though. It’s not the most important part.)

If you don’t have enough time for a project like this, or have alternate ideas, please let us know so we can find a path forward that we all feel good about! Either way, we look forward to seeing you through these next steps.

🛠️ Does it matter “how” I accomplish this?

Not really: The priority of this project is to ship useful software appropriate to the Job Description in a way that shows off some creativity and your strengths (at Spindle AI we “hire for strengths, not for lack of weaknesses” and “hire people, not roles”), within certain constraints.
You can use an LLM to help you write any code or evals you want, but you should still be familiar with every line of code: we might ask you why you wrote it that way!
There is no single “correct” answer or “correct” approach; there are only tradeoffs (and, we expect you to be able to discuss tradeoffs in depth, either in your README or when reviewing the results together).
If you are inspired to go a different direction and feel confident it will result in a more compelling outcome, feel free to do so. This works as long as you’re confident we’ll be able to evaluatue your project using many of the same criteria as we’d otherwise use.
- Rescope the work based on what you can fully accomplish (at least steel thread with demoable functionality) in the time available. We’d ask that you don’t spend an excessive or unusual amount of time (we’re not trying to test how much free time you have). If you’re feeling pressed for time, you might focus on a narrower scope! If you’re feeling more adventurous, you might push it further! (We trust your judgement either way, but the consequences are yours to embrace.)
We encourage you to ask us clarifying questions (including about the project or about possible compensation) anytime.

📬 Delivering your project

If live-coding this with us today, ignore this section. Otherwise:

When ready, please invite these GitHub users as collaborators on your GitHub repo.