💰 Compensation
This interview project may**$^*$** offer an honorarium of up to $1,000 (maximum, regardless of total hours spent), paid at a rate of $155 per hour spent directly engineering deliverables. Specifics to be discussed & agreed between you & your interviewer before starting (while interviewing).
**$^*$**The availability, amount, timing, and other terms of any honorarium are at the sole discretion of Spindle AI, and may change with or without notice. Honoraria are subject to complete & truthful time logging, timely & applicable invoicing, full confidentiality, degree of completion, good faith, standard conflict of interest principles, and Spindle’s sole discretion. Spindle AI reserves all customary rights, remedies, and indemnities. Interviewees can elect to forego the honorarium.
🎯 Your Mission
This mission represents a stripped-down but realistic “toy version” of the kind of multi-agent system Spindle AI is engineering (including some actual challenges we’ve already faced):
-
The Setup: First, create ≥5 distinct, simple, deterministic tools that an LLM-based agent could call to help solve user-provided math problems (e.g. SUM
, DELTA
, PRODUCT
, QUOTIENT
, MODULO
, POWER
, ABS
, LOG
, TRIG
, SQRT
, AVG
, MODE
, ROUND
, UNION
, INTERSECT
, DIFFERENTIATE
, INTEGRATE
, FACTORIZE
, … — the specific tools are entirely up to you).
- Modify 1-2 of the most basic tools to intentionally but silently throw errors (and/or silently give incorrect answers) 30%-50% of the time the tool is called. You may also want to include a basic
GET_USER_INPUT
tool for requesting input/clarification from a human user. (You can organize all tools in some form of “toolbox” if you want, but we’d prefer you do not hardcode a string listing all the tools, their docs, and their usage examples in a single prompt file or prompt mega-string anywhere in the project.)
-
The Architecture: Prototype a multi-agent system with at least 2 agents and at most 5 agents (for whatever definition of “agent” you believe makes sense in this context), that discovers which tools are available and sequences tool calls to reliably solve basic user-provided math problems (or if you prefer, mathy word problems). The agents can only ****use the available tools (including the unreliable tool[s]), i.e. no LLM-hallucinated arithmetic should be used for user-facing answers (even if that arithmetic is correct, as is increasingly the case among frontier models).
- You might well choose to include a lightweight planning, reasoning, and/or task decomposition layer in your prototype — but unless you have a compelling justification, all user-facing outputs (and most intermediate outputs) should be structured or semistructured, not unstructured.
- Don’t hesitate to ask us for an OpenAI API key or Anthropic API key. Otherwise, we’re happy to reimburse these costs after submission (within reason/at Spindle’s discretion).
-
The Twist: When your prototype identifies a sequence of tool calls that reliably or fairly reliably solves a certain class of math problem(s) based on successful execution(s), it should learn to do something like (e.g.) memoize or semantically cache that sequence of tool calls as a single, idempotent new VirtualTool
(i.e. some learning behavior akin to **“bundling” the tool calls into a single new idempotent tool, to which a single call can be made, which can be reliably invoked next time a math problem of the same or similar form is encountered).
-
The Finish Line: Prove programmatically that your prototype works reasonably well (or at least that it could be completed to work reasonably well, if short on time).
- Bonus points for using actual evals to show this.
- (If you’re are an “evals-focused” candidate, consider reframing/approaching the entire task through the lens of an evals system instead, i.e. evals-driven development. Just tell us to judge your quality vs. emphasis vs. completion accordingly.)
-
Bonus Points:
- Create the math toolbox/interfaces in a non-Python language (ideally Rust, Go, or Typescript).
- If you decide to use a vector database anywhere, consider prototyping your own vector DB or VDB-like utility. (Not if this takes up all your time, though. It’s not the most important part.)
- If you don’t have enough time for a project like this, or have alternate ideas, please let us know so we can find a path forward that we all feel good about! Either way, we really look forward to seeing you through these next steps.
🛠️ Does it matter “how” I accomplish this?
- Not really: The priority of this project is to ship useful software appropriate to the Job Description in a way that shows off some creativity and your strengths (at Spindle AI we “hire for strengths, not for lack of weaknesses” and “hire people, not roles”), within certain constraints.
- You can use an LLM to help you write any code you want. No bonus points for writing all your code by hand. (Probably you should write some of it by hand though 😅)
- There is no single “correct” answer or “correct” approach; there are only tradeoffs (and, we expect you to be able to discuss tradeoffs in depth, either in your README or when reviewing the results together).
- If you are inspired to go a different direction and feel confident it will result in a more compelling outcome, feel free to do so. This works as long as you’re confident we’ll be able to evaluatue your project using many of the same criteria as we’d evaluate using the scope proposed originally.
- Rescope the work based on what you can fully accomplish (at least steel thread with demoable functionality) in the time available. We’d ask that you don’t spend an excessive or unusual amount of time (we’re not trying to test how much free time you have). If you’re feeling pressed for time, you might focus on a narrower scope! If you’re feeling more adventurous, you might push it further! (We trust your judgement either way, but the consequences are yours to own.)
- We encourage you to ask us clarifying questions (including about the project or about possible compensation) anytime!
📬 Delivering your project
If live-coding this with us today, ignore this section. Otherwise: