💰 Compensation

This interview project may**$^*$** offer an honorarium of up to $1,000 (maximum, regardless of total hours spent), paid at a rate of $155 per hour spent directly engineering deliverables. Specifics to be discussed & agreed between you & your interviewer before starting (while interviewing).

**$^*$**The availability, amount, timing, and other terms of any honorarium are at the sole discretion of Spindle AI, and may change with or without notice. Honoraria are subject to complete & truthful time logging, timely & applicable invoicing, full confidentiality, degree of completion, good faith, standard conflict of interest principles, and Spindle’s sole discretion. Spindle AI reserves all customary rights, remedies, and indemnities. Interviewees can elect to forego the honorarium.

🎯 Your Mission

This mission represents a stripped-down but realistic “toy version” of the kind of multi-agent system Spindle AI is engineering (including some actual challenges we’ve already faced):

  1. The Setup: First, create ≥5 distinct, simple, deterministic tools that an LLM-based agent could call to help solve user-provided math problems (e.g. SUMDELTAPRODUCTQUOTIENTMODULOPOWERABSLOGTRIGSQRTAVGMODEROUND, UNIONINTERSECT, DIFFERENTIATEINTEGRATE, FACTORIZE, — the specific tools are entirely up to you).

    1. Modify 1-2 of the most basic tools to intentionally but silently throw errors (and/or silently give incorrect answers) 30%-50% of the time the tool is called. You may also want to include a basic GET_USER_INPUT tool for requesting input/clarification from a human user. (You can organize all tools in some form of “toolbox” if you want, but we’d prefer you do not hardcode a string listing all the tools, their docs, and their usage examples in a single prompt file or prompt mega-string anywhere in the project.)
  2. The Architecture: Prototype a multi-agent system with at least 2 agents and at most 5 agents (for whatever definition of “agent” you believe makes sense in this context), that discovers which tools are available and sequences tool calls to reliably solve basic user-provided math problems (or if you prefer, mathy word problems). The agents can only ****use the available tools (including the unreliable tool[s]), i.e. no LLM-hallucinated arithmetic should be used for user-facing answers (even if that arithmetic is correct, as is increasingly the case among frontier models).

    1. You might well choose to include a lightweight planning, reasoning, and/or task decomposition layer in your prototype — but unless you have a compelling justification, all user-facing outputs (and most intermediate outputs) should be structured or semistructured, not unstructured.
  3. The Twist: When your prototype identifies a sequence of tool calls that reliably or fairly reliably solves a certain class of math problem(s) based on successful execution(s), it should learn to do something like (e.g.) memoize or semantically cache that sequence of tool calls as a single, idempotent new VirtualTool (i.e. some learning behavior akin to **“bundling” the tool calls into a single new idempotent tool, to which a single call can be made, which can be reliably invoked next time a math problem of the same or similar form is encountered).

  4. The Finish Line: Prove programmatically that your prototype works reasonably well (or at least that it could be completed to work reasonably well, if short on time).

    1. Bonus points for using actual evals to show this.
      1. (If you’re are an “evals-focused” candidate, consider reframing/approaching the entire task through the lens of an evals system instead, i.e. evals-driven development. Just tell us to judge your quality vs. emphasis vs. completion accordingly.)
  5. Bonus Points:

    1. Create the math toolbox/interfaces in a non-Python language (ideally Rust, Go, or Typescript).
    2. If you decide to use a vector database anywhere, consider prototyping your own vector DB or VDB-like utility. (Not if this takes up all your time, though. It’s not the most important part.)

🛠️ Does it matter “how” I accomplish this?

📬 Delivering your project

If live-coding this with us today, ignore this section. Otherwise: