How We Built an AI That Reads Architectural Drawings, Not Just Text

There's a version of this story where I say we used AI to "read" building plans and leave it at that. But the word "read" is doing a lot of work, and what's actually happening underneath is more interesting and more specific than the general claim suggests. If you want to understand what prompted this build in the first place, the data entry cost article sets the scene.

Here's what we actually built, and why the distinction matters.

The problem with "AI reads documents"

Most AI document tools are really text extraction tools. You give them a PDF, they pull out the words, and they do something useful with those words. That works well for contracts, invoices, emails, anything where the meaningful content is in the text.

Building plans are different. Yes, they have text on them: room labels, dimensions, material callouts, notes in the margins. But the information that matters for a NatHERS energy assessment isn't mostly in the text. It's in the drawings.

Which direction does the building face? That's the north arrow on the site plan, not a label that says "orientation: north-west." How deep are the eaves? That's a measurement on the elevation drawing. What's the window-to-wall ratio on the north facade? You need to look at the elevations and do some spatial reasoning. What construction type is the external wall? Sometimes it's in a schedule, sometimes it's a detail drawing, sometimes it's implied by what's shown.

An AI that only reads text would miss most of this. And missing it means an assessor still has to read the drawings manually, which is the whole problem we were trying to solve.

What we needed the AI to actually do

The brief, in plain terms: look at a building plan the way a trained assessor would, extract the specification data relevant to a NatHERS assessment, and pre-populate the job form with what it finds.

That means the AI needs to handle:

Floor plans, to understand the building's orientation and layout
Elevation drawings, to determine eave depths, window head heights, and wall construction
Window schedules, which are sometimes tables and sometimes just notations on the drawings
Construction notes, which can appear in multiple places in multiple formats depending on who drew the plans and when

And it needs to do this across plans that vary enormously in style, quality, level of detail, and format. Plans from a boutique architect look nothing like plans from a volume builder. Both need to work.

The approach: vision, not extraction

We use Claude's vision capability, which means we're not trying to extract text from the PDF and then parse it. We're sending the actual page images to the model and asking it to look at them the way a person would.

This is the part that changes what's possible. A vision model doesn't need the information to be in a text layer. It can look at a north arrow and understand orientation. It can look at an elevation and read a dimension string. It can look at a detail drawing and infer wall construction type.

We break each plan set into pages, send the relevant pages to the model with a structured prompt that specifies exactly what to look for and how to report it, and map the responses back to the fields in the job form.

The prompt engineering here is where most of the work lives. Getting the model to return consistent, structured data across wildly varying inputs requires being extremely specific about what you want, what format you want it in, and how to handle ambiguity.

The confidence layer

This is the piece I'm most glad we built in from the start.

The model doesn't always know. Sometimes the plans don't show what we need. Sometimes the information is ambiguous. Sometimes what's drawn doesn't match what's noted. A model that fills in its best guess and presents it as fact would cause real problems in a compliance context.

So every extracted value comes back with a confidence indicator. High confidence means the model found clear, unambiguous information and is certain of the result. Lower confidence means it found something but had to make an inference, or found conflicting information, or didn't find the information at all.

Anything below a confidence threshold gets flagged for human review. The assessor doesn't skip that field. They look at the plan and make the call themselves.

This is what makes the system trustworthy enough to use in a real compliance workflow. It's not removing the assessor's judgment. It's reserving that judgment for the cases where it's genuinely needed, instead of applying it to every field on every job.

What the assessor's job looks like now

Before: read the plans, type the specs, repeat for every field, hope nothing got missed.

Now: open the pre-populated form, check the flagged fields, confirm or correct the rest, submit.

The cognitive load is completely different. Reviewing a filled form for accuracy is easier than building it from scratch. Flagged fields draw attention to exactly the things that need it. The assessor is still doing skilled work. They're just doing the skilled part, not the clerical part.

That's what we were actually trying to build. Not an AI that replaces the assessor. An AI that gives the assessor their time back for the work only they can do.

Building something with a similar challenge? Let's talk about it.