Michael R. Bock
@michaelrbock.com
35 followers 81 following 120 posts
co-founder of Column Tax // michaelrbock.com
Posts Media Videos Starter Packs
michaelrbock.com
4/ next up?

adding tool use (code execution & web search) to see how that helps models calculate tax returns

also testing Claude Opus 4.1 and GPT-5 mini & nano

follow here: github.com/column-tax/...
column-tax/tax-calc-bench
Code & data for TaxCalcBench. Contribute to column-tax/tax-calc-bench development by creating an account on GitHub.
github.com
michaelrbock.com
3/ GPT-5 is impressive in many ways

especially because it's knowledge cutoff is still September 2024

but it's not the leader in tax calculation today

(even with maximal test time compute)
michaelrbock.com
2/ back in July, we published the first-ever eval for US personal income tax calculations

x.com/michaelrboc...
michaelrbock.com
1/ GPT-5 is worse than Gemini 2.5 Pro at filing your taxes (but it's really close and they both can't do it yet)

we proved it via our tax calculation benchmark:
michaelrbock.com
I got married last month.🤵‍♂️👰‍♀️

Here's what it taught me about B2B2C tax software:

Just kidding :) but I do really recommend getting married to the love of your life with all your friends & family around!
michaelrbock.com
no one had even heard of git worktress before claude code
michaelrbock.com
amazing ChatGPT Agent Mode use case: find & validate coupon codes without having to test them yourself
michaelrbock.com
9/ This work wouldn’t have been possible without the hard work of our Tax Analyst team over the past 4 years & the success of our commercial product: you can’t buy this dataset on Scale or Surge.

View the dataset and testing harness here:

github.com/column-tax/...
GitHub - column-tax/tax-calc-bench: Code & data for TaxCalcBench
Code & data for TaxCalcBench. Contribute to column-tax/tax-calc-bench development by creating an account on GitHub.
github.com
michaelrbock.com
8/ Models are also inconsistent:

using pass^k (a measure of reliability of a model across multiple runs on the same task), performance degrades with additional runs meaning models mess up in new & surprising ways when calculating tax returns.
michaelrbock.com
7/ For some models, performance improves with increased inference-time compute (thinking budget tokens)

but not for the best model (Gemini 2.5 Pro), suggesting alternative techniques/scaffolding/orchestration is required to get AI to do this tax calculation task.
michaelrbock.com
6/ Models consistently:

1. Misuse tax tables
2. Make calculation errors

For example, models will hallucinate line numbers on Forms or use incorrect eligibility limits.
michaelrbock.com
5/ Takeaway: models can’t calculate tax returns reliably today.

Even on this simplified data set and allowing the models to output to a simplified format, the best model only calculates 32.35% of returns correctly.
michaelrbock.com
4/ TaxCalcBench is a dataset of 51 pairs of user inputs and the expected tax return output + a testing harness.

We made the task easy for the models. We provide:
- all of the data (e.g. W-2s) needed to file a return
- the expected output in IRS XML format
michaelrbock.com
3/ Tax calculation means taking a user’s "inputs" (W-2s, 1099s) and outputting the Form 1040 in the IRS XML format.

75k pages of English text define the transformations required to do this.

Companies like @ColumnTax use deterministic tax engines to do these calculations.
michaelrbock.com
2/ Today, we’re releasing TaxCalcBench: a first-ever benchmark dataset & eval framework for testing AI’s ability to calculate US personal income tax returns.

Tax is a secretive industry, so we’re proud to release a research paper sharing our findings:

arxiv.org/abs/2507.16126
TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task
Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully...
arxiv.org
michaelrbock.com
1/ Can AI file your taxes? Not yet.

We tested the latest frontier models and the results were full of catastrophic errors.

Letting AI do your taxes would mean IRS rejections, audits, and penalties:
michaelrbock.com
this is the wildest cold twitter dm opener i've ever received
michaelrbock.com
this is what founder <> founder private text messages look like (and what makes the job so fun)
michaelrbock.com
why is everyone complaining about a GPU shortage if it turns out you can just buy them on amazon ;)
michaelrbock.com
11/ Thanks to the folks who worked on Direct File. We have a lot of gratitude.
michaelrbock.com
10/ For that reason, I've decided to dedicate my life to working on this problem directly and making incremental progress every day, instead of waiting for another group—like the government—to solve it. It might not be the perfect idealistic solution, but I think it's worth fighting for nonetheless.
michaelrbock.com
9/ But we live in a world where it seems like progress from the federal government stalls every four years as administrations change.
michaelrbock.com
8/ I'm sad to see Direct File go. It was great to have more options for taxpayers in the market. And I hope it comes back. I've taken a very pragmatic approach in my career to the same problem Direct File was tackling: tax filing should be easy and automatic.