Michael R. Bock
michaelrbock.com
Michael R. Bock
@michaelrbock.com
co-founder of Column Tax // michaelrbock.com
We’re so confident that we’re publishing an internal roadmap document: our “secret” master plan to automate tax filing (just between you & me).
October 29, 2025 at 1:48 PM
And now the combination of the latest AI progress and our expert team & large proprietary eval datasets means we’re the group that can finally fully automate tax filing and save people time & money.
October 29, 2025 at 1:48 PM
4/ next up?

adding tool use (code execution & web search) to see how that helps models calculate tax returns

also testing Claude Opus 4.1 and GPT-5 mini & nano

follow here: github.com/column-tax/...
column-tax/tax-calc-bench
Code & data for TaxCalcBench. Contribute to column-tax/tax-calc-bench development by creating an account on GitHub.
github.com
September 18, 2025 at 5:39 PM
3/ GPT-5 is impressive in many ways

especially because it's knowledge cutoff is still September 2024

but it's not the leader in tax calculation today

(even with maximal test time compute)
September 18, 2025 at 5:39 PM
2/ back in July, we published the first-ever eval for US personal income tax calculations

x.com/michaelrboc...
September 18, 2025 at 5:38 PM
10/ Read more about the work, research, and results here:

www.columntax.com/blog/taxcal...
TaxCalcBench: Can AI file your taxes?
AI can’t do your taxes on its own (yet).
www.columntax.com
July 23, 2025 at 3:18 PM
9/ This work wouldn’t have been possible without the hard work of our Tax Analyst team over the past 4 years & the success of our commercial product: you can’t buy this dataset on Scale or Surge.

View the dataset and testing harness here:

github.com/column-tax/...
GitHub - column-tax/tax-calc-bench: Code & data for TaxCalcBench
Code & data for TaxCalcBench. Contribute to column-tax/tax-calc-bench development by creating an account on GitHub.
github.com
July 23, 2025 at 3:18 PM
8/ Models are also inconsistent:

using pass^k (a measure of reliability of a model across multiple runs on the same task), performance degrades with additional runs meaning models mess up in new & surprising ways when calculating tax returns.
July 23, 2025 at 3:18 PM
7/ For some models, performance improves with increased inference-time compute (thinking budget tokens)

but not for the best model (Gemini 2.5 Pro), suggesting alternative techniques/scaffolding/orchestration is required to get AI to do this tax calculation task.
July 23, 2025 at 3:18 PM
6/ Models consistently:

1. Misuse tax tables
2. Make calculation errors

For example, models will hallucinate line numbers on Forms or use incorrect eligibility limits.
July 23, 2025 at 3:18 PM
5/ Takeaway: models can’t calculate tax returns reliably today.

Even on this simplified data set and allowing the models to output to a simplified format, the best model only calculates 32.35% of returns correctly.
July 23, 2025 at 3:18 PM
4/ TaxCalcBench is a dataset of 51 pairs of user inputs and the expected tax return output + a testing harness.

We made the task easy for the models. We provide:
- all of the data (e.g. W-2s) needed to file a return
- the expected output in IRS XML format
July 23, 2025 at 3:17 PM
3/ Tax calculation means taking a user’s "inputs" (W-2s, 1099s) and outputting the Form 1040 in the IRS XML format.

75k pages of English text define the transformations required to do this.

Companies like @ColumnTax use deterministic tax engines to do these calculations.
July 23, 2025 at 3:17 PM
2/ Today, we’re releasing TaxCalcBench: a first-ever benchmark dataset & eval framework for testing AI’s ability to calculate US personal income tax returns.

Tax is a secretive industry, so we’re proud to release a research paper sharing our findings:

arxiv.org/abs/2507.16126
TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task
Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully...
arxiv.org
July 23, 2025 at 3:17 PM