adding tool use (code execution & web search) to see how that helps models calculate tax returns
also testing Claude Opus 4.1 and GPT-5 mini & nano
follow here: github.com/column-tax/...
adding tool use (code execution & web search) to see how that helps models calculate tax returns
also testing Claude Opus 4.1 and GPT-5 mini & nano
follow here: github.com/column-tax/...
especially because it's knowledge cutoff is still September 2024
but it's not the leader in tax calculation today
(even with maximal test time compute)
especially because it's knowledge cutoff is still September 2024
but it's not the leader in tax calculation today
(even with maximal test time compute)
x.com/michaelrboc...
x.com/michaelrboc...
View the dataset and testing harness here:
github.com/column-tax/...
View the dataset and testing harness here:
github.com/column-tax/...
using pass^k (a measure of reliability of a model across multiple runs on the same task), performance degrades with additional runs meaning models mess up in new & surprising ways when calculating tax returns.
using pass^k (a measure of reliability of a model across multiple runs on the same task), performance degrades with additional runs meaning models mess up in new & surprising ways when calculating tax returns.
but not for the best model (Gemini 2.5 Pro), suggesting alternative techniques/scaffolding/orchestration is required to get AI to do this tax calculation task.
but not for the best model (Gemini 2.5 Pro), suggesting alternative techniques/scaffolding/orchestration is required to get AI to do this tax calculation task.
1. Misuse tax tables
2. Make calculation errors
For example, models will hallucinate line numbers on Forms or use incorrect eligibility limits.
1. Misuse tax tables
2. Make calculation errors
For example, models will hallucinate line numbers on Forms or use incorrect eligibility limits.
Even on this simplified data set and allowing the models to output to a simplified format, the best model only calculates 32.35% of returns correctly.
Even on this simplified data set and allowing the models to output to a simplified format, the best model only calculates 32.35% of returns correctly.
We made the task easy for the models. We provide:
- all of the data (e.g. W-2s) needed to file a return
- the expected output in IRS XML format
We made the task easy for the models. We provide:
- all of the data (e.g. W-2s) needed to file a return
- the expected output in IRS XML format
75k pages of English text define the transformations required to do this.
Companies like @ColumnTax use deterministic tax engines to do these calculations.
75k pages of English text define the transformations required to do this.
Companies like @ColumnTax use deterministic tax engines to do these calculations.
Tax is a secretive industry, so we’re proud to release a research paper sharing our findings:
arxiv.org/abs/2507.16126
Tax is a secretive industry, so we’re proud to release a research paper sharing our findings:
arxiv.org/abs/2507.16126