Recent headlines lauding the success of natural language processing systems like BERT suggest that true natural language understanding is just around the corner. The fine print, on the other hand, tells a different story. In this talk we’ll dive into the challenge of defining quality benchmarks for mission-critical human language technology tasks, looking to economics, standardized testing, and animal psychology to help us better define advancement and success.