Statistical Analysis of Fuel Economy Claims

Updated August 28, 2023


I’m subtitling this post “don’t believe all the data you read.”
 
There’s a lot of stock put in fuel economy tests to verify changes in drag. Someone makes an aerodynamic change to their car and runs an “A-B” test, one run (sometimes more) in each of opposing directions measuring fuel economy, to “prove” that drag has been reduced. These results are then posted online uncritically; I’ve done it myself in the past.
 
There are a couple of things wrong with these tests.
 
The Normal Distribution
 
Statistics help us evaluate the truth of real-world claims. One of the foundations of this branch of mathematics is the “normal distribution”: the discovery that natural systems follow the same pattern of variability, with a certain percentage of measured results falling within a certain deviation of the simple average. For example, over the years I’ve owned my Prius, I have kept track of the fuel economy displayed on the factory gauge and the calculated MPG from the pump and miles driven. Last year, I put the difference between displayed and calculated economy by tank in a spreadsheet and plotted it:
 
The mean was 3.0 MPG, right at the top of the curve.
 
You can see that the data form a nice curve. Most of the results are in the middle, close to the mean. As you get further from the mean, above and below, there are fewer data. This is the normal distribution, and it is characteristic of just about every kind of measurement of, well, anything.

Yes, anything.

The same is true of something like MPG measured over the same distance several times; if you run the car a bunch of times in the same configuration, the natural variability in the system will give a normal distribution centered around the simple average. Put another way, doing one run may or may not tell you anything about the actual average result because it may be sitting out at one “tail” of the normal distribution, far away from the actual mean. To continue the example above, if I pick a tank at random, the difference between displayed and actual MPG might be 3.0 (exactly at the mean), but it might also be 5.2 or 0.7, and if that is my only datum I will grossly over- or underestimate the actual mean. Note that this is unlikely, but we have no way of knowing with only one datum!
 
For example, I once drove from Pennsylvania to Illinois, westbound on Interstate 70. The weather that day was quite windy, and it was blowing generally out of the east. My average fuel economy for the trip kept going up and up, and when I pulled into a gas station in Illinois the gauge read 70.7 MPG! That is nowhere close to the actual average economy of my car over the years, which is around 53 MPG. If the only datum I have, though, is that one tank it will lead me to erroneous conclusions.

Record of tank MPG (as measured by gallons filled divided by odometer miles driven) for my 2013 Prius, as of August 2023. Here's that pesky normal distribution again! The outlier tank I described above, which had an actual mileage of 67 MPG, lies at the extreme right of this chart.

So, the first problem is this: you need far more than one datum for each configuration. Statisticians use n = 15 as a minimum number; fewer tests than that requires much more robust, consistent results to assign the same confidence to a claim.
 
Confidence
 
One of the common uses of the normal distribution is to establish confidence about claims that a parameter has changed. Statisticians establish confidence by testing the data at a confidence level, which is the acceptable risk that the confidence test will result in an error e.g. the test suggests that adding a wing to your car improved its MPG when in reality it did not (in statistics-speak, this is a case of rejecting the null hypothesis when it was, in fact, true). Confidence level is always set at 0.15 or less, and the standard is 0.05. At a confidence level this small there is very little chance that the test will return an erroneous result, and consequently you can have confidence that your results actually reflect reality.
 
The second problem with simple A-B testing, then, is this: you must test your results to ensure that you can make claims about them. Without confidence testing, you are shooting in the dark. That initial A test may have been wrong because of a passing car or a gust of wind or because the temperature was rising or whatever. Accepting that result uncritically means you will never know if your tests actually show what was happening.
 
When Claims Go Wrong
 
Here’s an example of what not to do. Two years ago, I did A-B testing with and without air curtain ducts on my Prius. I thought I was doing things better than I typically saw online: I tested in the middle of the night, when temperatures were fairly consistent and winds were light, and I measured fuel economy on the car’s computer over a 3-mile stretch of flat, straight road rather than the typical 1-mile. I also did 3 runs (in each direction, so 6 total for each configuration). These were the results:
 
 
Comparing the simple averages, I concluded that the ducts were improving MPG and posted about it online. Success!
 
Well, not quite. Later that year I took a statistics course and decided to revisit these results, applying what I had learned in the class.
 
To conduct a test, you first need to formulate a hypothesis. In this case, my null or default hypothesis (the hypothesis I would assume true for purposes of the test) was that the average MPG without ducts was the same as with ducts, and my alternate hypothesis—what I wanted to know was true or not—was that the average MPG with ducts was greater than without. In mathematical terms,
 
H0: μwithout = μwith
H1: μwithout < μwith
 
I then ran a 2-sample non-pooled T-test, a choice based on the number of data I had and the fact that I wanted to compare two sets of data with unknown standard deviation, at a confidence level of α = 0.05. The test compares the mean and standard deviation of the data with the normal distribution to calculate the probability that the averages are close to or far away from the actual, real-life mean; it spits out a probability, in this case p = 0.292. Since that is significantly higher than our α = 0.05, it means that the alternate hypothesis must be rejected since there is not significant evidence to conclude that the addition of the ducts improved MPG. Extraordinary claims require extraordinary evidence, and this evidence did not meet a high enough standard to make the claims I had. Additionally, test results from others online that I had taken as gospel but which displayed similar spread in the data were, in all likelihood, BS. People (myself included) are incredibly adept at seeing what they want to see in data; some objective test of test results is necessary before making claims about aerodynamic changes based on something as variable as fuel economy.
 
Lesson learned.

Comments

Popular Posts

How Spoilers Work

Tuft Testing: A How-To Manual

Coastdown Testing Revisited