# NavList:

## A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding

**Re: Rejecting outliers**

**From:**Peter Hakel

**Date:**2011 Jan 3, 18:38 -0800

Re: 1) Yes, your attachment contains the math that leads to equations used to compute the weighted linear fit. The iterations in my spreadsheet occur due to the need for a self-consistent determination of the weights and the fit.

Re: 2) You made a good point about how much information is really needed to plot an LOP. I characterize the entire fit (including the slope) so that users have the option of choosing any UT in range (for whatever reason) to get their averaged altitude.

Re: 6) I am fitting both the slope and the intercept, so chi_squared should be near N - 2.

For Gary LaPook's Venus data set #1 (6 data points) I calculate final chi_squared as 3.3 (ideally 4 = 6 - 2).

For Peter Fogg's Canopus data (9 data points) I calculate final chi_squared as 5.8 (ideally 7 = 9 - 2).

I will think about adding this result to inform the user how well the fit is doing. Thank you for reminding of this property; it may provide a rather rigorous way of determining the "Scatter" parameter and therefore maybe allay George Huxtable's concerns about "magic."

Re: 7) I am not sure why you would object to my Eq3, since that is the very same definition of chi_squared included in your attachment. Perhaps you see it as problematic in connection with my Eq2, which, as I acknowledged, has been introduced in lieu of unavailable bone fide standard deviations. These "effective" uncertainties are not allowed to drop below a certain threshold controlled by the "Scatter" parameter, no matter how close to any data point the current fit may pass. Initially I selected this "Scatter" value to be 0.1' consistent with the number of decimals usually given in CelNav angular data. This choice worked OK for Gary LaPook's Venus data but for Peter Fogg's Canopus data (apparently taken under more adverse conditions) I ended up using 2.5'.

You are also concerned about chi_squared becoming equal only to the number of measurements. Again, with appropriately chosen "Scatter" some (but not all!) data points will hit the 1/Scatter^2 ceiling for their weights (represented by 1.000 in the yellow column D, and overriding the weight = 1 / diff^2 relation), so the substitution of Eq3 into Eq2 will not actually result in chi_squared = N (also, see above in Re: 6).

I realize that the reasonableness of Eq2 is debatable, so if there is a more sensible way to replace the unknown sigmas then all the better! I wanted to detect a deviation from the prevailing linear trend which has the same units as the measured quantity, hence I came up with sigma_i -> | diff_i |. I will think about other monotonously increasing functions sigma -> f( | diff | ) as possible, more appropriate replacements.

Peter Hakel

**From:**George B <gbrandenburg@rcn.com>

**To:**NavList@fer3.com

**Sent:**Mon, January 3, 2011 3:14:37 PM

**Subject:**[NavList] Re: Rejecting outliers

In response to posts by Antoine, George H, Peter H, and John H, I'd like to make a few comments about the least squares solution to the problem of several sights taken at different times. As has been noted, in this case the altitude measurements can be assumed to have a linear dependence on the time, provided the time interval isn't too great.

1. This is a simple example of the case where the function describing the data is linear in its parameters. Here the two parameters are the slope (a) and the intercept (b). In this case there the solution can easily be found algebraically - there is no need to perform any iterations. If you are interested in the algebra see the attachment.

2. For the case where both the slope and the intercept are to be determined, the result can be most easily summarized (as Antoine noted on Jan 1) by noting that the fit line goes through the "pivot" of the data, namely the point defined by the average altitude measurement and the average time of the measurements. If you also want the slope you need to make additional calculations (see attachment), but this isn't necessary to determine the LOP.

3. If you assume the value for the slope is defined and only find the intercept using the least squares method, the result is exactly the same as in point 2, the fit line goes through the pivot.

4. The least squares method is the best estimator of the fit line, in fact it is equivalent to the maximum liklihood method with Gaussian errors assumed for the measurements. But as pointed out in 2 and 3, because the of the assumed linear dependence of the altitude on the time, it is also equivalent to the intuitive method of finding the pivot point by simply averaging the altitudes and times (assuming you don't require the slope value).

5. The averages that are taken in the calculation should be weighted averages, where the weights are given by 1/standard deviation^2 (as in Eq 1 from Peter H on Dec 31). The standard deviation of each measurement is the best estimate of the uncertainty in each altitude measurement. If all the measurements are estimated to have the same uncertainty, which should usually be the case, then the weights are all the same, and the fitted result will not depend on them. However, the propagated uncertainty on the fit result and the value of chi square will depend on the estimated measurement uncertainties.

6. As I noted in a "cocked hat" post, the chi square value for the fit to the measurements can be calculated and could in principle be a useful estimator of the "quality" of the set of measurements. In particular if the slope is assumed to be fixed, then for an "average" set of measurements chi square should have a value near N-1, where N is the number of measurements. If chi square is much larger than N-1 then either the measurement uncertainties were underestimated or the set of measurements includes some particularly unlucky (or poorly done) points. (If both slope and intercept are calculated then the average value of chi square should be N-2.)

7. I seriously disagree with Eqs 2 & 3 from Peter H on Dec 31. First of all the least square method is only valid if the weights are based on sensible estimates of the measurement uncertainty. (This can be seen by its connection with the maximum liklihood method as noted in point 4.) Second PH's Eq 2, which equates the standard deviation to the "residual" or the distance of the measurement from the fit line, is a very bad estimator of the uncertainty of a specific measurement. In particular if the measurement happens to fall right on the fit line it assigns it an absurd uncertainty of zero (and infinite weight). Finally when PH's Eq 2 is substituted into chi square in Eq 3, the numerator and denominator for each term cancel with the result that chi square is equal to the number of measurements and nothing else. I may have badly misunderstood what is intended here, but nonetheless I conclude that the correct best estimate is just the simple one given in points 2 and 3 above.

8. And last but not least I agree with my experimental physicist colleagues George H and John H that discarding measurements is a dangerous business and should be done only when there is reason to believe that a measurement was faulty. And this should preferably be done before looking at the results.

Sorry to run on so long - I hope this clarifies a few things without having further muddied the water.

Happy New Year!

George B

----------------------------------------------------------------

NavList message boards and member settings: www.fer3.com/NavList

Members may optionally receive posts by email.

To cancel email delivery, send a message to
NoMail[at]fer3.com

----------------------------------------------------------------