A critique of the statistical protocol in ISO 20072 for aerosol drug delivery device design verification. CUED/F-INFENG/TR.703 October 8, 2016 M.S. Christescu, M.A. Tirlea, D.J.C. MacKay ISSN 0951-9211 University of Cambridge Department of Engineering Trumpington Street Cambridge CB2 1PZ United Kingdom Email: msc64 / mat58 @cam.ac.uk 1 A Critiquy of thy gtutisticul Protocol in Igc 2DD72 for Ayrosol Drug Dylivyry Dyvicy Dysign Vyri cution AUhei S. Chfighegci, AUfiig A. TifleU Tfibihm Ccllege, CUabfidge 8Uvid J. C. AUcKUm∗ 8edUfhaebh cf Ebgibeefibg, Ubivefgihm cf CUabfidge Cchcbef ,, 2016 Avstruwt ISO 20072 is an international standard detailing the methods and testing protocols required to ensure the quality of aerosol drug delivery devices (ADDDs), for use in humans [1]. Our paper examined the pitfalls and errors in this standard, and the difficulties in its application to ADDD-producing factories. We find that the likelihood of many imperfect factories which satisfy the requirements listed in the standard actually passing the prescribed testing is minuscule. We suggest other testing protocols, which achieve the stated goals of the standard. In the specific example we consider, they also reduce the workload by around 25-fold compared with the methods specified in ISO 20072, as well as giving crucial additional advantages. hvyly Dvviy bvxKvy FghA gegius erofessor of EngineeringA yiey in April 2EFK yuring the prepvrvtion of this reportC 2 Contznts 1 Introxuwtion 6 1.1 Overview of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 F hhy Dyviwy Funwtionulity dro ly (DFd) 8 2.1 Example of a DFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Rationale behind the various basic items in the DFP . . . . . . . . . . . . . . . . . . 8 3 hywhniwulitiys 1D 3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 hhy Igc FDD7F tysting protowol 13 4.1 The approach prescribed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1.1 Two-sided, real variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1.2 One-sided, real variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.3 Probability parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.4 Poisson rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Erroneous values in ISO 20072 table of 2-sided values of k . . . . . . . . . . . . . . . 14 4.3 A comment about two DFP basic items . . . . . . . . . . . . . . . . . . . . . . . . . 16 I Fryquyntist tysting 17 5.1 Details of an approach to designing a frequentist test compliant with ISO 20072 . . . 17 5.1.1 Continuous parameter with 2-sided constraints . . . . . . . . . . . . . . . . . 18 5.1.2 Continuous parameter with one-sided constraint . . . . . . . . . . . . . . . . 18 3 5.1.3 Probability parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.4 Poisson rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Summary of effort required for frequentist test . . . . . . . . . . . . . . . . . . . . . 22 5.3 Cheating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.3.1 An example of an ignorant attempt to be honest . . . . . . . . . . . . . . . . 24 5.4 The frequentist results are in fact irrelevant . . . . . . . . . . . . . . . . . . . . . . . 25 5.4.1 Fconfidence misbehaves when applied to combinations of items . . . . . . . . 25 5.4.2 Frequentist testing considers the wrong denominator . . . . . . . . . . . . . . 27 6 Vuyysiun tysting F8 6.1 Sequential Bayesian testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.1.1 Description of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.1.2 Validity of the Bayesian sequential method . . . . . . . . . . . . . . . . . . . 31 6.1.3 Further useful properties of the sequential Bayesian approach . . . . . . . . . 32 6.2 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.3 Priors on K and q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.4 Markov Chain and Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.5 Sampling and code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.6 Sample sizes needed for each type of DFP item . . . . . . . . . . . . . . . . . . . . . 35 6.6.1 Continuous variable with two-sided constraint . . . . . . . . . . . . . . . . . . 36 6.6.2 Continuous parameter with one-sided constraint . . . . . . . . . . . . . . . . 40 6.6.3 Probability parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.6.4 Poisson rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 6.7 Summary of effort needed in the Bayesian approach . . . . . . . . . . . . . . . . . . 49 7 Wompurison of thy fryquyntist unx Vuyysiun upprouwhys ID 7.1 Workload required . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.2 Comparison of incentives to cheat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.3 Comparison of the incentives to produce a high quality factory . . . . . . . . . . . . 51 7.4 Pseudo-Bayesian approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.5 Discussion of choice of prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8 Wommynts muxy in ryviyw vy othyrs II 9 Wonwlusion I6 A kuys in whiwh thy vyhuviour of fryquyntist won xynwy fuils to mutwh thy intuitivy vyhuviour yfipywtyx I8 V droof thut Vuyysiun puss provuvility of u goox fuwtory upprouwhys 1 us thy umount of xutu inwryusys 6D W Worrywt vuluys of k to rypluwy thosy in Igc FDD7F tuvly D1 63 5 F Introyuxtion In 2009, the International Standards Organization published a standard, ISO 20072 (see [1]), which describes a protocol for specifying and verifying the properties of a factory producing inhalers. A paper prepared by the International Pharmaceutical Aerosol Consortium on Regulation and Sci- ence (IPACRS), the European Pharmaceutical Aerosol Group (EPAG) and other experts (see [2]) criticizes the 2007 draft of ISO 20072 in the following statement: “the values for [the minimum probability content] p proposed in ISO DIS 20072:2007 are set so high that meeting the statis- tical requirements will be challenging or even impossible for many ADDD-based products.”. We investigated whether or not this statement is true. The intentions of ISO 20072 are noble. The aim of the standard is to require aerosol drug delivery device (ADDD) manufacturers to comply with stringent testing protocols, in order to ensure that the ADDDs are of a high quality. However, there are two problems with the standard: a minor error, which is easily correctable but has had the effect of satisfactory factories being rejected by the standard, and a major flaw, which renders the entire standard unfit for purpose. The second issue leads to manufacturers being tempted to manipulate their test data in order for their factories to be deemed compliant with this standard. FCF dvzrvizfi of thz pvpzr In sections 2, 3, and 4, we explain what a “Device Functionality Profile” is and define some terms we will use in this report, then using an example Device Functionality Profile, discuss the testing protocol mandated by ISO 20072. We then explain the minor error which occurs in table D1, Annex D of ISO 20072. This table contains the values of k to be used in a particular two-sided test of a continuous variable, under the assumption that this variable follows a Gaussian distribution. This table contains values which are often incorrect to the first decimal place, and even before for some values of n. For instance, one of the correct k values is 48% lower than the incorrect one listed in table D1 of ISO 20072. We discuss this issue further in section 4.2. However, this error can be simply remedied by replacing the incorrect table of k values with a correct one, which is why we consider it to be a minor problem. In section 5 we examine the sample sizes needed, which leads us in sections 5.2, 5.3, and 5.4 to describe the major problem, which is that the statistical tests in the ISO 20072 protocol are deeply flawed - so flawed that it is essentially impossible for many factories that are in fact satisfactory to pass all non-trivial tests1, unless the tests are rigged (for example, by censoring unfavourable results). The public is then seen to be in the following highly unsatisfactory situation: all devices that are stated to have complied with ISO 20072 and have non-trivial DFP requirements are extremely likely to have done so by a testing process that in fact did not comply with the ISO standard: 1For exvmples of non-trivivl requirements of the DFeA see sextion 2CF 6 some of the test results will have been censored or falsified. This not only implies that patients are at risk, but also that ISO 20072 is not fit for purpose. Note that we do not imply that the censoring or falsification of the results is done with malicious intent; rather, we note in section 5.3.1 that the workers in question may be bending over backwards to try to reconcile the commercial pressure to pass the test with the difficulty of doing so, but have insufficient knowledge of statistics to understand that what they are doing constitutes falsification of the results. The root of the problem with the standard is that it is based on classical “frequentist” statistical tests, which calculate a parameter called the “(frequentist) confidence” of the test, and which do not pay attention to the properties that we actually care about, such as the probability that a “good” factory might fail the test, or the probability that a factory that has passed the test is actually “bad”. In section 6 we therefore propose an alternative approach, aiming for the same high standard of quality control, but using Bayesian statistics. In section 7 we see that this results in both enormously reduced workload and in a much higher probability of a compliant factory passing the test. In section 8 we look briefly at what may be motivating regulators, based on the comments of others, before concluding in section 9. 7 G ihz Dzvixz Funxtionvlity ero lz (DFe) GCF Exvmplz of v DFe In order to understand how the major flaw in ISO 20072 arises, we must first consider how the this standard refers to a device functionality profile (DFP), developed by the manufacturer in conjunction with the requirements of ISO 14971. This DFP is a guide to the intended properties and acceptable limits of the parameters of the ADDD. An example DFP is shown in table 1: this DFP is for an inhaler which, in addition to its drug delivery functions, also comes to and reports judgments about whether the device has been correctly used, by monitoring the vibrations occurring in the device during use. We make various comments on the specific DFP items in table 2. GCG gvtionvlz wzhiny thz vvrious wvsix itzms in thz DFe From a manufacturer’s perspective, compliance with the items in the DFP shows that the device is working and safe to use. Therefore, a manufacturer is faced with a trade-off: to include many items in the DFP, thus enhancing the safety of the patients and at the same time increasing the amount of testing needed, or to sacrifice safety of the patients by reducing the number of items in the DFP, which in turn reduces the amount of testing needed. We note several properties of the example DFP: 1. There are 16 basic items in the DFP to be tested within ISO 20072, since item 9 adds three separate basic items to the DFP. 2. There are four types of variable in the DFP: (a) Directly observable, real-valued, continuous variables, constrained to be between finite lower and upper bounds, l and u, respectively (e.g. item 1). (b) Directly observable, real-valued, continuous variables, constrained either to be below a finite upper bound u or to be above a finite lower bound l (e.g. item 3). (c) A probability, which cannot be directly observed (e.g. item 2). (d) A Poisson rate, which cannot be directly observed (e.g. item 8). 3. Each variable specified within the DFP has some distribution across the population of devices. There are several additional specifications related to the example DFP: 1. Temperature range for correct operation must be at least −15◦C to +45◦C, with relative humidity between 5% and 95%. 2. Storage temperature range must be at least −30◦C to +60◦C. 3. Operating and storage atmospheric pressure range must be at least 0.5 to 1.2 atmospheres. 8 Itym Dyswription hypy of vuriuvly 1 The force needed to remove the cap must be not less than 8N, and not more than 12N. Real, two-sided 2 The probability, given a particular device, which has not warned of low battery, must be at least 0.99 that power comes on within 5 seconds of removal of the cap. Probability parameter 3 The force needed to replace the cap must be not more than 15N. Real, one-sided 4 The probability, given a particular device, must be at least 0.99 that power turns off within 5 seconds of replacing the cap. Probability parameter 5 The pressure drop to achieve 60 l/min flow must be between 0.9 and 1.2 kPa. Real, two-sided 6 The probability, given a particular device, must be at least 0.93 that, after a correct inhalation, it gives the response appropriate to correct use. Probability parameter 7 The probability, given a particular device, must be at least 0.93 that, after an incorrect inhalation, it gives the response appropriate to incorrect use. Probability parameter 8 The Poisson rate of false detection of in- halations, given a particular device, must be less than 1 per hour of power-on time. Poisson rate 9 The device must still operate correctly (i.e. satisfy items 6, 7, and 8 in the DFP) when 1m in front of the standard acoustic back- ground noise generator (as 6, 7, 8) 10 The battery must last for at least 2 hours of cap-off time before needing replacement. Real, one-sided 11 The device must give a low-battery warn- ing between 10 and 20 minutes of power-on time before the battery will expire if not turned off. Real, two-sided 12 Peak current drain from the battery must not exceed 500A. Real, one-sided 13 The device must have mass not exceeding 12 g, including the battery. Real, one-sided 14 The mean particle radius in the delivered airflow during a correct inhalation must not exceed that in the dose of drug powder put into the device by a factor of more than 1.2 Real, one-sided Table 1: Example DFP. The third column indicates which type of variable the respective item in the DFP is referring to. Note that drug dose delivered is outside the scope of ISO 20072, although of course it does in practice also need to be tested. 9 DFd itym Wommynts 1 Lower bound required so that the cap does not fall off accidentally, and upper bound required so that a physically weaker user will never be unable to remove the cap 2 Required since, otherwise, a user may not be able to use the ADDD. 3 See comment on item 1. 14 This might happen if the device uninten- tionally charged the particles with static electricity. Table 2: Comments on some of the items in the example DFP. 3 izxhnixvlitizs We now turn to various definitions and assumptions that are necessary to make sense of the rest of this report. 3CF Dz nitions For the avoidance of ambiguity we here define some specific terms for use in the rest of the report. By a fuwtory we mean whatever combination of product design, manufacturing method, and manufacturing facility is used to produce a population of inhalers whose properties are to be tested. A myusurymynt is a single experiment on a single device, done exactly once. Thus, for example, dropping a device once and seeing whether it breaks is a measurement. Applying a single inhalation to a single device and determining whether it gives the correct or incorrect result is a measurement. A vusiw itym is one numbered item in the DFP defined therein without reference to ISO 20072. Thus for example “The pressure drop to achieve 60 l/min flow must be between 0.9 and 1.2 kPa” is a basic item. An itym is one of the sets of measurements that results from combining a basic item defined in the DFP with one of the environmental and/or insulting test conditions defined in ISO 20072. Thus, for example, “The pressure drop to achieve 60 l/min flow must be between 0.9 and 1.2 kPa when testing at standard atmosphere after exposure to cold storage atmosphere” is an item. A tyst is a procedure that involves testing all the different items invoked by the combination of the DFP and ISO 20072, the outcome of which is that the factory either passes or fails the test. In this case a factory passes a Bayesian test, using a suitable prior, if we confirm for each item that given the data the probability that the item’s constraints hold for the relevant fraction (95% or 97.5% depending on environmental conditions) of produced devices is at least 0:95 . We refer 10 to this posterior probability as the won xynwy that an item is complied with (see the definition of fconfidence below for some justification for this unusual nomenclature). A factory passes a frequentist test if, for each item, we achieve 95% fwon xynwy that the item’s constraints hold for the relevant fraction of produced devices. Note that, to avoid confusion be- tween the Bayesian and frequentist concepts of confidence, we refer to the frequentist concept as fwon xynwy. As justification for this unusual terminology, we note that the Bayesian concept of “confidence” corresponds much more closely to the intuitive concept of “confidence” than does fconfidence; we will expand on this point in section 5.1 and appendix A below. We also mention in passing that the fconfidence level of frequentist hypothesis tests is often misinterpreted. The fconfidence level is widely thought of as “how likely it is that the tested devices comply with the requirements”, or “the degree of belief that one should have that the tested devices are compliant”, but this is a incorrect interpretation. If one wants to know thvt probability then one needs to calculate the Bayesian posterior probability, which is what we define here as “confidence” (without the f). We say that a device is goox for an item if the true parameter in question is within the specification limits for the item. For example, if the force required to remove the cap when testing at standard atmosphere after after exposure to cold storage atmosphere is between 8 and 12 N then the device is good for that item. We say that a factory is goox for an item if the actual distribution of the parameter over devices satisfies the probability content requirement. For example, if the fraction of the devices produced by the factory whose cap removal force under normal use conditions lies between 8 and 12 N is 97.5% or more, then the factory is good for that item. We say that a factory is goox for for the test (or for all items), or that it is wompliunt, if for each item it is good for that item. The meaning of vux in each case is that the factory or device is not good. We define a supyrlutivy factory to be a factory such that: 1. For each item in the DFP constraining a probability to be at least some value p, all devices produced by the factory have that probability exactly equal to 1− 1−p10 . 2. For each item in the DFP constraining a Poisson rate to be at most , all devices produced by the factory have that Poisson rate exactly equal to 10 . 3. For each item in the DFP constraining 2-sidedly a continuously distributed parameter that is intrinsically positive, that parameter of devices produced by the factory is log-Gaussianly distributed with median equal to the geometric mean of the two constraints and standard deviation such that the fraction of the devices with that parameter outside the range of constraint is one tenth of the maximum fraction permitted. 4. For each item in the DFP constraining 1-sidedly a continuously distributed parameter that is intrinsically positive, that parameter of devices produced by the factory is log-Gaussianly distributed with median strictly inside the constraint and standard deviation such that the fraction of the devices with that parameter outside the range of constraint is one tenth of the maximum fraction permitted. 11 5. For each item in the DFP constraining a parameter not so far discussed, i.e. a continuous parameter that is not intrinsically positive, the distribution of that parameter of devices produced by the factory is as required above for the exponential of that parameter (which is intrinsically positive). By a pyrfywt factory we mean one such that the probability of passing the test is in practice indistinguishable from 1. We will use the symvol ∝ in the following precise sense: We will write e (VPB | XPD) ∝ expression1 ∝ epression2 to mean that at both ∝ symbols there is a ‘constant’ of proportional- ity that does not vary with any of the variables to the left of the | sign (here V and B), but may vary with any of the symbols to the right of the | sign (here X and D). We will describe an item as insultyx if it is performed under such (adverse) environmental con- ditions that ISO 20072 prescribes 95% minimum probability content, and as uninsultyx if 97.5% minimum probability content is prescribed. 3CG Assumptions In order to simplify the analysis we make the following assumptions, which are of varying levels of realism, but in practice necessary. 1. The devices do not wear out, indeed each device has the same probability distribution of behaviours when it is new as at any other stage of its life other than “end of life” (though these distributions may be different from device to device). Note that this does not exempt us from end-of-life testing according to ISO 20072, and at that point we do not assume that the distribution is as it was at the start. 2. In each case the parameter specified is constant for the device, but may vary from device to device; in some cases the parameter specified is a probability, in other cases it is a force, pressure, duration, etc. 3. The distributions of each of the specified parameters are mutually independent. 4. Where a continuous parameter that ranges over (0P∞) is considered, we will consider instead the logarithm of that parameter when assumptions of Gaussianity etc. are considered. 12 I ihz Ihd GEE7G tzsting protoxol Once a manufacturer has established the DFP, ISO 20072 lists multiple differing environmental conditions under which each basic item must be tested. There are 14 such sets of conditions2 listed, in each of which a hypothesis test must be passed for each of the 16 basic items of the DFP, giving 224 = 16× 14 items to be tested in total. In this section, we discuss the attempts made by ISO 20072 to prescribe a valid testing protocol. ICF ihz vpprovxh przsxriwzy Under ISO 20072, the compliance of each of the four types of parameter is tested with a frequentist hypothesis testing protocol. This section describes the testing protocols laid out in ISO 20072 for each type of parameter. ISO 20072 requires that all frequentist hypothesis tests are performed at a fconfidence level of 95%, i.e. that the probability of any specific non-compliant factory passing any given item is at most 5%. All of the hypothesis testing protocols in ISO 20072 involve a “probability content”, p, which is a proportion of the underlying population, either 95% or 97.5%. Each hypothesis test aims to determine whether a proportion p of the underlying distribution of the given parameter is within the given acceptable bounds. 4B1B1 hwoAsixyx, ryul vuriuvly For the case that a real continuous variable follows an underlying normal distribution, and the DFP specifies both upper and lower bounds l < u for that variable, ISO 20072 gives a specific test, performed as follows. Exactly one measurement is taken from each of N devices, and the sample mean, x¯ = N∑ n=1 xn N , and sample standard deviation, s = √ N∑ n=1 (xn−x¯)2 N−1 , are calculated. These two values are then used to calculate x¯ − ks and x¯ + ks, where the value of k is obtained from table D1, in Annex D of ISO 20072, depending on the value of N and the probability content required. The two aforementioned values are subsequently compared with l and u, respectively, to determine whether the factory ”passes” the test. The critical region is the event (x¯− ks S l) ∩ (x¯+ ks < u) 2ihe 7free fvll7 xonyition in Ihd 2EE72 stvtes thvt the test must we performey with the xvp on vny the xvp offC le xonsiyer the EbC environmentvl xonyition to involve one testC 13 4B1BF cnyAsixyx, ryul vuriuvly The testing protocol in this case is very similar to the two-sided real variable. The only two differences are as follows: either l = −∞ or u = ∞, since there is either only a lower or only an upper bound on our parameter; additionally, the values of k to be used differ from those to be used in the two-sided real parameter test. These values of k are not included in ISO 20072. The reader is instead referred to a corresponding table in ISO 16269-6:2005. 4B1B3 drovuvility purumytyr For this type of parameter, a testing protocol is not explicitly stipulated in ISO 20072. However, the standard requires every test to have a fconfidence level of 95%, i.e. states that the probability of any specific non-compliant factory passing must be ≤ 0:05. We suggest a frequentist test protocol in section 5.1.3 below. 4B1B4 doisson ruty Similarly to the probability parameter, a testing protocol for the Poisson rates is not explicitly mentioned in ISO 20072. However, we suggest a frequentist testing protocol in section 5.1.4 below. ICG Erronzous vvluzs in Ihd GEE7G tvwlz of GBsiyzy vvluzs of k ISO 20072 refers to ISO 16269-6:2005 for how to calculate the k values used in its two-sided testing of real continuous variables. ISO 20072 invites the user to assume that the value of each continuous variable (e.g. the force needed, in Newtons, to remove the cap from a device) (or alternatively its logarithm) has a Gaussian distribution. Given this assumption, the testing protocol consists of obtaining sample values of the variable from N different devices, and evaluating the sample mean x¯ and standard deviation, s. This is then followed by performing a frequentist hypothesis test at a fconfidence level of 95%. The null and alternative hypotheses are H0 : ∫ u l f(x)dx ≤ p H1 : ∫ u l f(x)dx S pP where f(x) is the population probability density function of the continuous variable, and l and u are 14 the pre-determined lower and upper bounds, respectively, within which we wish a pre-determined proportion p (e.g. 97.5%) of our underlying population to lie. The factory passes the hypothesis test, i.e. H0 is rejected, and we say with 95% fconfidence that the factory is suitable, if x¯− ks S l x¯+ ks < u The values of k used to perform this test are “. . . determined based upon the confidence level (95%), probability content, p, and the number of measurements, n. . . ” (see [1]). These k values, contained within Annex D of ISO 20072, are copied from the table which appears in ISO 16269-6. However, the values within ISO 16269-6 have been misused. ISO 16269-6:2005 states that the values of k are to be used in order to determine an interval (x¯− ksP x¯+ ks) such that “...the probability that [the] interval constructed in the prescribed manner will contain at least a proportion p of the population [is 95%]”. In the hypothesis test, we wish to use a value of k such that the probability of falsely rejecting H0 is less than 5%. These two definitions of k produce values which differ to the first decimal place for n < 100, and, even for n = 1500, differ at the second decimal place. The disparity between these values of k result in compliant factories being rejected, since the correct values of k are lower than the values provided in ISO 20072. In ISO 16269-6:2005, it is stated that “The tolerance limits discussed... can be used to compare the natural capability of a process with one or two given specification limits, either an upper one U or a lower one a or both...”. This statement is a misleading indicator that the k values contained within table E.4 of this standard are appropriate for use in a hypothesis test, which is wasteful of good factories. The second edition of this standard, ISO 16269-6:2014, does not correct this statement sufficiently. ISO 16269-6:2014 suggests that “...the appropriateness of the given specification limits U and L can be compared with the actual properties of the process”, which is again a wasteful protocol for performing a frequentist hypothesis test. However, the table of k values pertaining to the case in which we know neither the mean, , nor the standard deviation, , of our underlying population is not included in the 2014 edition of ISO 16269-6. It is possible that, if ISO 20072 had been revised upon the publication of ISO 16269- 6:2014, the requirement to calculate the values of k could have resulted in the values of k being corrected. The corrected table of k values is provided in Appendix C. 15 IC3 A xommznt vwout tfio DFe wvsix itzms Now that we have seen the testing protocol proposed by ISO 20072, we can comment about our choice of two of the basic items in the example DFP, which we consider in the uninsulted case where the desired probability content is 0.975 . Firstly, let us compare DFP 2 with the following statement: “The device shall switch on within 5 seconds of the cap being removed” (*) If we suppose 95% of devices have the probability parameter equal to 1, and the remaining 5% of devices have probability parameter equal to 0.5, then the population of devices does not satisfy DFP item 2, but does satisfy the ISO 20072 interpretation of (*) because the mean is 0.975 (and no device is tested more than once). Indeed, any population satisfying our DFP 2 will also satisfy (*), since 0:975× 0:99 + 0:025× 0 S 0:95. Therefore our choice of basic item 2 in the DFP is stronger than (*) in the uninsulted case. Secondly, let us compare DFP 6 with the following statement, again in the uninsulted case: “The device shall signal a correct inhalation as correct.” (**) If we consider a population with all devices having the probability parameter equal to 0.94, then the population satisfies our DFP item 6; however it does not satisfy the ISO 20072 interpretation of (**) because the mean is 0:94 < 0:975. Therefore our choice of basic item 6 is not stronger than (**). However, if we consider a population with 90% of devices each with the probability parameter equal to 1, and another 10% of devices with probability parameter equal to 0.9, the mean will be 0:99 S 0:975, hence it will satisfy the ISO 20072 interpretation of (**) despite not satisfying our DFP item 6. Therefore, our DFP item 6 is also not weaker than (**). Therefore, our DFP item 6 is neither weaker nor stronger than (**). So we should ask “Why would we use DFP 2 rather than (*) ?”, and “Why would we use DFP 6 rather than (**) ?”. In the DFP 2 case, the answer is that it is important that a very high proportion of devices turn on with very high probability, and it is not acceptable to permit 5% of devices to be only turn on half the time (which would be permitted by (*)). In the DFP 6 case, (**) requires a mean probability of correct answer that may be unachievable within the available technological constraints, so needs to be set weaker. However, ISO 20072’s interpretation makes it impossible to specify a DFP item that merely requires the mean probability to exceed 0.93; and if one did that, it would then be permissible for up to 14% of devices to give random correct or incorrect answers with probability 0:5, which would be unacceptable. 16 5 Frzquzntist tzsting In this section, we propose a frequentist approach to doing testing that is in accordance with ISO 20072. We start by explaining the rationale that leads to this approach, taking into account the manufacturer’s perspective. 5CF Dztvils of vn vpprovxh to yzsigning v frzquzntist tzst xomplivnt fiith Ihd GEE7G In the previous sections we have described in outline the extent to which ISO 20072 prescribes how statistical testing is to be done. Before designing the remaining details of such a procedure, it is also necessary to examine the perspective of an ADDD manufacturer who will have to comply with those requirements. ISO 20072 does not stipulate how many devices should be examined in each test of each required item; it simply requires that every item should have an fconfidence level of 95%, i.e. the probability that any specific non-compliant factory passes the hypothesis test must be ≤ 5%. However, any factory owner who has made a good factory cares about another probability – the probability that their good factory might fail one of the 224 items. Failure of any one of the items would lead to a manufacturer being unable to produce any ADDDs for human use, resulting in a loss of the millions of dollars used to design and build an ADDD-producing facility. Retaking a test is not an option for a manufacturer, on account of the confidence level required by ISO 20072. Consider, for example, a specific non-compliant factory which has a probability of 5% of passing one item given a single run of the relevant testing procedure, but which is allowed to take the test a second time if it fails the first. The probability that this item either passes the first time, or fails the first time and subsequently passes a second one, is 9.75%, which implies that the confidence level of the extended ”double” test is now less than 95% (it is, in fact, equal to 90.25%). We strongly believe that a manufacturer would prefer that a goox factory should have at least 0.95 probability to pass the ISO 20072 protocol. In order to satisfy this, the frequentist approach must be designed such that a good factory passes each of the 224 items with probability substantially higher than 0.95: indeed, if the design is such that each item has the same probability of failing, then the probability of passing a single item needs to be at least 0.9998 (to 4 decimal places), so that the aforementioned good factory will pass the whole protocol with probability 0:9998224, which is about 0.95. While this is not required by ISO 20072, it is essential for a manufacturer of ADDDs. From their point of view, if their factory is entirely compliant with their DFP in the manner required by ISO 20072, they would wish their probability of passing any required testing protocols to be as close to 1 as possible. Now, it is possible to design a testing plan giving a probability of 0.95 of passing the entire test in many different ways — provided that one knows the true parameters of the factory. For simplicity, in what follows, we will assume that the design is such that each item is assigned the same pass 17 probability of 0:951=224, or approximately 0.9998, but it is likely that significant savings can be made on the overall number of tests by allocating the pass probabilities non-uniformly; for our present purposes, however, the workload of doing so optimally would be excessive. Furthermore, whether one designs for equal pass probability for each item or not, even the idea of doing so assumes that one knows the true parameters of the factory, which in practice one doesn’t. We will return to this topic later, but for now we note that for the following sample size calculations wy will ussumy thut thy fuwtory is supyrlutivy according to the definitions in section 3.1 above; intuitively then, we will be assuming that the factory is ten times better than it needs to be (in a carefully defined sense). IB1B1 Wontinuous purumytyr with FAsixyx wonstruints We have seen in section 4.1.1 how to test each item involving a continuous parameter that is (log-)Gaussianly distributed with a two-sided constraint. We will examine only DFP basic item 1, as Gaussian and log-Gaussian distributions can be linearly transformed to match the other basic items in the DFP that contain a two-sided constraint. Therefore, our data x will come from a log-Gaussian distribution logN(P ), i.e. log x will be Gaussian with median  and standard deviation . The lower and upper constraints on log x are respectively log(8) and log(12). The parameters, according to section 3.1 and knowing that the median is e, are as follows: 1. For probability content p = 0:95,  = 2:2822 and  = 0:0722. 2. For probability content p = 0:975,  = 2:2822 and  = 0:0671. Taking into account what we discussed in section 5.1, the minimum number of measurements pyr itym needed when testing a superlative factory in order to have 0.9998 probability of passing each item is: 1. For probability content p = 0:95, at least 108 measurements are needed. 2. For probability content p = 0:975, at least 154 measurements are needed. Therefore, as we have three DFP basic items with two-sided constraints and ISO 20072 mentions 8 environmental conditions for 0.95 content, and a further 6 conditions for 0.975 content, the total number of measurements needed for passing the 2-sided basic items will be 3×(108×8+154×6) = 5364. IB1BF Wontinuous purumytyr with onyAsixyx wonstruint As we have seen in section 4.1.2, this is very similar to the previous subsection. 18 Again, we will examine only DFP basic item 3, as Gaussian and log-Gaussian distributions can be linearly transformed to match the other basic items in the DFP that contain a one-sided constraint and a superlative factory. Therefore, our data x will come from a log-Gaussian distribution logN(P ) and the lower and upper constraints on log x are −∞ and log(15) respectively. The parameters, according to section 3.1 and choosing an arbitrary value below the upper constraing for the median e, are as follows: 1. For probability content p = 0:95,  = 1:3540 and  = 0:5256. 2. For probability content p = 0:975,  = 1:3540 and  = 0:4823. Taking into account what we discussed in section 5.1, the minimum number of measurements pyr itym needed when testing a superlative factory in order to have 0.9998 probability of passing each item is: 1. For probability content p = 0:95, at least 93 measurements are needed. 2. For probability content p = 0:975, at least 135 measurements are needed. Therefore, as we have 5 DFP basic items with one-sided constraints and ISO 20072 mentions 8 environmental conditions for 0.95 content, and other 6 conditions for 0.975 content, the total number of measurements needed for passing the one-sided basic items will be 5× (93× 8+135× 6) = 7770. IB1B3 drovuvility purumytyr On account of what we discussed in sections 5.1 and 4.1.3, we suggest the following frequentist testing protocol. We iterate the following procedure over a suitably wide range of values of N and Nn, picking the version giving the minimum total number of measurements. We consider testing N devices, collecting Nn 0/1 measurements from device n; we consider only the possibility that Nn is the same for all n, although clearly there are other possibilities. We need to determine a critical region, i.e. a subset of {0P 1P : : : P Nn}N which gives a frequentist fconfidence level of 95%. As usual there are an enormous number of possible critical regions; we consider only those of the form “At most Nmax devices give more than nmax measurements that are 0”, and consider what values of Nmax and nmax result in this defining a critical region for 95% fconfidence. The null and alternative hypotheses are as follows, where q is the lower constraint on the proba- bility parameter, p is the required probability content and F (x0) = e (x ≤ x0) is the cumulative distribution function of the probability parameter x: 19 H0 : F (q) ≥ 1− p H1 : F (q) < 1− pP The probability that the data lies in a critical region of the form being considered is maximised over h ∈ H0 by the “critical distribution” given by F (x) =  0 if x < qP 1− p if q ≤ x < 1P 1 if x = 1: and we therefore can determine, for each set of values of NPNnP NmaxP and nmax, whether or not the probability that data generated from the critical distribution lies in the critical region is less than 0.05 as required. Having picked the set of values of NPNnP NmaxP and nmax minimising N ×Nn over those satisfying this requirement on the critical region, we can then check whether if data is instead generated from the a superlative factory we get a sufficiently high probability (at least 0.9998) that thvt data lies in the critical region. If not, we discard that set of parameters and try the next best. Eventually we come to the set of parameters minimising the total number of measurements among those that are permissible. We then measure each of the N devices Nn times. The hypothesis test for this item is then passed, i.e. the null hypothesis H0 is rejected and we say, with 95% fconfidence, that the factory is good for this item, if the data produced lies within the critical region. When testing a superlative factory (see section 3.1), we have found the following optimal minimal numbers of measurements and hypotheses needed in order to achieve what we discussed in section 5. For items that with a lower constraint on the probability parameter of 0.99, we will say that a device “passes” if we observe at most 8 fails per device and the critical region is “every device passes”, i.e. Nmax = 0P nmax = 8. The following numbers are pyr itym: 1. For content p = 0:95, it is optimal to test N = 71 devices 1171 times each, giving 83141 measurements. 2. For content p = 0:975, it is optimal to test N = 141 devices 1192 times each, giving 168072 measurements. Therefore, as we have 2 DFP basic items with target 0.99 and ISO 20072 mentions 8 environmental conditions for 0.95 content, and another 6 conditions for 0.975 content, the total number of measure- ments needed for passing the target 0.99 basic items will be 2× (83141×8+168072×6) = 3347120. With British understatement, we note that this number is rather large, particularly in a context where manufacturers are hoping to use no more than a couple of hundred devices to test and to test each only a handful of times. We note also that we have paid no attention to whether a device would in practice survive being tested over one thousand times. 20 For items with a lower constraint on the probability parameter of 0.93, we will say that a device “passes” if we observe at most 9 fails per device and the critical region is “every device passes”, i.e. Nmax = 0P nmax = 9. The following numbers are pyr itym: 1. For content p = 0:95, it is optimal to test N = 68 devices 190 times each, giving 12920 measurements. 2. For content p = 0:975, it is optimal to test N = 142 devices 184 times each, giving 26128 measurements. Therefore, as we have 4 DFP basic items with target 0.93 and ISO 20072 mentions 8 environmental conditions for 0.95 content, and other 6 conditions for 0.975 content, the total number of measure- ments needed for passing the target 0.93 basic items will be 4× (12920× 8+26128× 6) = 1040512. Note that, while lower that the previous one, this number is still rather large. IB1B4 doisson ruty As in the previous case, we search for parameters N the number of devices to be tested, Tn the length of time for which each the response of device n is to be recorded, Nmax and nmax specifying the critical region of the form “Not more than Nmax devices gave more than nmax events while being observed”. Again, we consider only the case where all Tn are the same, and we search for the minimum number of total hours of observation N × Tn compatible with a critical region giving a fconfidence level of 95% and a sufficiently high pass probability for a superlative factory. The null and alternative hypotheses are as follows, where  is the upper constraint on the Poisson rate, p is the probability content and F (x0) = e (x ≤ x0) is the cumulative distribution function of the Poisson rate x: H0 : F (−) ≤ p H1 : F (−) S p: The probability that the data lies in a critical region of the form being considered is maximised over h ∈ H0 by the “critical distribution” given by F (x) = { p if x < P 1 if  ≤ : and we therefore can determine, for each set of values of NPTnP NmaxP and nmax, whether or not the probability that data generated from the critical distribution lies in the critical region is less than 0.05 as required. Having picked the set of values of NPTnP NmaxP and nmax minimising N × Tn over those satisfying this requirement on the critical region, we can then check whether if data is instead 21 generated from the a superlative factory we get a sufficiently high probability (at least 0.9998) that thvt data lies in the critical region. If not, we discard that set of parameters and try the next best. Eventually we come to the set of parameters minimising the total number of measurements among those that are permissible. We then observe each of the N devices for a time Tn. The hypothesis test for this item is then passed, i.e. the null hypothesis H0 is rejected and we say, with 95% fconfidence, that the factory is compliant in the environmental condition in which the item was tested, if the data produced lies within the critical region. When testing a superlative factory (see section 3.1), we have found the following optimal minimal device-hours of observation time needed in order to achieve the goals of section 5. 1. For content p = 0:95, each device “passes” if we register no more than 9 false inhalations during a testing time of 11.1 hours. The optimal testing time we found is 11.1 hours per device, and a total of 76 devices. This gives a total of 843 device-hours of observation. 2. For content p = 0:975. each device “passes” if we register no more than 8 false inhalations during a testing time of 9.9 hours. The optimal testing time we found is 9.9 hours per device, and a total of 182 devices. This gives a total of 1801 device-hours of observation. Therefore, as we have 2 DFP basic items with upper constraint 1/hour and ISO 20072 mentions 8 environmental conditions for 0.95 content, and another 6 conditions for 0.975 content, the total number of device-hours of observation needed for passing the Poisson basic items will be 2× (843× 8 + 1801× 6) = 35100 device-hours. 5CG hummvry of zffort rzquirzy for frzquzntist tzst We now have all the necessary information to assess how much testing needs to be done, in order to have a probability at least 95% that a supyrlutivy factory passes the whole test, that is passes all items. Note that a superlative factory is an almost perfect one, and in reality such a factory may be almost impossible to design. Therefore, in reality, all the numbers are likely to be even higher. Summing the minimal number of measurements needed for each category of items, we get that at least 3347120 + 1040512 + 7770 + 5364 = 4400766 measurements are required plus 35100 device- hours of observation. If we reckon that a robot could perform one measurement per minute, this gives a total of almost 12.5 years of robot time. This makes it an incredibly difficult task for the reality we live in. But this is just the tip of the iceberg - in reality, the only frequentist way that we could know that our factory was superlative is by doing an amount of testing that is enormously larger still. This wunnot be the right way of doing things ! Sadly, if we are determined to stay with frequentist rather than Bayesian methods, this naturally points the discussion to the topic of cheating and manipulating data, to which frequentist testing provides exquisitely acute temptation. 22 5C3 Chzvting In our opinion, the enormous amount of testing needed in order to honystly pass such a test naturally leads to the phenomenon of cheating and data manipulation. The first and most serious consequence of this is, of course, that the safety of patients is in danger. People with experience in the industry (who understandably wish to remain anonymous) tell us there there are several ways one can cheat while doing frequentist tests, some of which are pretty obvious, but nonetheless hard to check and control: 1. There is no assurance that the critical region was picked before rather than after collecting the data. In some instances, it is done uftyr doing the measurements, which completely invalidates the results, although this is becoming more difficult now that the test plan is required to be recorded in advance. 2. Censoring of data: when the observed data does not fall into the critical region, hence the frequentist approach cannot infer anything, the temptation is to xiswurx some or all of the unfavourable measurements and collect new measurements to take their place. In an extreme case, of course, one can discard every bad measurement and get 100% “correct” results. There are a battery of excuses available for discarding such data: (a) “I haven’t actually started yet” (b) “I didn’t do that one properly” (c) “I wasn’t wearing goggles, so I need to do it again” (d) “That one was faulty” (e) “Operator error” Our suggestion is that if an independent observer, given other data on vll the measurements, but deprived of their results, is able to determine precisely the set of measurements to be discarded, and is in agreement with the operators, then that is reasonable evidence that those measurements may be discarded. In reality, often, the decision to discard is considered only for those measurements which have given an undesirable result. 3. The “Concession Note”: here it is admitted that some of the items did in fact fail, but only in small print and buried in a large report, while the executive summary states that “All the tests passed as described in detail in the body of this document”. Such documents may even succeed in obtaining regulatory approval; we can never tell whether this is because the regulator hasn’t noticed the concession note, or because they realise that it is not realistic to hope that the ISO 20072 test will truly be passed. Because of the unrealistic number of measurements needed to pass a frequentist test of a realistic factory honystly, taking only a few measurements results in obvious cheating that is in principle easy to detect as follows: 1. Calculate a level of performance that although in H1 gives a very low probability of passing the test. 23 2. Collect sufficient new data to show that the performance is indeed that bad. In other words, if a factory passed such a frequentist test and the manufacturer claims that the reason is a good design, then there are a number of possibilities: 1. Their factory is not just superlative, but actually close to pyrfywt. (We may believe that such a perfect factory is yet to be designed, and if so we note that proving that a factory is not perfect is far easier than proving it is perfect.) 2. Their DFP contains very few basic items. Noting that even one basic item gives rise to 14 items, getting a small number of items while still providing adequate control of the parameters of the ADDD is difficult. 3. Their DFP constraints are very slack. This is likely to mean that their ADDDs may not be safe. 4. They have cheated in one of the many possible ways. However, there is also the danger of cheating unintentionally. In the following subsection we will discuss an anecdote brought to our attention by someone in the industry, where the intention of being honest resulted in a false claim to have passed a test. IB3B1 An yfiumply of un ignorunt uttympt to vy honyst Say that we want 95% fconfidence that 97.5% of the devices do not break when dropped. To do this, drop 119 devices. If none break, pass the item. If any break, fail the item. Then the hypothesis space is H = [0P 1], the null hypothesis is H0 = [0P 0:975] and H1 = (0:0975P 1]. Then the data space of how many devices broke is m = {0P 1P · · · 119} and the critical region is X = {0}. Consider h0 ∈ H0. Then e (x ∈ X | h0) = h1190 which achieves its upper bound of 0:975119, which is about 0:05, hence the fconfindence is about 95% – if the data falls in X. However, without planning to do this in advance, what was actually done was that when a device did break (and some did), it was replaced by 3 others and the data on the broken device discarded. While the intention was to be honest (the operator knew that just discarding the broken device and replacing its data with one new device would be dishonest), this procedure xrustiwully reduces the fconfidence as shown in the following calculation. Denote the statement “on 1 drop either 0 break or 1 breaks then 3 more do not” by (E). Then e (E | h0) = h0 + (1 − h0)h30 which achieves its upper bound of 0:975 + 0:025 × 0:9753, which is about 0.9982. This happens 119 times with probability 0:9982119, which is about 0.8043. Therefore the fconfidence is now at most 19.6%, when it was supposed to be 95%. 24 Since this “replace by three” procedure was not planned in advance, the test is invalid anyhow. To see that the situation is actually even worse than it appears from the above calculation, one must ask oneself “If one of the three replacements had also broken, would it have been replaced by a further set of three, and so on iteratively as needed ?”. If the answer to this is Yes, then a further calculation shows that although the item will now pass with probability 1 (so long as h0 S 0), the fconfidence then achieved would be zero. Thus we see that: 1. The fconfidence resulting depends on “what would have been done if...”, which is why any test whose procedure and critical region have not been fixed in advance cannot be regarded as valid. 2. Those who are intending to be honest, but whose mathematics is not up to the job, are likely to falsify the results due to the extreme temptation posed by the conflict between the commercial pressure for the test to pass and the impossibility of getting it to pass honestly. 5CI ihz frzquzntist rzsults vrz in fvxt irrzlzvvnt Apart from the huge numbers of devices required and the temptation to cheat, the frequentist approach has some other major flaws that makes its results in our view irrelevant. IB4B1 Fwon xynwy misvyhuvys whyn uppliyx to womvinutions of ityms To see this, consider a simplified DFP consisting of just one basic item of the form “When dropped the device must not break”, to be tested under only two environmental conditions (impossible under ISO 20072, of course). To make the diagrams drawable, suppose also that the probability content required is only 0.9, and that we require only 65% fconfidence of that. We then make the following observations. 1. If a factory passes the test, we have fconfidence 65% that the factory is good on each item taken one at a time. Any reasonable intuitive understanding of “confidence” would require that our confidence that the factory is simultaneously good on both items be not 65% but 0:652 = 0:4225 or 42.25%. This is illustrated in Figure 1. 2. However, if we consider the combined simultaneous frequentist test of both items, with the alternate hypothesis being “the factory is simultaneously good on both items” and the null hypothesis therefore being “the factory is not good on at least one item”, then the null hypothesis is the shaded area, i.e. everything apart from the top right small square of Figure 2. But now e (fi ∈ X | h0) is maximized over h0 ∈ H0 not by taking h0 = (0:9P 0:9), but by taking h0 to be either of the points (0.9, 1) or (1, 0.9) (indicated by the red stars), which makes that probability 0:910 × 110 = 0:35 approximately, so again we have 65% fconfidence, not 42.25% fconfidence as we would expect from 1 above. Thus to think of fconfidence as similar to the intuitive concept of “confidence” is misleading. 3. Similarly, doing a test according to ISO 20072 with 224 items and the null hypothesis “the factory is bad on at least one item”, given that the factory passes all items, the fconfidence that the factory is simultaneously good for all items is still 95%, where any reasonable person 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 1: How the intuitive idea of “confidence” behaves when two independent items are combined. If we are 65% confident that a point chosen yesterday from the unit square uniform randomly is to the left of the red line, and 65% confident that it is below the blue line, then the confidence that it is both to the left of the red line and below the blue line should be not 65% but 42.25%. p1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 2: Illustration of the hypothesis space for the simplified 2-item test discussed in subsection 5.4. The null hypothesis H0 is the shaded area, i.e. everything except for the small square at the top right. The two values of h0 maximizing e (fi ∈ X|h0) are indicated by the red stars. would expect it to be 0:95224, which is approximately zero. Thus frequentist hypothesis tests seem unable to distinguish being, for each of N independent items, fconfident that the factory is good for that item, and being fconfident that the factory is simultaneously good for all N items. 4. Given a factory design that intends to make the factory conform to the limits in the DFP as interpreted by the probability content specifications in ISO 20072, the frequentist approach is overwhelmingly likely to reject such a factory. It will do so not because the factory is bad, but because of “bad luck”, i.e. the data not falling in the critical region. Indeed, rather than calling this “bad luck”, we would consider it to be “bad statistics”. 26 IB4BF Fryquyntist tysting wonsixyrs thy wrong xynominutor When we use a phrase such as “We are 90% confident that the factory satisfies property Z” we are implicitly considering some population a of factories and saying that 90% of that population satisfies Z. We need to consider what population is relevant for us to be thinking of. Once we have collected our data, we would argue that the population a of interest is the subpop- ulation of all possible factories that would have yielded that data - as that is the subpopulation to which we have managed to determine our factory belongs. A frequentist test, however, instead considers the subpopulation of factories that has one of the most difficult to detect non-compliances and is perfect on all other items. Even if a manufacturer wanted to produce such a factory, it would be impossibly difficult. No reasonable person would believe it remotely likely that the factory in question was such a factory. However, any reasonable person fiouly believe that the factory we are dealing with is one that has produced the data we have observed - indeed this is a tautology. This, above all other reasons, is why we believe that a frequentist test is inappropriate for this purpose. However, for yet another way in which fconfidence does not correspond to the intuitive understanding of confidence, see theorem A.2 in the appendix. 27 6 Wvyzsivn tzsting When designing a better testing protocol, the issues to be dealt with are: 1. The current protocol depends on an assumption of Gaussianity, which is very likely to be in- appropriate. The protocol’s statistical assertions are incorrect for non-Gaussian distributions. This assumption needs to be replaced. 2. The current protocol has a high chance of rejecting factories that are actually compliant. A satisfactory protocol must have at most a small probability of rejecting factories that are compliant. Of course, it must also be the case that, after testing, for each item, the probability that a passed factory is good for that item is high. One might also hope that the probability that a passed factory is good for all items is high, but this is not realistic; achieving e.g. 0.95 probability that a factory is good for all items would require achieving, for each item, a 0.9998 probability that the factory was good for that item, which would take a truly immense amount of testing. 3. The current protocol employs sample sizes that are fixed in advance, as we have seen in section 5, and for the protocol to have a good chance of accepting a superlative factory (let alone an arbitrary compliant factory), those sample sizes have to be very large. To reduce the expected cost of the testing process, it would be desirable to use sample sizes that are not fixed in advance, that to be able to stop the test when the evidence for one hypothesis or another is sufficiently strong. 4. The current protocol is designed around knowing that the factory is superlative (and any protocol on the same lines is designed around some assumption on the quality of the factory that is much more restrictive than that the factory is good for all items). It is desirable that the protocol is completely automatically adaptive, and that the design does not depend on any stronger assumption than that any factory that is good for all items should have a high probability of passing. Now, we note that there exist methods of frequentist testing that design critical regions spanning a range of different sample sizes, and that if the precise properties of the factory under test are known, then it is possible to somewhat reduce the required sample size, at the expense of reducing still further the probability that a factory will pass the test that has different but still compliant true properties. However, we have chosen not to explore this route, as the fundamental issue is that true properties of the factory are not known before it is tested. 6CF hzquzntivl Wvyzsivn tzsting We propose a Bayesian iterative method, since Bayesian inference makes full use of all the data, given the assumptions; sequential hypothesis testing is particularly straightforward when a Bayesian approach is taken. We will first describe the general method, then explain in section 6.1.2 why it is valid — as this situation is so unlike that in a frequentist analysis, where repeated analysis and decision to stop when that analysis is favourable in general biases the results. 28 For both description and validity we will take a simpler example, that of determining whether the probability p that a biased coin lands on heads is at least p0. The same principles apply to the ISO 20072 case. 6B1B1 Dyswription of thy mythox If a fixed number N tosses are made, the number r of heads is expected, given p, to be binomially distributed with mean Np. The likelihood function is: e (r | NP p) = ( N r ) pr(1− p)N−r (1) In a sequential approach, the number of tosses N is not fixed, but, whatever the stopping rule, the likelihood function’s dependence on p still has the same form as the binomial distribution (1) given N . We will determine in advance a prior probability distribution on p. For simplicity, we will assume that this distribution has a density function e (p). Let s be the actual sequence of head/tail outcomes s1P s2P : : : sN , with sn = 1 denoting heads and sn = 0 denoting tails. Let h be the number of tosses we make before we decide to stop collecting data, and let hN denote the event that h takes the value N . Whatever the stopping rule, the likelihood function given N is: e (s | pPN) = (ps1(1− p)1−s1) (ps2(1− p)1−s2) : : : (psS (1− p)1−sS) = pr(1− p)(N−r) Suppose we decide to analyse the data after N tosses, where N is chosen independently of the data and of the unknown p, but that we have not necessarily decided to stop collecting data at that point. For example, we might decide to analyse the data after every new toss. Then the posterior probability density of the probability p is: e (p | sP N) = e (s | pPN)e (p | N) e (s | N) = e (s | pPN)e (p) e (s | N) where e (s | N) is the normalising constant e (s | N) = ∫ 1 0 e (s | pPN)e (p) dp: 29 However, if our stopping rule has also decided to stop at this point, we instead need e (p | sP hN P N) = e (sP hN | pPN)e (p) e (sP hN | N) P where e (sP hN |N) is the normalising constant e (sP hN | N) = ∫ 1 0 e (sP hN | pPN)e (p) dp: Note, however, that we have seen an equation for e (s | pPN) but that we need to use e (sP hN | pPN). If h = N is fixed in advance, these two quantities are the same, but otherwise we will need to consider how these two quantities are related. Let us next define the proposed sequential protocol, that is, the rule for deciding whether to stop or continuing testing. At arbitrary intervals during data collection (for example after each new toss or measurement), we compute two posterior probabilities, based only on the sequence of tosses we have seen so far: aN = ∫ 1 p0 e (p | s) dp which is the probability, given the data gathered thus far, that the probability p exceeds the required probability content p0; and bN = ∫ p0 0 e (p | s) dp which is the probability that p is smaller than p0. We stop testing and declare the item to have passed with (Bayesian) confidence aN if and when aN S , and it stops testing and declares the item to have failed with confidence bN , if bN S , where and are protocol parameters. Thus hN occurs if and only if N is the smallest integer such that aN S or bN S . If neither aN S nor bN S then the protocol continues testing. The protocol parameter controls the required confidence level to pass, and in this application will always be 0.95 as prescribed by ISO 20072. If an item passes, then the probability that the factory is bad for that item is 1− a, which is less than 1− . 30 On the other hand, specifies how sure the manufacturer has to be that the factory is bad before he gives up on it. It is usually catastrophic financially to give up on a factory and reject it, and the manufacturer is likely to set very close to 1. In the following we will assume that it is set at exactly 1; in other words, that the manufacturer will continue testing until such time as the factory passes, or otherwise for ever. 6B1BF Vulixity of thy Vuyysiun syquyntiul mythox We now consider why the Bayesian sequential method is valid. This essentially involves a con- sideration of the relationship between e (s | pPN) and e (sP hN | pPN), and hence that between e (p | sP hN P N) and e (p | sP N). But these are actually a very simple relationships. Consider e (sP hN | pPN) = e (s | pPN)e (hN |sP pPN) by the chain rule of probability. Now, e (hN |sP pPN) is either 0 or 1 depending on whether (aN S ) or bN S ) or not — and the formula for aN P bN do not involve p (except as a dummy variable of integration). They do, of course, involve s. Thus e (hN |sP pPN) = e (hN |sP N), and we have e (sP hN | pPN) = e (s | pPN)e (hN |sP N)P and hence e (p | sP hN P N) = e (sP hN | pPN)e (p)∫ e (sP hN | pPN)e (p) dp = e (hN | sP N)e (s | pPN)e (p)∫ e (hN | sP N)e (s | pPN)e (p) dp = e (s | pPN)e (p)∫ e (s | pPN)e (p) dp = e (p | sP N)P so that the decision to stop, made in this way, does not affect the inference we draw about whether or not p S p0. Thus as we have seen, the key point is that the stopping rule is conditionally independent of the unknown parameters given the data. In exactly the same way, this sequential testing and analysis protocol does not influence in any way the inference about whether an item is good or not — in total contrast to the effects of either repeated analysis or repeated data collection under the frequentist paradigm. It is, however, not impossible to make a stopping rule which is not conditionally independent of the unknown parameters given the data, and with such a stopping rule a Bayesian analysis would be invalid without modification. Two examples of such a stopping rule are: 1. Suppose we take a peek at each new data item before including it in our data, and decide to stop if the next data item doesn’t look good, but fiithout including it in the data considered. 31 Correct calculation of the effect of such a stopping rule would unsurprisingly show that the inference resulting would be the same as if the data item had been included. Thus data censoring (which this would be a form of) is illegal under the Bayesian paradigm just as much as under the frequentist paradigm. 2. Suppose we hear that our rival up the road has stopped their data collection on exactly the same subject, and decide to stop ours as a result and rush to publication. Such an event is clearly dependent on our rival’s data, and hence on the unknown parameters, but not via our own data, and stopping on this basis would therefore invalidate our analysis. (In this situation it should be obvious that the right thing to do is likely to be to cooperate with our rival and merge data sets.) 6B1B3 Furthyr usyful propyrtiys of thy syquyntiul Vuyysiun upprouwh In addition to giving us the ability to repeatedly analyse the data and automatically assess whether we have yet achieved the desired confidence level, such Bayesian testing also yields an additional advantage: as shown in appendix B, the probability of passing a good factory for a finite number of items now approaches 1 as the number of measurements approaches infinity (at least, so long as = 1). Of course, that is no guarantee that the necessary number of measurements will be feasible - but it turns out that it is vastly smaller than the corresponding number of measurements needed under the frequentist paradigm. Thus the sequential Bayesian approach solves issues 2, 3, and 4 of the list at the start of section 6. As we will see shortly, it is also able to address issue 1. 6CG bixturz moyzls Before actually reporting the number of measurements needed using this method, we need to address some issues connected with issue 1 of the list at the start of section 6. As motivation, we note that most ADDDs are made of plastic, and that in a factory it is typical to use not just one mould for any particular plastic part, but several, in order to increase the rate of production. In consequence, the population of ADDDs produced is often a mixture of several closely related but non-identical populations. This situation can be modelled using mixtures of distributions, as follows. We will denote the number of data points by N , the data points themselves by xn for n = 1P 2 : : : N , and the (unknown) number of mixture components by K (we will also call them “clusters”). Each such component has its own set of parameters k; depending on which item is being considered, k may be a vector of several scalar parameters. We denote by q the weights of the components, where qk is the weight of component k. Hence qk ≥ 0 for k = 1P 2 : : :K and K∑ k=1 qk = 1. 32 K 1 2 3 4 5 6 7 8 9 10 P( K) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Figure 3: Probability mass function for K Therefore e (xn) = K∑ k=1 qke (xn | k). We also denote by kn the mixture component from which xn is taken, for n = 1P 2 : : : N , and by k the vector of these indices. Thus e (xn | kn) = e (xn | kn). We will denote by b = (n1P : : : P nK) the number of data points coming from each cluster, i.e. Nk = |{n : kn = k}|. 6C3 eriors on K vny q We choose as prior for the number of clustersK the geometric distribution, i.e. e (K) = (1−)K−1 with  = 0:5. A plot of the probability mass function can be found in figure 3. Because the distribution e (b | NPq) is the Multinomial, we will use its conjugate prior, that is the Dirichlet Distribution, for q, with support on the K-simplex. Therefore, e (q | PK) = Γ( K∑ k=1 k) √ K K∏ k=1 Γ( k) K∏ k=1 q k−1k : For simplicity, we choose to make k = 1 K for all k = 1P 2 : : :K. 6CI bvrkov Chvin vny Giwws hvmpling Using the data and the priors, we will perform Markov Chain Monte Carlo to assess whether the population of devices tested is compliant with the restriction of a particular item. In this section, we will discuss the Gibbs sampling rules of the mixture model variables and how to sample the 33 next element of the Markov Chain. For general information on Markov Chain Monte Carlo and Gibbs Sampling, we recommend the review by Neal in [3] (the Gibbs Sampling method is discussed in section 4.4). We do Gibbs sampling as follows. Starting from: e (fiPkPqPKP) = e (K)e (q | KP )e (k | qPKPN)e (fi | Pk)e () 1. First sample K. Before this, we need to rearrange the ympty clusters to be at the end of the list of clusters. Then the probability mass function of K will be e (K | fiPPk) ∝ P (K)Γ( K∑ k=1 k) √ K K∏ k=1 k √ K K∏ k=1 ( k+Nk) Γ( K∑ k=1 ( k+Nk)) if K is greater than or equal to the old number of nonempty clusters, and e (K) = 0 otherwise. Then, we have three possible situations: (a) The new sampled K is the same as the old one. In this case nothing special needs to be done. (b) The new sampled K is less than the old K. In this case, we randomly select from the empty clusters which ones to discard. (c) The new sampled K is greater than the old K. In this case, we form extra clusters by taking their corresponding  parameters from the priors on , which depend on the item. 2. Sample q from e (q | fiPkPKP) ∝ e (q | KP )e (k | qPKPN). This is again a Dirichlet distribution with new parameters ak = k +Nk for k = 1P 2 : : :K. 3. Sample k from e (k | fiPqPKP) ∝ e (k | qPKPN)e (fi | kP) = N∏ n=1 qkne (xn | kn). Evaluate this for each possible value of kn, then draw from the discrete distribution, for n = 1P 2 : : : N . 4. Sample  from e ( | fiPkPqPK) ∝ e ()e (fi | kP) = K∏ k=1 e (k) N∏ n=1 e (xn | kn). 5. Then to ensure that detailed balance holds for the combination of moves, we make the se- quence palindromic, by repeating all but the last step in reverse order. 6C5 hvmpling vny xoyz In the previous subsection, we have seen that our code samples from several types of distributions: 1. From a discrete (possibly unnormalised) distribution: our code normalises the probability mass function then uses an uniform random variable u in the classical way, i.e. if u < p1, the sample is 1, if p1 + : : :+ pk−1 ≤ u < p1 + : : :+ pk, the sample is k etc. 34 step of MC 0 100 200 300 400 500 600 qv ec 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4: Behaviour of code when data is sampled from a true distribution with 5 mixture compo- nents (the dotted lines). Note that the magenta and cyan components are almost negligible, and that there is insufficient data to determine for definite whether some of the data points belong to the magenta, cyan, blue, or red clusters. 2. From a Dirichlet distribution, say with parameters 1P 2P : : : P K we want to sample q. Then sample rk from the Gamma distribution Gamma( k, 1) using the standard Matlab function gamrnd() and set qk = rk K∑ j=1 rj for k = 1P 2 : : :K. 3. From a continuous distribution on (0P+∞): here our code transforms the variables from the interval (0P+∞) to (0P 1) using the map x→ xx+1 , then performs “Slice Sampling” according to Neal in [4], before transforming the sample back using x→ x1−x . Similarly for other ranges of support using different transformations. When monitoring the performance of the MCMC algorithm on synthetic data, there is an iden- tification problem on the clusters; there is no hope of the algorithm determining which cluster it has found corresponds to which cluster number in the synthesis (and of course it may currently be considering more or fewer clusters than the true number). In order to ensure that the true and dis- covered clusters were colored with (as far as possible) the correct colors, the Hungarian algorithm [5] was used to associate true cluster numbers with clusters in a fashion maximizing the number of correct assignments of data points to true cluster numbers. A description of the Hungarian (or Munkres) Algorithm can be found in Harold Kuhn’s article in [6]. A plot of the sample trajectory of q from the Markov Chain is shown in figure 4. Convergence was established by constructing synthetic data with a variety of true parameter values and ensuring that the duration of sampling run was sufficient that all these had converged to distributions covering the correct values. 6C6 hvmplz sizzs nzzyzy for zvxh typz of DFe itzm In this section, we will detail the method of determining the number of measurements likely to be needed using the Bayesian approach. Note that we report the myxiun of 20 example runs rather than the mean, as when the distribution of something is unknown and we have only a small number 35 of samples to estimate it from it is much easier to estimate the median than the mean. Clearly when summing these various numbers we must then caution the reader that the sum of medians is not usually equal to the median of the sum of the variables. 6B6B1 Wontinuous vuriuvly with twoAsixyx wonstruint As we discussed before, our approach will be able to simulate reality better by not relying on the assumption of Gaussianity. Our Bayesian software will take into account the possibility of the data coming from a Student distribution, which is a generalisation of a Gaussian. Clearly this generalises to a log-Student instead of a log-Gaussian by taking the log of the data, and of course we will be considering mixtures of these distributions. In combination this makes the Bayesian method vastly more powerful and realistic than one assuming Gaussianity which ISO 20072 tempts the user to do. The Student distribution with shape m (or with “2m degrees of freedom”) is the marginal distri- bution obtained by integrating out s from the following two: e (x | P sP ) = √ s 2.2 e− s 22 (x−)2 where s is Gamma distributed e (s | mP r = m) = mmΓ(m)sm−1e−ms. Then the Student distribution has the form e (x | P Pm) = Γ(m+ 1 2) Γ(m) 1√ 2.m2 (1 + 1 2m (x− )2 2 )−(m+ 1 2 ) We will also denote 1 2 by  . We consider only items that contain DFP basic item 1. As the distributions can be scaled, we will get the same numbers for other items that contain a two-sided constraint. In section 5.1.1 we already saw the parameters for a superlative factory:  = 2:2822,  = 0:0722 when the content is 0.95, and  = 0:0671 when content is 0.975. Also, we set the true m = 1000 because the Student distribution tends to the Normal as m→ +∞. The acceptable region of  and  is plotted in figure 5. We now set reasonable dependent priors on ,  and m and plot them in figures 6, 7 and 8. We choose the prior for  to be a Gamma distribution with shape and scale parameters k0 = 0:9 and t0 = 140. This can be found in figure 6 with some comments. We choose the prior for  given  to be Normal(0P 1√ 0 ) with parameters 0 = 2:2822 and 0 = 0:02. This is actually a conjugate prior to the Gaussian. See the contours of the prior on P  in figure 7. The conjugate prior on m takes the form of a so-called “Pro-Gamma” with parameters a0P b0 S 0: e (m | a0P b0) ∝ mb0mΓ(m)b0 e−(a0+b0)m with a0 = 0:2 and b0 = 0:8. See the plot in figure 8. Therefore in this DFP basic item, the parameters of clusters are k = (kP kPmk) for k = 1P 2 : : :K. 36 mu 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 si gm a 0 0.02 0.04 0.06 0.08 0.1 0.12 Figure 5: The acceptable region for  and  for DFP 1, in the special case that m = ∞ (i.e. distribution is (log-)Gaussian), wanting probability content of 0.975 lies below the curve, bearing in mind that the upper and lower limits are log(12) and log(8). tau 0 50 100 150 200 250 300 P( 0< x< tau ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6: Plot of the wumulutivy distribution function on  . It was taken such that the probability of  being the in acceptable region in figure 5 to be about 0.2, so as not to bias the Bayesian approach towards what we know to be the true distribution. mu 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 si gm a 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 200 400 600 800 1000 1200 1400 1600 Figure 7: Plot of the contours of the prior on P . If we look at the acceptable region in figure 5, we can see that there is enough probability both inside and outside the acceptable region, so again the prior is not prejudiced in favour of H1. 37 m 0 5 10 15 20 25 30 35 40 45 50 P( m) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Figure 8: Plot of the selected conjugate prior for m. Next we describe the Gibbs sampling updates which take the place of step 4 of section 6.4, for which we reintroduce s = (sn)n=1P:::PN in order to be able to sample more easily, so that (taking the case that there is only a single cluster for simplicity) e (fiPmP P P s) = e (fi | mPP P s)e (m)e ( | )e ()e (s | m): 1. Sample s from Gamma distributions with shape and scale parameters for sn being Gamma( m+ 12 Pm+  2 (xn − )2 ). 2. Sample m from the new Pro-Gamma with parameters b1 = b0 + N and a1 = a0 + N∑ n=1 (sn − log(sn)) − N . Remark that this is valid since a1P b1 S 0 as a result of the inequality u ≥ 1 + log(u) for u S 0. 3. Sample  from the new Gamma distribution with the new shape and scale parameters k1 = k0 + N+1 2 and t1 = t0 + 0(−0)2 2 + N∑ n=1 sn(xn−)2 2 . 4. Sample  from the Normal(1P 1√  1 ) with the new parameters 1 = 0 + N∑ n=1 sn and 1 = N∑ n=1 snxn+ 00 N∑ i=n sn+ 0 . 5. Repeat all but the last step above in reverse order, to ensure palindromicity and thus detailed balance. An example of the sample trajectories of  and  for a 5-cluster example can be found in figures 9 and 10. Finally, we report the number of measurements needed as the myxiun over 20 runs. 1. For content p = 0:95, we found that 37 devices are needed. 2. For content p = 0:975, we found that 41 devices are needed. 38 step of MC 0 100 200 300 400 500 600 m u 0 2 4 6 8 10 12 14 16 18 20 Figure 9: Markov Chain of  when the data is taken from a true distribution with 5 mixture components (the dotted lines). Note that the blue, magenta and cyan components are negligible by figure 4, so the Markov Chain is focused on the red and green. step of MC 0 100 200 300 400 500 600 si gm a 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 10: Markov Chain of  when the data is taken from a true distribution with 5 mixture components (the dotted lines). Note that the blue, magenta and cyan components are negligible by figure 4, so the Markov Chain is focused on the red and green. 39 mu 0 0.5 1 1.5 2 2.5 3 si gm a 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Figure 11: The acceptable region of P  in DFP basic item 3 lies below the line. Therefore, as there are 3 basic DFP items with a two-sided constraint, and as there are 8 environ- mental conditions for content 0.95 and another 6 for content 0.975 in ISO 20072, the total number of measurements needed to pass items that contain a two-sided constraint is 3×(37×8+41×6) = 1626. We note that this is smaller than the corresponding frequentist number, even though we have al- lowed for both Student non-Gaussianity and for mixtures of distributions. 6B6BF Wontinuous purumytyr with onyAsixyx wonstruint The only difference from the two-sided problem is that we will draw superlative data from a Normal distribution with  = 1:3540 and  = 0:5257 for 0.95 content, respectively  = 0:4824 for 0.975 content. The corresponding acceptable region is in figure 11. The priors on P  will scale accordingly. Again, we report the myxiun number of measurements needed over 20 runs: 1. For content p = 0:95, we found that 20 devices are needed. 2. For content p = 0:975, we found that 25 devices are needed. Therefore, as there are 5 basic DFP items with a one-sided constraint, and as there are 8 environ- mental conditions for content 0.95 and another 6 for content 0.975 in ISO 20072, the total number of measurements needed to pass items that contain a one-sided constraint is 5×(20×8+25×6) = 1550. 6B6B3 drovuvility purumytyr We will consider the DFP basic items with the target probability 0.99, as the ones with target 0.93 are treated similarly. (Superlative test data will then be generated using a true probability parameter of 0:999 as in section 3.1.) 40 alpha 0 50 100 150 200 250 300 350 400 450 500 be ta 0 0.5 1 1.5 2 2.5 Figure 12: The acceptable region of P in DFP basic item 2, target 0.99, lies below the curve and is unbounded on the right. We then assume a model in which each cluster in a mixture has parameter distributed Beta( , ), then find the posterior distribution on and given some data and a suitable prior. Therefore the parameters of the clusters will be k = ( kP k). The acceptable region of and is plotted in figure 12. We choose a prior on and from the “ProBeta” family, the conjugate distribution to the Beta distribution with respect to the joint parameters ( P ), whose probability density function is given by e ( P | p1P p2P N0) ∝ ( Γ ( + ) Γ( )Γ( ) )N0 · pN0( −1)1 · pN0( −1)2 P where p1, p2, and N0 are independent parameters, constrained by p1P p2P N0 S 0 and p1+ p2 < 1 so the expression has finite integral. The values we take for p1P p2P N0 affect the shapes of the resulting Beta distributions. To avoid prejudicing the outcome, we would like to get all intuitive types of Beta distributions when drawing from the selected prior. We found the prior with p1 = p2 = 0:4525 and N0 = 0:01 satisfactory. See figure 13 of the contour plot and figure 14 of Beta distributions drawn from the prior. We note that the closure of the space of mixture distributions so modelled includes the frequentist critical distribution for this item. In this protocol, we measure the responses of N devices, testing the nth device Nn times and obtaining nn successes. We obtain two N -dimensional vectors, b and n. Since e (bPn | P ) = ∫ e (bPnPp | P ) dp = ∫ e (bPn | p)e (p | P ) dp 41 alpha 50 100 150 200 250 300 350 400 450 500 be ta 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 13: Contour plot of the unnormalised prior on P . Looking at the acceptable region in figure 12, we can see that there is a reasonable amount of probability both inside and outside of the acceptable region. x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P( x) 0 5 10 15 20 25 30 35 a = 63, b = 22 a = 0.9, b = 10.7 a = 250.7, b = 198.8 a = 60.4, b = 102.7 a = 363, b = 970.2 Figure 14: Sample Beta distributions drawn from the prior. They have various shapes and sizes, covering both narrow and broad and both those internal to the interval and those jammed up at one end. The values of P are specified in the top-right key. 42 step of MC 0 200 400 600 800 1000 1200 qv ec 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 15: Example of the sample trajectories of the Markov chain for the cluster weights q, when data is sampled from a true distribution with 5 mixture components (the dotted lines). Note that the magenta and red components are almost negligible, and that there is insufficient data to determine whether some of the data points belong to the magenta or red clusters. our posterior probability distribution subsequently takes the following form: e ( P |bPn) = e (bPn | P ) · e ( P ) e (bPn) ∝ ( Γ( + ) Γ( )Γ( ) )N0+N p N0( −1) 1 p N0( −1) 2 N∏ n=1 Γ( +nn)·Γ( +Nn−nn) Γ( + +Nn) This also tells us how to do the Gibbs Sampling step of the new element in the Markov Chain. Plots of example Markov chain sample trajectories along with some comments can be found in figures 15, 16 and 17. We considered only designs where the value of Nn was the same for all n, although exactly the same analysis would apply if one were to choose a variety of values for the various Nns beforehand, or, by a similar analysis to that in section 6.1.2 above, even if one were to decide which device to test next based on the results obtained from the various devices so far. We considered various possible common values for the Nns, and chose the considered value which gave smallest overall number of measurements. Reporting as usual the median of the sample sizes required in 20 runs, we found the following. For DFP basic item 2 with lower constraint 0.99 we needed, pyr itym: 1. For content p = 0:95, testing each device 100 times, we needed to test a median of 45 devices, for a total of 4500 measurements needed. 2. For content p = 0:975, testing each device 200 times, we needed to test a median of 67 devices, for a total of 13400 measurements needed. 43 step of MC 0 200 400 600 800 1000 1200 a lp ha 0 20 40 60 80 100 120 Figure 16: Example of the sample trajectories of the Markov Chain for the parameter . Note that, according to figure 15, the most accurate are again green, blue and cyan, because there is not enough data to represent the almost negligible red and magenta. Note also that there is considerable uncertainty about the precise values of the green and blue cluster parameters, all of which represent quite narrow clusters; a 2-d plot with would show trajectories of ( kP k) moving in and out along radial lines through the origin, representing Beta distributions all of approximately the same mean, but of slightly varying widths. A corresponding plot of is in Figure 17. step of MC 0 200 400 600 800 1000 1200 be ta 0 50 100 150 Figure 17: Markov Chain of . Note that the most accurate are again green, blue and cyan, because as seen in figure 15, there is not enough data to represent the almost negligible red and magenta. 44 m 0 1 2 3 4 5 6 7 8 9 10 r 0 2 4 6 8 10 12 14 16 Figure 18: The acceptable region of (mP r) is above the line in DFP basic item 8 with upper constraint on the Poisson rate of 1 per hour. As in ISO 20072 there are 8 environmental conditions that require content 0.95, and another 6 that require content 0.975, and there are 2 basic items in the DFP with target 0.99, a total number of 2× (4500× 8 + 13400× 6) = 232800 measurements needed. For DFP basic item 6 with lower constraint 0.93 we needed, pyr itym: 1. For content p = 0:95, testing each device 7 times, we needed to test a median of 20 devices, for a total of 140 measurements. 2. For content p = 0:975, testing each device 10 times, we needed to test a median of 21 devices, for a total of 210 measurements needed. As in ISO 20072 there are 8 environmental conditions that require content 0.95, and another 6 that require content 0.975, and there are 4 basic items in the DFP with target 0.99, a total number of 4× (140× 8 + 210× 6) = 9520 measurements needed. 6B6B4 doisson ruty We next consider the two DFP basic items with the Poisson rate constrained to be below 1 per hour. Superlative data will be generated from simulated devices with rate 0.1 per hour, as in section 3.1 above. We then assume a model in which each cluster in a mixture has rate parameter distributed Gamma(mP r), then find the posterior distribution on m, r given some data and a suitable prior. Therefore the parameters of the clusters with be k = (mkP rk). The acceptable region of (mP r) is plotted in figure 18. We choose a prior on (mP r) from the “ProGamma” distribution family, the conjugate distribution 45 m 1 2 3 4 5 6 7 8 9 10 r 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 19: Contour plot of the prior on (mP r). Looking at the acceptable region in figure 18, we can see that there is a reasonable amount of probability both inside and outside of the acceptable region. x 0 1 2 3 4 5 6 7 8 9 10 P( x) 0 0.5 1 1.5 2 2.5 3 3.5 m = 69, r = 14 m = 47, r = 27 m = 8, r = 23 m = 12, r = 11 m = 4, r = 1 Figure 20: Sample Gamma distributions drawn from the prior. They have various shapes and sizes. The values of (mP r) are specified in the top-right key. to the Gamma with respect to the joint parameters (mP r), whose probability density function is given by e (mP r | P PN0) ∝ ( rm Γ(m) )N0 mN0e− r e− m NmN00 P where , , and N0 are independent parameters, constrained by P PN0 S 0 so that the expression has finite integral. The values we take for P PN0 affect the shapes of the resulting Gamma distributions. To avoid prejudicing the outcome, we would like to get all intuitive types of Gamma distributions from the selected prior. We found the prior with = 0:03, = 0:1 and N0 = 0:01 satisfactory. See figure 19 for the contour plot and figure 20 for Gamma distributions drawn from the prior. In this protocol, we measure the responses of N devices, testing the nth device for Tn hours and 46 step of MC 0 500 1000 1500 2000 2500 3000 3500 qv ec 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 21: Example Markov chain sample trajectories of q when data is sampled from a true distribution with 5 mixture components (the dotted lines). Note that the blue and green clusters are virtually identical in all parameters (see also Figures 22 and 23), and it is therefore entirely reasonable for them to be interpreted as a single cluster in most samples from the chain. The red and magenta components are almost negligible, and there is insufficient data to determine whether some of the data points might belong to the red or magenta clusters. observing nn failures. We obtain two N -dimensional vectors, h and n. Since e (bPn | P ) = ∫ e (bPnP | P ) d = ∫ e (bPn | )e ( | P ) d our posterior probability distribution subsequently takes the following form: e (mP r | hPn) ∝ ( rm Γ(m) )N+N0 ( N∏ n=1 1 (r + Tn)m+nn )( N∏ n=1 Γ(m+ nn) ) mN0e− r e− m NmN00 This also tells us how to do the Gibbs Sampling step of the new element in the Markov Chain. Plots of example Markov chain sample trajectories along with some comments can be found in figures 21, 22 and 23. We considered only designs where the value of Tn was the same for all n, although exactly the same analysis would apply if one were to choose a variety of values for the various Tns beforehand, or, by a similar analysis to that in section 6.1.2 above, even if one were to decide which device to observe next based on the results obtained from the various devices so far. We considered various possible common values for the Tns, and chose the considered value which gave the smallest overall number of device-hours of observation needed. Reporting as usual the median of the sample sizes required in 20 runs, we found the following. 47 step of MC 0 500 1000 1500 2000 2500 3000 3500 m 0 50 100 150 200 250 Figure 22: Example Markov chain sample trajectories of m. Note that the blue and green clusters are virtually identical in all parameters, and it is therefore entirely reasonable for them to be interpreted as a single cluster in most samples from the chain. As we see in Figure 21, the red and magenta clusters are almost negligible in weight, and there is in consequence not enough data to be sure of their existence. step of MC 0 500 1000 1500 2000 2500 3000 3500 r 0 2 4 6 8 10 12 Figure 23: Example Markov chain sample trajectories of r. Note that the blue and green clusters are virtually identical in all parameters, and it is therefore entirely reasonable for them to be interpreted as a single cluster in most samples from the chain. As we see in Figure 21, the red and magenta clusters are almost negligible in weight, and there is in consequence not enough data to be sure of their existence. 48 For DFP basic item 8 with Poisson rate constrained to be below 1 per hour we obtained, pyr itym: 1. For content p = 0:95, observing each device for 0.1 hours, we needed 60 devices, for a total of 6 device-hours of observation. 2. For content p = 0:975, observing each device for 0.1 hours, we needed 110 devices, for a total of 11 device-hours of observation. As in ISO 20072 there are 8 environmental conditions that require content 0.95, and another 6 that require content 0.975, and there are 2 basic items in the DFP constraining a Poisson rate to be below 1 per hour, a total number of 2× (6× 8 + 11× 6) = 228 device-hours needed. 6C7 hummvry of zffort nzzyzy in thz Wvyzsivn vpprovxh We now have all the necessary information to assess how much testing needs to be done, in order to have a probability at least 95% that a supyrlutivy factory passes the whole test, that is passes all items. Note that a superlative factory is an almost perfect one, and in reality such a factory may be almost impossible to design. Therefore, in reality, these numbers may be higher — but equally, it is likely that for some items the factory will turn out to be better than superlative, while for others it is worse, and the Bayesian test will automatically increase the amount of testing only for those items that are worse than superlative, while it equally will reduce the amount of testing for those that turn out to be better than superlative. Summing the number of measurements needed for each category of items, we get that 232800 + 9520 + 1550 + 1626 = 245496 measurements are required plus 228 device-hours of observation. If we reckon that a robot could perform one measurement per minute, this gives a total of almost 180 days of robot time. This looks entirely feasible, in contrast with the frequentist result. To be even-handed, we make some comments about safety and cheating. The most important feature of the Bayesian approach is, in our opinion, the possibility of improving the quality control of the devices without massively increasing the required amount of testing. That is, a manufacturer may reasonably add several additional basic items to the DFP, thus making its device better and providing a safer product for its customers. It is of course also possiwlz to cheat in a Bayesian approach by censoring data just as in a frequentist approach. The question is whether the method provides an incentive to cheat or not. In section 7.2 below we argue that the incentives to cheat provided by the frequentist approach are not present in the Bayesian approach. 49 7 Compvrison of thz frzquzntist vny Wvyzsivn vpprovxhzs 7CF lorklovy rzquirzy We now compare the numbers of measurements needed when testing with the frequentist method according to ISO 20072 with the number needed with the Bayesian approach we proposed. We have also calculated the approximate time needed to test a factory. See the table in Figure 24 for a summary of these results. type of item cont. 0.95 cont. 0.975 no of items in DFP total measurements total time Probability parameter p = 0.99 B 4500 13400 2 232800 160 days Probability parameter p = 0.99 F 83141 168072 2 3347120 6.3 years Probability parameter p = 0.93 B 140 210 4 9520 7 days Probability parameter p = 0.93 F 12920 26128 4 1040512 2 years One-sided B 20 25 5 1550 1 day One-sided F 93 135 5 7770 5 days Two-sided B 37 41 3 1626 1 day Two-sided F 108 154 3 5364 4 days Total measurements B 245496 169 days Total measurements F 4400766 8.3 years Poisson rate lambda = 1 B 6 11 2 228 hours 10 days Poisson rate lambda = 1 F 843 1801 2 35100 hours 4 years Total time B 180 days Total time F 12.3 years Figure 24: Table comparing the numbers in the frequentist vs Bayesian methods. The total number of measurements was calculated considering that there are 8 items in ISO 20072 that require 0.95 content, and another 6 that require 0.975 content. The total time required was computed assuming a robot that does one measurement per minute is available. It is immediately obvious that the Bayesian method requires vastly fewer measurements, and overall approximately 25 times less robot-time, than the frequentist method. One of the facts which contributes to the enormous difference in workload required for the two methods is that • When testing with the Bayesian approach, the number of measurements needed grows as the sum of the numbers of measurements needed for each item in the DFP, each of which is 50 independent of the existence of other items. • On the other hand, when testing with the frequentist approach, increasing the number of items in the DFP mussivyly increases the number of measurements needed because of the need to increase the pass probability of individual items. 7CG Compvrison of inxzntivzs to xhzvt There are some ways in which the phenomenon of cheating is independent of the testing method, as the urge to cheat varies with personality, and it is possible to cheat in a Bayesian approach by censoring data just as in a frequentist approach. However, we think that the Bayesian method we described lacks the incentive to cheat observed with the frequentist approach for several reasons: 1. The amount of testing needed in the Bayesian approach ismuwh lowyr than what was needed in the frequentist method. Thus, it is feasible to do the testing honystly, without censoring or manipulating the data. 2. The Bayesian method accurately reports the probability, given the data observed so far, that a factory is compliant (or isn’t). Hence if a manufacturer tests a compliant factory, the design will pass with probability 1, while if a factory has not yet passed a test in progress, the manufacturer is continually aware of the probability that the factory is nonetheless compliant, and can make an informed decision on whether to continue testing. 3. Moreover, if the Bayesian method passes a factory, then for each item the probability that that factory is good is high (0.95 under the ISO 20072 conventions). Thus patients are protected under a Bayesian analysis just as under an honestly executed frequentist one — but in addition, they get the benefits that (a) Worthwhile treatments do actually become available to them, rather than be rejected by the minuscule pass probabilities of the frequentist approach, and (b) The probability that the manufacturer has cheated when “passing” their product is no longer close to 1. 7C3 Compvrison of thz inxzntivzs to proyuxz v high quvlity fvxtory We now address an argument frequently raised by some regulatory authorities as justification of the use of frequentist methods. That argument is that a low pass probability for e.g. a superlative factory should incentivize the manufacturer to only design perfect factories. However, there are in our view a number of difficulties with this reasoning: 1. It is usually impossible to design a perfect factory. 2. The manufacturer has no way of knowing whether or not his factory is perfect before embark- ing on testing it. If it fails, he has then no recourse; he cannot retest it from scratch, and in the frequentist paradigm he cannot extend the test to handle the imperfections that he now knows must be present. 51 3. The Bayesian approach we have described also provides incentives to design better factories, as the number of measurements that will be needed for any item will automatically rise as the factory gets closer to the specified constraint. Given that it is thus possible to design tests that both provide an incentive to design better factories, and which will eventually pass good factories, such an approach is surely preferable. 7CI eszuyoBWvyzsivn vpprovxhzs A “compromise” approach has also been described in [7], wherein such an approach is described as “Bayesian”. However, as seen in section 4.8 of [7], this approach actually wants to handicap a Bayesian approach by considering also the Bayesianly-irrelevant probability that a factory for which all items but one are perfect but one is exactly borderline will pass the Bayesian test. We have explained in section 5.4.2 above why we believe such probabilities to be irrelevant to the discussion. A Bayesian test as described in section 6 above will never satisfy restrictions on such a probability other than in rare special cases. Modification of a Bayesian test to satisfy such restrictions in fact turns it into a frequentist test, albeit one somewhat more complicated that the ones we have considered above, and in doing so sacrifices the most important properties of the Bayesian approach discussed above, and in particular sacrifices the probability of one of passing a good factory. 7C5 Disxussion of xhoixz of prior The favourite argument of frequentists and institutions that advocate frequentist methods against the Bayesian approach is that the results of the Bayesian method depend on “the choice of prior” (see section 3.9 of [8]). To illustrate this, we compare the median of the number of measurements needed when testing the Poisson rate in section 6.6.4 for a range of priors. We choose this DFP item because the data consists mainly of periods where no events are observed, and consequently the results are more influenced by the prior than in other cases. Here we will vary only the parameter in the prior, and will consider the case of the desired probability content being 0.975 and the number of hours of observation per device still being 0.1 . 1. As we have seen in section 6.6.4, with = 0:1 the prior puts a reasonable amount of probability both inside and outside the acceptable region; in fact the prior probability that a single cluster factory is good for that item is 0.004, and the MCMC runs showed a median of 110 devices are needed. 2. When we take = 0:05, the prior again puts a reasonable amount of probability both inside and outside the acceptable region, with the prior probability that the factory is good now being about 0.012, as seen in figure 25. Here we obtained a median of 104 devices needed. 3. When we take = 0:01, the prior probability that the factory is good is now 0.12, as seen in figure 26. Here we obtained a median of 70 devices needed. 4. When we take = 0:0001, the prior now puts more than 98% of the probability inside the acceptable region, as seen in figure 27. The prior is now totally prejudiced in favour of the factory being good for this item, and we need collect no data at all (since 0:98 S 0:95). 52 m 1 2 3 4 5 6 7 8 9 10 r 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 25: Prior on mP r when = 0:05. The prior probability that the factory is good for this item is about 0.012 . m 1 2 3 4 5 6 7 8 9 10 r 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 26: Prior on mP r when = 0:01. The prior probabilty that the factory is good for this item is about 0.12 . 5. On the other hand, when we take = 1, the prior probability that the factory is good is about 0.0004, as seen in figure 28. Here we obtained a median of 3000 devices. Therefore, from the above analysis, we can conclude that indeed the Bayesian method depends on the selected prior. However, the important point is that when choosing reasonable priors that put between 0.1% and 5% of the probability inside the acceptable region and are generally well spread out, the required number of measurements is almost constant. The big differences arise when the priors are prejudiced either towards accepting or rejecting the design, but in practice this should nyvyr happen, if the intention is to test honystly. Moreover, it is entirely feasible for a regulator and manufacturer to come to agreement on an appropriate prior for each problem before embarking on data collection or finalising the design of the test plan. 53 m 1 2 3 4 5 6 7 8 9 10 r 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 27: Prior on mP r when = 0:0001. Looking at figure 18, we see that this prior puts more than 98% of the probability inside the acceptable region, and is so prejudiced that the factory is good for this item that we need collect no data at all. m 1 2 3 4 5 6 7 8 9 10 r 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 28: Prior on mP r when = 1. The prior probability that the factory is good for this item is now 0.0004, so the prior is heavily prejudiced against the factory being good, and a lot of data is now needed before one can conclude otherwise. 54 M Commznts mvyz in rzvizfi wy othzrs We note that the Independent Institute, a not for profit, non-partisan, scholarly research organisa- tion, has assessed the actions of one regulator, the U.S. Food and Drug Administration (FDA), as follows [9]: Defining type I and type II errors by 1. “Type I errors occur when the FDA approves a drug that ends up being a net bad for society.” 2. “Type II errors occur when the FDA rejects a drug that would be a net good for society.” they find that the FDA cares more about type I errors mainly because they cause “major media attention, public concern, and congressional action”; their proposed testing procedures aim not to minimize the harm done by both types of errors, but only the harm done by type I error. This supports our concern that better testing protocol is needed, preferably one that deals with both error types. The current protocol, as stated by the FDA, does not take into account the harm done to people that suffered because although a goox treatment was produced, it failed a faulty testing procedure. A further comparison is made in [10] by a so-called “whistle-blower” Henry Miller: a type I error attracts much attention and harms both patients and designers. However, a type II error only harms patients and is less vulnerable to attention, because “doctors, journalists and patients are usually unaware that a drug has been delayed or suppressed and that it would have saved individuals” [9]. Another testimonial from Vincent Kleinfeld, a Washington, D.C., attorney who worked on drug approval affairs in [10], wrote “the courts in this area of the law tend to equate the Food and Drug Administration with God, motherhood, and country”. This status of “untouchable” achieved by institutions are another reason for a better protocol to be implemented, as there is no compensation for the defects of the current protocol. 55 9 Conxlusion In conclusion, our opinion is that in its current form, ISO 20072 is dangerous to patients because of both the frequentist protocol proposed, the way it encourages cheating when carrying out testing, and the risk of not passing good factories which would benefit patients. Instead, a sequential Bayesian protocol should be used. We note that there are many analogies to this situation in the area of testing new drugs and medical devices. In [11] we see for example that in order to get a new drug accredited, drug companies typ- ically run clinical trials with the intention of demonstrating a “statistically significant” benefit, but because frequentist statistical tests use data inefficiently, it is unnecessarily difficult to demonstrate statistical significance, and consequently: 1. It is in the drug company’s interest to compare their new drug to a suboptimal treatment such as a placebo, rather than to the best available treatment - this is not in the interests of the patients in the trial. 2. Unnecessarily large numbers of patients are subjected to trials. 3. It is likely that many good drugs are failing to pass their tests (because frequentist “fconfi- dence” values and p-values do not actually control the probabilities we care about). Examples are also given in [11] of drugs that achieve “statistical significance” in test comparisons with placebos but which are not actually good for patients. Unfortunately, many aspects of medicine are still dominated by classical frequentist statistical methods, so ISO 20072 is not an isolated example [11] of bad statistics driving organisations to behave in a way that is not in the public interest. 56 gzfzrznxzs [1] International Standards Organization, Geneva, Switzerland, “ISO 20072:2009: Aerosol drug delivery device design verification — requirements and test methods.” ettm7//ttt.fpl.lod/ fpl/cataildrb_abtafi.etj?cpkrj_bo:1.656. Retrieved on 28.08.2016. [2] International Pharmaceutical Aerosol Consortium on Regulation and Science (IPAC) and European Pharmaceutical Aerosol Group (EPAG), “Justification of the request for a neg- ative vote on ISO DIS 20072, ‘Aerosol drug delivery device design verification requirements and method’.” ettm7//fmacop.lod/appbtp/rmilaap/cljjbktp/Ok_ISO_/--4/_Nls_-4.mac, 2008. Retrieved on 28.08.2016. [3] Radford M. Neal, “Probabilistic inference using Markov Chain Monte Carlo Methods,” 1NNH, Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto. [4] Radford M. Neal, “Slice Sampling,” Annvls of htvtistixs, vol. 31, no. 3, pp. 705–776, 2003. ettm7//molgbctbrcifa.lod/brcifa.alp/.-2323/13., Retrieved on 28.08.2016. [5] Yi Cao, “Munkres Assignment Algorithm.” ettmp7//ttt.jatetlohp.clj/jatia_cbktoai/ cfibbuceakdb//-0/5-jrkhobp-appfdkjbkt-aidloftej, 2008. Retrieved on 28.08.2016. [6] Harold W. Kuhn, “The Hungarian Method for the assignment problem,” cvvvl gzszvrxh aoB gistixs fuvrtzrly, vol. 2, pp. 83–97, 1955. ettm7//lkifkbif_oaov.tfibv.clj/alf/.-..--// kas.05---/-.-6/bmac, Retrieved on 28.08.2016. [7] U.S. Department of Health and Human Services, Food and Drug Administration, Center for Devices and Radiological Health, Office of Division of Biostatistics, Surveillance and Bio- metrics, Center for Biologics Evaluation and Research, “Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials.” ettm7//ttt.caa.dls/MbafcaiDbsfcbp/ DbsfcbObdriatflkakaDrfaakcb/DrfaakcbDlcrjbktp/rcj-4.-4/.etj, 2010. Retrieved on 28.08.2016. [8] Gerry Gray, PhD, Department of Health and Human Services, Food and Drug Administra- tion, Center for Devices and Radiological Health, Division of Biostatistics, Surveillance and Biometrics, “Potential advantages and disadvantages to the Bayesian approach.” ettm7//ttt. caa.dls/leojp/alchbtp/alchbtp/-3a-.6./-3a--.6.-tp----/-doav-sli..mac, 2006. Re- trieved on 28.08.2016. [9] The Independent Institute, “Glossary.” ettm7//ttt.caaobsfbt.lod/.._dilppaov.mem# boolo_tvmbp, 2016. Retrieved on 28.08.2016. [10] The Independent Institute, “Why the FDA has an incentive to delay the introduction of new drugs.” ettm7//ttt.caaobsfbt.lod/-3_fkcbktfsbp.mem, 2016. Retrieved on 28.08.2016. [11] Ben Goldacre, Bvy ehvrmv: Hofi yrug xompvnizs mislzvy yoxtors vny hvrm pvtiznts. Faber and Faber (USA), 2012. [12] G. E. Tauchen, “Maximum likelihood specification tests,” Journvl of Exonomztrixs, vol. 30, pp. 415–443, 1985. Lemma 1. 57 A lvys in fihixh thz wzhvviour of frzquzntist xon yznxz fvils to mvtxh thz intuitivz wzhvviour zxpzxtzy We give two theorems. The first deals with the simultaneous occurrence of a number of hypotheses, each of which is the alternative hypothesis to some null hypothesis. It shows that whereas with the intuitive under- standing of confidence the confidence of the intersection of multiple uncertain events is (usually much) smaller than the confidence of any of the individual events, the fconfidence of the intersection is always equal to the infimum of the fconfidence of the individual events - in other words it is far too large. The second deals with the complementary situation that we are considering the union of a number of hypotheses, each of which is the alternative hypothesis to some null hypothesis. The situation here, though not relevant to the body of the work in this paper, is even worse. Whereas the intuitive confidence that at least one of a number of events has occurred must be at least as large as that of any one of the events, the fconfidence that at least one of multiple events has occurred can be strictly smaller than the fconfidence that a particular one of the events has occurred. In other words we can be less fconfident that a larger event occurred than that a small subset of that event occurred - totally contrary to any reasonable intuition. In the following F (V) denotes the fconfidence that an event V has occurred, it being understood that some particular critical region evident from the context is in use. hhyorym AB1 (Intersection of events)B hupposz thvt (H0P)∈I is v fvmily of zvznts to wz trzvtzy vs null hypothzszsA thvt H0 = ⋃ ∈I H0PA vny thvt H1P vny H1 vrz rzspzxtivzly thz xomplzmznts of H0P vny H0A so thvt H1 = ⋂ ∈I H1P. azt X wz v xritixvl rzgion in somz yvtv spvxz for vll thz H0P (for zxvmplzA X might wz ∏ ∈I X if thz yvtv spvxz is thz proyuxt of onz inyzpznyznt yvtv spvxz for zvxh A or X might wz ⋂ ∈I X). ihzn F (H1) = F ( ⋂ ∈I H1P) = inf ∈I F (H1P). Note that there is no requirement for I to be finite or even countable. eroof. By definition, F (H1P) = 1− sup h∈H0H e (x ∈ X | h), while F (H1) = 1− sup h∈H0 e (x ∈ X | h). But since H0 = ⋃ ∈I H0P, it follows that 1 − F (H1) = sup h∈H0 e (x ∈ X | h) = sup ∈I sup h∈H0H e (x ∈ X | h) = sup ∈I (1− F (H1P)), whence the result. hhyorym ABF (Union of events)B ihzrz zxist zvznts H1P1 vny H1P2 suxh thvt thz fxon yznxz F (H1) thvt H1 = H1P1 ∪H1P2 hvs oxxurrzy is strixtly smvllzr thvn min ∈{1P2} F (H1P)A fihzrz H1P is thz vltzrnvtivz hypothzsis to (vny xomplzmznt of) H0PA vny xritixvl rzgions X vrz in zffzxt for tzsting H0P for  ∈ {1P 2}A so thvt X = X1 ∪ X2 is in zffzxt for tzsting H0 = H0P1 ∩H0P2. 58 Comment: Any reasonable intuitive understanding of fconfidence as confidence would require that F (H1) = F ( ⋃ ∈{1P2} H1P) ≥ max ∈{1P2} F (H1P). eroof. It suffices, of course, to simply provide an example. We take our hypothesis space to be a four point space, H = {h0P1P h1P1}×{h0P2P h1P2}, with the null hypotheses being H0P1 = {h0P1}×{h0P2P h1P2}, H0P2 = {h0P1P h1P1}×{h0P2}, so that H0 = {(h0P1P h0P2)} and H1 = {(h0P1P h1P2)P (h1P1P h0P2)P (h1P1P h1P2)}. Similarly we take our dataspace to be a four point space, m = {x0P1P x1P1} × {x0P2P x1P2}, with the two factors independent, e (x0P1 | h) = 0:95 for all h ∈ H0P1, and e (x0P2 | h) = 0:95 for all h ∈ H0P2. (Note that the frequentist paradigm bizarrely puts no requirement on us to specify e (x | h) for h = (h1P1P h1P2).) Thus for example e ((x1P1P x1P2) | (h0P1P h0P2)) = 0:052 = 0:0025. Our critical regions will be X1 = {x1P1} × {x0P2P x1P2} and X2 = {x0P1P x1P1} × {x1P2}, so that X = {(x0P1P x1P2)P (x1P1P x0P2)P (x1P1P x1P2)}. Then e (x ∈ X1 | h) = 0:05 for all h ∈ H0P1, and similarly e (x ∈ X2 | h) = 0:05 for all h ∈ H0P2, so that X1 and X2 are respectively 95% critical regions for H0P1 and H0P2. However, somewhat unexpectedly, for h = (h0P1P h0P2), we have e (x ∈ X | h) = 1− 0:952 = 0:0975, so that X is only a 90.25% critical region for H1, and 0:9025 = F (H1) < min ∈{1P2} F (H1P) = 0:95 even though H1 ⊃ H1P for all . 59 W eroof thvt Wvyzsivn pvss prowvwility of v gooy fvxtory vpB provxhzs F vs thz vmount of yvtv inxrzvszs We consider a single item only, noting that since the number of items is finite showing the desired result for a single item suffices to prove it for all 224 items. Our proof will use the uniform strong law of large numbers [12], the proof of which carries over to the following more general statement with negligible modification and without invoking the separability condition subsumed in the assumption of regularity in [12]: If • m is a measure space; • K is a compact metric space; • ϕ is a function from m ×K to R; • for all k ∈ K, x 7→ ϕ(xP k) is measurable on m; • there exists an integrable function b : m → R such that for all k ∈ K and x ∈ m, |ϕ(xP k)| ≤ b(x); • for all k ∈ K, for almost all x ∈ m, ϕ is continuous in k at k; • x1P x2P : : : is a sequence of independent identically distributed random variables taking values in m; then • for all k ∈ K, the expectation of ϕ(x1P k) exists and is finite, and • with probability 1, ( 1N N∑ n=1 ϕ(xnP k)→ Eϕ(x1P k) uniformly in k as N →∞). We assume throughout that H = H0 ∪H1, H0 ∩H1 = ∅. hhyorym VB1B hupposz thvt: 1. m is v mzvsurz spvxzP 2. H is v loxvlly xompvxt mztrix spvxz xvrrying v mzvsurz yznotzy  on its Borzl sztsA xompvxt suwszts hvving nitz mzvsurzA vny nonBzmpty opzn szts of H hvving positivz mzvsurzP H. H0 is xompvxtP 4. hP x vrz rvnyom vvrivwlzs on thz prowvwility spvxz Ω tvking vvluzs in HPm rzspzxtivzly posB szssing rzspzxtivzly v yznsity funxtion vny v xonyitionvl yznsity givzn hP J. for vlmost vll x ∈ mA e (x|h) is v xontinuous funxtion of hP 6. e (x|h) is wounyzy on m loxvlly uniformly in hA fihilz e (h) is loxvlly wounyzyP 60 L. for vll h′ ∈ HA (e (x|h1) a.s.= e (x|h′) =⇒ h′ = h1)P 8. thzrz zxists vn opzn suwszt U1 ∋ h1 of H1 suxh thvt e(h) is wounyzy vfivy from zzro on U1P vny N. thzrz zxists v mzvsurvwlz vny vlmost surzly positivz funxtion a : m → R vny v nzighwourhooy U2 of h1 suxh thvt woth Ex|h1 log a(x) zxists vny is nitz vny vlso for vll x ∈ m vny h ∈ H0∪U2A e (x|h) ≥ a(x). Unyzr thzsz xonyitions if x1P x2P : : : P xN vrz inyzpznyzntly yistriwutzy vxxorying to e (x|h1)A thzn e (h ∈ H1|x1P : : : P xN )→ 1 fiBvlmost surzly vs N →∞. Before proving this result, we make a few remarks on the conditions. The basic conditions 1 and 2 on H and m are standard, and are fulfilled in particular by any finite-dimensional real vector spaces. They are also fulfilled by discrete spaces with H0 finite, in which case the continuity and boundedness conditions 5 and 6 become vacuous. The local boundedness conditions 6 can usually be achieved by continuous transformations of H or m if they do not already apply. The compactness condition 3 on H0 can usually be achieved, if does not already apply, either by adjoining a point at infinity or by deleting part of H0 that is extremely unlikely. For example, if h is the standard deviation of a Gaussian likelihood, and H0 is [1P∞), then replacing H by (0P 10100] and H0 by [1P 10 100] achieves compactness of H0 without materially altering the problem. However, note that the result specifically does not cover the case that h1 is on the boundary of H1, for which the theorem does not hold. Without the continuity condition 5 it is very hard to control the union of the uncountable number of zero probability subsets of the underlying probability space Ω that arise as a result of the strong law of large numbers. Condition 7 is required because otherwise the likelihood provides insufficient information to dis- tinguish h1 from other possibilities, while condition 8 is needed to ensure that the prior has not prejudicially excluded the right answer. The remaining condition 9 can usually be achieved if not already present by replacing e (x|h) by (1− )e (x|h) + b(x), where b is a bounded probability density onm such that ∫m (b(x) + e (x|h1)) log b(x)dx exists and is finite and S 0 is a small positive constant. This is intuitively bizarre, as doing so makes the resulting random variable x contain lzss information about h. We therefore feel that some more able mathematician might be able to manage without this condition. eroof. In the following we assume that fiN = (x1P x2P : : : P xN ), that fi means fiN , and that x and all xn are distributed according to e (x|h1). Thus for a fixed value h ∈ H, e (x|h) is a random variable calculated by letting x be distributed according to e (x|h1) then calculating e (x|h). We exclude the null subset of Ω on which e (x|h1) = 0. 61 For any h ∈ H, let us define q(h) = Ex|h1 loge (x|h). From Jensen’s inequality we note that for any h ∈ H \ {h1} the given conditions then imply q(h1) S q(h), with strict inequality. Conditions 5, 6, and 9, and the dominated convergence theorem give us that for some neighbourhood U of h1 in H1, q is continuous on H0 ∪ U . Now the compactness of H0 implies that q attains a maximum b on H0, and we then have that q(h1) S b . Indeed, for some compact neighbourhood k ⊆ U ⊆ H1 of h1 on which e (x|h) is uniformly bounded and for some  S 0 and ϵ S 0, we have e (h) S ϵ and q(h) S b +  and e (x|h) S 0 on k a.s. (by conditions 5 and 9). We note that then on k we also have e (fi|h) S 0 a.s. for all N . Applying the uniform strong law of large numbers with K equal to H0 or k and ϕ(xP k) equal to loge (x|k), we find that almost surely, uniformly in h ∈ k and h0 ∈ H0, e (fi|h0)1=N → eq(h0) and e (fi|h)1=N → eq(h), so (P (x|h0)P (x|h) )1=N → eq(h0)−q(h) < e− < 1, and hence almost surely (P (x|h0)P (x|h) → 0 uniformly in h0 and h as N →∞). We now fix any ! ∈ Ω for which this uniform convergence occurs. It suffices now to show that for that !, ∫ H0 P (h0)P (x|h0)dh0∫ V P (h)P (x|h)dh → 0. Now e (h0) and supx∈m e (x|h0) are locally uniformly bounded on the compact set H0 and therefore uniformly bounded on H0. By rescaling the measures on H and m we may assume that that bound is 1, in which case e (fi|h0) is likewise bounded by 1 for all N . Since also e (h) is bounded away from zero on k , it even suffices to show that ∫ H0 P (x|h0)dh0∫ V P (x|h)dh → 0. We already know that the integrand in the denominator is never zero. Now fix h ∈ k . Since the integrand in the numerator is uniformly bounded and (H0) is finite, by the dominated convergence theorem we have ∫ H0 P (x|h0)dh0 P (x|h) → 0. Let FN = ∫ H0 e (fi|h0)dh0. Then FN=e (fi|h) → 0, so e (fi|h)=FN → +∞. But F−1N ∫ k e (fi|h)dh ≥ ∫ k infN ′≥N (F −1 N ′ e (fiN ′ |h))dh → +∞ by the monotone convergence theorem. Therefore ∫ V P (x|h)dh∫ H0 P (x|h0)dh0 → +∞, and hence ∫ H0 P (x|h0)dh0∫ V P (x|h)dh → 0 as N →∞. Thus almost surely ∫ H0 P (h0)P (x|h0)dh0∫ V P (h)P (x|h)dh → 0, and hence e (h ∈ H0|fi)→ 0, so e (h ∈ H1|fi)→ 1. 62 C Corrzxt vvluzs of k to rzplvxz thosz in Ihd GEE7G tvwlz DF We have computed the correct values for the two-sided tolerance factors k in the tables in figures 29, 30, 31 and 32. 63 N p = 0.75 p = 0.9 p = 0.95 p = 0.975 p = 0.99 p = 0.995 p = 0.999 2 11.759 20.571 26.249 31.247 37.083 41.066 49.295 3 3.805 6.159 7.66 8.996 10.559 11.632 13.869 4 2.619 4.164 5.146 6.02 7.045 7.752 9.223 5 2.151 3.409 4.205 4.913 5.744 6.316 7.507 6 1.904 3.006 3.707 4.33 5.062 5.565 6.614 7 1.775 2.757 3.402 3.973 4.646 5.107 6.068 8 1.695 2.584 3.187 3.724 4.353 4.787 5.688 9 1.637 2.464 3.032 3.543 4.144 4.557 5.417 10 1.592 2.377 2.914 3.404 3.982 4.379 5.205 11 1.557 2.316 2.822 3.292 3.852 4.237 5.037 12 1.528 2.268 2.752 3.204 3.748 4.123 4.903 13 1.504 2.228 2.696 3.131 3.66 4.025 4.787 14 1.483 2.194 2.653 3.073 3.587 3.946 4.693 15 1.466 2.166 2.616 3.023 3.524 3.875 4.61 16 1.45 2.14 2.584 2.982 3.471 3.814 4.536 17 1.437 2.117 2.555 2.948 3.425 3.762 4.473 18 1.425 2.098 2.531 2.919 3.385 3.716 4.416 19 1.415 2.08 2.509 2.893 3.352 3.677 4.367 20 1.405 2.065 2.489 2.869 3.323 3.641 4.321 21 1.396 2.05 2.47 2.847 3.296 3.609 4.279 22 1.388 2.037 2.454 2.827 3.273 3.582 4.243 23 1.381 2.025 2.438 2.809 3.252 3.558 4.21 24 1.374 2.014 2.424 2.793 3.232 3.536 4.18 25 1.368 2.004 2.412 2.778 3.214 3.516 4.153 26 1.362 1.994 2.399 2.763 3.196 3.496 4.128 27 1.356 1.984 2.388 2.749 3.18 3.478 4.104 28 1.352 1.976 2.378 2.737 3.165 3.462 4.084 29 1.347 1.969 2.368 2.726 3.152 3.447 4.065 30 1.343 1.962 2.359 2.715 3.139 3.433 4.048 31 1.339 1.955 2.35 2.705 3.127 3.419 4.031 32 1.335 1.948 2.342 2.695 3.115 3.406 4.016 33 1.331 1.942 2.335 2.686 3.105 3.394 4.001 34 1.328 1.937 2.327 2.677 3.094 3.383 3.988 35 1.324 1.931 2.32 2.669 3.085 3.372 3.974 36 1.321 1.926 2.314 2.661 3.075 3.362 3.962 37 1.318 1.921 2.308 2.654 3.067 3.352 3.951 38 1.315 1.916 2.302 2.647 3.058 3.343 3.939 39 1.313 1.912 2.296 2.64 3.05 3.334 3.929 40 1.31 1.908 2.291 2.634 3.043 3.326 3.918 41 1.307 1.903 2.285 2.628 3.035 3.317 3.908 42 1.305 1.899 2.281 2.622 3.029 3.31 3.899 43 1.303 1.896 2.276 2.617 3.022 3.303 3.891 44 1.301 1.892 2.272 2.611 3.016 3.296 3.882 45 1.299 1.889 2.267 2.606 3.009 3.288 3.873 46 1.296 1.885 2.263 2.601 3.003 3.282 3.865 47 1.294 1.882 2.259 2.596 2.998 3.276 3.858 48 1.293 1.879 2.255 2.592 2.992 3.27 3.85 49 1.291 1.876 2.251 2.587 2.987 3.264 3.843 50 1.289 1.873 2.248 2.583 2.982 3.258 3.837 Figure 29: Part 1 of corrected k table. 64 N p = 0.75 p = 0.9 p = 0.95 p = 0.975 p = 0.99 p = 0.995 p = 0.999 51 1.288 1.871 2.244 2.579 2.977 3.253 3.83 52 1.286 1.868 2.241 2.574 2.972 3.247 3.823 53 1.284 1.865 2.237 2.571 2.967 3.242 3.817 54 1.283 1.863 2.234 2.567 2.963 3.237 3.811 55 1.281 1.86 2.231 2.563 2.958 3.232 3.805 56 1.28 1.858 2.228 2.56 2.955 3.228 3.8 57 1.278 1.856 2.226 2.557 2.951 3.223 3.794 58 1.277 1.854 2.223 2.553 2.947 3.219 3.789 59 1.276 1.851 2.22 2.55 2.943 3.215 3.784 60 1.275 1.849 2.217 2.547 2.939 3.21 3.779 61 1.273 1.847 2.215 2.544 2.936 3.207 3.774 62 1.272 1.845 2.212 2.541 2.932 3.203 3.769 63 1.271 1.843 2.21 2.538 2.929 3.199 3.765 64 1.27 1.841 2.207 2.535 2.925 3.195 3.76 65 1.269 1.84 2.205 2.532 2.922 3.191 3.755 66 1.268 1.838 2.203 2.53 2.919 3.188 3.752 67 1.267 1.836 2.201 2.527 2.916 3.184 3.747 68 1.266 1.835 2.199 2.525 2.913 3.181 3.744 69 1.265 1.833 2.197 2.522 2.91 3.178 3.739 70 1.264 1.831 2.195 2.52 2.907 3.175 3.736 71 1.263 1.83 2.193 2.517 2.904 3.172 3.732 72 1.262 1.828 2.191 2.515 2.901 3.169 3.728 73 1.261 1.827 2.189 2.513 2.899 3.166 3.725 74 1.26 1.825 2.187 2.511 2.896 3.163 3.721 75 1.259 1.824 2.185 2.508 2.893 3.16 3.718 76 1.258 1.822 2.184 2.507 2.891 3.158 3.715 77 1.257 1.821 2.182 2.505 2.889 3.155 3.711 78 1.257 1.82 2.18 2.503 2.886 3.152 3.708 79 1.256 1.818 2.179 2.501 2.884 3.15 3.705 80 1.255 1.817 2.177 2.499 2.882 3.147 3.702 81 1.254 1.816 2.175 2.497 2.88 3.145 3.699 82 1.253 1.815 2.174 2.495 2.877 3.142 3.696 83 1.253 1.813 2.172 2.493 2.875 3.139 3.693 84 1.252 1.812 2.171 2.492 2.873 3.138 3.69 85 1.251 1.811 2.169 2.49 2.871 3.135 3.687 86 1.251 1.81 2.168 2.488 2.869 3.133 3.685 87 1.25 1.809 2.167 2.487 2.867 3.131 3.682 88 1.249 1.808 2.165 2.485 2.865 3.128 3.679 89 1.249 1.807 2.164 2.483 2.864 3.127 3.677 90 1.248 1.805 2.162 2.481 2.861 3.124 3.674 91 1.247 1.805 2.161 2.48 2.86 3.122 3.672 92 1.247 1.804 2.16 2.478 2.858 3.12 3.669 93 1.246 1.802 2.158 2.477 2.856 3.118 3.667 94 1.246 1.802 2.157 2.476 2.854 3.116 3.665 95 1.245 1.801 2.156 2.474 2.853 3.114 3.663 96 1.244 1.8 2.155 2.473 2.851 3.112 3.66 97 1.244 1.799 2.154 2.471 2.849 3.111 3.658 98 1.243 1.798 2.153 2.47 2.848 3.109 3.656 99 1.243 1.797 2.151 2.469 2.846 3.107 3.654 Figure 30: Part 2 of corrected k table. 65 N p = 0.75 p = 0.9 p = 0.95 p = 0.975 p = 0.99 p = 0.995 p = 0.999 100 1.242 1.796 2.151 2.468 2.845 3.106 3.652 102 1.241 1.794 2.148 2.465 2.841 3.102 3.648 104 1.24 1.793 2.146 2.463 2.839 3.099 3.644 106 1.239 1.791 2.144 2.46 2.836 3.096 3.64 108 1.238 1.79 2.142 2.458 2.833 3.093 3.636 110 1.237 1.788 2.14 2.456 2.83 3.09 3.633 112 1.237 1.787 2.138 2.453 2.828 3.087 3.629 114 1.236 1.785 2.137 2.451 2.825 3.084 3.626 116 1.235 1.784 2.135 2.449 2.823 3.081 3.622 118 1.234 1.782 2.133 2.447 2.82 3.079 3.619 120 1.233 1.781 2.131 2.445 2.818 3.076 3.616 122 1.232 1.78 2.13 2.443 2.816 3.074 3.613 124 1.232 1.779 2.128 2.441 2.813 3.071 3.61 126 1.231 1.777 2.127 2.439 2.811 3.068 3.607 128 1.23 1.776 2.125 2.438 2.809 3.066 3.604 130 1.23 1.775 2.124 2.436 2.807 3.064 3.602 132 1.229 1.774 2.123 2.434 2.805 3.062 3.599 134 1.228 1.773 2.121 2.433 2.803 3.06 3.596 136 1.227 1.772 2.12 2.431 2.801 3.057 3.593 138 1.227 1.771 2.118 2.429 2.799 3.055 3.591 140 1.226 1.77 2.117 2.428 2.798 3.054 3.589 142 1.226 1.769 2.116 2.426 2.796 3.051 3.586 144 1.225 1.768 2.115 2.425 2.794 3.05 3.584 146 1.224 1.767 2.113 2.423 2.792 3.047 3.581 148 1.224 1.766 2.112 2.422 2.791 3.046 3.58 150 1.223 1.765 2.111 2.421 2.789 3.044 3.578 152 1.223 1.764 2.11 2.42 2.788 3.042 3.575 154 1.222 1.763 2.109 2.418 2.786 3.041 3.573 156 1.222 1.763 2.108 2.417 2.785 3.039 3.571 158 1.221 1.762 2.107 2.416 2.783 3.037 3.569 160 1.221 1.761 2.106 2.415 2.782 3.036 3.567 162 1.22 1.76 2.105 2.413 2.78 3.034 3.565 164 1.22 1.759 2.104 2.412 2.779 3.033 3.564 166 1.219 1.758 2.103 2.411 2.778 3.031 3.562 168 1.219 1.758 2.102 2.41 2.776 3.03 3.56 170 1.218 1.757 2.101 2.409 2.775 3.028 3.558 172 1.218 1.756 2.1 2.408 2.774 3.027 3.556 174 1.218 1.756 2.099 2.407 2.772 3.025 3.555 176 1.217 1.755 2.098 2.405 2.771 3.024 3.553 178 1.217 1.754 2.097 2.405 2.77 3.022 3.551 180 1.216 1.753 2.097 2.404 2.769 3.021 3.55 185 1.215 1.752 2.094 2.401 2.766 3.018 3.546 190 1.214 1.75 2.093 2.399 2.763 3.015 3.542 195 1.214 1.749 2.091 2.397 2.76 3.012 3.539 200 1.213 1.747 2.089 2.394 2.758 3.009 3.535 205 1.212 1.746 2.087 2.393 2.756 3.007 3.532 210 1.211 1.745 2.086 2.39 2.753 3.004 3.529 215 1.21 1.743 2.084 2.388 2.751 3.001 3.525 220 1.209 1.742 2.082 2.387 2.748 2.999 3.522 Figure 31: Part 3 of corrected k table. 66 N p = 0.75 p = 0.9 p = 0.95 p = 0.975 p = 0.99 p = 0.995 p = 0.999 225 1.209 1.741 2.081 2.385 2.746 2.996 3.52 230 1.208 1.74 2.08 2.383 2.744 2.994 3.517 235 1.207 1.739 2.078 2.381 2.742 2.992 3.514 240 1.207 1.738 2.077 2.38 2.74 2.99 3.512 245 1.206 1.737 2.075 2.378 2.739 2.988 3.509 250 1.205 1.735 2.074 2.377 2.737 2.986 3.507 255 1.205 1.735 2.073 2.375 2.735 2.984 3.505 260 1.204 1.734 2.072 2.374 2.733 2.982 3.502 265 1.204 1.733 2.071 2.373 2.732 2.98 3.5 270 1.203 1.732 2.069 2.371 2.73 2.978 3.498 275 1.203 1.731 2.068 2.37 2.729 2.977 3.496 280 1.202 1.73 2.067 2.369 2.727 2.975 3.494 285 1.202 1.729 2.066 2.368 2.726 2.974 3.492 290 1.201 1.729 2.065 2.366 2.724 2.972 3.49 295 1.201 1.728 2.064 2.365 2.723 2.971 3.488 300 1.2 1.727 2.063 2.364 2.722 2.969 3.487 310 1.199 1.725 2.061 2.362 2.719 2.966 3.483 320 1.199 1.724 2.06 2.36 2.717 2.963 3.48 330 1.198 1.723 2.058 2.358 2.714 2.961 3.477 340 1.197 1.722 2.057 2.356 2.712 2.958 3.474 350 1.196 1.72 2.055 2.354 2.71 2.956 3.471 360 1.196 1.719 2.054 2.353 2.708 2.954 3.468 370 1.195 1.718 2.052 2.351 2.706 2.952 3.465 380 1.194 1.717 2.051 2.35 2.704 2.95 3.463 390 1.194 1.716 2.05 2.348 2.703 2.948 3.461 400 1.193 1.715 2.049 2.347 2.701 2.946 3.458 425 1.192 1.713 2.046 2.343 2.697 2.941 3.453 450 1.191 1.711 2.043 2.34 2.693 2.937 3.448 475 1.189 1.709 2.041 2.337 2.69 2.934 3.444 500 1.188 1.707 2.039 2.335 2.687 2.93 3.44 525 1.187 1.706 2.037 2.332 2.684 2.927 3.436 550 1.187 1.704 2.035 2.33 2.681 2.924 3.432 575 1.186 1.703 2.033 2.328 2.679 2.922 3.429 600 1.185 1.702 2.031 2.326 2.677 2.919 3.426 625 1.184 1.7 2.03 2.324 2.674 2.916 3.423 650 1.183 1.699 2.028 2.323 2.672 2.914 3.42 700 1.182 1.697 2.026 2.32 2.669 2.91 3.415 750 1.181 1.695 2.024 2.317 2.665 2.907 3.411 800 1.18 1.694 2.021 2.314 2.662 2.903 3.407 850 1.179 1.692 2.02 2.312 2.66 2.9 3.403 900 1.178 1.691 2.018 2.31 2.657 2.898 3.4 950 1.177 1.689 2.016 2.308 2.655 2.895 3.397 1000 1.177 1.688 2.015 2.306 2.653 2.893 3.394 1500 1.172 1.68 2.004 2.294 2.638 2.876 3.374 Figure 32: Part 4 of corrected k table. 67