Experimental research into teaching innovations: responding to methodological and ethical challenges Accepted for publication in Studies in Science Education Keith S. Taber Faculty of Education, University of Cambridge kst24@cam.ac.uk Submitted: January 2018 Revised: October 2018; May 2019 Accepted: August 2019 Experimental research into teaching innovations: responding to methodological and ethical challenges Abstract Experimental studies are often employed to test the effectiveness of teaching innovations such as new pedagogy, curriculum, or learning resources. This article offers guidance on good practice in developing research designs, and in drawing conclusions from published reports. Random control trials potentially support the use of statistical inference, but face a number of potential threats to validity. Research in educational contexts often employs quasi-experiments or natural experiments rather than true experiments, and these types of designs raise additional questions about the equivalence between experimental and control groups and the potential influence of confounding variables. Where it is impractical for experimental studies to employ samples that fully reflect diverse populations, generalisation is limited. Series of small-scale replication studies may be useful here, especially if these are conceptualised as being akin to multiple case studies, and complemented by qualitative studies. Control conditions for experimental studies need to be carefully selected to provide the most appropriate test for a particular intervention, and considering the interests of all participants. Control groups in studies that replicate innovations that have been widely shown to be effective in other settings should experience teaching conditions that reflect good practice and meet expected teaching standards in the research context. Key words: educational experiments; random control trials; control conditions; teaching interventions; replication; ethical comparison conditions 2 Introduction It is common for educational innovations, such as teaching approaches, new curricula, or new learning resources, to be evaluated by an experiment where learning gains or other desired outcomes are compared between an experimental condition involving the innovative experimental ‘treatment’ and some comparison condition where the treatment being evaluated is absent. A small selection of published studies of this kind are listed in Table 1 to give a sense of the potential range of research foci. Such experimental approaches can be very powerful, although there may sometimes be a range of alternative explanations for research outcomes apart from the superiority, or otherwise, of the innovation being tested. [Table 1 about here] Table 1: A sample of published experimental studies testing teaching innovations The present article offers a thematic review of some key issues and challenges that arise in the design and interpretation of experimental studies in education, drawing upon selected illustratory examples of published studies. It is intended that this review will be useful both as guidance for those looking to undertake experimental studies of teaching innovations, and also for those seeking to be informed by reading research reports of such studies. The article considers the particular practical challenges of carrying-out experimental studies in education. This analysis highlights some inherent limitations in many small-scale experimental studies which cannot be assumed to generalise to other contexts. The article considers notions of generalisability and replication to both argue for how such studies can best be understood to contribute to our understanding of teaching and learning and to suggest how individual studies can be best designed to usefully add to the literature. Particular attention is given to the selection of the most informative ‘control’ conditions with which experimental treatments may be compared. The article suggests guidelines for best practice in establishing control conditions for studies that will be both ethical and informative. The use of random control trials in education Teaching is a complex and challenging process, and a core focus of educational research is in informing effective teaching (Pring, 2000). Such research draws upon a wide range of theoretical perspectives, and adopts a spread of different methodologies. Different studies address quite different research questions, and so different methods (collecting and analysing different kinds of data) are appropriate in different studies. As the U.S. National Research Council’s Committee on Scientific Principles for Educational Research (2002, p. 3 3) has noted “methods can only be judged in terms of their appropriateness and effectiveness in addressing a particular research question” and so a “wide variety of legitimate scientific designs are available for educational research” (p.6). From this perspective, experimental designs are very suitable for some educational studies, but are not indicated for others (Taber, 2014b). Particular research techniques have specific requirements, without which they are not strictly valid, and a research design that fails to meet the prerequisite conditions of its component techniques may not support robust conclusions. In this article the challenges of undertaking informative experimental research is discussed. Inevitably, then, this review emphasises the limitations of experimental work, and the practical issues that arise in designing valid studies and generalising from them. This is not intended to suggest such studies do not make an important contribution, but rather offers guidance for evaluating such studies, and, indeed, for considering when experimental research can be productively complemented by other forms of enquiry. Experimental research and units of analysis The adoption of an experimental approach is intended to avoid falsely inferring that a treatment brings about an outcome, by employing the most appropriate comparison conditions. An important term used in discussing experimental research is ‘unit of analysis’. An experiment may, for example, be comparing outcomes between different learners, different classes, different year groups, or different schools (see Table 1 for some examples). It is important at the outset of an experimental study to clarify what the unit of analysis is, and this should be explicit in research reports so that readers are aware what is being compared. A random control trial (RCT) is an experiment where the units of analysis are randomly assigned to different conditions, and statistical methods are used to determine whether any overall difference in the measured outcomes in those conditions is (probably) due to the intervention. Statistics can only indicate how likely a measured result would occur by chance (as randomisation of units of analysis to different treatments can only make uneven group composition unlikely, not impossible). The usual convention is that a result is statistically significant when its probability (p) of occurring by chance is less than 5 percent (i.e., p<0.5). The precise statistical test(s) chosen depend upon the research question(s). A null hypothesis (that there is no difference between the treatments, which is refuted by a finding that either of the treatments is more effective) is not simply the inverse of the hypothesis that the experimental treatment will be more effective, and researchers should set out the specific question to be tested before designing the research. A RCT is referred to as a ‘true experiment’ because there is randomisation of the ‘units of analysis’ (people, classes, schools, etc.) to conditions. Ben Goldacre, in a position paper on using research evidence in schools that was commissioned by the UK Department for Education, offers a caricature of this type of study: 4 Where they are feasible, randomised trials are generally the most reliable tool we have for finding out which of two interventions works best. We simply take a group of children, or schools…; we split them into two groups at random; we give one intervention to one group, and the other intervention to the other group; and then we measure how each group is doing, to see if one intervention achieved its supposed outcome any better (Goldacre, 2013, p. 8) The ‘where feasible’ proviso here is important, and a number of potential challenges in undertaking this kind of study are discussed in this article. RCT are sometimes difficult to arrange in education and other social contexts. ‘Simply’ taking a sample of children or schools and splitting them into two groups at random often raises practical difficulties - and later in this article studies that do not meet the requirements of being a ‘true experiment’ (such as most of those in Table 1) are discussed. Randomisation cannot ensure equivalence between groups (even if it makes any imbalance just as likely to advantage either condition) so “while a substantial imbalance is unlikely to occur in a very large trial, small trials may well be subject to sufficient differences between groups to affect the overall result of the trial” (Moore, Graham, & Diamond, 2003, p. 683). Researchers therefore sometimes seek to classify units (e.g., schools) in a sample into similar groupings and randomise from each of these clusters or ‘blocks’ rather than the complete pool (Moore et al., 2003; Ruthven et al., 2016). This so-called randomised block design requires both identifying what characteristics are pertinent to judging similarity in a particular study (e.g., school size?; location?; curriculum?; selectivity of intake?; gender / ethnic / socio-economic composition of pupils?; etc.) and having accurate measurements of these qualities. Research reports from small-scale studies (such as those comparing outcomes in two classes, see examples in Table 1) rarely inform readers how the randomisation was achieved, and it has been reported that authors sometimes seem unable to provide such information when asked (by journal editors, for example). It has therefore been recommended that the technique for making a random selection should be briefly reported in methodology sections of reports along with other details of techniques used in the study (Taber, 2013c). If the units of analysis are schools, it may be difficult to enrol a large enough number of schools into the sample for the statistical methods to be used - especially in those national contexts that rely on schools responding to invitations to volunteer (this is less of a problem when research access is granted at regional/ district or state level). Ruthven and colleagues (Ruthven et al., 2016) report a project (Effecting Principled Improvement in STEM Education - ‘epiSTEMe’) undertaken in England. The project team were based at a prestigious university that also had extensive and long-standing networks with schools in its region. The research was part of an initiative (the Targeted Initiative on Science and Mathematics Education) funded by a national research funding agency (the Economic and Social Research Council) in partnership with the 5 Gatsby Charitable Foundation, the Institute of Physics and the Association for Science Education. Despite these indicators of status, it proved difficult to recruit schools at the level hoped for, The intention was to recruit 30 schools to participate, together providing 60 teachers/ classes in each [of science and mathematics], so as to yield a structured sample of sufficient size to afford a hierarchical analysis of adequate statistical power. …In particular, while the original stipulation was that schools should nominate two science teachers and two mathematics teachers, it became clear that insisting on this would result in far too few schools participating in the trial. Consequently, both the two-subject and teacher-pair requirements were relaxed. … This yielded 25 participating schools: 12 in the intervention group and 13 in the control. Thus, while the number of schools participating came close to original intentions (25 rather than 30), as a result of the relaxation of participation requirements noted above the number of teachers/classes fell well short (34 in mathematics, 36 in science, rather than 60 in each). (Ruthven et al., 2016, pp. 25-26) In practice, most published studies are based on a much smaller number of classes, and indeed many are based on comparisons between one intervention class and one control class (see the examples in Table 1). Potential threats to the validity of findings from RCT The simplest type of RCT will compare two conditions, and often the treatment in one condition will be an innovation (a new teaching approach, or curriculum, or set of learning resources, etc.) to be compared with a treatment that is some form of ‘standard’ or ‘typical’ or ‘traditional’ alternative - for example, in the Ruthven study cited above, “teaching via established methods” (Ruthven et al., 2016, p. 26). The choice of different forms of comparison condition (i.e., no educational input versus customary teaching versus recognised good practice) is considered in a later section of this review. Where a RCT has been carefully designed and carried out, and when the actual treatment learners experience reflects the intended treatment - that is, that there is a high degree of ‘intervention fidelity’ (O'Donnell, 2008) - then it is concluded that the teaching innovation gives superior results to the comparison condition if the extent of greater learning gains (or a more positive shift in attitudes, or whatever the desired outcome was) in the innovative condition reaches statistical significance. As the units of analysis were randomly assigned to conditions, this is unlikely to be due to a difference in the composition of the two groups (e.g., that the higher attaining students, or the better-behaved classes, were assigned to the intervention/experimental treatment). However, randomisation cannot allow for systematic differences introduced by other aspects of the study design. If fifty students were randomly assigned to two different classes in the same school - 25 to the experimental group experiencing some teaching innovation, and 25 to the control group experiencing typical teaching - but the classes were taught by two different teachers, in different classrooms, with lessons at different times during the week’s timetable, then there are clearly 6 differences in the treatments (i.e., important variables not controlled so they are the same in both treatments) that can potentially confound any effect of the innovation, despite randomisation. Whilst it may seem obvious that the ‘teacher variable’ needs to be controlled (that is, the same teacher should teach both classes), this excludes controlling other variables (e.g., the same teacher cannot teach two classes simultaneously) and, as discussed later, the same teacher may not be equally experienced, competent and comfortable in different conditions. 25 students in the same class (even if assigned to the class randomly) cannot be considered to be independent learners as they interact and influence each other’s learning - so student outcomes within a class tend to be more similar than if the students were not taught together, leading to clustering of measurement outcomes within classes (Dorman, 2012). Such variables are less relevant in large scale RCT as there are many different classes in each condition. This is why studies comparing two, or a small number of classes, may not be especially informative individually, even if randomisation of students to classes is possible, as findings may not be generalisable beyond the specific experiment. (How such studies may be seen as part of a programme of research building up a wider picture of an intervention is considered later in this review.) Even if studies have large enough samples for such issues to be likely to only produce ‘noise’ in the data (such that statistical significance testing can reveal a true ‘signal’ above that noise), there may also be systematic differences that simply cannot be avoided as they are inherent to the way human beings relate to innovative experiences (regardless of the qualities of the innovations themselves). Some such common threats to validity are discussed in this section. These will not be relevant to all RCT, but they are all likely to be potentially pertinent to many experimental studies testing innovations in teaching (including quasi- experiments and natural experiments, discussed further below, where the randomisation required for true experiments is not feasible). It is good practice in research reporting for such issues to be acknowledged, as this helps those looking to learn from the research to consider whether these issues undermine confidence in drawing conclusions about a direct unmediated causal link between an innovation and positive study outcomes. Often, the reader will judge the findings robust regardless, and transparent reporting supports an informed evaluation. Participants’ expectations can influence outcomes A key issue that often arises in studies with human participants, is that the outcomes in treatments may in part depend upon participants’ expectations. This is important because of the demonstrated effect of expectations in producing changes in measured outcomes. In medicine, patients may have a lot invested in the promise of a new drug treatment, and those receiving an experimental treatment may be looking for any 7 small sign that the medicine is working for them - whilst those assigned to a control condition may feel disappointed, having enrolled in a trial in the hope of getting the new experimental drug. If clinicians are optimistic about the new drug, their expectations may be inadvertently communicated to patients, or may bias their measurements of effect when rating subjective reports of symptoms for example. This is readily avoided if neither patent nor doctor knows who is getting which treatment (a situation known as double- blind), and the analysts are working with anonymised data. Similar threats to validity are at work in educational settings. This was demonstrated in a study where teachers in a school were told that tests on the children had identified those - ‘growth-spurters’ - who were likely to make higher levels of progress in the following school year (Rosenthal & Jacobson, 1968; Rosenthal & Jacobson, 1970). These predictions came true: statistically, the identified children did significantly better in school than their classmates after their teachers had been told of their status as growth-spurters. Actually these children had been assigned this label at random, so the results were either an unlikely chance event, or were somehow the outcome of teachers’ expectations mediating classroom processes. That is, either by chance the students identified just happened to be those who were indeed going to make better than average progress in the next school year (and this is not logically ruled out by the statistics, but rather just shown to be very unlikely) or there was a substantive effect due to teachers knowing who had been identified as about to make good progress. The students that teachers expected to do well actually tended to do well even though they had been selected purely by chance. A great many other studies have since replicated effects of this kind (Rosenthal & Rubin, 1978). It is unlikely that such an experiment would be considered ethically acceptable today (British Educational Research Association, 2018). Deceiving study participants (in this case teachers were lied to) should be avoided, and now this ‘Pygmalion effect’, or self-fulfilling prophecy, is well established it would be considered unfair to those children not identified as likely to make progress (i.e., those in the control condition). Researchers and teachers may be optimistic about some new teaching approach or curriculum materials and this could bias their judgements, and change their classroom behaviour. Teachers may subtly communicate their expectations to learners who may also respond to a teacher’s additional enthusiasm for, and commitment to, an intervention, even if they are not directly aware that the teaching is in some way different from the norm. This is clearly a major issue in experimental research in science education. If researchers strongly expect co-operative learning, or a flipped classroom, or enquiry-based teaching (e.g., see the discussion of 1 ‘rhetorical’ experiments, below) - or indeed, for that matter, rote learning and drill exercises, or potentially even starting lessons with a ten-minute nap - to be more effective, then this expectation is likely to have an influence even when the intervention (of itself) may not have otherwise been effective. The response to 8 such a threat used in drug trials - doing the research double blind - is seldom an option in education as it is usually obvious to researchers, teachers, and even learners, when they are part of an experimental treatment condition. Participants can respond to perceived novelty Students experiencing innovative teaching treatments may well be aware there is something unusual going on. If the intervention only involves an individual teacher changing their teaching sequence or activities in a particular topic, then the students in the class may not be aware that things are being done differently compared with the teacher’s previous practice. Yet, when the intervention involves an obvious change from what has gone before (e.g., an abrupt shift from teacher-centred teaching and silent individual desk work, to activity-based enquiry learning in groups) then they will be aware something unusual is happening, and may simply respond to the novelty. Perhaps some learners are less comfortable with changes of routine, but when students are familiar with a routine that makes classes seem mundane, anything unusual is likely to make them more attentive and alert, and so likely to influence learning, simply because of its novelty. There is a tendency built into our cognitive systems to be aware of anything unusual and to pay it special attention, so we would expect students to pay more attention than usual when there is a change in the way things are carried out. This is a consideration in some of the science education studies discussed below where it is claimed that students in the population sampled normally experience teacher-centred instruction where they are largely passive, and by contrast the intervention involves enquiry-based practical activities, group-based discussion work, creative activities, co-operative learning, and so forth. For example, a study of ‘active learning’ teaching methods reported that “regular instruction in this high school is commonly teacher-centered with a lecture-type format and students passively participate in the learning process. They only listen to their teacher, write notes, and use textbooks as a learning material” (Sesen & Tarhan, 2011, p. 209). Moreover, if students are involved in theory-directed research (Taber, 2013a) initiated by external researchers (rather than context-directed enquiry undertaken by a single teacher or department as part of the usual ongoing review and development of teaching) then they (and/or their parents) are likely to have been asked to give informed consent for their participation; they may possibly have been involved in completing official looking tests or questionnaires; and their classroom may well have been visited by strangers carrying out observations or making recordings of some kind. All of this is likely to prime students to be more attentive to what is going on in that class. The joint influence of novelty effects and expectancy effects may in part explain why many interventions that seem effective on first testing, may seem to lose their efficacy once they are ‘rolled-out’ on a larger 9 scale to become part of normal ways of doing things (Barab & Luehmann, 2003). It seems that when carrying out educational experiments we have to consider that any apparent outcome may be the result of the combination of the particular intervention being tested plus the simple fact of participants experiencing an intervention. That applies even when the research is relatively large-scale, involving a large number of classes in different schools working with different teachers, and randomly assigned to one of two conditions: when one condition reflects the status quo, and the other condition something noticeably unusual, then a large sample size and the apparent ‘objective’ nature of the outcomes of statistical tests offer no way of separating any effect of (i) the special nature of the novel treatment, from (ii) that of the experience of novelty itself. Despite novelty and expectancy effects being well-recognised, many experimental studies make no reference to these potential threats to validity. One exception is a study that looked at “the effect of reflective science journal writing on students’ self-regulated learning strategies” (Al-Rawahi & Al-Balushi, 2015, p. 367). This did acknowledge that “students in the experimental group spent extra time doing something different or new… This was not the case in the control group” and suggested this could have been mitigated by “a second experimental group … given extra time to do something new” (pp.377-378). The same study also acknowledges the potential of an expectancy effect, but suggests this was “controlled for” (p.378) because teachers in both groups took opportunities to offer formal feedback to their students. Yet, it is likely teacher expectations are often communicated in more insidious ways (Rosenthal, 2003). In any case, similar opportunities to express their expectations would only be helpful if it was shown that teachers in both conditions had similar expectations of outcomes from their teaching. Fair testing should involve teachers in different treatment groups having comparable levels of experience of their assigned teaching ‘treatment’ An important variable in research into the effectiveness of teaching innovations is the teacher. Teachers have different levels of skill and experience, different strengths and attitudes, different teaching styles and levels of comfort with different pedagogical approaches, and so forth. Outcomes in two different treatments taught by two different teachers will likely be as much influenced by the ‘teacher variable’ as the ‘treatment variable’. Two approaches to addressing the teacher variable might be to either have the same teacher teach in both conditions or to have a sufficiently large sample so that a diverse range of teachers are employed in each condition. Whilst employing the same teacher in different conditions may seem to control for the ‘teacher effect’ a particular teacher’s skill set or pedagogic style may suit them to working more effectively in one way, where the opposite may be the case for another teacher. That is, there will be interactions between the teacher 10 variable and the treatment variable such that having the same teacher in different conditions (whilst, all other things being equal, preferable to comparing across different teachers) does not completely eliminate the teacher variable when seeking to generalise findings from a study context to other teaching contexts (an issue discussed in a later section). In large scale studies there may be enough variation within conditions to allow both for differences between teachers themselves, and the ways particular teachers may engage with different treatments. The approach is likely to be especially valuable when comparing between different treatments that are equally familiar to the teachers in the study. One variable that may be relevant in many educational experiments that seek to investigate teaching innovations is the level of teacher experience with the innovation. This could undermine even a true experiment that uses a randomisation process. One might consider that the experimental treatment is a new teaching approach, or a new curriculum, or new teaching resources, and the comparison condition comprises of a traditional alternative. The hypothesis being tested is that the innovation will support more effective teaching and so greater learning (that being the motivation for the innovation). One could imagine a large-scale trial where perhaps 100 suitable teachers (that is, those teaching the appropriate year group and topic) volunteered to take part, and a randomisation process was used to create two groups: a group of 50 teachers in the intervention group and 50 teachers in the comparison condition. Now it may be that the teachers involved in the study, and the classes they are to teach, and the schools where they work, are diverse in terms of teacher skills, student ability, school catchment area, and indeed any number of other potentially relevant variables. As the teachers (and their classes, and schools) have been assigned to conditions randomly it can be assumed that these factors are likely to cancel out and so inferential statistics that show statistically significant differences between the treatments are probably not confounded by these variables. However, this logic may be undermined by a systematic difference between the two groups. The comparison group consists of teachers who generally have experience of teaching in the way they will teaching during the experiment - they will generally have taught this topic in the same way to classes of this age several times before. Yet, typically, the teachers in the intervention group are given some materials and training, and then teach using the innovation for the first time. Generally, when teachers first teach in a new way, or using a new scheme of work or new teaching materials, they do not do so in an optimum way. Teacher Pedagogic Knowledge is to some extent context specific (Park & Oliver, 2008), and usually teachers need to run through a new approach several times before optimising their practice - honing timings and identifying foci for emphasis, finding out how students respond to aspects of the innovative teaching, determining when and how much structure and guidance should be offered during activities, and so forth. Despite whatever prior professional development is offered, teachers teaching in an innovative way for the first time will be learning through the process (van Driel, Beijaard, & Verloop, 2001) and cannot be fairly compared with experienced 11 teachers working in their customary way. There is also a potential interaction effect here with teacher expectancies (discussed above), as teachers’ self-efficacy will usually develop with increasing experience. A teacher who is confident in working in an innovative way may have high expectations for learning outcomes - a teacher who is still adjusting their practice to a new way of working may not. Now, in principle, there is an easy response to this challenge. In this kind of research, data should be collected over several school years and outcomes in the two conditions monitored. It is quite likely that outcomes in the second implementation of the innovation will be better than the first; and outcomes in the third implementation better than the second - but eventually performance will plateau: at which point a comparison between conditions will be fairer. In practice this means running the experiment and collecting and analysing data over a much longer period, which is why this precaution is seldom taken. This approach is also subject to greater potential experimental attrition (where participants drop-out), especially in those teaching contexts where teachers typically only remain in post for a few years before leaving a school. Participants may make gains during a study due to maturation Just as research into teaching interventions needs to take account of how teachers develop their skills in applying particular teaching treatments through cycles of implementation, there are parallel consideration about the nature of learners and learning. One issue is the possibility of maturation. As people mature they acquire new cognitive abilities (Goswami, 2008; Piaget, 1970/1972) and so can be expected to achieve more on tests of scientific understanding. One well-known project in science education was known as Cognitive Acceleration in Science Education (Adey, 1999), and involved providing regular teaching inputs designed to help facilitate a shift in intellectual development that lower secondary age students (e.g., 11-13 years olds) were expected to be undergoing. So, in this programme, which in educational terms can be considered a long- term intervention (over several school years) the participants would have been expected to be undergoing changes regardless of the intervention. Therefore, simply reporting that students at the end of the programme appeared to show cognitive development compared with the outset would not have been informative. This was recognised in the reference to cognitive ‘acceleration’ in the programme title. Rather, when evaluating the effectiveness of the programme, what was tested was whether the intervention encouraged faster cognitive development than would otherwise be the case, by comparing the results of school examinations (taken by participants some years after the intervention) for participants with those of comparable groups who had not experienced the intervention. The argument was that effective cognitive acceleration would support more effective student learning over the remainder of their secondary school career, which could be detected in terms of general academic performance at the end of compulsory schooling (Adey & Shayer, 2002). 12 In that innovation, maturation was a focus of the study. In other research it is possible that gains measured after an intervention could be due to maturation rather than the specific intended teaching input. This is more likely to be the case when an intervention takes place (i) over an extended period, and/or(ii) with young learners who are developing relatively quickly. An example would be a study into the effectiveness of curriculum designed to help young pupils from age 4 learn about floating and sinking (Leuchter, Saalbach, & Hardy, 2014). Leuchter and colleagues controlled for possible maturation by testing for changes in a comparison group of similar ages to their intervention group. Learning in many areas has been shown to follow a ‘U-shaped curve’ such that learner performance on objective measures actually dips first, before it subsequently improves (Siegler, 2004). Observing gains in such cases may then depend very much on the time-span between initial and final testing, with effective strategies potentially leading to gains, no change, or even losses, depending when the final measurement is made. In such a situation, mean post-test results for an experimental group that are not significantly better than mean pre-test performance might still represent a positive outcome if control group learners are found to show decreasing performance from pre-test to post-test. Participants may learn from pre-tests Pre-tests, then, offer a benchmark by which to compare post-test measurements. So, for example, a study may involve a pre-test, an intervention, and a post-test. The pre-test and post-test are intended to test the same variable of interest (e.g., knowledge, understanding, attitude, skills) that the teaching intervention is intended to impact on. It is important that the instruments actually test what is intended if they are to offer valid measures. Choices also have to be made about how to construct the pre-test and post-test so they are testing the same features. One extreme is to use precisely the same test on both occasions, as then the equivalence of the two tests is assured. An alternative approach is to develop alternative items intended to be equivalent: something that can (where resources allow) be checked by testing the items with a suitable sample of learners from the same general population as the study participants. The process of completing a pre-test can potentially be a learning experience. Thinking about questions and attempting to provide suitable answers on a pre-test can of itself make it more likely that a person will be more successful on a post-test, especially if precisely the same test items are used on both tests. Even if parallel, but non-identical, items are used, being tested on the first set of questions may trigger thinking processes that lead to learning, that then supports a better performance on the post-test items. This is a particular issue because of the nature of how learning about science occurs - the changes that may be triggered by a learning experience are not necessarily immediate, but may continue for some time (days, weeks, or longer) after the initial experience (Taber, 2013b). The brain may continue to process experiences 13 at a preconscious level, which can lead to new (conscious) insights some time later. The use of a control condition, where learners undertake the same pre-tests and post-tests, can go some way to allowing for this effect. If the experience of undertaking the pre-test directly primes students to do better on the post-test, then this should be experienced in both the experimental and control conditions. There can however also be indirect effects, due to interactions between the experience of taking the pre- test and the subsequent teaching. Current understanding of memory suggests that each time a memory is activated it is reinforced (Dudai & Eisenberg, 2004) so this may happen if the teaching intervention causes students to bring to mind thinking triggered by the pre-test. If pre-test items do not directly lead to any new science learning, it is still possible they may prime more effective learning from teaching that follows the pre-test. The education psychologist Ausubel (1978) discussed the notion of an advance organiser, presented before material to be taught, which can help structure the later learning experience. One experimental study of advance organisers in science lessons, that used pre-test items (Gidena & Gebeyehu, 2017, p. 2234) suggested that such advance organisers “can take many shapes” (p.2230). In teaching perspectives informed by the developmental/learning theory of Vygotsky (1934/1986, 1978), some types of learning scaffold (Wood, 1988) are employed to help learners bring to mind relevant prior learning, and to orientate them to the scope of the forthcoming teaching (Taber, 2018). Pre-test items that act in this way can indirectly influence post-test scores by facilitating learning from the intervening treatment. This can be a concern in research designs that compare an intervention with a comparison condition that does not offer a parallel treatment (the rationale for such a design is discussed later in the section on different forms of control group). Such a study may indicate that the intervention is effective, but strictly the pre-test may need to be considered part of the intervention. It is also possible that any interactions between a pre-test and subsequent teaching may occur differentially in experimental and control conditions that involve different ways of teaching the same topic, given that these are inherently different teaching inputs. As good teaching practice includes the testing of prior learning at the start of a unit, these issues could be somewhat countered by designing and making available pre-test instruments that teachers can then access and adopt as part of their normal teaching (so implementation will reflect this aspect of the tested innovation). Here the development of suitable pre-tests as research instruments has the useful consequence of supporting good teaching practice through the provision of resources. Deciding when learning is best measured The issue of the timescale of the learning process, referred to above, also raises the question of when post- tests should best be undertaken. If student consolidation of learning, due to normal brain processes, 14 continues for some time after teaching, then it may be more informative to test students with a deferred post-test rather than one taken immediately after teaching. Less optimistically, studies have also shown that measured immediate gains may not be maintained. So, interventions to challenge common alternative conceptions may bring about immediate changes in student thinking; but then apparent levels of conceptual change may appear to diminish if measured some weeks later - at which point students’ responses may reflect their initial conceptions (Gauld, 1989). Similar effects occur more generally when learners are no longer actively studying topics, where they may revert to patterns of thinking that dominated before learning took place (Taber, 2003). Teachers are primarily interested in learning that is long-lasting, suggesting deferred post-tests may be more informative than immediate post-tests. However, the greater the delay in measurement, then the more (uncontrolled and unknown) additional learning opportunities participants could have experienced in the interim. Post-tests some weeks, but not longer, after a teaching intervention may offer a sensible compromise here. Measurement instruments may be considered to be biased towards one treatment There is potential for the tests used to measure experimental outcomes to be biased towards (or indeed against) the experimental intervention, unless existing standard tests are used that are recognised as valid measures of focal learning outcomes. As an example, in the epiSTEMe project teaching modules and assessment instruments were prepared (for 11-12 year old learners) in two science topics - forces (Howe et al., 2014) and electricity (Taber et al., 2015). As well as incorporating principles adopted across the project, in particular a dialogic approach to teaching (Mortimer & Scott, 2003; Ruthven et al., 2016), each module had its own specific features. Within the electricity module there was a focus on teaching about aspects of the nature of science, in particular the use of models and analogies in science, alongside teaching circuit principles (Taber et al., 2016). The forces module had a focus on teaching about proportional relations, something not usually emphasised in teaching physics to this age group. The project included a measure to check for potential test bias towards the intervention condition in relation to the comparison classes studying the same school curriculum topics. Class teachers rated the suitability of module test items “for this class given its experience of the topic this school year”. In the electricity module the test items only examined understanding of circuit properties and no items on the nature of science teaching objectives were included as it was inappropriate to assume teachers in the control condition would emphasise these ideas. (Nature of science objectives were included the official curriculum for 11-14 year old students, but they were not linked to specific teaching topics and could have been introduced at any point over three school years). In the forces module, where there had been 15 emphasis on the use of proportional relations in teaching the physics concepts, learning of this aspect was tested. It was found (Ruthven et al., 2016) (a) that there was (on average) no more learning of circuit principles in the experimental condition than in the control condition when studying electricity - but it was not possible to know if students had developed a better understanding of the nature of science through studying the intervention module as this was not tested; whereas (b) in the forces module there was significantly more progress in learning about forces in the experimental condition, but teacher ratings suggested the tests measuring this were biased towards learning in that condition. Judgement is needed in deciding whether such bias is problematic or, perhaps, to be welcomed. If traditional teaching is considered to be ineffective in meeting some particular established curriculum aims, and a teaching intervention is intended to address this, then instruments biased towards testing those specific aims may well be appropriate. However, when tests are biased to objectives or outcomes that researchers particularly value, but which do not represent existing official curriculum aims and are not shared by teachers, then such bias may be considered to undermine the findings of the experiment among the teaching community. Other potential confounds This section has discussed a number of issues that may complicate experimental research designs by admitting uncontrolled (and unintended) differences between experimental and comparison conditions. These issues may be pertinent in true experiments (where randomisation is used, as discussed above) as well as in the types of experiment designs discussed in the next section - quasi-experiments and natural experiments - as they operate systematically regardless of randomisation of units of analysis to conditions - as for example when being randomly assigned to an innovative learning condition may tend to increase student engagement. There are of course many other possible interactions between the experimental teaching input and other experiences that can seldom be controlled - learners may do self-directed reading, watch documentaries, visit science museums and the like, alongside the teaching inputs. This can happen in both experimental and comparison conditions (and is not something science educators would wish to discourage), and generally such effects should not lead to a systematic difference between the two conditions - at least not in RCT where there has been randomisation of the units of analysis to the different conditions. However, not all experimental studies of teaching innovations are RCT, and where randomisation of the units of analysis (e.g., students) to the learning condition is not feasible then it cannot be assumed that the groups in the different conditions are equivalent at the outset, making it more difficult to interpret measured differences at the end of the study. 16 Quasi-experiments and natural experiments employed when randomisation is not plausible When experimental research explores classroom teaching in schools, and the units of analysis are individual learners, it is seldom possible (and may not be educationally desirable) to break up existing classes to randomise individual students into new groups for the research. One study included in Table 1 took place in a school designated as a ‘laboratory charter school’ where randomisation “was part of the school’s research mission” (Yin, Tomita, & Shavelson, 2013, p. 538), but more often creating new groupings is not feasible when working with school classes. So one might consider 50 students who were to be part of a study where it was intended to use individual student test results as a measure of learning to explore whether some teaching approach brought about greater learning than some other teaching approach. If it is possible to randomly assign the 50 students into two groups of 25, then there are 25 ‘units of analysis’ in each group. However, if the researchers are required to work with existing classes then the most randomisation that is possible is to assign whole classes to the two conditions. This would mean the units of analysis were whole classes (one in each condition). To consider this a true experiment (meeting the requirement of randomisation, see Figure 1) there would need to be one measure of learning from each class (cf. Figure 5), but it would be difficult to use statistics to infer anything useful when comparing just two values. Quasi-experiments In practice, in such situations, researchers tend to treat the individual learners within intact classes as the units of analysis, in order to collect enough data to be able to undertake statistical testing - but as the units of analysis are not randomly assigned (see the examples in Table 1) it is not possible to draw meaningful conclusions simply by calculating how likely the study outcomes are by chance (and compare this with a cut- off such as p<0.05), as the students were not assigned to conditions by chance. In a quasi-experiment (see Figure 1) then, it is not possible to draw general conclusions by simply comparing the measured outcomes in the two conditions. Natural experiments Another term often met in educational research is that of a natural experiment. A natural experiment takes advantage of differences in conditions that already occur, rather than being based on experimental manipulation (see Figure 1). This may be especially useful where researchers are interested in the possible 17 detrimental effect of some condition, and it would be unethical to create that condition and assign participants to it to test the effect (consider for example a study to find out if victims of bullying make less progress in their science classes - such a study would look to - sensitively - enrol existing students identified as victims rather than experimentally create new victims). [Figure 1 about here] Figure 1: Experimental designs may be categorised as true experiments, quasi-experiments and natural experiments Sometimes ‘natural experiments’ are possible due to some particular set of circumstances that happen to provide the type of comparison researchers are interested in studying. For example, in many countries, schools run through an annual cycle starting at a particular time of year, and students start formal schooling at the start of the school year following a particular birthday. In this situation it is possible to compare the younger and older students in a year group (Morrison, Smith, & Dow-Ehrensberger, 1995) who have experienced the same educational contexts and experiences, but beginning at a different age (i.e., at the earliest grade levels a child starting school at, say, 5 years and 1 day old is substantially younger - and so typically less developed - than a classmate starting school in the same class on the same day, but at, say, 5 years and 351 days old). A natural experiment might be possible where some innovative teaching approach, curriculum, or learning resource is already being adopted by some teachers offering researchers a ‘natural’ opportunity to test its effectiveness against some other more routine or traditional treatment. As, again, there is no random assignment to conditions, it is not possible to simply compare outcomes in the two conditions to infer a possible difference in effectiveness (as it may be, for example, that teachers adopting innovative approaches tend to be atypical in terms of any of more teaching experience, more skills, more confidence, working with more cooperative classes, having better rapport with their students, etc.) Testing for equivalence between groups In quasi-experiments or natural experiments a more complex design than simply comparing outcome measures is needed. For example, researchers have to either demonstrate that despite the lack of randomisation, the distribution of ‘units of analysis’ in the conditions can be considered equivalent prior to the treatment (something often checked even in RCT as a random process cannot assure equivalence); or that a difference in outcome seems to be due to the focal variable despite this non-equivalence. In either case this means identifying and measuring any relevant variables. 18 For example, if (hypothetically) prior knowledge was judged the only relevant variable influencing learning in some study, then a suitable pre-test (see above) might be used to test whether prior learning could be considered equivalent across the two conditions. Often, however, there are other variables which it is recognised could have an effect, other than the dependent variable: ‘confounding’ variables. If the social class of students and reading age were also considered relevant then it would need to be shown that valid measures of these were also equivalent. This raises the question of what should be considered as ‘equivalent’. Equivalence is more than a lack of significance difference Considering the prior learning variable, if students in two classes were given the same instrument considered to be a valid test of relevant prior learning, and if the mean scores and standard deviations in the scores in the two classes were found to be identical, then this might be considered convincing evidence for equivalence. This is also extremely unlikely to happen (so much so that such a result could look suspiciously convenient). The question then becomes how much of a difference between the measurements of prior learning in the two groups is so small that it can be assumed to make no practical difference. The account of one study of enquiry-based learning reports, In order to ensure the equivalence at experimental and control groups, students’ previous year graduate points of achievement (GPA), intelligence fields, the number of students at the groups and pretest results were taken into account. It was found that experimental group was statistically equal to control group. (Abdi, 2014, p. 37) Most of this data was not reported in the paper, but the “statistically equal” mean pre-test scores were 2.95 for the control group and 3.15 for the experimental group (p.40). In a study testing the use of advance organisers in physics teaching (Gidena & Gebeyehu, 2017), three parallel groups were pre-tested, and the two that were reported to “have equivalent means” were selected for comparison and assigned as experimental (mean score = 6.61) and control (mean score = 6.26) groups. Although statistical tests can offer some guidance on what counts as equivalent, they need to be interpreted differently than when looking for a statistically significance difference in the outcomes of the experiment (see Figure 2). An initial difference which is substantial, but statistically non-significant, may be sufficient to explain outcome differences that do reach statistical significance (Taber, 2013a, p. 85: Fig. 4.3). If statistical tests are applied to the starting conditions using the usual p<0.05 criterion then they will only flag up differences between the two groups which are very unlikely to be due to chance differences. However, what should be looked for is evidence of close similarity, rather than the absence of evidence of improbable differences. (One might say that testing for equivalence pre-intervention, and for experimental effects post-intervention, 19 involve looking at different tails of a distribution.) Two classes with differences between them that are at a level quite unlikely to occur by chance are certainly not equivalent (at least in the sense that the word is generally employed). [Figure 2 about here] Figure 2: Evaluations of equivalence between different groups should be more rigorous than simply excluding differences reaching statistical significance As an example of good practice here, in their study of the effect of cooperative learning strategies on understanding electrochemistry concepts, Acar and Tarhan (2007), compared treatments in two intact classes of students. Although they only randomised intact classes to conditions, they treated each of the 41 students in the study as a separate unit of analysis (that is, the individual units of analysis were not randomly assigned) and so could not consider this a true experiment. They used a pre-test to compare across the two conditions and reported that “independent t-test analysis showed that there was no statistically significant difference between the mean scores of the experimental and the control groups with respect to (t=0.199, p>.05) the pre-test” (p.360). They quote a probability value, p, of approximately 0.84 (p.361), which suggests the measured initial differences between the patterns of attainment in the two groups is at a level that would be likely to occur by chance (see Figure 2). However, Koksal and Berberoglu (2014, p. 66) report a study designed “to investigate the effectiveness of guided-inquiry approach in science classes…”, where evidence of equivalence was much weaker. In this study, the treatment group comprised of five classes in one school, and the control group was composed of nine classes in six other schools “to prevent any interaction between the control group and experimental group students” (p.70). They sought to demonstrate “equivalency of schools” as “evaluated with respect to socio-economic characteristics of the students” (p.70), and they reported that the measure used “did not indicate any significant difference” (p.70). Koksal and Berberoglu quote a value of p of 0.21 which is indeed >0.05 (see Figure 2), but means the degree of difference found is large enough to only be likely to occur on about one of five occasions by chance. That is, the differences in socio-economic backgrounds between the two conditions were not so great as to reach statistical significance, but could not be considered small enough to be at a level of just ‘noise’ in the data. Using statistics to respond to non-equivalence When groups in different treatments cannot be considered equivalent, then it is not sufficient to simply compare output measures at the end of the intervention. Rather, some kind of mathematical model (such as 20 the ‘hierarchical analysis’ alluded to in the quotation from Ruthven and colleagues above) is needed, in order to allow for how those differences in starting points for the two groups will influence outcomes. Then it can be judged whether any measured differences after the experiment can be considered as due to the difference in treatment, rather than differences in the measured values of confounding variables. Koksal and Berberoglu characterise their study (see above) as a “non-equivalent control group quasi- experimental design” (p.69). They explain the variables measured: “Guided-inquiry approach was the independent variable. While guided-inquiry teaching and learning was implemented in the experimental group, traditional teaching and learning was followed in the control group during the ‘Reproduction, Development, and Growth in Living Things’ (RDGLT) unit. In both groups, the students’ academic achievement, science process skills, and attitudes toward science were defined as dependent variables. And the unit Achievement Test (RDGLT), Science Process Skills Test (SPS), and Attitudes Toward Science Questionnaire (Att) were administered to both experimental and control groups prior to and after the treatment.” (p.69). Given the quasi-experimental design, the researchers used analysis of variance to interrogate the various measures made before and after the intervention in both the experimental and comparison conditions. Choosing comparison conditions Whilst all experimental designs have certain commonalities, there are considerable differences in the kinds of educational activity considered appropriate for the control or comparison groups in different studies. Table 2 sets out a simple typology of three levels depending upon the nature of the educational input provided for the learners in a control or comparison group. The activity undertaken with a group of learners that could potentially be educative is here referred to as a ‘treatment’. In experimental work the experimental/intervention group is subject to a treatment that differs in some well-characterised way from the treatment of the control or comparison group. The three levels suggested in Table 1 set different tests for the experimental treatment. These are, in effect, • does it have any educational value? (level 1); • is it better than standard educational provision? (level 2); • how does it compare to what is already recognised as good practice? (level 3). [Table 2 about here] 21 Table 2: Distinct levels of control in experimental designs according to the nature of the educational ‘treatment’ experience by the control or comparison group. As with most typologies used to analyse complex phenomena, it is not suggested that all relevant studies will fit clearly within one of the three categories, but rather that the typology offers a useful starting point for thinking about this aspect of studies. Some examples of studies that might be categorised according to these levels are summarised in Table 3, and discussed below. [Table 3 about here] Table 3: Examples of different ‘levels’ of control condition Does the experimental treatment have any educational effect? The first level of experimental design suggested in Table 2 simply looks to see if outcomes on some educational measure are better after some treatment than in a matched group of learners who did not experience any treatment. This level of design is potentially useful when the research question concerns whether there is any value in introducing some new educational provision or resource that would be additional to current provision. That is, this type of study is not concerned with doing something differently, but rather whether there is sufficient value in committing additional resources to do something extra, that is not currently done, to consider recommending it should be added to existing educational provision. One example of this type of study (see Table 3) was reported by Moore, Graham and Diamond (2003) who conducted “a randomised controlled trial to test the effectiveness of a teacher-led intervention to improve teenagers’ knowledge of emergency contraception” (p.673). The intervention comprised a lesson to be delivered to 14-15 year old students following a two-hour teacher development input. This intervention was to be given as something additional to existing sex education provision: the chosen control group treatment for the emergency contraception trial was that control group schools would be asked to continue with their existing sex education provision, whilst those randomised to the intervention group would be asked to continue with normal sex education, and to additionally receive the in-service training and deliver the emergency contraception lesson (p.681, emphasis added) For Moore and colleagues this was a principled decision: that based on what is known at the start of the trial, control group participants should not be offered something known to be less effective than (i) what the intervention group receive, or 22 (ii) what they would have received if the trial were not taking place (p.681) The decision to frame this research in terms of (what is described here) as a level 1 study means that all Moore and colleagues could test was whether the additional lesson added value over and above existing provision. However, as it was considered that existing provision was deficient (i.e., that students were not effectively learning about an important topic in their standard sex education provision) and so some kind of additional input on this topic was needed to augment standard provision, this was a sufficient and suitable test. Another ‘level 1’ type experimental study was reported by Hong, Lin, Chen, Wang and Lin (2013). They implemented a 24-hour intervention programme of “inquiry-based aesthetic science activities” over twelve weeks (see Table 3). No special curriculum activity was offered in parallel for the students in the comparison group, so positive outcomes reported by Hong and colleagues (in terms of ‘learning goal orientation’ and attitude to science, p.231) reflect the value added by the intervention as an additional extra-curriculum opportunity. The study cited earlier by Leuchter, Saalbach and Hardy (2014) testing a curriculum intervention in the topic of floating and sinking (see Table 3) included a “control group that participated in a pre- and post-test, but not in an implementation of the curriculum on floating and sinking” (p.1758). In that study, teachers “were asked to follow their usual curriculum [but not] offer any curriculum on floating and sinking between pre- and posttests” (p.1762). Leuchter, Saalbach and Hardy reported positive results for their study. The group of learners who experienced the learning experiences provided by the curriculum intervention showed significantly better outcomes than the group of learners who had not been provided with any relevant learning experiences. This kind of design can be valuable where there might be theoretical grounds to doubt whether an educational intervention could have any significant effect. In the context of Leuchter, Saalbach and Hardy’s study such arguments might be that learners of this age are too young to benefit from educational experiences of this kind, or that teachers of the lowest age grades generally lack the necessary specialist knowledge or skills to support learning of abstract scientific concepts. The control condition here acts as a check against the possibility that measured gains in the treatment could be explained by such possible effect as learning from the pre-test, spontaneous learning from general experience, or general cognitive development due to maturation (factors discussed earlier in this review). A useful feature of the report of this study, in common with the work of Moore, Graham and Diamond (2003), is that the report offers a clear rationale for why this level of control (‘level 1’, cf. Table 2) was chosen. 23 Does the intervention represent an improvement on current practice? The second type of experimental design represented in Table 2 concerns the testing of an innovation which is conjectured to offer an improved form of educational provision in relation to some specific educational aim(s). In this situation the innovation is compared with what is considered the ‘standard’ or ‘normal’ form of provision. An example of this type of study would be that of Grooms, Sampson and Golden (2014) where the use of enquiry-based undergraduate laboratories was compared to what the researchers considered a traditional (“cookbook”) approach (see Table 3). Grooms, Sampson and Golden compared outcomes (“the quality of students’ arguments”, p.1416) after two groups of students had experienced a semester of laboratory work. The raters who scored the student responses to the instruments used as pre- and post- tests were not aware which sets of responses related to each of the two conditions, an appropriate precaution to avoid any unconscious bias in the analysis. (This was then ‘single blind’: the students themselves would have been aware whether or not they were being taught in a novel condition, as would the teaching staff). In another study that can be characterised as having a level 2 control group (see Table 3), Bramwell-Lalor and Rainford (2013) ensured that the use of concept maps in the experimental treatment was balanced by equivalent time spent on more customary learning activities in the control condition. When Yin, Tomita and Shavelson (2013) investigated learning progression-aligned formal embedded formative assessment (see Table 3), they set up the teaching in the comparison condition to be as close to that in the experimental condition as possible, to the extent of having the same teacher teach the same activities to both groups. They even included additional “curriculum-specific extension activities” (p.531) for the comparison students to offer a relevant learning activity to substitute for the formative feedback activities undertaken by the experimental group. Arguably, Yin, Tomita and Shavelson’s study somewhat exceed the characteristics of a level 2 study (level 2+, perhaps) and approaches the next level, both because it ensured the comparison group were taught as similarly as possible to the innovative treatment group, and as it provided relevant additional learning opportunities for the comparison group learners to balance the specific intervention-relevant activity in the experimental group. How does an innovation compare with currently recognised good practice? Yin, Tomita and Shavelson’s study design approaches the third type of experimental design in Table 2 that sets a higher standard for an innovation to be measured against. Where at the first level researchers seek to find out if some educational treatment has some effect in comparison to no treatment at all; and at the second level researchers look to see if an innovative approach has a more positive effect than standard provision; at the third level a comparison is made with educational provision considered to reflect good 24 practice. In effect, researchers are asking if an innovation is as good as, or even better than, something that is already considered to be effective. An example of this type of design would be a study reported by Bunterm, Lee, Ng Lan Kong, Srikoon, Vangpoomyai, Rattanavongsa, et al. (2014) who compared learning in classes instructed according to the model of enquiry learning recommended by the Thai national ministry (see Table 3). The researchers provided lesson plans according to this model which were adapted according to either structured or guided enquiry. That is, the treatments varied in the extent to which the teacher directed student decision-making during the exploration, explanation and elaboration phases of the enquiry cycle. Here, then, the authors compared different implementations of recommended good practice. Another study which might be understood as falling in this category was carried out by Chen and colleagues with undergraduates in electronics (Chen, Chang, Lai, & Tsai, 2014). In this study (see Table 3) both treatments involved (i) using a pre-test to diagnose student’s misconceptions relating to diodes; (ii) providing them feedback from the pre-test; (iii) providing training in using electronic teaching materials designed to address such misconceptions; and then (iv) use of those learning materials. The difference was in the form the instructional materials took: in one case directly providing remedial information, and in the other engaging students in the P-O-E (Predict-Observe-Explain) sequence in working through the same content. All students experienced aspects of good teaching practice: a diagnostic exercise to check perquisite learning and instructional materials designed to address identified misconceptions. Guidance on selecting control conditions: logical considerations The choice between (a) level 1 control conditions where a teaching innovation is compared with a treatment without teaching (or where standard teaching that is supplemented by an additional teaching input is compared with only the standard provision) and (b), level 2 and 3 control conditions that offer an equivalent level of teaching input intended to meet the same educational objectives as the innovatory treatment, will derive from the motivation for the study. In many teaching contexts there will be existing provision which, even if not considered effective, will be assumed to bring about learning objectives to some extent. In these situations, a level 1 control condition is of limited use as such a study will simply show that the tested teaching treatment produces some level of learning - something that is to be expected (as even mediocre teaching is likely to facilitate some level of learning), and, without a meaningful comparison with existing practice, offers little guidance for teachers. The choice between levels 2 (the comparison treatment being standard provision) and 3 (the comparison treatment being recognised good practice), may depend upon what the innovation is hoped to provide. If existing provision is considered to draw upon too high a resource level, or is found to have some 25 undesirable effects, then seeking an alternative that is just as effective may be well-motivated. So, a hypothetical school level biology course using animal dissection might lead to satisfactory levels of learning of anatomy, but lead to a minority of students declining to take part. In such a situation an experiment to test an alternative to dissection may only be seeking to find an approach that produces learning outcomes that are as good as in the comparison condition. In this situation, current standard practice provides an effective comparison condition and there is a sensible rationale for a ‘level 2’ control (see Table 2). Many published studies argue that the innovation being tested has the potential to be more effective than current standard teaching practice, and seek to demonstrate this by comparing an innovative treatment with existing practice that is not seen as especially effective. This seems logical where the likely effectiveness of the innovation being tested is genuinely uncertain, and the ‘standard’ provision is the only available comparison. However, often these studies are carried out in contexts where the advantages of a range of innovative approaches have already been well demonstrated, in which case it would be more informative to test the innovation that is the focus of the study against some other approach already shown to be effective. These different situations are summarised in Table 4. [Table 4 about here] Table 4: Guidance on the logic of selecting control conditions Guidance on selecting control conditions: ethical considerations Education has values at its core, and educational researchers should always pay particular attention to research ethics: the potential consequences that their actions could have for others. Participants (and suitable gatekeepers, when participants are children) in educational research studies should always give voluntary, informed, consent - but researchers retain a major responsibility for the ethics of experiments as participants cannot be assumed to fully understand the background and nature of the research in the way the researchers do. Teachers and educational researchers should in particular seek to avoid doing anything that is likely to harm those they are working with (Taber, 2014a). In most educational research experiments of the type discussed in this article, potential harm is likely to be limited to subjecting students (and teachers) to conditions where teaching may be less effective, and perhaps demotivating. This may happen in experimental treatments with genuine innovations (given the nature of research). It can also potentially occur in control conditions if students are subjected to teaching inputs of low effectiveness when better alternatives were available. This may be judged only a modest level of harm, but - given that the whole 26 purpose of experiments to test teaching innovations is to improve teaching effectiveness - this possibility should be taken seriously. This leads to two general recommendations: Firstly, often there will be some scope for interpretation in deciding, on the basis of the logic of a study, whether to set up a research study with level 2 or level 3 control (see Table 3). Where this choice is unclear, the ethical imperative would suggest seeking to set up a level 3 study as this has the most potential to benefit participants. In general, participants in comparison conditions should never be treated merely as sources of data. Secondly, it is good practice to seek to offer an innovation to the control condition where possible. This may either mean offering this to those assigned to the control condition after the study (Moore et al., 2003; Ruthven et al., 2016), or setting up a design where participants all experience the experimental condition at some point in the study (e.g., see Figure 3). Such a design has methodological as well as ethical strengths. For one thing it offers two discrete tests of the treatment being investigated. It also somewhat mitigates any uncontrolled differences between the two groups. If, by chance, one group would learn more effectively across a wider range of conditions, then this design avoids that group being exclusively either the experimental or comparison group. [Figure 3 about here] Figure 3 - A compensatory research design where both groups experience the innovation One of the questions raised in designing a study is whether the innovation can reasonably be expected to be effective. By the nature of an experimental test this should be unknown at the start of the study, and in the natural sciences ‘bold’ conjectures are said to be potentially the most informative (Popper, 1989). Yet, clearly, it would be ethically questionable to set up a large-scale study to test a genuine innovation were there not some good grounds to hypothesise this would lead to positive outcomes. There needs to be a balance of considerations between the risks of carrying out experiments with untested teaching approaches based on overly bold conjectures, and of setting up experimental ‘tests’ that will only demonstrate what has already become well accepted to be the case. The former situation risks poor educational outcomes in the experimental treatment. The latter situation uses valuable resources ineffectively, and inconveniences participants despite having little scope for developing new knowledge. Yet, most new studies of teaching innovations are to some degree looking to 27 replicate findings from existing published studies. This reflects the key issue of the extent to which it is possible to generalise from the results of educational experiments. Generalising from experimental studies The issues considered so far in this article have in particular concerned the question: How can we be confident that the difference in measured outcomes from an educational experiment reflects differential effectiveness of the treatments compared, rather than some other factor(s)? In this section a rather different question is considered: Assuming we are confident that the difference in measured outcomes from an educational experiment reflects differential effectiveness of the treatments in the context studied, how can we also be confident the differential effectiveness would be found in other contexts? That is, how can we know that the result of an educational experiment can be generalised beyond its original context, to justify recommending that the evaluated innovation should be adopted more widely. Reporting effect sizes Even when an educational experiment offers statistically significant results that indicate that an innovation was effective in bringing about desired educational outcomes, this may not be a good enough reason to suggest wider implementation. Innovations tend to have resource costs - such as retraining teachers or publishing and disseminating new resources - and so it must also be judged that any gains will be ‘cost- effective’. It is in the nature of statistical significance that although it indicates a difference between treatments which is unlikely to occur just by chance, this does not mean the difference is substantial. It is possible (especially where the samples include large numbers of the units of analysis) for a difference that is modest in absolute terms to reach statistical significance. In “education research studies that compare different educational interventions, effect size is the magnitude of the difference between groups” (Sullivan & Feinn, 2012, p. 279), and it is good practice for reports of educational experiments to quote an effect size for statistically significant results. As one example, in the study by Koksal and Berberoglu (2014) discussed above, the researchers reported a significant effect of the treatment on student achievement, process skills, and attitudes, but also report that although these effects all reached significance, “the effect size in achievement measure is small” (p.75). 28 We can consider a hypothetical educational experiment in some specific classrooms and schools, to test some teaching innovation which has been well designed and carried out, and which has reported statistically significant effects with large effect sizes. This suggests the intervention resulted in a substantial effect which seems unlikely to be a statistical fluke: but poses questions of potential generalisation. • On what basis can we assume that the results are relevant to other classrooms and schools? • Is it sensible to recommend changes in other teaching contexts that may be quite different from those involved in the study on the assumption that the same effect will be observed? The assumption that the results of an experiment should apply beyond the specific sample tested can be based on assumptions about the kind of entities the units of analysis are; or on statistical inference; or may rely on comparisons of similarity between contexts. Each of these possibilities is considered below. Natural kinds and theoretical generalisation In the natural sciences, the units of analysis are usually examples of what are called ‘natural kinds’ (LaPorte, 2004), such that in terms of certain ‘essential’ qualities it can be assumed that what is found with one specimen applies to any other specimen of that kind. Science text books and data books reflect this assumption when they report ionisation enthalpies for different elements, electrical conductivities of different metals, the charge on any electron, the skeleton structure of (any) frog, and so forth. This is a kind of theoretical generalisation where what is found to be the case for some particular specimen or sample is considered to apply to other specimens based on theoretical considerations about what makes these different specimens to be of the same kind. Life scientists may expect more variation within a natural kind (say, a species) than physical scientists, but even here techniques may be used that work with particular ‘strains’ or genetic lines (Knorr Cetina, 1999) so that different specimens of the same type are very similar in their responses to experimental interventions. It may still be inappropriate to assume that what is found with one mouse or one bacterium can be generalised to all, so a larger number of specimens may be randomly assigned to experimental and control conditions and statistical techniques used to compare outcomes across the two conditions - which is superficially similar to many of the educational studies discussed in this review. In the natural sciences, then, theoretical considerations allow us to assume that certain measurements made on one specimen will apply to others of the same kind, or at least (in the life sciences) that average differences between conditions would apply to other samples of specimens of that kind. What are considered as natural kinds and which properties are essential qualities of such kinds have to be determined. For example, in many ways the chemical elements and compounds offer prototypical examples 29 of natural kinds. Yet, for certain particular purposes, samples of elements must be considered as mixtures of several kinds (isotopes). The failure to recognise two different kinds in the drug thalidomide (that is, assuming two different enantiomers could be considered to be the same natural kind for the purposes of drug production) led to tragic outcomes (Fabro, Smith, & Williams, 1967). In general, however, this kind of generalisation has been very effective. Just one of myriad examples would be that once the composition and geometry of ammonia molecules has been established, this can be assumed to apply to all ammonia molecules. In educational studies, however, the units of analysis are not considered to be natural kinds that can be taken to share common properties to this extent. Social kinds, such as learners, teachers, classes, and the like, differ from each other in a great many ways, so there are few useful common properties that once measured on one specimen or sample can be assumed to apply more generally across that social kind. Statistical generalisation Research in education (and the social sciences more widely) cannot usually assume the units of analysis can be treated as natural kinds: what is found out about this particuloar 15 year old learner, this physics class, this novice science teacher, cannot be assumed to apply to any 15 year-old learner, any physics class, or any novice science teacher. It is known that learners, classes, or teachers vary across a whole range of variables that may impact on teaching and learning - so theoretical generalisations (e.g., something was found to be the case with one biology class so it will be the case for all biology classes) cannot be made based on the basis of social kinds given such diversity within the ‘same’ kind (be that biology classes, university chemistry teachers; children attending primary school science clubs, etc.). Instead, a form of statistical generalisation is often used, where the results of an educational experiment tell us something about what is typically the case with, say, 15 year old learners, physics classes, or novice science teachers. Results therefore offer guidance on what is likely to be the case more generally, more often than not, rather than what has been shown to always be the case with these kinds. Moreover, as explained below, such forms of generalisation strictly rely upon following particular procedures. When the design of an educational experiment cannot support statistical generalisation, then there is greater doubt over whether the results of an educational experiment can offer guidance beyond the specific samples involved in the study to other samples of the same kind. However, in these cases it may be possible to offer what is known as ‘reader generalisability’ supporting what is sometimes labelled analytical generalisation. This will be considered below (see ‘Replication studies’), where the issue of the role of replication of experiments in generalising results is discussed. 30 Strict conditions for statistical generalisation The importance of randomisation of units of analysis to the different conditions in true experiments was explained above, as this gives an assurance that differences identified in outcomes are unlikely to be due to chance differences in the make-up of the different groups. Even if such conditions are met, this does not ensure that valid results from a specific trial are relevant beyond the sample involved in the research. Where statistical generalisability is intended, researchers need to: a) identify a specific population that the trial is intended to be relevant to b) ensure that those selected for the study experiment comprise a fair sample of the wider population If the implications of studies are to be clear, it is good practice for research reports to be explicit about precisely what population was sampled. One of the studies listed in Table 1 reports that “the population of the study consists of all 397 pre-service science teachers studying at a state university in Turkey, 121 of which participated in the study making the sample 30% of the population” (Taşlidere, 2013, p. 147). However, many studies have titles or research questions implying a broad population (e.g., ‘students’) where the sample is drawn from a very particular context (see Table 1). Often, it is left to reader to infer the population that results are intended to generalise to. Sampling Ideally statistical generalisation is supported by selecting a random sample of the population of interest, which gives the strongest grounds for considering results from the trial to reflect a general pattern that would be found across the wider population (see Figure 4). Selecting the units of analysis at random from the population (so each unit that is part of the population has an equal chances of being part of the study) avoids the need to understand the diversity of the population (what the relevant variables are, and how they are distributed in the population) in a parallel way to how randomly assigning units to conditions avoids the need to characterise and then show equivalence between the groups in the different conditions. [Figure 4 about here] Figure 4: When an experiment tests a sample drawn at random from a wider population, then the findings of the experiment can be assumed to apply (on average) to the population 31 However, it is often not feasible to be able to identify all units in a population, let alone ensure they are potentially included in a sample. So, whilst this would be the ideal situation, few educational trials achieve it. Alternatively, statistical generalisation could be supported by an argument that a non-random sample is representative of the wider population on those variables most likely to be relevant to outcome - based for example on findings from surveys of the population. As there may be a range of potentially relevant factors, which may interact, building a representative sample can be challenging. In many small-scale studies that only involve a few classes or schools (cf. Table 1), an inherently weaker design is often employed, where units of analysis are chosen to be fairly typical of the wider population, to avoid obvious ‘outliers’, but this does not strictly allow statistical generalisation to a wider population. An example would be Chen and colleagues study (see Table 1) where they located their study in a school “ranked around 14 of 28 high schools in Taipei” (Chen et al., 2014, p. 915). Other studies may report having used ‘convenience’ sampling, i.e., where researchers can easily access the research site and necessary permissions are readily forthcoming, such as Yin and colleagues’ work in a “a laboratory charter school [that] includes a focus on educational research as part of its charter” (Yin et al., 2013, p. 538). Access to research sites can be elusive, so convenience sampling may be justified, but this approach may not offer the most informative samples (see below). Variation within a population Even when statistical generalisation is possible, this does not imply that a teaching innovation found to be advantageous in the experiment would also be universally advantageous if implemented throughout the population sampled, only that it would on average be expected to produce better outcomes (see Figure 4). So, the implications are probabilistic. If a certain approach to teaching natural selection was found to give greater learning outcomes in a RCT based on a random sample of the population of secondary age classes in Florida, then this suggests that if the approach was implemented across Florida, it would (subject to the various caveats discussed earlier in the article) improve average learning outcomes in the state. A teacher in a particular school in Florida working with a particular class cannot be confident the innovation would improve learning gains in her class, but in the absence of any other direct evidence, she could reasonably assume that introducing the innovation will probably lead to greater learning gains. Where probabilistic evidence is all that is available, it can be the best guide for informing action. One discussion of a large-scale education intervention programme for disadvantaged children in the United States (‘Follow Through’) reports how the programme evolved into “a series of ‘planned variations’ of education” that allowed 17 models of schooling for disadvantaged children to be compared (Guthrie, 1977, p. 240). Thirteen of the models offered sufficient data for comparisons to be made based on a “battery of 32 tests…to encompass basic skills, cognitive/conceptual development, and affective factors”, and effectiveness was “judged by whether a model surpasses its control group in a particular site on a particular category of outcome test” (p. 241-2). This allowed the most and least effective programmes to be identified. Even though this enabled a form of overall ranking to be produced, it was noted that We should be alerted to the fact that no program was successful everywhere it was tried…All of the programs were successful in at least one location on at least one class of outcome, indicating that local effects are extremely important (Guthrie, 1977, p. 243) It was noted above that in the epiSTEMe project the experimental classes who studied the electricity module did not outperform the comparison classes: indeed the mean of class average learning gains (deferred post-test - pre-test) was slightly greater in the control cognition, albeit by a non-significant amount (Ruthven et al., 2016).What is perhaps more noteworthy is the range of outcomes in the two conditions - as Figure 5 shows, there was a wide range of learning gains in both conditions. Indeed, this was wider (including two classes showing reductions in average test score after teaching) in the intervention condition where all the classes were intended to follow the same scheme of work, including prepared teaching slides and common learning activities supported by the same printed learning resources (Taber et al., 2016). Perhaps the most reasonable conclusion to be drawn in this case is that the independent variable (the teaching scheme for studying the topic) appeared to be much less critical for determining learning than other factors that varied between the classes, and their teachers and schools. [Figure 5 about here] Figure 5: Results from a randomised trial showing the range of within-condition outcomes (Taber et al., 2016) Replication studies It seems that then that: (a) it may be difficult to set up experimental studies that meet the requirements to allow statistical generalisation of study findings to the wider population of interest, as random sampling of broad populations is seldom feasible, and building representative samples of broad populations (e.g., of secondary schools in England; of graduate chemistry teachers; of freshers on engineering degrees in Australia, etc.) is also challenging (see Figure 6); and (b) there may be such diversity within social kinds such as schools, teachers, or classes, that even when statistical inference is possible in general terms, it is likely that what is true on average for some identified population will not apply to all its members. [Figure 6 about here] 33 Figure 6: Many educational experiments do not meet the conditions that allow statistical generalisation to a wider population It is also useful to bear in mind that given that statistical significance only implies that an experimental outcome was unlikely to be due to chance, and there is always the possibility of false positives, as a small proportion of statistically significant results will have occurred by chance. A school or teacher considering changing practice in the light of an innovation that has been shown in an experiment to give statistically significantly better outcomes can be assured that, as p<0.05, this result is probably not just a fluke (although even then it could be due to systematic effects that could not be controlled for, as discussed earlier). However, inevitably, a small proportion of positive experimental outcomes are simply due to chance effects that are never absolutely ruled out by the statistics. Choosing a more stringent confidence level as the criterion for significance (e.g., p<0.01) would reduce the incidence of false positives (see Figure 7), but would also lead to more genuine effects not reaching the cut-off (i.e., more false negatives). Given these various challenges to generalising from educational experiments, replication studies can be informative in building up the evidence-based to support research-based practice. [Figure 7 about here] Figure 7: Choice of confidence level reflects a balance between admitting false positives (due to chance events) and false negatives (where real effects are not distinguished from chance events) Replication in the natural sciences There is a general principle in scientific research that experimental results need to be replicated before they are widely accepted. As suggested earlier, natural science studies so-called ‘natural kinds’ (LaPorte, 2004) where it is possible to generalise based on theoretical considerations. Millikan (1999, pp. 48-49) explains that in the case of many sciences, observations need to be made of only one or a very few exemplars of each kind studied in order to determine that certain properties are characteristic of the kind generally. If I have determined the boiling point of diethyl ether on one pure sample, then I have determined the boiling point of diethyl ether. If the experiment needs replication, this is not because some other sample of diethyl ether might have a different boiling point but because I may have made a mistake in measurement. Replication in science then is in part concerned with whether the published report fairly describes the work: was sufficient care taken in carrying out and reporting the research such that readers can take the 34 published account as an accurate description of what happened, and therefore what will happen if the experimental conditions are recreated. Educational studies might seem to be facing additional challenges, given, as discussed above, researchers cannot automatically assume that findings with one social kind (say, classes of 13-14 year old learners studying mechanics) are generalisable across the kind (from classes studying in Sweden, say, to classes studying in Singapore). Learning can be influenced by a wide range of factors, and teaching contexts vary considerably. Teachers looking to adopt evidence-based teaching practice work in very different institutions with their different norms, with students of different ages, and spreads of attainment (not to mention levels of interest and motivation), in a range of language and cultural contexts. Research that shows a particular technique, approach, or resource, seems to be effective in one classroom cannot be assumed to necessarily imply it should be adopted in other classrooms, with other teachers, working with different groups of students. Testing replicability across teaching contexts is therefore valuable. This seems, prima facie, quite different from the rationale for undertaking replication in the natural sciences. Yet research into scientific practices actually suggest that replication in science is usually subtler than the notion of simply attempting to precisely repeat the original experiment. It has been argued, based on both the examination of historical cases, and observations of contemporary scientific research, that follow-up studies are seldom straight replications (Collins, 1992; Shapin & Schaffer, 2011). Indeed, simple replications may be perceived as lacking the originality expected for reporting in top journals (Franco, Malhotra, & Simonovits, 2014). In practice, it seems replication in science does not necessarily require precise replication of conditions. In the natural sciences, certainly the physical sciences, replication is more about extending and developing the original findings: can they be reproduced with modified apparatus, or with a wider range of materials, or under broader conditions. This offers a strong parallel with the situation in education. Replication in a local educational context Studies undertaken in education to replicate published experimental studies may be of two kinds, which have been labelled as theory-directed and context-directed (Taber, 2013a). As these labels suggest, theory- directed research is primarily intended to contribute generalisable knowledge to the research literature, whereas context-directed studies are concerned with improving the situation in a specific teaching context. Such context-directed studies are often carried out by teachers in their own classrooms, to address recognised issues and problems and improve some aspect of teaching and learning - perhaps using action research approaches (Hammersley, 2004). In context-directed studies, teachers may often adopt ideas from published (i.e., theory-directed) research to test out whether recommendations are transferable to the specific local context - asking questions of 35 the form ‘would that work in this school?’; ’…with this class?’; ’…in teaching this topic?’, etcetera. As may be appreciated, the ‘burden of proof’ (i.e., the strength of a case argued from evidence built from the analysis of systematically collected data) is somewhat less demanding when the aim is to see if something works well in a particular teaching context, rather than seek to argue that it can be assumed to be likely to be effective more widely across a wide range of contexts. In particular, in context-directed research there is no need to make a case for the representativeness or typicality of the classroom(s) where the study was carried out. Some of the challenges to validity discussed earlier in this article cease to be relevant in context-directed studies. For example, if a teacher is enthusiastic about an innovation, believing it has great potential to improve teaching and learning, then this might bias the outcomes of any trial. However, in that particular context, any positive outcomes from a trial of the innovation reflect the actual conditions where practice will be informed by the trial - and as long as the teacher remains enthusiastic for the innovation, any positive gains observed may well be maintained. The particular context may be atypical - it may comprise mainly of gifted learners, or of a high proportion of students studying science in a second language, or of learners in a special unit for school refusers, or of long term medical patients being schooled in hospital wards…: but what matters is whether an innovation is effective in that context, rather than how likely it is that any results can suggest what might happen elsewhere. Programmes of replication across diverse contexts Studies that are theory-directed are intended to contribute to the research literature and seek to offer generalisable findings. Such studies are set up to go beyond finding out if something works in the particular context where the research was undertaken, to instead make a case for the specific findings being relevant more widely. As was suggested above, generalisation beyond the research site can never be simply assumed, but it is possible to design studies to strengthen the case that findings are of wider relevance. When there is a series of studies testing the same innovation, it is most useful if collectively they sample in a way that offers maximum information about the potential range of effectiveness of the innovation. There are clearly many factors that may be relevant. It may be useful for replication studies of effective innovations to take place with groups of different socio-economic status, or in different countries with different curriculum contexts, or indeed in countries with different cultural norms (and perhaps very different class sizes; different access to laboratory facilities) and languages of instruction (Taber, 2012). It may be useful to test the range of effectiveness of some innovations in terms of the ages of students, or across a range of quite different science topics. Such decisions should be based on theoretical considerations. Given the large number of potentially relevant variables, there will be a great many combinations of possible sets of replication conditions. A large number of replications giving similar results within a small region of this 36 ‘phase space’ means each new study adds little to the field. If all existing studies report positive outcomes, then it is most useful to select new samples that are as different as possible from those already tested. However, if replication contexts all simultaneously vary across a large number of factors, and outcomes vary widely (the innovation being more or less or not effective in different studies) this may also offer limited guidance to teachers hoping to learn from the research. When existing studies suggest the innovation is effective in some contexts but not others, then the characteristics of samples/context of published studies can be used to guide the selection of new samples/contexts (perhaps those judged as offering intermediate cases) that can help illuminate the boundaries of the range of effectiveness of the innovation. Progress in the field will then be best facilitated by a principled programme that complements existing studies by deliberately seeking to build systematically upon published studies when selecting the contexts of further replications. Guidelines for supporting analytical or reader generalisation This leads to two general guidelines for those seeking to undertake replications into innovations that have already been shown to be effective in published studies. The first concerns the theoretical justification for the importance of the study. So, for example, if an experimental study has already suggested that 11th grade students in one particular geographical location benefit from cooperative learning strategies when studying the topic of electricity (Acar & Tarhan, 2007), then researchers carrying out a replication study in the same city with 9th grade students studying the topic of metallic bonding (Acar & Tarhan, 2008, see Table 1) might be expected to discuss in theoretical terms why this modest degree of shift in the context is likely to be informative. A second recommendation is that contexts need to be well-characterised. If researchers carefully consider the results of previous trials of an innovation in relation to the specific contexts of those studies when planning their own research, then the community of researchers can collectively build up a body of research which incrementally explores the range of effectiveness of different innovations. For this to occur, it is important that reports of teaching experiments are sufficiently detailed, not just in terms of technical matters, but also in terms of the specific teaching and learning contexts where the work takes place. Given that such programmes can only explore the multidimensional extent of the range of effectiveness of a particular innovation incrementally, offering detailed contextual background to such studies can also support what has been labelled reader generalisability. Teachers reading research reports that offer ‘thick description’ (Geertz, 1973) of the research context are put in a strong position to answer the question ‘how similar is the context of this study to my own teaching situation?’ which may inform a decision about whether to try 37 out the innovation in the teacher’s own classroom (a context-directed study). This is referred to as reader generalisation (Kvale, 1996). This is a point often made in discussions of studies analysing qualitative data, and in particular case studies (Stake, 1995), which do not offer traditional forms of generalisability (Taber, 2000). Part of the inherent logic of the selection of case study methodology is that each case is unique (an idiosyncratic constellation of positions on a wide range of interacting variables) and embedded in a wider context, and so an examination of a single case detailed enough to explore interactions between features can be informative. Where cases are reported in detail, reader generalisation is supported - and the use of carefully selected multiple cases allows comparisons that may reveal general patterns (Stake, 2006). The argument here then is that large scale RCT that use representative samples from populations of interest are necessarily rare in education. What are more common are individual small-scale experiments that cannot be considered to offer highly generalisable results. Despite this, where these individual studies are seen as being akin to case studies (and reported in sufficient detail) they can collectively build up a useful account of the range of application of tested innovations. That is, some inherent limitations of small- scale experimental studies can be mitigated across series of studies, but this is most effective when individual studies offer thick description of teaching contexts and when contexts for ‘replication’ studies are selected to best complement previous studies. Planning ethical comparison conditions in replication studies This article has reviewed some key themes relating to the challenges in designing experimental studies into teaching innovations. It is clear that whilst experimental studies can be very informative, researchers have to make a wide range of decisions in setting up an experimental study, and justify these decisions when publishing reports of their work. Considering the range of potential threats to the validity of educational experiments, as discussed above, it seems unsurprising that most published studies offer results that are subject to caveats or may offer limited grounds for broad generalisation beyond the original context. Seeing individual studies as part of the incremental build-up of evidence for the general effectiveness of an approach allows users of research to acknowledge the limitations of individual studies, but come to a view based on a wider body of work. Some of the decision-making required in designing studies is complex and subtle. It is understandable therefore that a reader may conceptualise studies quite differently from their authors, and so may potentially evaluate some of those decisions quite critically. The reader stands outside many practical and contextual considerations that influenced the researchers. Such criticism should therefore be offered with 38 some humility, and understanding, but may still be important where it has potential to beneficially influence future practice. In this regard, I will here argue that in recent years a particular tradition has developed of experimental studies into aspects of science teaching that are being conceptualised in a way which (a) undermines their potential to contribute to the field, and (b) tends to systematically disadvantage participants assigned to control conditions. I will refer to these as ‘rhetorical’ experiments (see Figure 8). It is hoped by that by drawing attention to this issue, researchers can be persuaded to shift their conceptualisation of these studies, and will modify their design (as recommended below) when planning future research. [Figure 8 about here] Figure 8: Rhetorical experiments are intended to demonstrate that a well-tested teaching approach works in a very specific context Rhetorical experiments The labelling of these studies as ‘rhetorical experiments’ can be understood by analogy with many of the ‘experiments’ that school children carry out in school science - those laboratory practical activities labelled ‘experiments’ that are actually demonstrations of well-characterised effects clearly described in the students’ textbooks - as part of learning a “rhetoric of conclusions” (Schwab, 1958). These would be genuine experiments for the students if they had no strong expectations of the outcomes in advance, but often the practicals are undertaken after the relevant theory has been taught, rather than in advance to provide ‘epistemic relevance’ to motivate learning the scientific ideas (Taber, 2015), and the practical may even be entitled ‘an experiment to show …’. I am suggesting that some of the experimental studies reported in the literature are rhetorical in the parallel sense that the researchers clearly expect to demonstrate a well-established effect, albeit in a specific context where it has not previously been demonstrated. The general form of the question ‘will this much- tested teaching approach also work here’ is clearly set up expecting the answer ‘yes’. Indeed, control condition may be chosen to give the experiment the best possible chance of producing a positive outcome for the experimental treatment. Clearly all studies have unique elements, but Figure 8 represents the general logic of many of these rhetorical experiments. In terms of the analysis offered earlier in this article, such studies are replications, but often made without any strong grounds for suspecting that the context chosen for the study provides a real test for the 39 teaching innovation. That is, although the particular innovation may not have been tested in that specific context, given the range of prior studies showing it to be widely effective there is no strong reason to suspect that this particular context is sufficiently different from those where the effectiveness has already been demonstrated to motivate reasonable doubts about the outcome of the new study. This may be clear from the published reports themselves. Some examples of rhetorical studies of this kind are presented in Table 5. What is noteworthy is that as part of the conceptual framework justifying the research readers are told fairly unequivocally that the teaching approach to be tested has already been shown to be clearly superior to (what is sometimes termed) ‘traditional’ teaching, yet the researchers then seek to test this in a specific context where they set up a control treatment that reflects the very traditional conditions that they have already told readers are ineffective for achieving learning objectives. [Table 5 about here] Table 5: Some research studies including control conditions that the researchers claim are already known to be ineffective teaching treatments Avoiding detrimental control conditions This raises an ethical issue in such studies that, given the current state of knowledge prior to the research, the researchers employ a control treatment that is considered to be of limited educational value. Students in the control condition are expected to be disadvantaged compared to those in the experimental condition. Authors often justify this by reporting that the suboptimal conditions set up for the control are just what these students would experience anyway, and so they are not disadvantaged compared to not being in the study. That is only so if authors are correct that ‘traditional’ teaching, with no elements of more ‘progressive’ approaches, is endemic in the local context. Whilst studies may present traditional and progressive teaching as being a dichotomy, actual observations of teachers’ classroom practice suggest practice is more nuanced and reflects a blend of these two extremes (Bektas & Taber, 2009). These rhetorical studies nominally have level 2 controls (see Table 2) but if a teacher of a control class is asked to “transmit information to students, who receive and memorise it”, with “no consideration of the students’ existing conceptions”, and where learners are ‘passive’ (see Table 5 for examples) then this may actively prevent teachers engaging any progressive elements that might be part of their normal teaching repertoires. So, these experiments may in practice be better designed as having ‘level 2- (two minus)’ controls (cf. Table 3). Quite a few studies of this kind have been reported from Turkey (perhaps unsurprising as it is now one of the most active nations in science education research) where ‘reform’ teaching along constructivist lines has 40 been recommended for many years now (Gözütok, 2013). These recommendations have been supported by government policy, changes in teacher education, and a great many studies demonstrating how reform-based teaching can improve learning outcomes. Despite this, study authors often argue that this has not widely impacted teaching practice, and so employing ‘traditional’ teaching as a control treatment is not detrimental to study participants compared with not taking part in the research. If this is indeed so, then it seems unlikely that one more study demonstrating the greater effectiveness of some progressive teaching approach will persuade teachers in that context to change teaching practices. If researchers are planning studies of this type because they hope to act as catalysts for change, then this strategy is not working. Good practice in selecting productive control treatments The framework for thinking about experimental studies into teaching developed in this article suggests a different approach is indicated. Even if it is accepted that control conditions used in rhetorical experiments of this kind do not offer any less educational value than the teaching the particular learners would experience normally, educational researchers who wish to influence teaching practice should decline to adopt such conditions in their studies. In these rhetorical experiments, teachers assigned the experimental classes are prepared to teach using research-informed approaches aligned with reform policies (‘are prepared to’ both as in ‘are trained up to’, and as in ‘are willing to’), so researchers are certainly able to demonstrate their success in showing individual teachers both that they can teach in these ways, and that such approaches can be effective with their classes. Acar and Tarhan (2007) comment on the teacher in their study that “because she was experienced on active learning, she adapted the study easily” and as part of her preparation for working with the intervention group, she “was informed about the misconceptions related to electrochemistry and told about which activities had been developed to prevent which misconceptions” (p.353). Yet she was asked to teach the parallel control class “without consideration for student misconceptions”. So, a teacher experienced in reform teaching approaches was asked to restrict her professional practice to the detriment of her students, so as to artificially produce a control condition where learning was likely to be limited. Researchers in these educational contexts should therefore seriously consider looking to abandon testing well-established innovations in new contexts by using nominally level 2 (and perhaps actually level 2-) control conditions, and to instead plan studies with level 3 control conditions (see Table 4). If researchers are working in a context where teachers are expected to adopt ‘reform’ teaching approaches, then researchers should not undermine this by accepting teaching treatments in control conditions that clearly do not meet the expected educational standards (and so simply demonstrate, once again, the substandard nature of such teaching). Rather, educational researchers should act as change agents, training-up teachers to 41 offer a range of well-tested teaching approaches in their classes, and then seeking to compare between these to explore which of these superior approaches works best in teaching particular groups of students specific aspects of the curriculum. Conclusions This article has reviewed some key issues in designing and interpreting experimental studies intended to test different teaching innovations. Experimental research employing statistical tools is often seen as being more objective than studies based on interpretation of qualitative data, and findings quantified in terms of effect sizes and p values seem to offer definitive results. Yet, all research choices (e.g., how to implement an intervention, how to operationalise a variable, which instruments to use to collect data) involve interpretations, and most studies in education involve some compromises on ideal research designs. Few experiments in education offer large randomly selected or truly representative samples from clearly defined and identified populations, and even such ideal cases can be subject to some potential threats to validity that randomisation cannot overcome. This certainly does not imply that experiments are not useful, but they are best seen as most informative alongside other types of studies that that have complementary strengths and weaknesses (Taber, 2009) - for example studies that collect detailed data exploring classroom processes. Experimental research of the kind reviewed in this article tests a specific hypothesis about the potential effect of some specific treatment (such as a particular pedagogy or teaching resource). The hypothesis will be based on some theoretical model of how some variable has a causal influence on outcomes of interest (e.g., how pedagogy influences learning). Even when a hypothesis is supported by statistical analysis, that analysis offers no direct support for concluding that the conjectured causal mechanism explains the outcome. Teaching and learning are complex phenomena. As an example, it may be conjectured that implementing a form of problem-based learning could lead to increases in school test scores because students show greater engagement in classes due to higher motivation, or because it allows a level of peer interaction providing scaffolding of learning, or because it involves high-level thinking skills, or because the group work involved facilitates a more productive kind of discourse, or … A simple experimental study comparing teaching treatments and test scores and finding the problem-based learning condition resulted in significantly better outcomes could not distinguish which mechanism was at work. It is possible several such mechanisms are operating, perhaps synergistically: if students are more motivated and better engaged then they are more open to working outside their existing areas of competence where scaffolding may be effective, and may be more open to productive exploratory discourse - and so forth. Studies that collect data on a wide range of process variables can be used to construct mathematical models using techniques such as structural 42 equation modelling which offer insights into such complex situations (Schreiber, Nora, Stage, Barlow, & King, 2006), but these studies require more extensive quantitive data (as well as expertise in the methods) than simple experiments, and still require advanced knowledge of the variables that will be measured and included in a model Processes can also be investigated by ‘qualitative’ studies using more interpretivist modes of enquiry. Studies that observe teaching, collect classroom talk, and interview teachers and students, can offer valuable indications of productive educational processes (Duit, Roth, Komorek, & Wilbers, 1998; Petri & Niedderer, 1998). These studies may suffer a complementary weakness to experimental studies: so factors identified as salient in qualitative data may not always have a substantive influence on educational outcomes (that needs to be tested); just as showing a specific educational treatment is effective does not imply understanding the causal mechanism at work (an unidentified, confounding, factor could be the cause). Exploratory interpretive studies can be open to considering multiple explanations and to adopting a range of theoretical perspectives to support data analysis (Taber, 2008). Progressing a research programme may then be supported by complementing experimental studies with more interpretive work that can both suggest hypotheses to text experimentally and also question whether the assumed mechanisms underpinning experimental hypotheses seem feasible in terms of what is actually observed in different treatment conditions. For readers to fully evaluate the implications of experimental studies it is important that authors offer clarity about the units of analysis, the population sampled, what (if anything) has been assigned randomly, and the method used to achieve any randomisation, as well as detailed accounts of the different treatments. As small-scale studies undertaken in particular contexts offer limited inherent generalisability, these should be planned with careful consideration of how they will add to the body of studies testing that particular type of innovation and so contribute to a better understanding of its range of effectiveness. That decision requires a careful examination of both the outcomes and contexts of existing studies to determine what, if any, patterns can be identified for the range of application of the innovation. When researchers report such studies, they should explain the choice of research site and classroom context to help readers appreciate how the new study adds substantially to those previously reported. Context-directed research carried out by teachers in their own classrooms can be justified by the general research question ‘will this widely-tested innovation be effective in this particular very specific context where I teach’ (Taber, 2013a), but in published research authors should also explain why the particular context has been chosen to be of theoretical interest. A particular issue arising from the studies reviewed is the choice of control conditions. Comparing an innovation against standard practice is appropriate when the likely effectiveness of the innovation is genuinely uncertain, but when researchers test an approach that has already been widely demonstrated as effective across a broad range of contexts then it is usually more informative to compare it with a 43 treatment already recognised as good practice. The use of control conditions that reflect teaching that the researchers themselves believe is ineffective, or which is incompatible with local educational policies, should be avoided. Given the current state of knowledge about teaching and learning (Bransford, Brown, & Cocking, 2000; NGSS Lead States, 2013), it seems unlikely that many teachers have classroom practice which fully matches the caricature of ‘traditional’, ‘teacher-centred’ practice. Therefore, asking teachers to teach control groups this way (often whilst simultaneously demonstrating competence in much more progressive practice in teaching an intervention group) is difficult to justify ethically or logically. It is hoped that that this review will provide a framework for reading reports for teachers who may wish to draw upon the research literature to identify innovations that they might consider adopting or testing in their own classrooms, as well as raising some issues that researchers themselves may usefully reflect upon when deciding when to employ an experimental design, or planning an experimental study. (Berger & Hänze, 2015; Bramwell-Lalor & Rainford, 2013; Bunterm et al., 2014; Çam & Geban, 2011; Günter & Alpat, 2017; Hong, Lin, Chen, Wang, & Lin, 2013; Tüysüz, 2010) References Abdi, A. (2014). The Effect of Inquiry-based Learning Method on Students’ Academic Achievement in Science Course. Universal Journal of Educational Research, 2, 37-41. doi:10.13189/ujer. 2014.020104. Acar, B., & Tarhan, L. (2007). Effect of Cooperative Learning Strategies on Students' Understanding of Concepts in Electrochemistry. International Journal of Science and Mathematics Education, 5(2), 349-373. doi:10.1007/s10763-006-9046-7 Acar, B., & Tarhan, L. (2008). Effects of Cooperative Learning on Students’ Understanding of Metallic Bonding. Research in Science Education, 38(4), 401–420. doi:DOI 10.1007/s11165-007-9054-9 Adey, P. (1999). The Science of Thinking, and Science For Thinking: a description of Cognitive Acceleration through Science Education (CASE). Geneva: International Bureau of Education (UNESCO). Adey, P., & Shayer, M. (2002). An exploration of long-term far-transfer effects following an extended intervention program in the high school science curriculum. In C. Desforges & R. Fox (Eds.), Teaching and Learning: The Essential Readings (pp. 173-209). Oxford: Blackwell Publishing. Al-Rawahi, N. M., & Al-Balushi, S. M. (2015). The Effect of Reflective Science Journal Writing on Students’ Self-Regulated Learning Strategies. International Journal of Environmental & Science Education, 10(3), 367-379. Ausubel, D. P. (1978). In Defense of Advance Organizers: A Reply to the Critics. Review of Educational Research, 48(2), 251-257. 44 Barab, S. A., & Luehmann, A. L. (2003). Building sustainable science curriculum: Acknowledging and accommodating local adaptation. Science Education, 87(4), 454-467. doi:doi:10.1002/sce.10083 Bektas, O., & Taber, K. S. (2009). Can science pedagogy in English schools inform educational reform in Turkey? Exploring the extent of constructivist teaching in a curriculum context informed by constructivist principles. Journal of Turkish Science Education, 6(3), 66-80. Berger, R., & Hänze, M. (2015). Impact of Expert Teaching Quality on Novice Academic Performance in the Jigsaw Cooperative Learning Method. International Journal of Science Education, 37(2), 294-320. doi:10.1080/09500693.2014.985757 Bramwell-Lalor, S., & Rainford, M. (2013). The Effects of Using Concept Mapping for Improving Advanced Level Biology Students' Lower- and Higher-Order Cognitive Skills. International Journal of Science Education, 36(5), 839-864. doi:10.1080/09500693.2013.829255 Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.). (2000). How People Learn: Brain, mind, experience & school (Expanded ed.). Washington D C: National Academy Press. British Educational Research Association. (2018). Ethical Guidelines for Educational Research (4th ed.). London: British Educational Research Association. Bunterm, T., Lee, K., Ng Lan Kong, J., Srikoon, S., Vangpoomyai, P., Rattanavongsa, J., & Rachahoon, G. (2014). Do Different Levels of Inquiry Lead to Different Learning Outcomes? A comparison between guided and structured inquiry. International Journal of Science Education, 1-23. doi: 10.1080/09500693.2014.886347 Çam, A., & Geban, Ö. (2011). Effectiveness of Case-Based Learning Instruction on Epistemological Beliefs and Attitudes Toward Chemistry. Journal of Science Education and Technology, 20(1), 26-32. doi:10.1007/s10956-010-9231-x Chen, S., Chang, W.-H., Lai, C.-H., & Tsai, C.-Y. (2014). A Comparison of Students’ Approaches to Inquiry, Conceptual Learning, and Attitudes in Simulation-Based and Microcomputer-Based Laboratories. Science Education, 98(5), 905-935. doi:10.1002/sce.21126 Collins, H. (1992). Changing order: Replication and induction in scientific practice: University of Chicago Press. Dorman, J. P. (2012). The impact of student clustering on the rsults of statistical tests. In B. J. Fraser, K. G. Tobin, & C. J. McRobbie (Eds.), Second International Handbook of Science Education (Vol. 2, pp. 1333-1348). Dordrecht: Springer. Dudai, Y., & Eisenberg, M. (2004). Rites of Passage of the Engram: Reconsolidation and the Lingering Consolidation Hypothesis. Neuron, 44(1), 93-100. doi:10.1016/j.neuron.2004.09.003 Duit, R., Roth, W.-M., Komorek, M., & Wilbers, J. (1998). Conceptual change cum discourse analysis to understand cognition in a unit on chaotic systems: towards an integrative perspective on learning in science. International Journal of Science Education, 20(9), 1059-1073. Fabro, S., Smith, R. L., & Williams, R. T. (1967). Toxicity and Teratogenicity of Optical Isomers of Thalidomide. Nature, 215, 296. doi:10.1038/215296a0 Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502-1505. doi:10.1126/science.1255484 Gauld, C. (1989). A study of pupils’ responses to empirical evidence. In R. Millar (Ed.), Doing Science: images of science in science education (pp. 62-82). London: The Falmer Press. Geertz, C. (1973). Thick Description: Toward an Interpretive Theory of Culture The Interpretation of Cultures: Selected Essays (pp. 3-30). New York: Basic Books. 45 Gidena, A., & Gebeyehu, D. (2017). The effectiveness of advance organiser model on students’ academic achievement in learning work and energy. International Journal of Science Education, 39(16), 2226-2242. doi:10.1080/09500693.2017.1369600 Goldacre, B. (2013). Building Evidence into Education. Retrieved from London: http:// media.education.gov.uk/assets/files/pdf/b/ben%20goldacre%20paper.pdf Goswami, U. (2008). Cognitive Development: The Learning Brain. Hove, East Sussex: Psychology Press. Gözütok, F. D. (2013). Curriculum Studies in Turkey Since 2000. In W. F. Pinar (Ed.), International Handbook of Curriculum Research: Routledge. Grooms, J., Sampson, V., & Golden, B. (2014). Comparing the Effectiveness of Verification and Inquiry Laboratories in Supporting Undergraduate Science Students in Constructing Arguments Around Socioscientific Issues. International Journal of Science Education, 36(9), 1412-1433. doi: 10.1080/09500693.2014.891160 Günter, T., & Alpat, S. K. (2017). What is the Effect of Case-Based Learning on the Academic Achievement of Students on the Topic of “Biochemical Oxygen Demand?”. Research in Science Education. doi:10.1007/s11165-017-9672-9 Guthrie, J. T. (1977). Research Views: Follow through: A Compensatory Education Experiment. The Reading Teacher, 31(2), 240-244. Hammersley, M. (2004). Action research: a contradiction in terms? Oxford Review of Education, 30(2), 165-181. Hong, Z.-R., Lin, H.-s., Chen, H.-T., Wang, H.-H., & Lin, C.-J. (2013). The Effects of Aesthetic Science Activities on Improving At-Risk Families Children's Anxiety About Learning Science and Positive Thinking. International Journal of Science Education, 36(2), 216-243. doi: 10.1080/09500693.2012.758394 Howe, C., Ilie, S., Guardia, P., Hofmann, R., Mercer, N., & Riga, F. (2014). Principled Improvement in Science: Forces and proportional relations in early secondary-school teaching. International Journal of Science Education, 37(1), 162-184. doi:10.1080/09500693.2014.975168 Knorr Cetina, K. (1999). Epistemic Cultures: How the Sciences Make Knowledge. Cambridge, Massachusetts: Harvard University Press. Koksal, E. A., & Berberoglu, G. (2014). The Effect of Guided-Inquiry Instruction on 6th Grade Turkish Students' Achievement, Science Process Skills, and Attitudes Toward Science. International Journal of Science Education, 36(1), 66-78. doi:10.1080/09500693.2012.721942 Kvale, S. (1996). InterViews: An introduction to qualitative research interviewing. Thousand Oaks, California: Sage Publications. LaPorte, J. (2004). Natural Kinds and Conceptual Change. Cambridge: Cambridge University Press. Leuchter, M., Saalbach, H., & Hardy, I. (2014). Designing Science Learning in the First Years of Schooling. An intervention study with sequenced learning material on the topic of 'floating and sinking'. International Journal of Science Education, 36(10), 1751-1771. doi: 10.1080/09500693.2013.878482 Millikan, R. G. (1999). Historical Kinds and the “Special Sciences”. Philosophical Studies, 95(1), 45-65. doi:10.1023/a:1004532016219 Moore, L., Graham, A., & Diamond, I. (2003). On the Feasibility of Conducting Randomised Trials in Education: Case Study of a Sex Education Intervention. British Educational Research Journal, 29(5), 673-689. doi:10.2307/1502117 46 Mortimer, E. F., & Scott, P. H. (2003). Meaning Making in Secondary Science Classrooms. Maidenhead: Open University Press. National Research Council Committee on Scientific Principles for Educational Research. (2002). Scientific Research in Education. Washington DC: National Academies Press. Next Generation Science Standards: For States, By States. (2013). The National Academies Press. NGSS Lead States. (2013). Next Generation Science Standards: For States, By States: The National Academies Press. O'Donnell, C. L. (2008). Defining, Conceptualizing, and Measuring Fidelity of Implementation and Its Relationship to Outcomes in K–12 Curriculum Intervention Research. Review of Educational Research, 78(1), 33-84. doi:10.3102/0034654307313793 Park, S., & Oliver, J. S. (2008). Revisiting the Conceptualisation of Pedagogical Content Knowledge (PCK): PCK as a Conceptual Tool to Understand Teachers as Professionals. Research in Science Education, 38(3), 261-284. doi:10.1007/s11165-007-9049-6 Petri, J., & Niedderer, H. (1998). A learning pathway in high-school level quantum atomic physics. International Journal of Science Education, 20(9), 1075-1088. Piaget, J. (1970/1972). The Principles of Genetic Epistemology (W. Mays, Trans.). London: Routledge & Kegan Paul. Popper, K. R. (1989). Conjectures and Refutations: The Growth of Scientific Knowledge, (5th ed.). London: Routledge. Pring, R. (2000). Philosophy of Educational Research. London: Continuum. Rosenthal, R. (2003). Covert Communication in Laboratories, Classrooms, and the Truly Real World. Current Directions in Psychological Science, 12(5), 151-154. doi: 10.1111/1467-8721.t01-1-01250 Rosenthal, R., & Jacobson, L. (1968). Pygmalion in the classroom: teacher expectation and pupils’ intellectual development. New York: Holt, Rinehart & Winston. Rosenthal, R., & Jacobson, L. (1970). Teacher's expectations. In L. Hudson (Ed.), The Ecology of Human Intelligence (pp. 177-181). Harmondsworth: Penguin. Rosenthal, R., & Rubin, D. B. (1978). Interpersonal expectancy effects: the first 345 studies. Behavioral and Brain Sciences, 1, 377-386. doi:10.1017/S0140525X00075506 Ruthven, K., Mercer, N., Taber, K. S., Guardia, P., Hofmann, R., Ilie, S., . . . Riga, F. (2016). A research- informed dialogic-teaching approach to early secondary-school mathematics and science: the pedagogical design and field trial of the epiSTEMe intervention. Research Papers in Education, 32(1), 18-40. doi:10.1080/02671522.2015.1129642 Schreiber, J. B., Nora, A., Stage, F. K., Barlow, E. A., & King, J. (2006). Reporting Structural Equation Modeling and Confirmatory Factor Analysis Results: A Review. The Journal of Educational Research, 99(6), 323-338. doi:10.3200/JOER.99.6.323-338 Schwab, J. J. (1958). The Teaching of Science as Inquiry. Bulletin of the Atomic Scientists, 14(9), 374-379. doi:10.1080/00963402.1958.11453895 Sesen, B. A., & Tarhan, L. (2011). Active-learning versus teacher-centered instruction for learning acids and bases. Research in Science & Technological Education, 29(2), 205-226. doi: 10.1080/02635143.2011.581630 Shapin, S., & Schaffer, S. (2011). Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life: Princeton University Press. 47 Siegler, R. S. (2004). U-Shaped Interest in U-Shaped Development-and What It Means. Journal of Cognition and Development, 5(1), 1-10. doi:10.1207/s15327647jcd0501_1 Stake, R. E. (1995). The Art of Case Study Research. Thousand Oaks, California: Sage. Stake, R. E. (2006). Multiple Case Study Analysis. New York: The Guilford Press. Sullivan, G. M., & Feinn, R. (2012). Using Effect Size—or Why the P Value Is Not Enough. Journal of Graduate Medical Education, 4(3), 279-282. doi:10.4300/JGME-D-12-00156.1 Taber, K. S. (2000). Case studies and generalisability - grounded theory and research in science education. International Journal of Science Education, 22(5), 469-487. Taber, K. S. (2003). Lost without trace or not brought to mind? - a case study of remembering and forgetting of college science. Chemistry Education: Research and Practice, 4(3), 249-277. Taber, K. S. (2008). Of Models, Mermaids and Methods: The role of analytical pluralism in understanding student learning in science. In I. V. Eriksson (Ed.), Science Education in the 21st Century (pp. 69-106). Hauppauge, New York: Nova Science Publishers. Taber, K. S. (2009). Progressing Science Education: Constructing the scientific research programme into the contingent nature of learning science. Dordrecht: Springer. Taber, K. S. (2012). Vive la différence? Comparing ‘like with like’ in studies of learners’ ideas in diverse educational contexts. Educational Research International, 2012(Article 168741), 1-12. Retrieved from http://www.hindawi.com/journals/edu/2012/168741/ doi: 10.1155/2012/168741 Taber, K. S. (2013a). Classroom-based Research and Evidence-based Practice: An introduction (2nd ed.). London: Sage. Taber, K. S. (2013b). Modelling Learners and Learning in Science Education: Developing representations of concepts, conceptual structure and conceptual change to inform teaching and research. Dordrecht: Springer. Taber, K. S. (2013c). Non-random thoughts about research. Chemistry Education Research and Practice, 14(4), 359-362. doi:10.1039/c3rp90009f Taber, K. S. (2014a). Ethical considerations of chemistry education research involving ''human subjects''. Chemistry Education Research and Practice, 15(2), 109-113. doi:10.1039/C4RP90003K Taber, K. S. (2014b). Methodological issues in science education research: a perspective from the philosophy of science. In M. R. Matthews (Ed.), International Handbook of Research in History, Philosophy and Science Teaching (Vol. 3, pp. 1839-1893). Dordrecht: Springer Netherlands. Taber, K. S. (2015). Epistemic relevance and learning chemistry in an academic context. In I. Eilks & A. Hofstein (Eds.), Relevant Chemistry Education: From Theory to Practice (pp. 79-100). Rotterdam: Sense Publishers. Taber, K. S. (2018). Scaffolding learning: principles for effective teaching and the design of classroom resources. In M. Abend (Ed.), Effective Teaching and Learning: Perspectives, strategies and implementation (pp. 1-43). New York: Nova Science Publishers. Taber, K. S., Ruthven, K., Howe, C., Mercer, N., Riga, F., Hofmann, R., & Luthman, S. (2015). Developing a research-informed teaching module for learning about electrical circuits at lower secondary school level: supporting personal learning about science and the nature of science. In Information Resources Management Association (Ed.), K-12 STEM Education: Breakthroughs in Research and Practice (Vol. 1, pp. 1-28). Hershey, Pennsylvania: IGI Global. Taber, K. S., Ruthven, K., Mercer, N., Riga, F., Luthman, S., & Hofmann, R. (2016). Developing teaching with an explicit focus on scientific thinking. School Science Review, 97(361), 75-84. 48 Taşlidere, E. (2013). The Effect of Concept Cartoon Worksheets on Students' Conceptual Understandings of Geometrical Optics. Education & Science/Egitim ve Bilim, 38(167). Tüysüz, C. (2010). The Effect of the Virtual Laboratory on Students' Achievement and Attitude in Chemistry. International Online Journal of Educational Sciences, 2(1), 37-53. van Driel, J. H., Beijaard, D., & Verloop, N. (2001). Professional development and reform in science education: The role of teachers' practical knowledge. Journal of Research in Science Teaching, 38(2), 137-158. doi:doi:10.1002/1098-2736(200102)38:2<137::AID-TEA1001>3.0.CO;2-U Vygotsky, L. S. (1934/1986). Thought and Language (A. Kozulin, E. Hanfmann, & G. Vakar, Trans.). London: MIT Press. Vygotsky, L. S. (1978). Mind in Society: The development of higher psychological processes. Cambridge, Massachusetts: Harvard University Press. Wood, D. (1988). How Children Think and Learn: the social contexts of cognitive development. Oxford: Blackwell. Yin, Y., Tomita, M. K., & Shavelson, R. J. (2013). Using Formal Embedded Formative Assessments Aligned with a Short-Term Learning Progression to Promote Conceptual Change and Achievement in Science. International Journal of Science Education, 36(4), 531-552. doi: 10.1080/09500693.2013.787556 Endnote: In this article, the convention in British English spellings (preferred in Studies in Science Education) to use ‘enquiry’ as 1 the normal spelling for the general process of investigating (as the term ‘inquiry’ is usually reserved for formal proceedings) is followed. This usage is different to the convention with American English spellings. Where works cited use the alternative American spelling, ‘inquiry’, this has been retained in direct quotations. 49 Experimental research into teaching innovations: responding to methodological and ethical challenges Tables and figures
 Table 1: A sample of published experimental studies testing teaching innovations Citation Independent variable Dependent variable(s) Sample Randomisation Adbi, 2014 Inquiry-based learning Students’ academic achievement in science 40 5th-grade students from one primary school in Kermanshah, Iran Two intact classes (n=20, 20) assigned to conditions (same teacher) Acar & Tarhan, 2007. Cooperative learning Understanding of concepts in electrochemistry 41 11th-grade students from two science classes in a high school in Izmir, in Turkey Two intact classes (n=20, 21) assigned to two conditions (same teacher) Acar & Tarhan, 2008 Cooperative learning Students’ understanding of metallic bonding 57 9th-grade science students from two science classes in a high school in Izmir, in Turkey Two intact classes (n = 28, 29) assigned to two conditions (same teacher) Al-Rawahi & Al-Balushi, 2015 Reflective science journal writing Self-regulated learning strategies 62 10th-grade students from a public female school in the Ad Dakhiliyah region in Oman Two intact classes (n = 32, 30) assigned to two conditions (same teacher) Berger, R., & Hänze, M. (2015). Expert teaching quality (jigsaw teaching) Novice academic performance 129 12th-grade students in Nine physics classes from 7 schools in Germany Students assigned to groups - students acted as both novices and experts during project Bramwell- Lalor & Rainford, 2013 Concept mapping as a formative assessment tool Advanced level biology students' cognitive skills 156 A level biology students from (three or more *) schools in Jamaica * details only provided for experimental group None reported. (Intact classes. Three teachers and 90 students in experimental group; Five other teachers and 66 students in control condition.) Bunterm, et al. 2014 Form of guidance provided 5E learning cycle model Science content and process skills 183 10th-grade and 56 7th-grade students from three schools North- Eastern Thailand Two intact classes assigned in each school (n=42, 44; 49, 48; 27, 29) Within each school, one teacher taught both classes Çam & Geban, 2011 Effectiveness of case- based learning instruction Epistemological beliefs and attitudes toward chemistry 63 11th-grade students from two classes of an urban high school in Turkey Two intact classes (n=28, 35) assigned to conditions (same teacher). Chen, Chang, Lai & Tsai, 2014 Form of instructional materials Physics learning, enquiry behaviours, student enjoyment and engagement 68 11th-grade students in two physics classes at an urban high school in Taipei, Taiwan Two intact classes (n=32, 36) randomly assigned to conditions (same teacher) Gidena & Gebeyehu, 2017 Effectiveness of the advance organiser model Students’ academic achievement in learning about work and energy 139 11th-grade natural science students from a preparatory school, in Northern- West zone of Tigray region, Ethiopia Two intact classes (n=46, 46) assigned to conditions (same teacher) Grooms, Sampson & Golden, 2014 Enquiry-based undergraduate laboratories in relation to Students’ abilities to construct arguments relating to socioscientific issues 73 chemistry undergraduates from a two-year community college; and 79 chemistry undergraduates from a four-year university; in the same City in the Southeastern USA. None. (College students made up intervention; and university students the comparison condition.) Günter & Alpat, 2017 Case-based learning Academic achievement of students on the topic of biochemical oxygen demand 18 4th or 5th year undergraduates attending the chemistry teaching programme in a university in Izmir, Turkey Students randomly assigned to conditions (n = 10, 8) Hong, Lin, Chen, Wang & Lin, 2013 Aesthetic science activities At-risk families children’s anxiety about learning science and positive thinking 133 4th-grade school children from two elementary schools in the Chi-Jin district of Kaohsiung city in Taiwan 36 children volunteered for the intervention; “97 typical 4th graders were randomly selected as the comparison group” (p. 222) Page of 2 17 Leuchter, Saalbach, & Hardy, 2014 Structured learning materials Understanding of floating and sinking 15 classes (244 children) age 4-9 years plus 2 classes (22 children) as a control group in Central Switzerland No randomisation reported. Moore, Graham and Diamond, 2003) Teacher-led intervention Teenagers' knowledge of emergency contraception 24 schools in Avon, South-West England who responded to a invitation to all 49 eligible schools partake in the study 12 schools assigned to each condition Ruthven et al, 2016 Design of teaching units Learning and attitudes 11-12 year old pupils in 70 intact classes in schools in Eastern England 25 schools schools assigned to two conditions Sesen & Tarhan, 2011 Active-learning versus teacher- centered instruction Learning acids and bases 45 [sic] high-school students (average age 17 years) from two different classes in a high school in Turkey. Two intact classes (n=21, 25) assigned to conditions (same teacher) Taşlidere, 2013 Concept cartoon worksheets Students' conceptual understanding of geometrical optics 121 pre-service science teachers, sophomores (2nd year undergraduates), taking General Physics-III at a state university in Turkey Two intact classes (n=63 58) were assigned to each condition (same lecturer) Tüysüz, 2010 Virtual laboratory Students’ achievement and attitude in chemistry 341 9th-grade high school students in Turkey Students divided into two groups (n=174, 167) Yin, Tomita & Shavelson, 2013 Learning progression- aligned formal embedded formative assessment Conceptual change and achievement in middle- school science 52 6th-graders from a university laboratory school in Honolulu, Hawaii Students assigned to conditions (n=26, 26) Page of 3 17 Table 2: Distinct levels of control in experimental designs according to the nature of the educational ‘treatment’ experience by the control or comparison group Type Experimental group Control/comparison group Purpose Level 1: treatment vs.no treatment A treatment is applied which is hypothesised to have an educational effect Outcomes for the experimental group are compared with outcomes for a matched group not receiving any relevant educational treatment To test whether a particular form of treatment leads to educationally desirable outcomes Level 2: innovation vs. standard treatment An innovative treatment is applied which is hypothesised to have a greater educational effect than the standard treatment Outcomes for the group subject to the innovation are compared with outcomes for a matched group subject to the relevant standard educational input To test whether an innovative form of treatment leads to greater educational outcomes than current practice Level 3: innovation vs. enhanced treatment An innovative treatment of unknown efficacy is applied Outcomes for the group subject to the innovation are compared with outcomes for a matched group subject to a treatment recognised as good practice A treatment is tested to see how effective it is compared to another treatment previously shown to be effective Page of 4 17 Table 3: Examples of different ‘levels’ of control condition Citation Focus Experimental treatment Comparison condition Level characterisation Moore, Graham and Diamond, 2003 An intervention to improve teenagers' knowledge of emergency contraception An extra lesson to be delivered to 14-15 year-old students in addition to existing normal sex education No supplement to existing sex education provision Level 1 Hong, Lin, Chen, Wang and Lin, 2013 Intervention programme of inquiry-based aesthetic science activities 12 weeks programme of extra- curricular activities: “hands-on pedagogical strategy”, “inquiry teaching theory” and “aesthetic understanding teaching method”; and including “introductory hands-on activities, displays, team competitions, peer tutoring, small group discussions, demonstrations of scientific myths, and aesthetic science activities” (p.222) No relevant extra-curricular provision Level 1 Leuchter, Saalbach and Hardy, 2014 Curriculum intervention in the topic of floating and sinking “An instructional design with sequenced and problem-based tasks which are supposed to stimulate conceptual change in the area of ‘floating and sinking’ in children in the first years of schooling…[enacted through] a structured and problem-based learning environment…[during] a 4-week experiment-based instruction” (p.1757) “Usual curriculum”to exclude “any curriculum on floating and sinking between pre- and posttests” (p.1762) Level 1 Grooms, Sampson and Golden, 2014 Construct arguments relating to socio-scientific issues “A series of [six] argument- based lab activities” alongside 5 of “more ‘cookbook’ style” “a chemistry laboratory course aligned with the argument- driven inquiry” (p.1412) that emphasised “scientific argumentation, group collaboration, and peer review” (p.1417) All eleven laboratory activities followed the “more traditional laboratory approach” p.1412 “instruction followed a more ‘cookbook’ style, where the students were provided the steps needed to complete each investigation and typically worked as individuals” (p.1417) Level 2 Bramwell-Lalor and Rainford, 2013 Concept mapping as a formative assessment tool in developing students’ higher level cognitive skills Concept mapping added to the teaching of topics by “lectures, discussion and practical work.” “The same biology curriculum during the period under study. The topics that they were taught was done over the same time period as the treatment groups … [through] lectures, discussion and practical work” (pp.850-851) Level 2 Yin, Tomita and Shavelson, 2013 “Learning progression-aligned formal embedded formative assessment on conceptual change and achievement in middle-school science” Formal formative assessments added to teaching provision Equal amount of time on the same day gathering additional data and discussing patterns found in their experiment Level 2+ Page of 5 17 Bunterm, et al., 2014 The 5E Learning Cycle Model Enquiry learning following lesson plans adapted to support guided enquiry, Enquiry learning following lesson plans adapted to support structured enquiry, Level 3 Chen, Chang, Lai, & Tsai, 2014 Using a pre-test to diagnose student’s misconceptions relating to diodes Responding to diagnosed alternative conceptions by engaging in the P-O-E (Predict- Observe-Explain) sequence Responding to diagnosed alternative conceptions by providing students with remedial input level 3 Page of 6 17 Table 4: Guidance on the logic of selecting control conditions Context of study Type of control condition There are question about whether the teaching innovation can lead to learning gains in the context (e.g., students may be too young to benefit) Level 1 - comparison with learners not receiving any teaching It is unclear if it would be beneficial to provide some supplementary input in addition to current standard provision Level 1 - comparison with learners not receiving any supplement to standard teaching There is genuine uncertainty about the potential of the teaching intervention to lead to learning outcomes as positive as those obtained by current practice (i.e., the innovation has yet to be tested in any reasonably comparable context) Level 2 - comparison with learners receiving standard teaching An innovation is suspected to offer potential advantages over current practice, and there are no other alternatives already demonstrated to be effective that could feasibly substitute for current practice Level 2 - comparison with learners receiving standard teaching An innovation is suspected to offer potential advantages over current practice, where there are other alternatives already demonstrated to be effective that could feasibly substitute for current practice Level 3 - comparison with learners receiving an alternative teaching treatment already demonstrated to be effective Page of 7 17 Table 5: Some research studies including control conditions that the researchers claim are already known to be ineffective teaching treatments Citation Intervention condition Background assumptions Control condition Abdi, 2014 “Student [sic, students] in the experimental group were instructed with inquiry-based instruction supported 5E learning cycle. In the instruction based on 5E learning cycle method, teaching and learning activities and lesson plans were designed to maximize students active involvement in the learning process.” (p. 39) “The inquiry-based teaching approach is supported on knowledge about the learning process that has emerged from research…. In inquiry-based science education, children become engaged in many of the activities and thinking processes that scientists use to produce new knowledge. (p.37) “the traditional classroom often looks like a one-person show with a largely uninvolved learner. Traditional classes are usually dominated by direct and unilateral instruction. Students are expected to blindly accept the information they are given without questioning the instructor… Traditional approach followers assume that there is a fixed body of knowledge that the student must come to know. … The teacher seeks to transfer thoughts and meanings to the passive student leaving little room for student-initiated questions, independent thought or interaction between students” (p.37) “In the control group, a teacher directed strategy representing the traditional approach was used… where students are completely passive…” (p.39) “The teacher used direct teaching and question and answer methods … In this group, the teacher provided instruction through lecture and discussion methods to teach the concepts. The teacher … wrote notes on the chalkboard about the definition of concepts, and passed out worksheets for students to complete. The primary underlying principle was that knowledge takes the form of information that is transmitted to students. …” (p.39) Acar & Tarhan, 2007 “…cooperative learning instruction based on a constructivist approach” (p.353) “Construction of the knowledge occurs best in an active learning environment. Active learning methods such as cooperative learning encourages students to be active participants in the construction of their own knowledge during the learning process…The benefits of cooperative learning for students’ social and academic skills have been well documented by researchers...Based on the literature it can be said that cooperative learning based on the constructivist approach is effective for remediation of misconceptions” (pp. 351-352). “The control group was taught [by the same teacher] with a teacher-centered traditional didactic lecture format. Teaching strategies were dependent on teacher expression without consideration for student misconceptions. …students were required to use their textbooks; students were passive participants and rarely asked questions; they did not benefit from the library or internet sources; activities such as computer animations or brainstorming were not used; generally the teacher wrote the concepts on the board and then explained them; students listened and took notes as the teacher lectured on the content.” (p.358) Acar & Tarhan, 2008 “…newly developed material based on cooperative learning instruction was used in the experimental group” (p.407) “The teacher required students to actively participate in the learning process… asking some key questions such as “What are you doing?” “Why are you doing it?” “How will it help you understanding the subject?” “Why are you researching it?” (p.407) “At the beginning of the instruction, students’ groups were required to activate their prior knowledge” p.408 “…the most important factor that affects learning is the student’s existing conceptions” (p.401) “The benefits of cooperative learning on students’ academic and social skills have been well-documented…” (p.404) “…the control group was taught [by the same teacher] …using teacher-centred traditional didactic lecture format. Teaching strategies were dependent on teacher expression. The students were required to use their textbooks…there are not any student centred active activities [that] depend on constructivism. Students were passive participants during the lessons and they only listened and took notes as the teacher lectured on the content” (pp. 408-409). Çam & Geban, 2011 “The EG [experimental group] was treated with case-based learning instruction by small group format … The instruction was student-centered rather than teacher centered education. … Teacher is a facilitator who assists small groups of self-directed students as they work through a case. She kept the groups on track and stimulated the functioning of the groups. She were [sic, did] not lecture or directly teach the students. She taught students to find answers to their own questions and provided students with feedback. (p.29) “…people construct their knowledge by actively creating their own understanding rather [than] receiving knowledge from others” (p.26) “Case based learning instruction … promotes students’ active participation and students could construct their own learning.” (p.26) “Students in CG [control group] were instructed by lecturing method, discussion and sometimes students performed the laboratory activities in that students were passive listeners and teacher’s role was to transmit the facts and concepts to the students.…Teacher did not give emphasis on students’ misconceptions. Students were passive listeners and they were taking notes. In the laboratory activity section, students were required to do experiment by using the handout.…like ‘‘cookbook’’, described the all steps of the experiment (p.29). Sesen & Tarhan, 2011 “…a variety of specific student-centered instructional strategies […including] experimental activities, brain-storming, video presentations, demonstrations, computer animations, and learning together activities that engage active participation of students in the learning process” (p.209). “In an active-learning environment, in contrast to teacher-centered instruction, a teacher acts as a facilitator, engages active participation of students in the learning process, and puts less emphasis on memorizing information and more emphasis on inquiry through which students develop a deeper knowledge and appreciation of the nature of science … when students are actively involved in the learning task, they learn more than when they are passive recipients of instruction” (p.208). “…teacher-centered instruction, [where] learning focuses on the mastery of content, with little development of the skills and attitudes necessary for scientific inquiry. The teacher transmits information to students, who receive and memorize it. …The curriculum is loaded with many facts and a large number of vocabulary words, which encourages a lecture format of teaching. (p.216) “…the control group were instructed via teacher-centered didactic lecture format…The students were instructed with regular chemistry textbooks. They listened to the teacher carefully, took notes and solved algorithmic problems” (p.216). Gidena & Gebeyehu, 2017 “The lesson plan for the experimental group was prepared using the AOM. … This lesson was prepared in such a way that those students actively participated with guidance of the teacher in the starter activity, main activity, and concluding activity of the lesson. “ p.2233 “AOM [Advance organiser model] provides support for effective teaching and learning process …provides a framework to enable students to learn new ideas or information by meaningfully linking these ideas to the existing knowledge.” (p.2227) “…theories, concepts, and techniques are better understood when lectures are accompanied with demonstration, hands- on experiments through self-discovery, and questions that require students to ponder what will happen in an experiment and why” (p.2227). “…was taught using the lesson plan based on the conventional teaching method” (p.2226) “…the conventional teaching method, which was commonly practised in that school…in which the teacher dominants [sic], whereas the learners remain passive” (p.2233) Taşlidere, 2013 “For the three-week treatment period, the experimental group was instructed the application of concept cartoon worksheets” p.148 “…it is reported that traditional physics instruction is ineffective in helping students develop a scientific view and their conceptual understandings … In general, the approaches encouraging active participation of learners in learning environment are thought to help students construct knowledge meaningfully” (p. 145) “traditional instruction which relied on instructors’ explanations with no consideration of the students’ misconceptions. The instructor used overhead projector to show the definitions of concepts, explained the facts, solved the questions, meanwhile students took notes through the lessons” (p.154) Tüysüz, 2010 “…taught by a constructivist based instructional approach which was enriched by computer animations at the computer laboratory” (p.43) “As accepted throughout the world the idea of using student centred constructivist based instructional methods is widely accepted, since teacher centred, traditional instructional methods has given insufficient opportunities for student to construct their own learning. Eliciting students’ individual capabilities, intelligence and creative thinking can only be achieved through student centered instructional methods” (p.37) “…using chalk and talk method as commonly known name, the traditional method” (p.43) Figure 1: Experimental designs may be categorised as true experiments, quasi-experiments and natural experiments Random assignment of units of analysis to conditions Existing groupings of units of analysis Existing groupings of units of analysis Intervention treatment imposed by researchers Control treatment imposed by researchers Intervention treatment imposed by researchers Control treatment imposed by researchers Existing condition of interest Existing comparison condition True experiment Quasi-experiment Natural experiment Figure 2: Evaluations of equivalence between different groups should be more rigorous than simply excluding differences reaching statistical significance 
 p: Probability of measured initial differences between groups occurring (if due to chance events) statistically significant difference, i.e., p<0.05 measured initial differences at levels unlikely to occur by chance (i.e., p<0.5) e.g., p=0.21 (p>0.05, but not a likely outcome by chance) e.g., p=0.84 0.0 1.00.5 Figure 3 - A compensatory research design where both groups experience the innovation Topic 1 instruction Topic 2 instruction Group 1 Topic 1 pre-test Intervention (innovative treatment) Topic 1 post-test Topic 2 pre-test Comparison (customary treatment) Topic 2 post-test Group 2 Topic 1 pre-test Comparison (customary treatment) Topic 1 post-test Topic 2 pre-test Intervention (innovative treatment) Topic 2 post-test Figure 4: When an experiment tests a sample drawn at random from a wider population, then the findings of the experiment can be assumed to apply (on average) to the population Specified population of interest: e.g. •#14-15 year-olds studying natural selection •#chemistry teachers in Turkey •#secondary schools in New South Wales •#engineering undergraduates •#female school students on biology field trips … Random sample of the population Conclusion of study: inference (Random assignment) Innovative treatment Control condition Compared using inferential statistics Statistical generalisation Figure 5: Results from a randomised trial showing the range of within-condition outcomes (Taber et al., 2016) 0 10 20 30 Average class percentage gains: deferred post-test - pre-test Intervention condition - range of average learning gains in experimental classes Comparison condition - range of average learning gains in control classes mean of class gains (16 intervention classes) mean of class gains (12 control classes) X XXXX X X X X X XX XX X X X XXX XX X X XX X X Figure 6: Many educational experiments do not meet the conditions that allow statistical generalisation to a wider population Is the sample a random selection of the units of analysis in the population of interest? Has there been random assignment of units of analysis to the treatment conditions? yes no yes Have all the likely confounding variables been identified? Has a sample been recruited which represents the wider population in terms of diversity across the likely confounding variables? Have all the likely confounding variables been identified? Inference to the wider population possible Have all the likely confounding variables been measured in the treatment groups? Have the treatment groups been shown to be equivalent in terms of the likely confounding variables been measured in the treatment groups? Inference to the wider population would not be sound yes yes no yes yes yes yes no no no Is the population well described in terms of diversity across the likely confounding variables? no no no no Has statistical modelling been used to seek to separate out the effects of and interactions between the independent and confounding variables? yes no Figure 7: Choice of confidence level reflects a balance between admitting false positives (due to chance events) and false negatives (where real effects are not distinguished from chance events) p = 0.1 p = 0.05 p = 0.01 p = 0.001 more likely that chance events will reach significance, but reduces incidence of false negatives reduces incidence of false positives, but more likely that genuine effects will not reach significance Figure 8: Rhetorical experiments are intended to demonstrate that a well-tested teaching approach works in a very specific context Traditional teaching: teacher-centred; lecture method; recipe practicals; passive learning; etc. Progressive/reform teaching: learner-centred; active learning; constructivist; enquiry-based; etc. Progressive teaching approach X has beeen widely demonstrated to be more effective than traditional teaching across a wide range of contexts Progressive teaching approach X has not yet been specifically tested with grade […] students, studying topic […], in town […] Experimental group - class taught with progressive teaching approach X Control group - class taught with traditional teaching approach: (any progressive elements excluded for the sake of a clear comparison) can be contrasted with for example but so an experiment is designed is compared with Continues to be the norm: despite research evidence and educational policies, this is how students are taught in this educational context Yet researchers are able to prepare a local teacher to implement teaching approach X in an experimental study (The same teacher may be asked to teach in both conditions to give a 'fairer' test)