The General Standard

Application

This general standard applies to all software engineering studies that collect and analyze data. It should be complemented by more specific guidelines where available.

Initial Checks (Editor)

Reviewers should only be invited for papers with the following attributes. By assigning reviewers, the editor/chair/administrator is confirming that the manuscript meets these criteria:

  • meets venue’s requirements (e.g. length, author-blinding, appropriate keywords)
  • within the venue’s scope
  • meets the minimum level of language quality acceptable to the journal
  • cites other scholarly works
  • presents new analysis not previously published in a peer-reviewed venue (i.e. preprints are fine)
  • does not include unattributed verbatim published text (i.e. plagiarism)

Initial Checks (Reviewer)

Before beginning to review a paper, assigned reviewers should verify the following.

  • reviewer has no conflicts of interest; if unsure, check with the chair or editor
  • reviewer has sufficient expertise; if unsure, check with the chair or editor and clarify what you can(not) evaluate
  • paper is clear enough (in language and presentation) to even review

Specific Attributes

Essential Attributes

  • states a purpose, problem, objective, or research question
  • explains why the purpose, problem, etc. is important (motivation)
  • defines jargon, acronyms and key concepts

  • methodology is appropriate (not necessarily optimal) for stated purpose, problem, etc.
  • describes in detail what, where, when and how data were collected (see the Sampling Supplement)
  • describes in detail how the data were analyzed

  • presents results
  • results directly address research questions
  • enumerates and validates assumptions of statistical tests used (if any)1

  • discusses implications of the results
  • discusses the study’s limitations and threats to validity
  • supports main claims or conclusions with explicit evidence (data/observations) or arguments

  • contributes in some way to the collective body of knowledge
  • language is not misleading; any grammatical problems do not substantially hinder understanding
  • acknowledges and mitigates potential risks, harms, burdens or unintended consequences of the research (see the ethics supplements for Engineering Research, Human Participants, or Secondary Data)
  • visualizations/graphs are not misleading (see the Information Visualization Supplement)
  • complies with all applicable empirical standards

Desirable Attributes

  • states epistemological stance2
  • summarizes and synthesizes a reasonable selection of related work (not every single relevant study)
  • clearly describes relationship between contribution(s) and related work
  • demonstrates appropriate statistical power (for quantitative work) or saturation (for qualitative work)
  • describes reasonable attempts to investigate or mitigate limitations
  • discusses study’s realism, assumptions and sensitivity of the results to its realism/assumptions
  • provides plausibly useful interpretations or recommendations for practice, education or research
  • concise, precise, well-organized and easy-to-read presentation
  • visualizations (e.g. graphs, diagrams, tables) advance the paper’s arguments or contribution
  • clarifies the roles and responsibilities of the researchers (i.e. who did what?)
  • provides an auto-reflection or assessment of the authors’ own work (e.g. lessons learned)
  • publishes the study in two phases: a plan and the results of executing the plan (see the Registered Reports Supplement)
  • uses multiple raters, where philosophically appropriate, for making subjective judgments (see the IRR/IRA Supplement)

Extraordinary Attributes

  • applies two or more data collection or analysis strategies to the same research question (see the Multimethodology Standard)
  • approaches the same research question(s) from multiple epistemological perspectives
  • innovates on research methodology while completing an empirical study

General Quality Criteria

There are no universal quality criteria. Each study should be assessed against quality criteria appropriate for its methodology, as laid out in the specific empirical standards. Avoid applying inappropriate quality criteria (e.g. construct validity to a study with no constructs; internal validity to a study with no causal relationships).

Examples of Acceptable Deviations

A study can only apply an empirical standard if an appropriate standard exists. If no related standards exist, studies should apply published guidance. If no appropriate guidance exists, reviewers should apply the general standard and construct an ad hoc evaluation scheme for the new method.

Good Review Practices

Reviewers evaluate a manuscripts’ trustworthiness, importance and clarity. The results must be, primarily, true (trustworthy) and, secondarily, important. A paper that is trustworthy can be accepted even if it is not important. A paper that is not trustworthy cannot be accepted, even if it seems important. Papers that are both trustworthy and important can have priority. Papers must be clear enough to judge their trustworthiness and importance. Reviewers should endeavor to:

  • Reflect on and clearly state their own limitations and biases.
  • Clarify which are necessary and which are suggested changes. Ideally, separate them.
  • Identify parts of the paper that you cannot effectively judge or did not review.

Reviewing Antipatterns

  • Applying empirical standards in a mechanical, inflexible, box-ticking or gotcha-like manner.
  • Rejecting a study because it uses a methodology for which no specific standard is available.
  • Skimming a manuscript instead of carefully reading each word and inspecting each figure and table.
  • Unprofessional or vitriolic tone, ad hominem attacks, disparaging or denigrating comments.
  • Allowing the authors’ identities or affiliations to affect the review.
  • Focusing on superficial details of paper without engaging with its main claims or results.
  • Requesting additional analysis not directly related to a study’s purpose or research question, leading to results poorly linked to the article’s narrative.
  • Using sub-reviewers when the venue does not explicitly allow it.
  • Using the review to promote the reviewer’s own views, theories, methods, or publications.

Invalid Criticisms

  • Setting arbitrary minimum sample sizes or other data requirements, based on neither power analysis nor theoretical saturation
  • Stating that a study:
    • lacks detail without enumerating missing details;
    • is of low quality without explaining specific problems; or
    • is not new without providing citations to published studies that make practically identical contributions.
  • Rejecting a study because it replicates or reproduces existing work
  • Cross-paradigmatic criticism (e.g. attacking an interpretivist study for not conforming to positivist norms).
  • Criticizing a study for limitations intrinsic to that kind of study or the methodology used (e.g. attacking a case study for low generalizability)
  • Rejecting a study because the reviewer would have used a different methodology or design
  • Rejecting a study because it reports negative results

Research and Reporting Antipatterns

  • Attempting a study without reading, understanding and applying published guidelines for that kind of study.
  • Unreasonably small, underpowered or limited studies.
  • Hypothesizing After Results are Known (HARKing) in ostensibly confirmatory, (post-)positivist research.
  • Reporting only the subset of statistical tests that produce significant results (p-hacking).
  • Reporting—together in one paper—several immature or disjointed studies instead of one fully-developed study.
  • Unnecessarily dividing the presentation of a single study into many papers (salami-slicing).
  • Overreaching conclusions or generalizations; obfuscating, downplaying or dismissing a study’s limitations.
  • Mentioning related work only to dismiss it as irrelevant; listing rather than synthesizing related work.
  • Acknowledging limitations but then writing implications and conclusions as though the limitations don’t exist

1 visual methods of checking assumptions are often as good as or better than statistical tests
2e.g. positivism, falisificationism, interpretivism, critical realism, postmodernism)

Action Research

Empirical research that investigates how an intervention, like the introduction of a method or tool, affects a real-life context

Application

This standard applies to empirical research that meets the following conditions.

  • investigates a primarily social phenomenon within its real-life, organizational context
  • intervenes in the real-life context (otherwise see the Case Study Standard)
  • the change and its observation are an integral part of addressing the research question and contribute to research

If the intervention primarily alters social phenomena (e.g. the organization’s processes, culture, way of working or group dynamics), use this standard. If the intervention is a new technology or technique (e.g. a testing tool, a coding standard, a modeling grammar), especially if it lacks a social dimension, consider the Engineering Research Standard. If the research involves creating a technology and an organizational intervention with a social dimension, consider both standards.

Specific Attributes

Essential Attributes

  • justifies the selection of the site(s) that was(were) studied
  • describes the site(s) in rich detail
  • describes the relationship between the researcher and the host organization3
  • describes the intervention(s) in detail
  • describes how interventions were determined (e.g. by management, researchers, or a participative/co-determination process)
  • explains how the interventions are evaluated4
  • describes the longitudinal dimension of the research design (including the length of the study)
  • describes the interactions between researcher(s) and host organization(s)1
  • explains research cycles or phases, if any, and their relationship to the intervention(s)2

  • presents a clear chain of evidence from observations to findings
  • reports participant or stakeholder reactions to interventions

  • reports lessons learned by the organization
  • researchers reflect on their own possible biases

Desirable Attributes

  • provides supplemental materials such as interview guide(s), coding schemes, coding examples, decision rules, or extended chain-of-evidence tables
  • uses direct quotations extensively
  • validates results using member checking, dialogical interviewing5, feedback from non-participant practitioners or research audits of coding by advisors or other researchers
  • findings plausibly transferable to other contexts
  • triangulation across quantitative and qualitative data

Extraordinary Attributes

  • research team with triangulation across researchers (to mitigate researcher bias)

General Quality Criteria

Example criteria include reflexivity, credibility, resonance, usefulness and transferability (see Glossary). Positivist quality criteria such as internal validity, construct validity, generalizability and reliability typically do not apply.

Examples of Acceptable Deviations

  • In a study of deviations from organizational standards, detailed description of circumstances and direct quotations are omitted to protect participants.
  • The article reports a negative outcome of an intervention and e.g. investigates why a certain method was not applicable.

Antipatterns

  • Forcing interventions that are not acceptable to participants or the host organization.
  • Losing professional distance and impartiality; getting too involved with the community under study.
  • Over-selling a tool or method without regard for participants’ problems, practices or values.
  • Avoiding systematic evaluation; downplaying problems; simply reporting participants views of the intervention.

Invalid Criticisms

  • The findings and insights are not valid because the research intervened in the context. Though reflexivity is crucial, the whole point of action research is to introduce a change and observe how participants react.
  • This is merely consultancy or an experience report. Systematic observation and reflection should not be dismissed as consultancy or experience reports. Inversely, consultancy or experiences should not be falsely presented as action research.
  • Lack of quantitative data; causal analysis; objectivity, internal validity, reliability, or generalizability.
  • Sample not representative; lack of generalizability; generalizing from one organization.
  • Lack of replicability or reproducibility; not releasing transcripts.
  • Lack of control group or experimental protocols. An action research study is not an experiment.

Suggested Readings

Richard Baskerville and A. Trevor Wood-Harper. 1996. A critical perspective on action research as a method for information systems research." Journal of information Technology 11.3, 235–246.

Peter Checkland and Sue Holwell. 1998. Action Research: Its Nature and Validity. Systematic Practice and Action Research. (Oct. 1997), 9–21.

Yvonne Dittrich. 2002. Doing Empirical Research on Software Development: Finding a Path between Understanding, Intervention, and Method Development. In Social thinking—Software practice. 243–262

Yvonne Dittrich, Kari Rönkkö, Jeanette Eriksson, Christina Hansson and Olle Lindeberg. 2008. Cooperative method development. Empirical Software Engineering. 13, 3, 231-260. DOI: 10.1007/s10664-007-9057-1

Kurt Lewin. 1947. Frontiers in Group Dynamics. Human Relations 1, 2 (1947), 143–153. DOI: 10.1177/001872674700100201

Lars Mathiassen. 1998. Reflective systems development. Scandinavian Journal of Information Systems 10, 1 (1998), 67–118

Lars Mathiassen. 2002. Collaborative practice research. Information, Technology & People. 15,4 (2002), 321–345

Lars Mathiassen, Mike Chiasson, and Matt Germonprez. 2012. Style Composition in Action Research Publication. MIS quarterly. JSTOR 36, 2 (2012), 347-363

Miroslaw Staron. Action research in software engineering: Metrics’ research perspective. International Conference on Current Trends in Theory and Practice of Informatics. (2019), 39-49

Maung K. Sein, Ola Henfridsson, Sandeep Purao, Matti Rossi and Rikard Lindgren. 2011. Action design research. MIS quarterly. (2011), 37-56. DOI: 10.2307/23043488

Exemplars

Yvonne Dittrich, Kari Rönkkö, Jeanette Eriksson, Christina Hansson and Olle Lindeberg. 2008. Cooperative method development. Empirical Software Engineering. 13, 3 (Dec. 2007), 231-260. DOI: 10.1007/s10664-007-9057-1

Helle Damborg Frederiksen, Lars Mathiassen. 2005. Information-centric assessment of software metrics practices. IEEE Transactions on Engineering Eanagement. 52, 3 (2005), 350-362. DOI: 10.1109/TEM.2005.850737

Jakob Iversen and Lars Mathiassen. 2003. Cultivation and engineering of a software metrics program. Information Systems Journal. 13, 1 (2006), 3–19

Jakob Iversen. 1998. Problem diagnosis software process improvement. Larsen TJ, Levine L, DeGross JI (eds) Information systems: current issues and future changes.

Martin Kalenda, Petr Hyna, Bruno Rossi. Scaling agile in large organizations: Practices, challenges, and success factors. Journal of Software: Evolution and Process. Wiley Online Library 30, 10 (Oct. 2018), 1954 pages.

Miroslaw Ochodek, Regina Hebig, Wilhem Meding, Gert Frost, Miroslaw Staron. Recognizing lines of code violating company-specific coding guidelines using machine learning. Empirical Software Engineering. 25, 1 (Jan. 2020), 220-65.

Kari Rönkkö, Brita Kilander, Mats Hellman, Yvonne Dittrich. 2004. Personas is not applicable: local remedies interpreted in a wider context. In Proceedings of the eighth conference on Participatory design: Artful integration: interweaving media, materials and practices-Volume 1, Toronto, ON, 112–120.

Thatiany Lima De Sousa, Elaine Venson, Rejane Maria da Costa Figueired, Ricardo Ajax Kosloski, and Luiz Carlos Miyadaira Ribeiro. Using Scrum in Outsourced Government projects: An Action Research. 2016. In 2016 49th Hawaii International Conference on System Sciences (HICSS), January 5, 2016, 5447-5456.

Hataichanok Unphon, Yvonne Dittrich. 2008. Organisation matters: how the organisation of software development influences the introduction of a product line architecture. In Proc. IASTED Int. Conf. on Software Engineering. 2008, 178-183.


1 i.e. who intervened and with which part of the organization?
2 Action research projects are structured in interventions often described as action research cycles, which are often structured in distinct phases. It is a flexible methodology, where subsequent cycles are based on their predecessors.
3 E.g. project financing, potential conflicts of interest, professional relationship leading to access.
4 Can include quantitative evaluation in addition to qualitative evaluation.
5 L. Harvey. 2015. Beyond member-checking: A dialogic approach to the research interview, International Journal of Research & Method in Education, 38, 1, 23–38.

Benchmarking (of Software Systems)

A study in which a software system is assessed using a standard tool (i.e. a benchmark) for competitively evaluating and comparing methods, techniques or systems “according to specific characteristics such as performance, dependability, or security” (Kistowski et al. 2015).

Application

This standard applies to empirical research that meets the following conditions.

  • investigates software systems within a defined context with an automated, repeatable procedure
  • studies the software system’s quality of service under a specific workload or usage profile

If the benchmark experiments primarily study software systems, use this standard. For experiments with human participants, see the Experiments (with Human Participants) Standard. For simulation of models of the software systems, see the Simulation (Quantitative) Standard. If the study is conducted within a real-world context, see the Case Study and Ethnography Standard. Benchmarking is often used with Engineering Research (see the Engineering Research (AKA Design Science) Standard)

Specific Attributes

Essential Attributes

  • EITHER: justifies the selection of an existing, publicly available or standard benchmark (in terms of relevance, timeliness, etc.)
    OR: defines a new benchmark, including:
    (i) the quality to be benchmarked (e.g., performance, availability, scalability, security),
    (ii) the metric(s) to quantify the quality,
    (iii) the measurement method(s) for the metric (if not obvious),
    (iv) the workload, usage profile and/or task sample the system under test is subject to (i.e. what the system is doing when the measures are taken) AND justifies the design of the benchmark (in terms of relevance, timeliness, etc.) AND reuses existing benchmark design components from established benchmarks or justifies new components
  • describes the experimental setup for the benchmark in sufficient detail to support independent replication (or refers to such a description in supplementary materials)
  • specifies the workload or usage profile in sufficient detail to support independent replication (or refers to such a description in supplementary materials)
  • allows different configurations of a system under test to compete on their merits without artificial limitations
  • assesses stability or reliability using sufficient experiment repetitions and execution duration

  • discusses the construct validity of the benchmark; that is, does the benchmark measure what it is supposed to measure?

Desirable Attributes

  • provides supplementary materials including datasets, analysis scripts and (for novel benchmarks) extended documentation
  • provides benchmark(s) in a usable form that facillitates independent replication
  • reports on independent replication of the benchmark results
  • reports on a large community that uses the benchmark
  • reports on an independent organization that maintains the benchmark
  • uses or creates open source benchmarks
  • transparently reports on problems with executing benchmark runs, if any

Extraordinary Attributes

  • provides empirical evidence for the relevance of a benchmark, e.g., using a Systematic Review
  • provides empirical evidence for the usability of a benchmark, e.g., using Action Research or Case Studies

General Quality Criteria

Fairness of measurements, reproducibility of results across experiment repetitions, benchmarking the right aspects, using a realistic benchmark with a representative workload

Examples of Acceptable Deviations

  • the nature of the benchmark requires specialized hardware (e.g. a quantum computer) so it not easy to replicate
  • in a study that replicates published existing work, the description of the experimental setup could be quite brief
  • the study only employs one (or a few) runs because prior work has shown that a single run is sufficient

Antipatterns

  • Tailoring the benchmark for a specific method, technique or tool, which is evaluated with the benchmark.
  • Using benchmarking experiments that are irrelevant for the problem studied to obfuscate weaknesses in the proposed approach
  • Insufficient repetitions or duration to assess stability of results
  • Collecting aggregated measurements instead of persisting all raw results and running an offline analysis

Invalid Criticisms

  • The benchmark is not widely used. It is sufficient to start developing a new benchmark with a small group of researchers as an offer to a larger scientific community. Such a proto-benchmark (Sim et al. 2003) can act as a template to further the discussion of the topic and to initialize the consensus process.
  • No independent replication of the benchmark results is reported.
  • There is no independent organization that maintains the benchmark.

Suggested Readings

David Bermbach, Erik Wittern, and Stefan Tai. 2017. Cloud service benchmarking: Measuring Quality of Cloud Services from a Client Perspective. Springer. DOI: 10.1007/978-3-319-55483-9

Susan Elliott Sim, Steve Easterbrook, and Richard C. Holt. 2003. Using benchmarking to advance research: a challenge to software engineering. In 25th International Conference on Software Engineering. IEEE. DOI: 10.1109/icse.2003.1201189

Jim Gray (Ed.). 1993. The Benchmark Handbook for Database and Transaction Systems (2nd ed.). Morgan Kaufmann.

Wilhelm Hasselbring. 2021. Benchmarking as Empirical Standard in Software Engineering Research. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering (EASE 2021). 365-372. DOI: 10.1145/3463274.3463361

Jóakim v. Kistowski, Jeremy A. Arnold, Karl Huppler, Klaus-Dieter Lange, John L. Henning, and Paul Cao. 2015. How to Build a Benchmark. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering. DOI: 10.1145/2668930.2688819

Samuel Kounev, Klaus-Dieter Lange, and Jóakim von Kistowski. 2020. Systems Benchmarking for Scientists and Engineers. Springer. DOI: 10.1007/978-3-030-41705-5

Alessandro Vittorio Papadopoulos, Laurens Versluis, André Bauer, Nikolas Herbst, Jóakim von Kistowski, Ahmed Ali-Eldin, Cristina L. Abad, José Nelson Amaral, Petr Tůma, Alexandru Iosup. 2021. Methodological Principles for Reproducible Performance Evaluation in Cloud Computing. In IEEE Transactions on Software Engineering, vol. 47, no. 8, pp. 1528-1543. DOI: 10.1109/TSE.2019.2927908

Walter F. Tichy. 1998. Should computer scientists experiment more? Computer 31, 5 (May 1998), 32–40. DOI: 10.1109/2.675631

Walter F. Tichy. 2014. Where’s the Science in Software Engineering? Ubiquity Symposium: The Science in Computer Science. Ubiquity 2014 (March 2014), 1–6. DOI: 10.1145/2590528.2590529

Exemplars

Sören Henning and Wilhelm Hasselbring. 2021. Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures. Big Data Research 25 (July 2021), 1–17. DOI: 10.1016/j.bdr.2021.100209

Guenter Hesse, Christoph Matthies, Michael Perscheid, Matthias Uflacker, and Hasso Plattner. 2021. ESPBench: The Enterprise Stream Processing Benchmark. In ACM/SPEC International Conference on Performance Engineering (ICPE ‘21). ACM, 201–212. DOI: 10.1145/3427921.3450242

Martin Grambow, Erik Wittern, and David Bermbach. 2020. Benchmarking the Performance of Microservice Applications. ACM SIGAPP Applied Computing Review, vol 20, issue 3, 20-34. DOI: 10.1145/3429204.3429206

Joakim von Kistowski, Simon Eismann, Norbert Schmitt, Andre Bauer, Johannes Grohmann, and Samuel Kounev. 2018. TeaStore: A Micro-Service Reference Application for Benchmarking, Modeling and Resource Management Research. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 223–236. DOI: 10.1109/mascots.2018.00030

Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software microbenchmarking in the cloud. How bad is it really?. Empirical Software Engineering 24, 2469–2508. DOI: 10.1007/s10664-019-09681-1

Jan Waller, Nils C. Ehmke, and Wilhelm Hasselbring. 2015. Including Performance Benchmarks into Continuous Integration to Enable DevOps. SIGSOFT Softw. Eng. Notes 40, 2 (March 2015), 1–4. DOI: 10.1145/2735399.2735416

Case Study and Ethnography

“An empirical inquiry that investigates a contemporary phenomenon (the “case”) in depth and within its real-world context, especially when the boundaries between phenomenon and context [are unclear]” (Yin 2017)

Application

This standard applies to empirical research that meets the following conditions.

  • Presents a detailed account of a specific instance of a phenomenon at a site. The phenomenon can be virtually anything of interest (e.g. Unix, cohesion metrics, communication issues). The site can be a community, an organization, a team, a person, a process, an internet platform, etc.
  • Features direct or indirect observation (e.g. interviews, focus groups)—see Lethbridge et al.’s (2005) taxonomy.
  • Is not an experience report (cf. Perry et al. 2004) or a series of shallow inquiries at many different sites.

A case study can be brief (e.g. a week of observation) or longitudinal (if observation exceeds the natural rhythm of the site; e.g., observing a product over many releases). For our purposes, case study subsumes ethnography.

If data collection and analysis are interleaved, consider the Grounded Theory Standard. If the study mentions action research, or intervenes in the context, consider the Action Research Standard. If the study captures a large quantitative dataset with limited context, consider the Exploratory Data Science Standard.

Specific Attributes

Essential Attributes

  • justifies the selection of the case(s) or site(s) that was(were) studied
  • describes the site(s) in rich detail
  • reports the type of case study1
  • describes data sources (e.g. participants’ demographics and work roles)
  • defines unit(s) of analysis or observation

  • presents a clear chain of evidence from observations to findings

Desirable Attributes

  • provides supplemental materials such as interview guide(s), coding schemes, coding examples, decision rules, or extended chain-of-evidence tables
  • triangulates across data sources, informants or researchers
  • cross-checks interviewee statements (e.g. against direct observation or archival records)
  • uses participant observation (ethnography) or direct observation (non-ethnography) and clearly integrates these observations into results2
  • validates results using member checking, dialogical interviewing3, feedback from non-participant practitioners or research audits of coding by advisors or other researchers
  • describes external events and other factors that may have affected the case or site
  • uses quotations to illustrate findings4
  • EITHER: evaluates an a priori theory (or model, framework, taxonomy, etc.) using deductive coding with an a priori coding scheme based on the prior theory
    OR: synthesizes results into a new, mature, fully-developed and clearly articulated theory (or model, etc.) using some form of inductive coding (coding scheme generated from data)
  • researchers reflect on their own possible biases

Extraordinary Attributes

  • multiple, deep, fully-developed cases with cross-case triangulation
  • uses a team-based approach; e.g., multiple raters with analysis of inter-rater reliability (see the IRR/IRA Supplement)
  • published a case study protocol beforehand and made it publicly accessible (see the Registered Reports Supplement)

General Quality Criteria

Case studies should be evaluated using qualitative validity criteria such as credibility, multivocality, reflexivity, rigor and transferability (see Glossary). Quantitative quality criteria such as replicability, generalizability and objectivity typically do not apply.

Types of Case Studies

There is no standard way of conducting a case study. Case study research can adopt different philosophies, most notably (post-)positivism (Lee 1989) and interpretivism/constructivism (Walsham 1995), and serve different purposes, including:

  • a descriptive case study describes—in vivid detail–a particular instance of a phenomenon
  • an emancipatory case study identifies social, cultural, or political domination “that may hinder human ability” (Runeson and Host 2009), commensurate with a critical epistemological stance
  • an evaluative case study evaluates a priori research questions, propositions, hypotheses or technological artifacts
  • an explanatory case study explains how or why a phenomenon occurred, typically using a process or variance theory
  • an exploratory case study explores a particular phenomenon to identify new questions, propositions or hypotheses
  • an historical case study draws on archival data, for instance, software repositories
  • a revelatory case study examines a hitherto unknown or unexplored phenomenon

Antipatterns

  • Relying on a single approach to data collection (e.g. interviews) without corroboration or triangulation
  • Oversimplifying and over-rationalizing complex phenomena; presenting messy complicated things as simple and clean

Invalid Criticisms

  • Does not present quantitative data; only collects a single data type.
  • Sample of 1; findings not generalizable. The point of a case study is to study one thing deeply, not to generalize to a population. Case studies should lead to theoretical generalization; that is, concepts that are transferable in principle.
  • Lack of internal validity. Internal validity only applies to explanatory case studies that seek to establish causality.
  • Lack of reproducibility or a “replication package”; Data are not disclosed (qualitative data are often confidential).
  • Insufficient number of length of interviews. There is no magic number; what matters is that there is enough data that the findings are credible, and the description is deep and rich.

Suggested Readings

Line Dube and Guy Pare. Rigor in information systems positivist case re-search: current practices, trends, and recommendations. 2003. MIS Quarterly. 27, 4, 597–636. DOI: 10.2307/30036550

Shiva Ebneyamini, and Mohammad Reza Sadeghi Moghadam. 2018. Toward Developing a Framework for Conducting Case Study Research. International Journal of Qualitative Methods. 17, 1 (Dec. 2018)

Kilem Gwet. 2002. Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity. Statistical Methods for Inter-Rater Reliability Assessment Series, 2 (May 2002), 9 pages.

Barbara Kitchenham, Lesley Pickard, and Shari Lawrence Pfleeger. 1995. Case studies for method and tool evaluation. IEEE software. 12, 4 (1995), 52–62.

Timothy C. Lethbridge, Susan Elliott Sim, and Janice Singer. 2005. Studying software engineers: Data collection techniques for software field studies. Empirical Software Engineering. 10, 3 (2005), 311–341.

Mathew Miles, A Michael Huberman and Saldana Johnny. 2014. Qualitative data analysis: A methods sourcebook. Sage.

Dewayne E. Perry, Susan Elliott Sim, and Steve M. Easterbrook. 2004. Case Studies for Software Engineers, In Proceedings 26th International Conference on Software Engineering. 28 May 2008, Edinburgh, UK, 736–738.

Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering. 14, 2, Article 131.

Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012. Case study research in software engineering: Guidelines and examples. John Wiley & Sons.

Sarah J. Tracy. 2010. Qualitative Quality: Eight “Big-Tent” Criteria for Excellent Qualitative Research. Qualitative Inquiry. 16, 10, 837–851. DOI: 10.1177/1077800410383121

Geoff Walsham, 1995. Interpretive case studies in IS research: nature and method. European Journal of information systems. 4,2, 74–81.

Robert K. Yin. 2017. Case study research and applications: Design and methods. Sage publications.

Exemplars

Adam Alami, and Andrzej Wąsowski. 2019. Affiliated participation in open source communities. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–11

Michael Felderer and Rudolf Ramler. 2016. Risk orientation in software testing processes of small and medium enterprises: an exploratory and comparative study. Software Quality Journal. 24, 3 (2016), 519–548.

Audris Mockus, Roy T. Fielding, and James D. Herbsleb. 2002. Two case studies of open source software development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology (TOSEM). 11, 3 (2002), 309–346.

Helen Sharp and Hugh Robinson. 2004. An ethnographic study of XP practice. Empirical Software Engineering. 9, 4 (2004), 353–375.

Diomidis Spinellis and Paris C. Avgeriou. Evolution of the Unix System Architecture: An Exploratory Case Study. IEEE Transactions on Software Engineering. (2019).

Klaas-Jan Stol and Brian Fitzgerald. Two’s company, three’s a crowd: a case study of crowdsourcing software development. In Proceedings of the 36^th^ International Conference on Software Engineering, 187–198, 2014.


1e.g. descriptive, emancipatory, evaluative, explanatory, exploratory, historical, revelatory
2 Direct observation means watching research subjects without getting involved; participant observation means joining in with whatever participants are doing
3 L. Harvey. 2015. Beyond member-checking: A dialogic approach to the research interview, International Journal of Research & Method in Education, 38, 1, 23–38.
4 Quotations should not be the only representation of a finding; each finding should be described independently of supporting quotations

Case Survey (AKA Case Meta-Analysis)

A study that aims to generalize results about a complex phenomenon by systematically converting qualitative descriptions available in published case studies into quantitative data and analyzing the converted data

Application

This standard applies to studies in which:

  • a sample of previously published case studies is obtained;
  • the qualitative case descriptions are converted systematically into quantitative data; and
  • the converted quantitative data is analyzed to reach generalizable results.

This standard does not apply to studies collecting primary data from a large number of instances of a phenomenon; for instance, using interviews (consider the Qualitative Survey Standard) or questionnaires (consider the Questionnaire Survey Standard). For individual case studies use the Case Study Standard. For reviews of other kinds of studies (e.g., experiments) consider the Systematic Review Standard. This standard also does not apply to qualitative synthesis (e.g. meta-ethnography, narrative synthesis).

Specific Attributes

Essential Attributes

  • presents step-by-step, systematic, replicable description of the search process for published case studies (not necessarily in peer-reviewed venues)
  • defines clear inclusion and exclusion criteria for cases1
  • describes the sampling strategy (see the Sampling Supplement)
  • mitigates sampling bias and publication bias, using some (not all) of:
    (i) manual and keyword automated searches;
    (ii) backward and forward snowballing searches;
    (iii) checking profiles of prolific authors in the area;
    (iv) searching both formal databases (e.g. ACM Digital Library) and indexes (e.g. Google Scholar);
    (v) searching for relevant dissertations;
    (vi) searching pre-print servers (e.g. arXiv);
    (vii) soliciting unpublished manuscripts through appropriate listservs or social media;
    (viii) contacting known authors in the area.
  • defines a coding scheme to convert qualitative case descriptions into quantitative variables2
  • EITHER: describes the coding scheme in detail;
    OR: provides the coding scheme in supplementary materials
  • clearly explains how missingness in the dataset was managed

  • draws conclusions based on the quantitative variables derived

  • acknowledges generalizability threats; discusses how case studies reviewed may differ from target population

Desirable Attributes

  • provides supplementary materials such as protocol, search terms, search results, selection process results, coding scheme, examples of coding, decision rules, complete dataset, analysis scripts, descriptions of edge cases3
  • explains and justifies the design of the coding scheme
  • uses 2+ independent analysts; analyzes inter-rater reliability (see the IRR/IRA Supplement); explains how discrepancies among coders were resolved4
  • describes contacting authors of primary studies for more information, to check coding accuracy, or resolve coding disagreements
  • assesses quality of primary studies using an a priori scheme (e.g. the Case Survey Standard); explains how quality was assessed; models study quality as a moderating variable
  • consolidates results using tables, diagrams, or charts; includes PRISMA flow diagram (cf. Moher et al. 2009)
  • integrates results into prior theory or research; identifies gaps, biases, or future directions
  • presents results as practical, evidence-based guidelines for practitioners, researchers, or educators
  • clearly distinguishes evidence-based results from interpretations and speculation5

Extraordinary Attributes

  • uses theory to select and sample cases
  • two or more researchers independently undertake the preliminary search process before finalizing the search scope and search keywords
  • adds data from other sources to the selected cases
  • employs authors of primary studies to conduct the coding or ensure interpretations are correct
  • analyzes if/how studies’ characteristics (e.g., research design, publication venue or date) influence the coding

General Quality Criteria

Reliability, internal validity, external validity, construct validity, conclusion validity

Antipatterns

  • including primary studies that are not case studies (e.g., questionnaires or a small amounts of data)
  • include all cases without a critical evaluation of the data available
  • analyze data from cases using qualitative techniques
  • describing each case in isolation rather than synthesizing findings.
  • relying on publication venue as a proxy for quality
  • conflating correlation with causation in findings

Invalid Criticisms

  • the studies were not published in peer-reviewed venues
  • the original studies do not employ a common research design
  • the cases are not a random sample of the phenomenon

Suggested Readings

R.J. Bullock and Mark Tubbs. 1987. “The case meta-analysis method for OD,” Research in organizational change and development. 1, 171–228.

Marlen Jurisch, Petra Wolf, and Helmut Krcmar. 2013. Using the case survey method for synthesizing case study evidence in information systems research. 19th Americas Conference on Information Systems, AMCIS 2013 - Hyperconnected World: Anything, Anywhere, Anytime (2013), 3904–3911.

Rikard Larsson. 1993. Case survey methodology: quantitative analysis of patterns across case studies. Academy of Management Journal. 36, 6 (Dec. 1993), 1515–1546. DOI: 10.2307/256820.

William Lucas. 1974. The Case Survey Method: Aggregating Case Experience. Rand Corporation, Santa Monica, CA.

Jorge Melegati and Xiaofeng Wang. 2020. Case Survey Studies in Software Engineering Research. Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (Oct. 2020), 1–12.

Moher D, Liberati A, Tetzlaff J, Altman DG, The PRISMA Group (2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med 6, 7: e1000097. doi:10.1371/journal.pmed1000097

Robert Yin and Karen Heald. 1975. Using the Case Survey Method to Analyze Policy Studies. Administrative Science Quarterly. 20, 3 (1975), 371. DOI:10.2307/2391997.

Exemplars

R.J. Bullock and Mark Tubbs. 1990. A case meta-analysis of gainsharing plans as organization development interventions. The Journal of Applied Behavioral Science 26.3 (1990): 383-404.

Åke Grönlund and Joachim Åström. 2009. DoIT right: Measuring effectiveness of different eConsultation designs. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 5694 LNCS, (2009), 90–100. DOI: 10.1007/978-3-642-03781-8_9.

Marlen Jurisch, Christin Ikas, Petra Wolf and Helmut Krcmar. 2013. Key Differences of Private and Public Sector Business Process Change. e-Service Journal. 9, 1 (2013), 3. DOI:10.2979/eservicej.9.1.3.

Marlen Jurisch, Wolfgang Palka, Petra Wolf, and Helmut Krcmar. 2014. Which capabilities matter for successful business process change? Business Process Management Journal. 20, 1 (2014), 47–67. DOI:10.1108/BPMJ-11-2012-0125.

Rikard Larsson and Michael Lubatkin. 2001. Achieving acculturation in mergers and acquisitions: An international case survey. Human Relations. 54, 12 (2001), 1573–1601. DOI: 10.1177/00187267015412002.

Mark de Reuver, Harry Bouwman, and Ian MacInnes. 2007. What drives business model dynamics? A case survey. 8th World Congress on the Management of e-Business, WCMeB 2007 - Conference Proceedings. WCMeB (2007). DOI: 10.1109/WCMEB.2007.95.

Mark de Reuver, Harry Bouwman, and Ian MacInnes. 2009. Business model dynamics: A case survey. Journal of Theoretical and Applied Electronic Commerce Research. 4, 1 (2009), 1–11. DOI:10.4067/S0718-18762009000100002.


1e.g., reports at least two of the variables in the coding scheme
2the coding scheme should balance simplicity and information richness
3do provide data from multiple raters and scripts for calculating inter-rater agreement if possible but do not provide primary study author’s responses to any questionnaire unless authors gave explicit permission.
4By discussion and agreement, voting, adding tie-breaker, consulting with study authors, etc.
5Simply separating results and discussion into different sections is typically sufficient. No speculation in the results section.

Data Science

Studies that analyze software engineering phenomena or artifacts using data-centric analysis methods such as machine learning or other computational intelligence appraches as well as search-based approaches1

Application

Applies to studies that primarily analyze existing software phenomena using predictive, preemptive or corrective modelling.

  • If the analysis focuses on the toolkit, rather that some new conclusions generated by the toolkit, consider the Artifacts Standard
  • If the analysis focuses on a single, context-rich setting (e.g., a detailed analysis of a single repository), consider the Case Study Standard.
  • If the temporal dimension is analyzed, consider the Longitudinal Studies Standard.
  • If the data objects are discussions or messages between humans, consider the Discourse Analysis Standard.
  • If data visualizations are used, consider the Information Visualization Supplement. (With large data sets especially, care is needed to keep visualizations legible.)
  • If the analysis selects a subset of available data, consult the Sampling Supplement.

Specific Attributes

Essential Attributes

  • explains why it is timely to investigate the proposed problem using the proposed method

  • explains how and why the data was selected
  • presents the experimental setup (e.g. using a dataflow diagram)2
  • describes the feature engineering approaches3 and transformations that were applied
  • explains how the data was pre-processed, filtered, and categorized
  • EITHER: discusses state-of-art baselines (and their strengths, weaknesses and limitations) OR: explains why no state-of-art baselines exist OR: provides compelling argument that direct comparisons are impractical
  • defines the modeling approach(es) used (e.g. clustering then decision tree learning), typically using pseudocode
  • discusses the hardware and software infrastructure used4
  • justifies all statistics and (automated or manual) heuristics used
  • describes and justifies the evaluation metrics used

  • goes beyond single-dimensional summaries of performance (e.g., average; median) to include measures of variation, confidence, or other distributional information

  • discusses technical assumptions and threats to validity that are specific to data science5

Desirable Attributes

  • provides a replication package including source code and data set(s), or if data cannot be shared, synthetic data to illustrate the use of the algorithms6
  • data is processed by multiple learners, of different types7
  • data is processed multiple times with different, randomly selected, training/test examples; the results of which are compared via significance tests and effect size tests (e.g. cross-validation)
  • carefully selects the hyperparameters that control the data miners (e.g. via analysis of settings in related work or some automatic hyperparameter optimizer such as grid search)
  • manually inspects some non-trivial portion of the data (i.e. data sanity checks)
  • clearly distinguishes evidence-based results from interpretations and speculation8

Extraordinary Attributes

  • leverages temporal data via longitudinal analyses (see the Longitudinal Studies Standard)
  • triangulates with qualitative data analysis of selected samples of the data
  • triangulates with other data sources, such as surveys or interviews
  • shares findings with and solicits feedback from the creators of the (software) artifacts being studied

Examples of Acceptable Deviations

  • Using lighter and less precise data processing (e.g. keyword matching or random subsampling) if the scale of data is too large for a precise analysis to be practical.
  • Data not shared because it is impractical (e.g. too large) or unethical (e.g. too sensitive). Enough information should be offered to assure the reader that the data is real.
  • Not using temporal analysis techniques such as time series when the data is not easily converted to time series (e.g. some aspects of source code evolution may not be easily modelled as time series).
  • Not all studies need statistics and hypotheses. Some studies can be purely or principally descriptive.
  • Different explanations have different requirements (e.g. hold out sets, cross-validation)9.

Antipatterns

  • Using statistical tests without checking their assumptions.
  • Using Bayesian statistics without motivating priors.
  • Claiming causation without not only establishing covariaton and precedence but also eliminating third variable explanations and at least hypothesizing a generative mechanism.
  • Pre-processing changes training and test data; e.g. while it may be useful to adjust training data class distributions via (say) sub-sampling of majority classes, that adjustment should not applied to the test data (since it is important to assess the learner on the kinds of data that might be seen “in the wild”).
  • Unethical data collection or analysis (see the Ethics (Secondary Data) supplement)
  • Significance tests without effect size tests; effect sizes without confidence intervals.
  • Reporting a median, without any indication of variance (e.g., a boxplot).
  • Conducting multiple trials without reporting variations between trials.

Invalid Criticisms

  • You should have analyzed data ABC. The question reviewers should ask is whether the papers main claims are supported by the data that was analyzed, not whether some other data would have been better.
  • Does not have a reproduction package. These are desirable, not essential (yet).
  • Findings are not actionable: not all studies may have directly actionable findings in the short term.
  • “Needs more data” as a generic criticism without a clear, justified reason.
  • Study does not use qualitative data.
  • Study does not make causal claims, when it cannot.
  • Study does not use the most precise data source, unless the data source is clearly problematic for the study at hand. Some data is impractical to collect at scale.

Suggested Readings

  1. Hemmati, Hadi, et al. “The msr cookbook: Mining a decade of research.” 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 2013.
  2. Robles, Gregorio, and Jesus M. Gonzalez-Barahona. “Developer identification methods for integrated data from various sources.” (2005).
  3. Dey, Tapajit, et al. “Detecting and Characterizing Bots that Commit Code.” arXiv preprint arXiv:2003.03172 (2020).
  4. Hora, Andre, et al. “Assessing the threat of untracked changes in software evolution.” Proceedings of the 40th International Conference on Software Engineering. 2018.
  5. Herzig, Kim, and Andreas Zeller. “The impact of tangled code changes.” 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 2013.
  6. Berti-Équille, L. (2007). Measuring and Modelling Data Quality for Quality-Awareness in Data Mining.. In F. Guillet & H. J. Hamilton (ed.), Quality Measures in Data Mining , Vol. 43 (pp. 101-126) . Springer . ISBN: 978-3-540-44911-9.
  7. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C.,, Regnell, B. (2012). Experimentation in Software Engineering.. Springer. ISBN: 978-3-642-29043-5Wohlin’ standard thrrs
  8. Raymond P. L. Buse and Thomas Zimmermann. 2012. Information needs for software development analytics. In Proceedings of the 34th International Conference on Software Engineering (ICSE ‘12). IEEE Press, 987–996.
  9. https://aaai.org/Conferences/AAAI-21/reproducibility-checklist/
  10. Baljinder Ghotra, Shane McIntosh, and Ahmed E. Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (ICSE ‘15). IEEE Press, 789–800.
  11. Daniel Russo and Klaas-Jan Stol. In press. PLS-SEM for Software Engineering Research: An Introduction and Survey. ACM Computing Surveys.

Exemplars

  1. A. Barua, S. W. Thomas, A. E. Hassan, What are developers talkingabout? an analysis of topics and trends in stack overflow, Empirical Software Engineering 19 (3) (2014) 619–654.
  2. Bird, C., Rigby, P. C., Barr, E. T., Hamilton, D. J., German, D. M., & Devanbu, P. (2009, May). The promises and perils of mining git. In 2009 6th IEEE International Working Conference on Mining Software Repositories (pp. 1-10). IEEE.
  3. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., Germán, D. M. & Damian, D. E. (2014). The promises and perils of mining GitHub.. In P. T. Devanbu, S. Kim & M. Pinzger (eds.), MSR (p./pp. 92-101), : ACM. ISBN: 978-1-4503-2863-0
  4. Herbsleb, J. & Mockus, A. (2003). An Empirical Study of Speed and Communication in Globally Distributed Software Development. IEEE Transactions on Software Engineering, 29, 481-94.2
  5. Menzies, T., Greenwald, J., & Frank, A. (2006). Data mining static code attributes to learn defect predictors. IEEE transactions on software engineering, 33(1), 2-13.
  6. Menzies, T., & Marcus, A. (2008, September). Automated severity assessment of software defect reports. In 2008 IEEE International Conference on Software Maintenance (pp. 346-355). IEEE.
  7. Nair, V., Agrawal, A., Chen, J., Fu, W., Mathew, G., Menzies, T., Minku, L. L., Wagner, M. & Yu, Z. (2018). Data-driven search-based software engineering.. In A. Zaidman, Y. Kamei & E. Hill (eds.), MSR (p./pp. 341-352), : ACM.
  8. Rahman, F., & Devanbu, P. (2013, May). How, and why, process metrics are better. In 2013 35th International Conference on Software Engineering (ICSE) (pp. 432-441). IEEE.
  9. Tufano, M., Palomba, F., Bavota, G., Oliveto, R., Penta, M. D., Lucia, A. D. & Poshyvanyk, D. (2017). When and Why Your Code Starts to Smell Bad (and Whether the Smells Go Away).. IEEE Trans. Software Eng., 43, 1063-1088.

1Dhar, V. (2013). Data Science and Prediciton, Communications of the ACM, December 2013, Vol. 56 No. 12, Pages 64-73. https://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext
2Akidau, Tyler, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety et al. “The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing.” (2015). Proceedings of the VLDB Endowment 8.12
3Acf. Nargesian, Fatemeh, Horst Samulowitz, Udayan Khurana, Elias B. Khalil, and Deepak S. Turaga. “Learning Feature Engineering for Classification.” In Ijcai, pp. 2529-2535. 2017.
4including GPU/CPU models; amount of memory; operating system; names and versions of relevant software libraries and frameworks
5 For example, failing to address variations in the size or complexity of training, testing and validation data sets. For more, see: Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C.,,Regnell, B. (2012). Experimentation in Software Engineering. Springer. ISBN: 978-3-642-29043-5
6Sarkar, T. (2019). Synthetic data generation - a must-have skill for new data scientists. (July 2019).https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae
7e.g. regression, bayes classifier, decision tree, random forests, SVM (maybe with different kernels); for guidance, see Baljinder Ghotra, Shane McIntosh, and Ahmed E. Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (ICSE ‘15). IEEE Press, 789–800.
8Simply separating results and discussion into different sections is typically sufficient. No speculation in the results section.
9c.f. Raymond P. L. Buse and Thomas Zimmermann. 2012. Information needs for software development analytics. In Proceedings of the 34th International Conference on Software Engineering (ICSE ‘12). IEEE Press, 987–996.

Engineering Research (AKA Design Science)

Research that invents and evaluates technological artifacts

Application

This standard applies to manuscripts that propose and evaluate technological artifacts, including algorithms, models, languages, methods, systems, tools, and other computer-based technologies. This standard is not appropriate for:

  • evaluations of pre-existing engineering research approaches (consider the Experiments Standard)
  • experience reports of applying pre-existing engineering research approaches

Specific Attributes

Essential Attributes

  • describes the proposed artifact in adequate detail1
  • justifies the need for, usefulness of, or relevance of the proposed artifact2
  • conceptually evaluates the proposed artifact; discusses its strengths, weaknesses and limitations3

  • Empirically evaluates the proposed artifact using:
    action research, in which the researchers intervene in a real organization using the artifact,
    a case study in which the researchers obsevere a real organization using the artifact,
    a controlled experiment in which human participants use the artifact,
    a quantitative simulation in which the artifact is assessed (usually against a competing artifact) in an artificial environment,
    a benchmarking study, in which the artifact is assessed using one or more benchmarks, or
    another method for which a clear and convincing rationale is provided
  • clearly indicates which of the above empirical methodology is used
  • EITHER: discusses state-of-art alternatives (and their strengths, weaknesses and limitations)
    OR: explains why no state-of-art alternatives exist
    OR: provides compelling argument that direct comparisons are impractical
  • EITHER: empirically compares the artifact to one or more state-of-the-art alternatives
    OR: empirically compares the artifact to one or more state-of-the-art benchmarks
    OR: provides a clear and convincing rationale for why comparative evaluation is impractical

  • assumptions (if any) are explicit, plausible and do not contradict each other or the contribution’s goals
  • uses notation consistently (if any notation is used)

Desirable Attributes

  • provides supplementary materials including source code (if the artifact is software) or a comprehensive description of the artifact (if not software), and any input datasets (if applicable)
  • justifies any items missing from replication package based on practical or ethical grounds
  • discusses the theoretical basis of the artifact
  • provides correctness arguments for the key analytical and theoretical contributions (e.g. theorems, complexity analyses, mathematical proofs)
  • includes one or more running examples to elucidate the artifact
  • evaluates the artifact in an industry-relevant context (e.g. widely used open-source projects, professional programmers)

Extraordinary Attributes

  • contributes to our collective understanding of design practices or principles
  • presents ground-breaking innovations with obvious real-world benefits

General Quality Criteria

  • Comprehensiveness of proposed artifact description
  • Appropriateness of evaluation methods to the nature, goals, and assumptions of the contribution
  • Relationship of innovativeness to rigorousness: less innovative artifacts require more rigorous evaluations

Antipatterns

  • overstates the novelty of the contribution
  • omits details of key conceptual aspects while focusing exclusively on incidental implementation aspects
  • evaluation consists only of eliciting users’ opinions of the artifact
  • evaluation consists only of quantitative performance data that is not compared to established benchmarks or alternative solutions (see related point in “Invalid Criticism”)

Invalid Criticisms

  • The paper does not report as ambitious an empirical study as other predominately empirical papers. The more innovative the artifact and more comprehensive the conceptual evaluation, the less we should expect from the empirical study.
  • Too few experimental subjects (e.g. the source code used to evaluate a static analysis technique) if few subjects are available in the contribution’s domain or the experimental evaluation is part of a more comprehensive validation strategy (e.g. formal arguments). Other criteria, such as the variety, realism, availability, and scale of the subjects, should also be considered to assess the quality of the evaluation.
  • No replication package, if there are clear, convincing practical or ethical reasons preventing artifact disclosure.
  • The artifact is not experimentally compared with related approaches that are not publicly available. In other words, before saying “you should have compared this against X, make sure X is actually available and functional.
  • This is not the first known solution to the identified problem. The novelty of the paper can be in how it achieves scalability, better performance on specific classes of problems, applicability to realistic systems, stronger theoretical guarantees, or other aspects of improvement. Proposed artifacts should outperform existing artifacts on some dimension(s).
  • The contribution is not technically complicated. What matters is that it works. Unnecessary complexity is undesirable.

Suggested Readings 4

Richard Baskerville, Jan Pries-Heje, and John Venable. 2009. Soft design science methodology. In Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology (DESRIST ‘09). Association for Computing Machinery, New York, NY, USA, Article 9, 1–11. DOI: 10.1145/1555619.1555631

Carlo Ghezzi. 2020. Being a researcher - an informatics perspective. Springer Nature.

Alan Hevner and Samir Chatterjee. 2010. Design Research in Information Systems. Integrated Series in Information Systems. Springer, 22, (Mar. 2010), 145–156. DOI: 10.1007/978-1-4419-5653-8_11

Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. 2004. Design Science in Information Systems Research. MIS Quarterly, 28, 1 (Mar. 2004), 75–105. DOI:10.2307/25148625.

Roel Wieringa. 2014. Design science methodology for information systems and software engineering. Springer.

Exemplars

Kihong Heo, Hakjoo Oh and Hongseok Yang. 2019. Resource-aware Program Analysis via Online Abstraction Coarsening. In Proceedings of the 41st International Conference on Software Engineering.

Jianhui Chen, Fei He. 2018. Control Flow-Guided SMT Solving for Program Verification. In Proceedings of the 33rd International Conference on Automated Software Engineering.

Calvin Loncaric, Michael D. Ernst and Emina Torlak. 2018. Generalized Data Structure Synthesis. In Proceedings of the 40th International Conference on Software Engineering.

Nikolaos Tsantalis, Davood Mazinanian and Shahriar Rostami Dovom. 2017. Clone Refactoring with Lambda Expressions. In Proceedings of the 39th International Conference on Software Engineering.

August Shi, Suresh Thummalapenta, Shuvendu Lahiri, Nikolaj Bjorner and Jacek Czerwonka. (2017) Optimizing Test Placement for Module-Level Regression Testing. In Proceedings of the 39th International Conference on Software Engineering.

Magnus Madsen, Frank Tip, Esben Andreasen, Koushik Sen, and Anders Møller. 2016. Feedback-Directed Instrumentation for Deployed JavaScript Applications. In Proceedings of the 38th International Conference on Software Engineering.


1 e.g., does the paper describe the overall workflow of the solution, showing how different techniques work together? Are algorithmic contributions presented in an unambiguous way? Are the key parts of a formal model presented explicitly? Are the novel components of the solution clearly singled out?
2 i.e., is the problem the proposed approach tries to solve specific to a certain domain? If so, why? Why are state-of-the-art approaches not good enough to deal with the problem? How can the technical contribution be beneficial?
3 e.g., time complexity of an algorithm; theoretical.
4 Note: Learning by building innovative artifacts is called engineering research. Some of the following readings incorrectly refer to engineering research as “design science” because the information systems community misappropriated that term circa 2004. Design science has referred to the study of designers and their processes since at least the 1940s, and still does outside of the information systems community.

Experiments (with Human Participants)

A study in which an intervention is deliberately introduced to observe its effects on some aspects of reality under controlled conditions

Application

This standard applies to controlled experiments and quasi-experiments that meet all of the following conditions:

  • manipulates one or more independent variables
  • controls many extraneous variables
  • applies each treatment independently to several experimental units
  • involves human participants

In true experiments, experimental units are randomly allocated across treatments; quasi-experiments lack random assignment. Experiments include between-subjects, within-subjects and repeated measures designs. For experiments without human participants, see the Exploratory Data Science Standard or the Engineering Research Standard.

Specific Attributes

Essential Attributes

  • states formal hypotheses
  • justifies use of one-sided hypotheses (if any) based on face validity or previous work
  • describes the dependent variable(s) and justifies how they are measured (including units, instruments)
  • describes the independent variable(s) and how they are manipulated or measured
  • describes extraneous variables and how they are controlled, or not
  • describes how characteristics of the phenomenon under investigation relate to experimental constructs
  • describes the research design and protocol including treatments, materials, tasks, design (e.g. 2x2 factorial), participant allocation, period and sequences (for crossover designs), and logistics
  • EITHER: uses random assignment and explains logistics (e.g. how random numbers were generated)
    OR: provides compelling justification for not using random assignment and explains how unequal groups threat to validity is mitigated (e.g. using pre-test/post-test and matched subjects design)
  • describes experimental objects (e.g. real or toy system) and their characteristics (e.g. size, type)
  • justifies selection of experimental objects; acknowledges object-treatment confounds, if any1
  • design and protocol appropriate (not optimal) for stated research questions and hypotheses

  • describes participants (e.g. age, gender, education, relevant experience or preferences)
  • reports distribution-appropriate descriptive and inferential statistics; justifies tests used
  • reports effects sizes with confidence intervals (if using frequentist approach)

  • discusses construct, conclusion, internal, and external validity
  • discusses alternative interpretations of results

Desirable Attributes

  • provides supplementary material such as complete, algorithmic research protocol; task materials; raw, de-identified dataset; analysis scripts
  • justifies hypotheses and Bayesian priors (if applicable) based on previous studies and theory
  • discusses alternative experimental designs and why they were not used (e.g. validity trade-offs)
  • includes visualizations of data distributions
  • cites statistics research to support any nuanced issues or unusual approaches
  • explains deviations between design and execution, and their implications2
  • named experiment design (e.g. simple 2-group, 2x2 factorial, randomized block)
  • justifies sample size (e.g. using power analysis)
  • analyzes construct validity of dependent variable
  • reports manipulation checks
  • pre-registration of hypotheses and design (where venue allows)
  • clearly distinguishes evidence-based results from interpretations and speculation3

Extraordinary Attributes

  • reports multiple experiments or replications in different cultures or regions
  • uses multiple methods of data collection; data triangulation
  • longitudinal data collection with appropriate time-series analysis (see the Longitudinal Studies Standard)

General Quality Criteria

Conclusion validity, construct validity, internal validity, reliability, objectivity, reproducibility

Antipatterns

  • using bad proxies for dependent variables (e.g. task completion time as a proxy for task complexity)
  • quasi-experiments without a good reason4
  • treatments or response variables are poorly described
  • inappropriate design for the conditions under which the experiment took place
  • data analysis technique used does not correspond to the design chosen or data characteristics (e.g. using an independent samples t-test on paired data)
  • validity threats are simply listed without linking them to results
  • hypotheses are missing

Invalid criticisms

  • participants are students—appropriateness of participant characteristics should be judged based on the context, desired level of control, trade-off choices between internal and external validity, and the specifics of the technology (i.e. method, technique, tool, process, etc.) under evaluation; the choice must be explained in the paper
  • low external validity
  • the experiment is a replication
  • the reviewer would have investigated the topic in any other way than an experiment
  • not enough participants (unless supported by power analysis)

Exemplars

Dag IK Sjøberg, Aiko Yamashita, Bente CD Anda, Audris Mockus, and Tore Dybå. 2012. Quantifying the Effect of Code Smells on Maintenance Effort. IEEE Transactions on Software Engineering. 39, 8 (Dec. 2012), 1144–1156. DOI: 10.1109/TSE.2012.89.

Ayse Tosun, Oscar Dieste, Davide Fucci, Sira Vegas, Burak Turhan, Hakan Erdogmus, Adrian Santos et al. 2017. An industry experiment on the effects of test-driven development on external quality and productivity. Empirical Software Engineering. 22, 6 (Dec. 2016), 2763–2805.

Kai Petersen, Kari Rönkkö, and Claes Wohlin. 2008. The impact of time controlled reading on software inspection effectiveness and efficiency: a controlled experiment. In Proceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ‘08), 139–148. DOI:10.1145/1414004.1414029

Eduard P. Enoiu, Adnan Cauevic, Daniel Sundmark, and Paul Pettersson.

  1. A controlled experiment in testing of safety-critical embedded software. In 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST), 11-15 April, Chicago, IL, USA. IEEE. 1-11.

Yang Wang and Stefan Wagner. 2018. Combining STPA and BDD for safety analysis and verification in agile development. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings (ICSE ‘18), 286–287. DOI:10.1145/3183440.3194973

Evrim Itir Karac, Burak Turhan, and Natalia Juristo. 2019. A Controlled Experiment with Novice Developers on the Impact of Task Description Granularity on Software Quality in Test-Driven Development. IEEE Transactions on Software Engineering. DOI: 10.1109/TSE.2019.2920377

Suggested Reading

Nathaniel L. Gage and Julian C. Stanley. 1963. Experimental and Quasi-experimental Designs For Research. Chicago: R. McNally.

Andreas Jedlitschka, Marcus Ciolkowski, and Dietmar Pfahl. 2008. Reporting Experiments in Software Engineering. Guide to Advanced Empirical Software Engineering. 201-228.

Natalia Juristo and Ana M. Moreno. 2001. Basics of Software Engineering Experimentation. Springer Science & Business Media.

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Science & Business Media.

Martín Solari, Sira Vegas, and Natalia Juristo. 2018. Content and structure of laboratory packages for software engineering experiments. Information and Software Technology. 97, 64-79.

Sira Vegas, Cecilia Apa, and Natalia Juristo. 2015. Crossover designs in software engineering experiments: Benefits and perils. IEEE Transactions on Software Engineering. IEEE 42, 2 (2015), 120-135.

Vigdis By Kampenes, Tore Dybå, Jo E. Hannay, and Dag IK Sjøberg. 2009. A systematic review of quasi-experiments in software engineering. Information and Software Technology. 51, 1 (2009), 71-82.

Davide Falessi, Natalia Juristo, Claes Wohlin, Burak Turhan, Jürgen Münch, Andreas Jedlitschka, and Markku Oivo, Empirical Software Engineering Experts on the Use of Students and Professionals in Experiments, Empirical Software Engineering. 23, 1 (2018), 452-489.

Robert Feldt, Thomas Zimmermann, Gunnar R. Bergersen, Davide Falessi, Andreas Jedlitschka, Natalia Juristo, Jürgen Münch et al. 2018. Four commentaries on the use of students and professionals in empirical software engineering experiments. Empirical Software Engineering. 23, 6 (Nov. 2018), 3801-3820.

Kitchenham, Barbara, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong. 2017. Robust statistical methods for empirical software engineering. Empirical Software Engineering. 22, 2 (2018), 579-630.

Andreas Zeller, Thomas Zimmermann, and Christian Bird. 2011. Failure is a four-letter word: a parody in empirical research. In Proceedings of the 7th International Conference on Predictive Models in Software Engineering (Promise ’11). Association for Computing Machinery, New York, NY, USA, Article 5, 1–7. DOI: 10.1145/2020390.2020395


1 For example, in an experiment where the control group applies Test-Driven Development (TDD) with Object 1 while the treatment group applies Test-Last-Development (TDD) with Object 2, the experimental object is confounded with the treatment.
2 e.g. dropouts affecting balance between treatment and control group.
3 Simply separating results and discussion into different sections is typically sufficient. No speculation in the results section.
4 Quasi-experiments are appropriate for pilot studies or when assignment is beyond the researcher’s control (e.g. assigning students to two different sections of a course). Simply claiming that a study is “exploratory” is not sufficient justification.

Grounded Theory

A study of a specific area of interest or phenomenon that involves iterative and interleaved rounds of qualitative data collection and analysis, leading to key patterns (e.g. concepts, categories)

Application

This standard applies to empirical inquiries that meet all of the following conditions:

  • Explores a broad area of investigation without specific, up-front research questions.
  • Applies theoretical sampling with iterative and interleaved rounds of data collection and analysis.
  • Reports rich and nuanced findings, typically including verbatim quotes and samples of raw data.

For predominately qualitative inquiries that do not iterate between data collection and analysis or do not use theoretical sampling, consider the Case Study Standard or the Qualitative Survey Standard.

Specific Attributes

Essential Attributes

  • identifies the version of Grounded Theory used/adapted (Glaser, Strauss-Corbin, Charmaz, etc.)
  • explains how data source(s) were selected and accessed (e.g. participant sampling strategy)
  • describes data sources (e.g. participants’ demographics, work roles)
  • explains how the research iterated between data collection and analysis using constant comparison and theoretical sampling
  • provides evidence of saturation; explains how saturation was achieved1
  • explains how key patterns (e.g. categories) emerged from GT steps (e.g. selective coding)

  • provides clear chain of evidence from raw data (e.g. interviewee quotations) to derived codes, concepts, and categories

Desirable Attributes

  • provides supplemental materials such as interview guide(s), coding schemes, coding examples, decision rules, or extended chain-of-evidence tables
  • explains how and why study adapts or deviates from claimed GT version
  • presents a mature, fully-developed theory or taxonomy
  • includes diverse participants and/or data sources (e.g. software repositories, forums)
  • uses direct quotations extensively to support key points
  • explains how memo writing was used to drive the work
  • validates results using member checking, dialogical interviewing, feedback from non-participant practitioners or research audits of coding by advisors or other researchers
  • discusses transferability; characterizes the setting such that readers can assess transferability
  • compares results with (or integrates them into) prior theory or related research
  • explains theoretical sampling vis-à-vis the interplay between the sampling process, the emerging findings, and theoretical gaps perceived therein
  • reflects on how researcher’s biases may have affected their analysis
  • explains the role of literature, especially where an extensive review preceded the GT study

Extraordinary Attributes

  • triangulates with extensive quantitative data (e.g. questionnaires, sentiment analysis)
  • employs a team of researchers and explains their roles

Quality Criteria

Glaser, Strauss, Corbin and Charmaz advance inconsistent quality criteria. Using definitions in our Glossary, reviewers should consider common qualitative criteria such as credibility, resonance, usefulness and the degree to which results extend our cumulative knowledge. Quantitative quality criteria such as internal validity, construct validity, replicability, generalizability and reliability typically do not apply.

Examples of Acceptable Deviations

  • In a study of sexual harassment at a named organization, detailed description of interviewees and direct quotations are omitted to protect participants.

Antipatterns

  • Conducting data collection and data analysis sequentially; applying only analysis techniques of GT.
  • Data analysis focusing on counting words, codes, concepts, or categories instead of interpreting.
  • Presenting a tutorial on grounded theory instead of explaining how the current study was conducted.
  • Small, heterogenous samples creating the illusion of convergence and theoretical saturation. For example, it is highly unlikely that a full theory can be derived only from interviews with 20 people.
  • Focusing only on interviews without corroborating statements with other evidence (e.g. documents, observation).

Invalid Criticisms

  • Lack of quantitative data; causal analysis; objectivity, internal validity, reliability, or generalizability.
  • Lack of replicability or reproducibility; not releasing transcripts
  • Lack of representativeness (e.g. of a study of Turkish programmers, ‘how does this generalize to America?’)
  • Research questions should have been different
  • Findings should have been presented as a different set of relationships, hypotheses, or a different theory.

Suggested Readings

Steve Adolph, Wendy Hall, and Philippe Kruchten. 2011. Using grounded theory to study the experience of software development. Empirical Software Engineering. 16, 4 (2011), 487–513.

Khaldoun M. Aldiabat and Carole-Lynne Le Navenec. “Data Saturation: The Mysterious Step in Grounded Theory Methodology.” The Qualitative Report, vol. 23, no. 1, 2018, pp. 245-261.

Kathy Charmaz. 2014. Constructing grounded theory. Sage.

Juliet Corbin and Anselm Strauss. 2014. Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage publications.

Dennis A. Gioia, Kevin G. Corley, and Aimee L. Hamilton. “Seeking qualitative rigor in inductive research: Notes on the Gioia methodology.” Organizational Research Methods 16, no. 1 (2013): 15-31.

Janet Morse Morse, Barbara J. Bowers, Kathy Charmaz, Adele E. Clarke, Juliet Corbin, Caroline Jane Porr, and Phyllis Noerager Stern (eds). (2021) Developing Grounded Theory: The Second Generation Revisited. Routledge, New York, USA.

Klaas-Jan Stol, Paul Ralph, and Brian Fitzgerald. 2016. Grounded theory in software engineering research: a critical review and guidelines. In Proceedings of the 38th International Conference on Software Engineering (ICSE ‘16). Association for Computing Machinery, New York, NY, USA, 120–131. DOI: 10.1145/2884781.2884833

Exemplars

Barthélémy Dagenais and Martin P. Robillard. 2010. Creating and evolving developer documentation: understanding the decisions of open source contributors. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering (FSE ‘10). Association for Computing Machinery, New York, NY, USA, 127–136. DOI: 10.1145/1882291.1882312

Rashina Hoda, James Noble, and Stuart Marshall. 2012. Self-organizing roles on agile software development teams. IEEE Transactions on Software Engineering. IEEE 39, 3 (May 2012), 422–444. DOI: 10.1109/TSE.2012.30

Todd Sedano, Paul Ralph, and Cécile Péraire. 2017. Software development waste. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). (May. 2017), 130–140. DOI: 10.1109/icse (2017).

Christoph Treude and Margaret-Anne Storey. 2011. Effective communication of software development knowledge through community portals. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering (ESEC/FSE ‘11). Association for Computing Machinery, New York, NY, USA, 91–101. DOI:10.1145/2025113.2025129

Michael Waterman, James Noble, and George Allan. 2015. How much up-front? A grounded theory of agile architecture. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 1, (May 2015), 347–357.


1 cf. Khaldoun M. Aldiabat and Carole-Lynne Le Navenec. “Data Saturation: The Mysterious Step in Grounded Theory Methodology.” The Qualitative Report, vol. 23, no. 1, 2018, pp. 245-261.

Longitudinal Studies

A study focusing on the changes in and evolution of a phenomenon over time

Application

This standard applies to studies that involve repeated observations of the same variables (e.g., productivity or technical debt) over a period of time. Longitudinal studies include the analysis of datasets over time, such as the analysis of the evolution of the code. Longitudinal studies require to maintain identifiability of subjects (humans or artifacts) between data collection waves and to use at least two waves.

For cross-sectional analysis, consider the Exploratory Data Science Standard or the Experiments Standard (if variables are manipulated).

Specific Attributes

Essential Attributes

  • determines the appropriate number of waves based on the natural oscillation of the research phenomenon1
  • uses at least two data collection waves
  • subjects (humans or artifacts) are identifiable between waves
  • justifies the data analysis strategy3
  • the data analysis strategy is appropriate for the interdependent nature of the data2
  • discusses the critical alpha levels or justifies Bayesian priors4
  • justifies sample size (e.g. using power analysis)5
  • describes data loss throughout the different waves
  • explains how missing data are handled

  • describes the subjects (e.g., demographic information in the case of humans)6

  • discusses the operationalization of the research model (i.e. construct validity)7

Desirable Attributes

  • provides supplementary materials including data sets, data collection scripts or instruments, analytical scripts, a description of how to reproduce the work and any other materials used
  • either builds new theory or tests existing theory
  • investigates causality using the longitudinal nature of the data to establish precedence and statistically controlling for third-variable explanations
  • discusses potential confounding factors (for inferential analyses) that cannot be statistically controlled
  • discusses data (in)consistency across waves (e.g. test-retest reliability)
  • examines differences in distributions between waves (and uses an appropriate data analysis strategy)
  • describes the cost of gathering data and any incentives used
  • addresses survivorship bias8

Extraordinary Attributes

  • uses a multi-stage selection process to identify the study’s subjects9

  • follows subjects for an exceptionally long period (e.g. more than five years)

General Quality Criteria

Reliability, internal validity, conclusion validity, construct validity, and external validity.

Longitudinal studies exploit the temporal nature of data to maximize internal validity. Other criteria are sometimes sacrificed to improve internal validity.

Antipatterns

  • subject loss between waves is too high, leading to a severely underpowered study
  • the period between waves does not match the phenomenon’s natural cycles
  • treating longitudinal data as cross-sectional

Variations

  • Experience sampling provides a highly specific understanding of a phenomenon through multiple repeated measurements per day over a short period (typically one to three weeks). It emphasizes in-the-moment assessment rather than reflective assessment (van Berkel et al. 2017). This standard applies to experience sampling studies.
  • Cohort studies are a type of analytical observational study where researchers investigate the relationship between an independent and dependent variable by observing subjects over time and comparing groups with different levels of exposure. Cohort studies follow more strict rules than presented here.10

Invalid Criticisms

  • Claiming that the time span between measurements is too short or too long.
  • Claiming that the number of waves is inadequate without a reasoned explanation.
  • Claiming that the sample size is too small without performing a post hoc power calculation.
  • Claiming that the paper with a modest number of comparisons should have used more conservative alphas or adopted a Bayesian approach.
  • Complaining about generalizability when the paper clearly acknowledges limitations to generalizability.

Suggested Readings

  • Franz Faul, et al. Statistical power analyses using G* Power 3.1: Tests for correlation and regression analyses. In Behavior Research Methods. 41,4 (2009), 1149–1160.
  • Flavius Kehr and Tobias Kowatsch. 2015. Quantitative Longitudinal Research: A Review of IS Literature, and a Set of Methodological Guidelines. In Proceedings of the 23rd European Conference on Information Systems (ECIS). Münster, Germany.
  • Hall, Sharon M., et al. "Statistical analysis of randomized trials in tobacco treatment: longitudinal designs with dichotomous outcome." Nicotine & Tobacco Research 3.3 (2001): 193–202.
  • Duncan, Susan C., Terry E. Duncan, and Hyman Hops. "Analysis of longitudinal data within accelerated longitudinal designs." Psychological Methods 1.3 (1996): 236.
  • Langfred, Claus W. "The downside of self-management: A longitudinal study of the effects tf conflict on trust, autonomy, and task interdependence in self-managing teams." Academy of Management Journal 50.4 (2007): 885–900.
  • Benner, Mary J., and Michael Tushman. "Process management and technological innovation: A longitudinal study of the photography and paint industries." Administrative Science Quarterly 47.4 (2002): 676–707.
  • Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. In Psychological Science. 22,11 (2011), 1359—1366.
  • Niels van Berkel, Denzil Ferreira, and Vassilis Kostakos. The experience sampling method on mobile devices. In ACM Computing Surveys. 50,6 (2017), 1—40.

Exemplars

  • Daniel Russo, Paul H.P. Hanel, Seraphina Altnickel, and Niels van Berkel. Predictors of Well-being and Productivity among Software Professionals during the COVID-19 Pandemic — A Longitudinal Study. Empirical Software Engineering (2021).
  • Chandrasekar Subramaniam, Sen Ravi, and Matthew L. Nelson. Determinants of open source software project success: A longitudinal study. In Decision Support Systems. 46, 2 (2009), 576—585.
  • Davide Fucci, Simone Romano, Maria Teresa Baldassarre, Danilo Caivano, Giuseppe Scanniello, Burak Turhan, Natalia Juristo. A longitudinal cohort study on the retainment of test-driven development. International Symposium on Empirical Software Engineering and Measurement. (2018), 1-10
  • Jingyue Li, Nils B Moe, and Tore Dybå. 2010. Transition from a plan-driven process to Scrum: a longitudinal case study on software quality. International symposium on empirical software engineering and measurement. (2010), 1-10
  • Laurie McLeod, Stephen MacDonell, and Bill Doolin.Qualitative Research on software development: a longitudinal case study methodology. Empirical software engineering 16, 4 (2011), 430–459.
  • Jari Vanhanen, Casper Lassenius, and Mika V Mantyla. Issues and tactics when adopting pair programming: A longitudinal case study. International Conference on Software Engineering Advances. (2007)
  • Donald E Harter, Chris F Kemerer, and Sandra A Slaughter. 2012. Does software process improvement reduce the severity of defects? A longitudinal field study. IEEE Transactions on Software Engineering 38, 4 (2012), 810–827.

1 On the concept of natural oscillation cf. Kehr & Kowatsch, 2015.
2 Several different statistical approaches are used to analyze longitudinal data (Kehr & Kowatsch provide a partial overview).
3 Although there might not be one best method for a specific problem, it should still be discussed on a subjective level (e.g., why it fits best to the research question) and at an objective level (e.g., data normality).
4 The choice of thresholds (e.g., p-values \< 0.05) should be discussed, to avoid Type I errors. Typically, longitudinal analyses deal with many variables and multiple comparisons, increasing the likelihood to obtain results within traditionally acceptable thresholds. For this reason, authors are advised to adjust the critical alpha level (e.g., using as a threshold p-values \< 0.001) or use Bayesian statistics (Simmons et al., 2011).
5 Determining the sample size is of utter importance to avoid Type II errors. Thus, authors might define their sample size using a priori power calculations. At the same time, reviewers can control adequate size through a post hoc analysis (Faul, 2019).
6 The research design should explicitly state how the sample has been selected and filtered out through a selection process. For example, how are we sure to have included only software engineers when dealing with human subjects? Or, which type of quality controls have been performed on software repositories to ensure the consistency and homogeneity of artifacts?
7 It should be clear, which are the factors being investigated and how they have been selected. Similarly, measurements should show adequate reliability based on literature benchmarks (e.g., Cronbach's alpha, test-retest reliability between waves).
8 For example, if you analyze the past 100 years of stock market performance based on the markets that exist today, you get much higher average returns than if you analyze all the markets that existed 100 years ago.
9 An example of such a selection process can be found in Russo, Daniel, and Klaas-Jan Stol. "Gender differences in personality traits of software engineers." IEEE Transactions on Software Engineering (2020).
10 See: David A. Grimes and Kenneth F. Schulz. Cohort studies: marching towards outcomes. The Lancet 359, no. 9303 (2002): 341-345.

Methodological Guidelines and Meta-Science

A paper that analyses an issue of research methodology or makes recommendations for conducting research

Application

This standard applies to papers that provide analysis of one or more methodological issues, or advice concerning some aspect of research.

  • may or may not include primary or secondary empirical data or analysis.
  • may consider philosophical or practical issues
  • may simultaneously be a methodology paper and an empirical study, to which another standard also applies; for example, if a paper reports a case study and then gives advice about a methodological issue illuminated by the case study, consider both this standard and the Case Study Standard.

Specific Attributes

Essential Attributes

  • presents information that is useful for other researchers
  • makes recommendations for future research
  • presents clear, valid arguments supporting recommendations

Desirable Attributes

  • synthesizes related work from reference disciplines
  • provides insight specifically for software engineering; goes beyond summarizing methodological guidance from existing works or reference disciplines;
  • results integrated back into prior theory or research
  • develops helpful artifacts (e.g. checklists, templates, tests, tools, sets of criteria)

Extraordinary Attributes

  • includes an empirical study (e.g. a systematic literature review) that motivates the analysis of guidance
  • quantitative simulation illustrating methodological issues

General Criteria

  • comprehensiveness of analysis or guidance provided
  • usefulness to the research community
  • quality of argumentation supporting analysis of guidance
  • degree of integration with previous work, both in software engineering and in reference disciplines

Antipatterns

  • overreaching; informal logical fallacies (e.g. straw man argument, appeal to popularity, shifting the burden of proof)1
  • discussing an issue without clear conclusions; failing to provide clear guidelines
  • attacking individual studies or researchers; hypothetical examples should be used to avoid engendering animosity

Invalid Criticisms

  • Guidelines are not based on empirical evidence. Empirically testing meta-scientific propositions is typically impractical or impossible. Reviewers should evaluate the face validity, comprehensiveness and usefulness of the guidelines. It is not appropriate to reject methodological guidelines over lack of empirical support.

Notes

Because metascientific claims often cannot be justified empirically:

  • reviewers of methodology papers must themselves be experts in the methodology, so that they can evaluate the reasonableness of the guidelines
  • reviewers should be more pedantic in critiquing the discussion and guidelines than for an empirical paper

Exemplars

Natalia Juristo and Ana M. Moreno. 2001. Basics of Software Engineering Experimentation._ Springer Science & Business Media.

Barbara Kitchenham and Stuart Charters. 2007. Guidelines for performing systematic literature reviews in software engineering.

Barbara Kitchenham, Lesley Pickard, and Shari Lawrence Pfleeger. 1995. Case studies for method and tool evaluation. IEEE software. 12, 4 (1995), 52-62.

Paul Ralph. 2019. Toward methodological guidelines for process theories and taxonomies in software engineering. IEEE Transactions on Software Engineering. 45, 7 (Jan. 2018), 712-735.

Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering. Empirical software engineering. 14, 2, article 131.

Miroslaw Staron. 2019. Action Research in Software Engineering: Theory and Applications. In International Conference on Current Trends in Theory and Practice of Informatics. 39-49.

Klaas-Jan Stol, Paul Ralph, and Brian Fitzgerald. 2016. Grounded theory in software engineering research: a critical review and guidelines. In Proceedings of the 38th International Conference on Software Engineering (ICSE '16). Association for Computing Machinery, New York, NY, USA, 120–131. DOI:10.1145/2884781.2884833

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Science & Business Media.

Helen Sharp, Yvonne Dittrich, and Cleidson RB De Souza. 2016. "The role of ethnographic studies in empirical software engineering." IEEE Transactions on Software Engineering 42, 8: 786-804.

Yvonne Dittrich, Kari Rönkkö, Jeanette Eriksson, Christina Hansson, & Olle Lindeberg (2008). Cooperative method development. Empirical Software Engineering, 13, 3, 231–260.


1citing seminal works is not the “appeal to authority” fallacy

Multi-Methodology and Mixed Methods Research

Studies that use two or more approaches to data collection or analysis to corroborate, complement and expand research findings (multi-methodology) or combine and integrate inductive research with deductive research (mixed methods), often but not necessarily relying on qualitative and/or quantitative data.

Application

This standard applies to software engineering studies that use two ir more data collection or analysis methods. It enumerates criteria related to the mixing of methodologies, not the methodologies themselves. For the latter, refer to method-specific standards. For example, a multi-methodology study combining a case study with an experiment should comply with The Case Study Standard and The Experiment Standard as well as this standard.

Specific Attributes

Essential

  • justifies using multiple methodologies and/or methods
  • provides a purpose statement that conveys the overarching multi or mixed method design intent (why)

  • describes the multi-methodology, multi-method or mixed method design (what)
  • describes which phases of the research study the different methods or methodologies are used in (when)
  • describes how the design aligns with the research question or objective

  • integrates the findings from all methods to address the research question/objective

  • acknowledges the limitations associated with integrating findings1

Desirable

  • defines the multi-methodology or mixed method design used
  • describes and justifies sample reuse (or no reuse, or partial reuse) across methods
  • illustrates the research design using a visual model (diagram)
  • indicates the use of multiple methods or mixed method design in the title
  • (for mixed-methods) includes, in the literature review, a mixture of quantitative, qualitative, and mixed methods related work
  • distinguishes the additional value from using a multi-methodology or mixed method design in terms of corroboration, complementarity, and expansion (breadth and depth)
  • discusses discrepancies and incongruent findings from the use of multiple methods
  • describes the main philosophical, epistemological, and/or theoretical foundations that the authors use and relate those to the planned use of multi or mixed methods in the study
  • describes the challenges faced in the design and how those were or could be mitigated
  • describes how the methods and their findings relate to one or more theories or theoretical frameworks
  • describes ethical issues that may have been presented through the blend of multi- or mixed methods

Extraordinary

  • contributes to the methodological discourse surrounding multi-methodology or mixed-methods

General Quality Criteria

General quality criteria discussed in the guidelines for each method should be considered. In the case of a multi-methodology or multi-method design, the reliability of the findings that are specific to the triangulation goals should be assessed. In the case of a mixed method design, the research may also require a “legitimation step to assess the trustworthiness of both the qualitative and quantitative data and subsequent interpretations” [Johnson/On., P. 22].

Antipatterns

  • Uninvited guest: A research method is not clearly introduced in the paper introduction/methodology and makes an unexpected entrance in the discussion or limitations sections of the paper
  • Smoke and mirrors: Overselling a study as a multi-methodology or mixed method design when one approach at best offers a token or anecdotal contribution to the research motivation or findings
  • Selling your soul: Employing an additional method to appeal to a methodological purist during the review process that does not contribute substantively to the research findings
  • Integration failure: Poor integration of findings from all methods used
  • Limitation shirker: Failure to discuss limitations from all methods used or from their integration
  • Missing the mark: Misalignment of multi- or mixed method design with the research question/objective
  • Cargo cult research: Using methods where the research team lacks expertise in those methods, but hopes they work
  • Design by committee: Failure to agree on a crisp research question/objective (may be induced by different epistemological perspectives or use of heterogeneous methods)
  • Golden hammer: relying on superficial, typically quantitative analysis of rich qualitative data
  • Sample contamination: a mixed method sequential design where the same participants are used in multiple, sequential methods without accounting for potential contamination from earlier method(s) to later ones.
  • Ignoring the writing on the wall: In a mixed method sequential design, failing to use findings from an earlier study when forming an instrument for a study in a later phase of the research

Examples of Acceptable Deviations

  • Conference page limits may make it particularly challenging to share sufficient details on all methods used. These details may be available in other publications or supplementary materials.
  • Describing the research in terms of cycles or ongoing parallel processes instead of phases.

Invalid Criticisms

  • The method(ologie)s do not contribute equally (a non-equal design) or the minor method is limited (e.g. few participants).
  • The mixed- or multi-method approach isn’t necessary (when it is beneficial)
  • The method(ologie)s have different philosophical foundations or are otherwise incompatible
  • In an unequal design, the wrong method is dominant (this is a study design choice not a flaw )
  • The method(ologie)s have inconsistent findings

Notes

  • Multi-methodology research is sometimes referred to as multi-method, blended research, or integrative research (see P. 118 Johnson 2008). Mixed method research is sometimes seen as a special case of multi-methodology research that blends deductive and inductive research, or mixes methods that rely on quantitative and qualitative data.
  • The term “method” may refer to a way of collecting data or a broader process of doing research. In mixed methods research, “method” means the broader process of doing research.
  • The term “methodology” may refer to an approach to doing research or the study of how research is done. In “multi-methodology,” methodology means a process of doing research
  • For multi-method research, it is important to distinguish between using different methods of collecting data (e.g., as part of an inductive case study) and using different research strategies (e.g., complementing an inductive case study with a deductive experiment in a later stage of the research).
  • Some experts on mixed methods use specific terms to refer to core mixed method designs (e.g., Convergent mixed methods, Explanatory sequential mixed methods, Exploratory sequential mixed methods are defined by Creswell & Creswell (2007) and Quantitatively driven approaches/designs, Qualitatively driven approaches/designs, Interactive or equal status designs are used by Johnson, Onwuegbuzie, & Turner, 2007).

Suggested Readings

  1. Bergman, M. M. (2011). The good, the bad, and the ugly in mixed methods research and design. Journal of Mixed Methods Research, 5(4), 271-275. doi:10.1177/1558689811433236

  2. Creswell, John W., V. L. Plano Clark, Michelle L. Gutmann, and William E. Hanson. “An expanded typology for classifying mixed methods research into designs.” A. Tashakkori y C. Teddlie, Handbook of mixed methods in social and behavioral research (2003): 209-240.

  3. Easterbrook, Steve, Janice Singer, Margaret-Anne Storey, and Daniela Damian. “Selecting empirical methods for software engineering research.” In Guide to advanced empirical software engineering, pp. 285-311. Springer, London, 2008.

  4. Johnson, R. & Onwuegbuzie, Anthony & Turner, Lisa. (2007). Toward a Definition of Mixed Methods Research. Journal of Mixed Methods Research, 1, 112-133. Journal of Mixed Methods Research. 1. 112 -133. 10.1177/1558689806298224.

  5. Ladner, S., Mixed Methods: A short guide to applied mixed methods design, 2019. https://www.mixedmethodsguide.com/

  6. O’Cathain, A. (2010). Assessing the quality of mixed methods research: Towards a comprehensive framework. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd edition) (pp. 531-555). Thousand Oaks: Sage.

  7. Pace, R., Pluye, P., Bartlett, G., Macaulay, A., Salsberg, J., Jagosh, J., & Seller, R. (2010). Reliability of a tool for concomitantly appraising the methodological quality of qualitative, quantitative and mixed methods research: a pilot study. 38th Annual Meeting of the North American Primary Care Research Group (NAPCRG), Seattle, USA.

  8. Pluye, P., Gagnon, M.P., Griffiths, F. & Johnson-Lafleur, J. (2009). A scoring system for appraising mixed methods research, and concomitantly appraising qualitative, quantitative and mixed methods primary studies in Mixed Studies Reviews. International Journal of Nursing Studies, 46(4), 529-46.

  9. Storey, MA., Ernst, N.A., Williams, C. et al. The who, what, how of software engineering research: a socio-technical framework. Empir Software Eng 25, 4097–4129 (2020). https://doi.org/10.1007/s10664-020-09858-z

  10. Tashakkori, Abbas, and John W. Creswell. “The new era of mixed methods.” (2007): 3-7.

  11. Teddlie, Charles, and Abbas Tashakkori. “A general typology of research designs featuring mixed methods.” Research in the Schools 13, no. 1 (2006): 12-28.

  12. https://en.wikipedia.org/wiki/Multimethodology

Exemplars

Paper reference Methods used
M. Almaliki, C. Ncube and R. Ali, “The design of adaptive acquisition of users feedback: An empirical study,” 2014 IEEE Eighth International Conference on Research Challenges in Information Science (RCIS), Marrakech, 2014, pp. 1-12, doi: 10.1109/RCIS.2014.6861076. Questionnaire+interviews
Sebastian Baltes and Stephan Diehl. 2019. Usage and attribution of Stack Overflow code snippets in GitHub projects. Empirical Softw. Engg. 24, 3 (June 2019), 1259–1295. DOI:https://doi.org/10.1007/s10664-018-9650-5 [pdf] Mining clones of SO code + qualitative analysis + follow-up online survey
Sebastian Baltes and Stephan Diehl. 2018. Towards a theory of software development expertise. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 187–200. DOI:https://doi.org/10.1145/3236024.3236061 [pdf] Combining deductive and inductive theory building steps
T. Barik, Y. Song, B. Johnson and E. Murphy-Hill, “From Quick Fixes to Slow Fixes: Reimagining Static Analysis Resolutions to Enable Design Space Exploration,” 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Raleigh, NC, 2016, pp. 211-221, doi: 10.1109/ICSME.2016.63. [pdf] [blogpost] This had both a usability tool study but also a heuristic evaluation (!) using expert evaluators.
M. Borg, K. Wnuk, B. Regnell and P. Runeson, “Supporting Change Impact Analysis Using a Recommendation System: An Industrial Case Study in a Safety-Critical Context,” in IEEE Transactions on Software Engineering, vol. 43, no. 7, pp. 675-700, 1 July 2017, doi: 10.1109/TSE.2016.2620458. Spent a month at a site in India. Replayed some development history, instrumented and measured, did interviews.
Chattopadhyay, Souti, Nicholas Nelson, Audrey Au, Natalia Morales, Christopher Sanchez, Rahul Pandita, and Anita Sarma. “A tale from the trenches: cognitive biases and software development.” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 654-665. 2020. Observations + interviews
F. Ebert, F. Castor, N. Novielli and A. Serebrenik, “Confusion in Code Reviews: Reasons, Impacts, and Coping Strategies,” 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, 2019, pp. 49-60, doi: 10.1109/SANER.2019.8668024. [pdf] Survey + repository mining on confusion
Egelman, Carolyn D., Emerson Murphy-Hill, Elizabeth Kammer, Margaret Morrow Hodges, Collin Green, Ciera Jaspan, and James Lin. “Predicting developers’ negative feelings about code review.” In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pp. 174-185. IEEE, 2020. [pdf] Surveys cross-referenced with code review data.
D. Ford, M. Behroozi, A. Serebrenik and C. Parnin, “Beyond the Code Itself: How Programmers Really Look at Pull Requests,” 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS), Montreal, QC, Canada, 2019, pp. 51-60, doi: 10.1109/ICSE-SEIS.2019.00014. [pdf] Eye tracking software and interviews with developers to figure out how they evaluate contributions to OSS
Fabian Gilson, Miguel Morales-Trujillo, and Moffat Mathews. 2020. How junior developers deal with their technical debt? In Proceedings of the 3rd International Conference on Technical Debt (TechDebt ‘20). Association for Computing Machinery, New York, NY, USA, 51–61. DOI:https://doi.org/10.1145/3387906.3388624 They looked at how students (proxy for junior devs) deal with their tech debt using a survey, some analysis of their smells (sonarqube + commits) and a focus group.
Austin Z. Henley, KΙvanç Muçlu, Maria Christakis, Scott D. Fleming, and Christian Bird. 2018. CFar: A Tool to Increase Communication, Productivity, and Review Quality in Collaborative Code Reviews. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, Paper 157, 1–13. DOI:https://doi.org/10.1145/3173574.3173731[pdf] Looked at the effects of an automated code reviewer on team collaboration at Microsoft using a lab study, a field study, and a survey.
Yogeshwar Shastri, Rashina Hoda, Robert Amor,”The role of the project manager in agile software development projects”, Journal of Systems and Software, Volume 173, 2021, 110871, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2020.110871. [pdf] A gentle approach to mixed methods with GT, primarily qual with supplementary quant.
Thomas D. LaToza, Gina Venolia, and Robert DeLine. 2006. Maintaining mental models: a study of developer work habits. In Proceedings of the 28th international conference on Software engineering (ICSE ‘06). Association for Computing Machinery, New York, NY, USA, 492–501. DOI:https://doi.org/10.1145/1134285.1134355 [pdf] survey + interviews + survey
Thomas D. LaToza and Brad A. Myers. 2010. Developers ask reachability questions. Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1. Association for Computing Machinery, New York, NY, USA, 185–194. DOI:https://doi.org/10.1145/1806799.1806829 [pdf] lab obs + survey + field obs
C. Omar, Y. S. Yoon, T. D. LaToza and B. A. Myers, “Active code completion,” 2012 34th International Conference on Software Engineering (ICSE), Zurich, 2012, pp. 859-869, doi: 10.1109/ICSE.2012.6227133. [pdf] survey + lab study
Nicolas Mangano, Thomas D. LaToza, Marian Petre, and André van der Hoek. 2014. Supporting informal design with interactive whiteboards. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ‘14). Association for Computing Machinery, New York, NY, USA, 331–340. DOI:https://doi.org/10.1145/2556288.2557411 [pdf] lit review + logs + interviews
Thomas D. LaToza, W. Ben Towne, Christian M. Adriano, and André van der Hoek. 2014. Microtask programming: building software with a crowd. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST ‘14). Association for Computing Machinery, New York, NY, USA, 43–54. DOI:https://doi.org/10.1145/2642918.2647349 [pdf] logs + survey 2x
L. Martie, T. D. LaToza and A. v. d. Hoek, “CodeExchange: Supporting Reformulation of Internet-Scale Code Queries in Context (T),” 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, 2015, pp. 24-35, doi: 10.1109/ASE.2015.51. [pdf] field deployment + lab study
Thomas D. LaToza, W. Ben Towne, Christian M. Adriano, and André van der Hoek. 2014. Microtask programming: building software with a crowd. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST ‘14). Association for Computing Machinery, New York, NY, USA, 43–54. DOI:https://doi.org/10.1145/2642918.2647349 [pdf] log data + surveys
A. Alaboudi and T. D. LaToza, “An Exploratory Study of Live-Streamed Programming,” 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Memphis, TN, USA, 2019, pp. 5-13, doi: 10.1109/VLHCC.2019.8818832. [pdf] web videos + interviews
K. Chugh, A. Y. Solis and T. D. LaToza, “Editable AI: Mixed Human-AI Authoring of Code Patterns,” 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Memphis, TN, USA, 2019, pp. 35-43, doi: 10.1109/VLHCC.2019.8818871. [pdf] lab study + interviews
E. Aghayi, A. Massey and T. LaToza, “Find Unique Usages: Helping Developers Understand Common Usages,” in 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Dunedin, New Zealand, 2020 pp. 1-8.doi: 10.1109/VL/HCC50065.2020.9127285 https://doi.ieeecomputersociety.org/10.1109/VL/HCC50065.2020.9127285[pdf] formative obs + lab study
Clara Mancini, Keerthi Thomas, Yvonne Rogers, Blaine A. Price, Lukazs Jedrzejczyk, Arosha K. Bandara, Adam N. Joinson, and Bashar Nuseibeh. 2009. From spaces to places: emerging contexts in mobile privacy. In Proceedings of the 11th international conference on Ubiquitous computing (UbiComp ‘09). Association for Computing Machinery, New York, NY, USA, 1–10. DOI:https://doi.org/10.1145/1620545.1620547 [pdf] Extended experience sampling with ‘memory phrases’ to allow for deferred contextual interviews, to elicit mobile privacy reqs
Clara Mancini, Yvonne Rogers, Keerthi Thomas, Adam N. Joinson, Blaine A. Price, Arosha K. Bandara, Lukasz Jedrzejczyk, and Bashar Nuseibeh. 2011. In the best families: tracking and relationships. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ‘11). Association for Computing Machinery, New York, NY, USA, 2419–2428. DOI:https://doi.org/10.1145/1978942.1979296 [pdf] They mixed ‘extended experience sampling’ from above ubicomp’09 work with ‘breaching experiments’ from psychology, to explore progressively uncomfortable invasions of privacy of user, to better understand privacy requirements.
Clara Mancini, Yvonne Rogers, Arosha K. Bandara, Tony Coe, Lukasz Jedrzejczyk, Adam N. Joinson, Blaine A. Price, Keerthi Thomas, and Bashar Nuseibeh. 2010. Contravision: exploring users’ reactions to futuristic technology. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 153–162. DOI:https://doi.org/10.1145/1753326.1753350 [pdf] For privacy reqs elicitation, inspired by cinema (film: Sliding Doors), they created the ‘Contravision’ method with utopian & dystopian narratives (films) of futuristic tech, to elicit reqs from focus groups that watched their film variants
H. S. Qiu, A. Nolte, A. Brown, A. Serebrenik and B. Vasilescu, “Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source,” 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), Montreal, QC, Canada, 2019, pp. 688-699, doi: 10.1109/ICSE.2019.00078. [pdf] Surveys + repository mining on gender diversity
Qiu, Huilian Sophie, Yucen Lily Li, Susmita Padala, Anita Sarma, and Bogdan Vasilescu. “The signals that potential contributors look for when choosing open-source projects.” Proceedings of the ACM on Human-Computer Interaction 3, no. CSCW (2019): 1-29. Interviews + mining
Nischal Shrestha, Colton Botta, Titus Barik, and Chris Parnin. 2020. Here we go again: why is it difficult for developers to learn another programming language? In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ‘20). Association for Computing Machinery, New York, NY, USA, 691–701. DOI:https://doi.org/10.1145/3377811.3380352 [pdf] Combines Stack Overflow data with qualitative interviews. It’s not so much triangulation but more about seeing the same problem from different lenses.
Siegmund, Janet, Norman Peitek, Sven Apel, and Norbert Siegmund. “Mastering Variation in Human Studies: The Role of Aggregation.” ACM Transactions on Software Engineering and Methodology (TOSEM) 30, no. 1 (2020): 1-40. A literature survey, an in-depth statistical re-analysis, and train several classifiers on data of different human studies to demonstrate how aggregation affects results.
Siegmund, Norbert, Nicolai Ruckel, and Janet Siegmund. “Dimensions of software configuration: on the configuration context in modern software development.” In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 338-349. 2020. (https://sws.informatik.uni-leipzig.de/wp-content/uploads/2020/05/Configuration.pdf) Interviews, two related papers, and an application on SE subfields for building a context of research on software configuration.
Stors, N. and Sebastian Baltes. “Constructing Urban Tourism Space Digitally.” Proceedings of the ACM on Human-Computer Interaction 2 (2018): 1 - 29. [pdf] Interdisciplinary study combining data mining and qualitative analysis
Leen Lambers, Daniel Strüber, Gabriele Taentzer, Kristopher Born, and Jevgenij Huebert. 2018. Multi-granular conflict and dependency analysis in software engineering based on graph transformation. In Proceedings of the 40th International Conference on Software Engineering (ICSE ‘18). Association for Computing Machinery, New York, NY, USA, 716–727. DOI:https://doi.org/10.1145/3180155.3180258 They combine 4 research methods to improve the state-of-the-art of a certain software analysis: (1.) a literature survey to identify issues with available techniques (performance- and usability-related); (2.) formal methods to define new concepts and prove that we can compute them in a sound way; (3.) a tool implementation and evaluation to show performance benefits; and (4.) a user experiment to show usability benefits arising from our new concepts.
Bogdan Vasilescu, Daryl Posnett, Baishakhi Ray, Mark G.J. van den Brand, Alexander Serebrenik, Premkumar Devanbu, and Vladimir Filkov. 2015. Gender and Tenure Diversity in GitHub Teams. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 3789–3798. DOI:https://doi.org/10.1145/2702123.2702549 [pdf] Surveys + repository mining on gender diversity
Umme Ayda Mannan, Iftekhar Ahmed, Carlos Jensen, and Anita Sarma. 2020. On the relationship between design discussions and design quality: a case study of Apache projects. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 543–555. DOI:https://doi.org/10.1145/3368089.3409707 Data mining + survey
Vidoni, Melina, “Evaluating Unit Testing Practices in R Packages”,To appear ICSE 2021. [pdf] Combined an MSR (mining software repositories) with a developers’ survey.
A. AlSubaihin, F. Sarro, S. Black, L. Capra and M. Harman, “App Store Effects on Software Engineering Practices,” in IEEE Transactions on Software Engineering (Early Access). DOI:https://doi.org/10.1109/TSE.2019.2891715. [pdf] Interviews + Questionnairs, Deductive, mixed method drawing from survey and case study empirical research methodologies, exploratory + descriptive

1 e.g., samples that are drawn from different populations may introduce limitations when the findings are integrated, or biases may be introduced by the sequential or parallel nature of a mixed design

Optimization Studies in SE (including Search-Based Software Engineering)

Research studies that focus on the formulation of software engineering problems as search problems, and apply optimization techniques to solve such problems1.

Application

This standard applies to empirical studies that meet the following criteria:

  • Formulates a software engineering task2 as an optimization problem, with one or more specified fitness functions3 used to judge success in this task.
  • Applies one or more approaches that generate solutions to the problem in an attempt to maximize or minimize the specified fitness functions.

Specific Attributes

We stress that the use of optimization in SE is still a rapidly evolving field. Hence, the following criteria are approximate and there may exist many exceptions to them. Reviewers should reward sound and novel work and, where possible, support a diverse range of studies.

Essential

  • explains why the problem cannot be optimized manually or by brute force within a reasonable timeframe4
  • EITHER: describes prior state of the art in this area
    OR: carefully motivates and defines the problem tackled and the solution proposed

  • describes the search space (e.g., constraints, independent variables choices)
  • uses realistic and limited simplifications and constraints for the optimization problem; simplifications and constraints do not reduce the search to one where all solutions could be enumerated through brute force
  • justifies the choice of algorithm5 underlying the approach6
  • compares approaches to a justified and appropriate baseline7
  • explictly defines the solution formulation, including a description of what a solution represents8, how it is represented9, and how it is manipulated
  • explicitly defines all fitness functions, including the type of goals that are optimized and the equations for calculating fitness values
  • explicitly defines evaluated approaches, including the techniques, specific heuristics, and the parameters and their values10
  • EITHER: clearly describes (and follows) a sound process to collect and prepare the datasets used to run and to evaluate the optimization approach
    OR: if the subjects are taken from previous work, fully reference the original source and explain whether any transformation or cleaning was applied to the datasets
  • EITHER: makes data publicly available OR: explains why this is not possible11
  • identifies and explains all possible sources of stochasticity12
  • EITHER: executes stochastic approaches or elements multiple times
    OR: explains why this is not possible13

Desirable

  • provides a replication package that conforms to SIGSOFT standards for artifacts14.
  • motivates the novelty and soundness of the proposed approach15
  • explains whether the study explores a new problem type (or a new area within an existing problem space), or how it reproduces, replicates, or improves upon prior work
  • explains in detail how subjects or datasets were collected/chosen to mitigate selection bias and improve the generalization of findings
  • describes the main features of the subjects used to run and evaluate the optimization approach(es) and discuss what characterizes the different instances in terms of “hardness”
  • justifies the use of synthetic data (if any); explain why real-world data cannot be used; discusses the extent to which the proposed approach and the findings can apply to the real world
  • (if data cannot be shared) provides a sample dataset that can be shared to illustrate the approach
  • selects a realistic option space for formulating a solution; any values set for attributes should reflect one that might be chosen in a “real-world” solution, and not generated from an arbitrary distribution
  • justifies the parameter values used when executing the evaluated approaches (and note that experiments trying a wide range of different parameter values would be extraordinary, see below)
  • samples from data multiple times in a controlled manner (where appropriate and possible)
  • performs multiple trials either as a cross-validation (multiple independent executions) or temporally (multiple applications as part of a timed sequence), depending on the problem at hand
  • provides random data splits (e.g., those used in data-driven approaches) or ensures splits are reproducibile.
  • compares distributions (rather than means) of results using appropriate statistics
  • compares solutions using an appropriate meta-evaluation criteria16; justifies the chosen criteria
  • clearly distinguishes evidence-based results from interpretations and speculation17

Extraordinary

  • analyzes different parameter choices to the algorithm, indicating how the final parameters were selected18
  • analyzes the fitness landscape for one or more of the chosen fitness functions

General Quality Criteria

The most valuable quality criteria for optimization studies in SE include reliability, replicability, reproducibility, rigor, and usefulness (see Glossary).

Examples of Acceptable Deviations

  • The number of trials can be constrained by available time or experimental resources (e.g. where experiments are time-consuming to repeat or have human elements). In such cases, multiple trials are still ideal, but a limited number of trials can be justified as long as the limitations are disclosed and the possible effects of stochasticity are discussed.
  • The use of industrial case studies is important in demonstrating the real-world application of a proposed technique, but industrial data generally cannot be shared. In such cases, it is recommended that a small open-source example be prepared and distributed as part of a replication package to demonstrate how the approach can be applied.

Antipatterns

  • Reporting significance tests (e.g., Mann-Whitney Wilcoxon test) without effect size tests (see Notes)
  • Conducting multiple trials but failing to disclose or discuss the variation between trials; for instance reporting a measure of central (e.g. median) without any indication of variance (e.g., a boxplot).

Invalid Criticisms

  • The paper is unimportant. Be cautious of rejecting papers that seem “unimportant” (in the eyes of a reviewer). Research is exploratory and it is about taking risks. Clealy-motivated research and speculative exploration are both important and should be rewarded.
  • The paper just uses older algorithms with no reference to recent work. Using older (and widely understood algorithms) may be valid when they are used, e.g., (1) as part of a larger set that compares many approaches; e.g. (2) to offer a “straw man” method that defines the “floor” of the performance (that everything else needs to beat); or (3), as a workbench within which one thing is changed (e.g., the fitness function) but everything else remains constant.
  • That an approach is not benchmarked against an inappropriate or unavailable baseline. If a state-of-the-art approach lacks an available and functional implementation, it is not reasonable to expect the author to recreate that approach for benchmarking purposes.
  • That a multi-objective approach is not compared to a single-objective approach by evaluating each objective separately. This is not a meaningful comparison because, in a multi-objective problem, the trade-off between the objectives is a major factor in result quality. It is more important to consider the Pareto frontiers and quality indicators.
  • That one or very few subjects are used, as long as the paper offers a reasonable justification for why this was the case.

Suggested Readings

  • Shaukat Ali, Lionel C. Briand, Hadi Hemmati, Rajwinder Kaur Panesar-Walawege. 2010. A Systematic Review of the Application and Empirical Investigation of Search-Based Test Case Generation,” in IEEE Transactions on Software Engineering, vol. 36, no. 6, pp. 742-762, DOI: https://doi.org/10.1109/TSE.2009.52
  • Andrea Arcuri and Lionel Briand. 2014. A Hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test. Verif. Reliab. 24, 3, pp. 219–250. DOI: https://doi.org/10.1002/stvr.1486
  • Amritanshu Agrawal, Tim Menzies, Leandro L. Minku, Markus Wagner, and Zhe Yu. 2020. Better software analytics via DUO: Data mining algorithms using/used-by optimizers.” Empirical Software Engineering 25, no. 3. pp.2099-2136. DOI: https://doi.org/10.1007/s10664-020-09808-9
  • Efron, Bradley, and Robert J. Tibshirani. An introduction to the bootstrap. CRC press, 1994
  • Mark Harman, Phil McMinn, Jerffeson Teixeira Souza, and Shin Yoo. 2011. Search-Based Software Engineering: Techniques, Taxonomy, Tutorial. Empirical Software Engineering and Verification. Lecture Notes in Computer Science, vol. 7007, pp. 1–59. DOI: https://doi.org/10.1007/978-3-642-25231-0_1
  • Vigdis By Kampenes, Tore Dybå, Jo E. Hannay, and Dag I. K. Sjøberg. 2007. Systematic review: A systematic review of effect size in software engineering experiments. Inf. Softw. Technol. 49, 11–12 (November, 2007), 1073–1086. DOI:https://doi.org/10.1016/j.infsof.2007.02.015
  • M. Li, T. Chen and X. Yao. 2020. How to Evaluate Solutions in Pareto-based Search-Based Software Engineering? A Critical Review and Methodological Guidance. In IEEE Transactions on Software Engineering. DOI: https://doi.org/10.1109/TSE.2020.3036108.
  • Nikolaos Mittas and Lefteris Angelis. Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans. Software Eng., 39(4):537–551, 2013.
  • Guenther Ruhe. 2020. Optimization in Software Engineering - A Pragmatic Approach. In Felderer, M. and Travassos, G.H. eds., Contemporary Empirical Methods in Software Engineering, Springer. DOI: https://doi.org/10.1007/978-3-030-32489-6_9

Exemplars

  • Hussein Almulla, Gregory Gay. 2020. Learning How to Search: Generating Exception-Triggering Tests Through Adaptive Fitness Function Selection. In Proceedings of 13th IEEE International Conference on Software Testing (ICST’20). IEEE, 63-73. DOI: https://doi.org/10.1109/ICST46399.2020.00017
  • Jianfeng Chen, Vivek Nair, Rahul Krishna, Tim Menzies. “Sampling” as a Baseline Optimizer for Search-Based Software Engineering. IEEE Transactions on Software Engineering 2019 45(6), 2019. DOI: https://doi.org/10.1109/TSE.2018.279092
  • José Campos, Yan Ge, Nasser Albunian, Gordon Fraser, Marcelo Eler and Andrea Arcuri. 2018. An empirical evaluation of evolutionary algorithms for unit test suite generation. Information and Software Technology. vol. 104, pp. 207–235. DOI: https://doi.org/10.1016/j.infsof.2018.08.010
  • Feather, Martin S., and Tim Menzies. “Converging on the optimal attainment of requirements.” Proceedings IEEE Joint International Conference on Requirements Engineering. IEEE, 2002.
  • G. Mathew, T. Menzies, N. Ernst and J. Klein. 2017. “SHORT”er Reasoning About Larger Requirements Models. In 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, pp. 154-163. doi: 10.1109/RE.2017.3
  • Annibale Panichella, Fitsum Meshesha Kifetew and Paolo Tonella. 2018. Automated Test Case Generation as a Many-Objective Optimisation Problem with Dynamic Selection of the Targets. IEEE Transactions on Software Engineering. vol. 44, no. 2, pp. 122–158. DOI: https://doi.org/10.1109/TSE.2017.2663435
  • Federica Sarro, Filomena Ferrucci, Mark Harman, Alessandra Manna and Jen Ren. 2017. Adaptive Multi-Objective Evolutionary Algorithms for Overtime Planning in Software Projects. IEEE Transactions on Software Engineering, vol. 43, no. 10, pp. 898-917. DOI: https://doi.org/10.1109/TSE.2017.2650914
  • Federica Sarro, Alessio Petrozziello, and Mark Harman. 2016. Multi-objective software effort estimation. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). Association for Computing Machinery, New York, NY, USA, 619–630. DOI: https://doi.org/10.1145/2884781.2884830
  • Norbert Siegmund, Stefan Sobernig, and Sven Apel. 2017. Attributed variability models: outside the comfort zone. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE. Association for Computing Machinery, New York, NY, USA, 268–278. DOI: https://doi.org/10.1145/3106237.3106251

Notes

Regarding the difference between “significance” and “effect size” tests: “Significance” checks if distributions can be distinguished from each other while “Effect size” tests are required to check if the difference between distributions is “interesting” (and not just a trivially “small effect”). These tests can be parametric or non-parametric. For example, code for the parametric t-test/Hedges significance/effect tests endorsed by Kampenese et al. can be found at https://tinyurl.com/y4o7ucnx. Code for a parametric Scott-Knot/Cohen test of the kind endorsed by Mittas et al. is available at https://tinyurl.com/y5tg37fp. Code for the non-parametric bootstrap/Cliffs Delta significant/effect tests of the kind endorsed by Efron et al. and Arcuri et al. can be found at https://tinyurl.com/y2ufofgu.


1Note that there are many such optimization techniques (metaheuristic; numerical optimizers; constraint solving theorem provers SAT,SMT,CSP; and other), some of which are stochastic.
2E.g., test input creation, design refactoring, effort prediction.
3A “fitness function”, or “objective function”, is a numerical scoring function used to indicate the quality of a solution to a defined problem. Optimization approaches attempt to maximize or minimize such functions, depending on whether lower or higher scores indicate success.
4E.g., if the cross-product of the space of options is very large or if the time required to perform a task manually is very slow.
5E.g., the numerical optimizer, the specific metaheuristic, the constraint solving method, etc.
6For example, do not use an algorithm such as Simulated Annealing, or even a specific approach such as NSGA-II, to solve an optimization problem unless it is actually appropriate for that problem. While one rarely knows the best approach for a new problem, one should at least consider the algorithms applied to address similar problems and make an informed judgement.
7If the approach addresses a problem never tackled before, then it should be compared - at least - to random search. Otherwise, compare the proposed approach to the existing state of the art.
8E.g., a test suite or test case in test generation.
9E.g., a tree or vector structure.
10Example techniques - Simulated Annealing, Genetic Algorithm. Example heuristic - single-point crossover. Example parameters - crossover and mutation rates.
11E.g., proprietary data, ethics issues, or a Non-Disclosure Agreement.
12For example, stochasticity may arise from the use of randomized algorithms, from the use of a fitness function that measures a random variable from the environment (e.g., a fitness function based on execution time may return different results across different executions), from the use of data sampling or cross-validation approaches.
13E.g., the approach is too slow, human-in-the-loop.
14Including, for example, source code (of approach, solution representation, and fitness calculations), datasets used as experiment input, and collected experiment data (e.g., output logs, generated solutions).
15For example, if applying a multi-objective optimization approach, then use a criterion that can analyze the Pareto frontier of solutions (e.g., generational distance and inverse generational distance)
16E.g., applying hyperparameter optimization.
17 Simply separating results and discussion into different sections is typically sufficient. No speculation in the results section.
18: E.g., applying hyperparameter optimization.

Qualitative Surveys (Interview Studies)

Research comprising semi-structured or open-ended interviews

Application

This standard applies to empirical inquiries that meet all of the following criteria:

  • Researcher(s) have synchronous conversations with one participant at a time
  • Researchers ask, and participants answer, open-ended questions
  • Participants’ answers are recorded in some way
  • Researchers apply some kind of qualitative data analysis to participants’ answers

If researchers iterated between data collection and analysis, consider the Grounded Theory Standard. If respondents are all from the same organization, consider the Case Study Standard. If researchers collect written text or conversations (e.g. StackExchange threads), consider the Discourse Analysis Standard.

Specific Attributes

Essential Attributes

  • explains how interviewees were selected (see the Sampling Supplement)
  • describes interviewees (e.g. demographics, work roles)
  • describes interviewer(s) (e.g. experience, perspective)

  • presents clear chain of evidence from interviewee quotations to findings (e.g. proposed concepts)
  • clearly answers the research question(s)
  • provides evidence of saturation; explains how saturation was achieved1

  • researchers reflect on their own possible biases

Desirable Attributes

  • provides supplemental materials including interview guide(s), coding schemes, coding examples, decision rules, or extended chain-of-evidence table(s)
  • includes highly diverse participants
  • uses direct quotations extensively to support key points
  • EITHER: evaluates an a priori theory (or model, framework, taxonomy, etc.) using deductive coding with an a priori coding scheme based on the prior theory
    OR: synthesizes results into a new, mature, fully-developed and clearly articulated theory (or model, etc.) using some form of inductive coding (coding scheme generated from data)
  • validates results using member checking, dialogical interviewing, feedback from non-participant practitioners or research audits of coding by advisors or other researchers2)
  • discusses transferability; findings plausibly transferable to different contexts
  • compares results with (or integrates them into) prior theory or related research
  • reflects on how researchers’ biases may have affected their analysis

Extraordinary Attributes

  • employs multiple methods of data analysis (e.g. open coding vs. process coding; manual coding vs. automated sentiment analysis) with method-triangulation
  • employs longitudinal design (i.e. each interviewee participates multiple times) and analysis
  • employs probabilistic sampling strategy; statistical analysis of response bias
  • uses multiple coders and analyzes inter-coder reliability (see IRR/IRA Supplement)

General Quality Criteria

An interview study should address appropriate qualitative quality criteria such as: credibility, resonance, usefulness, and transferability (see Glossary). Quantitative quality criteria such as internal validity, construct validity, generalizability and reliability typically do not apply.

Examples of Acceptable Deviations

  • In a study of deaf software developers, the interviews are conducted via text messages.
  • In a study of sexual harassment at named organizations, detailed description of interviewees and direct quotations are omitted to protect participants.
  • In a study of barriers faced by gay developers, participants are all gay (but should be diverse on other dimensions).

Antipatterns

  • Interviewing a small number of similar people, creating the illusion of convergence and saturation
  • Mis-presenting a qualitative survey as grounded theory or a case study.

Invalid Criticisms

  • Lack of quantitative data; causal analysis; objectivity, internal validity, reliability, or generalizability.
  • Lack of replicability or reproducibility; not releasing transcripts.
  • Lack of probability sampling, statistical generalizability or representativeness unless representative sampling was an explicit goal of the study.
  • Failure to apply grounded theory or case study practices. A qualitative survey is not grounded theory or a case study.

Notes

  • A qualitative survey generally has more interviews than a case study that triangulates across different kinds of data.

Suggested Readings

Khaldoun M. Aldiabat and Carole-Lynne Le Navenec. “Data Saturation: The Mysterious Step in Grounded Theory Methodology.” The Qualitative Report, vol. 23, no. 1, 2018, pp. 245-261.

Michael Quinn Patton. 2002. Qualitative Research and Evaluation Methods. 3rd ed. Sage Publications.

Herbert J. Rubin and Irene S. Rubin. 2011. Qualitative interviewing: The art of hearing data. Sage.

Johnny Saldaña. 2015. The coding manual for qualitative researchers. Sage.

Exemplars

Marian Petre. 2013. UML in practice. In Proceedings of the 35th International Conference on Software Engineering, San Francisco, USA, 722–731.

Paul Ralph and Paul Kelly. 2014. The dimensions of software engineering success. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 24–35. DOI: 10.1145/2568225.2568261

Paul Ralph and Ewan Tempero. 2016. Characteristics of decision-making during coding. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering (EASE ‘16). Association for Computing Machinery, New York, NY, USA, Article 34, 1–10. DOI:10.1145/2915970.2915990


1cf. Khaldoun M. Aldiabat and Carole-Lynne Le Navenec. “Data Saturation: The Mysterious Step in Grounded Theory Methodology.” The Qualitative Report, vol. 23, no. 1, 2018, pp. 245-261.
2 L. Harvey. 2015. Beyond member-checking: A dialogic approach to the research interview, International Journal of Research & Method in Education, 38, 1, 23–38.

Simulation (Quantitative)

A study that involves developing and using a mathematical model that imitates a real-world system’s behavior, which often entails problem understanding, data collection, model development, verification, validation, design of experiments, data analysis, and implementation of results.

Application

The standard applies to research studies that use simulation to understand, assess or improve a system or process and its behavior. Use this standard for in silico simulations, i.e., studies representing everything using computational models. For in virtuo simulations, i.e., human participants manipulating simulation models, use the Experiments (with human participants) Standard. For simulations used to assess a new or improved technological artifact, also consider the Engineering Research standard.

Specific Attributes

Essential Attributes

  • justifies that simulation is a suitable method for investigating the problem (or research question, etc.)

  • describes the simulation model (conceptual, implementation, or hybrid abstraction levels), including input parameters and response variables
  • describes the underlying simulation approach1)
  • describes simulation packages or tools used to develop and run the simulation model, including version numbers, and computational environments
  • describes the data used for model calibration, the calibration procedures, and contextual information
  • describes how the simulation model was verified and validated at different abstraction levels2
  • describes the study protocol, including independent variables, scenarios, number of runs per scenario (in case of using stochastic simulation), and steady-state or terminating conditions
  • analyzes validity threats considering the supporting data and the simulation model3
  • clearly explicates the assumptions of the simulation model

Desirable Attributes

  • provides supplementary materials including the raw data (for real data) or generation mechanism (for synthetic data) used for model calibration, all simulation models and source code, analysis scripts
  • characterizes reference behaviors for the definition of simulation scenarios with representative and known values or probability distributions for input parameters4
  • separates conceptual and implementation levels of the simulation model
  • reports sensitivity analysis for input parameters or factors
  • clearly distinguishes evidence-based results from interpretations and speculation5

Extraordinary Attributes

  • describes how stakeholders were involved in developing and validating the simulation model6
  • provides a modular view of the simulation model, allowing reuse in different contexts7

General Quality Criteria

Conclusion validity, construct validity, internal validity (if examining causal relationships), external validity, and reproducibility.

Antipatterns

  • Overfitting8 the simulation model to reproduce a reference behavior.
  • Use of non-standard experimental designs9 without justification.
  • Using a single run instead of multiple runs to experiment with stochastic models.

Examples of Acceptable Deviations

  • If insufficient data is available (or too costly to collect) to calibrate the model, assumptions can be used to implement parts of the model. These assumptions, however, must be explained and justified.
  • When the translation from a conceptual to implementation model is straightforward, authors may present them together.
  • If the simulation approach used is very common in software engineering (e.g. discrete-event simulation, system dynamics), it is sufficient to indicate which approach is used, citing appropriate references, rather than explaining in full how the approach works.

Invalid Criticisms

  • The mere presence of assumptions in the model is not a valid basis for criticism as long as the assumptions are documented and justified, and their implications for the validity of the simulation are sufficiently addressed. All models make assumptions.
  • Claiming that the model is too abstact without explaining why the level of abstraction is inadequate for the purposes of the study.
  • Claiming that the study is invalid because it uses generated data, secondary data or approximations based on expert opinion, when no appropriate primary data is available.

Suggested Readings

Nauman Bin Ali, Kai Petersen. A consolidated process for software process simulation: State of the art and industry experience. In: 38th Euromicro Conference on Software Engineering and Advanced Applications, 2012, IEEE, pp 327–336.

Nauman Bin Ali, Kai Petersen, Claes Wohlin. A systematic literature review on the industrial use of software process simulation. Journal of Systems and Software, 97, 2014, 65–85.

Dietmar Pfahl. Process Simulation: A Tool for Software Project Managers? In: Günther Ruhe, Claes Wohlin (Eds.) Software Project Management in a Changing World. Springer-Verlag Berlin Heidelberg, 2014, 425-446.

Ivo Babuska, and J. Tinsley Oden. Verification and validation in computational engineering and science: basic concepts. Computer methods in applied mechanics and engineering, 193, 36, 2004, 4057–4066.

Breno Bernard Nicolau de França, Nauman Bin Ali. The Role of Simulation-Based Studies in Software Engineering Research. In: Felderer M., Travassos G. (eds) Contemporary Empirical Methods in Software Engineering. Springer, Cham. 2020. https://doi.org/10.1007/978-3-030-32489-6_10.

Breno Bernard Nicolau de França, Guilherme Horta Travassos. Experimentation with dynamic simulation models in software engineering: planning and reporting guidelines. Empirical Software Engineering, 21, 3, 2016, 1302–1345.

Breno Bernard Nicolau de França, Guilherme Horta Travassos. (2015). Simulation Based Studies in Software Engineering: A Matter of Validity. CLEI Electronic Journal, 18(1), 5.

Houston DX, Ferreira S, Collofello JS, Montgomery DC, Mackulak GT, Shunk DL. Behavioral characterization: finding and using the influential factors in software process simulation models. Journal of Systems and Software, 59, 3, 2001, 259– 270, DOI https://doi.org/10.1016/S0164-1212(01)00067-X.

Kleijnen JPC, Sanchez SM, Lucas TW, Cioppa TM. State-of-the-art review: A user's guide to the brave new world of designing simulation experiments. INFORMS Journal on Computing, 17, 3, 2005, 263–289. DOI 10.1287/ijoc.1050.0136.

Law AM. Simulation modeling and analysis. 5th ed., 2015, McGraw-Hill, New York.

Madachy RJ. Software Process Dynamics, 2008, Wiley-IEEE Press.

Exemplars

Ali NB, Petersen K, de França BBN (2015) Evaluation of simulation-assisted value stream mapping for software product development: Two industrial cases. Information & Software Technology 68:45–61 [an example of a simulation-based study in industrial settings].

Concas, Giulio, Maria Ilaria Lunesu, Michele Marchesi, and Hongyu Zhang. "Simulation of software maintenance process, with and without a work‐in‐process limit." Journal of software: Evolution and Process 25, no. 12 (2013): 1225-1248. [an example of model description and discussion of threats to validity].

Garousi V, Khosrovian K, Pfahl D (2009) A customizable pattern-based software process simulation model: design, calibration, and application. Software Process: Improvement and Practice 14(3):165–180, DOI 10.1002/spip.411. [an example of a complete report of a simulation-based study].

Smith, Neil, Andrea Capiluppi, and Juan F. Ramil. "A study of open source software evolution data using qualitative simulation." Software Process: Improvement and Practice 10, no. 3 (2005): 287-300. [an example of a simulation study using a unusual simulation approach: qualitative simulation].


1 e.g. discrete-event simulation, system dynamics, agent-based simulation
2 Some verification and validation procedures may be applied to the model at the conceptual level (e.g., validating variables and relationships) down to an implementation level (e.g., using tests, reproducing reference behaviors, or performing simulated experiments).
3 Simulation studies are prone to several validity threats, including non-representative simulation scenarios, insufficient verification and validation, using different datasets (contexts) for model calibration and experimentation, and others (de França and Travassos, 2015).
4 Reference behaviors represent a real-world model (often based on actual measurement of a system or process), which is characterized by data distribution or series of model variables. Usually, these models are used for validating simulation outcomes. For instance, an effort and schedule baseline for software project simulation.
5 Simply separating results and discussion into different sections is typically sufficient. No speculation in the results section.
6 Apart from the developers of the simulation model, stakeholders could be data providers (for model calibration), domain experts (who may have hypotheses about causal relationships between model variables) and users of the simulation model. All of these stakeholders could, for example, be involved in checking the plausibility of the model output (face validity check).
7 Understanding simulation models as software, they may become too large and difficult to understand in a single view. So, the idea is to have a composite model, in which each module concerns a particular set of variables. The following book presents an entire model on software projects in a modular perspective: Abdel-Hamid, T. and Madnick, S.E., 1991. Software project dynamics: an integrated approach. Prentice-Hall, Inc.
8 For instance, implementing a specific model capable of only producing desired outcomes.
9 Houston et al. (2001) discusses some usual experimental design for software process simulation, such as (fractional) factorial designs. For a more general view on experimental designs for simulation, we suggest Kleijnen et al. (2005).

Questionnaire Surveys

A study in which a sample of respondents answer a series of (mostly structured) questions, typically through a computerized or paper form

Application

This guideline applies to studies in which:

  • a sample of participants answer predefined, mostly closed-ended questions (typically online or on paper)
  • researchers systematically analyze participants’ answers

This standard does not apply to questionnaires comprising predominately open-ended questions1, literature surveys (see the Systematic Review Standard), longitudinal or repeated measures studies (see the Longitudinal Studies Standard), or the demographic questionnaires typically given to participants in controlled experiments (see the Experiments Standard).

Specific Attributes

Essential Attributes

  • identifies the target population and defines the sampling strategy (see the Sampling Supplement)
  • describes how the questionnaire instrument was created
  • describes how participants were selected or recruited (e.g. sampling frame, advertising, invitations, incentives)
  • step-by-step, systematic, replicable description of data collection and analysis
  • describes how responses were managed/monitored, including contingency actions for non-responses and drop-outs
  • EITHER: measures constructs using (or adapting) validated scales
    OR: analyzes construct validity (including content, convergent, discriminant and predictive validity) ex post3
  • explains handling of missing data (e.g. imputation, weighting adjustments, discarding)

  • analyzes response rates

  • acknowledges generalizability threats; discusses how respondents may differ from target population

  • provides the questionnaire instrument (as an appendix or supplementary materials)
  • the questionnaire design matches the research aims and the target population2

Desirable Attributes

  • provides supplementary materials including instrument(s), code books, analysis scripts and dataset(s)
  • characterizes the target population including demographic information (e.g. culture, knowledge)
  • accounts for the principles of research ethics (e.g. informed consent, re-identification risk)
  • explains and justifies instrument design and choice of scales (e.g. by research objectives or by analogy to similar studies)
  • validates whether the instrument’s items, layout, duration, and technology are appropriate (e.g. using pilots, test-retest, or expert and non-expert reviews)
  • reports how the instrument has evolved through the validation process (if at all)
  • analyzes response bias (quantitatively)
  • applies techniques for improving response rates (e.g. incentives, reminders, targeted advertising)
  • discusses possible effects of incentives (e.g. on voluntariness, response rates, response bias) if used
  • describes the stratification of the analysis (if stratified sampling is used)
  • defines and estimates the size of the population strata (if applicable)
  • clearly distinguishes evidence-based results from interpretations and speculation4

Extraordinary Attributes

  • provides feasibility check of the anticipated data analysis techniques
  • reports on the scale validation in terms of dimensionality, reliability, and validity of measures
  • longitudinal design in which each respondent participates two or more times

General Quality Criteria

Survey studies should address quantitative quality criteria such as internal validity, construct validity, external validity, reliability and objectivity (see Glossary).

Variations

  • Descriptive surveys provide a detailed account of the properties of a phenomenon or population.
  • Exploratory surveys generate insights, hypotheses or models for further research.
  • Confirmatory surveys testing formal (e.g. causal) propositions to explain a phenomenon.

Examples of Acceptable Deviations

  • Ommitting part of a questionnaire instrument from supplementary materials due to copyright issues (in which case the paper should cite the source of the questions)
  • Doesn’t describe handling of drop-outs or missing data because there were none

Invalid Criticism

  • Not reporting response rate for open public subscription surveys (i.e. surveys open to the anonymous public so that everyone with a link—typically broadcasted among social networks—can participate).
  • Failure to release full data sets despite the data being sensitive.
  • Claiming the sample size is too small without justifying why the sample size is insufficient to answer the research questions.
  • Criticizing the relevance of a survey on the basis that responses only capture general people’s perceptions.
  • The results are considered controversial or hardly surprising.
  • The results do not accord with the reviewer’s personal experience or previous studies.

Suggested Readings

Don Dillman, Jolene Smyth, and Leah Christian. 2014. Internet, phone, mail, and mixed-mode surveys: the tailored design method. John Wiley & Sons.

Mark Kasunic. 2005. Designing an effective survey. Tech report #CMU/SEI-2005-GB-004, Carnegie-Mellon University, Pittsburgh, USA.

Jefferson Seide Molléri, Kai Petersen, and Emilia Mendes. An empirically evaluated checklist for surveys in software engineering. Information and Software Technology. 119 (2020).

Stefan Wagner, Daniel Mendez, Michael Felderer, Daniel Graziotin, Marcos Kalinowski. Challenges in Survey Research. In: Contemporary Empirical Methods in Software Engineering, Springer, 2020.

Paul Ralph and Ewan Tempero. 2018. Construct Validity in Software Engineering Research and Software Metrics. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018 (EASE’18), 13–23. DOI:10.1145/3210459.3210461

Marco Torchiano, Daniel Méndez, Guilherme Horta Travassos, and Rafael Maiani de Mello. 2017. Lessons learnt in conducting survey research. In Proceedings of the 5th International Workshop on Conducting Empirical Studies in Industry (CESI ‘17), 33–39. DOI:10.1109/CESI.2017.5

Torchiano Marco and Filippo Ricca. Six reasons for rejecting an industrial survey paper. In 2013 1st International Workshop on Conducting Empirical Studies in Industry (CESI). (2013), 21–26.

Exemplars

Jingyue Li, Reidar Conradi, Odd Petter Slyngstad, Marco Torchiano, Maurizio Morisio, and Christian Bunse. A State-of-the-Practice Survey on Risk Management in Development with Off-The-Shelf Software Components. In IEEE Transactions on Software Engineering. 34, 2 (2008), 271–286.

D. Méndez Fernández, Stefan Wagner, Marcos Kalinowski, Michael Felderer, Priscilla Mafra, Antonio Vetrò, Tayana Conte et al. Naming the Pain in Requirements Engineering: Contemporary Problems, Causes, and Effects in Practice. In Empirical software engineering. 22, 5 (2016), 2298–2338.

Paul Ralph, Sebastian Baltes, Gianisa Adisaputri, Richard Torkar, Vladimir Kovalenko, Marcos Kalinowski, et al. Pandemic Programming: How COVID-19 affects software developers and how their organizations can help. In Empirical Software Engineering, 25, 6, 2020, 4927–4961. DOI: 10.1007/s10664-020-09875-y

Stefan Wagner, Daniel Méndez Fernández, Michael Felderer, Antonio Vetrò, Marcos Kalinowski, Roel Wieringa, et al. 2019. Status Quo in Requirements Engineering: A Theory and a Global Family of Surveys. ACM Trans. Softw. Eng. Methodol. 28, 2, Article 9 (April 2019), 48 pages. DOI:10.1145/3306607


1 There is currently no standard for predominately open-ended questionnaire surveys. One exemplar readers could draw from is: Daniel Graziotin, Fabian Fagerholm, Xiaofeng Wang, and Pekka Abrahamsson. 2018. “What happens when software developers are (un)happy.” Journal of Systems and Software 140, 32-47.
2 questions are mapped to research objectives and their wording and format is appropriate for their audience
3 For advice on analyzing construct validity, see Ralph, Paul, and Ewan Tempero. “Construct validity in software engineering research and software metrics.” In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018, pp. 13-23</footnote
4 Simply separating results and discussion into different sections is typically sufficient. No speculation in the results section.

Systematic Reviews

A study that appraises, analyses, and synthesizes primary or secondary literature to provide a complete, exhaustive summary of current evidence regarding one or more specific topics or research questions

Application

  • Applies to studies that systematically find and analyze existing literature about a specified topic
  • Applies both to secondary and tertiary studies
  • Does not apply to ad-hoc literature reviews, case surveys or advanced qualitative synthesis methods (e.g. meta-ethnography)

Specific Attributes

Essential Attributes

  • identifies type of review (e.g. scoping review, meta-analysis, systematic mapping study, narrative synthesis, case survey, critical review)

  • presents step-by-step, systematic, replicable description of search process including search terms1
  • defines clear inclusion and exclusion criteria
  • specifies the data extracted from each primary study2 ; explains relationships to research questions
  • describes in detail how data were extracted and synthesized (can be qualitative or quantitative)
  • describes coding scheme(s) and their use

  • clear chain of evidence from the extracted data to the answers to the research question(s)

  • presents conclusions or recommendations for practitioners/non-specialists

Desirable Attributes

  • provides supplementary materials such as protocol, search terms, search results, selection process results; complete dataset, analysis scripts; coding scheme, examples of coding, decision rules, descriptions of edge cases
  • mitigates sampling bias and publication bias, using some (not all) of:
    (i) manual and keyword automated searches;
    (ii) backward and forward snowballing searches;
    (iii) checking profiles of prolific authors in the area;
    (iv) searching both formal databases (e.g. ACM Digital Library) and indexes (e.g. Google Scholar);
    (v) searching for relevant dissertations;
    (vi) searching pre-print servers (e.g. arXiv);
    (iiv) soliciting unpublished manuscripts through appropriate listservs or social media;
    (iiiv) contacting known authors in the area.
  • demonstrates that the search process is sufficiently rigorous for the systematic review goals3
  • assesses quality of primary studies using an a priori coding scheme (e.g. the Systematic Reviews Standard); explains how quality was assessed; excludes low quality studies (ok) or models study quality as a moderating variable (better)
  • assesses coverage using funnel plots or percentage of known papers found
  • (positivist reviews), uses 2+ independent analysts; analyzes inter-rater reliability (see the IRR/IRA Supplement); explains how discrepancies among coders were resolved4
  • (interpretivist reviews) reflects on how researcher’s biases may have affected their analysis
  • consolidates results using tables, diagrams, or charts; PRISMA flow diagram (cf. Moher et al. 2009)
  • performs analysis through an existing or new conceptual framework (qualitative synthesis)
  • uses meta-analysis methods appropriate for primary studies; does not use vote counting
  • integrates results into prior theory or research; identifies gaps, biases, or future directions
  • presents results as practical, evidence-based guidelines for practitioners, researchers, or educators
  • clearly distinguishes evidence-based results from interpretations and speculation5

Extraordinary Attributes

  • two or more researchers independently undertaking the preliminary search process before finalizing the search scope and search keywords
  • contacted primary study authors to ensure interpretations are correct, and elicit additional details not found in the papers such as access to raw data

Examples of Acceptable Deviations

  • No attempts to mitigate publication bias in a study explicitly examining a specific venue’s (e.g. CACM or ICSE) coverage of a given topic.
  • Using probability sampling on primary studies when there are too many to analyze (i.e. thousands).
  • No recommendations for practitioners in a study of a methodological issue (e.g. representative sampling).

Antipatterns

  • A laundry-list description of the studies (A found X, B found Y, …), rather than a synthesis of the findings.
  • Relying on characteristics of the publication venues as a proxy for the quality of the primary studies instead of assessing primary studies’ quality explicitly.
  • Reviewing an area in which there are too few high-quality primary studies to draw reliable conclusions.

Suggested Readings

Moher D, Liberati A, Tetzlaff J, Altman DG, The PRISMA Group (2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med 6, 7: e1000097. doi:10.1371/journal.pmed1000097

Michael Borenstein and Larry V. Hedges and Julian P.T. Higgins and Hannah R. Rothstein. 2009. Introduction to Meta-Analysis. John Wiley & Sons Ltd.

Daniela S. Cruzes and Tore Dybå. 2010. Synthesizing evidence in software engineering research. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ‘10). Association for Computing Machinery, New York, NY, USA, Article 1, 1–10. DOI:10.1145/1852786.1852788

Barbara Kitchenham and Stuart Charters. 2007. Guidelines for performing Systematic Literature Reviews in Software Engineering.

Matthew B. Miles and A. Michael Huberman and Jonny Saldana. 2014. Qualitative Data Analysis: A Methods Sourcebook. Sage Publications Inc.

Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Systematic mapping studies in software engineering. In 12th International Conference on Evaluation and Assessment in Software Engineering (EASE). (Jun. 2008), 1–10.


1 Searches can be manual or automated or a combination of both.
2 Primary studies are the studies that are being reviewed. In a tertiary study, the “primary studies” are themselves reviews.
3 e.g. formal meta-analysis of experiments has higher requirements for completeness than mapping studies of broad topic areas.
4By discussion and agreement, voting, adding tie-breaker, consulting with study authors, etc.
5 Simply separating results and discussion into different sections is typically sufficient. No speculation in the results section.