Ethics (Engineering Research)

A study in which a novel technological or socio-technical artifact is created and evaluated.

Application

This standard applies to studies that propose and evaluate technological artifacts, including algorithms, models, languages, methods, systems, tools, and other computer-based technologies.

Specific Attributes

Essential Attributes

  • describes anticipated benefits of the artifact for individuals, communities, organizations or society as a whole
  • describes any burdens created by using the artifact, on whom these burdens fall, whether burderns fall dispropritonately on vulnerable individuals or groups and, if so, attempts to mitigate such burdens
  • describes all plausible, non-trivial potential harms arising from using the artifact (e.g. in terms of privacy (including surveillance and workplace monitoring), security, exclusion, bias, inequality, accessibility, deception, pollution, environmental destructiion and other societal, community, organizational, individual or ecological impacts) and attempts to mitigate such harms
  • considers ways in which the artifact could be abused, misused, hacked, compromised, or otherwise applied to nefarious or unethical purposes, and describes attmpts to mitigate such abuses
  • considers unintended consequences of using the artifact and describes attempts to mitigate them
  • considers unethical or otherwise problematic behaviours that could be encouraged by using or misuing the artifact in the anticipated context of use and describes attempts to mitigate such behaviours
  • describes the environmental impact of the work in terms of carbon emissions (for development and use) and other respects (e.g. materials, mineral mining required)
  • justifies the potential burdens, harms, misuses, unintended consequences and environmental impacts of the artifact in terms of its benefits, considering particularly the relationship between who reaps the benefits and who suffers the burdens, harms, etc. how the latter are mitigated.

Desirable Attributes

  • quantifies the carbon emissions generated by (1) developing and (2) using the artifact, especially for machine learning1
  • quantify carbon emissions over the whole line of research, not just this part of the work (for artifacts emerging from a line of research) 1
  • relates ethics discussion to frameworks of professional practice or the UN Sustainable Development Goals
  • discuss relative harms and benefits when describing alternatives in related work

Extraordinary Attributes

  • provides complete statement of ethical approval from institutional REC/IRB in relation specifically to the technological artifact (typically as supplementary materials)
  • describes measures taken to reduce the environmental impact of the work such as carbon offsetting

General Quality Criteria

Assessing the ethicality of a new technology is mainly about (1) the degree to which potential harms are acknowledged and mitigated, and (2) juxtaposing harms and benefits. We do not reject a new technology because of the potential for harm—virtually every technology has costs and potential for harms and misuse. Rather, we reject a paper when it evinces a general lack of awareness or consideration of potential harms, obfuscation of harms, when the harms outweigh the benefits, or when benefits accrue to people who don’t really need them (e.g. the rich and powerful) while the costs are borne by different people who cannot afford them (e.g. the poor and vulnerable).

Antipatterns

  • Justifying an artifact as beneficial solely on its own merits without consideration of the broad impacts of its use
  • Justifying an artifact as beneficial solely on the grounds that it is similar to other artifacts that have been useful and without considering the impacts specifically in this case
  • Justifying an artifact as beneficial without considering the potential contexts of use, including beyond those targeted
  • Failure to identify any misuse or risk scenarios

Examples of Acceptable Deviations

  • Justified inability to quantify machine learning carbon costs1

Invalid Criticisms

  • “The study was not approved by an ethical review board” is not grounds for rejection unless there is a substantive ethical concern relating to the artifact that such a board should have prevented.
  • “There exists a potential harm, burden, etc. that the paper did not consider” is not by itself grounds for rejection in a paper that discusses many harms, burdens, etc.; it is normal for reviewers to dream up scenarios that authors have not considered.

Suggested Readings

British Computer Society, Code of Conduct, [https://www.bcs.org/media/2211/bcs-code-of-conduct.pdf]

Don Gotterbarn, Keith Miller, and Simon Rogerson. 1997. Software engineering code of ethics. Commun. ACM 40, 11 (November 1997), 110-118. DOI: 10.1145/265684.265699

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. Quantifying the Carbon Emissions of Machine Learning. NeurIPS2019 at ClimateChangeAI. [https://arxiv.org/abs/1910.09700v2]

Michael D. Myers, John R. Venable, 2014. A set of ethical principles for design science research in information systems, Information & Management, 51(6), 801-809. DOI: 10.1016/j.im.2014.01.002.

United Nations. 2015. Sustainable Development Goals. [https://sdgs.un.org/goals]

Caroline Whitbeck. 2011. Ethics in Engineering Practice and Research. Cambridge University Press. ISBN 978-0521723985. Chapters 10 and 11 in particular.


1 The tool at [https://mlco2.github.io/impact/#home] provides an estimate of the carbon impact of machine learning for a wide variety of hardware running in various cloud platforms including private infrastructure, producing results in a publication-friendly format.

Ethics (Studies with Human Participants)

A study in which information and data is gathered directly from human beings

Application

This standard applies to studies in which one or more human beings other than the researchers take part in the research (e.g. Action Research, Case Study, Grounded Theory, Qualitative Surveys, Experiments, Questionnaire Surveys).

Specific Attributes

Essential Attributes

  • describes all plausible, non-trivial potential risks or harms to participants (if any)
  • describes any steps taken to mitigate risks or harms
  • explains how benefits of the research justify any non-trivial risks or harms
  • explains how participants were recruited
  • explains the compensation (if any) offered for participation and how coercion was mitigated
  • explains how participants' free, individual, informed consent to participate was obtained1
  • explains how participants' privacy and reputation was respected in the conduct and reporting of the research.2
  • explains how data about or supplied by participants was protected in the conduct of the research3
  • explains the permissions given by participants for publication and sharing of their data by the researchers, and the permissions given by participants for others to use this data in future
  • states any barriers to participation in the research process and EITHER justifies them OR explains how they were minimized4

Desirable Attributes

  • cites (with application number) approval of a national, institutional, or other appropriate scholarly ethical review board
  • cites the reference ethics framework(s) within which the work was conducted5
  • discusses justice, ethics, diversity and inclusion issues arising from eligibility and participation
  • explains how the impact of the research process on participants was considered
  • explains, where relevant, how risks to the researchers themselves were mitigated in the conduct of the research
  • where the research refers to identifiable persons (other than participants), explains how their privacy is respected and protected (since they have not formally consented to participation)

Extraordinary Attributes

  • supplementary materials include a complete application to a scholarly ethical review board and documentation of its approval

General Quality Criteria

Research should balance the anticipated benefits of a study with the potential risks and/or harms to participants, minimizing risk or harm wherever possible. Studies involving human beings should respect participants' rights and abilities to decide whether to participate and/or withdraw, on the basis of full information about the risks and benefits of participating, the measures that will be taken to protect them (including their privacy and reputation), and where appropriate, measures to protect their organization(s).

Antipatterns

  • Willful blindness to risks and harms to participants; justifying risks and harms on the basis that research is generally beneficial rather than in respect of benefits of the particular study; downplaying risks and harms during recruitment, consent or reporting.

Examples of Acceptable Deviations

  • Potentially damaging the reputation of a named organization by responsibly disclosing or reporting impartial, corroborated evidence of illegal, dangerous or unethical behavior.6
  • The requirement to explain informed consent procedures may be modified where deception is a necessary part of the research protocol (although this itself must be explained in full7 ).
  • Maintaining participants' privacy is not required where it is demonstrated that this was clearly explained and freely agreed to by the participants during recruitment and consent.

Invalid Criticisms

  • "Study was not approved by an ethical review board" is not itself grounds for rejection unless there is a substantive ethical concern that such a board should have prevented.
  • "No consent form was used" is not a valid criticism where consent was given orally or implied (e.g., by completing a questionnaire that is explicitly part of a research project).

Note

Rarely, ethical research may require unlawful action. Ethical frameworks like Menlo include legal compliance so that the ethicality and legality of research can be weighed together. Therefore, the presence of a legally dubious action is not grounds for rejection if and only if the action is ethically justified, including the risks to the researcher.

Suggested Readings

American Anthropological Association. 2020. AAA Statement on Ethics. Retrieved November 6, 2020 from https://www.americananthro.org/LearnAndTeach/Content.aspx?ItemNumber=22869.

Association for Computing Machinery. 2018. ACM Code of Ethics and Professional Conduct. Retrieved November 17, 2020 from https://www.acm.org/code-of-ethics

Sally Dench, Ron Iphofen and Ursula Huws. 2004. RESPECT: An EU Code of Ethics for Socio-Economic Research. The Institute for Employment Studies, Brighton, UK. Retrieved November 6, 2020 from: http://www.respectproject.org/ethics/412ethics.pdf

David Dittrich and Erin Kenneally. 2012. The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. Department of Homeland Security: Science and Technology. Retrieved November 6, 2020 from https://www.dhs.gov/sites/default/files/publications/CSD-MenloPrinciplesCORE-20120803_1.pdf

The British Psychological Society. 2014. _Code of Human Research Ethics. The British Psychological Society, Leicester, UK. Retrieved November 6, 2020 from https://www.bps.org.uk/sites/www.bps.org.uk/files/Policy/Policy%20-%20Files/BPS%20Code%20of%20Human%20Research%20Ethics.pdf


1 If a gatekeeper was involved in recruitment, also explains how the gatekeeper was recruited, informed, and consented, and how power relationships between gatekeeper and individual participant were mitigated to ensure free individual informed consent.
2 e.g., use of anonymization or pseudonymization during data analysis and reporting, careful consideration of the reidentification potential through others combining multiple data sources (such as social media, or contextual information like company plus job role), clear explanation of the reidentification risks to participants prior to their consent).
3 Including, where appropriate, compliance with relevant data protection legislation (e.g. EU General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA) in terms of matters such as the lawful basis for data processing, privacy notices offered, and data subject rights).
4 Exclusion from participation should follow from the purposes of the research. For example, when investigating of the experiences of gay software developers need not include interviewing straight software developers. However, the same investigation should not exclude wheelchair users by selecting an inaccessible site for the interviews. Participants should not be excluded accidentally.
5 e.g. AAA Statement of Ethics; ACM code of ethics, BPS, BERA, RESPECT, Menlo Report (see Suggested Readings)
6 Where this information is obtained from individuals who may be identified within an organisation, explains how consideration was given to their protection, alongside any undertakings given to the organisation about this kind of disclosure.
7 For example, where possible, participants must be informed that they are taking part in a study, and the study that is described to them should have similar risks and compensation as the one they actually participate in. Participants must: (1) be debriefed in terms of the data captured and the reasons for deception, and (2) be offered the opportunity to withdraw themselves and their data at that point. Support must be provided where appropriate to mitigate potential for embarrassment or psychological harm.

Ethics (Studies with Secondary Human Data)

A study that analyzes data previously gathered from (or supplied by) humans and held by a third-party

Application

This standard applies to studies in which data previously gathered from humans and held by an organization is used by the researchers (e.g. Exploratory Data Science). This may involve researchers gathering existing data themselves through technical means (e.g. website scraping) from a site where directly-contributed data is visible (e.g. social media sites, software repositories), or using datasets collected, curated and made available by others (e.g. published research datasets).

Specific Attributes

Essential Attributes

  • describes all plausible, non-trivial potential risks or harms to data custodians and data subjects
  • describes any steps taken to mitigate risks or harms
  • justifies the taking of risks and/or the causing of harm in terms of the claimed benefits of the research.
  • states the grounds on which the data used was legitimately available to the researchers1
  • explains how the research falls within the permissions given by the original data subjects2
  • explains the provenance of the original data in terms of the data custodian's documentation of the conditions of data capture, and whether this capture met the standards required for Studies with Human Participants (particularly in respect of consent for capture and future use).
  • explains how participants' privacy and reputation was respected in the conduct and reporting of the research. In particular, where multiple datasets are combined, explains how any additional privacy risks raised by this were mitigated.
  • explains how the privacy and consent of data subjects was respected during the research with reference to the conditions under which data was made available to the researcher3 , and how any further publication of the data (in existing, combined, or transformed form) will respect the expectations and consent given by data subjects.

Desirable Attributes

  • explains how the researchers considered the impact of the research on those whose data was used
  • cites, with application number, approval of a national, institutional, or other appropriate scholarly ethical review board
  • cites the reference ethics framework(s) within which the work was conducted4

Extraordinary Attributes

  • supplementary materials include a complete application to a scholarly ethical review board and documentation of its approval

General Quality Criteria

Research should balance the anticipated benefits of a study with the potential risks and/or harms to participants, minimizing risk or harm wherever possible. In secondary data studies, researchers typically do not have access to the original data subjects and may therefore need to justify research based on information and terms supplied and imposed by intermediaries (to whom researchers also have ethical responsibilities). Researchers' ethical responsibility to the original participants remains and thus they have an ethical duty to seek and verify the provenance of the data they use, verify whether their intended use falls within the reasonable expectation and consent of those whose data it is, and to respect the rights and roles of the intermediaries as the proximate representatives of those for whose data they are responsible. Any use of data that is not legitimately available must be justified on strong and documented ethical grounds.

Antipatterns

  • Justifying data use as acceptable solely because it is technically available (e.g. machine accessible or downloadable) as opposed to legitimately available (i.e. under the custodian's terms of access and use, and with verified provenance permitting the current use under the terms of the original data surrender).5
  • Reidentification (or creating the conditions for reidentification) of the data subjects whose data is analyzed.6
  • Failing to consider potential, harmful, unintended consequences of the research.
  • Accepting (e.g. from a corporation) a dataset that has been collected without adhering to the principles set forth in the Ethics (Studies with Human Participants) Standard ; e.g. without participants' consent or appropriate safeguards.

Examples of Acceptable Deviations

  • Using data outside of the norms of legitimate access on strong and documented ethical grounds (preferably citing approval of a national, institutional, or other appropriate scholarly ethical review board).
  • Reidentification of data subjects where there are strong and documented ethical grounds for disclosure of identity (e.g. to expose subject's unethical behavior).
  • Using data with unknown provenance if and only if data provenance information is unavailable from the custodian and the use is justified in terms of the balance between research benefit and risks to those whose data is used.

Invalid Criticisms

  • "Study was not approved by an ethical review board" is not grounds for rejection unless there is a substantive ethical concern that such a board should have prevented.

Note

Rarely, ethical research may require unlawful action. Ethical frameworks like Menlo include legal compliance so that the ethicality and legality of research can be weighed together. Therefore, the presence of a legally dubious action is not grounds for rejection if and only if the action is ethically justified, including the risks to the researcher.

Suggested Readings

Association for Computing Machinery. 2018. ACM Code of Ethics and Professional Conduct. Retrieved November 17, 2020 from https://www.acm.org/code-of-ethics

David Dittrich and Erin Kenneally. 2012. The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. Department of Homeland Security: Science and Technology. Retrieved November 6, 2020 from https://www.dhs.gov/sites/default/files/publications/CSD-MenloPrinciplesCORE-20120803_1.pdf

Sally Dench, Ron Iphofen and Ursula Huws. 2004. RESPECT: An EU Code of Ethics for Socio-Economic Research. The Institute for Employment Studies, Brighton, UK. Retrieved November 6, 2020 from: http://www.respectproject.org/ethics/412ethics.pdf

aline shakti franzke, Anja Bechmann, Michael Zimmer, Charles Ess, and the Association of Internet Researchers. 2020. Internet Research: Ethical Guidelines 3.0. Retrieved November 6, 2020 from https://aoir.org/reports/ethics3.pdf

Nicolas Gold and Jens Krinke. 2020. Ethical Mining: A Case Study on MSR Mining Challenges. In Proceedings of the 17th International Conference on Mining Software Repositories__(MSR '20). Association for Computing Machinery, New York, NY, USA, 265–276. DOI: 10.1145/3379597.3387462

Lisa Sugiura, Rosemary Wiles, and Catherine Pope. 2016. Ethical challenges in online research: Public/private perceptions. Research Ethics 13, 3–4 (July 2016), 184-199. DOI: 10.1177/1747016116650720


1 Examples: the terms of use of a website explicitly permit scraping for the kind of information gathered and for the purposes of research; the research falls within the scope of usage terms for the retrieval API; a curated dataset is licensed for the purpose of research.
2 e.g.: (1) a public dataset’s website explicates that data was gathered from subjects with their consent for future research use compatible with that reported; (2) a software repository’s policies state that its users agree to their data being used for research and under conditions compatible with the research reported.
3 Including, where appropriate, explains compliance with relevant data protection legislation (e.g. EU General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA) in terms of matters such as the basis for data processing, privacy notices offered, and data subject rights.
4 E.g. ACM Code of Conduct, Internet Research: Ethical Guidelines, Menlo Report, RESPECT (see Suggested Readings)
5 Note that there may be a difference between publicly viewable data on a privately-operated website (thus subject to site terms and conditions of access and use) and data that is in the public domain (i.e. publicly owned) or manifestly published.
6 Conditions for reidentification may include quoting ostensibly anonymous data content that can be resolved to its original source (and thus the identity of the originator) through a search engine.

Information Visualization

Diagrams that map quantitative values to visual objects their visual attributes to aid understanding

Application

This crosscutting guideline applies to visualizations of quantitative data. Graphical visualization consists of diagrams that map quantitative values to visual objects (e.g. segments, points, circles) and to their visual attributes (e.g. length, position, size, color). Their goal is to support the reader in a data understanding task. It does not apply to:

  • Software visualization, such as showing structure or architecture of a software product.
  • Diagrams not encoding quantitative values, e.g. UML activity diagrams, BPMN diagrams, flow charts.
  • Tables that encode data in textual format.

Attributes

  • proportionality
    • the values/measures are reported in a uniformly proportional way: the ratio of two values is equal to the geometrical ratio on paper (or screen) of the corresponding visual attributes (length, area, slope, etc.) [lie factor]
    • the visual attributes provide an accurate perception of the proportion, according to the attribute ranking:
      • position along a common scale
      • position along identical scales
      • length
      • angle/slope
      • area
    • diagrams are rendered in 2D and refrain from 3D perspectives that might alter the perception of dimensions
    • encoding of ordinal measures through colors use saturation and lightness and avoid rainbow palettes
  • utility

    • all the elements in the diagram convey useful information or support clarity
    • the diagram does not contain chartjunk or over-designed elements that interfere with perception or understanding
    • different colors represent data variation
    • the background is light and uniform
    • there are no decorative 3D effects
    • bright and saturated colors are used solely for emphasis
    • grids are light and do not obscure data
    • annotations are less prominent than data
  • clarity

    • the diagram layout and text annotation support the understanding of the data and make the visualization as much self-contained as possible
      • the title (or figure caption) concisely conveys what is the content of the visualization and the intended message
      • text annotations guide the reader in understanding the message
    • direct labeling is used instead of a separate legend especially when there are more than two color codes
    • color encoding of categorical measures is limited to at most 5 distinct levels; more colors are too difficult to discriminate
    • when data points are very dense, appropriate techniques are applied to mitigate overplotting
    • axes and the relative tick marks are labelled
    • the size of text is large enough to make it readable.
    • avoid rotated or vertical text
    • image format is preferably vectorial, if a raster format is used it must have sufficient resolution.
  • diagram design

    • the diagram contains at most two axes unless a surface (function of two variables) is the standard representation
    • use of logarithmic scales is explicitly highlighted
    • data objects (e.g. bars) are sorted in a meaningful way (e.g. ascending, descending or grouped) to ease comparison
    • non-interactive visualization should serve a single understanding task
    • the type of visualization is appropriate for the visual understanding task that it is intended to support (Table 1)
    • the design is colorblind-friendly – it uses a colorblind-safe palette for color coding or a redundant visual variable (e.g., saturation, pattern)
    • (optional) several data series should usually be reported using multiple small diagrams rather than in a single crowded diagram

Table 1: When to use which visualization

Understanding task Commonly used type of visualization
Comparison Bar plot, Dot plot, Heatmap, Strip plot, Isotype
Correlation Scatter plot, Bubble chart, Slope chart, Dumbbell plot, Diverging bars
Deviation Bullet graph, Gauge
Distribution Histogram, Frequency polygon, Cumulative density, Quantile-quantile plot, Boxplot, Violin plot
Geo location Choroplet map, Cartogram
Part-to-whole Bar plot, Stacked bars, Treemap, Waffle, Pie
Ranking Bar plot, Dot plot, Lollypop
Time series Line plot, Bar plot, Streamgraph

Anti-patterns

  • using truncated bars to exaggerate differences, compromising proportionality instead of using other representations (e.g. dot plots)
  • using pie charts for more than 5 categories and/or without direct labelling of the slices
  • using dual vertical scales that are difficult to read and lend themselves to ambiguity due to the arbitrary selection of axis ranges
  • using any 3D effect or decoration that may alter perception
  • using line plots to display unordered variables

Suggested Readings

Colin Ware. 2000. Information Visualization: Perception for Design. Morgan Kaufmann Publishers, Inc., San Francisco, California.

David Borland and R. M. Taylor Ii. 2007. Rainbow Color Map (Still) Considered Harmful. IEEE Computer Graphics and Applications. 27, 2, 14–17.

Financial Times Visual Journalism Team. Visual Vocabulary. Retrieved July 12, 2020 from https://ft.com/vocabulary

Claus O. Wilke. Fundamentals of Data Visualization, O’Reilly, 2019. Retrieved July 12, 2020 from https://serialmentor.com/dataviz/

Simon Fear. Publication quality tables in LATEX. 2020. Retrieved July 12, 2020 from http://mirrors.ctan.org/macros/latex/contrib/booktabs/booktabs.pdf

Inter-Rater Reliability and Agreement 1

"The extent to which different raters assign the same precise value for each item being rated" (Inter-Rater Agreement; IRA) and "the extent to which raters can consistently distinguish between different items on a measurement scale" (Inter-Rater Reliability; IRR)2

Application

Applies to studies in which one or more human raters (also called judges or coders) rate (or measure, label, judge, rank or categorize) one or more properties of one or more research objects.3

Are multiple raters needed?

There is no universal rule to determine when two or more raters are necessary. The following factors should be considered:

  • Controversiality: the more potentially controversial the judgment, the more multiple raters are needed; e.g. recording the publication year of the primary studies in an SLR is less controversial than evaluating the elegance of a technical solution.
  • Practicality: the less practical it is to have multiple rates, the more reasonable a single-rater design becomes; e.g. multiple raters applying an a priori, deductive coding scheme to some artifacts is more practical than multiple raters inductively coding 2000 pages of interview transcripts.
  • Philosophy: having multiple raters is more important from a realist ontological perspective (characteristic of positivism and falsificationism) than from an idealist ontological perspective (characteristic of interpretivism and constructivism).

Specific Attributes

Essential Attributes

  • clearly states what properties were rated
  • clearly states how many raters rated each property

EITHER:

  • provides reasonable justification for using a single rater4

OR:

  • describes the process by which two or more raters independently rated properties of research objects; AND
  • describes how disagreements were resolved; AND
  • indicates the variable type (nominal, ordinal, interval, ratio) of the ratings; AND
  • reports an appropriate statistical measure of IRR/IRA5

Desirable Attributes

  • provides supplementary materials including: rating scheme(s), decision rules, example disagreements, IRR/IRA broken down by property or wave of analysis.
  • justifies the statistic used6
  • reports established interpretations or thresholds for the statistics used
  • analyzes anomalous results7 in light of the properties of the statistic used (e.g. Cohen's kappa anomalies8)
  • describes the raters' training or experience in the research topic
  • resolves disagreements through discussion (rather than voting)

Extraordinary Attributes

  • employs more than three raters per property
  • reports an iterative process with multiple cycles of (i) rating a subset of the data, (ii) resolving disagreements, and (iii) updating the rating scheme or decision rules until a minimum threshold indicates acceptable reliability/agreement; reports IRR/IRA for each cycle in an iterative process
  • calculates IRR/IRA for internal quality of the research, i.e., as a tool for progressively improving the consistency of rating systems thus improving researchers' reflexivity

Antipatterns

  • Reporting IRR where IRA is more appropriate and vice versa.
  • Pretending that the IRR or IRA statistic indicates good reliability when it is below established thresholds.
  • Calculating multiple IRR/IRA measures and reporting only the most favourable (p-hacking)
  • Pretending that an obviously positivist study adopts an idealist ontology to avoid employing multiple raters

Examples of Acceptable Deviations

  • Resolving disagreements using a tiebreaker instead of discussion because a rater was unavailable.
  • Supplementary materials do not include example disagreements because the data is sensitive.
  • There is no indication of the number of iterations performed to assess / improve the IRR / IRA.

Invalid Criticisms

  • Criticizing use of a single rater where multiple raters would be impractical or inconsistent with the study's underlying philosophy..
  • Criticizing use of a single rater when the data is such that there is no reason to suspect different raters would reach different conclusions.
  • Criticizing use of multiple raters. It is difficult to imagine a scenario in which multiple raters would actively harm a study's credibility.
  • 'IRR/IRA is too low' when there is no evidence-based threshold and reliability threats are clearly acknowledged.

Notes

  • IRR is a measure of correlation the can be calculated using (for example) Pearson's r, Kendall's tau or Spearman's rho
  • IRA is a measure of agreement and can be calculated using (for example) Scott's π, Cohen's κ, Fleiss's κ and Krippendorff's α (for example).
  • IRR/IRA analysis not only indicates reliability and objectivity (in positivist research) but also improves reflexivity (in anti-positivist research).

Suggested Readings

FENG, G. C. 2014. Intercoder reliability indices: disuse, misuse, and abuse. Quality & Quantity, 48 (3), 1803-1815.

GISEV, N., BELL, J. S., & CHEN, T. F. 2013. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 9 (3), 330-338.

HENRICA C.W. DE VETA, CAROLINE B. TERWEEA, DIRK L. KNOLA,B, LEX M. BOUTER. 2006. When to use agreement versus reliability measures. Journal of Clinical Epidemiology 59, 1033–1039.

NILI, A., TATE, M., BARROS, A., & JOHNSTONE, D. 2020. An approach for selecting and using a method of inter-coder reliability in information management research. International Journal of Information Management, 54.

O'CONNOR, C., & JOFFE, H. 2020. Intercoder reliability in qualitative research: debates and practical guidelines. International Journal of Qualitative Methods, 19.

Exemplars

JESSICA DIAZ, DANIEL LÓPEZ-FERNÁNDEZ, JORGE PEREZ, ÁNGEL GONZÁLEZ-PRIETO (in press) Why are many businesses instilling a DevOps culture into their organization? _Empirical Software Engineering

JORGE PÉREZ, JESSICA DÍAZ, JAVIER GARCÍA-MARTÍN, AND BERNARDO TABUENCA. 2020. Systematic literature reviews in software engineering - enhancement of the study selection process using Cohen's Kappa statistic. Journal of Systems and Software, 168

JORGE PÉREZ, CARMEN VIZCARRO, JAVIER GARCÍA, AURELIO BERMÚDEZ, AND RUTH COBOS.2017. Development of Procedures to Assess Problem-Solving Competence in Computing Engineering. IEEE Transactions on Education, 60 (1), 22-28

R. MOHANANI, B. TURHAN, P. RALPH, (in press) Requirements framing affects design creativity. IEEE Transactions on Software Engineering. DOI: 10.1109/TSE.2019.2909033

ZAPF, A., CASTELL, S., MORAWIETZ, L., & KARCH, A. 2016. Measuring inter-rater reliability for nominal data–which coefficients and confidence intervals are appropriate? BMC medical research methodology, 16, article 93


1 Assessing consistency among raters, where appropriate, promotes "systematicity, communicability and transparency of the coding process; reflexivity and dialogue within research teams; and helping to satisfy diverse audiences of the trustworthiness of the research" (O'Connor & Joffe 2020).
2 See Gisev, Bell, & Chen (2013)
3 For example: (a) applying selection criteria in an SLR; (b) an experiment in which a dependent variable like code quality is scored by human; (c) deductive qualitative analysis with an a priori coding scheme in positivist case study.
4 e.g. the ratings are uncontroversial; multiple raters would be impractical; multiple raters are inconsistent with the philosophical perspective of the work
5 Such as Pearson's r, Kendall's tau, Spearman's rho, Cohen's Kappa, Fleiss Kappa, Krippendorff's Alpha or (rarely) percent agreement
6 E.g. is the absolute value for each item being rated (IRA) or the trend in ratings (IRR) important?
7 E.g. high value of the observed agreement, but low value of the statistic or vice versa
8 See FEINSTEIN, A. R., & CICCHETTI, D. V.1990. High Agreement But Low Kappa: I. the Problems of Two Paradoxes*. J Clin Epidemiol, 43(6), 543–549.

Open Science

The practice of maximizing the accessibility and transparency of science

Application

The open science supplement applies to all research.

Principle

Artifacts related to a study and the paper itself should, in principle, be made available on the Internet:

  • without any barrier (e.g. paywalls, registration forms, request mechanisms),
  • under an appropriate open license that specifies purposes for re-use and re-purposing,
  • properly archived and preserved,

provided that there are no ethical, legal, technical, economical, or practical barriers preventing their disclosure.

Specific Attributes

Desirable Attributes

  • includes a section named data availability (typically after conclusion)
  • EITHER: links to supplementary materials
    OR explains why materials cannot be released (reasons for limited disclosure of data should be trusted)
  • includes supplementary materials such as: raw, deidentified or transformed data, extended proofs, analysis scripts, software, virtual machines and containers, or qualitative codebooks.
  • archives supplementary materials on preserved digital repositories such as zenodo.org, figshare.com, softwareheritage.org, osf.io, or institutional repositories
  • releases supplementary material under a clearly-identified open license such as CC0 or CC-BY 4.0

General Criteria

Rather than evaluating reproducibility or replicability in principle, reviewers should focus on the extent to which artifacts that can be released, are released.

Invalid Criticisms

Researchers should not complain that a study involves artifacts which— for good reasons—cannot be released.

Examples of Acceptable Deviations

  • dataset is not released because it cannot be safely deidentified (e.g. interview transcripts; videos of participants)
  • source code is not released because it is closed-source and belongs to industry partner

Notes

  • authors are encouraged to self-archive their pre- and post-prints in open and preserved repositories
  • open science is challenging for qualitative studies; reviewers should welcome qualitative studies which open their artifacts even in a limited way
  • personal or institutional websites, version control systems (e.g. GitHub), consumer cloud storage (e.g. Dropbox), and commercial paper repositories (e.g. ResearchGate; Academia.edu) do not offer properly archived and preserved data.

Suggested Readings

Daniel Graziotin. 2020. Open science policies. Retrieved July 12, 2020 from https://ineed.coffee/open-science-policies/

Daniel Graziotin. 2020. SIGSOFT open science policies. Retrieved July 12, 2020 from https://github.com/acmsigsoft/open-science-policies/

Daniel Graziotin. 2018. How to disclose data for double-blind review and make it archived open data upon acceptance Retrieved July 12, 2020 from https://ineed.coffee/5205/how-to-disclose-data-for-double-blind-review-and-make-it-archived-open-data-upon-acceptance/.

Daniel Méndez, Daniel Graziotin, Stefan Wagner, and Heidi Seibold. 2019. Open science in software engineering. arXiv. https://arxiv.org/abs/1904.06499

GitHub. 2016. Making Your Code Citable. Retrieved July 12, 2020 from https://guides.github.com/activities/citable-code/. (How to automatically archive a GitHub repository to Zenodo)

Figshare. How to connect Figshare with your GitHub account. Retrieved July 12, 2020 from https://knowledge.figshare.com/articles/item/how-to-connect-figshare-with-your-github-account (How to automatically archive a GitHub repository to Figshare)

Registered Reports

An empirical study that is published in two phases: the plan (RR1) and the results of executing the plan (RR2)

Definition

"Registered reports" refers to studies that are conducted in two phases:

  • researchers create and publish a study plan, or phase 1 registered report (RR1), which is accepted in principle by a publishing venue;
  • researchers execute the plan and write up the results, or a phase 2 registered report (RR2), which then receives final acceptance from the same publishing venue.

Pre-registration refers to depositing the RR1 somewhere publicly visible. Pre-registration aims to prevent hypothesising after results are known, to mitigate unconscious researcher bias in data analysis, and combat publication bias.

Application

This standard applies to positivist, confirmatory (i.e. hypothesis-testing) studies with tightly-scoped analysis approaches. Pre-registering interpretivist, qualitative or exploratory research remains controversial.

Specific Attributes (RR1)

Essential Attributes

  • meets all essential criteria in The General Standard except those that require data:
    • does not present results
    • does not validate assumptions of statistical tests
    • does not discuss implications
    • does not contribute to collective body of knowledge
    • does not support conclusions with evidence or arguments
  • meets all essential criteria, in applicable empirical standards, that can be met before data collection1
  • justifies importance of the purpose, problem, objective, or research question(s)
  • describes the research method in detail sufficient for an independent researcher to exactly replicate the proposed data collection and analysis procedures
  • the stated hypotheses can be tested with the data the researchers propose to collect

Desirable Attributes

  • presents preliminary data (e.g. from a pilot study) to justify the chosen approach (e.g. probability distributions).
  • includes a conditional structure (e.g. pre-specifying different tests for normal and non-normal distributions)
  • explains how the study will change based on the results of data analysis (i.e. conditional analysis)2

Specific Attributes (RR2)3

Essential Attributes

  • meets all essential criteria in The General Standard (no exceptions)
  • introduction, rationale and stated hypotheses are the same as the approved RR1 submission except for improvements based on feedback from RR1 reviews
  • EITHER: adheres precisely to the registered procedures
    OR: thoroughly justifies all deviations and explains how they affect the final analysis.
  • deviations, if any, are not justified based on the data
  • clearly designates as exploratory any unregistered post hoc analyses
  • unregistered post hoc analysis, if any, are justified, methodologically sound, and informative

Desirable Attributes

  • provides evidence that data was collected after RR1 plan is accepted

Extraordinary Attributes

  • generates novel insights into the concept, process benefits, or limitations of registered reports

Antipatterns

  • deviating from the RR1 plan because it constrains exploratory research, when the plan did not mention exploration
  • postdictive deviations from RR1 plan; i.e., changes made knowing how they would affect the outcome of the study
  • Pre-registrations that are not verifiably committed prior to data collection, e.g. not time-stamped.

Invalid Criticisms

  • RR1: Insisting on complete data collection or detailed analysis and results
  • RR1: Rejecting exploratory or qualitative research: all kinds of research can be pre-registered even it's not covered here
  • RR2: in hindsight, the RR1 plan was not appropriate (RR2 reviews should not criticize any aspect of the RR1 plan)
  • RR2: results are not statistically significant, novel, relevant or compelling; effect sizes too small

Justifying Deviations

Reviewers should not expect research to go exactly according to plan or authors to foresee every possible problem. Changes are acceptable as long as they are justified and not postdictive (i.e. changes made knowing how they will affect results). For example, researchers might drop a mistranslated question in a multi-lingual questionnaire survey.

Suggested Readings

Center for Open Science. Future-proof your research. Preregister your next study. Retrieved July 12, 2020 from https://www.cos.io/our-services/prereg

Center for Open Science. Registered reports: Peer review before results are known to align scientific values and practices. Retrieved July 12, 2020 from https://cos.io/rr/

Center for Open Science. Template reviewer and author guidelines. Retrieved July 12, 2020 from https://osf.io/pukzy/

Ben Goldacre, Nicholas J DeVito, Carl Heneghan, Francis Irving, Seb Bacon, Jessica Fleminger and Helen Curtis. 2018. Compliance with requirement to report results on the EU Clinical Trials Register: cohort study and web resource. BMJ 2018;362:k3218 DOI: 10.1136/bmj.k3218

Wiseman R, Watt C, Kornbrot D. 2019. Registered reports: an early example and analysis. PeerJ 7:e6232 10.7717/peerj.6232


1 e.g. presents power analysis; describes how card sorting will be executed, lists anticipated statistical tests
2 e.g. as a decision tree; while not strictly required, omitting conditional analysis is extraordinarily risky for the authors
3 Adapted from https://osf.io/pukzy/ by CC-BY

Sampling

An empirical study where some of many possible items are selected

Application

This standard applies to empirical research in which the researcher selects smaller groups of items to study (a sample) from a larger group of items of interest (the population) using a usually imperfect population list (the sampling frame). Common items in software engineering research include people (e.g. software developers), code artifacts (e.g. source code files) nd non-code artifacts (e.g. online discussions, user stories).

Specific Attributes

Essential Attributes

  • explains the goal of sampling (e.g. aiming for representativeness, identifying exceptional cases)
  • explains the sampling strategy, in particular the different filtering steps involved or the reasons for selecting certain objects
  • explains why the sampling strategy is reasonable (not necessarily optimal) for the sampling goal
  • explains the reasoning behind the selection of study objects (especially qualitative studies)
  • reports the sample size

Essential only if representativeness is a goal

  • states the theoretical population (what would the researcher like to generalize to?)
  • presents a replicable, concise, algorithmic account of how other researchers could derive the same sample
  • explicitly argues for representativeness (e.g. compares sample and population parameters, provides confidence interval and confidence level for sample size)
  • explains how the sample could be biased along the sampling steps

Desirable Attributes

  • reports the approximate or exact sizes of populations and sampling frames
  • provides the sample, sampling frame, and sampling scripts as supplementary material (subject to the collected data containing sensitive or protected information).
  • uses more sophisticated sampling strategies where appropriate, e.g.:

    • exploratory research: using purposive rather than convenience sampling for unit of analysis
    • case study: using purposive rather than convenience sampling for site selection
    • repository mining: using probability rather than convenience or purposive sampling (if a sampling frame is available)
    • online survey: using respondent-driven rather than snowball sampling
    • study with identifiable strata: using stratified random rather than simple random sampling
    • theory building: using theoretical rather than convenience sampling

Examples of Acceptable Deviations

  • omitting a detailed account of the sampling strategy because it is explained in previous work using the same data set
  • using a very simple sampling strategy in exceptional circumstances where expediency outweighs representativeness (e.g. research during a disaster)

Antipatterns

  • making claims about a population, based on sample, without providing an argument for representativeness
  • claiming that a sample is representative of a population because it was randomly selected from a sampling frame, without considering bias in the sampling frame
  • conducting underpowered research; i.e.:
    • quantitative research with a sample size insufficient to detect effects of the expected size1
    • qualitative research with too little data for plausible saturation
  • justifying the selection of items merely by stating that they come from a "real-world" context, without providing additional reasoning why the selected items are suitable for the study context

Invalid Criticisms

  • complaining about lack of representativeness or low external validity in studies where representativeness is not a goal
  • abstractly criticizing generalizability rather than pointing to best practices, e.g.:
  • invalid: 'as most respondents work in app development, the results may not generalize to other settings'
  • valid: 'the researchers should have sent participation reminders to mitigate response bias'
  • for qualitative research, claiming that the sample size is too small without considering how the items were selected (e.g. theoretical sampling) or the authors' argument for saturation.

Suggested Readings

Sebastian Baltes and Paul Ralph. 2020. Sampling in Software Engineering Research: A Critical Review and Guidelines. Empirical Software Engineering. https://arxiv.org/abs/2002.07764

William G. Cochran. 2007. Sampling techniques. Wiley.

Steve Easterbrook, Janice Singer, Margaret-Anne Storey, and Daniela Damian. 2008. Selecting Empirical Methods for Software Engineering Research. In Guide to Advanced Empirical Software Engineering. 285-311.

Barbara Kitchenham and Shari Lawrence Pfleeger. 2002. Principles of survey research: part 5: populations and samples. SIGSOFT Softw. Eng. Notes 27, 5 (September 2002), 17–20. DOI:10.1145/571681.571686

Gary T. Henry. 1990. Practical sampling. Sage 21.

Meiyappan Nagappan, Thomas Zimmermann, and Christian Bird. 2013. Diversity in software engineering research. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013). Association for Computing Machinery, New York, NY, USA, 466–476. DOI:10.1145/2491411.2491415


1Expected effect sizes should be plausible. For instance, expecting any single factor (e.g. programming language) to explain 50% of the variance in software project success is not plausible.