Frequently Asked Questions

How do I cite the standards?

Please cite the empirical standards report from arXiv, as follows. Please do not cite this repo by URL.

Paul Ralph et al. (2021) Empirical Standards for Software Engineering Research. arXiv:2010.03525 [cs.SE].

For complete author list, see: https://arxiv.org/abs/2010.03525

Where can I find the standards?

You can find the standards, our report and our presentation slides here: https://github.com/acmsigsoft/EmpiricalStandards

Will the standards lead to lower-quality papers as researchers aim to do the bare minimum to get publications?

We have three responses to this question: (1) If everyone did the bare minimum to comply with the standards, overall research quality would skyrocket. (2) Some researchers may do the minimum, but academia is a land of overachievers. For everyone aiming for a bare pass, there are probably two or three determined to get a best paper award. (3) Over time, we can raise the minimum bar by shifting making more attributes “essential.”

How do the standards affect incentives to produce unobjectionable but low-impact research vs. controversial but potentially high-impact research? Do they privilege rigour over relevance, interestingness and importance?

Structured, standards-based review will make it easier to publish controversial research because reviewers won’t be able to reject research just because they don’t agree with the conclusions. Structured, standards-based review will also make it easier to publish relevant research because it will prevent reviewers from misapplying expectations for controlled experiments to in situ approaches like Action Research.

Researchers are not going to stop trying to do meaningful, impactful research because of (1) promotion criteria that are not limited to publication venues; (2) the way research grants work and (3) scholars’ own egos and ambitions pushing them to “put a ding in the universe.”

How do the standards define Usefulness and Significance? Do you think it is even possible to define such standards?

The general standard simply asks reviewers whether the study contributes to our collective body of knowledge. There is no rating of usefulness or significance.

Reviewers are not oracles and cannot reliably predict how research may be used or affect the world after publication. We take the position that all rigorous research should be published. It’s time to face the reality that usefulness and significance can only be assessed years after work has been published. Reviewers are not oracles.

Even if the paper follows the standard, to reject a paper, a reviewer can just say the paper is poorly written (maybe because he/she is working on the same topic). Where does that belong? How can that reviewer behavior be spotted? Is language considered a measure of rejection?

The general standard has a simple test for the quality of writing: “any grammatical problems do not substantially hinder understanding”. What matters is whether the reader can understand the paper. If the writing is good enough to be understood, it’s good enough to publish. If the reader can’t understand the paper, then it’s not ready to publish. Structured review means that at least a majority of reviewers would have to feel that the paper was not understandable to lead to a reject decision.

Reviews are not only about acceptance, but about feedback. One difference between good and poor conferences (aside from acceptance rate) is the quality of feedback and improvement suggestions. Doesn’t this proposal make feedback an (optional) afterthought?

“Developmental reviewing” is a double-edged sword. Some reviewer feedback is helpful. A lot of it is destructive, abusive and factually incorrect. The point of the standards is to make expectations clear BEFORE the research is done and the paper is submitted, instead of revealing them afterward, when it’s too late to do the study a different way.

The standards initiative seeks to replace unstructured, unvetted feedback with structured feedback based on community consensus. Revisions will still get feedback like “make the research question clearer” or “provide more details on data collection include X, Y and Z.” But the standards will prevent abuse (“the author should consider a different career”), cross-paradigmatic criticism (“Despite this being a qualitative paper, it needs more numbers”), and incorrect feedback (“This SLR isn’t systematic because it doesn’t include all relevant papers”). All of these are real examples.

If venues choose to include longform reviewer feedback, that’s up to them, but we discourage it, because so many reviewers have demonstrated time and again that they cannot be trusted to give correct, unbiased, construct open-ended feedback.

Do the standards support ranking articles that otherwise meet the standards?

The standards allow a rough ranking of papers based on their number of desirable and extraordinary attributes. We are not sure that ranking articles is a good idea.

How would this work with conferences and journals that have fewer slots than submissions?

The limited number of “slots” at journals is a side effect of the idiocy of printing articles on dead trees, and we should accept no such limits. Moreover, using acceptance rate as a measure of conference quality is invalid, indefensible and unscientific. Competition is anathema to science.

If there are more accepted papers than speaking slots, perhaps we should let attendees vote for the most exciting papers (which get speaking slots) and the rest are presented as posters but still published as full papers in the proceedings. That’s just one idea. The point is that we should fundamentally rethink the way we run conferences, get rid of all the counterproductive competition and publish all legitimate research.

Could this kind of checklist of review criteria in combination with enforced page limits for PDFs lead to more rejections in the end, since it is not possible to address all points in the guidelines and reviewers reject, for example, if they do not find any information on a particular criterion?

Arbitrary page limits are stupid for the same reason as slots at journals. That said, the standards should allow us to write shorter, more concise papers because you don’t have to explain as much when things are more standardized and you don’t have to include as much defensive text trying to head off myriad irrelevant criticisms. If you read through the standards, it’s all stuff you would include anyway. The standards also encourage authors to present additional detail in supplementary materials, appendices, replication packages, etc. If we get a lot of reports about not being able to fit all necessary detail in a specific kind of paper, we’ll revisit the standard and make the criteria more concisely addressable.

How can this method keep up with evolving methods and changing standards?

The standards will be hosted in a version control system (GitHub) with an issue tracker and maintainers who will consistently improve the standards based on feedback.

Will the standards make papers even less understandable to non-researchers (i.e., industry people)?

Trying to get professionals to read individual primary studies is probably unrealistic and might even be harmful. To the extent professionals read research, they should read systematic reviews and similar papers that synthesize bodies of work into practical recommendations. Therefore, the SLR standard includes the essential criterion: “presents conclusions or recommendations for practitioners/non-specialists.”

General conferences like ICSE don’t only accept empirical studies. If we only have detailed standards for empirical studies, would it make it more difficult to publish empirical research than non-empirical research.

We don’t think it’s possible to make it any more difficult to publish non-empirical research at ICSE. Years ago, the dominant method of evaluating a new software tool was to simply assert that it worked ‘because I said so.’ Software Engineering overreacted, and now publishing a non-empirical study at ICSE is practically impossible. It seems highly unlikely that the empirical standards, as currently written, could make top software engineering venues swing entirely the other way and become biased against empirical research. In contrast, we need to create space to publish high-quality, non-empirical scholarship such as conceptual explorations of methodological and philosophical issues. Perhaps ICSE should have a separate track for non-empirical, foundational and conceptual scholarship.

Will the standards make it more difficult to publish a study using a methodology for which a standard exists than a study with a method that doesn’t have a standard?

No. Standards make it easier to publish because they prevent reviewers from inventing unexpected criteria.

Some papers mix a theoretical contribution (say new algorithm) with a small empirical evaluation. Should these be evaluated against the same standards?

The Engineering Research (AKA Design Science) Standard covers such papers.

What’s the current prospect of SE venues adopting these standards? Have any venues already used any of these standards?

EASE 2021 and ESEM 2021 have adopted the standards. JSS is trying to figure out how to integrate the standards into their Editorial Manager. We are currently focused on finishing the in-progress standards and building the system that generates the review forms. Then we will start working more with journals and conferences. That said, any reviewer can voluntarily use the standards for guidance, and any venue that wants to experiment with the standards is welcome to do so. Further adoption is inevitable: the standards are just so overwhelmingly useful that all the major journals and conferences will eventually adopt them in some way.

How did you ensure that your criteria are well-structured for every subject and field?

The criteria are divided by “methodology” rather than “subject and field.” We ensured that the criteria are reasonable and well-structured by recruiting experts in each method to draft its standard, revising each standard in several rounds and editing all the standards together for consistency. Of course, the standards remain imperfect; hence the focus on ongoing evolution and maintenance.

Why are the “essentials” just Boolean. Isn’t it too simplistic?

Yes/no questions will lead to the highest inter-rater reliability and the least confusion. The criteria are, however, not simplistic. For example, the general criterion “methodology is appropriate (not necessarily optimal) for stated purpose or questions” still requires an intelligent human being to think about the relationship between the paper’s purpose and method and decide whether they fit.

When will reviewers need to make an expert judgment / subjective judgment?

Every time a reviewer checks a criterion, they make a subjective judgment. It is fundamentally impossible to make peer review objective or algorithmic. The purpose of the standards is to direct expert judgment (like legal precedents), not to replace it.

Are the final decisions generated only by the weighted checklist? What about questions like whether the dataset is big enough or not?

There is no weighting the checklist. All papers that meet the essential criteria are publishable. Venues may create more complex decision rules but there’s nothing like a weighted average. There are several criteria related to sufficient statistical power (for quantitative work) or saturation (for qualitative work).

It seems like there is also diversity in software submitted with papers. What lessons do you think can be transplanted from paper reviews to code reviews?

That is a very good question and we would be happy to facilitate research on this topic. We are working on a standard for artifact evaluation, including assessing software submitted with a paper, or to an artifact track.

Do you see the possibility of “standard hacking” in the future? Since in a way, it’s a bit like an “open book exam” for paper submissions.

We can only hope! The standards are full of research best practices; therefore, trying to hack a study/paper to maximize chance of acceptance by complying with the standards will dramatically improve research quality. The exam is supposed to be open book! The authors should know going in exactly what the reviewers are looking for. Keeping review criteria secret is insane.

What kind of criteria are in the standard for ____?

You can see all the criteria on: https://github.com/acmsigsoft/EmpiricalStandards

Is there any evidence yet that such guidelines do help in all the ways mentioned?

The effectiveness of checklists on professional activity is well established—that’s why surgeons use checklists on operating rooms. Furthermore, the idea of moving to more structured, atomistic, binary criteria to improve inter-rater reliability is well established.

Have the standards been empirically tested? e.g. Have you “back-tested” the standards against historical accepted and rejected papers and compared the outcomes to some ground-truth notion of paper quality?

Not yet because they have only existed for a few months and all of our time is currently consumed with trying to finish the standards that are still in progress. If anyone is interested in organizing empirical testing of the standards, we’d be happy to facilitate—please get in touch.

You mentioned that an experience report is not an empirical study. Is this general, or can there be studies in which the report actually use data analysis, and implicitly empirical methods.

I should have said that, at least for now, experience reports are outside the scope of the empirical standards, unless they systematically collect and analyze data, in which case they might count as case studies. The line between “experience report” and “case study” is a bit blurry.

Why aren’t the standards organized by section, e.g. “Introduction, Background, Method … Conclusion”

We don’t want to micromanage paper organization. We want reviewers to focus on the substance of each paper and the rigor of the study it presents.

Is there any evidence of a broken peer review system?

Oh wow where do we start? Virtually every empirical study of peer review has concluded that it is totally broken. Peer review is unreliable, inefficient, sexist, racist and biased against new ideas and non-native English speakers. For an overview, see:

Ann Weller, 2001. Editorial peer review: Its strengths and weaknesses. Information Today, Inc.

and

P. Ralph (2016) Practical suggestions for improving scholarly peer review quality and reducing cycle times. Communications of the Association for Information Systems, (38), Article 13.

Aren’t many problems with peer review driven by the lack of monetary compensation from the publishers to the reviewers? Good technical reviews demand a lot of time, and reviewers should get paid for that.

We 100% support initiatives to financially compensate reviewers for their work.

How about the people factor? We all know that there are so many crappy reviews even in top conferences. When somebody submits even an obviously bad review, absolutely nothing happens to the reviewer. The PC chairs don’t do any kind of quality checks, they just let the bad reviews go. There is no penalty for providing a crappy review.

Reviewers are volunteers so editors and program chairs feel incapable of disciplining them when they misbehave. Associate Editors and meta-reviewers are often unwilling to disagree with reviewers, because they don’t want to make enemies. The reviewers who most need correcting are typically the most defensive, so when some kind of editor does correct them, they tend to complain, and then the editor gets labelled as someone who puts down other reviewers and excluded from future editorial roles. We cannot fix a volunteer organization by applying authority, no matter how legitimate. Instead, we redesign the review system so that it actively prevents bad reviewing in the first place. We cannot physically force reviewers to read more carefully, but we can prevent them from writing an incompetent essay by reducing free text and increasing structure.

How can the Empirical Standards help with malicious reviewers?

A structured, standards-based review process is better for identifying problematic reviewers. Problematic reviewers will have lower levels of agreement with other reviewers. Agreement is easy to measure with structured review, so the system will start identifying problem reviewers.

Furthermore, the current, broken system allows one negative reviewer to override two or three positive reviewers, leading to a paper being rejected. The standards don’t work that way. If two reviewers don’t agree on a criterion (like whether or not an experiment has a control group), the question is elevated to a third reviewer or editor. There’s no long-winded argument, it just asks for another opinion. All disagreements must be resolved one way or the other, and the decision is based on the venue’s decision rules, not any individual reviewer’s beliefs. There’s no more simply saying “one of the reviewers is unhappy.”

ICSE’s move to consensus decision making through reviewer discussion sounds good, but it’s not. Disagreements between reviewers get nasty fast, so everyone walks on eggshells, and that means the loudest, most belligerent reviewer often gets their way. Negative reviewers talk down positive reviewers more often than the other way around, decreasing acceptance rates. Detailed, specific review criteria allows us to go back to voting, which is fairer and more efficient.

Premise: this initiative sounds valuable. Doubt: a “standard” is inflexible by definition, hence risky. For example: random selection of participants is not always (!) a good thing; how do we avoid judging works on the standard and not on the actual specific work?

Our current peer review system is risky. It risks rejecting sound research because of biased reviewers, disenfranchising good scholars so they quit, and accepting well-written but methodologically shoddy work. Our current system creates division and inequity, scholars getting passed over for well-deserved grants, tenure and promotion and papers getting bounced around to five different venues before being published. Standards-based review mitigates these risks.

The standards don’t say things like ‘participants should be randomly selected’. The standards say things like ‘the paper should explain the sampling strategy and why it is appropriate for study.’

Do the standards discriminate against researchers who invent new methods or apply atypical methods? Do you think the general framework of empirical standards would translate to other disciplines? Does the framework have the flexibility to evaluate interdisciplinary research, or research involving innovative or methods?

The general standard explicitly states that new or atypical studies should not be rejected just because there’s no applicable method-specific standard. The general standard should still apply to new, innovative methods. It says things like “Discusses the importance, implications and limitations (validity threats) of the study.” It’s hard to imagine a new, innovative research method where papers no longer discuss their implications and limitations. However, the standards also emphasize that justified deviations are welcome. Whenever a reviewer feels that a criterion is not satisfied, the system will explicit ask the reviewer whether a reasonable justification was offered.

The general standard was written for software engineering research. It should apply to just about all empirical software engineering research. It might apply to some other fields with a little tweaking.

The other standards are method-specific. The Questionnaire Survey standard is for questionnaire surveys. It should not be used to assess a different kind of study, like a focus group. Some of the method-specific standards can probably be adapted for other research areas, and people from other communities are welcome to try.

How do other disciplines go about this? Are they less diverse than SE? Do they have empirical standards that work and yield the discussed advantages?

All communities attempt to write criteria for judging scientific work (such as reviewer guidelines) and some scientific communities (e.g. SIGPLAN), have attempted to write things sort of like “empirical standards.” These previous attempts have met with some successes despite suffering from several problems including: (1) lack of detail; (2) trying to force everyone into an overly limited and outdated view of science (like logical positivism); (3) failing to distinguishing between dissimilar methodologies. Our standards are far more detailed and comprehensive, without imposing a singular epistemological or ontological position. We accomplish this by having different approaches for different standards. Our community’s high level of methodological diversity both necessitates this multi-standard approach.

It would be useful to cover method X as well. Are you planning to?

If you’d like to see a standard created for a certain method (e.g. Focus Groups) please lodge an issue on the GitHub repo. You can suggest people who might be able to draft such a standard, including yourself, and content that might be included. We have to balance the need for method-specific guidance against the complexity of having too many standards, but if lots of people want a standard for a certain method and someone is willing to write a first draft, then we’ll include it.

This stuff needs to be used to form the next generation of software engineers. Any plans to integrate this into curricula?

The standards clearly have a lot of pedagogical value for teaching research methods. Paul is using them in his advanced quantitative methods course in Winter 2021. We are not sure if they are appropriate for undergrads, but if anyone wants to try, they are welcome to do so and we would love to hear about their experiences.