Image credit: Photo by Janko Ferlič on Unsplash

Open science and reproducible literature searches

How to create a reproducible and systematic literature search for your research projects

Image credit: Photo by Janko Ferlič on Unsplash

Open science and reproducible literature searches

How to create a reproducible and systematic literature search for your research projects

The literature search is the first step required for conducting a literature review. There are many tools available to search “the literature”, this evergrowing mass of research articles, commentaries, dissertations, reviews among other type of items. There are, however, very limited resources focussing on how to search the literature.

While I have often seen research students worrying they would miss an important paper in their literature review, I often have no idea how they come up with the selection of papers they included in their final review. As a marker, reviewer, or examiner, I gauge the appropriateness of the articles referenced using heuristics such as:

  • are the older key papers mentioned?
  • is the review up-to-date? (i.e., are most papers referenced published in the past five years?)

These are heuristics because, in all likelihood, I do not know all key papers or all the recent papers published on the topic I review or supervise. The same goes for published literature reviews, whether they represent a short preamble to an empirical study or aim to provide a snapshot state-of-the art overview of a field such as the reviews published in prestigious journals such as The Annual Review of Psychology or Current Directions in Psychological Science. They represent a good point of entry on a topic, and sometimes can also offer food-for-thought when they are written by leading authorities on a topic, but the method for selecting the articles included in those so-called narrative reviews are often opaque, undocumented and highly subjective.

It is important to pause a moment and consider the difference between a literature search and a literature review. I am not advocating for (or against) narrative reviews. This would be an entirely different blog topic, comparing the relative merits, or rather, purposes of different methods for synthesizing the information inside the articles one has selected for reviewing. Instead, my main point, in this post, is about the need for a more open, systematic, and reproducible method for searching the literature and selecting the articles to synthesise.

Scoping the literature

Whatever the initial source of idea for a new research project, sooner or later, you will need to identify what the topic of your research is and find out what other articles have been published on this topic. The issue nowadays is that any search might return thousands of articles so where do you start? How do you know you have identified the most relevant articles? How do you create a reproducible search?

Search engines and bibliographic databases

Researchers use search engines and academic bibliographic databases to find academic peer-reviewed articles. Bibliographic databases generally focus on articles but can also include conference proceedings and books. Unlike library search engines, however, they also include a lot of meta-data such as subject-specific keywords, research areas, and abstracts. Meta-data is extremely important to help you refine your inclusion and exclusion criteria and ultimately arrive at a manageable, reproducible search string (more details on this below!).

There are freely available search engines such as Google Scholar or PubMed which will return a list of articles based on a keyword search. They are also subscription-only databases such as Web of Science, Scopus, or PsycInfo which you can access through a university database list. These databases will offer superior meta-data to refine your search. The table below lists a selection of these meta-data filters. These filters are crucial to help you reduce the number of articles returned by your search.

Database Selected meta-data filters
Google Scholar Sort by time, relevance; Custom data ranges
PubMed Sort by time, relevance; Custom data ranges; Article types (e.g., Classical articles, Review, Peer-review…); Text availability (Abstract, free full text…); Species (Humans, Other animals); Subjects (AIDS, Cancer, Systematic reviews…); Languages; Search fields (e.g., MeSH major topic, MeSH subheading or MeSH terms…), Ages (e.g., Child, Infant, Adults, Aged…); Export batch size: 200.
Social Sciences Citation Index Sort by time, relevance, times cited, usage count…; Custom data ranges; Document types (e.g., Articles, Review); Languages; Web of Science Index (e.g., Conference Proceedings Citation Index-Social Sciences and Humanities, Social Sciences Citation Index); Author Keywords (DE); Keywords Plus® (ID); Usage Count Since 2013 (U2); Web of Science Categories (WC); Research Areas (SC); Export batch size: 500.
Scopus Sort by time, relevance, cited references…; Custom data ranges; Document types (e.g., Article, review…); Subject area (Social sciences, Multidisciplinary…); Document type (Article, Review…); Keyword; Source type (Journals, Conference Proceedings…); Language; SciVal Topic Prominence; Indexed keywords; Export batch size: 2000.
PsycInfo Sort by time, cited references; Custom data ranges; Publication type (e.g., Peer-reviewed journal…); Subject headings; Human; English language; PsycARTICLES Journals; Export batch size: 200(?).

Identifying key search terms

The first step in scoping the literature is to identify the relevant key search terms for your topic. There are several channels you can use to do so. You can ask experts (e.g., your supervisor and/or your librarian). You can also use the databases.

To illustrate how you can use databases to build a list of keywords, I outline my search for relevant key search terms for our new TORR project.

The research topic of the TORR project is scholarly peer-review. In particular, we are looking at better understanding what information peer-reviewers may use when reviewing a grant proposal (as opposed to a journal article) in the humanities and social sciences?. We have already conducted a heuristic search to identify key recent papers but we are now at the stage where we want to produce a synthesis of past research on this topic. We are ready to start scoping.

The aim of the scoping exercise is to create an overarching research question that will inform how we select articles and what information we retrieve when we read them. Scoping also allows to try and test our search strategy.

The scoping exercise involves a few basic steps:

  1. Select one database (e.g., Google Scholar, Medline, or if you have access via your uni, Web of Science, Scopus or PsychInfo) and search for keywords and key phrases relevant to your topic
  2. Run an initial search with an initial search string, screen the results to assess the proportion of relevant articles based on title and keywords, note down additional relevant keywords or key phrases to refine your search string.
  3. Repeat with (an)other database(s).

The key here is to (a) identify the keywords that other people have used to label your topic, and (b) find a balance between the extent to which your search string is able to capture all relevant articles (sensitivity) and the extent to which it is able to exclude irrelevant articles (specificity).

Each of the databases listed above have their own specific strengths and weaknesses but they all offer ways to identify relevant keywords and key search terms.

Scoping step 1

I like to use Google Scholar for my initial keyword search because when you enter a keyword, you get prompts for popular search expressions.

If you are searching for a compound such as peer reviewers, you should enter it as a search expression using quote marks: “” to increase the specificity of your search. Otherwise you will also pick up articles which mention anyone of these words separately.

The screenshot below shows the search expressions coming up when I typed "peer-reviewer" in Google Scholar on the 31st of March 2019.

Search expressions suggestions for peer-reviewer in Google Scholar

This gives you several prompts for search expressions which are relevant to our topic such as peer reviewer recommendation or peer reviewer opinion as well as peer reviewer comments.

On the basis of this first scoping, I skim through the articles return and I start building up a list of keywords. Note that the choices I make are subjective and will ultimately define the set of articles which end up in the review.

You (or a reviewer) may disagree with my choice but it does not matter! The key here is to make those subjective choices explicit so that the process by which the final set of articles was selected is reproducible and transparent. You can always amend it (or not), based on comments and suggestions from third parties.

Skimming through the search results, I made note of the following further keywords from articles that I would consider to be relevant for our review:

quality, validity, improv?, grad?, assess?, criteria, norm, grant proposals

The character ? at the end of the word indicates that any word starting with the stem would be considered relevant (e.g., improve or improving or improvement). In some search engine, this is replaced by *.

I repeated this process in the PubMed database.

After skimming through results from two databases, I came up with the following research question:

What are the *criteria* or *norms* or *standards* or *benchmarks* used by *peer-reviewers* for *assessing* or *reviewing* or *evaluating* the *quality* or *excellence* or *worth* or *value* or *merit* of *grant* *applications* or *funding* *proposals*?

This translates into the following reproducible search string:

(criter? OR norm? OR standard? OR benchmark?) AND (peer-review?) AND (assess? OR review? OR evaluat?) AND (quality OR excellence OR worth or value OR merit) AND (grant OR funding) AND (proposal OR application)

Scoping step 2

Once you have a full search string, you can run your search in a database which will give you more freedom to refine and filter the search results. All other databases listed above have their pros and cons and again, there is no right or wrong choice. You simply have to decide based on your research objectives. PubMed is orientated towards medical fields, for example, and in our current research we are focusing on social sciences and the humanities so I will probably not include it. PsycInfo may be too specific although we are interested in the cognitive processes involved in peer-reviewing so we plan to test the string in this database. We will also test run it in Web of Science, and Scopus. We may refine the search string depending on the quantity and relevance of the outputs returned by this search.

To be continued…

This post is work in progress, so comments are welcome! If you know of other strategies or approaches for identifying appropriate keywords for reproducible systematic literature searches, please do leave a comment below or get in touch, I would love to hear!

Further reading

  • James, K. L., Randall, N. P., & Haddaway, N. R. (2016). A methodology for systematic mapping in environmental sciences. Environmental Evidence, 5(7).
  • Click here to see an example of a cumulative literature review and here for a how-to guide to conduct one.
Gaëlle Vallée-Tourangeau
Professor of Behavioural Science
comments powered by Disqus