About this series of articles
This article on scientific publications is the fourth in my “sources for technology and business insights” series. In the previous articles, I talked about how you can get technology, business, and innovation insights from venture capital news, market analyses, and patents.
My focus in this article is not on how to deal with individual R&D papers. Rather, I will discuss strategies for identifying likely interesting papers. In other words, I will talk about the following use case here:
“I have hundreds or thousands of papers that are relevant to my field of interest. How should I decide which 5 to 10 papers to read first?”
Then, to wrap up, I will also talk about what you probably won’t find in R&D papers, and what search strategies you can use to get the best possible results.
What exactly I mean by “scientific publications”, and where you can find them
In this article, I use the terms “R&D paper” and “scientific publication” interchangeably. And by these terms, I mean three things:
These are journals where you submit an article that describes some original research you did. Your article is then reviewed by a group of researchers who work in the same general area as you do (= your peers). Usually, this process is anonymous, meaning that you don’t know who the reviewers are, and the reviewers don’t know who you are. Well-known examples of such journals are Nature or Science, for example. But across subject areas, there are many thousands of journals worldwide.
You can find peer-reviewed journals in their websites (cf. the links to Nature and Science above). Of course, this only works if you know the name of the publication. Other options include services like PubMed, for example, which focuses on publications in the life sciences.
Conference proceedings are similar to journals, except that the contents of the articles were also presented at a conference or workshop. For conference proceedings, the process usually goes like this:
- You submit your paper.
- Your paper goes through peer review.
- If the peer review recommends accepting your paper, you get to present your paper at a conference or workshop.
- After the conference or workshop, the proceedings with your paper come out.
Like journals, you can find proceedings in their website, or via services like PubMed.
Preprints are, as the name implies, scientific publications that are published before they go through peer review. Sometimes, but not always, preprints then go through a peer review process and come out in a journal or proceedings. This peer-reviewed version might be a bit different from the preprint because it might address concerns or recommendations by the reviewers.
Preprints have one big advantage over peer-reviewed publications: You save a lot of time. It’s not unusual for the peer review process to take several months. With preprints, you cut this time short.
Some argue that preprints may change the nature of research publishing overall (here and here, for example). Probably the most famous example of how this can play out is COVID research, where the traditional peer review process was just too slow.
You can get preprints from dedicated servers. For example, arXiv has preprints mostly from computational sciences (bioinformatics, computer science, economics, mathematics, physics, etc.). And bioRxiv and medRxiv are preprint servers for health sciences and biology, respectively.
What insights you can expect to get from R&D papers
Really digging in to scientific publications requires a certain level of subject matter expertise. If you don’t have the training and the experience in a field, it’s difficult to go very deep. And going deep is time-consuming. In order to fully understand a scientific publication and what it might mean to your work, you have to spend the time and work through the details, perhaps even contact the authors and ask them questions about certain aspects of their paper.
My purpose here is not to explain how to study individual papers in depth. Instead, I’ll discuss the process of taking a large set of papers (= hundreds or thousands) and identifying those that are likely to be interesting.
3 simple methods for organizing a collection of scientific publications
So you’ve got a collection of hundreds or thousands of R&D papers on a topic that interests you. For example, all “deep learning” papers from the past year, or all “solid-state battery” papers from the past five years.
Now, out of your hundreds or thousands of papers, which 5 or 10 should you read–or have explained to you–first?
In order to make this selection, you can ask questions like these:
- Is my topic experiencing an increase in activity?
- Who are the most prolific authors? Who are their co-authors?
- What subtopics are there?
Let’s look at each of these questions in more detail.
Is my topic experiencing an increase in activity?
Below is an example of what I mean by “increase in activity”. The topic is telemedicine, where the number of R&D papers rose sharply in the first quarter of 2020 (probably triggered by COVID), as I indicate with the red arrow. Click on the image to see a larger version.
What can you do with this information?
You could use the timeline to zoom in on R&D papers at the onset of the upward trend, for example. This might help you better understand what prompted the rise in publication volume. Or you could look for a review paper a bit further “downstream”. Review papers are a good starting point because they collect and evaluate a lot of the preceding research in an area.
Generally, if you look at trends over time across various data sets–scientific publications, venture investments, patents, news, etc.–you can use this information for better estimates of technology maturity, or, more formally, technology readiness level (TRL). Our technical knowledge base has an article that describes a back-of-the-envelope method for estimating TRLs.
Next, let’s see how you can use scientific publications to discover experts and their collaboration networks.
Who are the most prolific authors? Who are their co-authors?
If someone is a prolific author in an area, you should probably look at their papers. Sometimes these authors also start companies based on their research. For instance, in another article in our blog, I described some science-based companies in the area of carbon capture.
Back to finding prolific authors and their co-author network. For example, let’s say you are interested in solid-state battery technology. Below is a screenshot from Mergeflow that shows you the names of R&D publication authors in this field. “Bigger font” means “more publications”.
You can see, for example, that Jürgen Janek has published a lot. You could start by exploring his and his colleagues’ papers (this one, for example).
Apropos colleagues. This brings us to the second part of our question: Who are the co-authors of the most prolific authors? Do they have a wide network? And do any of their co-authors work with other research groups as well?
Mergeflow has interactive network graphs to help you answer such questions. Below is a screenshot that shows some of the co-author networks extracted from solid-state battery R&D papers (click on the image to see a larger version).
Watch the 14-second video below to see how you can use Mergeflow’s network graphs to zoom in on a co-authorship network, and on an author’s publications.
With this, let’s move on to our third question, how you can find and use subtopics to decide what publications to examine more closely.
What subtopics are there?
Other technologies as subtopics
Subtopics could be other technologies. For example, below is a tag cloud of emerging technologies, identified by Mergeflow in scientific publications on solid-state battery technology (“emerging” means “shows strong momentum” in Mergeflow, not necessarily “brand-new”):
Bring your own subtopics
Of course, you might also have your own ideas for subtopics. For example, if your general field of interest is deep learning, subtopics could be “computer vision”, “drug discovery”, or “predictive maintenance” as deep learning applications.
In Mergeflow, you could search for…
"deep learning" AND "computer vision"|"drug discovery"|"predictive maintenance"
…and use a tag cloud to see how prominent your topics are (since the tag cloud shows all your query terms, “deep learning” is in it as well):
Next, you could zoom in on each of the subtopics–perhaps then use an “emerging technologies” tag cloud to help you zoom in further still. For the subtopic of ‘drug discovery’, you get…
…and for the ‘predictive maintenance’ subtopic, you get something like this:
These screenshots are from May 2021, so your results will probably a bit different when you try this (because between now and when you try this, new publications will have come out).
Subtopics can also be domain-specific, of course. For example, for medical topics, diseases could be relevant subtopics.
As an example, let’s say you are interested in CRISPR-Cas13. Cas13 is a genome editing system that could be used for diagnostics or gene therapy (see this infographic made by April Pawluk, for example). In this context, you might ask, “In the context of what diseases is Cas13 being investigated?”
Mergeflow detects disease names as defined by the ICD-10 convention. You could use these terms for partitioning your set of Cas13 scientific publications. And you could then even ask, “In the context of what diseases is Cas13 being investigated, and by whom?”
A visualization called Sankey graph lets you do this in Mergeflow. Below you can see an example (click on the image to see a larger version). The graph shows, for Cas13 scientific publications, which diseases play a role (left side), and who the authors of the papers are (right side). The screenshot shows what happens when you move your mouse over a disease name: the authors of papers on Cas13 and–in this case–Dengue are highlighted. In other words, the Sankey graph helps you answer the question of who works on Cas13 in the context of what disease.
What you probably won’t find in scientific publications
Scientific publications alone cannot tell you much about the business potential of a method or technology. Just because there is a lot of R&D on a topic doesn’t necessarily mean that commercialization is imminent or even possible.
In most cases, there is a “valley of death” between R&D prototypes and production-ready systems. This is not only true for “hardware topics” such as green hydrogen, or highly regulated areas of innovation such as health care. It often also applies to “software-only” topics–which then turn out not to be software-only. For example, it may one thing to come up with a new machine learning algorithm. But to then deploy this algorithm in practical applications may be another thing entirely, if you can’t figure out a way to run it in an energy-efficient manner (energy-efficient machine learning is an R&D topic in its own right, as discussed here, for example).
By saying all this, I am in no way questioning the value of R&D publications. But presenting production-ready innovations simply isn’t their job.
Good search strategies for scientific publications
Yes, scientific publications are technical and use technical language. But this does not necessarily mean that starting off using very specific and technical search terms is the best discovery strategy. In fact, in our experience working with customers from across industries and technology sectors, starting off broadly and then iterating is a much better strategy.
By “starting off broadly”, I don’t mean “use jargon”. Instead, think about your topic at a variety of “flight levels”, which means more-specific and more-general concepts. For example, rather than just searching for “deep reinforcement learning”, you could start more generally, and search for “machine learning” (deep reinforcement learning is a variant of machine learning). After all, it’s possible that another relevant paper uses a method other than deep reinforcement learning, but accomplishes the same goals.
More generally, I recommend a method called associating. The idea behind associating is to take your topic, and either expand it toward more-general concepts (“deep reinforcement learning” -> “machine learning”), or zoom in on subtopics. Zooming in is what we did above. And in most cases, it’s best to start by expanding, and then zooming in, iteratively.
There is another concept I find useful. It’s called first useful approximation. First useful approximation discovery is meant to prevent overthinking, which may lead to going down a rabbit hole. Instead, the idea is to iterate and consider intermediate results along the way to adapt your discovery strategy.