Using AI to Extract a Knowledge Base of COVID-19 Mechanisms

By Tom Hope, Aida Amini, David Wadden, Madeleine van Zuylen, Eric Horvitz, Roy Schwartz, and Hannaneh Hajishirzi

Tom Hope
AI2 Blog

--

Our knowledge base of mechanisms spans a wide range of activities, functions, and influences extracted from papers related to COVID-19. Scientists can use our tool to explore this important information across domains.

The web of science related to COVID-19 is immense — scientists in fields ranging from medicine, genetics, microbiology, and zoology, all the way to physics, mathematics, computer science, climatology, sociology and macroeconomics are working to understand different angles of the pandemic and its effects. Can we leverage artificial intelligence to help researchers navigate the eclectic landscape of scientific literature around the disease, that keeps growing by the day?

To help accelerate the pace of discovery, we release our COVID-19 mechanism knowledge base (KB) and online search tool, containing diverse and structured information on causal relations, methods, objectives and activities — coming from any area.

To create our KB, we train AI models to extract information from over 200K scientific papers old and new, with an approach we discuss in our recent paper. We built our tool for scientists to rapidly search and explore the web of COVID-19 science — not only for biomedical phenomena such as mechanisms involved in viral activity or drugs and their effects, but also information on algorithms used for diagnosis, designs for safer air circulation, public policies for pandemic control, models of climatic effects on disease spread, and many more. Current biomedical knowledge bases contain important but limited information on entities such as genes and drugs; in contrast, we’ve designed our KB to have broad reach across all scientific disciplines.

In one important recent example, a group of 239 scientists called attention to the airborne transmissibility of the virus, based on interdisciplinary research spanning virology, aerosol physics, flow dynamics, exposure and epidemiology, medicine, and building engineering. In this scenario, a scientist can use our online search tool to discover, for instance, the use of ceiling-level exhausts for controlling airborne transmission, or optical methods for measuring viral particle size:

Searching our KB for mechanisms, methods and causal links related to airborne transmission.

The same search also reveals computer simulations used to study droplets:

A researcher looking to find out about applications of AI — perhaps looking for AI solutions to their problem, or new opportunities to apply their AI method — can search for algorithms such as convolutional neural networks, with COVID-19 as an objective/target:

and retrieve a table of structured results, such as applications of CNN models to COVID-19 detection/testing, along with the original context:

Of course, more “conventional” biomedical mechanisms can be searched, such as the effects of Vitamin D on COVID-19:

Finally, for an example beyond STEM sciences, a researcher can quickly find a list of factors impacting society, such as school closures:

Importantly, our focus is not on finding and displaying papers, but discovering full lists of structured, pinpointed mechanism relationships. Aside from being valuable information that can now be directly targeted unlike other search engines, this can also help scientists cut through the clutter and help mitigate information overload — by focusing their attention on the information they need.

In Homo Deus: A Brief History of Tomorrow by Yuval Noah Harari, the author refers to the vast web of interdisciplinary science governing the world:

While some experts are familiar with developments in one field, such as artificial intelligence, nanotechnology, big data or genetics, no one is an expert on everything. No one is therefore capable of connecting all the dots and seeing the full picture. Different fields influence one another in such intricate ways that even the best minds cannot fathom how breakthroughs on artificial intelligence might impact nanotechnology, or vice versa.

By building a knowledge base with diverse mechanisms across fields, we aim to make progress toward connecting those dots, starting with one of the pressing challenges of our time — the COVID-19 pandemic.

What are mechanisms?

We focus on the fundamental concept of mechanisms that captures important knowledge across disciplines, including:

  • ⚙️ Mechanistic activities (e.g., receptor binding).
  • 🧰 Functions (e.g., a protein used for viral binding, AI algorithms used for diagnosis, or public health policies).
  • ⛓️ Influences or associations (e.g., disease effects, drug interactions, or socioeconomic impacts).

Although seemingly intuitive, a definition of what mechanisms exactly are is subject to debate in the philosophy of science, discussed in detail in our paper. However, a simple dictionary definition reveals the generality of the concept:

Mechanism: A natural or established process by which something takes place or is brought about.

In biomedicine, AI-based Information Extraction (IE) tools have been used to extract mentions of entities such as proteins or chemicals and their relations. Some of these relations correspond to our notion of mechanisms (e.g., chemical-protein regulation, or drug-drug interactions), but capture only a fraction of the full breadth and depth of mechanisms in the literature. Our unified view of mechanisms is designed to help generalize and scale the study of these important relations.

COVID-19 Functional Open IE (COFIE)

We train an IE model that automatically extracts mechanism information (functional relations) from scientific papers into a KB. We technically define mechanisms as relations between spans of text appearing in the literature (such as in paper abstracts). The spans we use are open and free-form, to strike a balance between expressivity and breadth across domains. We formulate two main types of relations: Coarse-grained and fine-grained relations.

Coarse-grained relations are ordered pairs (tuples) of spans capturing mechanism patterns such as (method, goal),(cause, effect), (agent, action). In the screenshot of our search interface shown before, we can see examples of these pairs — such as (ceiling-level exhausts, controlling airborne transmission).

Fine-grained relations are triples of the form (subject, predicate, object), where the predicate may indicate the type of mechanism as in the following examples:

Fine-grained mechanism relations in our KB.

While more granular, these relations are also less general — as the natural language of scientific papers describing mechanisms often does not conform to this more rigid structure, as discussed in our paper along with more details on the dataset and models.

Neural semantic search for mechanisms

In another example from our tool, a scientist can search for mechanisms referring to cardiovascular effects of COVID-19:

Among the many mechanism results, we discover a complication of COVID-19 associated with arterial disease — “thrombosis of both radial arteries”.

This result is semantically related to the search query of “cardiovascular disease”, even though the result and query do not share any keywords. This result is found using a language model fine-tuned for semantic similarity using the excellent sentence-transformers library, representing both the query and the KB entries as soft vectors such that entries with similar meaning should have vectors that are close to one another. For fast similarity-based search we use FAISS, a specialized index of vectors for the search task.

We can also further filter the retrieved mechanisms by context — for example, taking the above query for cardiovascular effects of COVID-19 and filtering for a context that explicitly mentions “patients”.

Search experiment — do we find useful mechanisms?

To assess our tool more quantitatively, we recruit annotators with background in computer science (AI), medicine, biology and material science. Annotators are given two types of tasks:

Precision vs. recall for the search tasks (viral mechanisms, AI methods). Retrieved relations are ranked by embedding-based similarity to a query and compared to gold labels for evaluation.
  • Biomedical search for SARS-CoV-2 mechanism relations. This task is focused on a set of specific well-known statements or questions regarding the virus (e.g., SARS-CoV-2 binds ACE2 receptor to gain entry into cells).
  • Open-ended cross-domain search for AI applications. This task is focused on discovering diverse ways in which AI research areas or methods are applied in the CORD-19 corpus.

In both tasks, annotators view search results from our KB, with varying degrees of relevance. Overall, our results indicate the retrieved relations are both accurately extracted and retrieved:

Data, models, and search tool

  • Our COFIE dataset with annotated functional relations over COVID-19 papers, as well as pre-trained models, is available in our Github repository.
  • To search our KB with over 2 million mechanisms, check out our search engine.

Our hope is that our framework can support research on COVID-19, and boost knowledge discovery more broadly across the sciences.

Follow @allen_ai on Twitter and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--