A look at how the thinking about Web Data and the sources of semantics can help drive decisions on combining latent and explicit knowledge. Examples from Elsevier and lots of pointers to related work.
This document discusses Elsevier's Health Knowledge Graph (H-Graph) which connects Elsevier healthcare products, data, and content to power advanced clinical decision support applications. The H-Graph contains over 400,000 medical concepts with 4.9 million semantic relations extracted from medical literature using natural language processing. It aims to integrate Elsevier's existing products through linked data standards while minimizing impact on current workflows. The document outlines Elsevier's approach to linked data, including the need to control namespaces and prioritize developer experience.
Knowledge graph construction for research & medicinePaul Groth
1) Elsevier aims to build knowledge graphs to help address challenges in research and medicine like high drug development costs and medical errors.
2) Knowledge graphs link entities like people, concepts, and events to provide answers by going beyond traditional bibliographic descriptions.
3) Elsevier constructs knowledge graphs using techniques like information extraction from text, integrating data sources, and predictive modeling of large patient datasets to identify statistical correlations.
This document discusses how semantic technologies can help link datasets to publications and institutions to enable new forms of data search and showcasing. It notes that standard schemas and formats are needed to allow linkages between data repositories. Knowledge graphs can help relate entities like papers, authors and institutions to facilitate disambiguation and multi-institutional search capabilities. Semantic technologies are seen as central to efficiently building these linkages at scale across the research data ecosystem.
The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.
The need for a transparent data supply chainPaul Groth
1. The document discusses the need for transparency in data supply chains. It notes that data goes through multiple steps as it is collected, modeled, and applied in applications.
2. It illustrates the complexity of data supply chains using examples of how data is reused and integrated from multiple sources to build models and how bias can propagate.
3. The document argues that transparency is important to understand where data comes from, how it has been processed, and help address issues like bias, privacy, or other problems at their source in the data supply chain.
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
Some thoughts on successful data for the agricultural domain. Keynote at Linked Open Data in Agriculture
MACS-G20 Workshop in Berlin, September 27th and 28th, 2017 https://www.ktbl.de/inhalte/themen/ueber-uns/projekte/macs-g20-loda/lod/
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
Themes and objectives:
To position FAIR as a key enabler to automate and accelerate R&D process workflows
FAIR Implementation within the context of a use case
Grounded in precise outcomes (e.g. faster and bigger science / more reuse of data to enhance value / increased ability to share data for collaboration and partnership)
To make data actionable through FAIR interoperability
Speakers:
Mathew Woodwark,Head of Data Infrastructure and Tools, Data Science & AI, AstraZeneca
Erik Schultes, International Science Coordinator, GO-FAIR
Georges Heiter, Founder & CEO, Databiology
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
Presentation at the IJCAI 2018 Industry Day
Elsevier serves researchers, doctors, and nurses. They have come to expect the same AI based services that they use in everyday life in their work environment, e.g.: recommendations, answer driven search, and summarized information. However, providing these sorts of services over the plethora of low resource domains that characterize science and medicine is a challenging proposition. (For example, most of the shelf NLP components are trained on newspaper corpora and exhibit much worse performance on scientific text). Furthermore, the level of precision expected in these domains is quite high. In this talk, we overview our efforts to overcome this challenge through the application of four techniques: 1) unsupervised learning; 2) leveraging of highly skilled but low volume expert annotators; 2) designing annotation tasks for non-experts in expert domains; and 4) transfer learning. We conclude with a series of open issues for the AI community stemming from our experience.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
More ways of symbol grounding for knowledge graphs?Paul Groth
This document discusses various ways to ground the symbols used in knowledge graphs. It describes the traditional "symbol grounding problem" where symbols are defined based only on other symbols. It then outlines several approaches to grounding symbols in non-symbolic ways, such as by linking them to perceptual modalities like images, audio, and simulation. It also discusses grounding symbols via embeddings, relationships to physical entities, and operational semantics. The document argues that richer grounding could help integrate these notions and enhance interoperability, exchange, identity, and reasoning over knowledge graphs.
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
In this webinar Dr Henriette Harmse from EMBL-EBI presents how they are using their ontology services at EMBL-EBI to scale up the annotation of data and deliver added value through ontologies and semantics to their users.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
With the explosion of interest in both enhanced knowledge management and open science, the past few years have seen considerable discussion about making scientific data “FAIR” — findable, accessible, interoperable, and reusable. The problem is that most scientific datasets are not FAIR. When left to their own devices, scientists do an absolutely terrible job creating the metadata that describe the experimental datasets that make their way in online repositories. The lack of standardization makes it extremely difficult for other investigators to locate relevant datasets, to re-analyse them, and to integrate those datasets with other data. The Center for Expanded Data Annotation and Retrieval (CEDAR) has the goal of enhancing the authoring of experimental metadata to make online datasets more useful to the scientific community. The CEDAR work bench for metadata management will be presented in this webinar. CEDAR illustrates the importance of semantic technology to driving open science. It also demonstrates a means for simplifying access to scientific data sets and enhancing the reuse of the data to drive new discoveries.
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
Big Data and machine learning are increasingly important in biomedical science and clinical practice. Big Data refers to large and complex datasets that are too large for traditional tools to handle. Machine learning involves algorithms that can recognize patterns in data without being explicitly programmed. Some challenges of working with big data and machine learning include issues with data volume, variety, and veracity. However, techniques like distributed analysis, standards, and validation can help address these challenges.
Federated Learning (FL) is a learning paradigm that enables collaborative learning without centralizing datasets. In this webinar, NVIDIA present the concept of FL and discuss how it can help overcome some of the barriers seen in the development of AI-based solutions for pharma, genomics and healthcare. Following the presentation, the panel debate on other elements that could drive the adoption of digital approaches more widely and help answer currently intractable science and business questions.
On community-standards, data curation and scholarly communication" Stanford M...Susanna-Assunta Sansone
This document discusses content standards for better describing scientific data. It notes that while some common features exist across domains, descriptions of experimental context are often inconsistent or duplicated. The author advocates for community-developed content standards to structure, enrich and report dataset descriptions and their experimental context to facilitate discovery, sharing, understanding and reuse of data. Standards should include minimum reporting requirements, controlled vocabularies and conceptual models to allow data to flow between systems. This will help enable better science from better described data.
On community-standards, data curation and scholarly communication - BITS, Ita...Susanna-Assunta Sansone
The document discusses the vision of a "connected digital research enterprise" where researchers can more easily find and collaborate with others based on shared data and outputs. It describes a scenario where Researcher X discovers commonalities in data with Researcher Y, views Y's datasets and publications, and initiates a collaboration. Their joint work is captured and indexed, and a company utilizes some of the outputs while providing funding back to the researchers. The vision aims to more closely connect scientific work through shared digital resources.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
Fairification experience clarifying the semantics of data matricesPistoia Alliance
This webinar presents the Statistics Ontology, STATO which is a semantic framework to support the creation of standardized analysis reports to help with review of results in the form of data matrices. STATO includes a hierarchy of classes and a vocabulary for annotating statistical methods used in life, natural and biomedical sciences investigations, text mining and statistical analyses.
This document outlines a course on Knowledge Representation (KR) on the Web. The course aims to expose students to challenges of applying traditional KR techniques to the scale and heterogeneity of data on the Web. Students will learn about representing Web data through formal knowledge graphs and ontologies, integrating and reasoning over distributed datasets, and how characteristics such as volume, variety and veracity impact KR approaches. The course involves lectures, literature reviews, and milestone projects where students publish papers on building semantic systems, modeling Web data, ontology matching, and reasoning over large knowledge graphs.
Provenance abstraction for implementing security: Learning Health System and ...Vasa Curcin
Discussion of provenance usage in the Learning Health System paradigm, as implemented in the TRANSFoRm project, with focus on security requirements and how they can be addressed using provenance graph abstraction.
Linkages to EHRs and Related Standards. What can we learn from the Parallel U...Koray Atalag
This is the prezo I used during the CellML workshop in Waiheke Island, Auckland, New Zealand on 13 April 2015. The aim was to introduce information modelling methods and tools for the purpose of inspiring computational modelling work in the area of semantics and interoperability.
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
Themes and objectives:
To position FAIR as a key enabler to automate and accelerate R&D process workflows
FAIR Implementation within the context of a use case
Grounded in precise outcomes (e.g. faster and bigger science / more reuse of data to enhance value / increased ability to share data for collaboration and partnership)
To make data actionable through FAIR interoperability
Speakers:
Mathew Woodwark,Head of Data Infrastructure and Tools, Data Science & AI, AstraZeneca
Erik Schultes, International Science Coordinator, GO-FAIR
Georges Heiter, Founder & CEO, Databiology
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
Presentation at the IJCAI 2018 Industry Day
Elsevier serves researchers, doctors, and nurses. They have come to expect the same AI based services that they use in everyday life in their work environment, e.g.: recommendations, answer driven search, and summarized information. However, providing these sorts of services over the plethora of low resource domains that characterize science and medicine is a challenging proposition. (For example, most of the shelf NLP components are trained on newspaper corpora and exhibit much worse performance on scientific text). Furthermore, the level of precision expected in these domains is quite high. In this talk, we overview our efforts to overcome this challenge through the application of four techniques: 1) unsupervised learning; 2) leveraging of highly skilled but low volume expert annotators; 2) designing annotation tasks for non-experts in expert domains; and 4) transfer learning. We conclude with a series of open issues for the AI community stemming from our experience.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
More ways of symbol grounding for knowledge graphs?Paul Groth
This document discusses various ways to ground the symbols used in knowledge graphs. It describes the traditional "symbol grounding problem" where symbols are defined based only on other symbols. It then outlines several approaches to grounding symbols in non-symbolic ways, such as by linking them to perceptual modalities like images, audio, and simulation. It also discusses grounding symbols via embeddings, relationships to physical entities, and operational semantics. The document argues that richer grounding could help integrate these notions and enhance interoperability, exchange, identity, and reasoning over knowledge graphs.
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
In this webinar Dr Henriette Harmse from EMBL-EBI presents how they are using their ontology services at EMBL-EBI to scale up the annotation of data and deliver added value through ontologies and semantics to their users.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
With the explosion of interest in both enhanced knowledge management and open science, the past few years have seen considerable discussion about making scientific data “FAIR” — findable, accessible, interoperable, and reusable. The problem is that most scientific datasets are not FAIR. When left to their own devices, scientists do an absolutely terrible job creating the metadata that describe the experimental datasets that make their way in online repositories. The lack of standardization makes it extremely difficult for other investigators to locate relevant datasets, to re-analyse them, and to integrate those datasets with other data. The Center for Expanded Data Annotation and Retrieval (CEDAR) has the goal of enhancing the authoring of experimental metadata to make online datasets more useful to the scientific community. The CEDAR work bench for metadata management will be presented in this webinar. CEDAR illustrates the importance of semantic technology to driving open science. It also demonstrates a means for simplifying access to scientific data sets and enhancing the reuse of the data to drive new discoveries.
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
Big Data and machine learning are increasingly important in biomedical science and clinical practice. Big Data refers to large and complex datasets that are too large for traditional tools to handle. Machine learning involves algorithms that can recognize patterns in data without being explicitly programmed. Some challenges of working with big data and machine learning include issues with data volume, variety, and veracity. However, techniques like distributed analysis, standards, and validation can help address these challenges.
Federated Learning (FL) is a learning paradigm that enables collaborative learning without centralizing datasets. In this webinar, NVIDIA present the concept of FL and discuss how it can help overcome some of the barriers seen in the development of AI-based solutions for pharma, genomics and healthcare. Following the presentation, the panel debate on other elements that could drive the adoption of digital approaches more widely and help answer currently intractable science and business questions.
On community-standards, data curation and scholarly communication" Stanford M...Susanna-Assunta Sansone
This document discusses content standards for better describing scientific data. It notes that while some common features exist across domains, descriptions of experimental context are often inconsistent or duplicated. The author advocates for community-developed content standards to structure, enrich and report dataset descriptions and their experimental context to facilitate discovery, sharing, understanding and reuse of data. Standards should include minimum reporting requirements, controlled vocabularies and conceptual models to allow data to flow between systems. This will help enable better science from better described data.
On community-standards, data curation and scholarly communication - BITS, Ita...Susanna-Assunta Sansone
The document discusses the vision of a "connected digital research enterprise" where researchers can more easily find and collaborate with others based on shared data and outputs. It describes a scenario where Researcher X discovers commonalities in data with Researcher Y, views Y's datasets and publications, and initiates a collaboration. Their joint work is captured and indexed, and a company utilizes some of the outputs while providing funding back to the researchers. The vision aims to more closely connect scientific work through shared digital resources.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
Fairification experience clarifying the semantics of data matricesPistoia Alliance
This webinar presents the Statistics Ontology, STATO which is a semantic framework to support the creation of standardized analysis reports to help with review of results in the form of data matrices. STATO includes a hierarchy of classes and a vocabulary for annotating statistical methods used in life, natural and biomedical sciences investigations, text mining and statistical analyses.
This document outlines a course on Knowledge Representation (KR) on the Web. The course aims to expose students to challenges of applying traditional KR techniques to the scale and heterogeneity of data on the Web. Students will learn about representing Web data through formal knowledge graphs and ontologies, integrating and reasoning over distributed datasets, and how characteristics such as volume, variety and veracity impact KR approaches. The course involves lectures, literature reviews, and milestone projects where students publish papers on building semantic systems, modeling Web data, ontology matching, and reasoning over large knowledge graphs.
Provenance abstraction for implementing security: Learning Health System and ...Vasa Curcin
Discussion of provenance usage in the Learning Health System paradigm, as implemented in the TRANSFoRm project, with focus on security requirements and how they can be addressed using provenance graph abstraction.
Linkages to EHRs and Related Standards. What can we learn from the Parallel U...Koray Atalag
This is the prezo I used during the CellML workshop in Waiheke Island, Auckland, New Zealand on 13 April 2015. The aim was to introduce information modelling methods and tools for the purpose of inspiring computational modelling work in the area of semantics and interoperability.
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
Talk presented in Spain (WiMS 2013/UAM-Madrid, UMA-Malaga), June 2013.
Replaces earlier version at: http://www.slideshare.net/apsheth/semantic-technology-empowering-real-world-outcomes-in-biomedical-research-and-clinical-practices
Biomedical and translational research as well as clinical practice are increasingly data driven. Activities routinely involve large number of devices, data and people, resulting in the challenges associated with volume, velocity (change), variety (heterogeneity) and veracity (provenance, quality). Equally important is to realize the challenge of serving the needs of broader ecosystems of people and organizations, extending traditional stakeholders like drug makers, clinicians and policy makers, to increasingly technology savvy and information empowered patients. We believe that semantics is becoming centerpiece of informatics solutions that convert data into meaningful, contextually relevant information and insights that lead to optimal decisions for translational research and 360 degree health, fitness and well-being.
In this talk, I will provide a series of snapshots of efforts in which semantic approach and technology is the key enabler. I will emphasize real-world and in-use projects, technologies and systems, involving significant collaborations between my team and biomedical researchers or practicing clinicians. Examples include:
• Active Semantic Electronic Medical Record
• Semantics and Services enabled Problem Solving Environment for T.cruzi (SPSE)
• Data Mining of Cardiology data
• Semantic Search, Browsing and Literature Based Discovery
• PREscription Drug abuse Online Surveillance and Epidemiology (PREDOSE)
• kHealth: development of a knowledge-enhanced sensing and mobile computing applications (using low cost sensors and smartphone), along with ability to convert low level observations into clinically relevant abstractions
Further details are at http://knoesis.org/amit/hcls
Elsevier aims to construct knowledge graphs to help address challenges in research and medicine. Knowledge graphs link entities like people, concepts, and events to provide answers. Elsevier analyzes text and data to build knowledge graphs using techniques like information extraction, machine learning, and predictive modeling. Their knowledge graph integrates data from publications, clinical records, and other sources to power applications that help researchers, medical professionals, and patients. Knowledge graphs are a critical component for delivering value, especially as data volumes and needs accelerate.
Ontologies: What Librarians Need to KnowBarry Smith
Barry Smith presented on ontologies and what librarians need to know about them. Ontologies provide controlled vocabularies that can be used to tag and annotate data in order to integrate datasets and avoid data silos. The Gene Ontology is highlighted as a successful ontology due to factors such as being developed and maintained by domain experts according to best practices, having over 11 million annotations linking genes to ontology terms, and enabling new types of biological research through analysis and comparison of massive quantities of annotated data. For ontologies to fully realize their potential to remove data silos, they must be prospectively standardized and evolved based on user feedback.
openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...Koray Atalag
Presented at Health Informatics New Zealand (HINZ 2017) Conference, 1-3 Nov 2017, Rotorua, New Zealand. Based on my Masters student Peter Wei's research. Authorship: Ping-Cheng Wei, Koray Atalag and Karen Day from the University of Auckland.
This document provides a summary of the 2012 Translational Bioinformatics conference. It highlights several important papers presented at the conference in areas like systems medicine, finding and defining phenotypes, biomarkers, and genomic infrastructure. The document outlines the goals of the conference, the process used to select papers, caveats about the selection, and thanks various contributors. It then briefly summarizes several key papers from the conference in these areas.
Graphs and Artificial Intelligence have long been a focus for Franz Inc. and currently we are collaborating with Montefiore Health System, Intel, Cloudera, and Cisco to improve a patient’s ability to understand the probabilities of their future health status. By combining artificial intelligence, semantic technologies, big data, graph databases and dynamic visualizations we are deploying a Cognitive Probability Graph concept as a means to help predict future medical events.
The power of Cognitive Probability Graphs stems from the capability to combine the probability space (statistical patient data) with a knowledge base of comprehensive medical codes and a unified terminology system. Cognitive Probability Graphs are remarkable not just because of the possibilities they engender, but also because of their practicality. The confluence of machine learning, semantics, visual querying, graph databases, and big data not only displays links between objects, but also quantifies the probability of their occurrence.
We believe this approach will be transformative for the healthcare field and we see numerous possibilities that exist across business verticals.
During the presentation we will describe the Cognitive Probability Graph concepts using a distributed graph database on top of Hadoop along with the query language SPARQL to extract feature vectors out of the data, applying R and SPARK ML, and then returning the results for further graph processing. #AllegroGraph
The document discusses issues biomedical projects face when accessing clinical datasets due to disparate data formats. It presents a proposed solution of annotating clinical datasets with openEHR Archetypes, which are standards-based models of clinical concepts, to enable computer-based discovery of clinical information. The proposed technique involves transforming Archetypes into an "ontology of reality" by identifying clinical concepts and terminology codes to annotate datasets. This would allow complete clinical concepts, rather than just attributes, to be annotated and discovered from datasets.
Enabling Clinical Data Reuse with openEHR Data Warehouse EnvironmentsLuis Marco Ruiz
Databases for Clinical Information Systems are difficult to
design and implement, especially when the design should be
compliant with a formal specification or standard. The
openEHR specifications offer a very expressive and generic
model for clinical data structures, allowing semantic
interoperability and compatibility with other standards like
HL7 CDA, FHIR, and ASTM CCR. But openEHR is not only
for data modeling, it specifies an EHR Computational
Platform designed to create highly modifiable future-proof
EHR systems, and to support long term economically viable
projects, with a knowledge-oriented approach that is
independent from specific technologies. Software Developers
find a great complexity in designing openEHR compliant
databases since the specifications do not include any
guidelines in that area. The authors of this tutorial are
developers that had to overcome these challenges. This
tutorial will expose different requirements, design principles,
technologies, techniques and main challenges of implementing
an openEHR-based Clinical Database, with examples and
lessons learned to help designers and developers to overcome the challenges more easily
Enabling Clinical Data Reuse with openEHR Data Warehouse EnvironmentsLuis Marco Ruiz
Modern medicine needs methods to enable access to data,
captured during health care, for research, surveillance,
decision support and other reuse purposes. Initiatives like the
National Patient Centered Clinical Research Network in the
US and the Electronic Health Records for Clinical Research
in the EU are facilitating the reuse of Electronic Health
Record (EHR) data for clinical research. One of the barriers
for data reuse is the integration and interoperability of
different Healthcare Information Systems (HIS). The reason is
the differences among the HIS information and terminology
models. The use of EHR standards like openEHR can alleviate
these barriers providing a standard, unambiguous,
semantically enriched representation of clinical data to
enable semantic interoperability and data integration. Few
works have been published describing how to drive
proprietary data stored in EHRs into standard openEHR
repositories. This tutorial provides an overview of the key
concepts, tools and techniques necessary to implement an
openEHR-based Data Warehouse (DW) environment to reuse
clinical data. We aim to provide insights into data extraction
from proprietary sources, transformation into openEHR
compliant instances to populate a standard repository and
enable access to it using standard query languages and
services
A Semantic Web based Framework for Linking Healthcare Information with Comput...Koray Atalag
Presented at Health Informatics New Zealand (HINZ 2017) Conference, 1-3 Nov 2017, Rotorua, New Zealand. Authorship: Koray Atalag, Reza Kalbasi, David Nickerson
The University of Auckland
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
This document discusses topic modeling on 350 million documents from Mendeley. It describes how topic modeling can be used to categorize documents into topics and subcategories, though categorization is imperfect and topics change over time. It also discusses how topic modeling and metrics can help with fact discovery and reproducibility of research to build more robust datasets.
Machine learning, health data & the limits of knowledgePaul Agapow
Lecture for Imperial College London's MSc in Health Data Analytics, critiquing a recent paper on COVID diagnosis and moving out to talk about good practices (& limits) in ML and model building
This document discusses using ontologies to simplify semantic solutions for biomedical applications. It provides examples of how ontologies can be used to integrate medical expertise and knowledge from different sources. It also describes challenges in representing biomedical information with ontologies and introduces MedMaP, a medical management portal that aims to simplify access to ontology-based reasoning and analytics using graphical visualizations and self-service tools. MedMaP allows users to customize their experience and gain insights from subject matter experts.
The Past, Present and Future of Knowledge in Biologyrobertstevens65
This document discusses the past, present, and future of knowledge representation in biology. It covers how ontologies have grown significantly in use over time for organizing biological facts and data. However, ontologies only represent part of biological knowledge, and there is potential to do more by connecting different types of knowledge, generating natural language descriptions, and representing knowledge about experiments and workflows in addition to entities and relationships. The document argues that biological knowledge representation has advanced beyond ontologies alone and could benefit from additional types of knowledge representation and reasoning.
This document discusses lessons learned from analyzing data from the MIMIC database. It makes the following key points:
1) While causality cannot be proven with observational data, large datasets like MIMIC can still provide useful insights, especially when multiple studies find consistent results.
2) Single-center databases are limited; collaborating and sharing data across centers expands what can be learned.
3) Reliable research requires transparent and continuous peer review as well as open sharing of data, methods, and findings.
4) Bringing together different experts in data-driven "datathons" can help ensure robust and impactful analyses.
Co-Constructing Explanations for AI Systems using ProvenancePaul Groth
Explanation is not a one off - it's a process where people and systems work together to gain understanding. This idea of co-constructing explanations or explanation by exploration is powerful way to frame the problem of explanation. In this talk, I discuss our first experiments with this approach for explaining complex AI systems by using provenance. Importantly, I discuss the difficulty of evaluation and discuss some of our first approaches to evaluating these systems at scale. Finally, I touch on the importance of explanation to the comprehensive evaluation of AI systems.
Evaluation Challenges in Using Generative AI for Science & Technical ContentPaul Groth
Evaluation Challenges in Using Generative AI for Science & Technical Content.
Foundation Models show impressive results in a wide-range of tasks on scientific and legal content from information extraction to question answering and even literature synthesis. However, standard evaluation approaches (e.g. comparing to ground truth) often don't seem to work. Qualitatively the results look great but quantitive scores do not align with these observations. In this talk, I discuss the challenges we've face in our lab in evaluation. I then outline potential routes forward.
Data Curation and Debugging for Data Centric AIPaul Groth
It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines.
Presented at DBML 2022 at ICDE - https://www.wis.ewi.tudelft.nl/dbml2022
The document discusses knowledge graphs and their future directions. It summarizes a panel discussion on knowledge graphs at ESWC 2020 and references several papers on industry-scale knowledge graphs, weak supervision for knowledge graph construction, and representing entities and identities in knowledge bases. It concludes that knowledge graph construction involves complex pipelines with many components and calls for an updated theory of knowledge engineering to address the demands of modern knowledge graphs at large scale and with continuous changes.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
Progressive Provenance Capture Through Re-computationPaul Groth
Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit
Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.
1) The document discusses the concept of "transclusion" and how scholarly communication may evolve to include more modular, distributed elements.
2) It notes trends toward atomizing scholarly works into smaller components like claims, annotations, and nanopublications.
3) Persistent identifiers are also becoming more widespread to identify funding, licensing, versions, datasets, and individual knowledge components.
4) The future may see scholarly texts constructed by transcluding and assembling various research objects, data, and background distributed across the scholarly infrastructure. Machines may help generate text for inclusion.
Structured Data & the Future of Educational MaterialPaul Groth
Structured data and linked open data standards can help improve educational materials and enable personalized learning. By tagging educational content with metadata about learning objectives, concepts, and relationships, recommender systems can suggest personalized learning paths and assessments. Elsevier is applying these approaches to nursing education by building structured learning objects, mapping them to a taxonomy of student learning objectives, and using recommender algorithms to sequence content and questions for individual students. Linking educational and research content through metadata also allows discovering new ways for academic research to support education.
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
Each month 12 million people use Elsevier’s ScienceDirect platform. The Mendeley social network has 4.6 million registered users. 3500 institutions make use of ClinicalKey to bring the latest in medical research to doctors and nurses. How can we help these users be more effective? In this talk, I give an overview of how Elsevier is employing data science to improve its services from recommendation systems, to natural language processing and analytics. While data science is changing how Elsevier serves researchers, it’s also changing research practice itself. In that context, I discuss the impact that large amounts of open research data are having and the challenges researchers face in making use of it, in particular, in terms of data integration and reuse. We are at just beginning to see of how technology and data is changing science correspondingly this impacts how best to empower those who practice it.
DevOps in the Modern Era - Thoughtfully Critical PodcastChris Wahl
https://youtu.be/735hP_01WV0
My journey through the world of DevOps! From the early days of breaking down silos between developers and operations to the current complexities of cloud-native environments. I'll talk about my personal experiences, the challenges we faced, and how the role of a DevOps engineer has evolved.
In this talk, Elliott explores how developers can embrace AI not as a threat, but as a collaborative partner.
We’ll examine the shift from routine coding to creative leadership, highlighting the new developer superpowers of vision, integration, and innovation.
We'll touch on security, legacy code, and the future of democratized development.
Whether you're AI-curious or already a prompt engineering, this session will help you find your rhythm in the new dance of modern development.
Soulmaite review - Find Real AI soulmate reviewSoulmaite
Looking for an honest take on Soulmaite? This Soulmaite review covers everything you need to know—from features and pricing to how well it performs as a real AI soulmate. We share how users interact with adult chat features, AI girlfriend 18+ options, and nude AI chat experiences. Whether you're curious about AI roleplay porn or free AI NSFW chat with no sign-up, this review breaks it down clearly and informatively.
Data Virtualization: Bringing the Power of FME to Any ApplicationSafe Software
Imagine building web applications or dashboards on top of all your systems. With FME’s new Data Virtualization feature, you can deliver the full CRUD (create, read, update, and delete) capabilities on top of all your data that exploit the full power of FME’s all data, any AI capabilities. Data Virtualization enables you to build OpenAPI compliant API endpoints using FME Form’s no-code development platform.
In this webinar, you’ll see how easy it is to turn complex data into real-time, usable REST API based services. We’ll walk through a real example of building a map-based app using FME’s Data Virtualization, and show you how to get started in your own environment – no dev team required.
What you’ll take away:
-How to build live applications and dashboards with federated data
-Ways to control what’s exposed: filter, transform, and secure responses
-How to scale access with caching, asynchronous web call support, with API endpoint level security.
-Where this fits in your stack: from web apps, to AI, to automation
Whether you’re building internal tools, public portals, or powering automation – this webinar is your starting point to real-time data delivery.
Down the Rabbit Hole – Solving 5 Training RoadblocksRustici Software
Feeling stuck in the Matrix of your training technologies? You’re not alone. Managing your training catalog, wrangling LMSs and delivering content across different tools and audiences can feel like dodging digital bullets. At some point, you hit a fork in the road: Keep patching things up as issues pop up… or follow the rabbit hole to the root of the problems.
Good news, we’ve already been down that rabbit hole. Peter Overton and Cameron Gray of Rustici Software are here to share what we found. In this webinar, we’ll break down 5 training roadblocks in delivery and management and show you how they’re easier to fix than you might think.
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....Jasper Oosterveld
Sensitivity labels, powered by Microsoft Purview Information Protection, serve as the foundation for classifying and protecting your sensitive data within Microsoft 365. Their importance extends beyond classification and play a crucial role in enforcing governance policies across your Microsoft 365 environment. Join me, a Data Security Consultant and Microsoft MVP, as I share practical tips and tricks to get the full potential of sensitivity labels. I discuss sensitive information types, automatic labeling, and seamless integration with Data Loss Prevention, Teams Premium, and Microsoft 365 Copilot.
Bridging the divide: A conversation on tariffs today in the book industry - T...BookNet Canada
A collaboration-focused conversation on the recently imposed US and Canadian tariffs where speakers shared insights into the current legislative landscape, ongoing advocacy efforts, and recommended next steps. This event was presented in partnership with the Book Industry Study Group.
Link to accompanying resource: https://bnctechforum.ca/sessions/bridging-the-divide-a-conversation-on-tariffs-today-in-the-book-industry/
Presented by BookNet Canada and the Book Industry Study Group on May 29, 2025 with support from the Department of Canadian Heritage.
6th Power Grid Model Meetup
Join the Power Grid Model community for an exciting day of sharing experiences, learning from each other, planning, and collaborating.
This hybrid in-person/online event will include a full day agenda, with the opportunity to socialize afterwards for in-person attendees.
If you have a hackathon proposal, tell us when you register!
About Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
TrustArc Webinar - 2025 Global Privacy SurveyTrustArc
How does your privacy program compare to your peers? What challenges are privacy teams tackling and prioritizing in 2025?
In the sixth annual Global Privacy Benchmarks Survey, we asked global privacy professionals and business executives to share their perspectives on privacy inside and outside their organizations. The annual report provides a 360-degree view of various industries' priorities, attitudes, and trends. See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar features an expert panel discussion and data-driven insights to help you navigate the shifting privacy landscape. Whether you are a privacy officer, legal professional, compliance specialist, or security expert, this session will provide actionable takeaways to strengthen your privacy strategy.
This webinar will review:
- The emerging trends in data protection, compliance, and risk
- The top challenges for privacy leaders, practitioners, and organizations in 2025
- The impact of evolving regulations and the crossroads with new technology, like AI
Predictions for the future of privacy in 2025 and beyond
What is Oracle EPM A Guide to Oracle EPM Cloud Everything You Need to KnowSMACT Works
In today's fast-paced business landscape, financial planning and performance management demand powerful tools that deliver accurate insights. Oracle EPM (Enterprise Performance Management) stands as a leading solution for organizations seeking to transform their financial processes. This comprehensive guide explores what Oracle EPM is, its key benefits, and how partnering with the right Oracle EPM consulting team can maximize your investment.
Domino IQ – Was Sie erwartet, erste Schritte und Anwendungsfällepanagenda
Webinar Recording: https://www.panagenda.com/webinars/domino-iq-was-sie-erwartet-erste-schritte-und-anwendungsfalle/
HCL Domino iQ Server – Vom Ideenportal zur implementierten Funktion. Entdecken Sie, was es ist, was es nicht ist, und erkunden Sie die Chancen und Herausforderungen, die es bietet.
Wichtige Erkenntnisse
- Was sind Large Language Models (LLMs) und wie stehen sie im Zusammenhang mit Domino iQ
- Wesentliche Voraussetzungen für die Bereitstellung des Domino iQ Servers
- Schritt-für-Schritt-Anleitung zur Einrichtung Ihres Domino iQ Servers
- Teilen und diskutieren Sie Gedanken und Ideen, um das Potenzial von Domino iQ zu maximieren
Domino IQ – What to Expect, First Steps and Use Casespanagenda
Webinar Recording: https://www.panagenda.com/webinars/domino-iq-what-to-expect-first-steps-and-use-cases/
HCL Domino iQ Server – From Ideas Portal to implemented Feature. Discover what it is, what it isn’t, and explore the opportunities and challenges it presents.
Key Takeaways
- What are Large Language Models (LLMs) and how do they relate to Domino iQ
- Essential prerequisites for deploying Domino iQ Server
- Step-by-step instructions on setting up your Domino iQ Server
- Share and discuss thoughts and ideas to maximize the potential of Domino iQ
Jeremy Millul - A Talented Software DeveloperJeremy Millul
Jeremy Millul is a talented software developer based in NYC, known for leading impactful projects such as a Community Engagement Platform and a Hiking Trail Finder. Using React, MongoDB, and geolocation tools, Jeremy delivers intuitive applications that foster engagement and usability. A graduate of NYU’s Computer Science program, he brings creativity and technical expertise to every project, ensuring seamless user experiences and meaningful results in software development.
Discover 7 best practices for Salesforce Data Cloud to clean, integrate, secure, and scale data for smarter decisions and improved customer experiences.
Improving Developer Productivity With DORA, SPACE, and DevExJustin Reock
Ready to measure and improve developer productivity in your organization?
Join Justin Reock, Deputy CTO at DX, for an interactive session where you'll learn actionable strategies to measure and increase engineering performance.
Leave this session equipped with a comprehensive understanding of developer productivity and a roadmap to create a high-performing engineering team in your company.
ISOIEC 42005 Revolutionalises AI Impact Assessment.pptxAyilurRamnath1
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
1. 1
Combining Explicit and
Latent Web Semantics
Paul Groth - @pgroth
Elsevier Labs
BigNet : WWW 2018
Thanks to Ron Daniel, Brad Allen & the Labs Team
Empowering
KnowledgeTM
for Maintaining Knowledge Graphs
2. 2
Outline
Goal: to tell you our current thinking and to get your feedback
• Why we’re interested
• What we’ve tried
• What we’re missing
• Webby Data
• 2 Sources of Semantics
• State of the art
• What’s missing
Warning: The back half is like a probably incomplete
literature review so think of this as pointers
5. 5
EMMeT (Elsevier Merged Medical Taxonomy)
EMMeT is a multilingual, concept-based clinical ontology
• Multilingual: English, French, Spanish
• Concept-based: All terms, synonyms, translations, mappings are
related to a unique identifier (“IMUI”)
• Ontology: Provides semantic relationships between concepts
(symptoms of a disease, treatment procedures of a disease,
complications of a disease or a procedure, etc…)
EMMeT is a controlled reference terminology
• Based on Unified Medical Language System (UMLS), standard clinical
terminologies as well as Elsevier proprietary vocabularies and lists of
acronyms
• Explicitly mapped to international medical standards (SNOMED-CT,
ICD-9-CM, ICD-10-CM, LOINC, RXNorm, CVX, etc.) and Elsevier’s
vocabularies (Gold Standard, EMTREE, etc.)
EMMeT is current
• Continuously updated, and released every 12 weeks for automatic
indexing
• Updated daily and available via an API for manual tagging access
• Maintained by a team of medical terminology experts,
6. 6
Automated Tagging
Manual Tagging/
Data Structuring
Products and platforms using EMMeT
Clinical Solutions
ClinicalKey Global
ClinicalKey ANZ
ClinicalKey France
ClinicalKey Espanol
ClinicalKey Nursing
ClinicalKey German
ClinicalKey Nursing ANZ
ClinicalKey Brazil
Amirsys Decision Point
RP/STMJ
Health Advance
The Lancet
Cell
LexisNexis
MedMal Nav
LN Insight
Legend
In production
In Pilot
In Pipeline
Nursing Education
Mosby’s Dictionary
Clinical Solutions
PoC - Clinical Overviews
ClinicalKey HL7 API
Health Analytics
IDS FHIR API/Apps
Dorland’s Dictionary
Patient Engagement
Gold Standard CP
ERC
Content 2.0
Nursing Education
Sherpath
EMEALAAP
MedEnact
RP/STMJ (SCT)
Health Advance
The Lancet
Cell
8. 8
Rankings of EMMeT’s ontological relationships
• Relationships are ranked according to 5-tiered ranking model: for simplicity and accessibility.
• 10: best option;
• 9: second option. When the rank of 10 is not applicable;
• 8: given two concepts that are too general to be directly related to a specific disease;
• 7: is used as an outlier.
• 6: default / non validated.
Relationship
Ranking Criteria
10 9 8 7
has cause most common common sometimes rare
has clinical finding most common common sometimes rare
has_complication severity (disease) severe/death high moderate low morbiditiy
has_complication prevalence (disease) Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs Rare occurrence
has_complication severity (procedure) critical/death major moderate minor
has comorbidity strongly associated Commonly associated Sometimes associated Rarely associated
has screening procedure best choice is done sometimes done rarely done
has risk factor strongly associated Commonly associated Sometimes associated Rarely associated
has diagnostic procedure best choice commonly done sometimes done rarely done
has differential diagnosis Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs/ low prevalence Rare occurrence
has drug best choice 2nd line 3rd line rarely given
has contraindication drug Strongly avoid/black box Commonly avoid Sometimes avoid Rarely avoid
has treatment procedure best choice commonly done sometimes done rarely done
has prevention Best option common option sometimes advised rarely advised
has physician specialty specific specialty general/specialty broad rare
has device standard device acceptable device sometimes used rarely used
9. 9
From EMMeT to H-Graph
• Based on EMMeT
• Support more complex relations including patient context (Clinical Overview content + more)
• Flexible and extensible model to support links to content, model treatment strategies, numeric values, temporal
data, etc. Age, sex, weight, … are very simple context.
In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate
or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48
hours or is uncertain. NICE Guideline Atrial Fibrillation: Management
• Continue to support existing indexing pipelines (e.g. ClinicalKey), and tagging use cases (e.g. Clinical Overviews)
From EMMeT… …To H-Graph
11. 11
Universal schemas
• … are a specific technique from the Information Extraction and the Automatic Knowledge Base
Completion literature
• … are an unsupervised method to ‘learn’ by combining text extracts with existing knowledge base
assertions
• Applications:
• Extend a medical knowledge base
• scan incoming literature to suggest new additions to EMMeT and show the
underlying evidence to the taxonomy editor.
• scan literature backlog to find evidence for data already in EMMeT
• Literature Surveillance
• scan incoming literature to find existing facts even if expressed in very different ways
• find new concepts in the literature related to an existing EMMeT concept*. Let taxonomy
editor decide whether to add new concept and relation to EMMeT
12. 12
Open Information Extraction
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• One weird trick for open information extraction …
• ReVerb*:
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction
13. 13
ReVerb output
After ReVerb pulls out noun phrases, match them up to EMMeT concepts
Discard rare concepts, relations, or relations that are not used with many different concepts
# SD Documents Scanned 14,000,000
Extracted ReVerb Triples 473,350,566
14. 14
Universal schemas - Initialization
• Method to combine ‘facts’ found by
machine reading with stronger
assertions from ontology.
• Build ExR matrix with entity-pairs
as rows and relations as columns.
• Relation columns can come from
EMMeT, or from ReVerb
extractions.
• Cells contain 1.0 if that pair of
entities is connected by that
relation.
15. 15
Universal schemas - Prediction
• Factorize matrix to ExK and KxR,
then recombine.
• “Learns” the correlations between
text relations and EMMeT relations,
in the context of pairs of objects.
• Find new triples to go into EMMeT
e.g., (glaucoma,
has_alternativeProcedure,
biofeedback)
17. 17
Paulheim, Heiko. "Knowledge graph refinement: A survey of approaches and evaluation
methods." Semantic web 8.3 (2017): 489-508.
WHERE TO GO?
18. 18
MORE THAN LINK PREDICTION
• Data has deep hierarchy –link prediction flattens this
• Data has hooks into specific content
• Schemas are increasingly richly defined – not just a
single type
• N-ary relations
19. 19
OUR KG’S SHARE PROPERTIES WITH WEB KGS
Ringler, Daniel, and Heiko Paulheim. "One knowledge graph to rule them all? Analyzing
the differences between DBpedia, YAGO, Wikidata & co." Joint German/Austrian
Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, Cham, 2017.
20. 20
The Web of Data
http://webdatacommons.org/structureddata/
2017-12/stats/stats.html
http://lodlaundromat.org
24. 24
Pay attention to the underlying data
Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on Scientific Text: An
Evaluation”, 2018; [http://arxiv.org/abs/1802.05574 arXiv:1802.05574]
25. 25
Embed more
Gupta, N., Singh, S., & Roth, D. (2017). Entity linking via joint encoding of types,
descriptions, and context. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (pp. 2681-2690).
26. 26
Embed more
Both, Fabian, Steffen Thoma, and Achim Rettinger. "Cross-modal Knowledge Transfer:
Improving the Word Embedding of Apple by Looking at Oranges." Proceedings of the
Knowledge Capture Conference. ACM, 2017.
27. 27
Social Semantics?
de Rooij, S., Beek, W., Bloem, P., van Harmelen, F., & Schlobach, S. (2016, October).
Are Names Meaningful? Quantifying Social Meaning on the Semantic Web.
In International Semantic Web Conference (pp. 184-199). Springer, Cham.
• Distributional semantics for
identifiers (NTN)
• But uses the global network
• Could we use the discussion
space as well?
NTN - Socher, R., Chen, D., Manning, C. D., & Ng, A. (2013).
Reasoning with neural tensor networks for knowledge base
completion. In Advances in neural information processing
systems (pp. 926-934).
28. 28
schema:dateModified a rdf:Property ;
rdfs:label "dateModified" ;
schema:domainIncludes schema:CreativeWork,
schema:DataFeedItem ;
schema:rangeIncludes schema:Date,
schema:DateTime ;
rdfs:comment "The date on which the CreativeWork was
most recently modified or when the item's entry was
modified within a DataFeed." .
schema:datePublished a rdf:Property ;
rdfs:label "datePublished" ;
schema:domainIncludes schema:CreativeWork ;
schema:rangeIncludes schema:Date ;
rdfs:comment "Date of first broadcast/publication." .
schema:disambiguatingDescription a rdf:Property ;
rdfs:label "disambiguatingDescription" ;
schema:domainIncludes schema:Thing ;
schema:rangeIncludes schema:Text ;
rdfs:comment "A sub property of description. A short
description of the item used to disambiguate from other,
similar items. Information from other properties (in
particular, name) may be necessary for the description to
be useful for disambiguation." ;
rdfs:subPropertyOf schema:description .
https://www.w3.org/TR/rdf11-mt/
Rules
29. 29
Injecting Background Knowledge as Constraints
Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting logical background knowledge into embeddings for relation
extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (pp. 1119-1129)
30. 30
Learning Rules
Yang, Fan, Zhilin Yang, and William W. Cohen. "Differentiable learning of logical
rules for knowledge base reasoning." Advances in Neural Information Processing
Systems. 2017.
31. 31
Combing Both – supporting complex reasoning with subsymbolic representations
Rocktäschel, T., & Riedel, S. (2017). End-to-end
differentiable proving. In Advances in Neural Information
Processing Systems (pp. 3791-3803).
32. 32
Future
Welbl, J., Stenetorp, P., & Riedel, S. (2017). Constructing Datasets for
Multi-hop Reading Comprehension Across Documents. arXiv preprint
arXiv:1710.06481.
•Scale
•The knowledge base == text?
•Multi-hop reasoning
•Is everything end-to-end
differentiable
33. 33
Conclusion
• In practice: data is webby data
• Messy
• Interconnected
• Constraints and rules associated
• Semantic Web: semantics can come from multiple different sources
• Explicit & implicit
• Take advantage of those sources
• Knowledge graphs benefit from inference
• Your thoughts?
• Thanks & We’re hiring!
[email protected] | pgroth.com
labs.elsevier.com
35. 35
INTEGRATION OF LARGE NUMBERS OF DATA SOURCES
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE ,
vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language
infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to
Wikipedia categories based frequencies
• Wordnet is built by psycholinguists
36. 36
Units & Measurement Annotations
• Time
• Dosage
• Probability
• Percent
• Count
• Not handled yet
Find numbers followed by a unit name or abbreviation (perhaps with scale factor like k, m, G, …). Provide value
normalized to SI units. Also provide type of measurement (time, temperature, length, mass, dosage, etc.) based on
unit. Handling tolerances, ranges, probabilities, and counts adds complexity. Conjunctions not yet handled but very
important.
Current work – identify the property being measured (e.g. dosages of AA, indomethacin, HtE, leptin, etc.)
Additionally at 120 min following glucose administration, the 100 mg/kg 5g and 5e groups had
significantly (P ⩽ 0.005) a greater drop in blood glucose than the 10 and 50 mg/kg groups.
In the mouse xenograft model of LLC cells in C57BL/6J mice, once daily administration of AA (50 and
100 mg/kg) inhibited tumor growth in a dose-dependent manner (Fig. 6A and C).
Groups of Swiss mice (n = 6) were treated (p.o.) with vehicle, indomethacin (10 mg/kg-Roche®) or HtE
(50, 100 or 200 mg/kg) 1 h before administration of carrageenan at 2.5% (Sigma-Aldrich®) injected
subcutaneously into the plantar region of the left hind paw and phosphate buffer saline (PBS) in
right hind paw.
In the experiments designed to study the antidepressant-like effect of the repeated treatment (for
14 days) of EET, the immobility time in the TST and the locomotor activity in the open-field were
assessed in independent groups of mice 24 h after the last daily administration of EET (10–100
mg/kg, p.o.).
Hoppers containing chow were removed from the cages 1 h before the administration of leptin
[depending on studies, 5 mg/kg or 2.5 mg/kg, ip; mouse recombinant leptin obtained from Dr. A.F.
#8: On the left side we see one concept, breast cancer, and a number of pieces of informaiton about it such as synonyms, parent and child concepts, etc. On the right we see some ontological relations from breast cancer to other concepts, such as
(breast cancer, has diagnostic procedure, breast biopsy).
One of the major differences between EMMeT and what is in UMLS is that we not only provide the basic 3-part relationship, such as (breast cancer, has_treatment, radical mastectomy), we also provide information about the ‘strength’ of that relation according to current medical evidence.
#10: Excerpt from National Institute for Health and Care Excellence (In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48 hours or is uncertain.
.
In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48 hours or is uncertain.
#13: Using EMMeT, and some code and data we already had, he built a quick prototype and tested it. Performance (in terms of accuracy of predictions) was surprisingly high.
Unsupervised is very important because it means the construction of the rough underlying knowledge base is scalable and not limited by the availability of experts.
Raw predictions not good enough for fully automatic operation, but are plenty good enough to help taxonomy editors and other people do their job much faster.
#20: Complex axioms
Messy
Integrates lots of infromation
#37: One type of NLP annotation Labs is implementing is to mark up measurements – find the quantity, the unit, any tolerances, etc. We also normalize them to SI standards so measurements can be compared and searched. This is not novel research. However, we have not found prior work that attempts to detect the specific object and property being measured. We are using several domain-specific scenarios (mouse cancer, concrete additives, NLP algorithm accuracy, neuronal properties) to find ways that information is expressed. For mouse cancer, it is relatively easy to detect that a measurement is a dosage of a particular drug. But those patterns are of little use in the other scenarios.
This work has application to the h-graph – dosages, ages, weights, etc. are all important properties for the patient context. Cohort size and probability are important for the quality of evidence measures.