“Building a Knowledge Graph from Scratch: Techniques, Methods and Cases” When the book was first read, it felt a bit theoretical, but after combining Professor Chen Huajun’s knowledge graph courseware, I felt that the exposition was relatively complete, especially in the first two chapters, this article is to make a note on the knowledge graph technical system.

Knowledge graph is considered to be the key to making artificial intelligence systems “knowledgeable”, it becomes the brain of artificial intelligence through the storage of a variety of structured knowledge such as RDF, graph form, etc.

Representing and modeling existing knowledge is the basis and preparation for building a knowledge graph, and it is also a prerequisite for the complete construction of a valuable knowledge graph. By representing and storing knowledge in a certain way, computer systems can process and use knowledge more efficiently.

The five roles of knowledge representation:

Abstract substitution of knowledge in the real world (knowledge representation can be seen as an abstract alternative to knowledge in the real world, and this substitution is implemented in a way that can be understood by computers, but it will be detrimental)

Ontology of collections (ontologies abstract real-world concepts and entities into classes and objects, to some extent achieve the same purpose as the representation of knowledge.

The advantage of abstracting the real world into classes and objects is that users can focus only on the focus they want to focus on and abstract and represent them only, avoiding the problem that knowledge representation cannot be harmless as a substitute for real-world abstraction.)

Incomplete intelligent reasoning theory (only the representation theory of knowledge is not enough, it is necessary to cooperate with other theories such as reasoning methods to form a complete reasoning theory)

The medium of efficient computing

An intermediate of knowledge

Combining the roles of the above five types of knowledge representation, we can understand knowledge representation as an incomplete abstract description of the real world, containing only the aspects that humans or computers want to focus on, and can also use it as middleware for computing and reasoning.

In computer systems, there are many methods of knowledge representation and formal languages, and different representation methods will bring different representation effects. This makes us need a recognized descriptive approach to describing the knowledge that needs to be represented, which must be concise enough and highly scalable to accommodate the diversity of real-world knowledge.

There are two ways to describe it, descriptive logic and descriptive language.

1: Describe the logic

Descriptive logic refers to a series of representations based on the formalization of logical knowledge that enable the representation and reasoning of knowledge in a structured, easy-to-understand manner. Descriptive logic is built on concepts and relationships, which are classes and entities in the knowledge graph, and relationships can be understood as relationships between entities.

2: Description language

In the process of knowledge representation, in addition to the need for logic to describe knowledge, there is also a need for a suitable language to describe knowledge and pass on information based on prescribed logic. According to the W3C standard, knowledge is typically described using the Resource Description Framework (RDF) and the Web Ontology Language (OWL), both of which use Extensible Markup Language (XML) as the core syntax

In RDF, knowledge is encoded in the form of triples, where each triplet consists of a subject, a predicate (or an attribute), and an object that makes it easy to convert RDF into natural language. The subject and object of RDF can be a blank node or an internationalized resource representation (IRI) that uniquely identifies a resource, and the predicate must be an IRI.

From a holistic point of view, XML, as the core syntax and language of knowledge representation, plays a pivotal role in different knowledge representation languages. On the basis of XML, RDF translates knowledge into the form of triples and uses IRI as a unique identifier for different knowledge, making knowledge more readable for both computers and people. On the basis of RDF, OWL uses the concepts of classes and entities to further abstract knowledge into the representation of the body, so that the knowledge of the real world can be expressed more completely and hierarchically

Knowledge modeling refers to the process of building computer-interpretable knowledge models. These models can be knowledge models of some general domain, or they can be interpretations or specifications for a certain product. The point of knowledge modeling is the need to build a computer-storable and interpretable knowledge model.

Typically these knowledge models use knowledge representation methods to store and represent. Knowledge modeling is divided into two steps:

1: Knowledge acquisition

The process of acquiring and organizing the knowledge needed for a knowledge base system through multiple data sources and human experts. In the knowledge acquisition stage, it is first necessary to clarify the purpose of establishing a knowledge model, and determine the field and scope of knowledge covered by it according to the purpose.

2: Knowledge is structured

The tasks of this phase can be divided into two parts: knowledge extraction and knowledge structured representation. The knowledge extraction part is mainly responsible for extracting unstructured or semi-structured knowledge (usually natural language or close to natural language) and facilitating subsequent knowledge representation.

After the knowledge is extracted to the structured data, we also need to convert it into a computer-readable form, and a common practice is to build the ontology and save the knowledge as an RDF or OWL file.

The concept of ontology originated in the field of philosophy and is primarily concerned with concepts directly related to “being” in the philosophical sense, as well as relations related to “existence”. In the field of computers and artificial intelligence, a brief explanation of ontology is that ontology is a conceptualized specification of the real world, that is, an abstract model of knowledge that abstracts the characteristics of different entities and generalizes them into different classes and relationships.

In terms of ontology construction, more classical methods include the METHONTOLOGY method:

Determine the purpose of establishing the ontology, including the intended users of the ontology, the use scenarios, and the scope involved in the ontology

Knowledge acquisition is required

Ontologies need to be conceptualized in order to organize and structure the knowledge acquired by external sources

In order for the currently built ontology to merge and share with other ontologies, it is necessary to integrate existing ontologies as much as possible

Implement the ontology using a formal language

The constructed ontology should be evaluated

Knowledge extraction refers to the technology of extracting knowledge from different sources and different structures, using entity extraction, relationship extraction, event extraction, etc.

The data sources of the knowledge graph can be divided into three major categories according to the different structures, namely structured data, semi-structured data and unstructured data, and different types of data have different knowledge extraction methods.

For semi-structured data, wrappers are a type of technology that can extract data from HTML web pages and restore it to structured data, and there are three main ways to implement it, namely manual method, wrapper induction and automatic extraction.

Wrapper induction is a supervised learning method that learns extraction rules from annotated data sets and applies them to data extraction with the same tags or the same page template. The automatic extraction method is to cluster a batch of web pages first, get several clusters with similar structures, and then train a wrapper for each group, and the other web pages to be extracted will output structured data after passing through the wrapper.

For the final structured data, it is more about extracting knowledge from unstructured data such as text, and the technology that implements this task is collectively referred to as information extraction.

The difference between information extraction and knowledge extraction is that information extraction focuses on unstructured data, while knowledge extraction targets all categories of data.

Information extraction consists of three main subtasks, namely entity extraction, relationship extraction, and event extraction.

1: Entity extraction

This problem can be translated into a sequence labeling problem, where features that can be considered include features of the word itself, such as part of speech; The prefix features of the word, such as the administrative unit “province”, “city”, “county” that appears in the place name, etc.; The characteristics of the word itself, such as whether it is a number or not.

Rules- and dictionary-based extraction methods

Methods based on statistical learning, including hidden Markov models, conditional random field models

Mixed extraction method:

Machine learning models are combined with deep learning, such as the LSTM-CRF model

2: Relationship extraction

(1) Rule-based extraction method

For the relationship extraction method based on the trigger word, you need to define a set of extraction templates that are summarized from the text to be extracted, and the relationship is extracted by the trigger word.

For dependency syntactic analysis, the sentence is preprocessed first by relying on syntactic analyzer, including word segmentation, part-of-speech annotation, entity extraction and dependency syntactic analysis, etc., and then the rules in the rule database are parsed, and the results obtained by dependency analysis are matched with the rules, and each rule can obtain a triplet structure data, and then the triplet structure data is extended according to the extension rules, and further processed to obtain the corresponding semantic relationship.

(2) Supervise learning methods

The supervised learning method aims to train a relational extractor by partially labeling the data. Labeling data needs to contain both relationships and related entity pairs. The need to use a large number of predefined features in relational classifiers is one of the most difficult problems with this approach

(3) Semi-supervised learning methods

Heuristics and distance-supervised learning methods based on seed data.

Heuristic algorithms based on seed data need to prepare a batch of high-quality triplet structure data in advance, and then based on this batch of seed data, to match the text data in the corpus, find out the candidate text set of mentioned entity pairs and relationships, perform semantic analysis of the candidate text, find some strong features that support the relationship to be established, and find more instances in the corpus through these strong features, add seed data, and then mine new features through newly discovered examples, repeat the above steps, until the pre-set threshold is met.

In order to generate large amounts of training data in a short period of time, you can use the remote supervision method, which uses an existing knowledge base to annotate unknown data. Assuming that there is a relationship between two entities in the knowledge base, the remote supervision method assumes that the data containing both entities describes the relationship. But in reality, many candidate entity pairs in the text do not contain the relationship, and you can narrow the dataset by manually constructing prior knowledge

The remote supervision method does not require iterative acquisition of data and features, and is an effective method of data set expansion.

3: Event extraction

It refers to extracting the event information of interest to users from natural language and storing it in a structured form, which is widely used in the fields of automatic Q&A, automatic abstraction, and information retrieval.

The event extraction task includes event discovery, identifying event trigger words and event types, and event extraction is carried out in multiple stages, so that problems can be transformed into multi-stage classification problems.

Event extraction tasks can be divided into two categories: meta-event extraction and topic event extraction.

Metaevents indicate the occurrence of an action or a change in state that is often triggered by a verb or a noun or other part of speech that represents an action.

Meta-event extraction stays at the sentence level and is triggered by a change in one action or state, while topic events tend to consist of multiple actions or states, scattered across multiple sentences or documents. Therefore, the key to topic event extraction is how to identify and group together collections of documents that describe the same topic.

Knowledge mining refers to the process of mining new entities or entity relationships from text or knowledge bases and associating them with existing knowledge. Knowledge mining is divided into two parts: entity linking and disambiguation, and knowledge rule mining.

1: Physical linking and disambiguation

Entity linking refers to the process of mapping entity referent from natural language text to the entity corresponding to the knowledge base, avoiding the situation that the same entity name contains multiple entities or multiple entity names point to the same entity, so the entity needs to be disambiguated at this time.

The basic process of entity linking and disambiguation is divided into three steps: entity referent identification, candidate entity generation, and candidate entity sorting.

Entity referent recognition is the same as entity extraction in knowledge extraction.

Candidate entity generation, that is, by the entity identified in the text refers to generate a collection of candidate entities that may be linked, there are currently three commonly used methods to generate candidate entities, namely based on the entity referent dictionary generation method, search engine-based generation method, based on the entity referent surface extension of the generation method.

After you generate a collection of candidate entities, you often also have multiple candidate entities in the collection, which need to be sorted to filter out the entities that the entity refers to is the entity that the real entity refers to. According to whether the data needs to be labeled, the candidate entity sorting method can be divided into two types: the sorting method based on supervised learning and the sorting method based on unsupervised learning.

2: Knowledge rule mining

Knowledge rule mining is the mining of knowledge structure, which can use some rules to excavate new knowledge for the existing knowledge system, such as mining new entities and related relationships. Knowledge rule mining is divided into mining based on association rules and mining based on statistical relationship learning.

(1) Mining based on association rules

An association rule is an implied expression in the form of X → Y, where X and Y are two sets of terms that do not intersect. Its strength can be measured in terms of support and confidence. Association rule-based mining is the mining of a potential connection between categories in a knowledge base, and the connections found can be represented by association rules or frequent itemsets.

(2) Mining based on statistical relationship learning

Mining based on statistical relationship learning is based on the mining of statistical relationship learning, which is to use the known triples in the knowledge base to predict the possibility of the establishment of unknown triples through statistical relationship learning, which can be used to improve the existing knowledge graph

Knowledge storage is the process of considering business scenarios and data scale, selecting appropriate storage methods, and storing structured knowledge in the corresponding database, which can realize the effective management and calculation of data.

Knowledge storage based on graph structure, that is, the use of graph database to store data in the knowledge graph. A graph database is a database that uses graph structures for semantic queries. So-called semantic queries, that is, queries and analyses that allow for correlation and contextual nature, can use the grammatical, semantic, and structural information contained in the data to retrieve explicit and implicitly derived information. Graph databases originate from Euler diagram theory and can also be called graph-oriented databases or graph-based databases. The basic meaning of a graph database is to store and query data in a data structure called “graph”.

Graph databases fall into three categories: resource description frameworks, property graphs, and hypergraphs.

The resource description framework, also known as RDF, is a triplet data model in which each piece of knowledge can be broken down into triples (Subject, Predicate, Object) form or graphically displayed. It is important to note that RDF is a data schema, not a serialization format, and the specific storage representation can be XML, Turtle, or N-Triples.

Knowledge integration is through high-level knowledge organization, so that knowledge from different knowledge sources under the same framework specifications for heterogeneous data integration, disambiguation, processing, reasoning verification, update and other steps, to achieve the integration of data, information, methods, experience and human thoughts, the formation of a high-quality knowledge base.

The reason for the emergence of knowledge fusion technology, on the one hand, is that the result data obtained through knowledge extraction and mining may contain a large amount of redundant information and error information, which needs to be cleaned and integrated; On the other hand, due to the large number of sources of knowledge, there are problems such as data duplication, uneven quality, and unclear associations.

Knowledge fusion is divided into conceptual layer knowledge fusion and data layer knowledge fusion, of which conceptual layer knowledge fusion mainly studies ontology matching, cross-language fusion and other technologies, and data layer knowledge fusion mainly studies entity alignment.

1: Conceptual layer knowledge fusion

When there are multiple knowledge sources, each knowledge source may use a different classification system and attribute system. The integration of conceptual knowledge is to unify these different classification systems and attribute systems into a global system.

Ontology matching refers to the establishment of relationships between entities from different ontologies, which can be similar values between entities, fuzzy relationships, etc., which is one of the main tasks of knowledge fusion at the conceptual layer.

2: Data layer knowledge fusion

Entity alignment, also known as entity matching or entity resolution, is the process of determining whether two entities in the same or different datasets point to the same object in the real world.

At present, there are many problems and challenges in entity alignment, which is one of the main tasks of data layer knowledge fusion research.

In contrast to the function of the human brain, the key to cognition lies in the use and processing of knowledge. After the knowledge graph is constructed, that is, the system has modeled the knowledge and stored it in a form that both humans and machines can understand, when the external information is passed in, it is first necessary to find the knowledge related to the perceptual information, and process and process it. In this process, the main techniques involved are knowledge retrieval and knowledge reasoning.

As one of the simplest applications of knowledge graph, knowledge retrieval is mainly aimed at returning relevant information by querying the knowledge graph according to certain conditions or keywords. Compared with traditional queries and retrievals, knowledge retrieval returns not only a simple list of data, but also returns information in a structured form, consistent with human cognitive processes. At present, the commonly used knowledge retrieval methods mainly include knowledge retrieval based on query language and knowledge retrieval based on semantics (that is, semantic search).

1: Knowledge retrieval based on query language

Query languages typically include at least two subsets: data definition languages and data manipulation languages. The data definition language is used to create, modify, and delete items in the database, and the data manipulation language is used to query and update data in data tables.

For example, the query language supported by Neo4j is Cypher. 2: Semantic search

In fact, no matter how close the query language of the graph database is to natural language, its essence still requires the user to write a structured query script to retrieve the knowledge in the database. At the same time, traditional search methods can usually only use a relatively simple syntax to combine to meet the user’s needs for retrieval.

In this case, as a large number of structured data platforms such as Linked Open Data continue to open and improve, the number of data sources that can be used to build knowledge graphs will continue to increase. At the same time, a large number of knowledge graphs with RDF and OWL as knowledge representation languages will continue to be built.

In this case, companies led by Google began to use knowledge graphs to improve search quality and implement semantics-based knowledge retrieval (i.e., semantic search). In fact, semantic search is the product of a further development based on the knowledge search based on the query language.

The essence of semantic search is to get rid of the approximations and imprecision in traditional search methods through mathematical methods, and to find a clear understanding of the meaning of words and how those words relate to the entered words.

In simple terms, semantic search allows the user’s input to be as close as possible to natural language, while returning more precise answers based on understanding those languages. Semantic search uses the representation and expressiveness of knowledge graphs to explore the inherent correlation between user needs and data. At the same time, compared to traditional query methods, semantic search can understand and complete more complex queries and give more accurate results.

Semantic search can be divided into lightweight semantics-based information retrieval systems and relatively complex semantic search systems.

In a lightweight semantics-based information retrieval system, there is no relatively complex knowledge representation system like the knowledge graph, so only simple models such as dictionaries or classifiers are often used to associate semantic data with the data to be retrieved.

Relatively complex semantic search systems often require explicit modeling of semantics and knowledge using methods such as knowledge graphs or ontologies. When searching for keywords in a query on a model-based basis, you can find what is indeed relevant to the search requirements and return more precise results based on the semantics of inference or association.

Commonly used semantic search methods mainly include keyword queries and natural language queries.

Commonly used semantic search methods mainly include keyword queries and natural language queries. As the most basic method of semantic search, the keyword query process is: for the specified keyword, you first need to find the subgraph that matches the keyword definition in the knowledge graph according to the index, which can greatly reduce the overall search space. After obtaining the subgraph, search in a relatively small subgraph and finally find the search results. The main problem with keyword queries is how to build an index.

Natural language queries are more complex than keyword queries. When the user enters natural language, the system needs to understand the sentence, first remove the meaningless component in the sentence, disdifferentiate it, and then analyze and quantify its syntactic, lexical and other characteristics. After obtaining the vector representation of the input language, it is then compared with the vectorized representation of the information in the knowledge graph to obtain the exact query result.

When a complete knowledge graph has been constructed, it means that it is one step further from making the computer system knowledgeable enough. However, simply building a knowledge graph and using what it requires for queries does not really make the computer system achieve cognitive purposes.

In the long-term evolution and development of human beings, possessing knowledge and using it to reason is a key part of cognition. Conceptually, reasoning is the process of extrapolating knowledge from what is already known.

The process of reasoning usually involves two kinds of knowledge: existing knowledge and new knowledge that has not yet been possessed.

The application of knowledge reasoning mainly includes knowledge completion, knowledge alignment and knowledge graph noise reduction.

Traditional knowledge reasoning methods mainly rely on rules to reason about the content of the knowledge graph. Rule-based reasoning methods mainly use the methods in logical reasoning, applying simple rules, constraints or statistical methods to the data represented by the knowledge graph, and then reasoning according to these characteristics. Common rules-based reasoning includes categorical reasoning and attribute reasoning.

You can also train the rules through the knowledge graph, and then reason after deriving new rules. For example, NELL uses first-order logic combined with probability learning to manually filter the learned rules and then instantiate them using the entities present in the knowledge graph.

In general, rule reasoning based on the knowledge graph can construct effective rules and reason about them through methods such as statistics, manual filtering or machine learning, with high accuracy, relatively small amount of computation, and fast reasoning.

However, due to the relatively large scale of the current knowledge graph, a large number of entities are needed to verify the rules; At the same time, when the scale of the map continues to rise, abstracting global rules and multi-step rules can be very difficult. Using statistical features in this case over-relies on statistics and results in overfitting, and is less resistant to data noise. Therefore, with the development of deep learning, reasoning based on representational learning and reasoning based on deep learning have more advantages.

Through the learning of the representation of the knowledge graph, you can map elements such as nodes and relationships in the knowledge graph to the same continuous vector space, so as to learn the representation of each element in the vector space. In fact, the representational learning of the knowledge graph originally originated from the expression learning method of words and words in natural language processing, and the vectors or matrices obtained by the representation learning also have spatial translation. The most classic algorithm in the field of knowledge graph representation learning is TransE.