Hello everyone, I’m Hydra.
Some time ago, the boss gave me a task to investigate the knowledge graph technology. Although there is a bit of NLP foundation, but the study of this is still full of bags, and finally after collecting a lot of information, draw out more than 50 pages of PPT, then today I will share with you the relevant knowledge of the knowledge graph.
The concept of a knowledge graph was born in 2012 and was first proposed by Google Inc. As we all know, Google is a search engine, so after they first proposed the Google Knowledge Graph, they first used knowledge graph technology to improve the core of the search engine.
Note the above statement, although the Knowledge Graph was born in 2012, it actually had another name earlier, that is, semantics. So what is semantics? To answer this question by quoting two sentences from Fundamentals of Statistical Natural Language Processing:
Semantics can be divided into two parts, studying the semantics of individual words (i.e., word meanings) and how the meanings of individual words combine to form the meaning of a sentence (or a larger unit).
Semantics is the study of the meaning, structure, and way words are spoken.
So, what exactly is the Knowledge Graph?
You can understand it as the establishment of a knowledge database of entity relationships in nature, and its proposal is to accurately explain the relationship between people, things, and things.
There is currently no uniform definition of knowledge graphs in academia, but there is a clear description in the document published by Google: “Knowledge graphs are a technical method of using graph models to describe the relationship between knowledge and modeling the world.”
Dr. Singhal of Google points out in three terms how search has changed since the addition of the Knowledge Graph:
“Things,not string.”
These few words point out the core of the knowledge graph. In previous searches, the content to be searched was treated as a string, and the result was to match the string, ranking the most matched first, and then showing the match in order. After using the knowledge graph, the content of the search is no longer regarded as a string, but as a thing in the objective world, that is, as an individual.
For example, when we search for Bill Gates, the search engine does not search for the string “Bill Gates”, but searches for Bill Gates as a person, around Bill Gates, showing people and things related to him.
In the image above, the encyclopedia on the left will list the main facts about Bill Gates, and the right side shows Bill Gates’ Microsoft products and people similar to him, mainly some of the founders of the IT industry. In this way, a search results page lists the basic information with Bill Gates and his main relationship, and it is easy for searchers to find the results they are interested in.
In the Knowledge Graph, relationships between things are described in the form of triples < entities× relationships, × attributes> sets:
The entities mentioned here are slightly different from entities in the ordinary sense, and it would be better to understand them by borrowing the concept of ontologies in NLP:
Ontologies define the basic terms and relationships that make up the glossary of subject areas, as well as the rules that define the extension of the glossary in conjunction with these terms and relationships.
For example, when we want to describe the field of the university, it is a relatively important concept for faculty, students, and courses, and there is also a certain relationship between faculty and students, in addition to a certain constraint relationship between objects, for example, the number of faculty members in a department cannot be less than 10.
After understanding the above triplet, we can build a relationship like this based on it:
As you can see, the queen and the crown prince are linked together through a mother-child relationship, and each has his own attributes.
When the number of nodes in the knowledge graph gradually increases, its manifestation will be similar to the structure of chemical formulas, and there are often many types of entities and relationships in a knowledge graph.
The knowledge graph processes the knowledge information in the nonlinear world to achieve such a structure and visualization, so as to assist human beings in reasoning, prediction, and classification.
At this point, we can briefly summarize the basic characteristics of the knowledge graph:
As mentioned earlier, the previous search engine was to find the content that matches the query most from a large number of keywords, and return some of the results with the highest sorting score to the user according to the query results. Throughout the process, search engines may not need to know what the user is typing, because the system does not have reasoning ability and is slightly inadequate in terms of accurate search. The search engine based on the knowledge graph, in addition to being able to directly answer the user’s questions, also has a certain semantic reasoning ability, which greatly improves the accuracy of the search.
In a traditional recommendation system, there are two typical problems:
For example, a movie site may contain tens of thousands of movies, but a user-overplayed movie may have an average of only a few dozens. Using such a small amount of observed data to predict a large amount of unknown information greatly increases the risk of overfitting the algorithm.
Therefore, some additional auxiliary information will be introduced as input in the recommendation algorithm, which can enrich the description of users and items, thus effectively making up for the sparse or missing interactive information. Among the various auxiliary information, the knowledge graph as an emerging type of auxiliary information, there have been many related studies in recent years.
The following is a recommended example based on the Knowledge Graph:
When you introduce a knowledge graph into a recommendation system, you have the following advantages:
In addition, knowledge graph technology is also widely used in many fields such as question answering and dialogue systems, language understanding, and decision analysis, and it is attached to these systems to serve as a background knowledge base. In general, the application in these scenarios can summarize the development trend of the entire AI, which is a process from perception to cognition.
The construction of the knowledge graph currently has a relatively complete set of architectural system, you can first take a look at the following diagram, and then we will slowly explain:
In general, the overall process can be divided into the following 5 steps:
Below, we break down some of these important core details to describe them in detail.
Data is the foundation of the knowledge graph, which is directly related to the efficiency and quality of the knowledge graph construction. So let’s analyze their strengths and weaknesses from the data source:
Entity extraction refers to the identification and extraction of attribute and relationship information of entities from data, and this process is still for different structures of data:
Review the three elements of the knowledge graph we mentioned earlier, which are entities, relationships, and attributes. Relational extraction we can likewise RDF graph represented by a triple:
Such a (S, P, O) triplet can break down a piece of knowledge into subjects, predicates, and objects. Such an SPO structure can be used as a storage unit when stored with the Knowledge Graph.
In RDF you can declare some rules and derive others from some relationships, which are called RDF Schemas. Rules can be expressed in vocabulary such as class, subClassOf, type, property, subPropertyOf, domain, range, etc.
In the following example, the relationship between nodes and nodes can be understood as the connection in the ontology mentioned earlier, and this association process can be called derivation or correlation reasoning in the knowledge graph:
In the process of knowledge integration, it mainly includes the process of referring to digestion, entity alignment, entity linkage, etc. Let’s mainly look at the more important entity alignment (Object Alignment) in this process.
After the entity extraction is completed, there are cases where the entity IDs are different but represent the same object in the real world. Knowledge fusion is the merging of these entities into a globally unique entity object that is added to the knowledge graph.
This process can be represented by the following diagram:
In fact, the merge judgment model in this process is familiar to everyone, which is the secondary classifier generated by machine learning training.
There is a general problem of incompleteness in the knowledge graph, and what needs to be done at this step is to deduce the missing relationship based on the existing relationships in the graph.
In the physical network of the knowledge graph below, the yellow arrows indicate relationships that already exist, and the red dotted lines are the missing relationships. We can complete the missing relationships between e3 and e4 based on the relationships between entities.
As for this completion process, there are many ready-made algorithms that can be used, such as pathfind-based methods, reinforcement-based learning methods, inference rules-based methods, meta-learning-based methods, and so on.
The storage of the knowledge graph depends on the graph database and its engine, and the implementation of different vendors may be very different, for example, the graph databases that can be selected are RDF4j, Virtuoso, Neo4j, etc. For example, iQIYI’s graph database engine chose JanusGraph, and built its own JanusGraph distributed graph database engine with the help of the cloud platform’s Hbase and ES clusters.
JanusGraph supports upstream online query services with the support of external storage systems and external indexing systems.
The logical hierarchy of the underlying storage data triplets can be referred to as the data layer, which is usually managed through an ontology library, which is equivalent to the concept of a “class” in an object. The schema layer, built on top of the data layer, is the core of the knowledge graph, which uses the ontology library to manage axioms, rules, and constraints, and to regulate the relationship between specific objects such as entities, relationships, and attributes.
Looking at the knowledge graph from different perspectives makes it easier for us to understand it:
The following is an example of a knowledge graph that is easier for us to understand after it is built:
Seeing this, do you feel that the process of building a knowledge graph is more complicated, making it difficult for us to get started?
In fact, in recent years, the rapid development of deep learning and related natural language processing technologies has made it possible for automatic knowledge extraction of unstructured data to be less humanized or even unmanned, and some cutting-edge knowledge graph automatic construction technologies have been proposed.
On the basis of deep learning, researchers from Allen Artificial Intelligence Lab and Microsoft combined with the more successful pre-trained language models in the field of natural language processing to propose the automatic knowledge graph construction model COMET (COMmonsEnse Transformers).
The model can automatically generate a rich and diverse common sense description based on the natural language content in the existing common sense base, and has achieved high accuracy close to human performance on both the Atomic and ConcepNet classic common sense maps, which proves the feasibility of such methods to replace traditional methods in the automatic construction and completion of common sense knowledge graphs.
Data governance is the data source of knowledge graph transportation, which is the precursor and basic project of knowledge graph construction. Complete and good data governance can not only ensure that the knowledge graph obtains real and reliable data raw materials in the process of construction, but also improves the quality of information from the source, improves the accuracy of knowledge, and establishes a data resource pool that conforms to the human cognitive system.
However, data governance is an old problem in the construction of knowledge graphs. Knowledge graph applications always revolve around data governance links such as data labeling, data cleaning, data unification, and data destruction, and application developers often need to invest a lot of time and manpower in the early stage of data governance work to ensure the authenticity, reliability, availability, and correctness of data sources.
At present, data governance problems such as ununified data standards, large data noise, lack of domain data sets, and abnormal data credibility still plague knowledge graph developers, and continuing to carry out data governance projects is the arduous mission and responsibility of industry participants.
At present, the knowledge graph industry as a whole is in a situation where development resources need to be improved, and the scarcity of industry and technical expert resources is part of the situation.
On the one hand, there is a lack of experts with deep industry experience. Due to the high correlation between the industry knowledge graph and the industry, developers need to quickly understand the business and customer needs, complete the schema construction under the guidance of industry experts, and if it involves text extraction, industry experts are also required to annotate data, and there are often only a very small number of industry experts in various industries. In this regard, supply-side enterprises need to lock in the strong areas of industry business, recruit and train industry experts in advance, and carry out internal and external collaboration to complete the reserve of industry experts.
On the other hand, there is a lack of technical compound experts. The entire knowledge graph application production process not only involves the knowledge graph algorithm, the advanced part of the production process also involves the underlying graph data storage and data governance, NLP text extraction and semantic conversion, and each link is permeated with machine learning, the underlying artificial intelligence technology. This means that the entire production process requires engineers from multiple technical fields to work together, and technical experts with knowledge of the entire range of technologies are scarce.
Since the knowledge graph is a two-dimensional linked graph structure rather than a row or column table structure, it needs to be described and stored in the form of graph data, which can directly reflect the internal structure of the knowledge graph, which is conducive to knowledge query, combined with graph calculation algorithm for in-depth mining and reasoning of knowledge.
The database that meets this storage requirement is the graph database that has emerged in recent years. Compared with traditional relational databases, the data model of graph databases is represented by nodes and edges, which can greatly shorten the query execution time of association relationships, support semi-structured data storage, and display multi-dimensional association relationships. Efficient and convenient new technologies often mean higher barriers to research and development.
In the process of building the knowledge graph, there are still various algorithm difficulties, the main difficulties can be attributed to the algorithm difficulties in the production process and the difficulties in the performance of the algorithms. The former is reflected in problems such as knowledge acquisition limited by data sets, many interference factors of knowledge integration, and insufficient data sets and computing power for knowledge calculation.
The latter is reflected in the lack of generalization ability of the algorithm, the lack of robustness, and the lack of unified evaluation indicators. The difficulty of algorithm depends on the supply and demand sides, academia, and the government to continue to tackle tough problems, rather than the efforts of one party to reap success.
After dragging on for a long time, I don’t know if everyone is looking forward to it
In fact, I have also saved a lot of articles here to choose the topic, but recently the work is really busy, after work time is basically playing with the small fat sheep, so there is no time to be more textual. Just like this article, it is also on the high-speed rail that I am on a business trip, based on the PPT reported a few days ago.
How, the scenery along the way, is not OK?
Well, that’s it for this sharing, I’m Hydra, we’ll see you next time.
Official account background reply
The “356” — received more than 100 back-end books
“Interview” — to collect the interview materials of the factory
Map — receive 24 Java back-end learning note maps
“Architecture” — receive 29 Java Architect e-books
“Practice” — receive the Springboot Combat Project
Pay attention to the official account
Fun, deep, direct
Talk to you about technology
Think it’s useful, let’s have a four-in-a-row~