-
Notifications
You must be signed in to change notification settings - Fork 0
Build graph database structure
First of all, we need to create a new document - named parsed_vacancy.
parsed_vacancy sample script:
db.parsed_vacancy.insertMany([
{_id: 1, vacancy_id: 1, crawler_id: 1, link: "https://www.work.ua/jobs/2981307/", raw_vacancy: ["Java","MySQL","MongoDB","OOP","office", "in","the"],status: "NEW", created_date: new Date(), modified_date: new Date()},
{_id: 2, vacancy_id: 2, crawler_id: 1, link: "https://www.work.ua/jobs/2981308/", raw_vacancy: ["MySQL","JSON","PHP","JS","cookies","have","must"]},status: "NEW", created_date: new Date(), modified_date: new Date()},
{_id: 3, vacancy_id: 3, crawler_id: 1, link: "https://www.work.ua/jobs/2981309/", raw_vacancy: ["JS","MySQL","PHP","OOP","client","trips"]},status: "NEW", created_date: new Date(), modified_date: new Date()},
{_id: 4, vacancy_id: 4, crawler_id: 1, link: "https://www.work.ua/jobs/29813010/", raw_vacancy: ["Java","OOP","MongoDB","JSON","a","friendly"]status: "NEW", created_date: new Date(), modified_date: new Date()}, ]);
This document contains fields:
-
_id - id (for a better view we used integer type
_id); - vacancy_id - foreign key to document vacancy (for a better view we used integer type);
- crawler_id - foreign key to document crawler;
- link - field that contains reference to a weblink with vacancy;
- raw_vacancy - array that contains words from the field description.
The other document named stop_word contains words that must be excluded while filling the document graph_skill.
stop_word sample script:
db.stop_word.insertMany([
{_id:1, key: "office"},
{_id:2, key: "in"},
{_id:3, key: "the"},
{_id:4, key: "must"},
{_id:5, key: "have"},
{_id:6, key: "cookies"},
{_id:7, key: "client"},
{_id:8, key: "trips"},
{_id:9, key: "a"},
{_id:10, key: "friendly"}]);
This document contains fields:
-
_id - id (for a better view we used integer type
_id); - key - value that contains stop word.
The final document is graph_skill. The document contains skills and its connected subskills, also each subskill has connection weight with the skill that represents how many times these two skills were seen in the same vacancy.
graph_skill sample script:
db.gpraph_skill.insertMany([
{_id: 1, crawler_id: 1, "skill": "Java", "connects":[
{"subskill":"MySQL", "weight": 1, "parser_id":[1]},
{"subskill":"MongoDB", "weight": 2,"parser_id":[1,4]},
{"subskill":"OOP", "weight": 2, "parser_id":[1,4]},
{"subskill":"JSON", "weight": 1, "parser_id":[4]}
]
}, created_date: new Date(), modified_date: new Date()}
{_id: 2, crawler_id: 1, "skill": "MySQL", "connects":[
{"subskill":"Java", "weight": 1, "parser_id":[1]},
{"subskill":"MongoDB", "weight": 1, "parser_id":[1]},
{"subskill":"OOP", "weight": 2,"parser_id":[1,3]},
{"subskill":"JSON", "weight": 1,"parser_id":[2]},
{"subskill":"JS", "weight": 2,"parser_id":[2,3]},
{"subskill":"PHP", "weight": 2,"parser_id":[2,3]}
]
}, created_date: new Date(), modified_date: new Date()}
{_id: 3, crawler_id: 1, "skill": "MongoDB", "connects":[
{"subskill":"Java", "weight": 2, "parser_id":[1,4]},
{"subskill":"MySQL", "weight": 1, "parser_id":[1]},
{"subskill":"OOP", "weight": 2, "parser_id":[1,4]},
{"subskill":"JSON", "weight": 1, "parser_id":[4]}
]
}, created_date: new Date(), modified_date: new Date()}
{_id: 4, crawler_id: 1, "skill": "OOP", "connects":[
{"subskill":"Java", "weight": 2, "parser_id":[1,4]},
{"subskill":"MySQL", "weight": 2, "parser_id":[1,3]},
{"subskill":"MongoDB", "weight": 2, "parser_id":[1,4]},
{"subskill":"JS", "weight": 3, "parser_id":[2,3]},
{"subskill":"PHP", "weight": 1, "parser_id":[3]},
{"subskill":"JSON", "weight": 1, "parser_id":[4]}
]
}, created_date: new Date(), modified_date: new Date()}
{_id: 5, crawler_id: 1, "skill": "JSON", "connects":[
{"subskill":"MySQL", "weight": 1, "parser_id":[2]},
{"subskill":"PHP", "weight": 1, "parser_id":[2]},
{"subskill":"JS", "weight": 1, "parser_id":[2]},
{"subskill":"Java", "weight": 1, "parser_id":[4]},
{"subskill":"OOP", "weight": 1, "parser_id":[4]},
{"subskill":"MongoDB", "weight": 1, "parser_id":[4]}
]
}, created_date: new Date(), modified_date: new Date()}
{_id: 6, crawler_id: 1, "skill": "PHP", "connects":[
{"subskill":"MySQL", "weight": 2, "parser_id":[2,3]},
{"subskill":"JS", "weight": 2, "parser_id":[2,3]},
{"subskill":"JSON", "weight": 1, "parser_id":[2]},
{"subskill":"OOP", "weight": 1, "parser_id":[3]}
]
}, created_date: new Date(), modified_date: new Date()}
{_id: 7, crawler_id: 1, "skill": "JS", "connects":[
{"subskill":"MySQL", "weight": 2, "parser_id":[2,3]},
{"subskill":"JSON", "weight": 1, "parser_id":[2]},
{"subskill":"PHP", "weight": 2, "parser_id":[2,3]},
{"subskill":"OOP", "weight": 1, "parser_id":[3]}
]
}, created_date: new Date(), modified_date: new Date()}
]);
This document contains fields:
-
_id - id (for a better view we used integer type
_id); - crawler_id - foreign key to document crawler;
- skill - value that must be taken from the field raw_vacancy and checked if it is not a stop_word, then to be written (_in a graph view it is the main node);
- connects.subskill - value that must be taken from the field raw_vacancy (NOTE each word from the row_vacancy must be added, as the subskill and must be written as an embedded document, exclude that one which was taken as a skill see an example above in the sample script);
- connects.weight - field that contains the quantity of time these skills have appeared;
- connects.parser_id - array that contains ids of document parsed_vacancy where the subskill and the skill matched in the same raw_vacancy.