30/04/2022

[FR] Using Graph Database when dealing with connected data

Relational databases deal poorly with relationships

Do you deal with complex queries that involve many joins (anywhere up from four)?
Do you try to find unknown paths through the data?
Do you have a model that evolves frequently?

At Vadis Technologies, our data scientists often face this kind of challenges.

In partnership with Intys Data, we have been investigating graph databases to

gain in performance
work agile
get a richer view of the 300M companies and 200M people we analyze

If you either work with highly connected data (social networks), recommendations (e-commerce), or pathfinding (how do I know you), then you will like our findings.

Do not get me wrong. Relational databases are great. But only for a limited number of tables, and when a rigid structure is not an issue. Here is an illustration from Max De Marzi.

A graph is a set of nodes, with relationships that connect them

A graphDB is a database with:

An explicit graph structure;
Nodes that know each of their adjacent nodes;
An index for lookups;
Local steps (hops) whose cost remains constant as the number of nodes increases.

If you are used to SQL:

Rows in tables become nodes;
Foreign keys become relationships;
Link tables become relationships (possibly with properties);
Artificial constructs (extra primary and foreign keys for example) are no longer necessary.

Fraud Detection at Vadis Technologies, a Top 100 #RegTech Company

At Vadis Technologies, we do fraud detection for accounts such as public institutions and big banks.

In order to do that, our team harvests and enriches complex business data to offer risk scoring and 360° third-party monitoring to our customers.

The graphDBs we have been investigating are
Neo4j and TigerGraph.

Gain in performance

The rise in the connectedness of our data translates into increased joins.

With relational databases, the bigger the dataset, the less performant our join-intensive queries are. Using graphDBs, the performance tends to remain constant.

As a benchmark, we created 2M nodes and 4M relationships in 40 sec in TigerGraph using an SSD and 16Go of memory. With Neo4j, we built these in about 80 sec.

Note that you can speed up loading considering index and transaction flows.

In case you work with a big amount of data and need to scale, you will most likely not be able to store the whole graph into memory.

I suggest you check this analysis comparing the performance of Tigergraph and Neo4j with a 500GB dataset. It presents metrics on loading time, querying performance as well as storage size after loading.

Work agile

The cost of change in GraphDBs is low. So you work agile across your workflow.

1. Derive the question

There is no need to grasp the whole problem domain in one go, and to turn that knowledge into a big model.

Take/Pick one concrete question that needs to be solved and adds value.

The more concrete, the better!

2. Obtain the data

What data is needed to answer your question? Get that data and only that data. If an ID is sufficient to solve the question, then do not get the name and description.

3. Develop a model and ingest the data

There are no rules to create a good graph. Be creative.

Still, here are a few performance-driven tips & tricks :

Use Nodes for Entities, Relationships for Structure;
Represent Facts as nodes. Fact emerges when two or more domain entities interact for a period of time;
In general, use fine-grained relationships instead of generic relationships;
Represent complex value types as nodes.

4. Query/prove your model

Write the query that answers your question.

Does it perform within expectations?

YES — Excellent, you are ready for the next iteration
NO — Backtrack to steps 2 and 3 and rethink the model

Richer picture of the data

GraphDBs make it easier to visualize data and see the links between different entities.

Both Neo4j and TigerGraph have a tool that enables you to navigate through the data and visualize your model.

Here is a graph built with TigerGraph on a dataset containing 280K users (anonymized but with demographic information) providing 1M ratings about 250K books.

You can easily query a user and see which books he has rated. Then you can expand on a rated book, and see what other users think of it.

It gets even more interesting as the number of types of relationships increases.

GraphDBs are used in many other cases

Here is a non-exhaustive list of graphDBs use cases

Recommendation engine
Network and IT Operations
Search engine
Master Data Management
Identity and access management (internal and external)
Machine learning and analytics
Social networks
Privacy and risk compliances
Email targeting
Knowledge Graph (for asset management, content management, inventory, workflows, cataloging…)