NoSQL Databases

Neo4j Graph Database: When Relationships Beat Tables

Use Neo4j for relationship-heavy data — Cypher query language, graph algorithms, and fraud detection use cases

JusDB Team
March 28, 2023
11 min read
167 views
Neo4j Graph Database: Cypher and Use Cases Guide | JusDB

A social network with 50 million users. A fraud ring spanning dozens of synthetic identities sharing phones, addresses, and bank accounts. A product catalog where recommendations require traversing six degrees of customer behavior. These problems share a common trait: the data is fundamentally relational, and the relationships themselves carry as much meaning as the entities they connect. Relational databases model this reality with JOIN tables that compound exponentially as traversal depth grows. Graph databases were designed from the ground up to make relationship traversal a first-class, constant-time operation — and Neo4j is the production-hardened leader of that category.

TL;DR
  • Graph databases store entities as nodes and connections as relationships, both with arbitrary properties.
  • Neo4j's Cypher query language expresses traversals as ASCII-art patterns, making complex graph queries readable.
  • Relationship-heavy queries (fraud detection, recommendations, knowledge graphs) run orders of magnitude faster in Neo4j than equivalent multi-JOIN SQL.
  • The Graph Data Science (GDS) library adds production-ready PageRank, Louvain community detection, and pathfinding algorithms.
  • AuraDB provides a fully managed Neo4j cloud option; competitors include Amazon Neptune and TigerGraph.
  • Index types (range and full-text) and the APOC library extend Neo4j for operational workloads.

Graph Database Fundamentals

The property graph model underpinning Neo4j has three primitive building blocks: nodes, relationships, and properties.

  • Nodes represent entities — a User, a Product, a Device, an Account. A single node can carry multiple labels (e.g., :Person:Employee) and an arbitrary set of key-value properties.
  • Relationships are directed, named edges between two nodes (e.g., [:PURCHASED], [:SHARES_DEVICE]). Crucially, relationships are first-class citizens stored as physical pointers, not computed at query time. Every relationship also supports its own properties (since, weight, confidence_score).
  • Properties are schemaless key-value pairs on both nodes and relationships, stored natively as primitive types (string, integer, float, boolean, date, list).

This model means that traversing from a node to its neighbors requires following stored pointers rather than scanning index B-trees and constructing hash joins across tables. The traversal cost is proportional to the local neighborhood size, not the total dataset size — a property called index-free adjacency.

When Graphs Beat Tables

The break-even point for graph databases versus relational databases is roughly the third or fourth degree of relationship traversal. Consider finding "friends of friends who have purchased the same product category within 30 days." In SQL this requires at minimum four JOINs, with intermediate result sets that can balloon to hundreds of millions of rows before filters reduce them. The query planner has to work hard, and execution time grows super-linearly with dataset size.

In Neo4j the same query is a pattern match with a variable-length path expression. The engine walks the stored pointer chains; intermediate results stay small because pruning happens during traversal, not after materialization.

Warning

Graph databases are not universally faster. For bulk analytical queries that read every row of a large dataset (e.g., monthly revenue summaries, full-table aggregations), a columnar warehouse like ClickHouse or BigQuery will outperform Neo4j significantly. Graph databases excel at local traversal queries, not global scans.

The sweet-spot workloads for Neo4j:

  • Fraud and anomaly detection (shared identity signals across accounts)
  • Real-time recommendation engines (collaborative filtering, content similarity)
  • Knowledge graphs and entity resolution
  • Network and IT infrastructure dependency mapping
  • Supply chain and logistics routing
  • Access control and permission inheritance (RBAC/ABAC hierarchies)

Neo4j Architecture

Neo4j stores the graph using a custom binary format on disk. Each node record holds a pointer to its first relationship. Each relationship record holds pointers to the next relationship for both its start node and end node, forming two doubly-linked lists — one per endpoint. Traversal follows these linked lists without touching unrelated data.

At the cluster level, Neo4j operates on a primary-secondary replication model using the Raft consensus protocol. Writes commit to the primary and replicate asynchronously to read replicas. The AuraDB managed service (Neo4j's cloud offering) automates provisioning, patching, and scaling across AWS, GCP, and Azure. AuraDB Free, Professional, and Enterprise tiers provide a low-friction entry point without infrastructure management overhead.

Index types supported in current Neo4j versions:

  • Range indexes — B-tree indexes on node/relationship properties supporting equality, range, prefix, and existence lookups. The default index type for most property lookups.
  • Full-text indexes — Lucene-backed inverted indexes for string search, supporting tokenization, fuzzy matching, and relevance scoring. Essential for knowledge graph entity search and NLP-adjacent workloads.
  • Point indexes — Spatial indexes for 2D/3D coordinate properties.
  • Token lookup indexes — Implicit label and relationship-type indexes used internally for label scans.

Cypher Query Language

Cypher is Neo4j's declarative query language. Its defining characteristic is ASCII-art graph pattern syntax: nodes are represented as (), relationships as -[]->, and full patterns as connected expressions. This makes queries visually mirror the data model they describe.

CREATE — inserting nodes and relationships:

cypher
// Create two users and a FOLLOWS relationship
CREATE (alice:User {id: 'u1', name: 'Alice', email: 'alice@example.com'})
CREATE (bob:User {id: 'u2', name: 'Bob', email: 'bob@example.com'})
CREATE (alice)-[:FOLLOWS {since: date('2024-03-15')}]->(bob)

MERGE — upsert semantics (create if not exists):

cypher
// Idempotent device association — safe to run repeatedly
MERGE (d:Device {fingerprint: 'fp-abc123'})
ON CREATE SET d.first_seen = datetime()
ON MATCH  SET d.last_seen  = datetime()
WITH d
MATCH (u:User {id: 'u1'})
MERGE (u)-[:USES_DEVICE]->(d)

MATCH — pattern-based retrieval with variable-length paths:

cypher
// Find all users within 3 hops of Alice in the FOLLOWS network
MATCH path = (alice:User {name: 'Alice'})-[:FOLLOWS*1..3]->(other:User)
WHERE other <> alice
RETURN other.name, length(path) AS degrees
ORDER BY degrees ASC
LIMIT 50

Relationship traversal with aggregation:

cypher
// Products frequently co-purchased with a given item
MATCH (p:Product {sku: 'SKU-9921'})<-[:PURCHASED]-(buyer:User)-[:PURCHASED]->(other:Product)
WHERE other <> p
RETURN other.name, count(buyer) AS shared_buyers
ORDER BY shared_buyers DESC
LIMIT 10
Tip

Use EXPLAIN and PROFILE prefixes before any Cypher query to inspect the logical and physical query plans. PROFILE actually executes the query and reports db hits per operator — invaluable for identifying missing indexes or Cartesian products introduced by unconnected MATCH clauses.

Real Use Cases

Fraud Detection: Shared Identity Signals

Fraud rings create synthetic identities that inevitably share real-world artifacts — a phone number, a device fingerprint, an IP address, a mailing address. Individually these overlaps look like noise; connected together they form a detectable cluster. This query surfaces accounts that share two or more signals with a flagged account:

cypher
// Identify accounts sharing devices or phone numbers with a known fraudster
MATCH (flagged:Account {status: 'fraud'})-[:USES_DEVICE|REGISTERED_PHONE]->
      (signal)<-[:USES_DEVICE|REGISTERED_PHONE]-(suspect:Account)
WHERE suspect <> flagged
  AND suspect.status <> 'fraud'
WITH suspect, count(DISTINCT signal) AS shared_signals
WHERE shared_signals >= 2
RETURN suspect.id, suspect.email, shared_signals
ORDER BY shared_signals DESC

The equivalent SQL would require a self-JOIN on each signal table, UNION the results, then GROUP BY to count shared signals — typically 6–8 tables with intermediate cardinality explosions. In Neo4j the traversal stays local to the flagged account's two-hop neighborhood.

Recommendation Engine

Collaborative filtering via graph traversal naturally surfaces "users like you also bought" recommendations without pre-computing a similarity matrix:

cypher
// Collaborative filtering: items purchased by similar users
MATCH (me:User {id: $userId})-[:PURCHASED]->(item:Product)
      <-[:PURCHASED]-(peer:User)-[:PURCHASED]->(rec:Product)
WHERE NOT (me)-[:PURCHASED]->(rec)
  AND peer <> me
RETURN rec.name,
       rec.category,
       count(DISTINCT peer) AS peer_signal,
       collect(DISTINCT item.name)[0..3] AS via_items
ORDER BY peer_signal DESC
LIMIT 20

Knowledge Graph and Entity Resolution

Enterprise knowledge graphs store concepts, documents, and their semantic relationships. Full-text indexes on :Concept nodes enable entity search, while graph traversal resolves synonyms and hierarchical relationships (e.g., "electric vehicle" IS_A "automobile" IS_A "vehicle") without recursive CTEs.

Scaling Neo4j

Neo4j's horizontal scaling story is primarily read-scale: add read replicas behind a load balancer. Write throughput scales vertically (more RAM for page cache, faster NVMe storage). For workloads demanding write-heavy horizontal scale, consider sharding the graph by domain subgraph or evaluating TigerGraph's native distributed architecture.

The APOC (Awesome Procedures on Cypher) library extends Neo4j with 450+ stored procedures for data import, graph refactoring, date/text utilities, and schema introspection. APOC is de facto standard in production Neo4j deployments:

cypher
// Batch-import relationships from a JSON list using APOC
CALL apoc.periodic.iterate(
  "UNWIND $events AS e RETURN e",
  "MATCH (u:User {id: e.userId}), (p:Product {sku: e.sku})
   MERGE (u)-[:PURCHASED {ts: e.timestamp, amount: e.amount}]->(p)",
  {batchSize: 1000, params: {events: $eventList}, parallel: false}
)

The Graph Data Science (GDS) library adds in-memory graph projection and analytical algorithms that run on the graph without blocking transactional queries:

cypher
// Project a subgraph and run PageRank to rank influential users
CALL gds.graph.project(
  'user-follows',
  'User',
  {FOLLOWS: {orientation: 'NATURAL'}}
)

CALL gds.pageRank.stream('user-follows', {
  maxIterations: 20,
  dampingFactor: 0.85
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS user, score
ORDER BY score DESC
LIMIT 25

// Community detection with Louvain modularity
CALL gds.louvain.stream('user-follows')
YIELD nodeId, communityId
RETURN communityId, count(*) AS members
ORDER BY members DESC
LIMIT 10
Warning

GDS in-memory graph projections can consume significant heap. For large graphs (100M+ nodes), size the GDS projection carefully and drop it after use with CALL gds.graph.drop('user-follows'). Monitor heap usage via Neo4j's :sysinfo browser command or the metrics endpoint.

Neo4j vs Amazon Neptune vs TigerGraph

Dimension Neo4j Amazon Neptune TigerGraph
Query language Cypher (+ GQL) Gremlin, openCypher, SPARQL GSQL (SQL-like)
Managed option AuraDB Native AWS service TigerGraph Cloud
Horizontal write scale Limited (single primary) Limited Native distributed
Ecosystem / community Largest (APOC, GDS, drivers) AWS ecosystem lock-in Smaller, enterprise-focused
Open source core Community Edition (GPLv3) No No
Best fit General-purpose graph + ML AWS-native, multi-model Ultra-large scale analytics

For teams already running workloads on AWS with strict data residency requirements, Neptune's tight VPC integration and IAM auth can outweigh Neo4j's richer ecosystem. For most greenfield graph projects, Neo4j's Cypher ergonomics, community, and GDS library make it the lower-friction choice.

Tip

Neo4j's Community Edition is free and GPLv3 licensed — suitable for evaluation and internal tooling. Production enterprise features (cluster failover, role-based security, warm standby) require the Enterprise license or AuraDB. Factor license costs into your TCO comparison with Neptune, which charges per I/O request and instance-hour.

Key Takeaways
  • Neo4j's property graph model (nodes + relationships + properties) enables index-free adjacency — traversal cost is proportional to local neighborhood size, not total dataset size.
  • Cypher's ASCII-art pattern syntax makes relationship queries readable and maintainable; MATCH, CREATE, and MERGE cover the majority of operational patterns.
  • Variable-length path expressions ([:REL*1..3]) replace exponential JOIN chains for multi-hop traversal queries.
  • Fraud detection and recommendation engines are canonical graph database workloads where Neo4j outperforms relational alternatives by orders of magnitude at depth > 2.
  • The APOC library and GDS algorithms (PageRank, Louvain community detection) extend Neo4j from a transactional store to a full graph analytics platform.
  • Range indexes handle property lookups; full-text indexes (Lucene-backed) enable entity search for knowledge graph workloads.
  • AuraDB eliminates infrastructure management; compare it against Neptune on I/O cost and against TigerGraph on write-scale requirements.
  • Graph databases complement, not replace, your relational or document store — federate queries across systems for the data model that fits each workload.

Evaluate Graph Databases with Expert Guidance from JusDB

Deciding whether Neo4j belongs in your architecture requires honest analysis of your query patterns, data model, and operational constraints. JusDB's database consulting team has hands-on experience migrating relational workloads to Neo4j, designing Cypher query patterns for fraud and recommendation systems, and right-sizing AuraDB clusters for production traffic. Whether you're evaluating graph databases for the first time or optimizing an existing Neo4j deployment, we can accelerate the process and help you avoid the common pitfalls — from Cartesian product Cypher queries to GDS heap exhaustion in production.

Talk to a JusDB database architect about your graph database requirements.

Share this article