#4500 Cypher Statement Sequence Validation

#4500 Given a python string of cypher statements and an LLM, how can I validate the cypher statements? Example: """MERGE (s:System {id: 'abw', name: 'Acoustic Barrier Wall', label: 'System', type: 'Structural', description: 'A modular barrier system designed to attenuate sound transmission and vibration.'})

Validating Cypher statements generated by or passed through an LLM is crucial because LLMs are notorious for hallucinating syntax, inventing non-existent relationship directions, or mismatching variables.

To build a robust validation pipeline in Python, you should use a hybrid approach: use deterministic programming libraries for syntax and schema validation, and use the LLM for semantic validation.

Here is a step-by-step framework to achieve this.

1. Syntax Validation (Deterministic)

Before asking an LLM if the Cypher makes sense, you must ensure it is actually valid Cypher. The most reliable way to do this without executing it against a live production database is using a parsing library.

Option A: Using `libcypher-parser` via Python

You can use openCypher tools or Python wrappers like cypher-validator (if available) or simply use a dummy Neo4j transaction with an EXPLAIN clause. EXPLAIN asks the database to parse and create an execution plan without actually running or altering data.

from neo4j import GraphDatabase

def validate_syntax_with_explain(cypher_string, uri, auth):
    """Validates Cypher syntax without executing the query using EXPLAIN."""
    # Split queries if they are in a single large string block
    queries = [q.strip() for q in cypher_string.split('\n\n') if q.strip()]
    
    with GraphDatabase.driver(uri, auth=auth) as driver:
        with driver.session() as session:
            for query in queries:
                try:
                    # Prepend EXPLAIN to verify syntax without mutating data
                    session.run(f"EXPLAIN {query}")
                except Exception as e:
                    return False, f"Syntax Error found in query:\n{query}\n\nError: {e}"
    return True, "All queries have valid syntax."

2. Schema and Semantic Validation (LLM-Assisted)

Once you know the Cypher code won't crash the parser, you need to validate semantics and schema conformity (e.g., Are we allowed to connect a Part directly to a Component using AFFECTS?).

You can prompt your LLM as a "Cypher Linter" by feeding it your Graph Schema and the generated Cypher string.

The LLM Prompt Design

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def validate_cypher_semantics(cypher_to_test):
    # Define your allowed graph schema rigidly
    graph_schema = """
    Node Labels:
      - System (id, name, label, type, description)
      - Subsystem (id, name, label, type, description)
      - Component (id, name, label, type, description)
      - Part (id, name, label, type, description)
      
    Allowed Relationships:
      - (:System)-[:HAS_SUBSYSTEM]->(:Subsystem)
      - (:Subsystem)-[:HAS_COMPONENT]->(:Component)
      - (:Component)-[:HAS_PART]->(:Part)
      - (:Part)-[:AFFECTS]->(:Part)
      - (:Part)-[:AFFECTS]->(:Component)
      - (:Component)-[:AFFECTS]->(:Component)
      - (:Component)-[:AFFECTS]->(:Part)
    """

    prompt = f"""
    You are an expert Neo4j Cypher QA engineer. Your job is to validate a block of Cypher statements against an official Graph Schema.
    
    CRITICAL VALIDATION RULES:
    1. Look for mismatched variables. If a variable is declared as a `Part` in line 1, it must not be re-declared as a `Component` later.
    2. Ensure relationship directions and types perfectly match the strict allowed schema.
    3. Check for structural redundancy (e.g., MERGE statements redefining the same exact properties in multiple conflicting ways).
    
    Official Schema:
    {graph_schema}
    
    Cypher Statements to Validate:
    \"\"\"{cypher_to_test}\"\"\"
    
    Respond STRICTLY in JSON format with two keys:
    - "valid": boolean (true/false)
    - "errors": a list of strings detailing exactly what violates the schema or syntax logic. If valid is true, this list must be empty.
    """

    response = client.chat.completions.create(
        model="gpt-4o",  # or your preferred LLM
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    
    return response.choices[0].message.content

3. Reviewing Your Example Data

If you run your specific example through the LLM validator above, it will actually catch semantic errors that a basic syntax parser might miss.

For instance, looking at your example string:

In your schema logic, a Part (like p3) is initially merged with properties: MERGE (p3:Part {id: 'abw_sensor_mic_mem', name: 'MEMS Transducer', ...})
But later in the script, you do: MERGE (p5:Part {id: 'abw_thermal_fan_mtr'}) MERGE (c4:Component {id: 'abw_control_cpu'}) MERGE (p5)-[:AFFECTS ...]->(c4)

Depending on how strict your graph logic is, the LLM will identify if p5 (a Part) affecting c4 (a Component) matches your engineering topology, or if MERGE (p3:Part {id: '...'}) without its other properties will unintentionally overwrite or split nodes if not handled cleanly by Neo4j's unique constraints.

Summary Checklist for Production

Sanitize & Split: Use Python to trim whitespace and split the string by double newlines (\n\n) to evaluate statements iteratively.
Dry Run (EXPLAIN): Pass queries to a Neo4j test instance utilizing EXPLAIN to catch bad commas, unclosed brackets, or invalid keywords.
LLM Schema Audit: Pass the clean queries along with a strict definition text of your graph schema to the LLM to flag structural architectural errors.

Will validate_syntax_with_explain catch misassignments or misuse of variables in the cypher statements?

No, validate_syntax_with_explain will not catch most misassignments or variable misuses.

The EXPLAIN clause only checks if the query follows the formal rules of the Cypher grammar (i.e., structural syntax). It ensures parentheses are closed, commas are in the right place, and keywords like MERGE or MATCH are used correctly.

Because Cypher is highly dynamic, a query can be syntactically flawless while still being completely broken logically, structurally, or semantically.

Here are the specific types of misassignments and misuses that EXPLAIN will miss, but an LLM or database constraints will catch:

What `EXPLAIN` Fails to Catch

1. Variable Type Contradictions (Labels)

If you assign a variable to one label early in the string and a completely different label later, EXPLAIN sees absolutely nothing wrong with it.

// Line 1
MERGE (p3:Part {id: '123'})

// Line 50
MERGE (p3:Component {id: '123'})

Why EXPLAIN misses it: Cypher technically allows you to assign multiple labels to a single variable/node. EXPLAIN assumes you want p3 to be both a Part and a Component.
Why it's a bug: In your domain schema, a node cannot be both a Part and a Component. This is a semantic violation.

2. Property Inconsistencies

If you redefine a node with missing or conflicting properties, EXPLAIN will pass it.

// Line 1 (Detailed definition)
MERGE (s:System {id: 'abw', name: 'Acoustic Barrier Wall'})

// Line 10 (Conflicting definition)
MERGE (s:System {id: 'xyz', name: 'Completely Different Name'})

Why EXPLAIN misses it: If these statements are evaluated in the same execution block, Cypher looks at the second statement and assumes you are trying to match a new node bound to the same variable name s, or it evaluates them as separate valid queries. It won't flag the logical conflict.

3. Schema and Relationship Violations

EXPLAIN does not know or care about your business logic or graph data model.

MERGE (p1:Part)-[:HAS_SUBSYSTEM]->(s:System)

Why EXPLAIN misses it: Mechanically, it is perfectly valid Cypher to connect a Part to a System using HAS_SUBSYSTEM.
Why it's a bug: Structurally, your architecture dictates that only Systems can have Subsystems (the reverse of this query).

What `EXPLAIN` Will Catch

EXPLAIN is strictly a compiler-level check. It will only throw an error if you do something that breaks the language rules, such as:

Unbound Variables: Referring to a variable in a relationship that was never defined anywhere in that specific statement block (e.g., MERGE (missing_node)-[:AFFECTS]->(p1) if missing_node wasn't declared).
Syntax Typos: Writing MEGER instead of MERGE.
Invalid Expressions: Trying to perform mathematical operations on non-compatible data types explicitly declared in the query.

How to Bridge the Gap

To catch the variable misuses that EXPLAIN misses, you need to rely on the second layer of defense:

The LLM Linter (Semantic Validation): As shown in the previous prompt, an LLM excels at reading the entire string sequentially, keeping track of what variable p3 is supposed to mean, and raising a flag if p3 suddenly changes from a Part to a Component.
Database Constraints (Schema Enforcement): If you want the database itself to enforce these rules at runtime, you must pre-declare Uniqueness Constraints on your database instance before running the script:
CREATE CONSTRAINT FOR (p:Part) REQUIRE p.id IS UNIQUE; CREATE CONSTRAINT FOR (c:Component) REQUIRE c.id IS UNIQUE;
If you have these constraints active, Neo4j will throw a runtime error if a query tries to merge a Component with an ID that already belongs to a Part.

Rather than splitting the python string into individual cypher statements, can I submit the entire string of 30 or more cypher statements as one call?

Yes, you absolutely can submit the entire string as a single call. In fact, sending the statements together is usually the better approach for both syntax checking and LLM validation, though how you handle it differs for each.

Here is how you can process the entire string as one single call for both methods.

1. Syntax Validation: The Neo4j Multi-Statement Approach

If you pass a massive string containing multiple Cypher statements separated by semicolons (or newlines) to a standard Neo4j driver session run() call, it will normally throw an error because run() expects exactly one statement.

To submit everything at once, you use session.execute_write() or session.execute_read(), combined with prepending EXPLAIN to the top block, or passing them as an explicit multi-statement transaction block.

However, an even cleaner way if you are using Neo4j 5+ (or newer in 2026) is to ensure your statements are separated by semicolons ;.

Here is how to validate the entire block at once:

from neo4j import GraphDatabase

def validate_entire_block_syntax(cypher_block, uri, auth):
    """
    Validates a giant multi-statement string in a single network call.
    Note: Statements should ideally be terminated with semicolons (;).
    """
    # Ensure statements are separated by semicolons if they aren't already
    # If your string relies on newlines, we can format it quickly:
    formatted_block = cypher_block.replace('\n\n', ';\n\n')
    
    # Prepend EXPLAIN to the entire block
    # Neo4j will parse the entire execution plan for all statements combined
    validation_query = f"EXPLAIN {formatted_block}"
    
    with GraphDatabase.driver(uri, auth=auth) as driver:
        with driver.session() as session:
            try:
                session.run(validation_query)
                return True, "Entire block is syntactically valid."
            except Exception as e:
                return False, f"Syntax error found somewhere in the block:\n{e}"

Why doing it in one call is better for `EXPLAIN`:

If you split statements, EXPLAIN can't track variable references across lines. For example, if line 1 defines (s:System) and line 5 uses (s), splitting them causes a "variable not defined" error. Keeping them in one single block allows the compiler to see that variable s flows through the entire script perfectly.

2. LLM Validation: The Single-Call Approach

For the LLM, sending the entire string at once is actually highly recommended. LLMs need global context. If you split the string, the LLM won't know that a variable typo on line 30 conflicts with a definition on line 2.

Your example string is roughly 400 words. Modern LLMs can easily handle tens of thousands of words in a single prompt, so 30+ Cypher statements are a walk in the park.

You can feed the entire string directly into the prompt you saw earlier:

# No splitting required! Just dump the raw triple-quoted string right in.
cypher_payload = """MERGE (s:System {id: 'abw', ...})
MERGE (sub1:Subsystem {id: 'abw_struct', ...})
...
MERGE (c2:Component)-[:AFFECTS]->(c1:Component)"""

# Call the LLM function once
json_validation_result = validate_cypher_semantics(cypher_payload)
print(json_validation_result)

Advantages of the Single LLM Call:

Cross-Reference Checking: The LLM can scan the entire document to ensure p3 didn't accidentally morph from a Part into a Component halfway through the file.
Speed and Cost: Making one API call with a larger prompt is significantly faster and often cheaper than making 30 individual round-trip API calls to an LLM.
Global Context: It can flag architectural issues, like finding isolated "islands" (nodes that don't connect to anything else in the entire dataset).

From <https://gemini.google.com/app/> Google Gemini (3.5 Flash)

Search This Blog

Learning from ChatGPT