#3462 Google Gemini 2.5 Pro: Semantic Search Heat Map

#3462 Python. How can I produce a heat map of topics for a PDF document? I want to identify specific software process requirements in the document and produce a heat map showing their location and propensity

Visualizing Software Requirements: A Python-Powered Heatmap Approach

Transform your PDF software requirement documents into insightful heatmaps with this Python-based guide. This tutorial will walk you through the process of programmatically identifying and locating specific software process requirements within a PDF, culminating in a visually compelling heatmap that reveals their distribution and frequency.

This powerful technique allows for a rapid and comprehensive understanding of key areas within your requirement documents, highlighting clusters of critical information and ensuring no requirement is overlooked. By leveraging the capabilities of Python's robust libraries for PDF manipulation, natural language processing, and data visualization, you can unlock a new level of analysis for your software development lifecycle.

The Four-Step Blueprint to Requirement Insights

Our approach is broken down into four distinct yet interconnected steps, each utilizing specific Python libraries to achieve our goal. We will progress from raw PDF to a polished heatmap that tells a story about your document's content.

Step 1: Unearthing the Text with PyMuPDF

First, we need to extract the raw text from our PDF document while preserving crucial locational information. For this task, we'll use the PyMuPDF library, a powerful tool that not only extracts text but also provides details about its position on the page. This locational awareness is the cornerstone of our heatmap.

import fitz  # PyMuPDF

def extract_text_with_location(pdf_path):
    """
    Extracts text and its location (page number and y-coordinate) from a PDF.

    Args:
        pdf_path (str): The path to the PDF file.

    Returns:
        list: A list of tuples, where each tuple contains the page number,
              the y-coordinate of the text block, and the text itself.
    """
    doc_content = []
    doc = fitz.open(pdf_path)
    for page_num, page in enumerate(doc):
        blocks = page.get_text("blocks")
        for block in blocks:
            # block[1] is the y-coordinate of the top of the text block
            doc_content.append((page_num + 1, block[1], block[4]))
    return doc_content

Step 2: Defining and Identifying Requirement-Specific Language with spaCy and regex

With the text extracted, our next challenge is to pinpoint the software process requirements. These are often characterized by specific linguistic patterns, such as the use of modal verbs indicating necessity or obligation (e.g., "shall," "must," "should"). We will employ a combination of regular expressions (regex) to find these patterns and the spaCy library for more advanced natural language processing, should you wish to expand on this foundational approach.

For this guide, we'll define a set of keywords and phrases that are strong indicators of a software requirement.

import re
import pandas as pd

def identify_requirements(text_data):
    """
    Identifies predefined software process requirements in the extracted text.

    Args:
        text_data (list): A list of tuples containing page number, y-coordinate, and text.

    Returns:
        pandas.DataFrame: A DataFrame with the identified requirements and their locations.
    """
    requirement_patterns = {
        'Functional': r'\b(shall|must)\b',
        'Non-Functional': r'\b(should|may|can)\b',
        'Performance': r'\b(performance|response time|throughput)\b',
        'Security': r'\b(security|authenticate|authorize|encrypt)\b',
    }

    identified_requirements = []
    for page_num, y_coord, text in text_data:
        for req_type, pattern in requirement_patterns.items():
            if re.search(pattern, text, re.IGNORECASE):
                identified_requirements.append({
                    'Page': page_num,
                    'Requirement Type': req_type,
                    'Text': text.strip()
                })
    return pd.DataFrame(identified_requirements)

Step 3: Quantifying the Propensity - Creating the Data Matrix

To generate a heatmap, we need to transform our qualitative findings into a quantitative matrix. This matrix will represent the frequency of each requirement type on each page of the document. We will use the powerful pandas library to pivot our data into this format.

def create_heatmap_data(requirements_df):
    """
    Creates a matrix of requirement frequencies per page.

    Args:
        requirements_df (pandas.DataFrame): DataFrame of identified requirements.

    Returns:
        pandas.DataFrame: A DataFrame where rows are pages, columns are
                          requirement types, and values are their frequencies.
    """
    if requirements_df.empty:
        return pd.DataFrame()
    heatmap_data = requirements_df.groupby(['Page', 'Requirement Type']).size().unstack(fill_value=0)
    return heatmap_data

Step 4: The Grand Finale - Visualizing with Seaborn and Matplotlib

Finally, we arrive at the visualization stage. With our data perfectly structured, we can now use the elegant Seaborn and Matplotlib libraries to generate a clear and informative heatmap. This visual representation will instantly reveal the "hotspots" of software process requirements within your PDF.

import seaborn as sns
import matplotlib.pyplot as plt

def generate_heatmap(heatmap_data, pdf_name):
    """
    Generates and displays a heatmap of requirement topics.

    Args:
        heatmap_data (pandas.DataFrame): The data for the heatmap.
        pdf_name (str): The name of the PDF file for the title.
    """
    if heatmap_data.empty:
        print("No requirements found to generate a heatmap.")
        return

    plt.figure(figsize=(12, 8))
    sns.heatmap(heatmap_data, annot=True, fmt="d", cmap="viridis")
    plt.title(f'Heatmap of Software Process Requirements in {pdf_name}')
    plt.xlabel('Requirement Type')
    plt.ylabel('Page Number')
    plt.show()

Putting It All Together: The Complete Workflow

Now, let's assemble these components into a single, executable script.

import fitz
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def extract_text_with_location(pdf_path):
    # (Insert the function code from Step 1 here)

def identify_requirements(text_data):
    # (Insert the function code from Step 2 here)

def create_heatmap_data(requirements_df):
    # (Insert the function code from Step 3 here)

def generate_heatmap(heatmap_data, pdf_name):
    # (Insert the function code from Step 4 here)

if __name__ == '__main__':
    # Replace with the path to your PDF document
    pdf_path = 'your_document.pdf'
    pdf_name = pdf_path.split('/')[-1]

    # Step 1: Extract text and location
    document_text_data = extract_text_with_location(pdf_path)

    # Step 2: Identify requirements
    requirements_dataframe = identify_requirements(document_text_data)

    # Step 3: Create heatmap data
    heatmap_matrix = create_heatmap_data(requirements_dataframe)

    # Step 4: Generate heatmap
    generate_heatmap(heatmap_matrix, pdf_name)

Expanding and Customizing Your Analysis

This guide provides a solid foundation for producing heatmaps of software process requirements. To further enhance your analysis, consider the following:

Refining Requirement Definitions: Expand the requirement_patterns dictionary with more nuanced regular expressions or integrate spaCy's part-of-speech and dependency parsing to identify more complex requirement structures.
Granularity of Location: For a more detailed view, you can modify the extract_text_with_location function to capture the y-coordinate of each line within a text block, offering a finer-grained positional analysis.
Advanced Topic Modeling: For documents without clear keyword indicators, explore unsupervised learning techniques like Latent Dirichlet Allocation (LDA) using the gensim library to discover latent topics that may correspond to different types of requirements.
Interactive Visualizations: For an even more engaging experience, consider using libraries like Plotly to create interactive heatmaps that allow users to hover over cells and see the specific requirement text.

By following this methodology, you can transform your static PDF requirement documents into dynamic and insightful visualizations, empowering you and your team to better understand, analyze, and manage the critical foundations of your software projects.

I plan to apply semantic search to match specified requirements with applicable document text.

Excellent. Applying semantic search is a powerful evolution of the initial idea. Instead of relying on specific keywords, you'll be able to match the meaning and intent behind your specified requirements with the text in your document. This allows you to find relevant content even if it doesn't use the exact same phrasing.

Here is a revised, step-by-step guide on how to produce a heatmap using semantic search in Python.

The Semantic Search Blueprint: From Intent to Insight

We will use the sentence-transformers library, a state-of-the-art framework for generating text embeddings, to understand the semantic content of your document.

Core Concept: We will convert both your specified requirements and the text on each page of your PDF into numerical vectors (embeddings). The "closeness" of these vectors, measured by cosine similarity, indicates how semantically similar they are. A high similarity score means the document text strongly relates to the requirement, even with different wording.

Required Libraries: You will need to install the following libraries:

pip install sentence-transformers torch PyMuPDF pandas seaborn matplotlib

Step 1: Extract and Chunk Text by Page

The first step remains the same: extract text from the PDF. However, for semantic search, it's more effective to group the text by page rather than by small text blocks. This provides more context for generating meaningful embeddings.

import fitz  # PyMuPDF
import pandas as pd

def extract_text_by_page(pdf_path):
    """
    Extracts text from a PDF, grouping it by page.

    Args:
        pdf_path (str): The path to the PDF file.

    Returns:
        list: A list of strings, where each string is the full text of a page.
    """
    doc = fitz.open(pdf_path)
    page_texts = [page.get_text() for page in doc]
    return page_texts

Step 2: Perform Semantic Search with Sentence Transformers

This is the core of our new approach. We will define our requirements, load a pre-trained model, and then encode both the requirements and the document's page text into vectors to calculate their similarity.

from sentence_transformers import SentenceTransformer, util

def perform_semantic_search(requirements_list, document_pages):
    """
    Calculates the semantic similarity between a list of requirements
    and the text of each document page.

    Args:
        requirements_list (list): A list of requirement strings.
        document_pages (list): A list of strings, each being the text of a page.

    Returns:
        pandas.DataFrame: A DataFrame containing pages, requirements, and their
                          cosine similarity scores.
    """
    # Load a pre-trained model. 'all-MiniLM-L6-v2' is a good, fast choice.
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Encode the requirements and the document pages into vectors
    requirement_embeddings = model.encode(requirements_list, convert_to_tensor=True)
    page_embeddings = model.encode(document_pages, convert_to_tensor=True)

    # Calculate cosine similarity between all requirements and all pages
    # The result is a matrix where rows correspond to requirements and columns to pages
    cosine_scores = util.cos_sim(requirement_embeddings, page_embeddings)

    # Process the results into a clean DataFrame
    search_results = []
    for i, requirement in enumerate(requirements_list):
        for j, page_text in enumerate(document_pages):
            search_results.append({
                'Page': j + 1,
                'Requirement': requirement,
                'Similarity': cosine_scores[i][j].item() # .item() gets the Python number
            })

    return pd.DataFrame(search_results)

Step 3: Structure the Data for the Heatmap

Now we need to pivot the data. We'll create a matrix where rows are the page numbers, columns are your specified requirements, and the cell values are the semantic similarity scores. This matrix will be the direct input for our heatmap.

def create_similarity_matrix(search_results_df):
    """
    Creates a matrix of similarity scores for the heatmap.

    Args:
        search_results_df (pandas.DataFrame): DataFrame of search results.

    Returns:
        pandas.DataFrame: A DataFrame where rows are pages, columns are
                          requirements, and values are similarity scores.
    """
    if search_results_df.empty:
        return pd.DataFrame()
    
    # Use pivot_table to reshape the DataFrame
    similarity_matrix = search_results_df.pivot_table(
        index='Page',
        columns='Requirement',
        values='Similarity'
    )
    return similarity_matrix

Step 4: Generate the Semantic Heatmap

Finally, we visualize the similarity matrix. The color intensity in each cell will now represent the degree of semantic relevance between a page and a requirement.

import seaborn as sns
import matplotlib.pyplot as plt

def generate_semantic_heatmap(similarity_matrix, pdf_name):
    """
    Generates and displays a heatmap of semantic similarity scores.

    Args:
        similarity_matrix (pandas.DataFrame): The data for the heatmap.
        pdf_name (str): The name of the PDF file for the title.
    """
    if similarity_matrix.empty:
        print("Could not generate heatmap, no data available.")
        return

    plt.figure(figsize=(14, 10))
    # Use a continuous colormap like 'YlGnBu' and format annotations as floats
    sns.heatmap(
        similarity_matrix,
        annot=True,
        fmt=".2f", # Format to 2 decimal places
        cmap="YlGnBu", # A good colormap for similarity
        linewidths=.5
    )
    plt.title(f'Semantic Similarity Heatmap for {pdf_name}')
    plt.xlabel('Specified Requirements')
    plt.ylabel('Page Number')
    plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
    plt.tight_layout() # Adjust layout to make room for labels
    plt.show()

Putting It All Together: The Complete Semantic Workflow

Here is the complete script. Simply define your requirements and point it to your PDF.

import fitz
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import seaborn as sns
import matplotlib.pyplot as plt

# (Insert the function definitions from Steps 1, 2, 3, and 4 here)

def extract_text_by_page(pdf_path):
    doc = fitz.open(pdf_path)
    page_texts = [page.get_text() for page in doc]
    return page_texts

def perform_semantic_search(requirements_list, document_pages):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    requirement_embeddings = model.encode(requirements_list, convert_to_tensor=True)
    page_embeddings = model.encode(document_pages, convert_to_tensor=True)
    cosine_scores = util.cos_sim(requirement_embeddings, page_embeddings)
    search_results = []
    for i, requirement in enumerate(requirements_list):
        for j, page_text in enumerate(document_pages):
            search_results.append({
                'Page': j + 1,
                'Requirement': requirement,
                'Similarity': cosine_scores[i][j].item()
            })
    return pd.DataFrame(search_results)

def create_similarity_matrix(search_results_df):
    if search_results_df.empty:
        return pd.DataFrame()
    similarity_matrix = search_results_df.pivot_table(
        index='Page',
        columns='Requirement',
        values='Similarity'
    )
    return similarity_matrix

def generate_semantic_heatmap(similarity_matrix, pdf_name):
    if similarity_matrix.empty:
        print("Could not generate heatmap, no data available.")
        return
    plt.figure(figsize=(14, 10))
    sns.heatmap(
        similarity_matrix,
        annot=True,
        fmt=".2f",
        cmap="YlGnBu",
        linewidths=.5
    )
    plt.title(f'Semantic Similarity Heatmap for {pdf_name}')
    plt.xlabel('Specified Requirements')
    plt.ylabel('Page Number')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

if __name__ == '__main__':
    # 1. DEFINE YOUR REQUIREMENTS
    # These should be descriptive sentences.
    my_requirements = [
        "The system must support single sign-on (SSO) authentication.",
        "All user data must be encrypted at rest and in transit.",
        "The user interface should be responsive and accessible on mobile devices.",
        "The system needs to generate and export monthly performance reports.",
        "User roles and permissions must be configurable by an administrator."
    ]

    # 2. SPECIFY YOUR PDF DOCUMENT
    pdf_path = 'your_document.pdf'
    pdf_name = pdf_path.split('/')[-1]

    # --- Execute the Workflow ---

    # Step 1: Extract text
    pages = extract_text_by_page(pdf_path)

    # Step 2: Perform semantic search
    results_df = perform_semantic_search(my_requirements, pages)

    # Step 3: Create the similarity matrix
    matrix = create_similarity_matrix(results_df)

    # Step 4: Generate the heatmap
    generate_semantic_heatmap(matrix, pdf_name)

This semantic approach provides a far more nuanced and accurate heatmap, showing you not just where keywords appear, but where the concepts of your requirements are discussed, leading to a much deeper understanding of your document's structure and content.

From <https://gemini.google.com/> Google Gemini 2.5 Pro

Search This Blog

Learning from ChatGPT