#3462 Google Gemini 2.5 Pro: Semantic Search Heat Map
#3462 Google Gemini 2.5 Pro: Semantic Search Heat Map
#3462 Python. How can I produce a heat map of topics for a PDF document? I want to identify specific software process requirements in the document and produce a heat map showing their location and propensity
Visualizing Software Requirements: A Python-Powered Heatmap Approach
Transform your PDF software requirement documents into insightful heatmaps with this Python-based guide. This tutorial will walk you through the process of programmatically identifying and locating specific software process requirements within a PDF, culminating in a visually compelling heatmap that reveals their distribution and frequency.
This powerful technique allows for a rapid and comprehensive understanding of key areas within your requirement documents, highlighting clusters of critical information and ensuring no requirement is overlooked. By leveraging the capabilities of Python's robust libraries for PDF manipulation, natural language processing, and data visualization, you can unlock a new level of analysis for your software development lifecycle.
The Four-Step Blueprint to Requirement Insights
Our approach is broken down into four distinct yet interconnected steps, each utilizing specific Python libraries to achieve our goal. We will progress from raw PDF to a polished heatmap that tells a story about your document's content.
Step 1: Unearthing the Text with PyMuPDF
First, we need to extract the raw text from our PDF document while preserving crucial locational information. For this task, we'll use the PyMuPDF
library, a powerful tool that not only extracts text but also provides details about its position on the page. This locational awareness is the cornerstone of our heatmap.
import fitz # PyMuPDF
def extract_text_with_location(pdf_path):
"""
Extracts text and its location (page number and y-coordinate) from a PDF.
Args:
pdf_path (str): The path to the PDF file.
Returns:
list: A list of tuples, where each tuple contains the page number,
the y-coordinate of the text block, and the text itself.
"""
doc_content = []
doc = fitz.open(pdf_path)
for page_num, page in enumerate(doc):
blocks = page.get_text("blocks")
for block in blocks:
# block[1] is the y-coordinate of the top of the text block
doc_content.append((page_num + 1, block[1], block[4]))
return doc_content
Step 2: Defining and Identifying Requirement-Specific Language with spaCy
and regex
With the text extracted, our next challenge is to pinpoint the software process requirements. These are often characterized by specific linguistic patterns, such as the use of modal verbs indicating necessity or obligation (e.g., "shall," "must," "should"). We will employ a combination of regular expressions (regex
) to find these patterns and the spaCy
library for more advanced natural language processing, should you wish to expand on this foundational approach.
For this guide, we'll define a set of keywords and phrases that are strong indicators of a software requirement.
import re
import pandas as pd
def identify_requirements(text_data):
"""
Identifies predefined software process requirements in the extracted text.
Args:
text_data (list): A list of tuples containing page number, y-coordinate, and text.
Returns:
pandas.DataFrame: A DataFrame with the identified requirements and their locations.
"""
requirement_patterns = {
'Functional': r'\b(shall|must)\b',
'Non-Functional': r'\b(should|may|can)\b',
'Performance': r'\b(performance|response time|throughput)\b',
'Security': r'\b(security|authenticate|authorize|encrypt)\b',
}
identified_requirements = []
for page_num, y_coord, text in text_data:
for req_type, pattern in requirement_patterns.items():
if re.search(pattern, text, re.IGNORECASE):
identified_requirements.append({
'Page': page_num,
'Requirement Type': req_type,
'Text': text.strip()
})
return pd.DataFrame(identified_requirements)
Step 3: Quantifying the Propensity - Creating the Data Matrix
To generate a heatmap, we need to transform our qualitative findings into a quantitative matrix. This matrix will represent the frequency of each requirement type on each page of the document. We will use the powerful pandas
library to pivot our data into this format.
def create_heatmap_data(requirements_df):
"""
Creates a matrix of requirement frequencies per page.
Args:
requirements_df (pandas.DataFrame): DataFrame of identified requirements.
Returns:
pandas.DataFrame: A DataFrame where rows are pages, columns are
requirement types, and values are their frequencies.
"""
if requirements_df.empty:
return pd.DataFrame()
heatmap_data = requirements_df.groupby(['Page', 'Requirement Type']).size().unstack(fill_value=0)
return heatmap_data
Step 4: The Grand Finale - Visualizing with Seaborn
and Matplotlib
Finally, we arrive at the visualization stage. With our data perfectly structured, we can now use the elegant Seaborn
and Matplotlib
libraries to generate a clear and informative heatmap. This visual representation will instantly reveal the "hotspots" of software process requirements within your PDF.
import seaborn as sns
import matplotlib.pyplot as plt
def generate_heatmap(heatmap_data, pdf_name):
"""
Generates and displays a heatmap of requirement topics.
Args:
heatmap_data (pandas.DataFrame): The data for the heatmap.
pdf_name (str): The name of the PDF file for the title.
"""
if heatmap_data.empty:
print("No requirements found to generate a heatmap.")
return
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, fmt="d", cmap="viridis")
plt.title(f'Heatmap of Software Process Requirements in {pdf_name}')
plt.xlabel('Requirement Type')
plt.ylabel('Page Number')
plt.show()
Putting It All Together: The Complete Workflow
Now, let's assemble these components into a single, executable script.
import fitz
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
def extract_text_with_location(pdf_path):
# (Insert the function code from Step 1 here)
def identify_requirements(text_data):
# (Insert the function code from Step 2 here)
def create_heatmap_data(requirements_df):
# (Insert the function code from Step 3 here)
def generate_heatmap(heatmap_data, pdf_name):
# (Insert the function code from Step 4 here)
if __name__ == '__main__':
# Replace with the path to your PDF document
pdf_path = 'your_document.pdf'
pdf_name = pdf_path.split('/')[-1]
# Step 1: Extract text and location
document_text_data = extract_text_with_location(pdf_path)
# Step 2: Identify requirements
requirements_dataframe = identify_requirements(document_text_data)
# Step 3: Create heatmap data
heatmap_matrix = create_heatmap_data(requirements_dataframe)
# Step 4: Generate heatmap
generate_heatmap(heatmap_matrix, pdf_name)
Expanding and Customizing Your Analysis
This guide provides a solid foundation for producing heatmaps of software process requirements. To further enhance your analysis, consider the following:
Refining Requirement Definitions: Expand the
requirement_patterns
dictionary with more nuanced regular expressions or integratespaCy
's part-of-speech and dependency parsing to identify more complex requirement structures.Granularity of Location: For a more detailed view, you can modify the
extract_text_with_location
function to capture the y-coordinate of each line within a text block, offering a finer-grained positional analysis.Advanced Topic Modeling: For documents without clear keyword indicators, explore unsupervised learning techniques like Latent Dirichlet Allocation (LDA) using the
gensim
library to discover latent topics that may correspond to different types of requirements.Interactive Visualizations: For an even more engaging experience, consider using libraries like
Plotly
to create interactive heatmaps that allow users to hover over cells and see the specific requirement text.
By following this methodology, you can transform your static PDF requirement documents into dynamic and insightful visualizations, empowering you and your team to better understand, analyze, and manage the critical foundations of your software projects.
I plan to apply semantic search to match specified requirements with applicable document text.
Excellent. Applying semantic search is a powerful evolution of the initial idea. Instead of relying on specific keywords, you'll be able to match the meaning and intent behind your specified requirements with the text in your document. This allows you to find relevant content even if it doesn't use the exact same phrasing.
Here is a revised, step-by-step guide on how to produce a heatmap using semantic search in Python.
The Semantic Search Blueprint: From Intent to Insight
We will use the sentence-transformers
library, a state-of-the-art framework for generating text embeddings, to understand the semantic content of your document.
Core Concept: We will convert both your specified requirements and the text on each page of your PDF into numerical vectors (embeddings). The "closeness" of these vectors, measured by cosine similarity, indicates how semantically similar they are. A high similarity score means the document text strongly relates to the requirement, even with different wording.
Required Libraries: You will need to install the following libraries:
pip install sentence-transformers torch PyMuPDF pandas seaborn matplotlib
Step 1: Extract and Chunk Text by Page
The first step remains the same: extract text from the PDF. However, for semantic search, it's more effective to group the text by page rather than by small text blocks. This provides more context for generating meaningful embeddings.
import fitz # PyMuPDF
import pandas as pd
def extract_text_by_page(pdf_path):
"""
Extracts text from a PDF, grouping it by page.
Args:
pdf_path (str): The path to the PDF file.
Returns:
list: A list of strings, where each string is the full text of a page.
"""
doc = fitz.open(pdf_path)
page_texts = [page.get_text() for page in doc]
return page_texts
Step 2: Perform Semantic Search with Sentence Transformers
This is the core of our new approach. We will define our requirements, load a pre-trained model, and then encode both the requirements and the document's page text into vectors to calculate their similarity.
from sentence_transformers import SentenceTransformer, util
def perform_semantic_search(requirements_list, document_pages):
"""
Calculates the semantic similarity between a list of requirements
and the text of each document page.
Args:
requirements_list (list): A list of requirement strings.
document_pages (list): A list of strings, each being the text of a page.
Returns:
pandas.DataFrame: A DataFrame containing pages, requirements, and their
cosine similarity scores.
"""
# Load a pre-trained model. 'all-MiniLM-L6-v2' is a good, fast choice.
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode the requirements and the document pages into vectors
requirement_embeddings = model.encode(requirements_list, convert_to_tensor=True)
page_embeddings = model.encode(document_pages, convert_to_tensor=True)
# Calculate cosine similarity between all requirements and all pages
# The result is a matrix where rows correspond to requirements and columns to pages
cosine_scores = util.cos_sim(requirement_embeddings, page_embeddings)
# Process the results into a clean DataFrame
search_results = []
for i, requirement in enumerate(requirements_list):
for j, page_text in enumerate(document_pages):
search_results.append({
'Page': j + 1,
'Requirement': requirement,
'Similarity': cosine_scores[i][j].item() # .item() gets the Python number
})
return pd.DataFrame(search_results)
Step 3: Structure the Data for the Heatmap
Now we need to pivot the data. We'll create a matrix where rows are the page numbers, columns are your specified requirements, and the cell values are the semantic similarity scores. This matrix will be the direct input for our heatmap.
def create_similarity_matrix(search_results_df):
"""
Creates a matrix of similarity scores for the heatmap.
Args:
search_results_df (pandas.DataFrame): DataFrame of search results.
Returns:
pandas.DataFrame: A DataFrame where rows are pages, columns are
requirements, and values are similarity scores.
"""
if search_results_df.empty:
return pd.DataFrame()
# Use pivot_table to reshape the DataFrame
similarity_matrix = search_results_df.pivot_table(
index='Page',
columns='Requirement',
values='Similarity'
)
return similarity_matrix
Step 4: Generate the Semantic Heatmap
Finally, we visualize the similarity matrix. The color intensity in each cell will now represent the degree of semantic relevance between a page and a requirement.
import seaborn as sns
import matplotlib.pyplot as plt
def generate_semantic_heatmap(similarity_matrix, pdf_name):
"""
Generates and displays a heatmap of semantic similarity scores.
Args:
similarity_matrix (pandas.DataFrame): The data for the heatmap.
pdf_name (str): The name of the PDF file for the title.
"""
if similarity_matrix.empty:
print("Could not generate heatmap, no data available.")
return
plt.figure(figsize=(14, 10))
# Use a continuous colormap like 'YlGnBu' and format annotations as floats
sns.heatmap(
similarity_matrix,
annot=True,
fmt=".2f", # Format to 2 decimal places
cmap="YlGnBu", # A good colormap for similarity
linewidths=.5
)
plt.title(f'Semantic Similarity Heatmap for {pdf_name}')
plt.xlabel('Specified Requirements')
plt.ylabel('Page Number')
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.tight_layout() # Adjust layout to make room for labels
plt.show()
Putting It All Together: The Complete Semantic Workflow
Here is the complete script. Simply define your requirements and point it to your PDF.
import fitz
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import seaborn as sns
import matplotlib.pyplot as plt
# (Insert the function definitions from Steps 1, 2, 3, and 4 here)
def extract_text_by_page(pdf_path):
doc = fitz.open(pdf_path)
page_texts = [page.get_text() for page in doc]
return page_texts
def perform_semantic_search(requirements_list, document_pages):
model = SentenceTransformer('all-MiniLM-L6-v2')
requirement_embeddings = model.encode(requirements_list, convert_to_tensor=True)
page_embeddings = model.encode(document_pages, convert_to_tensor=True)
cosine_scores = util.cos_sim(requirement_embeddings, page_embeddings)
search_results = []
for i, requirement in enumerate(requirements_list):
for j, page_text in enumerate(document_pages):
search_results.append({
'Page': j + 1,
'Requirement': requirement,
'Similarity': cosine_scores[i][j].item()
})
return pd.DataFrame(search_results)
def create_similarity_matrix(search_results_df):
if search_results_df.empty:
return pd.DataFrame()
similarity_matrix = search_results_df.pivot_table(
index='Page',
columns='Requirement',
values='Similarity'
)
return similarity_matrix
def generate_semantic_heatmap(similarity_matrix, pdf_name):
if similarity_matrix.empty:
print("Could not generate heatmap, no data available.")
return
plt.figure(figsize=(14, 10))
sns.heatmap(
similarity_matrix,
annot=True,
fmt=".2f",
cmap="YlGnBu",
linewidths=.5
)
plt.title(f'Semantic Similarity Heatmap for {pdf_name}')
plt.xlabel('Specified Requirements')
plt.ylabel('Page Number')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
if __name__ == '__main__':
# 1. DEFINE YOUR REQUIREMENTS
# These should be descriptive sentences.
my_requirements = [
"The system must support single sign-on (SSO) authentication.",
"All user data must be encrypted at rest and in transit.",
"The user interface should be responsive and accessible on mobile devices.",
"The system needs to generate and export monthly performance reports.",
"User roles and permissions must be configurable by an administrator."
]
# 2. SPECIFY YOUR PDF DOCUMENT
pdf_path = 'your_document.pdf'
pdf_name = pdf_path.split('/')[-1]
# --- Execute the Workflow ---
# Step 1: Extract text
pages = extract_text_by_page(pdf_path)
# Step 2: Perform semantic search
results_df = perform_semantic_search(my_requirements, pages)
# Step 3: Create the similarity matrix
matrix = create_similarity_matrix(results_df)
# Step 4: Generate the heatmap
generate_semantic_heatmap(matrix, pdf_name)
This semantic approach provides a far more nuanced and accurate heatmap, showing you not just where keywords appear, but where the concepts of your requirements are discussed, leading to a much deeper understanding of your document's structure and content.
From <https://gemini.google.com/> Google Gemini 2.5 Pro
Comments
Post a Comment