Skip to content

Add PDF data extractor demo#206

Merged
ilopezluna merged 4 commits intomainfrom
add-demo-to-extract-data-from-pdf
Oct 10, 2025
Merged

Add PDF data extractor demo#206
ilopezluna merged 4 commits intomainfrom
add-demo-to-extract-data-from-pdf

Conversation

@ilopezluna
Copy link
Copy Markdown
Contributor

@ilopezluna ilopezluna commented Oct 9, 2025

This pull request introduces a new demo application for extracting structured data from PDF files using JSON schemas and AI models. It adds a backend server, project configuration, documentation, and supporting files for the demo.

Given a PDF and a JSON Schema it will:

  1. Parse the PDF
  2. Send the parsed PDF and the expected schema to the LLM
  3. Print the resulting json returned by the LLM

Summary by Sourcery

Add a complete PDF Data Extractor demo featuring a web UI and a Node.js backend to configure AI models, define extraction schemas, upload PDFs, and retrieve structured JSON output

New Features:

  • Add demo HTML interface for configuring API settings, selecting AI models, defining JSON schemas, uploading PDFs, and displaying extraction results
  • Introduce Express server endpoints for health checks, fetching available models, and extracting structured data from PDFs via pdf-data-extractor

Build:

  • Include package.json with demo dependencies and scripts, and add .gitignore for upload artifacts

Documentation:

  • Add README with setup instructions, prerequisites, and usage guide for the demo

@ilopezluna ilopezluna requested review from a team and Copilot October 9, 2025 15:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a complete demo application for extracting structured data from PDF files using JSON schemas and AI models. The demo includes both backend infrastructure and a user-friendly web interface.

Key changes:

  • Backend Express server with PDF upload and processing endpoints
  • Web interface with configuration, schema definition, and file upload capabilities
  • Integration with Docker Model Runner for AI-powered data extraction

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
demos/extractor/server.js Express server providing API endpoints for model listing and PDF data extraction
demos/extractor/package.json Node.js project configuration with required dependencies
demos/extractor/demo.html Frontend web interface for interacting with the PDF extraction service
demos/extractor/README.md Documentation explaining setup, prerequisites, and usage instructions
demos/extractor/.gitignore Git ignore rules for dependencies, uploads, and temporary files
Comments suppressed due to low confidence (1)

demos/extractor/demo.html:1

  • The referenced sample PDF file 'invoice.pdf' is mentioned but not included in the PR. Either include the file or update the documentation to reflect available samples.
<!DOCTYPE html>

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread demos/extractor/server.js Outdated
Comment thread demos/extractor/server.js
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Oct 9, 2025

🧙 Sourcery has finished reviewing your pull request!


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

sourcery-ai[bot]
sourcery-ai Bot previously requested changes Oct 9, 2025
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • A CSRF middleware was not detected in your express application. Ensure you are either using one such as csurf or csrf (see rule references) and/or you are properly doing CSRF validation in your routes with a token or cookies. (link)

General comments:

  • Consider serving demo.html and its assets via Express's static middleware (instead of opening file://) to avoid CORS/file origin issues and streamline the demo setup.
  • Extract the inline CSS and JavaScript in demo.html into separate .css and .js files to keep the markup clean and improve maintainability.
  • Make the CLIENT_SERVER_URL (and other endpoints) on the client dynamic—e.g., derive from window.location or a config input—so you don't have to manually edit the HTML when pointing to a different server.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider serving demo.html and its assets via Express's static middleware (instead of opening file://) to avoid CORS/file origin issues and streamline the demo setup.
- Extract the inline CSS and JavaScript in demo.html into separate .css and .js files to keep the markup clean and improve maintainability.
- Make the CLIENT_SERVER_URL (and other endpoints) on the client dynamic—e.g., derive from window.location or a config input—so you don't have to manually edit the HTML when pointing to a different server.

## Individual Comments

### Comment 1
<location> `demos/extractor/server.js:16-18` </location>
<code_context>
+app.use(express.json());
+
+// Configure multer for file upload
+const upload = multer({ 
+  dest: 'uploads/',
+  limits: { fileSize: 10 * 1024 * 1024 } // 10MB limit
+});
+
</code_context>

<issue_to_address>
**suggestion (performance):** Uploaded files are stored in a local 'uploads/' directory, which may accumulate files if errors occur before cleanup.

Consider adding a periodic cleanup process or a mechanism to remove orphaned files if the server crashes before cleanup occurs.
</issue_to_address>

### Comment 2
<location> `demos/extractor/server.js:104-110` </location>
<code_context>
+    };
+
+    // Add optional parameters if provided
+    if (temperature !== undefined && temperature !== '') {
+      extractOptions.temperature = parseFloat(temperature);
+    }
+    if (maxTokens !== undefined && maxTokens !== '') {
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Temperature and maxTokens are parsed without validation for NaN or out-of-range values.

Validate 'temperature' and 'maxTokens' to ensure they are numbers and within acceptable ranges before using them.

```suggestion
    // Add optional parameters if provided, with validation
    if (temperature !== undefined && temperature !== '') {
      const tempValue = parseFloat(temperature);
      if (!isNaN(tempValue) && tempValue >= 0 && tempValue <= 2) {
        extractOptions.temperature = tempValue;
      } else {
        console.warn(`Invalid temperature value: ${temperature}. Must be a number between 0 and 2.`);
      }
    }
    if (maxTokens !== undefined && maxTokens !== '') {
      const maxTokensValue = parseInt(maxTokens, 10);
      if (!isNaN(maxTokensValue) && maxTokensValue > 0 && maxTokensValue <= 4096) {
        extractOptions.maxTokens = maxTokensValue;
      } else {
        console.warn(`Invalid maxTokens value: ${maxTokens}. Must be a positive integer up to 4096.`);
      }
    }
```
</issue_to_address>

### Comment 3
<location> `demos/extractor/server.js:112-113` </location>
<code_context>
+    console.log(`Extracting data from PDF using model: ${model}`);
+    const result = await extractor.extract(extractOptions);
+
+    // Clean up uploaded file
+    await fs.unlink(pdfPath);
+    pdfPath = null;
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Uploaded PDF is deleted after extraction, but errors before this point may leave files behind.

Consider using a 'finally' block or centralized cleanup to guarantee file deletion even if errors occur before the current cleanup step.

```suggestion
    let result;
    try {
      console.log(`Extracting data from PDF using model: ${model}`);
      result = await extractor.extract(extractOptions);
    } finally {
      // Clean up uploaded file
      if (pdfPath) {
        try {
          await fs.unlink(pdfPath);
        } catch (cleanupErr) {
          console.error(`Failed to delete uploaded PDF: ${pdfPath}`, cleanupErr);
        }
        pdfPath = null;
      }
    }
```
</issue_to_address>

### Comment 4
<location> `demos/extractor/server.js:154` </location>
<code_context>
+});
+
+// Create uploads directory if it doesn't exist
+const uploadsDir = path.join(__dirname, 'uploads');
+fs.mkdir(uploadsDir, { recursive: true }).catch(console.error);
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Uploads directory is created asynchronously at startup, which may race with incoming requests.

To prevent multer errors, create the uploads directory synchronously before server startup.
</issue_to_address>

### Comment 5
<location> `demos/extractor/server.js:8` </location>
<code_context>
const app = express();
</code_context>

<issue_to_address>
**security (javascript.express.security.audit.express-check-csurf-middleware-usage):** A CSRF middleware was not detected in your express application. Ensure you are either using one such as `csurf` or `csrf` (see rule references) and/or you are properly doing CSRF validation in your routes with a token or cookies.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread demos/extractor/server.js
Comment thread demos/extractor/server.js
Comment thread demos/extractor/server.js
Comment thread demos/extractor/server.js Outdated
Comment thread demos/extractor/server.js
Copilot AI review requested due to automatic review settings October 10, 2025 11:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread demos/extractor/demo.html
Comment thread demos/extractor/README.md
@ilopezluna
Copy link
Copy Markdown
Contributor Author

@sourcery-ai dismiss

@sourcery-ai sourcery-ai Bot dismissed their stale review October 10, 2025 12:11

Automated Sourcery review dismissed.

@ilopezluna ilopezluna merged commit 7dcf42e into main Oct 10, 2025
4 of 5 checks passed
@ilopezluna ilopezluna deleted the add-demo-to-extract-data-from-pdf branch October 10, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants