Add PDF data extractor demo by ilopezluna · Pull Request #206 · docker/model-runner

ilopezluna · 2025-10-09T15:12:36Z

This pull request introduces a new demo application for extracting structured data from PDF files using JSON schemas and AI models. It adds a backend server, project configuration, documentation, and supporting files for the demo.

Given a PDF and a JSON Schema it will:

Parse the PDF
Send the parsed PDF and the expected schema to the LLM
Print the resulting json returned by the LLM

Summary by Sourcery

Add a complete PDF Data Extractor demo featuring a web UI and a Node.js backend to configure AI models, define extraction schemas, upload PDFs, and retrieve structured JSON output

New Features:

Add demo HTML interface for configuring API settings, selecting AI models, defining JSON schemas, uploading PDFs, and displaying extraction results
Introduce Express server endpoints for health checks, fetching available models, and extracting structured data from PDFs via pdf-data-extractor

Build:

Include package.json with demo dependencies and scripts, and add .gitignore for upload artifacts

Documentation:

Add README with setup instructions, prerequisites, and usage guide for the demo

…atures

Copilot

Pull Request Overview

This PR introduces a complete demo application for extracting structured data from PDF files using JSON schemas and AI models. The demo includes both backend infrastructure and a user-friendly web interface.

Key changes:

Backend Express server with PDF upload and processing endpoints
Web interface with configuration, schema definition, and file upload capabilities
Integration with Docker Model Runner for AI-powered data extraction

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`demos/extractor/server.js`	Express server providing API endpoints for model listing and PDF data extraction
`demos/extractor/package.json`	Node.js project configuration with required dependencies
`demos/extractor/demo.html`	Frontend web interface for interacting with the PDF extraction service
`demos/extractor/README.md`	Documentation explaining setup, prerequisites, and usage instructions
`demos/extractor/.gitignore`	Git ignore rules for dependencies, uploads, and temporary files

Comments suppressed due to low confidence (1)

demos/extractor/demo.html:1

The referenced sample PDF file 'invoice.pdf' is mentioned but not included in the PR. Either include the file or update the documentation to reflect available samples.

<!DOCTYPE html>

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

sourcery-ai · 2025-10-09T15:28:24Z

🧙 Sourcery has finished reviewing your pull request!

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

A CSRF middleware was not detected in your express application. Ensure you are either using one such as csurf or csrf (see rule references) and/or you are properly doing CSRF validation in your routes with a token or cookies. (link)

General comments:

Consider serving demo.html and its assets via Express's static middleware (instead of opening file://) to avoid CORS/file origin issues and streamline the demo setup.
Extract the inline CSS and JavaScript in demo.html into separate .css and .js files to keep the markup clean and improve maintainability.
Make the CLIENT_SERVER_URL (and other endpoints) on the client dynamic—e.g., derive from window.location or a config input—so you don't have to manually edit the HTML when pointing to a different server.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Consider serving demo.html and its assets via Express's static middleware (instead of opening file://) to avoid CORS/file origin issues and streamline the demo setup.
- Extract the inline CSS and JavaScript in demo.html into separate .css and .js files to keep the markup clean and improve maintainability.
- Make the CLIENT_SERVER_URL (and other endpoints) on the client dynamic—e.g., derive from window.location or a config input—so you don't have to manually edit the HTML when pointing to a different server.

## Individual Comments

### Comment 1
<location> `demos/extractor/server.js:16-18` </location>
<code_context>
+app.use(express.json());
+
+// Configure multer for file upload
+const upload = multer({ 
+  dest: 'uploads/',
+  limits: { fileSize: 10 * 1024 * 1024 } // 10MB limit
+});
+
</code_context>

<issue_to_address>
**suggestion (performance):** Uploaded files are stored in a local 'uploads/' directory, which may accumulate files if errors occur before cleanup.

Consider adding a periodic cleanup process or a mechanism to remove orphaned files if the server crashes before cleanup occurs.
</issue_to_address>

### Comment 2
<location> `demos/extractor/server.js:104-110` </location>
<code_context>
+    };
+
+    // Add optional parameters if provided
+    if (temperature !== undefined && temperature !== '') {
+      extractOptions.temperature = parseFloat(temperature);
+    }
+    if (maxTokens !== undefined && maxTokens !== '') {
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Temperature and maxTokens are parsed without validation for NaN or out-of-range values.

Validate 'temperature' and 'maxTokens' to ensure they are numbers and within acceptable ranges before using them.

```suggestion
    // Add optional parameters if provided, with validation
    if (temperature !== undefined && temperature !== '') {
      const tempValue = parseFloat(temperature);
      if (!isNaN(tempValue) && tempValue >= 0 && tempValue <= 2) {
        extractOptions.temperature = tempValue;
      } else {
        console.warn(`Invalid temperature value: ${temperature}. Must be a number between 0 and 2.`);
      }
    }
    if (maxTokens !== undefined && maxTokens !== '') {
      const maxTokensValue = parseInt(maxTokens, 10);
      if (!isNaN(maxTokensValue) && maxTokensValue > 0 && maxTokensValue <= 4096) {
        extractOptions.maxTokens = maxTokensValue;
      } else {
        console.warn(`Invalid maxTokens value: ${maxTokens}. Must be a positive integer up to 4096.`);
      }
    }
```
</issue_to_address>

### Comment 3
<location> `demos/extractor/server.js:112-113` </location>
<code_context>
+    console.log(`Extracting data from PDF using model: ${model}`);
+    const result = await extractor.extract(extractOptions);
+
+    // Clean up uploaded file
+    await fs.unlink(pdfPath);
+    pdfPath = null;
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Uploaded PDF is deleted after extraction, but errors before this point may leave files behind.

Consider using a 'finally' block or centralized cleanup to guarantee file deletion even if errors occur before the current cleanup step.

```suggestion
    let result;
    try {
      console.log(`Extracting data from PDF using model: ${model}`);
      result = await extractor.extract(extractOptions);
    } finally {
      // Clean up uploaded file
      if (pdfPath) {
        try {
          await fs.unlink(pdfPath);
        } catch (cleanupErr) {
          console.error(`Failed to delete uploaded PDF: ${pdfPath}`, cleanupErr);
        }
        pdfPath = null;
      }
    }
```
</issue_to_address>

### Comment 4
<location> `demos/extractor/server.js:154` </location>
<code_context>
+});
+
+// Create uploads directory if it doesn't exist
+const uploadsDir = path.join(__dirname, 'uploads');
+fs.mkdir(uploadsDir, { recursive: true }).catch(console.error);
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Uploads directory is created asynchronously at startup, which may race with incoming requests.

To prevent multer errors, create the uploads directory synchronously before server startup.
</issue_to_address>

### Comment 5
<location> `demos/extractor/server.js:8` </location>
<code_context>
const app = express();
</code_context>

<issue_to_address>
**security (javascript.express.security.audit.express-check-csurf-middleware-usage):** A CSRF middleware was not detected in your express application. Ensure you are either using one such as `csurf` or `csrf` (see rule references) and/or you are properly doing CSRF validation in your routes with a token or cookies.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

…uploads

Copilot

Pull Request Overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ilopezluna · 2025-10-10T12:11:01Z

@sourcery-ai dismiss

Automated Sourcery review dismissed.

feat(demo): add PDF data extractor demo with upload and extraction fe…

15c3594

…atures

ilopezluna requested review from a team and Copilot October 9, 2025 15:12

Copilot AI reviewed Oct 9, 2025

View reviewed changes

Comment thread demos/extractor/server.js Outdated

Comment thread demos/extractor/server.js

sourcery-ai Bot previously requested changes Oct 9, 2025

View reviewed changes

Comment thread demos/extractor/server.js

Comment thread demos/extractor/server.js

Comment thread demos/extractor/server.js

Comment thread demos/extractor/server.js Outdated

Comment thread demos/extractor/server.js

doringeman approved these changes Oct 10, 2025

View reviewed changes

ilopezluna added 3 commits October 10, 2025 13:04

fix(pdf-extraction): improve file cleanup and error handling for PDF …

782e7ec

…uploads

fix(pdf-extraction): improve file cleanup and error handling for PDF …

2014ca4

…uploads

Merge branch 'main' into add-demo-to-extract-data-from-pdf

b1e63bf

Copilot AI review requested due to automatic review settings October 10, 2025 11:59

Copilot AI reviewed Oct 10, 2025

View reviewed changes

Comment thread demos/extractor/demo.html

Comment thread demos/extractor/README.md

ilopezluna merged commit 7dcf42e into main Oct 10, 2025
4 of 5 checks passed

ilopezluna deleted the add-demo-to-extract-data-from-pdf branch October 10, 2025 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDF data extractor demo#206

Add PDF data extractor demo#206
ilopezluna merged 4 commits intomainfrom
add-demo-to-extract-data-from-pdf

ilopezluna commented Oct 9, 2025 •

edited by sourcery-ai Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

sourcery-ai Bot commented Oct 9, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

ilopezluna commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ilopezluna commented Oct 9, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

sourcery-ai Bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

ilopezluna commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilopezluna commented Oct 9, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Oct 9, 2025 •

edited

Loading