html file to demo multimodal#199
Conversation
Reviewer's GuideIntroduces a standalone HTML-based demo that captures live webcam frames, embeds them with user text instructions into multimodal chat completion requests, and streams model responses in real time via a configurable UI, accompanied by detailed setup and usage documentation. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @ericcurtin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request adds a standalone HTML file that creates a web application to showcase the multimodal support within Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull Request Overview
This PR introduces an HTML demo application that showcases multimodal capabilities by integrating webcam functionality with llama.cpp. The demo allows users to capture webcam images and send them with text instructions to a local API endpoint for AI-powered image analysis.
- Adds interactive webcam capture functionality with real-time streaming
- Implements API integration to send multimodal requests (text + images) to llama.cpp
- Provides configurable request intervals and user-friendly controls for starting/stopping the demo
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
This isn't ready yet, but we can continue to iterate on it. |
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Fix the malformed baseURL input element that is closed with </textarea> — input elements shouldn’t have a </textarea> closing tag.
- isProcessing is used both to track run state and in-flight requests, so overlapping sendData calls can still happen; introduce a separate flag or lock to prevent concurrent fetches.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Fix the malformed baseURL input element that is closed with </textarea> — input elements shouldn’t have a </textarea> closing tag.
- isProcessing is used both to track run state and in-flight requests, so overlapping sendData calls can still happen; introduce a separate flag or lock to prevent concurrent fetches.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces an HTML file for a webcam-based multimodal demo. The implementation is a good starting point, but I've identified a few significant issues related to HTML validity, potential race conditions with API requests, and unsafe handling of API responses. My review includes suggestions to correct an HTML syntax error, prevent request storms by implementing a locking mechanism, and improve the robustness of response handling to avoid runtime errors.
Demos llama.cpp multimodal support with a webcam. Signed-off-by: Eric Curtin <eric.curtin@docker.com>
d9e9175 to
fecff1e
Compare
|
@ericcurtin let us know when it's ready for review. |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…imodal input, so most models will throw an error if used here.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| 'Content-Type': 'application/json' | ||
| }, | ||
| body: JSON.stringify({ | ||
| model: 'ai/smolvlm:500M-Q8_0', |
There was a problem hiding this comment.
The model name is hardcoded. Consider making it configurable through the UI or at least adding a comment explaining why this specific model is used.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| 'Content-Type': 'application/json' | ||
| }, | ||
| body: JSON.stringify({ | ||
| model: 'ai/smolvlm:500M-Q8_0', |
There was a problem hiding this comment.
[nitpick] The model name is hardcoded in the JavaScript. Consider making this configurable via a form input to allow users to test with different models without modifying the code.
| }, | ||
| body: JSON.stringify({ | ||
| model: 'ai/smolvlm:500M-Q8_0', | ||
| max_tokens: 100, |
There was a problem hiding this comment.
[nitpick] The max_tokens value is hardcoded. Consider making this configurable to allow users to adjust response length based on their needs.
There was a problem hiding this comment.
What do you prefer?
- hardcoded model
- input with default value
- model selector
The issue with model selector is that there are few models that support multimodal, so most models will throw an error if used here.
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Consider moving the inline CSS and JavaScript into separate files (e.g. demo.css and demo.js) to improve maintainability and reduce page clutter.
- Make the model name, max_tokens, and exact API endpoint path configurable from the UI rather than hard-coding them in sendChatCompletionRequest so you can easily swap models or endpoints without editing the HTML.
- Add more robust error and state handling around camera initialization and fetch failures (invalid Base URL, CORS errors, network timeouts) to surface clearer user feedback and avoid silent skips.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider moving the inline CSS and JavaScript into separate files (e.g. demo.css and demo.js) to improve maintainability and reduce page clutter.
- Make the model name, max_tokens, and exact API endpoint path configurable from the UI rather than hard-coding them in sendChatCompletionRequest so you can easily swap models or endpoints without editing the HTML.
- Add more robust error and state handling around camera initialization and fetch failures (invalid Base URL, CORS errors, network timeouts) to surface clearer user feedback and avoid silent skips.
## Individual Comments
### Comment 1
<location> `demos/multimodal/demo.html:119-120` </location>
<code_context>
+
+ let stream;
+ let intervalId;
+ let isProcessing = false;
+ let isWaitingForResponse = false;
+
+ // Returns response text (string)
</code_context>
<issue_to_address>
**issue (bug_risk):** Concurrent request control may not prevent overlapping requests reliably.
Since isWaitingForResponse is set before fetch and reset after, multiple requests can still be sent if sendData is triggered again before the previous fetch resolves. Consider tracking the active promise or pausing the interval until the response is received to ensure only one request is in flight.
</issue_to_address>
### Comment 2
<location> `demos/multimodal/demo.html:148` </location>
<code_context>
+ return `Server error: ${response.status} - ${errorData}`;
+ }
+ const data = await response.json();
+ return data.choices[0].message.content;
+ }
+
</code_context>
<issue_to_address>
**suggestion:** No error handling for unexpected API response structure.
Add checks to verify that data.choices and data.choices[0].message.content exist before accessing them, and return a fallback error message if they are missing.
```suggestion
const data = await response.json();
if (
!data.choices ||
!Array.isArray(data.choices) ||
!data.choices[0] ||
!data.choices[0].message ||
typeof data.choices[0].message.content !== "string"
) {
return "API response error: Unexpected response structure.";
}
```
</issue_to_address>
### Comment 3
<location> `demos/multimodal/demo.html:166` </location>
<code_context>
+ }
+
+ function captureImage() {
+ if (!stream || !video.videoWidth) {
+ console.warn("Video stream not ready for capture.");
+ return null;
</code_context>
<issue_to_address>
**nitpick:** Warning message for video not ready may be too generic.
Consider providing distinct warnings for a missing stream and for video.videoWidth being zero to aid in troubleshooting.
</issue_to_address>
### Comment 4
<location> `demos/multimodal/demo.html:185-188` </location>
<code_context>
+ const instruction = instructionText.value;
+ const imageBase64URL = captureImage();
+
+ if (!imageBase64URL) {
+ responseText.value = "Failed to capture image. Stream might not be active.";
+ // Optionally stop processing if image capture fails consistently
+ // handleStop();
+ return;
+ }
</code_context>
<issue_to_address>
**suggestion:** Commented-out code for stopping processing on image capture failure should be removed or handled.
If you intend to stop processing after repeated failures, implement a retry or error threshold. Otherwise, remove the commented-out 'handleStop();' to avoid confusion.
```suggestion
responseText.value = "Failed to capture image. Stream might not be active.";
return;
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@ericcurtin @p1-0tr @doringeman
The issue with model selector is that there are few models that support multimodal, so most models will throw an error if used here. |
|
I vote option 2 (input with sensible default) |
|
I would say 3 (but you can leave a little note that this application will only work with multi-modal compatible models). The problem with sensible default for me is... What is the sensible default? (the sensible default that springs to my head is "") because we don't know what models are available. |
|
I would vote for 3 and also add a link to Hub with a filter for multimodal models. |
|
Thanks! so I finally used a model selector:
|
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| // Select the first model | ||
| modelSelect.value = models[0]; | ||
| // Show suggestion about recommended model | ||
| modelWarning.innerHTML = `💡 We recommend to run this demo with <code>${RECOMMENDED_MODEL}</code>`; |
There was a problem hiding this comment.
Using innerHTML with template literals containing variables can lead to XSS if the variable contains malicious content. Since RECOMMENDED_MODEL is a constant, this is lower risk, but consider using textContent or proper HTML escaping for consistency.
| modelWarning.innerHTML = `💡 We recommend to run this demo with <code>${RECOMMENDED_MODEL}</code>`; | |
| // Safely set the warning message without using innerHTML | |
| modelWarning.textContent = ""; // Clear previous content | |
| modelWarning.append("💡 We recommend to run this demo with "); | |
| const codeElem = document.createElement("code"); | |
| codeElem.textContent = RECOMMENDED_MODEL; | |
| modelWarning.appendChild(codeElem); |
Demos llama.cpp multimodal support with a webcam.
Summary by Sourcery
Add a new real-time webcam-based multimodal demo for llama.cpp with UI controls and API interaction, along with accompanying documentation.
New Features:
Documentation: