-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: instrumentation based opentelemetry collection #3361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 8 commits
c4d9871
3afef13
3bcb862
3ba871e
c7df026
f77071f
bbb9c00
bb9e5fc
7314d31
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| --- | ||
| id: trace-and-monitor-crawlers | ||
| title: Trace and monitor crawlers | ||
| description: How to use OpenTelemetry to trace and monitor your crawlers | ||
| --- | ||
|
|
||
| import ApiLink from '@site/src/components/ApiLink'; | ||
| import Tabs from '@theme/Tabs'; | ||
| import TabItem from '@theme/TabItem'; | ||
| import CodeBlock from '@theme/CodeBlock'; | ||
|
|
||
| import SetupSource from '!!raw-loader!./trace_and_monitor_setup.ts'; | ||
| import BasicExampleSource from '!!raw-loader!./trace_and_monitor_basic.ts'; | ||
| import WrapWithSpanSource from '!!raw-loader!./trace_and_monitor_wrap_with_span.ts'; | ||
| import CustomInstrumentationSource from '!!raw-loader!./trace_and_monitor_custom.ts'; | ||
|
|
||
| [OpenTelemetry](https://opentelemetry.io/) is a collection of APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior. You can learn more about its basic concepts in the [OpenTelemetry documentation](https://opentelemetry.io/docs/concepts/). | ||
|
|
||
| In this guide, we'll show you how to set up OpenTelemetry and instrument your Crawlee crawlers to see traces of individual requests as they are processed. OpenTelemetry on its own does not provide visualization tools, so we'll use [Jaeger](https://www.jaegertracing.io/) as our tracing backend. Feel free to use any other OpenTelemetry-compatible backend—check the [OpenTelemetry vendors list](https://opentelemetry.io/docs/concepts/vendors/) for more options. | ||
|
|
||
| ## Set up Jaeger | ||
|
|
||
| This guide will show you how to set up the environment locally to run the example code and visualize the telemetry data in Jaeger running in a [Docker](https://docs.docker.com/engine/install/) container. | ||
|
|
||
| To start the preconfigured Docker container, create a `docker-compose.yml` file: | ||
|
|
||
| ```yaml | ||
| services: | ||
| jaeger: | ||
| image: jaegertracing/all-in-one:1.53 | ||
| container_name: jaeger | ||
| ports: | ||
| # Jaeger UI | ||
| - "16686:16686" | ||
| # OTLP gRPC | ||
| - "4317:4317" | ||
| # OTLP HTTP | ||
| - "4318:4318" | ||
| environment: | ||
| - COLLECTOR_OTLP_ENABLED=true | ||
| restart: unless-stopped | ||
| ``` | ||
|
|
||
| Then start it with: | ||
|
|
||
| ```bash | ||
| docker compose up -d | ||
| ``` | ||
|
|
||
| For more details about the Jaeger setup, see the [getting started section](https://www.jaegertracing.io/docs/latest/getting-started/) in their documentation. You can see the Jaeger UI in your browser by navigating to [http://localhost:16686](http://localhost:16686). | ||
|
|
||
| ## Install dependencies | ||
|
|
||
| To instrument your Crawlee crawler, you need to install the `@crawlee/otel` package along with the OpenTelemetry SDK packages: | ||
|
|
||
| ```bash npm2yarn | ||
| npm install @crawlee/otel @opentelemetry/api @opentelemetry/api-logs @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc @opentelemetry/exporter-logs-otlp-grpc @opentelemetry/sdk-logs | ||
| ``` | ||
|
|
||
| ## Instrument the crawler | ||
|
|
||
| OpenTelemetry instrumentation must be set up **before** importing Crawlee or any other instrumented modules. The easiest way to do this is to create a separate setup file and import it first using Node.js's `--import` flag. | ||
|
|
||
| ### Setup file | ||
|
|
||
| Create a setup file that initializes OpenTelemetry with the Crawlee instrumentation: | ||
|
|
||
| <CodeBlock language="ts" title="src/setup.ts"> | ||
| {SetupSource} | ||
| </CodeBlock> | ||
|
|
||
| ### Main crawler file | ||
|
|
||
| Now create your crawler. The `CrawleeInstrumentation` will automatically instrument the core crawler methods: | ||
|
|
||
| <CodeBlock language="ts" title="src/main.ts"> | ||
| {BasicExampleSource} | ||
| </CodeBlock> | ||
|
|
||
| ### Run the crawler | ||
|
|
||
| Run your crawler with the setup file imported first: | ||
|
|
||
| ```bash | ||
| npx tsx --import ./src/setup.ts ./src/main.ts | ||
| ``` | ||
|
|
||
| The `--import` flag ensures the OpenTelemetry setup runs before any other code, which is required for the automatic instrumentation to work properly. | ||
|
|
||
| ## Analyze the results | ||
|
|
||
| In the Jaeger UI, you can search for different traces, apply filtering, compare traces, view their detailed attributes, view timing details, and more. For a detailed description of the tool's capabilities, please refer to the [Jaeger documentation](https://www.jaegertracing.io/docs/latest/). | ||
|
|
||
|  | ||
|
|
||
| You can use different tools to consume the OpenTelemetry data that might better suit your needs. Please see the list of known [vendors in OpenTelemetry documentation](https://opentelemetry.io/ecosystem/vendors/). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was already mentioned on line 19. I would keep just on of those. |
||
|
|
||
| ## Customize the instrumentation | ||
|
|
||
| The `CrawleeInstrumentation` class provides several configuration options to customize what gets instrumented: | ||
|
|
||
| | Option | Default | Description | | ||
| |--------|---------|-------------| | ||
| | `enabled` | `true` | Enable or disable the instrumentation entirely | | ||
| | `requestHandlingInstrumentation` | `true` | Instrument core crawler methods like `run`, `_runTaskFunction`, navigation handlers | | ||
| | `logInstrumentation` | `true` | Forward Crawlee logs to OpenTelemetry logs | | ||
| | `customInstrumentation` | `[]` | Array of custom class methods to instrument | | ||
|
|
||
| ### Configuration example | ||
|
|
||
| ```ts | ||
| import { CrawleeInstrumentation } from '@crawlee/otel'; | ||
|
|
||
| const crawleeInstrumentation = new CrawleeInstrumentation({ | ||
| // Disable automatic request handling instrumentation | ||
| requestHandlingInstrumentation: false, | ||
| // Disable log forwarding | ||
| logInstrumentation: false, | ||
| // Add custom instrumentation | ||
| customInstrumentation: [ | ||
| { | ||
| moduleName: '@crawlee/basic', | ||
| className: 'BasicCrawler', | ||
| methodName: 'run', | ||
| spanName: 'my-custom-span-name', | ||
| }, | ||
| ], | ||
| }); | ||
| ``` | ||
|
|
||
| ## Manual span instrumentation with wrapWithSpan | ||
|
|
||
| For more fine-grained control, you can use the `wrapWithSpan` utility to wrap specific functions with OpenTelemetry spans. This is particularly useful for instrumenting request handlers, hooks, and error handlers. | ||
|
|
||
| <CodeBlock language="ts" title="src/main.ts"> | ||
| {WrapWithSpanSource} | ||
| </CodeBlock> | ||
|
|
||
| ### wrapWithSpan options | ||
|
|
||
| The `wrapWithSpan` function accepts these options: | ||
|
|
||
| | Option | Type | Description | | ||
| |--------|------|-------------| | ||
| | `spanName` | `string \| ((...args) => string)` | Static name or function that receives the handler arguments and returns a span name | | ||
| | `spanOptions` | `SpanOptions \| ((...args) => SpanOptions)` | Static options or function that returns OpenTelemetry SpanOptions including attributes | | ||
| | `tracer` | `Tracer` | Custom tracer instance (defaults to `trace.getTracer('crawlee')`) | | ||
|
|
||
| ### Accessing the current span | ||
|
|
||
| Inside a wrapped function, you can access the current span to add additional attributes or events: | ||
|
|
||
| ```ts | ||
| import { context, trace } from '@opentelemetry/api'; | ||
|
|
||
| requestHandler: wrapWithSpan( | ||
| async ({ request, $ }) => { | ||
| const span = trace.getSpan(context.active()); | ||
|
|
||
| const title = $('title').text(); | ||
|
|
||
| if (span) { | ||
| span.setAttribute('page.title', title); | ||
| span.addEvent('page_scraped', { url: request.url }); | ||
| } | ||
|
|
||
| // ... rest of your handler | ||
| }, | ||
| { spanName: 'request-handler' } | ||
| ), | ||
| ``` | ||
|
|
||
| ## Custom class instrumentation | ||
|
|
||
| You can also create your instrumentation by selecting only the methods you want to instrument. Here's an example of adding custom instrumentation for specific crawler methods: | ||
|
|
||
| <CodeBlock language="ts" title="src/setup.ts"> | ||
| {CustomInstrumentationSource} | ||
| </CodeBlock> | ||
|
|
||
| ## What gets instrumented automatically | ||
|
|
||
| When `requestHandlingInstrumentation` is enabled (the default), the following methods are automatically instrumented: | ||
|
|
||
| | Crawler | Method | Span Name | | ||
| |---------|--------|-----------| | ||
| | `BasicCrawler` | `run` | `crawlee.crawler.run` | | ||
| | `BasicCrawler` | `_runTaskFunction` | `crawlee.crawler.runTaskFunction` | | ||
| | `BasicCrawler` | `_requestFunctionErrorHandler` | `crawlee.crawler.requestFunctionErrorHandler` | | ||
| | `BasicCrawler` | `_handleFailedRequestHandler` | `crawlee.crawler.handleFailedRequestHandler` | | ||
| | `BasicCrawler` | `_executeHooks` | `crawlee.crawler.executeHooks` | | ||
| | `BrowserCrawler` | `_handleNavigation` | `crawlee.browser.handleNavigation` | | ||
| | `BrowserCrawler` | `_runRequestHandler` | `crawlee.browser.runRequestHandler` | | ||
| | `HttpCrawler` | `_handleNavigation` | `crawlee.http.handleNavigation` | | ||
| | `HttpCrawler` | `_runRequestHandler` | `crawlee.http.runRequestHandler` | | ||
|
|
||
| Request handler spans include these attributes automatically: | ||
| - `crawlee.request.id` | ||
| - `crawlee.request.url` | ||
| - `crawlee.request.method` | ||
|
Comment on lines
+199
to
+200
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would consider using stable attributes from semantic-conventions instead of But I am not really sure what is the common practice in this case. |
||
| - `crawlee.request.retry_count` | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| import { CheerioCrawler } from 'crawlee'; | ||
| import { sdk } from './setup.js'; | ||
|
|
||
| const crawler = new CheerioCrawler({ | ||
| maxRequestsPerCrawl: 10, | ||
|
|
||
| async requestHandler({ request, $, enqueueLinks, log }) { | ||
| const title = $('title').text(); | ||
| log.info(`Crawled ${request.url}`, { title }); | ||
|
|
||
| await enqueueLinks({ | ||
| globs: ['https://crawlee.dev/**'], | ||
| }); | ||
| }, | ||
| }); | ||
|
|
||
| await crawler.run(['https://crawlee.dev']); | ||
|
|
||
| // Ensure all telemetry is flushed before exiting | ||
| await sdk.shutdown(); | ||
| console.log('Crawl complete. View traces at http://localhost:16686'); |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; | ||
| import { resourceFromAttributes } from '@opentelemetry/resources'; | ||
| import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'; | ||
| import { NodeSDK } from '@opentelemetry/sdk-node'; | ||
| import { CrawleeInstrumentation } from '@crawlee/otel'; | ||
| import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions'; | ||
|
|
||
| const crawleeInstrumentation = new CrawleeInstrumentation({ | ||
| // Disable default request handling instrumentation | ||
| requestHandlingInstrumentation: false, | ||
| // Disable log forwarding to OpenTelemetry | ||
| logInstrumentation: false, | ||
| // Define custom methods to instrument | ||
| customInstrumentation: [ | ||
| { | ||
| moduleName: '@crawlee/basic', | ||
| className: 'BasicCrawler', | ||
| methodName: 'run', | ||
| spanName: 'crawler.run', | ||
| spanOptions() { | ||
| return { | ||
| attributes: { | ||
| 'crawler.type': this.constructor.name, | ||
| }, | ||
| }; | ||
| }, | ||
| }, | ||
| { | ||
| moduleName: '@crawlee/http', | ||
| className: 'HttpCrawler', | ||
| methodName: '_runRequestHandler', | ||
| // Dynamic span name using the context argument | ||
| spanName(context: any) { | ||
| return `http.request ${context.request.url}`; | ||
| }, | ||
| spanOptions(context: any) { | ||
| return { | ||
| attributes: { | ||
| 'http.url': context.request.url, | ||
| 'http.method': context.request.method, | ||
| }, | ||
| }; | ||
| }, | ||
| }, | ||
| ], | ||
| }); | ||
|
|
||
| const resource = resourceFromAttributes({ | ||
| [ATTR_SERVICE_NAME]: 'custom-instrumented-crawler', | ||
| }); | ||
|
|
||
| const traceExporter = new OTLPTraceExporter({ | ||
| url: 'http://localhost:4317/v1/traces', | ||
| }); | ||
|
|
||
| export const sdk = new NodeSDK({ | ||
| resource, | ||
| spanProcessors: [new BatchSpanProcessor(traceExporter)], | ||
| instrumentations: [crawleeInstrumentation], | ||
| }); | ||
|
|
||
| sdk.start(); |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; | ||
| import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-grpc'; | ||
| import { resourceFromAttributes } from '@opentelemetry/resources'; | ||
| import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'; | ||
| import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs'; | ||
| import { NodeSDK } from '@opentelemetry/sdk-node'; | ||
| import { CrawleeInstrumentation } from '@crawlee/otel'; | ||
| import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions'; | ||
|
|
||
| // Create a resource that identifies your service | ||
| const resource = resourceFromAttributes({ | ||
| [ATTR_SERVICE_NAME]: 'my-crawler', | ||
| [ATTR_SERVICE_VERSION]: '1.0.0', | ||
| 'deployment.environment': 'development', | ||
| }); | ||
|
|
||
| // Configure exporters to send data to Jaeger via OTLP | ||
| const traceExporter = new OTLPTraceExporter({ | ||
| url: 'http://localhost:4317/v1/traces', | ||
| }); | ||
|
|
||
| const logExporter = new OTLPLogExporter({ | ||
| url: 'http://localhost:4317/v1/logs', | ||
| }); | ||
|
|
||
| // Create the Crawlee instrumentation | ||
| const crawleeInstrumentation = new CrawleeInstrumentation(); | ||
|
|
||
| // Initialize the OpenTelemetry SDK | ||
| export const sdk = new NodeSDK({ | ||
| resource, | ||
| spanProcessors: [new BatchSpanProcessor(traceExporter)], | ||
| logRecordProcessors: [new BatchLogRecordProcessor(logExporter)], | ||
| instrumentations: [crawleeInstrumentation], | ||
| }); | ||
|
|
||
| // Start the SDK | ||
| sdk.start(); | ||
|
|
||
| console.log('OpenTelemetry initialized'); | ||
|
|
||
| // Graceful shutdown | ||
| process.on('SIGTERM', async () => { | ||
| await sdk.shutdown(); | ||
| process.exit(0); | ||
| }); |
Uh oh!
There was an error while loading. Please reload this page.