Skip to content

beyondopen/verfassungsschutzberichte.de

Repository files navigation

Verfassungsschutzberichte.de

Über was informiert der Verfassungs­schutz? Die Berichte des Geheimdienstes: gesammelt, durchsuchbar und analysiert.

Der Verfassungsschutz hat die Aufgabe die Öffentlichkeit über verfassungsfeindliche Bestrebungen aufzuklären. Die 16 Landesämter und das Bundesamt für Verfassungsschutz veröffentlichen jährlich Verfassungsschutzberichte. Diese Webseite ist ein zivilgesellschaftliches Archiv, das den Zugang zu den Berichten erleichert.

English

About what does the Verfassungsschutz (the German internal intelligence, translated: protection of the constitution) inform? The reports of the secret service: collected, searchable and analyzed.

The Verfassungsschutz has the task of informing the public about anti-constitutional efforts. The 16 state offices and the federal office publish annual reports on the protection of the constitution. This website is a civil society archive that simplifies access to the reports.

Development

  1. install and run Docker
  2. git clone https://github.com/beyondopen/verfassungsschutzberichte.de && cd verfassungsschutzberichte.de
  3. docker compose up
  4. http://localhost:5001

Add PDFs for Development

Put some PDFs into data/pdfs/, then process them:

docker compose exec web flask update-docs '*'

Testing

End-to-end tests verify that the application works correctly after updates (e.g., Python version upgrades).

Run tests locally:

# Simply run the test script (requires Docker)
./scripts/run_tests.sh

All tests run inside Docker containers - no local Python dependencies needed!

CI/CD: Tests run automatically on every push and pull request via GitHub Actions.

Production

Deploy with Dokku.

  1. create a Dokku app, e.g. with the name vsb
  2. link a Postgres db
  3. link a Redis cache
  4. Mount a data directory: dokku storage:mount <app> <data-dir>:/data

Adjust the Postgres config and increase shared_buffers and work_mem to, e.g., 1GB and 128MB respectively.

Data Export & Import

Export and import all PDF data (processed, cleaned, raw, deleted) as a tar archive.

Export

dokku run <app> flask export-data /data/export.tar

Then copy the tar from the server's mounted data directory.

Import to a New Instance

# 1. Copy the tar to the server's mounted data directory
scp export.tar <server>:<data-dir>/export.tar

# 2. Import the PDFs
dokku run <app> flask import-data /data/export.tar

# 3. Process PDFs into the database
dokku run <app> flask update-docs '*'

# 4. Generate page images
dokku run <app> flask generate-images '*'

# 5. Clean up
rm <data-dir>/export.tar

One-off commands

  • clear cache: dokku run <app> flask clear-cache
  • add documents: dokku run <app> flask update-docs '*'
  • remove all documents: dokku run <app> flask remove-docs '*'
  • remove one document: dokku run <app> flask remove-docs 'vsbericht-th-2002.pdf'
  • clean all data from the database and add all documents again: dokku run <app> flask clear-data
  • initialize database schema: dokku run <app> flask init-db

Data Storage

The reports are organized by folders and filenames. This has the limitation that we can't store different versions of a yearly report.

PDF preprocessing

All documents should be PDF in a A4/A5 portrait format. Several pdf scripts exists but occasionally, manual work is required. See scripts.

Folder Structure

/data/
├── pdfs/       # processed PDFs used by the app
├── cleaned/    # normalized PDFs, before OCR & file reduction
├── raw/        # original unprocessed PDFs
├── deleted/    # removed PDFs kept for reference
└── images/     # generated page scans (JPG + AVIF)

Adding a New Report

Naming: vsbericht-nw-2000.pdf for NRW 2000, vsbericht-2000.pdf for the federal report 2000.

If a report is for multiple years, choose the latest year as the main date.

And update the title in src/report_info.py accordingly.

  1. Put the file in data/pdfs/
  2. Process it: dokku run <app> flask update-docs 'vsbericht-nw-2000'
  3. Optionally store the original in data/raw/ and the cleaned version in data/cleaned/

Search

Using Postgres' full-text search features via sqlalchemy-searchable. Right now, there are some shortcomings. It's not possible to use trigram similarity. And wildcard queries are the default and can only be deactivated by using quotes, i.e., "query". Also the matching tokens are not displayed on the page/image. Further work is required to improve the search.

License

MIT

About

📚 Online archive for annual reports of the German internal intelligence

Topics

Resources

License

Stars

Watchers

Forks

Contributors