|
| 1 | +# Tableau DQ Metrics |
| 2 | + |
| 3 | +This README describes the process |
| 4 | +for publishing the DQ metrics data from AWS S3 to the Tableau Server. |
| 5 | + |
| 6 | +There are two methods for publishing the data to Tableau Server: |
| 7 | + |
| 8 | +1. [Running the Python scripts locally](#python-scripts) |
| 9 | +2. Using the [GitHub Actions workflow](#github-actions-workflow) to run the scripts on a schedule |
| 10 | + |
| 11 | +## Overview |
| 12 | + |
| 13 | +This section provides an overview of the process for publishing DQ metrics data to Tableau Server. |
| 14 | + |
| 15 | +The Python scripts are used to generate a Tableau `.hyper` datasource file from the DQ metric |
| 16 | +JSON files stored in S3. |
| 17 | +This `.hyper` file is then published to Tableau Server by overwriting an existing datasource. |
| 18 | +Once this datasource is updated, a ping is sent to a Tableau view to trigger a cache refresh, |
| 19 | +ensuring that the latest data is updated and available for visualisation. |
| 20 | + |
| 21 | +The following sections provide more details on the Python scripts and the GitHub Actions workflow that automates this process. |
| 22 | + |
| 23 | +## Python Scripts |
| 24 | + |
| 25 | +There are two Python scripts involved: |
| 26 | + |
| 27 | +- `generate_tableau_data.py` |
| 28 | + Reads DQ metric JSON files from the S3 bucket `eligibility-signposting-api-dev-dq-metrics`, filters to approximately the last 3 months, and writes a Tableau Hyper extract called `converted.hyper`. |
| 29 | + |
| 30 | +- `tableau_refresh.py` |
| 31 | + Publishes `./converted.hyper` to Tableau Server by overwriting an existing datasource, then pings a Tableau view to trigger a cache refresh. |
| 32 | + |
| 33 | +This work supports the EliD DQ metrics Tableau MVP, where Tableau is being used to visualise DQ metrics for monitoring and comparison against expected thresholds. |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +### What the scripts do |
| 38 | + |
| 39 | +#### 1. Generate Hyper extract |
| 40 | + |
| 41 | +`generate_tableau_data.py`: |
| 42 | + |
| 43 | +- connects to S3 |
| 44 | +- scans daily `processing_date=YYYYMMDD/` prefixes for the last 90 days |
| 45 | +- reads `.json` files from the bucket |
| 46 | +- parses JSON or JSONL content into a pandas DataFrame |
| 47 | +- creates a Tableau Hyper file named `converted.hyper` |
| 48 | + |
| 49 | +The S3 source bucket is currently hard coded as: |
| 50 | + |
| 51 | +```python |
| 52 | +S3_BUCKET = "eligibility-signposting-api-dev-dq-metrics" |
| 53 | +``` |
| 54 | + |
| 55 | +and the output file is: |
| 56 | + |
| 57 | +```python |
| 58 | +LOCAL_HYPER_PATH = "converted.hyper" |
| 59 | +``` |
| 60 | + |
| 61 | +#### 2. Publish to Tableau |
| 62 | + |
| 63 | +`tableau_refresh.py`: |
| 64 | + |
| 65 | +- checks that `./converted.hyper` exists |
| 66 | +- validates the file type |
| 67 | +- reads Tableau credentials and settings from environment variables |
| 68 | +- signs in using a Tableau Personal Access Token (PAT) |
| 69 | +- overwrites the configured datasource |
| 70 | +- pings the Tableau view `EligibilityData-DQMetrics/DataQualityMetrics?:refresh=y` to trigger refresh |
| 71 | + |
| 72 | +NOTE: PAT credentials must be set as GitHub secrets for the workflow, and as environment variables for local testing. |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +### Repository structure |
| 77 | + |
| 78 | +The GitHub Actions workflow expects the scripts to exist at: |
| 79 | + |
| 80 | +```text |
| 81 | +scripts/tableau/generate_tableau_data.py |
| 82 | +scripts/tableau/tableau_refresh.py |
| 83 | +``` |
| 84 | + |
| 85 | +because it runs: |
| 86 | + |
| 87 | +```yaml |
| 88 | +python scripts/tableau/generate_tableau_data.py |
| 89 | +python scripts/tableau/tableau_refresh.py |
| 90 | +``` |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +### Running locally |
| 95 | + |
| 96 | +### Prerequisites |
| 97 | + |
| 98 | +You will need: |
| 99 | + |
| 100 | +- Python 3.13 recommended, to match the workflow setup. |
| 101 | +- Access to AWS with permission to read from the S3 bucket `eligibility-signposting-api-dev-dq-metrics`. |
| 102 | +- Tableau Personal Access Token credentials |
| 103 | +- The required Python packages installed: |
| 104 | + - `boto3` |
| 105 | + - `pandas` |
| 106 | + - `tableauserverclient` |
| 107 | + - `tableauhyperapi` |
| 108 | + - `requests` |
| 109 | + |
| 110 | +#### Install dependencies |
| 111 | + |
| 112 | +If installing dependencies locally |
| 113 | + |
| 114 | +```bash |
| 115 | +pip install boto3 pandas tableauserverclient tableauhyperapi requests |
| 116 | +``` |
| 117 | + |
| 118 | +#### Required environment variables |
| 119 | + |
| 120 | +Before publishing to Tableau, set the following environment variables: |
| 121 | + |
| 122 | +```bash |
| 123 | +export TABLEAU_TOKEN_NAME="your-token-name" |
| 124 | +export TABLEAU_TOKEN_VALUE="your-token-value" |
| 125 | +export TABLEAU_SERVER_URL="https://your-tableau-server" |
| 126 | +export TABLEAU_DATASOURCE_ID="your-datasource-id" |
| 127 | +export TABLEAU_SITE_ID="NHSD_DEV" |
| 128 | +``` |
| 129 | + |
| 130 | +`TABLEAU_SERVER_URL` is the base URL of the Tableau Server instance, for example `https://tableau.nhsd.com`. |
| 131 | + |
| 132 | +`TABLEAU_DATASOURCE_ID` is the ID of the datasource to overwrite, |
| 133 | +which can be found in the Tableau Server URL when viewing the datasource (LUID). |
| 134 | + |
| 135 | +`TABLEAU_SITE_ID` defaults to `NHSD_DEV` if not set. |
| 136 | + |
| 137 | +#### AWS credentials |
| 138 | + |
| 139 | +You also need AWS credentials available locally so `boto3` can read from S3. |
| 140 | +Also, may need to set the AWS region if not configured globally: |
| 141 | + |
| 142 | +```bash |
| 143 | +export AWS_REGION=eu-west-2 |
| 144 | +``` |
| 145 | + |
| 146 | +The workflow uses `eu-west-2`. |
| 147 | + |
| 148 | +#### Run the scripts |
| 149 | + |
| 150 | +Step 1: Generate the Hyper file |
| 151 | + |
| 152 | +```bash |
| 153 | +python scripts/tableau/generate_tableau_data.py |
| 154 | +``` |
| 155 | + |
| 156 | +This should create: |
| 157 | + |
| 158 | +```text |
| 159 | +converted.hyper |
| 160 | +``` |
| 161 | + |
| 162 | +Step 2: Publish to Tableau |
| 163 | + |
| 164 | +```bash |
| 165 | +python scripts/tableau/tableau_refresh.py |
| 166 | +``` |
| 167 | + |
| 168 | +If you want the publish step to continue even when the cache refresh ping fails: |
| 169 | + |
| 170 | +```bash |
| 171 | +python scripts/tableau/tableau_refresh.py --ignore-refresh-failure |
| 172 | +``` |
| 173 | + |
| 174 | +The optional `--ignore-refresh-failure` flag prevents the script from exiting with an error |
| 175 | +if the Tableau refresh ping fails. |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +### Expected local flow |
| 180 | + |
| 181 | +1. Read recent DQ metric JSON data from S3 |
| 182 | +2. Build `converted.hyper` |
| 183 | +3. Sign in to Tableau using PAT credentials |
| 184 | +4. Overwrite the target datasource |
| 185 | +5. Trigger cache refresh for the relevant workbook view |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## GitHub Actions workflow |
| 190 | + |
| 191 | +The GitHub Actions workflow is named: |
| 192 | + |
| 193 | +```yaml |
| 194 | +Daily Tableau Data Update |
| 195 | +``` |
| 196 | + |
| 197 | +It supports: |
| 198 | + |
| 199 | +- scheduled execution every day at `10:00 AM UTC` |
| 200 | +- manual triggering using `workflow_dispatch` for testing |
| 201 | + |
| 202 | +### Workflow triggers |
| 203 | + |
| 204 | +```yaml |
| 205 | +on: |
| 206 | + schedule: |
| 207 | + - cron: '0 10 * * *' |
| 208 | + workflow_dispatch: |
| 209 | +``` |
| 210 | +
|
| 211 | +### Workflow jobs |
| 212 | +
|
| 213 | +The workflow has two jobs: |
| 214 | +
|
| 215 | +### 1. `metadata` |
| 216 | + |
| 217 | +This job: |
| 218 | + |
| 219 | +- checks out the repo |
| 220 | +- reads versions from `.tool-versions` |
| 221 | +- sets CI metadata such as build timestamp and version string |
| 222 | + |
| 223 | +### 2. `publish` |
| 224 | + |
| 225 | +This job: |
| 226 | + |
| 227 | +- sets up Python 3.13 |
| 228 | +- checks out the repository |
| 229 | +- installs the required Python packages |
| 230 | +- assumes the AWS deployment role using GitHub OIDC |
| 231 | +- runs the S3 to Hyper script |
| 232 | +- publishes the datasource to Tableau |
| 233 | + |
| 234 | +--- |
| 235 | + |
| 236 | +## GitHub Actions secrets and variables |
| 237 | + |
| 238 | +The workflow requires the following GitHub environment configuration. |
| 239 | + |
| 240 | +### Secrets |
| 241 | + |
| 242 | +- `AWS_ACCOUNT_ID` |
| 243 | +- `TABLEAU_TOKEN_NAME` |
| 244 | +- `TABLEAU_TOKEN_VALUE` |
| 245 | +- `TABLEAU_DATASOURCE_ID` |
| 246 | + |
| 247 | +### Variables |
| 248 | + |
| 249 | +- `TABLEAU_SITE_ID` |
| 250 | +- `TABLEAU_SERVER_URL` |
| 251 | + |
| 252 | +### GitHub environment |
| 253 | + |
| 254 | +The workflow runs under the `dev` environment. |
| 255 | + |
| 256 | +--- |
| 257 | + |
| 258 | +## Example GitHub Actions execution flow |
| 259 | + |
| 260 | +```text |
| 261 | +Schedule or manual trigger |
| 262 | + -> metadata job |
| 263 | + -> publish job |
| 264 | + -> setup Python |
| 265 | + -> install dependencies |
| 266 | + -> assume AWS role |
| 267 | + -> generate converted.hyper from S3 JSON files |
| 268 | + -> publish converted.hyper to Tableau datasource |
| 269 | + -> trigger Tableau cache refresh |
| 270 | +``` |
| 271 | + |
| 272 | +--- |
| 273 | + |
| 274 | +## Troubleshooting |
| 275 | + |
| 276 | +### `Datasource file not found: ./converted.hyper` |
| 277 | + |
| 278 | +The publish script expects `converted.hyper` to exist in the current working directory. Run the data generation script first. |
| 279 | + |
| 280 | +### `Missing required environment variables` |
| 281 | + |
| 282 | +Ensure the required Tableau environment variables are set before running `tableau_refresh.py`. |
| 283 | + |
| 284 | +### No data found |
| 285 | + |
| 286 | +If no JSON files are found for the date range, the generation script will print: |
| 287 | + |
| 288 | +```text |
| 289 | +No data found for the selected date range. |
| 290 | +``` |
| 291 | + |
| 292 | +and no Hyper file will be created. |
| 293 | + |
| 294 | +### Cache refresh fails |
| 295 | + |
| 296 | +By default, a Tableau cache refresh failure causes the script to exit with a non zero status. Use: |
| 297 | + |
| 298 | +```bash |
| 299 | +python scripts/tableau/tableau_refresh.py --ignore-refresh-failure |
| 300 | +``` |
| 301 | + |
| 302 | +if you want to allow publish success even when the refresh ping fails. |
0 commit comments