Skip to content

Commit a877fc2

Browse files
added readme file stub (#623)
* added readme file stub * added details for updating tableau * Renamed file to pass build checks. * P in python upper case. * Updated word readme. * Edited file for markdown checks. * Updates
1 parent 936e44c commit a877fc2

1 file changed

Lines changed: 302 additions & 0 deletions

File tree

scripts/tableau/README.md

Lines changed: 302 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,302 @@
1+
# Tableau DQ Metrics
2+
3+
This README describes the process
4+
for publishing the DQ metrics data from AWS S3 to the Tableau Server.
5+
6+
There are two methods for publishing the data to Tableau Server:
7+
8+
1. [Running the Python scripts locally](#python-scripts)
9+
2. Using the [GitHub Actions workflow](#github-actions-workflow) to run the scripts on a schedule
10+
11+
## Overview
12+
13+
This section provides an overview of the process for publishing DQ metrics data to Tableau Server.
14+
15+
The Python scripts are used to generate a Tableau `.hyper` datasource file from the DQ metric
16+
JSON files stored in S3.
17+
This `.hyper` file is then published to Tableau Server by overwriting an existing datasource.
18+
Once this datasource is updated, a ping is sent to a Tableau view to trigger a cache refresh,
19+
ensuring that the latest data is updated and available for visualisation.
20+
21+
The following sections provide more details on the Python scripts and the GitHub Actions workflow that automates this process.
22+
23+
## Python Scripts
24+
25+
There are two Python scripts involved:
26+
27+
- `generate_tableau_data.py`
28+
Reads DQ metric JSON files from the S3 bucket `eligibility-signposting-api-dev-dq-metrics`, filters to approximately the last 3 months, and writes a Tableau Hyper extract called `converted.hyper`.
29+
30+
- `tableau_refresh.py`
31+
Publishes `./converted.hyper` to Tableau Server by overwriting an existing datasource, then pings a Tableau view to trigger a cache refresh.
32+
33+
This work supports the EliD DQ metrics Tableau MVP, where Tableau is being used to visualise DQ metrics for monitoring and comparison against expected thresholds.
34+
35+
---
36+
37+
### What the scripts do
38+
39+
#### 1. Generate Hyper extract
40+
41+
`generate_tableau_data.py`:
42+
43+
- connects to S3
44+
- scans daily `processing_date=YYYYMMDD/` prefixes for the last 90 days
45+
- reads `.json` files from the bucket
46+
- parses JSON or JSONL content into a pandas DataFrame
47+
- creates a Tableau Hyper file named `converted.hyper`
48+
49+
The S3 source bucket is currently hard coded as:
50+
51+
```python
52+
S3_BUCKET = "eligibility-signposting-api-dev-dq-metrics"
53+
```
54+
55+
and the output file is:
56+
57+
```python
58+
LOCAL_HYPER_PATH = "converted.hyper"
59+
```
60+
61+
#### 2. Publish to Tableau
62+
63+
`tableau_refresh.py`:
64+
65+
- checks that `./converted.hyper` exists
66+
- validates the file type
67+
- reads Tableau credentials and settings from environment variables
68+
- signs in using a Tableau Personal Access Token (PAT)
69+
- overwrites the configured datasource
70+
- pings the Tableau view `EligibilityData-DQMetrics/DataQualityMetrics?:refresh=y` to trigger refresh
71+
72+
NOTE: PAT credentials must be set as GitHub secrets for the workflow, and as environment variables for local testing.
73+
74+
---
75+
76+
### Repository structure
77+
78+
The GitHub Actions workflow expects the scripts to exist at:
79+
80+
```text
81+
scripts/tableau/generate_tableau_data.py
82+
scripts/tableau/tableau_refresh.py
83+
```
84+
85+
because it runs:
86+
87+
```yaml
88+
python scripts/tableau/generate_tableau_data.py
89+
python scripts/tableau/tableau_refresh.py
90+
```
91+
92+
---
93+
94+
### Running locally
95+
96+
### Prerequisites
97+
98+
You will need:
99+
100+
- Python 3.13 recommended, to match the workflow setup.
101+
- Access to AWS with permission to read from the S3 bucket `eligibility-signposting-api-dev-dq-metrics`.
102+
- Tableau Personal Access Token credentials
103+
- The required Python packages installed:
104+
- `boto3`
105+
- `pandas`
106+
- `tableauserverclient`
107+
- `tableauhyperapi`
108+
- `requests`
109+
110+
#### Install dependencies
111+
112+
If installing dependencies locally
113+
114+
```bash
115+
pip install boto3 pandas tableauserverclient tableauhyperapi requests
116+
```
117+
118+
#### Required environment variables
119+
120+
Before publishing to Tableau, set the following environment variables:
121+
122+
```bash
123+
export TABLEAU_TOKEN_NAME="your-token-name"
124+
export TABLEAU_TOKEN_VALUE="your-token-value"
125+
export TABLEAU_SERVER_URL="https://your-tableau-server"
126+
export TABLEAU_DATASOURCE_ID="your-datasource-id"
127+
export TABLEAU_SITE_ID="NHSD_DEV"
128+
```
129+
130+
`TABLEAU_SERVER_URL` is the base URL of the Tableau Server instance, for example `https://tableau.nhsd.com`.
131+
132+
`TABLEAU_DATASOURCE_ID` is the ID of the datasource to overwrite,
133+
which can be found in the Tableau Server URL when viewing the datasource (LUID).
134+
135+
`TABLEAU_SITE_ID` defaults to `NHSD_DEV` if not set.
136+
137+
#### AWS credentials
138+
139+
You also need AWS credentials available locally so `boto3` can read from S3.
140+
Also, may need to set the AWS region if not configured globally:
141+
142+
```bash
143+
export AWS_REGION=eu-west-2
144+
```
145+
146+
The workflow uses `eu-west-2`.
147+
148+
#### Run the scripts
149+
150+
Step 1: Generate the Hyper file
151+
152+
```bash
153+
python scripts/tableau/generate_tableau_data.py
154+
```
155+
156+
This should create:
157+
158+
```text
159+
converted.hyper
160+
```
161+
162+
Step 2: Publish to Tableau
163+
164+
```bash
165+
python scripts/tableau/tableau_refresh.py
166+
```
167+
168+
If you want the publish step to continue even when the cache refresh ping fails:
169+
170+
```bash
171+
python scripts/tableau/tableau_refresh.py --ignore-refresh-failure
172+
```
173+
174+
The optional `--ignore-refresh-failure` flag prevents the script from exiting with an error
175+
if the Tableau refresh ping fails.
176+
177+
---
178+
179+
### Expected local flow
180+
181+
1. Read recent DQ metric JSON data from S3
182+
2. Build `converted.hyper`
183+
3. Sign in to Tableau using PAT credentials
184+
4. Overwrite the target datasource
185+
5. Trigger cache refresh for the relevant workbook view
186+
187+
---
188+
189+
## GitHub Actions workflow
190+
191+
The GitHub Actions workflow is named:
192+
193+
```yaml
194+
Daily Tableau Data Update
195+
```
196+
197+
It supports:
198+
199+
- scheduled execution every day at `10:00 AM UTC`
200+
- manual triggering using `workflow_dispatch` for testing
201+
202+
### Workflow triggers
203+
204+
```yaml
205+
on:
206+
schedule:
207+
- cron: '0 10 * * *'
208+
workflow_dispatch:
209+
```
210+
211+
### Workflow jobs
212+
213+
The workflow has two jobs:
214+
215+
### 1. `metadata`
216+
217+
This job:
218+
219+
- checks out the repo
220+
- reads versions from `.tool-versions`
221+
- sets CI metadata such as build timestamp and version string
222+
223+
### 2. `publish`
224+
225+
This job:
226+
227+
- sets up Python 3.13
228+
- checks out the repository
229+
- installs the required Python packages
230+
- assumes the AWS deployment role using GitHub OIDC
231+
- runs the S3 to Hyper script
232+
- publishes the datasource to Tableau
233+
234+
---
235+
236+
## GitHub Actions secrets and variables
237+
238+
The workflow requires the following GitHub environment configuration.
239+
240+
### Secrets
241+
242+
- `AWS_ACCOUNT_ID`
243+
- `TABLEAU_TOKEN_NAME`
244+
- `TABLEAU_TOKEN_VALUE`
245+
- `TABLEAU_DATASOURCE_ID`
246+
247+
### Variables
248+
249+
- `TABLEAU_SITE_ID`
250+
- `TABLEAU_SERVER_URL`
251+
252+
### GitHub environment
253+
254+
The workflow runs under the `dev` environment.
255+
256+
---
257+
258+
## Example GitHub Actions execution flow
259+
260+
```text
261+
Schedule or manual trigger
262+
-> metadata job
263+
-> publish job
264+
-> setup Python
265+
-> install dependencies
266+
-> assume AWS role
267+
-> generate converted.hyper from S3 JSON files
268+
-> publish converted.hyper to Tableau datasource
269+
-> trigger Tableau cache refresh
270+
```
271+
272+
---
273+
274+
## Troubleshooting
275+
276+
### `Datasource file not found: ./converted.hyper`
277+
278+
The publish script expects `converted.hyper` to exist in the current working directory. Run the data generation script first.
279+
280+
### `Missing required environment variables`
281+
282+
Ensure the required Tableau environment variables are set before running `tableau_refresh.py`.
283+
284+
### No data found
285+
286+
If no JSON files are found for the date range, the generation script will print:
287+
288+
```text
289+
No data found for the selected date range.
290+
```
291+
292+
and no Hyper file will be created.
293+
294+
### Cache refresh fails
295+
296+
By default, a Tableau cache refresh failure causes the script to exit with a non zero status. Use:
297+
298+
```bash
299+
python scripts/tableau/tableau_refresh.py --ignore-refresh-failure
300+
```
301+
302+
if you want to allow publish success even when the refresh ping fails.

0 commit comments

Comments
 (0)