Skip to content

Commit 25cace6

Browse files
docs: update generated documentation (#3507)
Co-authored-by: liferoad <7833268+liferoad@users.noreply.github.com>
1 parent 3a3605b commit 25cace6

3 files changed

Lines changed: 259 additions & 2 deletions

File tree

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
2+
GCS Spanner Data Validation template
3+
---
4+
Batch pipeline that reads data from GCS and Spanner compares them to validate
5+
migration correctness.
6+
7+
8+
:memo: This is a Google-provided template! Please
9+
check [Provided templates documentation](https://cloud.google.com/dataflow/docs/guides/templates/provided/gcs-spanner-dv)
10+
on how to use it without having to build from sources using [Create job from template](https://console.cloud.google.com/dataflow/createjob?template=GCS_Spanner_Data_Validator).
11+
12+
:bulb: This is a generated documentation based
13+
on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations)
14+
. Do not change this file directly.
15+
16+
## Parameters
17+
18+
### Required parameters
19+
20+
* **instanceId**: The destination Cloud Spanner instance.
21+
* **databaseId**: The destination Cloud Spanner database.
22+
* **bigQueryDataset**: The BigQuery dataset ID where the validation results will be stored. For example, `validation_report_dataset`.
23+
24+
### Optional parameters
25+
26+
* **gcsInputDirectory**: This directory is used to read the AVRO files of the records read from source. For example, `gs://your-bucket/your-path`.
27+
* **projectId**: This is the name of the Cloud Spanner project.
28+
* **spannerHost**: The Cloud Spanner endpoint to call in the template. For example, `https://batch-spanner.googleapis.com`. Defaults to: https://batch-spanner.googleapis.com.
29+
* **spannerPriority**: The request priority for Cloud Spanner calls. The value must be one of: [`HIGH`,`MEDIUM`,`LOW`]. Defaults to `HIGH`.
30+
* **sessionFilePath**: Session file path in Cloud Storage that contains mapping information from Spanner Migration Tool. Defaults to empty.
31+
* **schemaOverridesFilePath**: A file which specifies the table and the column name overrides from source to spanner. Defaults to empty.
32+
* **tableOverrides**: These are the table name overrides from source to spanner. They are written in thefollowing format: [{SourceTableName1, SpannerTableName1}, {SourceTableName2, SpannerTableName2}]This example shows mapping Singers table to Vocalists and Albums table to Records. For example, `[{Singers, Vocalists}, {Albums, Records}]`. Defaults to empty.
33+
* **columnOverrides**: These are the column name overrides from source to spanner. They are written in thefollowing format: [{SourceTableName1.SourceColumnName1, SourceTableName1.SpannerColumnName1}, {SourceTableName2.SourceColumnName1, SourceTableName2.SpannerColumnName1}]Note that the SourceTableName should remain the same in both the source and spanner pair. To override table names, use tableOverrides.The example shows mapping SingerName to TalentName and AlbumName to RecordName in Singers and Albums table respectively. For example, `[{Singers.SingerName, Singers.TalentName}, {Albums.AlbumName, Albums.RecordName}]`. Defaults to empty.
34+
* **runId**: A unique identifier for the validation run. If not provided, the Dataflow Job Name will be used. For example, `run_20230101_120000`.
35+
36+
37+
38+
## Getting Started
39+
40+
### Requirements
41+
42+
* Java 17
43+
* Maven
44+
* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the
45+
following commands:
46+
* `gcloud auth login`
47+
* `gcloud auth application-default login`
48+
49+
:star2: Those dependencies are pre-installed if you use Google Cloud Shell!
50+
51+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=v2/gcs-spanner-dv/src/main/java/com/google/cloud/teleport/v2/templates/GCSSpannerDV.java)
52+
53+
### Templates Plugin
54+
55+
This README provides instructions using
56+
the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin).
57+
58+
#### Validating the Template
59+
60+
This template has a validation command that is used to check code quality.
61+
62+
```shell
63+
mvn clean install -PtemplatesValidate \
64+
-DskipTests -am \
65+
-pl v2/gcs-spanner-dv
66+
```
67+
68+
### Building Template
69+
70+
This template is a Flex Template, meaning that the pipeline code will be
71+
containerized and the container will be executed on Dataflow. Please
72+
check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates)
73+
and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates)
74+
for more information.
75+
76+
#### Staging the Template
77+
78+
If the plan is to just stage the template (i.e., make it available to use) by
79+
the `gcloud` command or Dataflow "Create job from template" UI,
80+
the `-PtemplatesStage` profile should be used:
81+
82+
```shell
83+
export PROJECT=<my-project>
84+
export BUCKET_NAME=<bucket-name>
85+
export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo>
86+
87+
mvn clean package -PtemplatesStage \
88+
-DskipTests \
89+
-DprojectId="$PROJECT" \
90+
-DbucketName="$BUCKET_NAME" \
91+
-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \
92+
-DstagePrefix="templates" \
93+
-DtemplateName="GCS_Spanner_Data_Validator" \
94+
-pl v2/gcs-spanner-dv -am
95+
```
96+
97+
The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image.
98+
If not provided, it defaults to `gcr.io/<project>`.
99+
100+
The command should build and save the template to Google Cloud, and then print
101+
the complete location on Cloud Storage:
102+
103+
```
104+
Flex Template was staged! gs://<bucket-name>/templates/flex/GCS_Spanner_Data_Validator
105+
```
106+
107+
The specific path should be copied as it will be used in the following steps.
108+
109+
#### Running the Template
110+
111+
**Using the staged template**:
112+
113+
You can use the path above run the template (or share with others for execution).
114+
115+
To start a job with the template at any time using `gcloud`, you are going to
116+
need valid resources for the required parameters.
117+
118+
Provided that, the following command line can be used:
119+
120+
```shell
121+
export PROJECT=<my-project>
122+
export BUCKET_NAME=<bucket-name>
123+
export REGION=us-central1
124+
export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/GCS_Spanner_Data_Validator"
125+
126+
### Required
127+
export INSTANCE_ID=<instanceId>
128+
export DATABASE_ID=<databaseId>
129+
export BIG_QUERY_DATASET=<bigQueryDataset>
130+
131+
### Optional
132+
export GCS_INPUT_DIRECTORY=<gcsInputDirectory>
133+
export PROJECT_ID=<projectId>
134+
export SPANNER_HOST=https://batch-spanner.googleapis.com
135+
export SPANNER_PRIORITY=HIGH
136+
export SESSION_FILE_PATH=""
137+
export SCHEMA_OVERRIDES_FILE_PATH=""
138+
export TABLE_OVERRIDES=""
139+
export COLUMN_OVERRIDES=""
140+
export RUN_ID=<runId>
141+
142+
gcloud dataflow flex-template run "gcs-spanner-data-validator-job" \
143+
--project "$PROJECT" \
144+
--region "$REGION" \
145+
--template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \
146+
--parameters "gcsInputDirectory=$GCS_INPUT_DIRECTORY" \
147+
--parameters "projectId=$PROJECT_ID" \
148+
--parameters "spannerHost=$SPANNER_HOST" \
149+
--parameters "instanceId=$INSTANCE_ID" \
150+
--parameters "databaseId=$DATABASE_ID" \
151+
--parameters "spannerPriority=$SPANNER_PRIORITY" \
152+
--parameters "sessionFilePath=$SESSION_FILE_PATH" \
153+
--parameters "schemaOverridesFilePath=$SCHEMA_OVERRIDES_FILE_PATH" \
154+
--parameters "tableOverrides=$TABLE_OVERRIDES" \
155+
--parameters "columnOverrides=$COLUMN_OVERRIDES" \
156+
--parameters "bigQueryDataset=$BIG_QUERY_DATASET" \
157+
--parameters "runId=$RUN_ID"
158+
```
159+
160+
For more information about the command, please check:
161+
https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run
162+
163+
164+
**Using the plugin**:
165+
166+
Instead of just generating the template in the folder, it is possible to stage
167+
and run the template in a single command. This may be useful for testing when
168+
changing the templates.
169+
170+
```shell
171+
export PROJECT=<my-project>
172+
export BUCKET_NAME=<bucket-name>
173+
export REGION=us-central1
174+
175+
### Required
176+
export INSTANCE_ID=<instanceId>
177+
export DATABASE_ID=<databaseId>
178+
export BIG_QUERY_DATASET=<bigQueryDataset>
179+
180+
### Optional
181+
export GCS_INPUT_DIRECTORY=<gcsInputDirectory>
182+
export PROJECT_ID=<projectId>
183+
export SPANNER_HOST=https://batch-spanner.googleapis.com
184+
export SPANNER_PRIORITY=HIGH
185+
export SESSION_FILE_PATH=""
186+
export SCHEMA_OVERRIDES_FILE_PATH=""
187+
export TABLE_OVERRIDES=""
188+
export COLUMN_OVERRIDES=""
189+
export RUN_ID=<runId>
190+
191+
mvn clean package -PtemplatesRun \
192+
-DskipTests \
193+
-DprojectId="$PROJECT" \
194+
-DbucketName="$BUCKET_NAME" \
195+
-Dregion="$REGION" \
196+
-DjobName="gcs-spanner-data-validator-job" \
197+
-DtemplateName="GCS_Spanner_Data_Validator" \
198+
-Dparameters="gcsInputDirectory=$GCS_INPUT_DIRECTORY,projectId=$PROJECT_ID,spannerHost=$SPANNER_HOST,instanceId=$INSTANCE_ID,databaseId=$DATABASE_ID,spannerPriority=$SPANNER_PRIORITY,sessionFilePath=$SESSION_FILE_PATH,schemaOverridesFilePath=$SCHEMA_OVERRIDES_FILE_PATH,tableOverrides=$TABLE_OVERRIDES,columnOverrides=$COLUMN_OVERRIDES,bigQueryDataset=$BIG_QUERY_DATASET,runId=$RUN_ID" \
199+
-f v2/gcs-spanner-dv
200+
```
201+
202+
## Terraform
203+
204+
Dataflow supports the utilization of Terraform to manage template jobs,
205+
see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job).
206+
207+
Terraform modules have been generated for most templates in this repository. This includes the relevant parameters
208+
specific to the template. If available, they may be used instead of
209+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
210+
directly.
211+
212+
To use the autogenerated module, execute the standard
213+
[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow):
214+
215+
```shell
216+
cd v2/gcs-spanner-dv/terraform/GCS_Spanner_Data_Validator
217+
terraform init
218+
terraform apply
219+
```
220+
221+
To use
222+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
223+
directly:
224+
225+
```terraform
226+
provider "google-beta" {
227+
project = var.project
228+
}
229+
variable "project" {
230+
default = "<my-project>"
231+
}
232+
variable "region" {
233+
default = "us-central1"
234+
}
235+
236+
resource "google_dataflow_flex_template_job" "gcs_spanner_data_validator" {
237+
238+
provider = google-beta
239+
container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/GCS_Spanner_Data_Validator"
240+
name = "gcs-spanner-data-validator"
241+
region = var.region
242+
parameters = {
243+
instanceId = "<instanceId>"
244+
databaseId = "<databaseId>"
245+
bigQueryDataset = "<bigQueryDataset>"
246+
# gcsInputDirectory = "<gcsInputDirectory>"
247+
# projectId = "<projectId>"
248+
# spannerHost = "https://batch-spanner.googleapis.com"
249+
# spannerPriority = "HIGH"
250+
# sessionFilePath = ""
251+
# schemaOverridesFilePath = ""
252+
# tableOverrides = ""
253+
# columnOverrides = ""
254+
# runId = "<runId>"
255+
}
256+
}
257+
```

v2/googlecloud-to-googlecloud/README_Stream_GCS_Text_to_BigQuery_Flex.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplat
4141

4242
* **outputDeadletterTable**: Table for messages that failed to reach the output table. If a table doesn't exist, it is created during pipeline execution. If not specified, `<outputTableSpec>_error_records` is used. For example, `<PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>`.
4343
* **useStorageWriteApiAtLeastOnce**: This parameter takes effect only if `Use BigQuery Storage Write API` is enabled. If enabled the at-least-once semantics will be used for Storage Write API, otherwise exactly-once semantics will be used. Defaults to: false.
44-
* **useStorageWriteApi**: If `true`, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value is `false`. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api).
44+
* **useStorageWriteApi**: If true, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value is `false`. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api).
4545
* **numStorageWriteApiStreams**: When using the Storage Write API, specifies the number of write streams. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter. Defaults to: 0.
4646
* **storageWriteApiTriggeringFrequencySec**: When using the Storage Write API, specifies the triggering frequency, in seconds. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter.
4747
* **pythonExternalTextTransformGcsPath**: The Cloud Storage path pattern for the Python code containing your user-defined functions. For example, `gs://your-bucket/your-function.py`.

v2/googlecloud-to-googlecloud/README_Stream_GCS_Text_to_BigQuery_Xlang.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplat
3939

4040
* **outputDeadletterTable**: Table for messages that failed to reach the output table. If a table doesn't exist, it is created during pipeline execution. If not specified, `<outputTableSpec>_error_records` is used. For example, `<PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>`.
4141
* **useStorageWriteApiAtLeastOnce**: This parameter takes effect only if `Use BigQuery Storage Write API` is enabled. If enabled the at-least-once semantics will be used for Storage Write API, otherwise exactly-once semantics will be used. Defaults to: false.
42-
* **useStorageWriteApi**: If `true`, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value is `false`. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api).
42+
* **useStorageWriteApi**: If true, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value is `false`. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api).
4343
* **numStorageWriteApiStreams**: When using the Storage Write API, specifies the number of write streams. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter. Defaults to: 0.
4444
* **storageWriteApiTriggeringFrequencySec**: When using the Storage Write API, specifies the triggering frequency, in seconds. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter.
4545
* **pythonExternalTextTransformGcsPath**: The Cloud Storage path pattern for the Python code containing your user-defined functions. For example, `gs://your-bucket/your-function.py`.

0 commit comments

Comments
 (0)