|
| 1 | + |
| 2 | +GCS Spanner Data Validation template |
| 3 | +--- |
| 4 | +Batch pipeline that reads data from GCS and Spanner compares them to validate |
| 5 | +migration correctness. |
| 6 | + |
| 7 | + |
| 8 | +:memo: This is a Google-provided template! Please |
| 9 | +check [Provided templates documentation](https://cloud.google.com/dataflow/docs/guides/templates/provided/gcs-spanner-dv) |
| 10 | +on how to use it without having to build from sources using [Create job from template](https://console.cloud.google.com/dataflow/createjob?template=GCS_Spanner_Data_Validator). |
| 11 | + |
| 12 | +:bulb: This is a generated documentation based |
| 13 | +on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations) |
| 14 | +. Do not change this file directly. |
| 15 | + |
| 16 | +## Parameters |
| 17 | + |
| 18 | +### Required parameters |
| 19 | + |
| 20 | +* **instanceId**: The destination Cloud Spanner instance. |
| 21 | +* **databaseId**: The destination Cloud Spanner database. |
| 22 | +* **bigQueryDataset**: The BigQuery dataset ID where the validation results will be stored. For example, `validation_report_dataset`. |
| 23 | + |
| 24 | +### Optional parameters |
| 25 | + |
| 26 | +* **gcsInputDirectory**: This directory is used to read the AVRO files of the records read from source. For example, `gs://your-bucket/your-path`. |
| 27 | +* **projectId**: This is the name of the Cloud Spanner project. |
| 28 | +* **spannerHost**: The Cloud Spanner endpoint to call in the template. For example, `https://batch-spanner.googleapis.com`. Defaults to: https://batch-spanner.googleapis.com. |
| 29 | +* **spannerPriority**: The request priority for Cloud Spanner calls. The value must be one of: [`HIGH`,`MEDIUM`,`LOW`]. Defaults to `HIGH`. |
| 30 | +* **sessionFilePath**: Session file path in Cloud Storage that contains mapping information from Spanner Migration Tool. Defaults to empty. |
| 31 | +* **schemaOverridesFilePath**: A file which specifies the table and the column name overrides from source to spanner. Defaults to empty. |
| 32 | +* **tableOverrides**: These are the table name overrides from source to spanner. They are written in thefollowing format: [{SourceTableName1, SpannerTableName1}, {SourceTableName2, SpannerTableName2}]This example shows mapping Singers table to Vocalists and Albums table to Records. For example, `[{Singers, Vocalists}, {Albums, Records}]`. Defaults to empty. |
| 33 | +* **columnOverrides**: These are the column name overrides from source to spanner. They are written in thefollowing format: [{SourceTableName1.SourceColumnName1, SourceTableName1.SpannerColumnName1}, {SourceTableName2.SourceColumnName1, SourceTableName2.SpannerColumnName1}]Note that the SourceTableName should remain the same in both the source and spanner pair. To override table names, use tableOverrides.The example shows mapping SingerName to TalentName and AlbumName to RecordName in Singers and Albums table respectively. For example, `[{Singers.SingerName, Singers.TalentName}, {Albums.AlbumName, Albums.RecordName}]`. Defaults to empty. |
| 34 | +* **runId**: A unique identifier for the validation run. If not provided, the Dataflow Job Name will be used. For example, `run_20230101_120000`. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +## Getting Started |
| 39 | + |
| 40 | +### Requirements |
| 41 | + |
| 42 | +* Java 17 |
| 43 | +* Maven |
| 44 | +* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the |
| 45 | + following commands: |
| 46 | + * `gcloud auth login` |
| 47 | + * `gcloud auth application-default login` |
| 48 | + |
| 49 | +:star2: Those dependencies are pre-installed if you use Google Cloud Shell! |
| 50 | + |
| 51 | +[](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=v2/gcs-spanner-dv/src/main/java/com/google/cloud/teleport/v2/templates/GCSSpannerDV.java) |
| 52 | + |
| 53 | +### Templates Plugin |
| 54 | + |
| 55 | +This README provides instructions using |
| 56 | +the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin). |
| 57 | + |
| 58 | +#### Validating the Template |
| 59 | + |
| 60 | +This template has a validation command that is used to check code quality. |
| 61 | + |
| 62 | +```shell |
| 63 | +mvn clean install -PtemplatesValidate \ |
| 64 | +-DskipTests -am \ |
| 65 | +-pl v2/gcs-spanner-dv |
| 66 | +``` |
| 67 | + |
| 68 | +### Building Template |
| 69 | + |
| 70 | +This template is a Flex Template, meaning that the pipeline code will be |
| 71 | +containerized and the container will be executed on Dataflow. Please |
| 72 | +check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) |
| 73 | +and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates) |
| 74 | +for more information. |
| 75 | + |
| 76 | +#### Staging the Template |
| 77 | + |
| 78 | +If the plan is to just stage the template (i.e., make it available to use) by |
| 79 | +the `gcloud` command or Dataflow "Create job from template" UI, |
| 80 | +the `-PtemplatesStage` profile should be used: |
| 81 | + |
| 82 | +```shell |
| 83 | +export PROJECT=<my-project> |
| 84 | +export BUCKET_NAME=<bucket-name> |
| 85 | +export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo> |
| 86 | + |
| 87 | +mvn clean package -PtemplatesStage \ |
| 88 | +-DskipTests \ |
| 89 | +-DprojectId="$PROJECT" \ |
| 90 | +-DbucketName="$BUCKET_NAME" \ |
| 91 | +-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \ |
| 92 | +-DstagePrefix="templates" \ |
| 93 | +-DtemplateName="GCS_Spanner_Data_Validator" \ |
| 94 | +-pl v2/gcs-spanner-dv -am |
| 95 | +``` |
| 96 | + |
| 97 | +The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image. |
| 98 | +If not provided, it defaults to `gcr.io/<project>`. |
| 99 | + |
| 100 | +The command should build and save the template to Google Cloud, and then print |
| 101 | +the complete location on Cloud Storage: |
| 102 | + |
| 103 | +``` |
| 104 | +Flex Template was staged! gs://<bucket-name>/templates/flex/GCS_Spanner_Data_Validator |
| 105 | +``` |
| 106 | + |
| 107 | +The specific path should be copied as it will be used in the following steps. |
| 108 | + |
| 109 | +#### Running the Template |
| 110 | + |
| 111 | +**Using the staged template**: |
| 112 | + |
| 113 | +You can use the path above run the template (or share with others for execution). |
| 114 | + |
| 115 | +To start a job with the template at any time using `gcloud`, you are going to |
| 116 | +need valid resources for the required parameters. |
| 117 | + |
| 118 | +Provided that, the following command line can be used: |
| 119 | + |
| 120 | +```shell |
| 121 | +export PROJECT=<my-project> |
| 122 | +export BUCKET_NAME=<bucket-name> |
| 123 | +export REGION=us-central1 |
| 124 | +export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/GCS_Spanner_Data_Validator" |
| 125 | + |
| 126 | +### Required |
| 127 | +export INSTANCE_ID=<instanceId> |
| 128 | +export DATABASE_ID=<databaseId> |
| 129 | +export BIG_QUERY_DATASET=<bigQueryDataset> |
| 130 | + |
| 131 | +### Optional |
| 132 | +export GCS_INPUT_DIRECTORY=<gcsInputDirectory> |
| 133 | +export PROJECT_ID=<projectId> |
| 134 | +export SPANNER_HOST=https://batch-spanner.googleapis.com |
| 135 | +export SPANNER_PRIORITY=HIGH |
| 136 | +export SESSION_FILE_PATH="" |
| 137 | +export SCHEMA_OVERRIDES_FILE_PATH="" |
| 138 | +export TABLE_OVERRIDES="" |
| 139 | +export COLUMN_OVERRIDES="" |
| 140 | +export RUN_ID=<runId> |
| 141 | + |
| 142 | +gcloud dataflow flex-template run "gcs-spanner-data-validator-job" \ |
| 143 | + --project "$PROJECT" \ |
| 144 | + --region "$REGION" \ |
| 145 | + --template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \ |
| 146 | + --parameters "gcsInputDirectory=$GCS_INPUT_DIRECTORY" \ |
| 147 | + --parameters "projectId=$PROJECT_ID" \ |
| 148 | + --parameters "spannerHost=$SPANNER_HOST" \ |
| 149 | + --parameters "instanceId=$INSTANCE_ID" \ |
| 150 | + --parameters "databaseId=$DATABASE_ID" \ |
| 151 | + --parameters "spannerPriority=$SPANNER_PRIORITY" \ |
| 152 | + --parameters "sessionFilePath=$SESSION_FILE_PATH" \ |
| 153 | + --parameters "schemaOverridesFilePath=$SCHEMA_OVERRIDES_FILE_PATH" \ |
| 154 | + --parameters "tableOverrides=$TABLE_OVERRIDES" \ |
| 155 | + --parameters "columnOverrides=$COLUMN_OVERRIDES" \ |
| 156 | + --parameters "bigQueryDataset=$BIG_QUERY_DATASET" \ |
| 157 | + --parameters "runId=$RUN_ID" |
| 158 | +``` |
| 159 | + |
| 160 | +For more information about the command, please check: |
| 161 | +https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run |
| 162 | + |
| 163 | + |
| 164 | +**Using the plugin**: |
| 165 | + |
| 166 | +Instead of just generating the template in the folder, it is possible to stage |
| 167 | +and run the template in a single command. This may be useful for testing when |
| 168 | +changing the templates. |
| 169 | + |
| 170 | +```shell |
| 171 | +export PROJECT=<my-project> |
| 172 | +export BUCKET_NAME=<bucket-name> |
| 173 | +export REGION=us-central1 |
| 174 | + |
| 175 | +### Required |
| 176 | +export INSTANCE_ID=<instanceId> |
| 177 | +export DATABASE_ID=<databaseId> |
| 178 | +export BIG_QUERY_DATASET=<bigQueryDataset> |
| 179 | + |
| 180 | +### Optional |
| 181 | +export GCS_INPUT_DIRECTORY=<gcsInputDirectory> |
| 182 | +export PROJECT_ID=<projectId> |
| 183 | +export SPANNER_HOST=https://batch-spanner.googleapis.com |
| 184 | +export SPANNER_PRIORITY=HIGH |
| 185 | +export SESSION_FILE_PATH="" |
| 186 | +export SCHEMA_OVERRIDES_FILE_PATH="" |
| 187 | +export TABLE_OVERRIDES="" |
| 188 | +export COLUMN_OVERRIDES="" |
| 189 | +export RUN_ID=<runId> |
| 190 | + |
| 191 | +mvn clean package -PtemplatesRun \ |
| 192 | +-DskipTests \ |
| 193 | +-DprojectId="$PROJECT" \ |
| 194 | +-DbucketName="$BUCKET_NAME" \ |
| 195 | +-Dregion="$REGION" \ |
| 196 | +-DjobName="gcs-spanner-data-validator-job" \ |
| 197 | +-DtemplateName="GCS_Spanner_Data_Validator" \ |
| 198 | +-Dparameters="gcsInputDirectory=$GCS_INPUT_DIRECTORY,projectId=$PROJECT_ID,spannerHost=$SPANNER_HOST,instanceId=$INSTANCE_ID,databaseId=$DATABASE_ID,spannerPriority=$SPANNER_PRIORITY,sessionFilePath=$SESSION_FILE_PATH,schemaOverridesFilePath=$SCHEMA_OVERRIDES_FILE_PATH,tableOverrides=$TABLE_OVERRIDES,columnOverrides=$COLUMN_OVERRIDES,bigQueryDataset=$BIG_QUERY_DATASET,runId=$RUN_ID" \ |
| 199 | +-f v2/gcs-spanner-dv |
| 200 | +``` |
| 201 | + |
| 202 | +## Terraform |
| 203 | + |
| 204 | +Dataflow supports the utilization of Terraform to manage template jobs, |
| 205 | +see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job). |
| 206 | + |
| 207 | +Terraform modules have been generated for most templates in this repository. This includes the relevant parameters |
| 208 | +specific to the template. If available, they may be used instead of |
| 209 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 210 | +directly. |
| 211 | + |
| 212 | +To use the autogenerated module, execute the standard |
| 213 | +[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow): |
| 214 | + |
| 215 | +```shell |
| 216 | +cd v2/gcs-spanner-dv/terraform/GCS_Spanner_Data_Validator |
| 217 | +terraform init |
| 218 | +terraform apply |
| 219 | +``` |
| 220 | + |
| 221 | +To use |
| 222 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 223 | +directly: |
| 224 | + |
| 225 | +```terraform |
| 226 | +provider "google-beta" { |
| 227 | + project = var.project |
| 228 | +} |
| 229 | +variable "project" { |
| 230 | + default = "<my-project>" |
| 231 | +} |
| 232 | +variable "region" { |
| 233 | + default = "us-central1" |
| 234 | +} |
| 235 | +
|
| 236 | +resource "google_dataflow_flex_template_job" "gcs_spanner_data_validator" { |
| 237 | +
|
| 238 | + provider = google-beta |
| 239 | + container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/GCS_Spanner_Data_Validator" |
| 240 | + name = "gcs-spanner-data-validator" |
| 241 | + region = var.region |
| 242 | + parameters = { |
| 243 | + instanceId = "<instanceId>" |
| 244 | + databaseId = "<databaseId>" |
| 245 | + bigQueryDataset = "<bigQueryDataset>" |
| 246 | + # gcsInputDirectory = "<gcsInputDirectory>" |
| 247 | + # projectId = "<projectId>" |
| 248 | + # spannerHost = "https://batch-spanner.googleapis.com" |
| 249 | + # spannerPriority = "HIGH" |
| 250 | + # sessionFilePath = "" |
| 251 | + # schemaOverridesFilePath = "" |
| 252 | + # tableOverrides = "" |
| 253 | + # columnOverrides = "" |
| 254 | + # runId = "<runId>" |
| 255 | + } |
| 256 | +} |
| 257 | +``` |
0 commit comments