|
| 1 | +# AWS Glue to Google Cloud Dataplex Connector |
| 2 | + |
| 3 | +This connector extracts metadata from AWS Glue and transforms it into a format that can be imported into Google Cloud Dataplex. It captures database, table, and lineage information from AWS Glue and prepares it for ingestion into Dataplex, allowing you to catalog your AWS data assets within Google Cloud. |
| 4 | + |
| 5 | +This connector is designed to be run from a Python virtual environment. |
| 6 | + |
| 7 | +*** |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +Before using this connector, you need to have the following set up: |
| 12 | + |
| 13 | +1. **AWS Credentials**: You will need an AWS access key ID and a secret access key with permissions to access AWS Glue. |
| 14 | +2. **Google Cloud Project**: A Google Cloud project is required to run the script and store the output. |
| 15 | +3. **GCP Secret Manager**: The AWS credentials must be stored in a secret in Google Cloud Secret Manager. |
| 16 | +4. **Python 3** and **pip** installed. |
| 17 | + |
| 18 | +*** |
| 19 | + |
| 20 | +## AWS Credentials Setup |
| 21 | + |
| 22 | +This connector requires an IAM User with `GlueConsoleFullAccess` (or read-only equivalent) and `S3ReadOnly` (to download job scripts for lineage). |
| 23 | + |
| 24 | +1. Create an IAM User in AWS Console. |
| 25 | +2. Attach policies: `AWSGlueConsoleFullAccess`, `AmazonS3ReadOnlyAccess`. |
| 26 | +3. Generate an **Access Key ID** and **Secret Access Key**. |
| 27 | +4. Store these in GCP Secret Manager as a **JSON object**: |
| 28 | + ```json |
| 29 | + { |
| 30 | + "access_key_id": "YOUR_AWS_ACCESS_KEY_ID", |
| 31 | + "secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY" |
| 32 | + } |
| 33 | + ``` |
| 34 | + |
| 35 | +*** |
| 36 | + |
| 37 | +## Setup Resources |
| 38 | + |
| 39 | +To run this connector, you must first create the required Dataplex resources. |
| 40 | + |
| 41 | +### Required Catalog Objects |
| 42 | + |
| 43 | +Note: Before importing metadata, the Entry Group and all Entry Types and Aspect Types found in the metadata import file must exist in the target project and location. This connector requires the following Entry Group, Entry Types and Aspect Types: |
| 44 | + |
| 45 | +| Catalog Object | IDs required by connector | |
| 46 | +| :--- | :--- | |
| 47 | +| **Entry Group** | Defined in `config.json` as `entry_group_id` | |
| 48 | +| **Entry Types** | `aws-glue-database` `aws-glue-table` `aws-glue-view` | |
| 49 | +| **Aspect Types** | `aws-glue-database` `aws-glue-table` `aws-glue-view` `aws-lineage-aspect` | |
| 50 | + |
| 51 | +See [manage entries and create custom sources](https://cloud.google.com/dataplex/docs/ingest-custom-sources) for instructions on creating Entry Groups, Entry Types, and Aspect Types. |
| 52 | + |
| 53 | +### Option 1: Automated Setup (Recommended) |
| 54 | +Run the provided script to create all resources automatically: |
| 55 | + |
| 56 | +```bash |
| 57 | +# Set your project and location |
| 58 | +export PROJECT_ID=your-project-id |
| 59 | +export LOCATION=us-central1 |
| 60 | +export ENTRY_GROUP_ID=aws-glue-entries |
| 61 | + |
| 62 | +# Run the setup script |
| 63 | +chmod +x scripts/setup_dataplex_resources.sh |
| 64 | +./scripts/setup_dataplex_resources.sh |
| 65 | +``` |
| 66 | + |
| 67 | +### Option 2: Manual Setup |
| 68 | +If you prefer to create them manually, ensure you define the following: |
| 69 | + |
| 70 | +**Entry Types:** |
| 71 | +* `aws-glue-database` |
| 72 | +* `aws-glue-table` |
| 73 | +* `aws-glue-view` |
| 74 | + |
| 75 | +**Aspect Types:** |
| 76 | +* `aws-glue-database`, `aws-glue-table`, `aws-glue-view` (Marker Aspects) |
| 77 | +* `aws-lineage-aspect` (Schema below) |
| 78 | + |
| 79 | +<details> |
| 80 | +<summary>Click to see Schema for aws-lineage-aspect</summary> |
| 81 | + |
| 82 | +```json |
| 83 | +{ |
| 84 | + "type": "record", |
| 85 | + "recordFields": [ |
| 86 | + { |
| 87 | + "name": "links", |
| 88 | + "type": "array", |
| 89 | + "index": 1, |
| 90 | + "arrayItems": { |
| 91 | + "type": "record", |
| 92 | + "recordFields": [ |
| 93 | + { |
| 94 | + "name": "source", |
| 95 | + "type": "record", |
| 96 | + "index": 1, |
| 97 | + "recordFields": [ |
| 98 | + { "name": "fully_qualified_name", "type": "string", "index": 1 } |
| 99 | + ] |
| 100 | + }, |
| 101 | + { |
| 102 | + "name": "target", |
| 103 | + "type": "record", |
| 104 | + "index": 2, |
| 105 | + "recordFields": [ |
| 106 | + { "name": "fully_qualified_name", "type": "string", "index": 1 } |
| 107 | + ] |
| 108 | + } |
| 109 | + ] |
| 110 | + } |
| 111 | + } |
| 112 | + ] |
| 113 | +} |
| 114 | +``` |
| 115 | +</details> |
| 116 | + |
| 117 | +For more details see [manage entries and create custom sources](https://cloud.google.com/dataplex/docs/ingest-custom-sources). |
| 118 | + |
| 119 | +*** |
| 120 | + |
| 121 | +## Configuration |
| 122 | + |
| 123 | +The connector is configured using the `config.json` file. Ensure this file is present in the same directory as `main.py`. |
| 124 | + |
| 125 | +| Parameter | Description | |
| 126 | +| :--- | :--- | |
| 127 | +| **`aws_region`** | The AWS region where your Glue Data Catalog is located (e.g., "eu-north-1"). | |
| 128 | +| **`project_id`** | Your Google Cloud Project ID. | |
| 129 | +| **`location_id`** | The Google Cloud region where you want to run the script (e.g., "us-central1"). | |
| 130 | +| **`entry_group_id`** | The Dataplex entry group ID where the metadata will be imported. | |
| 131 | +| **`gcs_bucket`** | The Google Cloud Storage bucket where the output metadata file will be stored. | |
| 132 | +| **`aws_account_id`** | Your AWS account ID. | |
| 133 | +| **`output_folder`** | The folder within the GCS bucket where the output file will be stored. | |
| 134 | +| **`gcp_secret_id`** | The ID of the secret in GCP Secret Manager that contains your AWS credentials. | |
| 135 | + |
| 136 | +*** |
| 137 | + |
| 138 | +## Running the Connector |
| 139 | + |
| 140 | +You can run the connector from your local machine using a Python virtual environment. |
| 141 | + |
| 142 | +### Setup and Execution |
| 143 | + |
| 144 | +1. **Create a virtual environment:** |
| 145 | + ```bash |
| 146 | + python3 -m venv venv |
| 147 | + source venv/bin/activate |
| 148 | + ``` |
| 149 | +2. **Install the required dependencies:** |
| 150 | + ```bash |
| 151 | + pip install -r requirements.txt |
| 152 | + ``` |
| 153 | +3. **Run the connector:** |
| 154 | + Execute the `main.py` script. It will read settings from `config.json` in the current directory. |
| 155 | + ```bash |
| 156 | + python3 main.py |
| 157 | + ``` |
| 158 | + |
| 159 | +*** |
| 160 | + |
| 161 | +## Output |
| 162 | + |
| 163 | +The connector generates a JSONL file in the specified GCS bucket and folder. This file contains the extracted metadata in a format that can be imported into Dataplex. |
| 164 | + |
| 165 | +*** |
| 166 | + |
| 167 | +## Importing Metadata into Dataplex |
| 168 | + |
| 169 | +Once the metadata file has been generated, you can import it into Dataplex using a metadata import job. |
| 170 | + |
| 171 | +1. **Prepare the Request File:** |
| 172 | + Open the `request.json` file and replace the following placeholders with your actual values: |
| 173 | + * `<YOUR_GCS_BUCKET>`: The bucket where the output file was saved. |
| 174 | + * `<YOUR_OUTPUT_FOLDER>`: The folder where the output file was saved. |
| 175 | + * `<YOUR_PROJECT_ID>`: Your Google Cloud Project ID. |
| 176 | + * `<YOUR_LOCATION>`: Your Google Cloud Location (e.g., `us-central1`). |
| 177 | + * `<YOUR_ENTRY_GROUP_ID>`: The Dataplex Entry Group ID. |
| 178 | + |
| 179 | +2. **Run the Import Command:** |
| 180 | + Use `curl` to initiate the import. Replace `{project-id}`, `{location}`, and `{job-id}` in the URL. |
| 181 | + |
| 182 | + ```bash |
| 183 | + curl -X POST \ |
| 184 | + -H "Authorization: Bearer $(gcloud auth print-access-token)" \ |
| 185 | + -H "Content-Type: application/json; charset=utf-8" \ |
| 186 | + -d @request.json \ |
| 187 | + "https://dataplex.googleapis.com/v1/projects/{project-id}/locations/{location}/metadataJobs?metadataJobId={job-id}" |
| 188 | + ``` |
| 189 | + |
| 190 | +*** |
| 191 | + |
| 192 | +## Metadata Extracted |
| 193 | + |
| 194 | +The connector maps AWS Glue objects to Dataplex entries as follows: |
| 195 | + |
| 196 | +| AWS Glue Object | Dataplex Entry Type | |
| 197 | +| :--- | :--- | |
| 198 | +| **Database** | `aws-glue-database` | |
| 199 | +| **Table** | `aws-glue-table` | |
| 200 | +| **View** | `aws-glue-view` | |
| 201 | + |
| 202 | +### Lineage |
| 203 | +The connector parses AWS Glue Job scripts (Python/Scala) to extract lineage: |
| 204 | +- **Source**: `DataSource` nodes in Glue Job graph. |
| 205 | +- **Target**: `DataSink` nodes in Glue Job graph. |
| 206 | +- **Result**: Lineage is visualized in Dataplex from Source Table -> Target Table. |
| 207 | + |
| 208 | +*** |
| 209 | + |
| 210 | +## Docker Setup |
| 211 | + |
| 212 | +You can containerize this connector to run on Cloud Run, Dataproc, or Kubernetes. |
| 213 | + |
| 214 | +1. **Build the Image**: |
| 215 | + ```bash |
| 216 | + docker build -t aws-glue-connector:latest . |
| 217 | + ``` |
| 218 | + |
| 219 | +2. **Run Locally** (passing config): |
| 220 | + Ensure `config.json` is in the current directory or mounted. |
| 221 | + ```bash |
| 222 | + docker run -v $(pwd)/config.json:/app/config.json -v $(pwd)/src:/app/src aws-glue-connector:latest |
| 223 | + ``` |
| 224 | + |
| 225 | +3. **Push to GCR/Artifact Registry**: |
| 226 | + ```bash |
| 227 | + gcloud auth configure-docker |
| 228 | + docker tag aws-glue-connector:latest gcr.io/YOUR_PROJECT/aws-glue-connector:latest |
| 229 | + docker push gcr.io/YOUR_PROJECT/aws-glue-connector:latest |
| 230 | + ``` |
0 commit comments