Skip to content

Commit 56cd6cb

Browse files
Adding AWS Glue connector
1 parent 944889c commit 56cd6cb

19 files changed

Lines changed: 1192 additions & 0 deletions
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Step 1: Use an official OpenJDK base image, as Spark requires Java
2+
FROM openjdk:11-jre-slim
3+
4+
# Step 2: Set environment variables for Spark and Python
5+
ENV SPARK_VERSION=3.5.0
6+
ENV HADOOP_VERSION=3
7+
ENV SPARK_HOME=/opt/spark
8+
ENV PATH=$SPARK_HOME/bin:$PATH
9+
ENV PYTHONUNBUFFERED=1
10+
11+
# Step 3: Install Python, pip, and other necessary tools
12+
RUN apt-get update && \
13+
apt-get install -y python3 python3-pip curl && \
14+
rm -rf /var/lib/apt/lists/*
15+
16+
# Step 4: Download and install Spark
17+
RUN curl -fSL "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" -o /tmp/spark.tgz && \
18+
tar -xvf /tmp/spark.tgz -C /opt/ && \
19+
mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} ${SPARK_HOME} && \
20+
rm /tmp/spark.tgz
21+
22+
# Step 5: Set up the application directory
23+
WORKDIR /app
24+
25+
# Step 6: Copy and install Python dependencies
26+
COPY requirements.txt .
27+
RUN pip3 install --no-cache-dir -r requirements.txt
28+
29+
# Step 7: Copy your application source code
30+
COPY src ./src
31+
COPY config.json .
32+
COPY pyspark_job.py .
33+
34+
# Step 8: Define the entry point for running the PySpark job
35+
ENTRYPOINT ["spark-submit", "pyspark_job.py"]
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# AWS Glue to Google Cloud Dataplex Connector
2+
3+
This connector extracts metadata from AWS Glue and transforms it into a format that can be imported into Google Cloud Dataplex. It captures database, table, and lineage information from AWS Glue and prepares it for ingestion into Dataplex, allowing you to catalog your AWS data assets within Google Cloud.
4+
5+
This connector is designed to be run from a Python virtual environment.
6+
7+
***
8+
9+
## Prerequisites
10+
11+
Before using this connector, you need to have the following set up:
12+
13+
1. **AWS Credentials**: You will need an AWS access key ID and a secret access key with permissions to access AWS Glue.
14+
2. **Google Cloud Project**: A Google Cloud project is required to run the script and store the output.
15+
3. **GCP Secret Manager**: The AWS credentials must be stored in a secret in Google Cloud Secret Manager.
16+
4. **Python 3** and **pip** installed.
17+
18+
***
19+
20+
## AWS Credentials Setup
21+
22+
This connector requires an IAM User with `GlueConsoleFullAccess` (or read-only equivalent) and `S3ReadOnly` (to download job scripts for lineage).
23+
24+
1. Create an IAM User in AWS Console.
25+
2. Attach policies: `AWSGlueConsoleFullAccess`, `AmazonS3ReadOnlyAccess`.
26+
3. Generate an **Access Key ID** and **Secret Access Key**.
27+
4. Store these in GCP Secret Manager as a **JSON object**:
28+
```json
29+
{
30+
"access_key_id": "YOUR_AWS_ACCESS_KEY_ID",
31+
"secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY"
32+
}
33+
```
34+
35+
***
36+
37+
## Setup Resources
38+
39+
To run this connector, you must first create the required Dataplex resources.
40+
41+
### Required Catalog Objects
42+
43+
Note: Before importing metadata, the Entry Group and all Entry Types and Aspect Types found in the metadata import file must exist in the target project and location. This connector requires the following Entry Group, Entry Types and Aspect Types:
44+
45+
| Catalog Object | IDs required by connector |
46+
| :--- | :--- |
47+
| **Entry Group** | Defined in `config.json` as `entry_group_id` |
48+
| **Entry Types** | `aws-glue-database`  `aws-glue-table`  `aws-glue-view` |
49+
| **Aspect Types** | `aws-glue-database`  `aws-glue-table`  `aws-glue-view`  `aws-lineage-aspect` |
50+
51+
See [manage entries and create custom sources](https://cloud.google.com/dataplex/docs/ingest-custom-sources) for instructions on creating Entry Groups, Entry Types, and Aspect Types.
52+
53+
### Option 1: Automated Setup (Recommended)
54+
Run the provided script to create all resources automatically:
55+
56+
```bash
57+
# Set your project and location
58+
export PROJECT_ID=your-project-id
59+
export LOCATION=us-central1
60+
export ENTRY_GROUP_ID=aws-glue-entries
61+
62+
# Run the setup script
63+
chmod +x scripts/setup_dataplex_resources.sh
64+
./scripts/setup_dataplex_resources.sh
65+
```
66+
67+
### Option 2: Manual Setup
68+
If you prefer to create them manually, ensure you define the following:
69+
70+
**Entry Types:**
71+
* `aws-glue-database`
72+
* `aws-glue-table`
73+
* `aws-glue-view`
74+
75+
**Aspect Types:**
76+
* `aws-glue-database`, `aws-glue-table`, `aws-glue-view` (Marker Aspects)
77+
* `aws-lineage-aspect` (Schema below)
78+
79+
<details>
80+
<summary>Click to see Schema for aws-lineage-aspect</summary>
81+
82+
```json
83+
{
84+
"type": "record",
85+
"recordFields": [
86+
{
87+
"name": "links",
88+
"type": "array",
89+
"index": 1,
90+
"arrayItems": {
91+
"type": "record",
92+
"recordFields": [
93+
{
94+
"name": "source",
95+
"type": "record",
96+
"index": 1,
97+
"recordFields": [
98+
{ "name": "fully_qualified_name", "type": "string", "index": 1 }
99+
]
100+
},
101+
{
102+
"name": "target",
103+
"type": "record",
104+
"index": 2,
105+
"recordFields": [
106+
{ "name": "fully_qualified_name", "type": "string", "index": 1 }
107+
]
108+
}
109+
]
110+
}
111+
}
112+
]
113+
}
114+
```
115+
</details>
116+
117+
For more details see [manage entries and create custom sources](https://cloud.google.com/dataplex/docs/ingest-custom-sources).
118+
119+
***
120+
121+
## Configuration
122+
123+
The connector is configured using the `config.json` file. Ensure this file is present in the same directory as `main.py`.
124+
125+
| Parameter | Description |
126+
| :--- | :--- |
127+
| **`aws_region`** | The AWS region where your Glue Data Catalog is located (e.g., "eu-north-1"). |
128+
| **`project_id`** | Your Google Cloud Project ID. |
129+
| **`location_id`** | The Google Cloud region where you want to run the script (e.g., "us-central1"). |
130+
| **`entry_group_id`** | The Dataplex entry group ID where the metadata will be imported. |
131+
| **`gcs_bucket`** | The Google Cloud Storage bucket where the output metadata file will be stored. |
132+
| **`aws_account_id`** | Your AWS account ID. |
133+
| **`output_folder`** | The folder within the GCS bucket where the output file will be stored. |
134+
| **`gcp_secret_id`** | The ID of the secret in GCP Secret Manager that contains your AWS credentials. |
135+
136+
***
137+
138+
## Running the Connector
139+
140+
You can run the connector from your local machine using a Python virtual environment.
141+
142+
### Setup and Execution
143+
144+
1. **Create a virtual environment:**
145+
```bash
146+
python3 -m venv venv
147+
source venv/bin/activate
148+
```
149+
2. **Install the required dependencies:**
150+
```bash
151+
pip install -r requirements.txt
152+
```
153+
3. **Run the connector:**
154+
Execute the `main.py` script. It will read settings from `config.json` in the current directory.
155+
```bash
156+
python3 main.py
157+
```
158+
159+
***
160+
161+
## Output
162+
163+
The connector generates a JSONL file in the specified GCS bucket and folder. This file contains the extracted metadata in a format that can be imported into Dataplex.
164+
165+
***
166+
167+
## Importing Metadata into Dataplex
168+
169+
Once the metadata file has been generated, you can import it into Dataplex using a metadata import job.
170+
171+
1. **Prepare the Request File:**
172+
Open the `request.json` file and replace the following placeholders with your actual values:
173+
* `<YOUR_GCS_BUCKET>`: The bucket where the output file was saved.
174+
* `<YOUR_OUTPUT_FOLDER>`: The folder where the output file was saved.
175+
* `<YOUR_PROJECT_ID>`: Your Google Cloud Project ID.
176+
* `<YOUR_LOCATION>`: Your Google Cloud Location (e.g., `us-central1`).
177+
* `<YOUR_ENTRY_GROUP_ID>`: The Dataplex Entry Group ID.
178+
179+
2. **Run the Import Command:**
180+
Use `curl` to initiate the import. Replace `{project-id}`, `{location}`, and `{job-id}` in the URL.
181+
182+
```bash
183+
curl -X POST \
184+
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
185+
-H "Content-Type: application/json; charset=utf-8" \
186+
-d @request.json \
187+
"https://dataplex.googleapis.com/v1/projects/{project-id}/locations/{location}/metadataJobs?metadataJobId={job-id}"
188+
```
189+
190+
***
191+
192+
## Metadata Extracted
193+
194+
The connector maps AWS Glue objects to Dataplex entries as follows:
195+
196+
| AWS Glue Object | Dataplex Entry Type |
197+
| :--- | :--- |
198+
| **Database** | `aws-glue-database` |
199+
| **Table** | `aws-glue-table` |
200+
| **View** | `aws-glue-view` |
201+
202+
### Lineage
203+
The connector parses AWS Glue Job scripts (Python/Scala) to extract lineage:
204+
- **Source**: `DataSource` nodes in Glue Job graph.
205+
- **Target**: `DataSink` nodes in Glue Job graph.
206+
- **Result**: Lineage is visualized in Dataplex from Source Table -> Target Table.
207+
208+
***
209+
210+
## Docker Setup
211+
212+
You can containerize this connector to run on Cloud Run, Dataproc, or Kubernetes.
213+
214+
1. **Build the Image**:
215+
```bash
216+
docker build -t aws-glue-connector:latest .
217+
```
218+
219+
2. **Run Locally** (passing config):
220+
Ensure `config.json` is in the current directory or mounted.
221+
```bash
222+
docker run -v $(pwd)/config.json:/app/config.json -v $(pwd)/src:/app/src aws-glue-connector:latest
223+
```
224+
225+
3. **Push to GCR/Artifact Registry**:
226+
```bash
227+
gcloud auth configure-docker
228+
docker tag aws-glue-connector:latest gcr.io/YOUR_PROJECT/aws-glue-connector:latest
229+
docker push gcr.io/YOUR_PROJECT/aws-glue-connector:latest
230+
```
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#!/bin/bash
2+
3+
# Terminate script on error
4+
set -e
5+
6+
# --- Read script arguments ---
7+
POSITIONAL=()
8+
while [[ $# -gt 0 ]]
9+
do
10+
key="$1"
11+
12+
case $key in
13+
-p|--project_id)
14+
PROJECT_ID="$2"
15+
shift # past argument
16+
shift # past value
17+
;;
18+
-r|--repo)
19+
REPO="$2"
20+
shift # past argument
21+
shift # past value
22+
;;
23+
-i|--image_name)
24+
IMAGE_NAME="$2"
25+
shift # past argument
26+
shift # past value
27+
;;
28+
*) # unknown option
29+
POSITIONAL+=("$1") # save it in an array for later
30+
shift # past argument
31+
;;
32+
esac
33+
done
34+
set -- "${POSITIONAL[@]}" # restore positional parameters
35+
36+
# --- Validate arguments ---
37+
if [ -z "$PROJECT_ID" ]; then
38+
echo "Project ID not provided. Please provide project ID with the -p flag."
39+
exit 1
40+
fi
41+
42+
if [ -z "$REPO" ]; then
43+
# Default to gcr.io/[PROJECT_ID] if no repo is provided
44+
REPO="gcr.io/${PROJECT_ID}"
45+
echo "Repository not provided, defaulting to: ${REPO}"
46+
fi
47+
48+
if [ -z "$IMAGE_NAME" ]; then
49+
IMAGE_NAME="aws-glue-to-dataplex-pyspark"
50+
echo "Image name not provided, defaulting to: ${IMAGE_NAME}"
51+
fi
52+
53+
IMAGE_TAG="latest"
54+
IMAGE_URI="${REPO}/${IMAGE_NAME}:${IMAGE_TAG}"
55+
56+
# --- Build the Docker Image ---
57+
echo "Building Docker image: ${IMAGE_URI}..."
58+
# Use the Dockerfile for PySpark
59+
docker build -t "${IMAGE_URI}" -f Dockerfile .
60+
61+
if [ $? -ne 0 ]; then
62+
echo "Docker build failed."
63+
exit 1
64+
fi
65+
echo "Docker build successful."
66+
67+
# --- Run the Docker Container ---
68+
echo "Running the PySpark job in a Docker container..."
69+
echo "Using local gcloud credentials for authentication."
70+
71+
# We mount the local gcloud config directory into the container.
72+
# This allows the container to use your Application Default Credentials.
73+
# Make sure you have run 'gcloud auth application-default login' on your machine.
74+
docker run --rm \
75+
-v ~/.config/gcloud:/root/.config/gcloud \
76+
"${IMAGE_URI}"
77+
78+
if [ $? -ne 0 ]; then
79+
echo "Docker run failed."
80+
exit 1
81+
fi
82+
83+
echo "PySpark job completed successfully."
84+
85+
# --- Optional: Push to Google Container Registry ---
86+
read -p "Do you want to push the image to ${REPO}? (y/n) " -n 1 -r
87+
echo
88+
if [[ $REPLY =~ ^[Yy]$ ]]
89+
then
90+
echo "Pushing image to ${REPO}..."
91+
gcloud auth configure-docker
92+
docker push "${IMAGE_URI}"
93+
echo "Image pushed successfully."
94+
fi
95+
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"aws_region": "<YOUR_AWS_REGION>",
3+
"project_id": "<GCP_PROJECT>",
4+
"location_id": "<GCP_REGION>",
5+
"entry_group_id": "<DATAPLEX_ENTRY_GROUP>",
6+
"gcs_bucket": "<GCS_BUCKET>",
7+
"aws_account_id": "<AWS_ACCOUNT_ID>",
8+
"output_folder": "<GCS_FOLDER_NAME>",
9+
"gcp_secret_id": "<GCP_SECRET_ID>"
10+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
import sys
2+
from src import bootstrap
3+
4+
# Allow shared files to be found when running from command line
5+
sys.path.insert(1, '../src/shared')
6+
7+
if __name__ == '__main__':
8+
bootstrap.run()

0 commit comments

Comments
 (0)