AWS::Glue::Crawler
The
AWS::Glue::Crawler resource specifies an AWS Glue crawler. For more information, see Cataloging Tables with a Crawler and Crawler Structure in the
AWS Glue Developer Guide.
Syntax
To declare this entity in your AWS CloudFormation template, use the following syntax:
JSON
{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Role" :String, "Classifiers" : [String, ... ], "Configuration" :String, "Description" :String, "SchemaChangePolicy" : SchemaChangePolicy, "Schedule" : Schedule, "DatabaseName" :String, "Targets" : Targets, "TablePrefix" :String, "Name" :String} }
YAML
Type: AWS::Glue::Crawler Properties: Role:StringClassifiers: -StringConfiguration:StringDescription:StringSchemaChangePolicy: SchemaChangePolicy Schedule: Schedule DatabaseName:StringTargets: Targets TablePrefix:StringName:String
Properties
Role-
The Amazon Resource Name (ARN) of an IAM role that's used to access customer resources, such as Amazon S3 data.
Required: Yes
Type: String
Update requires: No interruption
Classifiers-
A list of UTF-8 strings that specify the custom classifiers that are associated with the crawler.
Required: No
Type: List of String values
Update requires: No interruption
Configuration-
Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler..
Required: No
Type: String
Update requires: No interruption
Description-
A description of the crawler and where it should be used. It must match the URI address multi-line string pattern:
[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]*Required: No
Type: String
Update requires: No interruption
SchemaChangePolicy-
The policy that specifies update and delete behaviors for the crawler.
Required: No
Type: SchemaChangePolicy
Update requires: No interruption
Schedule-
The schedule for the crawler.
Required: No
Type: Schedule
Update requires: No interruption
DatabaseName-
The name of the database where the crawler's output is stored.
Required: Yes
Type: String
Update requires: No interruption
Targets-
The crawler targets.
Required: Yes
Type: Targets
Update requires: No interruption
TablePrefix-
The table prefix that's used for catalog tables that are created.
Required: No
Type: String
Update requires: No interruption
Name-
The name of the crawler. Must match the single-line string pattern:
[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*Required: No
Type: String
Update requires: Replacement
Return Values
Ref
When the logical ID of this resource is provided to the Ref intrinsic
function, Ref returns the resource name.
For more information about using the
Ref function, see
Ref.
Examples
The following example creates a crawler for an Amazon S3 target.
JSON
{ "Description": "AWS Glue Crawler Test", "Resources": { "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com" ] }, "Action": [ "sts:AssumeRole" ] } ] }, "Path": "/", "Policies": [ { "PolicyName": "root", "PolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } } ] } }, "MyDatabase": { "Type": "AWS::Glue::Database", "Properties": { "CatalogId": { "Ref": "AWS::AccountId" }, "DatabaseInput": { "Name": "dbCrawler", "Description": "TestDatabaseDescription", "LocationUri": "TestLocationUri", "Parameters": { "key1": "value1", "key2": "value2" } } } }, "MyClassifier": { "Type": "AWS::Glue::Classifier", "Properties": { "GrokClassifier": { "Name": "CrawlerClassifier", "Classification": "wikiData", "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" } } }, "MyS3Bucket": { "Type": "AWS::S3::Bucket", "Properties": { "BucketName": "crawlertesttarget", "AccessControl": "BucketOwnerFullControl" } }, "MyCrawler2": { "Type": "AWS::Glue::Crawler", "Properties": { "Name": "testcrawler1", "Role": { "Fn::GetAtt": [ "MyRole", "Arn" ] }, "DatabaseName": { "Ref": "MyDatabase" }, "Classifiers": [ { "Ref": "MyClassifier" } ], "Targets": { "S3Targets": [ { "Path": { "Ref": "MyS3Bucket" } } ] }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "Schedule": { "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)" } } } } }
YAML
Resources: MyRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" MyDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: "dbCrawler" Description: "TestDatabaseDescription" LocationUri: "TestLocationUri" Parameters: key1 : "value1" key2 : "value2" MyClassifier: Type: AWS::Glue::Classifier Properties: GrokClassifier: Name: "CrawlerClassifier" Classification: "wikiData" GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" MyS3Bucket: Type: AWS::S3::Bucket Properties: BucketName: "crawlertesttarget" AccessControl: "BucketOwnerFullControl" MyCrawler2: Type: AWS::Glue::Crawler Properties: Name: "testcrawler1" Role: !GetAtt MyRole.Arn DatabaseName: !Ref MyDatabase Classifiers: - !Ref MyClassifier Targets: S3Targets: - Path: !Ref MyS3Bucket SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Schedule: ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"
Crawler Configuration
The following example specifies the crawler configuration that controls a crawler's behavior.
{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Role" : "role1", "Classifiers" : [ ], "Description" : "example classifier", "SchemaChangePolicy" : "", "Schedule" : Schedule, "DatabaseName" : "test", "Targets" : [], "TablePrefix" : "test-", "Name" : "my-crawler", "Configuration" : "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}" }
