Datadog-AWS Integration

Overview

Connect to Amazon Web Services (AWS) in order to:

See automatic AWS status updates in your stream
Get CloudWatch metrics for EC2 hosts without installing the Agent
Tag your EC2 hosts with EC2-specific information (e.g. availability zone)
See EC2 scheduled maintenances events in your stream
Collect CloudWatch metrics and events from many other AWS products

Related integrations include:

API Gateway	create, publish, maintain, and secure APIs
Autoscaling	scale EC2 capacity
Billing	billing and budgets
CloudFront	glocal content delivery network
CloudTrail	Access to log files and AWS API calls
CloudSearch	Access to log files and AWS API calls
Dynamo DB	NoSQL Database
EC2 Container Service (ECS)	container management service that supports Docker containers
Elastic Beanstalk	easy-to-use service for deploying and scaling web applications and services
Elastic Block Store (EBS)	persistent block level storage volumes
ElastiCache	in-memory cache in the cloud
Elastic Cloud Compute (EC2)	resizable compute capacity in the cloud
Elastic File System (EFS)	shared file storage
Elastic Load Balancing (ELB)	distributes incoming application traffic across multiple Amazon EC2 instances
Elastic Map Reduce (EMR)	data processing using Hadoop
Elasticsearch Service (ES)	deploy, operate, and scale Elasticsearch clusters
Firehose	capture and load streaming data
IOT	connect IOT devices with cloud services
Kinesis	service for real-time processing of large, distributed data streams
Key Management Service (KMS)	create and control encryption keys
Lambda	serverless computing
Machine Learning (ML)	create machine learning models
OpsWorks	configuration management
Polly	text-speech service
Redshift	data warehouse solution
Relational Database Service (RDS)	relational database in the cloud
Route 53	DNS and traffic management with availability monitoring
Simple Email Service (SES)	cost-effective, outbound-only email-sending service
Simple Notification System (SNS)	alert and notifications
Simple Queue Service (SQS)	messaging queue service
Simple Storage Service (S3)	highly available and scalable cloud storage service
Simple Workflow Service (SWF)	cloud workflow management
Storage Gateway	hybrid cloud storage
Web Application Firewall (WAF)	protect web applications from common web exploits
Workspaces	secure desktop computing service

Installation

Setting up the Datadog integration with Amazon Web Services requires configuring role delegation using AWS IAM. To get a better understanding of role delegation, refer to the AWS IAM Best Practices guide.

Note: The GovCloud and China regions do not currently support IAM role delegation. If you are deploying in these regions please skip to the configuration section below.

First create a new policy in the IAM Console. Name the policy DatadogAWSIntegrationPolicy, or choose a name that is more relevant for you. To take advantage of every AWS integration offered by Datadog, using the following in the Policy Document textbox. As we add other components to the integration, these permissions may change.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "autoscaling:Describe*",
        "budgets:ViewBudget",
        "cloudtrail:DescribeTrails",
        "cloudtrail:GetTrailStatus",
        "cloudwatch:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "dynamodb:list*",
        "dynamodb:describe*",
        "ec2:Describe*",
        "ec2:Get*",
        "ecs:Describe*",
        "ecs:List*",
        "elasticache:Describe*",
        "elasticache:List*",
        "elasticfilesystem:DescribeTags",
        "elasticfilesystem:DescribeFileSystems",
        "elasticloadbalancing:Describe*",
        "elasticmapreduce:List*",
        "elasticmapreduce:Describe*",
        "es:ListTags",
        "es:ListDomainNames",
        "es:DescribeElasticsearchDomains",
        "kinesis:List*",
        "kinesis:Describe*",
        "logs:Get*",
        "logs:Describe*",
        "logs:FilterLogEvents",
        "logs:TestMetricFilter",
        "rds:Describe*",
        "rds:List*",
        "route53:List*",
        "s3:GetBucketTagging",
        "s3:ListAllMyBuckets",
        "ses:Get*",
        "sns:List*",
        "sns:Publish",
        "sqs:ListQueues",
        "support:*",
        "tag:getResources",
        "tag:getTagKeys",
        "tag:getTagValues"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

If you are not comfortable with granting all of these permissions, at the very least use the existing policies named AmazonEC2ReadOnlyAccess and CloudWatchReadOnlyAccess. For more detailed information regarding permissions, please see the Permissions section below.

Create a new role in the IAM Console. Name it anything you like, such as DatadogAWSIntegrationRole.
From the selection, choose Role for Cross-Account Access.
Click the Select button for Allows IAM users from a 3rd party AWS account to access this account.
For Account ID, enter 464622532012 (Datadog’s account ID). This means that you will grant Datadog and Datadog only read access to your AWS data. For External ID, enter the one generated on our website. Make sure you leave Require MFA disabled. For more information about the External ID, refer to this document in the IAM User Guide.
Select the policy you created above.
Review what you selected and click the Create Role button.

Configuration

logo

Open the AWS Integration tile.
Select the Role Delegation tab.
Enter your AWS Account ID without dashes, e.g. 123456789012, not 1234-5678-9012. Your Account ID can be found in the ARN of the newly created role. Then enter the name of the role you just created. Finally enter the External ID you specified above.
Choose the services you want to collect metrics for on the left side of the dialog. You can optionally add tags to all hosts and metrics. Also if you want to only monitor a subset of EC2 instances on AWS, tag them and specify the tag in the limit textbox here.
Click Install Integration.

Configuration for China and GovCloud

Open the AWS Integration tile.
Select the Access Keys (GovCloud or China Only) tab.
Enter your AWS Access Key and AWS Secret Key. Note: only access and secret keys for China and GovCloud are accepted.
Choose the services you want to collect metrics for on the left side of the dialog. You can optionally add tags to all hosts and metrics. Also if you want to only monitor a subset of EC2 instances on AWS, tag them and specify the tag in the limit textbox here.
Click Install Integration.

Metrics

aws.logs.incoming_bytes (gauge)	The volume of log events in uncompressed bytes uploaded to Cloudwatch Logs. shown as byte
aws.logs.incoming_log_events (count)	The number of log events uploaded to Cloudwatch Logs. shown as event
aws.logs.forwarded_bytes (gauge)	The volume of log events in compressed bytes forwarded to the subscription destination. shown as byte
aws.logs.forwarded_log_events (count)	The number of log events forwarded to the subscription destination. shown as event
aws.logs.delivery_errors (count)	The number of log events for which CloudWatch Logs received an error when forwarding data to the subscription destination. shown as event
aws.logs.delivery_throttling (count)	The number of log events for which CloudWatch Logs was throttled when forwarding data to the subscription destination. shown as event
aws.ec2spot.available_instance_pools_count (count)	The Spot Instance pools specified in the Spot Fleet request. shown as instance
aws.ec2spot.bids_submitted_for_capacity (count)	The capacity for which Amazon EC2 has submitted bids. shown as instance
aws.ec2spot.eligible_instance_pool_count (count)	The Spot Instance pools specified in the Spot Fleet request where Amazon EC2 can fulfill bids. shown as instance
aws.ec2spot.fulfilled_capacity (count)	The capacity that Amazon EC2 has fulfilled. shown as instance
aws.ec2spot.max_percent_capacity_allocation (gauge)	The maximum value of PercentCapacityAllocation across all Spot Instance pools specified in the Spot Fleet request. shown as percent
aws.ec2spot.pending_capacity (count)	The difference between TargetCapacity and FulfilledCapacity. shown as instance
aws.ec2spot.percent_capacity_allocation (gauge)	The capacity allocated for the Spot Instance pool for the specified dimensions. shown as percent
aws.ec2spot.target_capacity (count)	The target capacity of the Spot Fleet request. shown as instance
aws.ec2spot.terminating_capacity (count)	The capacity that is being terminated due to Spot Instance interruptions. shown as instance
aws.dms.cpuutilization (gauge)	Average percentage of allocated EC2 compute units that are currently in use on the instance.
aws.dms.free_storage_space (gauge)	The amount of available storage space shown as byte
aws.dms.freeable_memory (gauge)	The amount of available random access memory. shown as byte
aws.dms.write_iops (gauge)	The average number of disk I/O operations per second shown as operation/second
aws.dms.read_iops (gauge)	The average number of disk I/O operations per second. shown as operation/second
aws.dms.write_throughput (gauge)	The average number of bytes written to disk per second. shown as byte/second
aws.dms.read_throughput (gauge)	The average number of bytes read from disk per second. shown as byte/second
aws.dms.write_latency (gauge)	The average amount of time taken per write disk I/O operation shown as second
aws.dms.read_latency (gauge)	The average amount of time taken per read disk I/O operation shown as second
aws.dms.swap_usage (gauge)	The amount of swap space used on the DB Instance shown as byte
aws.dms.network_transmit_throughput (gauge)	The outgoing (Transmit) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication shown as byte/second
aws.dms.network_receive_throughput (gauge)	The incoming (Receive) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication. shown as byte/second
aws.dms.full_load_throughput_bandwidth_source (gauge)	Incoming network bandwidth from a full load from the source shown as kibibyte/second
aws.dms.full_load_throughput_bandwidth_target (gauge)	Outgoing network bandwidth from a full load for the target shown as kibibyte/second
aws.dms.full_load_throughput_rows_source (gauge)	Incoming changes from a full load from the source in rows per second shown as row/second
aws.dms.full_load_throughput_rows_target (gauge)	Outgoing changes from a full load for the target shown as row
aws.dms.cdcincoming_changes (gauge)	Total row count of changes for the task shown as row
aws.dms.cdcchanges_memory_source (gauge)	Amount of rows accumulating in a memory and waiting to be committed from the source shown as row
aws.dms.cdcchanges_memory_target (gauge)	Amount of rows accumulating in a memory and waiting to be committed to the target shown as row
aws.dms.cdcchanges_disk_source (gauge)	Amount of rows accumulating on disk and waiting to be committed from the source shown as row
aws.dms.cdcchanges_disk_target (gauge)	Amount of rows accumulating on disk and waiting to be committed to the target shown as row
aws.dms.cdcthroughput_bandwidth_source (gauge)	Incoming task network bandwidth from the source shown as kibibyte/second
aws.dms.cdcthroughput_bandwidth_target (gauge)	Outgoing task network bandwidth for the target shown as kibibyte/second
aws.dms.cdcthroughput_rows_source (gauge)	Incoming task changes from the source shown as row/second
aws.dms.cdcthroughput_rows_target (gauge)	Outgoing task changes for the target shown as row/second
aws.dms.cdclatency_source (gauge)	Latency reading from source shown as second
aws.dms.cdclatency_target (gauge)	Latency writing to the target shown as second
aws.events.invocations (count)	Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations but does not include throttled or retried attempts until they fail permanently.
aws.events.failed_invocations (count)	Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt
aws.events.triggered_rules (count)	Measures the number of triggered rules that matched with any event.
aws.events.matched_events (count)	Measures the number of events that matched with any rule.
aws.events.throttled_rules (count)	Measures the number of triggered rules that are being throttled.
aws.states.execution_time (gauge every 60 seconds)	The average time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond
aws.states.execution_time.maximum (gauge every 60 seconds)	The maximum time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond
aws.states.execution_time.minimum (gauge every 60 seconds)	The minimum time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond
aws.states.execution_time.p95 (gauge every 60 seconds)	The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond
aws.states.execution_time.p99 (gauge every 60 seconds)	The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il shown as millisecond
aws.states.executions_aborted (count every 60 seconds)	The number of executions that were aborted/terminated.
aws.states.executions_failed (count every 60 seconds)	The number of executions that failed.
aws.states.executions_started (count every 60 seconds)	The number of executions started.
aws.states.executions_succeeded (count every 60 seconds)	The number of executions that completed successfully.
aws.states.executions_timed_out (count every 60 seconds)	The number of executions that timed out for any reason.
aws.states.lambda_function_run_time (gauge every 60 seconds)	The average time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond
aws.states.lambda_function_run_time.maximum (gauge every 60 seconds)	The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond
aws.states.lambda_function_run_time.minimum (gauge every 60 seconds)	The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond
aws.states.lambda_function_run_time.p95 (gauge every 60 seconds)	The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond
aws.states.lambda_function_run_time.p99 (gauge every 60 seconds)	The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond
aws.states.lambda_function_schedule_time (gauge every 60 seconds)	The avg time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.lambda_function_schedule_time.maximum (gauge every 60 seconds)	The maximum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.lambda_function_schedule_time.minimum (gauge every 60 seconds)	The minimum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.lambda_function_schedule_time.p95 (gauge every 60 seconds)	The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.lambda_function_schedule_time.p99 (gauge every 60 seconds)	The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.lambda_function_time (gauge every 60 seconds)	The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond
aws.states.lambda_function_time.maximum (gauge every 60 seconds)	The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond
aws.states.lambda_function_time.minimum (gauge every 60 seconds)	The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond
aws.states.lambda_function_time.p95 (gauge every 60 seconds)	The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond
aws.states.lambda_function_time.p99 (gauge every 60 seconds)	The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond
aws.states.lambda_functions_failed (count every 60 seconds)	The number of lambda functions that failed.
aws.states.lambda_functions_heartbeat_timed_out (count every 60 seconds)	The number of lambda functions that were timed out due to a heartbeat timeout.
aws.states.lambda_functions_scheduled (count every 60 seconds)	The number of lambda functions that were scheduled.
aws.states.lambda_functions_started (count every 60 seconds)	The number of lambda functions that were started.
aws.states.lambda_functions_succeeded (count every 60 seconds)	The number of lambda functions that completed successfully.
aws.states.lambda_functions_timed_out (count every 60 seconds)	The number of lambda functions that were timed out on close.
aws.states.activity_run_time (gauge every 60 seconds)	The average time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond
aws.states.activity_run_time.maximum (gauge every 60 seconds)	The maximum time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond
aws.states.activity_run_time.minimum (gauge every 60 seconds)	The minimum time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond
aws.states.activity_run_time.p95 (gauge every 60 seconds)	The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond
aws.states.activity_run_time.p99 (gauge every 60 seconds)	The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond
aws.states.activity_schedule_time (gauge every 60 seconds)	The avg time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.activity_schedule_time.maximum (gauge every 60 seconds)	The maximum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.activity_schedule_time.minimum (gauge every 60 seconds)	The minimum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.activity_schedule_time.p95 (gauge every 60 seconds)	The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.activity_schedule_time.p99 (gauge every 60 seconds)	The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond
aws.states.activity_time (gauge every 60 seconds)	The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond
aws.states.activity_time.maximum (gauge every 60 seconds)	The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond
aws.states.activity_time.minimum (gauge every 60 seconds)	The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond
aws.states.activity_time.p95 (gauge every 60 seconds)	The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond
aws.states.activity_time.p99 (gauge every 60 seconds)	The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond
aws.states.activitys_failed (count every 60 seconds)	The number of activities that failed.
aws.states.activitys_heartbeat_timed_out (count every 60 seconds)	The number of activities that were timed out due to a heartbeat timeout.
aws.states.activitys_scheduled (count every 60 seconds)	The number of activities that were scheduled.
aws.states.activitys_started (count every 60 seconds)	The number of activities that were started.
aws.states.activitys_succeeded (count every 60 seconds)	The number of activities that completed successfully.
aws.states.activitys_timed_out (count every 60 seconds)	The number of activities that were timed out on close.

Permissions

The core Datadog-AWS integration pulls data from AWS CloudWatch. At a minimum, your Policy Document will need to allow the following actions:

cloudwatch:ListMetrics to list the available CloudWatch metrics.
cloudwatch:GetMetricStatistics to fetch data points for a given metric.

Note that these actions and the ones listed below are included in the Policy Document using wild cards such as List* and Get*. If you require strict policies, please use the complete action names as listed and reference the Amazon API documentation for the services you require.

By allowing Datadog to read the following additional endpoints, the AWS integration will be able to add tags to CloudWatch metrics and generate additional metrics.

Autoscaling

autoscaling:DescribeAutoScalingGroups: Used to list all autoscaling groups.
autoscaling:DescribePolicies: List available policies (for autocompletion in events and monitors).
autoscaling:DescribeTags: Used to list tags for a given autoscaling group. This will add ASG custom tags on ASG CloudWatch metrics.
autoscaling:DescribeScalingActivities: Used to generate events when an ASG scales up or down.
autoscaling:ExecutePolicy: Execute one policy (scale up or down from a monitor or the events feed). Note: This is not included in the installation Policy Document and should only be included if you are using monitors or events to execute an autoscaling policy.

For more information on Autoscaling policies, review the documentation on the AWS website.

Billing

budgets:ViewBudget: Used to view budget metrics

For more information on Budget policies, review the documentation on the AWS website.

CloudTrail

cloudtrail:DescribeTrails: Used to list trails and find in which s3 bucket they store the trails
cloudtrail:GetTrailStatus: Used to skip inactive trails

For more information on CloudTrail policies, review the documentation on the AWS website.

CloudTrail also requires some s3 permissions to access the trails. These are required on the CloudTrail bucket only

s3:ListBucket: List objects in the CloudTrail bucket to get available trails
s3:GetBucketLocation: Get bucket’s region to download trails
s3:GetObject: Fetch available trails

For more information on S3 policies, review the documentation on the AWS website.

DynamoDB

dynamodb:ListTables: Used to list available DynamoDB tables.
dynamodb:DescribeTable: Used to add metrics on a table size and item count.
dynamodb:ListTagsOfResource: Used to collect all tags on a DynamoDB resource.

For more information on DynamoDB policies, review the documentation on the AWS website.

EC2

ec2:DescribeInstanceStatus: Used by the ELB integration to assert the health of an instance. Used by the EC2 integration to describe the health of all instances.
ec2:DescribeSecurityGroups: Adds SecurityGroup names and custom tags to ec2 instances.
ec2:DescribeInstances: Adds tags to ec2 instances and ec2 cloudwatch metrics.

For more information on EC2 policies, review the documentation on the AWS website.

ECS

ecs:ListClusters: List available clusters.
ecs:ListContainerInstances: List instances of a cluster.
ecs:DescribeContainerInstances: Describe instances to add metrics on resources and tasks running, adds cluster tag to ec2 instances.

For more information on ECS policies, review the documentation on the AWS website.

Elasticache

elasticache:DescribeCacheClusters: List and describe Cache clusters, to add tags and additional metrics.
elasticache:ListTagsForResource: List custom tags of a cluster, to add custom tags.
elasticache:DescribeEvents: Add events avout snapshots and maintenances.

For more information on Elasticache policies, review the documentation on the AWS website.

EFS

elasticfilesystem:DescribeTags: Gets custom tags applied to file systems
elasticfilesystem:DescribeFileSystems: Provides a list of active file systems

For more information on EFS policies, review the documentation on the AWS website.

ELB

elasticloadbalancing:DescribeLoadBalancers: List ELBs, add additional tags and metrics.
elasticloadbalancing:DescribeTags: Add custom ELB tags to ELB metrics.

For more information on ELB policies, review the documentation on the AWS website.

EMR

elasticmapreduce:ListClusters: List available clusters.
elasticmapreduce:DescribeCluster: Add tags to CloudWatch EMR metrics.

For more information on EMR policies, review the documentation on the AWS website.

ES

es:ListTags: Add custom ES domain tags to ES metrics
es:ListDomainNames: Add custom ES domain tags to ES metrics
es:DescribeElasticsearchDomains: Add custom ES domain tags to ES metrics

For more information on ES policies, review the documentation on the AWS website.

Kinesis

kinesis:ListStreams: List available streams.
kinesis:DescribeStreams: Add tags and new metrics for kinesis streams.
kinesis:ListTagsForStream: Add custom tags.

For more information on Kinesis policies, review the documentation on the AWS website.

CloudWatch Logs and Lambda

logs:DescribeLogGroups: List available groups.
logs:DescribeLogStreams: List available streams for a group.
logs:FilterLogEvents: Fetch some specific log events for a stream to generate metrics.

For more information on CloudWatch Logs policies, review the documentation on the AWS website.

RDS

rds:DescribeDBInstances: Descrive RDS instances to add tags.
rds:ListTagsForResource: Add custom tags on RDS instances.
rds:DescribeEvents: Add events related to RDS databases.

For more information on RDS policies, review the documentation on the AWS website.

Route53

route53:listHealthChecks: List available health checks.
route53:listTagsForResources: Add custom tags on Route53 CloudWatch metrics.

For more information on Route53 policies, review the documentation on the AWS website.

S3

s3:ListAllMyBuckets: Used to list available buckets
s3:GetBucketTagging: Used to get custom bucket tags

For more information on S3 policies, review the documentation on the AWS website.

SES

ses:GetSendQuota: Add metrics about send quotas.
ses:GetSendStatistics: Add metrics about send statistics.

For more information on SES policies, review the documentation on the AWS website.

SNS

sns:ListTopics: Used to list available topics.
sns:Publish: Used to publish notifications (monitors or event feed).

For more information on SNS policies, review the documentation on the AWS website.

SQS

sqs:ListQueues: Used to list alive queues.

For more information on SQS policies, review the documentation on the AWS website.

Support

support:*: Used to add metrics about service limits. Note: it requires full access because of AWS limitations

Troubleshooting

Do you believe you’re seeing a discrepancy between your data in CloudWatch and Datadog?

There are two important distinctions to be aware of:

In AWS for counters, a graph that is set to ‘sum’ ‘1minute’ shows the total number of occurrences in one minute leading up to that point, i.e. the rate per 1 minute. Datadog is displaying the raw data from AWS normalized to per second values, regardless of the timeframe selected in AWS, which is why you will probably see our value as lower.
Overall, min/max/avg have a different meaning within AWS than in Datadog. In AWS, average latency, minimum latency, and maximum latency are three distinct metrics that AWS collects. When Datadog pulls metrics from AWS CloudWatch, we only get the average latency as a single time series per ELB. Within Datadog, when you are selecting ‘min’, ‘max’, or ‘avg’, you are controlling how multiple time series will be combined. For example, requesting system.cpu.idle without any filter would return one series for each host that reports that metric and those series need to be combined to be graphed. On the other hand, if you requested system.cpu.idle from a single host, no aggregation would be necessary and switching between average and max would yield the same result.

Metrics delayed?

When using the AWS integration, we’re pulling in metrics via the CloudWatch API. You may see a slight delay in metrics from AWS due to some constraints that exist for their API.

To begin, the CloudWatch API only offers a metric-by-metric crawl to pull data. The CloudWatch APIs have a rate limit that varies based on the combination of authentication credentials, region, and service. Metrics are made available by AWS dependent on the account level. For example, if you are paying for “detailed metrics” within AWS, they are available more quickly. This level of service for detailed metrics also applies to granularity, with some metrics being available per minute and others per five minutes.

On the Datadog side, we do have the ability to prioritize certain metrics within an account to pull them in faster, depending on the circumstances. Please contact [email protected] for more info on this.

To obtain metrics with virtually zero delay, we recommend installing the Datadog Agent on those hosts. We’ve written a bit about this here, especially in relation to CloudWatch.

Missing metrics?

CloudWatch’s api returns only metrics with datapoints, so if for instance an ELB has no attached instances, it is expected not to see metrics related to this ELB in Datadog.

Wrong count of aws.elb.healthy_host_count?

When the Cross-Zone Load Balancing option is enabled on an ELB, all the instances attached to this ELB are considered part of all A-Zs (on CloudWatch’s side), so if you have 2 instances in 1a and 3 in ab, the metric will display 5 instances per A-Z. As this can be counter-intuitive, we’ve added a new metric, aws.elb.host_count, that displays the count of healthy instances per AZ, regardless of if this Cross-Zone Load Balancing option is enabled or not. This metric should have value you would expect.

Duplicated hosts when installing the agent?

When installing the agent on an aws host, you might see duplicated hosts on the infra page for a few hours if you manually set the hostname in the agent’s configuration. This second host will disapear a few hours later, and won’t affect your billing.