Connect to Amazon Web Services (AWS) in order to:
Related integrations include:
| API Gateway | create, publish, maintain, and secure APIs |
| Autoscaling | scale EC2 capacity |
| Billing | billing and budgets |
| CloudFront | glocal content delivery network |
| CloudTrail | Access to log files and AWS API calls |
| CloudSearch | Access to log files and AWS API calls |
| Dynamo DB | NoSQL Database |
| EC2 Container Service (ECS) | container management service that supports Docker containers |
| Elastic Beanstalk | easy-to-use service for deploying and scaling web applications and services |
| Elastic Block Store (EBS) | persistent block level storage volumes |
| ElastiCache | in-memory cache in the cloud |
| Elastic Cloud Compute (EC2) | resizable compute capacity in the cloud |
| Elastic File System (EFS) | shared file storage |
| Elastic Load Balancing (ELB) | distributes incoming application traffic across multiple Amazon EC2 instances |
| Elastic Map Reduce (EMR) | data processing using Hadoop |
| Elasticsearch Service (ES) | deploy, operate, and scale Elasticsearch clusters |
| Firehose | capture and load streaming data |
| IOT | connect IOT devices with cloud services |
| Kinesis | service for real-time processing of large, distributed data streams |
| Key Management Service (KMS) | create and control encryption keys |
| Lambda | serverless computing |
| Machine Learning (ML) | create machine learning models |
| OpsWorks | configuration management |
| Polly | text-speech service |
| Redshift | data warehouse solution |
| Relational Database Service (RDS) | relational database in the cloud |
| Route 53 | DNS and traffic management with availability monitoring |
| Simple Email Service (SES) | cost-effective, outbound-only email-sending service |
| Simple Notification System (SNS) | alert and notifications |
| Simple Queue Service (SQS) | messaging queue service |
| Simple Storage Service (S3) | highly available and scalable cloud storage service |
| Simple Workflow Service (SWF) | cloud workflow management |
| Storage Gateway | hybrid cloud storage |
| Web Application Firewall (WAF) | protect web applications from common web exploits |
| Workspaces | secure desktop computing service |
Setting up the Datadog integration with Amazon Web Services requires configuring role delegation using AWS IAM. To get a better understanding of role delegation, refer to the AWS IAM Best Practices guide.
Note: The GovCloud and China regions do not currently support IAM role delegation. If you are deploying in these regions please skip to the configuration section below.
First create a new policy in the IAM Console. Name the policy DatadogAWSIntegrationPolicy, or choose a name that is more relevant for you. To take advantage of every AWS integration offered by Datadog, using the following in the Policy Document textbox. As we add other components to the integration, these permissions may change.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:Describe*",
"budgets:ViewBudget",
"cloudtrail:DescribeTrails",
"cloudtrail:GetTrailStatus",
"cloudwatch:Describe*",
"cloudwatch:Get*",
"cloudwatch:List*",
"dynamodb:list*",
"dynamodb:describe*",
"ec2:Describe*",
"ec2:Get*",
"ecs:Describe*",
"ecs:List*",
"elasticache:Describe*",
"elasticache:List*",
"elasticfilesystem:DescribeTags",
"elasticfilesystem:DescribeFileSystems",
"elasticloadbalancing:Describe*",
"elasticmapreduce:List*",
"elasticmapreduce:Describe*",
"es:ListTags",
"es:ListDomainNames",
"es:DescribeElasticsearchDomains",
"kinesis:List*",
"kinesis:Describe*",
"logs:Get*",
"logs:Describe*",
"logs:FilterLogEvents",
"logs:TestMetricFilter",
"rds:Describe*",
"rds:List*",
"route53:List*",
"s3:GetBucketTagging",
"s3:ListAllMyBuckets",
"ses:Get*",
"sns:List*",
"sns:Publish",
"sqs:ListQueues",
"support:*",
"tag:getResources",
"tag:getTagKeys",
"tag:getTagValues"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
If you are not comfortable with granting all of these permissions, at the very least use the existing policies named AmazonEC2ReadOnlyAccess and CloudWatchReadOnlyAccess. For more detailed information regarding permissions, please see the Permissions section below.
DatadogAWSIntegrationRole.464622532012 (Datadog’s account ID). This means that you will grant Datadog and Datadog only read access to your AWS data. For External ID, enter the one generated on our website. Make sure you leave Require MFA disabled. For more information about the External ID, refer to this document in the IAM User Guide.
|
aws.logs.incoming_bytes (gauge) |
The volume of log events in uncompressed bytes uploaded to Cloudwatch Logs. shown as byte |
|
aws.logs.incoming_log_events (count) |
The number of log events uploaded to Cloudwatch Logs. shown as event |
|
aws.logs.forwarded_bytes (gauge) |
The volume of log events in compressed bytes forwarded to the subscription destination. shown as byte |
|
aws.logs.forwarded_log_events (count) |
The number of log events forwarded to the subscription destination. shown as event |
|
aws.logs.delivery_errors (count) |
The number of log events for which CloudWatch Logs received an error when forwarding data to the subscription destination. shown as event |
|
aws.logs.delivery_throttling (count) |
The number of log events for which CloudWatch Logs was throttled when forwarding data to the subscription destination. shown as event |
|
aws.ec2spot.available_instance_pools_count (count) |
The Spot Instance pools specified in the Spot Fleet request. shown as instance |
|
aws.ec2spot.bids_submitted_for_capacity (count) |
The capacity for which Amazon EC2 has submitted bids. shown as instance |
|
aws.ec2spot.eligible_instance_pool_count (count) |
The Spot Instance pools specified in the Spot Fleet request where Amazon EC2 can fulfill bids. shown as instance |
|
aws.ec2spot.fulfilled_capacity (count) |
The capacity that Amazon EC2 has fulfilled. shown as instance |
|
aws.ec2spot.max_percent_capacity_allocation (gauge) |
The maximum value of PercentCapacityAllocation across all Spot Instance pools specified in the Spot Fleet request. shown as percent |
|
aws.ec2spot.pending_capacity (count) |
The difference between TargetCapacity and FulfilledCapacity. shown as instance |
|
aws.ec2spot.percent_capacity_allocation (gauge) |
The capacity allocated for the Spot Instance pool for the specified dimensions. shown as percent |
|
aws.ec2spot.target_capacity (count) |
The target capacity of the Spot Fleet request. shown as instance |
|
aws.ec2spot.terminating_capacity (count) |
The capacity that is being terminated due to Spot Instance interruptions. shown as instance |
|
aws.dms.cpuutilization (gauge) |
Average percentage of allocated EC2 compute units that are currently in use on the instance. |
|
aws.dms.free_storage_space (gauge) |
The amount of available storage space shown as byte |
|
aws.dms.freeable_memory (gauge) |
The amount of available random access memory. shown as byte |
|
aws.dms.write_iops (gauge) |
The average number of disk I/O operations per second shown as operation/second |
|
aws.dms.read_iops (gauge) |
The average number of disk I/O operations per second. shown as operation/second |
|
aws.dms.write_throughput (gauge) |
The average number of bytes written to disk per second. shown as byte/second |
|
aws.dms.read_throughput (gauge) |
The average number of bytes read from disk per second. shown as byte/second |
|
aws.dms.write_latency (gauge) |
The average amount of time taken per write disk I/O operation shown as second |
|
aws.dms.read_latency (gauge) |
The average amount of time taken per read disk I/O operation shown as second |
|
aws.dms.swap_usage (gauge) |
The amount of swap space used on the DB Instance shown as byte |
|
aws.dms.network_transmit_throughput (gauge) |
The outgoing (Transmit) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication shown as byte/second |
|
aws.dms.network_receive_throughput (gauge) |
The incoming (Receive) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication. shown as byte/second |
|
aws.dms.full_load_throughput_bandwidth_source (gauge) |
Incoming network bandwidth from a full load from the source shown as kibibyte/second |
|
aws.dms.full_load_throughput_bandwidth_target (gauge) |
Outgoing network bandwidth from a full load for the target shown as kibibyte/second |
|
aws.dms.full_load_throughput_rows_source (gauge) |
Incoming changes from a full load from the source in rows per second shown as row/second |
|
aws.dms.full_load_throughput_rows_target (gauge) |
Outgoing changes from a full load for the target shown as row |
|
aws.dms.cdcincoming_changes (gauge) |
Total row count of changes for the task shown as row |
|
aws.dms.cdcchanges_memory_source (gauge) |
Amount of rows accumulating in a memory and waiting to be committed from the source shown as row |
|
aws.dms.cdcchanges_memory_target (gauge) |
Amount of rows accumulating in a memory and waiting to be committed to the target shown as row |
|
aws.dms.cdcchanges_disk_source (gauge) |
Amount of rows accumulating on disk and waiting to be committed from the source shown as row |
|
aws.dms.cdcchanges_disk_target (gauge) |
Amount of rows accumulating on disk and waiting to be committed to the target shown as row |
|
aws.dms.cdcthroughput_bandwidth_source (gauge) |
Incoming task network bandwidth from the source shown as kibibyte/second |
|
aws.dms.cdcthroughput_bandwidth_target (gauge) |
Outgoing task network bandwidth for the target shown as kibibyte/second |
|
aws.dms.cdcthroughput_rows_source (gauge) |
Incoming task changes from the source shown as row/second |
|
aws.dms.cdcthroughput_rows_target (gauge) |
Outgoing task changes for the target shown as row/second |
|
aws.dms.cdclatency_source (gauge) |
Latency reading from source shown as second |
|
aws.dms.cdclatency_target (gauge) |
Latency writing to the target shown as second |
|
aws.events.invocations (count) |
Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations but does not include throttled or retried attempts until they fail permanently. |
|
aws.events.failed_invocations (count) |
Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt |
|
aws.events.triggered_rules (count) |
Measures the number of triggered rules that matched with any event. |
|
aws.events.matched_events (count) |
Measures the number of events that matched with any rule. |
|
aws.events.throttled_rules (count) |
Measures the number of triggered rules that are being throttled. |
|
aws.states.execution_time (gauge every 60 seconds) |
The average time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
|
aws.states.execution_time.maximum (gauge every 60 seconds) |
The maximum time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
|
aws.states.execution_time.minimum (gauge every 60 seconds) |
The minimum time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
|
aws.states.execution_time.p95 (gauge every 60 seconds) |
The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
|
aws.states.execution_time.p99 (gauge every 60 seconds) |
The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il shown as millisecond |
|
aws.states.executions_aborted (count every 60 seconds) |
The number of executions that were aborted/terminated. |
|
aws.states.executions_failed (count every 60 seconds) |
The number of executions that failed. |
|
aws.states.executions_started (count every 60 seconds) |
The number of executions started. |
|
aws.states.executions_succeeded (count every 60 seconds) |
The number of executions that completed successfully. |
|
aws.states.executions_timed_out (count every 60 seconds) |
The number of executions that timed out for any reason. |
|
aws.states.lambda_function_run_time (gauge every 60 seconds) |
The average time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
|
aws.states.lambda_function_run_time.maximum (gauge every 60 seconds) |
The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
|
aws.states.lambda_function_run_time.minimum (gauge every 60 seconds) |
The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
|
aws.states.lambda_function_run_time.p95 (gauge every 60 seconds) |
The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
|
aws.states.lambda_function_run_time.p99 (gauge every 60 seconds) |
The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
|
aws.states.lambda_function_schedule_time (gauge every 60 seconds) |
The avg time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.lambda_function_schedule_time.maximum (gauge every 60 seconds) |
The maximum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.lambda_function_schedule_time.minimum (gauge every 60 seconds) |
The minimum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.lambda_function_schedule_time.p95 (gauge every 60 seconds) |
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.lambda_function_schedule_time.p99 (gauge every 60 seconds) |
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.lambda_function_time (gauge every 60 seconds) |
The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
|
aws.states.lambda_function_time.maximum (gauge every 60 seconds) |
The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
|
aws.states.lambda_function_time.minimum (gauge every 60 seconds) |
The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
|
aws.states.lambda_function_time.p95 (gauge every 60 seconds) |
The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
|
aws.states.lambda_function_time.p99 (gauge every 60 seconds) |
The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
|
aws.states.lambda_functions_failed (count every 60 seconds) |
The number of lambda functions that failed. |
|
aws.states.lambda_functions_heartbeat_timed_out (count every 60 seconds) |
The number of lambda functions that were timed out due to a heartbeat timeout. |
|
aws.states.lambda_functions_scheduled (count every 60 seconds) |
The number of lambda functions that were scheduled. |
|
aws.states.lambda_functions_started (count every 60 seconds) |
The number of lambda functions that were started. |
|
aws.states.lambda_functions_succeeded (count every 60 seconds) |
The number of lambda functions that completed successfully. |
|
aws.states.lambda_functions_timed_out (count every 60 seconds) |
The number of lambda functions that were timed out on close. |
|
aws.states.activity_run_time (gauge every 60 seconds) |
The average time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
|
aws.states.activity_run_time.maximum (gauge every 60 seconds) |
The maximum time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
|
aws.states.activity_run_time.minimum (gauge every 60 seconds) |
The minimum time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
|
aws.states.activity_run_time.p95 (gauge every 60 seconds) |
The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
|
aws.states.activity_run_time.p99 (gauge every 60 seconds) |
The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
|
aws.states.activity_schedule_time (gauge every 60 seconds) |
The avg time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.activity_schedule_time.maximum (gauge every 60 seconds) |
The maximum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.activity_schedule_time.minimum (gauge every 60 seconds) |
The minimum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.activity_schedule_time.p95 (gauge every 60 seconds) |
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.activity_schedule_time.p99 (gauge every 60 seconds) |
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
|
aws.states.activity_time (gauge every 60 seconds) |
The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
|
aws.states.activity_time.maximum (gauge every 60 seconds) |
The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
|
aws.states.activity_time.minimum (gauge every 60 seconds) |
The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
|
aws.states.activity_time.p95 (gauge every 60 seconds) |
The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
|
aws.states.activity_time.p99 (gauge every 60 seconds) |
The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
|
aws.states.activitys_failed (count every 60 seconds) |
The number of activities that failed. |
|
aws.states.activitys_heartbeat_timed_out (count every 60 seconds) |
The number of activities that were timed out due to a heartbeat timeout. |
|
aws.states.activitys_scheduled (count every 60 seconds) |
The number of activities that were scheduled. |
|
aws.states.activitys_started (count every 60 seconds) |
The number of activities that were started. |
|
aws.states.activitys_succeeded (count every 60 seconds) |
The number of activities that completed successfully. |
|
aws.states.activitys_timed_out (count every 60 seconds) |
The number of activities that were timed out on close. |
The core Datadog-AWS integration pulls data from AWS CloudWatch. At a minimum, your Policy Document will need to allow the following actions:
cloudwatch:ListMetrics to list the available CloudWatch metrics.cloudwatch:GetMetricStatistics to fetch data points for a given metric.Note that these actions and the ones listed below are included in the Policy Document using wild cards such as List* and Get*. If you require strict policies, please use the complete action names as listed and reference the Amazon API documentation for the services you require.
By allowing Datadog to read the following additional endpoints, the AWS integration will be able to add tags to CloudWatch metrics and generate additional metrics.
autoscaling:DescribeAutoScalingGroups: Used to list all autoscaling groups.autoscaling:DescribePolicies: List available policies (for autocompletion in events and monitors).autoscaling:DescribeTags: Used to list tags for a given autoscaling group. This will add ASG custom tags on ASG CloudWatch metrics.autoscaling:DescribeScalingActivities: Used to generate events when an ASG scales up or down.autoscaling:ExecutePolicy: Execute one policy (scale up or down from a monitor or the events feed). Note: This is not included in the installation Policy Document and should only be included if you are using monitors or events to execute an autoscaling policy.For more information on Autoscaling policies, review the documentation on the AWS website.
budgets:ViewBudget: Used to view budget metricsFor more information on Budget policies, review the documentation on the AWS website.
cloudtrail:DescribeTrails: Used to list trails and find in which s3 bucket they store the trailscloudtrail:GetTrailStatus: Used to skip inactive trailsFor more information on CloudTrail policies, review the documentation on the AWS website.
CloudTrail also requires some s3 permissions to access the trails. These are required on the CloudTrail bucket only
s3:ListBucket: List objects in the CloudTrail bucket to get available trailss3:GetBucketLocation: Get bucket’s region to download trailss3:GetObject: Fetch available trailsFor more information on S3 policies, review the documentation on the AWS website.
dynamodb:ListTables: Used to list available DynamoDB tables.dynamodb:DescribeTable: Used to add metrics on a table size and item count.dynamodb:ListTagsOfResource: Used to collect all tags on a DynamoDB resource.For more information on DynamoDB policies, review the documentation on the AWS website.
ec2:DescribeInstanceStatus: Used by the ELB integration to assert the health of an instance. Used by the EC2 integration to describe the health of all instances.ec2:DescribeSecurityGroups: Adds SecurityGroup names and custom tags to ec2 instances.ec2:DescribeInstances: Adds tags to ec2 instances and ec2 cloudwatch metrics.For more information on EC2 policies, review the documentation on the AWS website.
ecs:ListClusters: List available clusters.ecs:ListContainerInstances: List instances of a cluster.ecs:DescribeContainerInstances: Describe instances to add metrics on resources and tasks running, adds cluster tag to ec2 instances.For more information on ECS policies, review the documentation on the AWS website.
elasticache:DescribeCacheClusters: List and describe Cache clusters, to add tags and additional metrics.elasticache:ListTagsForResource: List custom tags of a cluster, to add custom tags.elasticache:DescribeEvents: Add events avout snapshots and maintenances.For more information on Elasticache policies, review the documentation on the AWS website.
elasticfilesystem:DescribeTags: Gets custom tags applied to file systemselasticfilesystem:DescribeFileSystems: Provides a list of active file systemsFor more information on EFS policies, review the documentation on the AWS website.
elasticloadbalancing:DescribeLoadBalancers: List ELBs, add additional tags and metrics.elasticloadbalancing:DescribeTags: Add custom ELB tags to ELB metrics.For more information on ELB policies, review the documentation on the AWS website.
elasticmapreduce:ListClusters: List available clusters.elasticmapreduce:DescribeCluster: Add tags to CloudWatch EMR metrics.For more information on EMR policies, review the documentation on the AWS website.
es:ListTags: Add custom ES domain tags to ES metricses:ListDomainNames: Add custom ES domain tags to ES metricses:DescribeElasticsearchDomains: Add custom ES domain tags to ES metricsFor more information on ES policies, review the documentation on the AWS website.
kinesis:ListStreams: List available streams.kinesis:DescribeStreams: Add tags and new metrics for kinesis streams.kinesis:ListTagsForStream: Add custom tags.For more information on Kinesis policies, review the documentation on the AWS website.
logs:DescribeLogGroups: List available groups.logs:DescribeLogStreams: List available streams for a group.logs:FilterLogEvents: Fetch some specific log events for a stream to generate metrics.For more information on CloudWatch Logs policies, review the documentation on the AWS website.
rds:DescribeDBInstances: Descrive RDS instances to add tags.rds:ListTagsForResource: Add custom tags on RDS instances.rds:DescribeEvents: Add events related to RDS databases.For more information on RDS policies, review the documentation on the AWS website.
route53:listHealthChecks: List available health checks.route53:listTagsForResources: Add custom tags on Route53 CloudWatch metrics.For more information on Route53 policies, review the documentation on the AWS website.
s3:ListAllMyBuckets: Used to list available bucketss3:GetBucketTagging: Used to get custom bucket tagsFor more information on S3 policies, review the documentation on the AWS website.
ses:GetSendQuota: Add metrics about send quotas.ses:GetSendStatistics: Add metrics about send statistics.For more information on SES policies, review the documentation on the AWS website.
sns:ListTopics: Used to list available topics.sns:Publish: Used to publish notifications (monitors or event feed).For more information on SNS policies, review the documentation on the AWS website.
sqs:ListQueues: Used to list alive queues.For more information on SQS policies, review the documentation on the AWS website.
support:*: Used to add metrics about service limits. Note: it requires full access because of AWS limitations
tag:getResources: Used to get custom tags by resource type.tag:getTagKeys: Used to get tag keys by region within an AWS account.tag:getTagValues: Used to get tag values by region within an AWS account.The main use of the Resource Group Tagging API is to reduce the number of API calls we need to collect custom tags. For more information on Tag policies, review the documentation on the AWS website.
Do you believe you’re seeing a discrepancy between your data in CloudWatch and Datadog?
There are two important distinctions to be aware of:
system.cpu.idle without any filter would return one series for each host that reports that metric and those series need to be combined to be graphed. On the other hand, if you requested system.cpu.idle from a single host, no aggregation would be necessary and switching between average and max would yield the same result.Metrics delayed?
When using the AWS integration, we’re pulling in metrics via the CloudWatch API. You may see a slight delay in metrics from AWS due to some constraints that exist for their API.
To begin, the CloudWatch API only offers a metric-by-metric crawl to pull data. The CloudWatch APIs have a rate limit that varies based on the combination of authentication credentials, region, and service. Metrics are made available by AWS dependent on the account level. For example, if you are paying for “detailed metrics” within AWS, they are available more quickly. This level of service for detailed metrics also applies to granularity, with some metrics being available per minute and others per five minutes.
On the Datadog side, we do have the ability to prioritize certain metrics within an account to pull them in faster, depending on the circumstances. Please contact [email protected] for more info on this.
To obtain metrics with virtually zero delay, we recommend installing the Datadog Agent on those hosts. We’ve written a bit about this here, especially in relation to CloudWatch.
Missing metrics?
CloudWatch’s api returns only metrics with datapoints, so if for instance an ELB has no attached instances, it is expected not to see metrics related to this ELB in Datadog.
Wrong count of aws.elb.healthy_host_count?
When the Cross-Zone Load Balancing option is enabled on an ELB, all the instances attached to this ELB are considered part of all A-Zs (on CloudWatch’s side), so if you have 2 instances in 1a and 3 in ab, the metric will display 5 instances per A-Z. As this can be counter-intuitive, we’ve added a new metric, aws.elb.host_count, that displays the count of healthy instances per AZ, regardless of if this Cross-Zone Load Balancing option is enabled or not. This metric should have value you would expect.
Duplicated hosts when installing the agent?
When installing the agent on an aws host, you might see duplicated hosts on the infra page for a few hours if you manually set the hostname in the agent’s configuration. This second host will disapear a few hours later, and won’t affect your billing.