Troubleshooting the Elastic Network Adapter (ENA)
The Elastic Network Adapter (ENA) is designed to improve operating system health and reduce the chances of long-term disruption because of unexpected hardware behavior and or failures. The ENA architecture keeps device or driver failures as transparent to the system as possible. This topic provides troubleshooting information for ENA.
If you are unable to connect to your instance, start with the Troubleshooting Connectivity Issues section.
If you are able to connect to your instance, you can gather diagnostic information by using the failure detection and recovery mechanisms that are covered in the later sections of this topic.
Contents
Troubleshooting Connectivity Issues
If you lose connectivity while enabling enhanced networking, the
ena module might be incompatible with your instance's current
running kernel. This can happen if you install the module for a specific kernel version
(without dkms, or with an improperly configured
dkms.conf file) and then your instance kernel is updated. If the
instance kernel that is loaded at boot time does not have the ena
module properly installed, your instance will not recognize the network adapter and your
instance becomes unreachable.
If you enable enhanced networking for a PV instance or AMI, this can also make your instance unreachable.
If your instance becomes unreachable after enabling enhanced networking with ENA, you
can disable the enaSupport attribute for your instance and it will
fall back to the stock network adapter.
To disable enhanced networking with ENA (EBS-backed instances)
From your local computer, stop the instance using the Amazon EC2 console or the following command: stop-instances (AWS CLI). If your instance is managed by AWS OpsWorks, you should stop the instance in the AWS OpsWorks console so that the instance state remains in sync.
Important
If you are using an instance store-backed instance, you can't stop the instance. Instead, proceed to To disable enhanced networking with ENA (instance store-backed instances).
From your local computer, disable the enhanced networking attribute using the following command.
modify-instance-attribute (AWS CLI)
$aws ec2 modify-instance-attribute --instance-idinstance_id--no-ena-support
From your local computer, start the instance using the Amazon EC2 console or the following command: start-instances (AWS CLI). If your instance is managed by AWS OpsWorks, you should start the instance in the AWS OpsWorks console so that the instance state remains in sync.
(Optional) Connect to your instance and try reinstalling the
enamodule with your current kernel version by following the steps in Enabling Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances in a VPC.
To disable enhanced networking with ENA (instance store-backed instances)
If your instance is an instance store-backed instance, create a new AMI as
described in Creating an Instance Store-Backed Linux
AMI. Be sure to disable the
enhanced networking enaSupport attribute when you register the
AMI.
register-image (AWS CLI)
$aws ec2 register-image --no-ena-support ...
Keep-Alive Mechanism
The ENA device posts keep-alive events at a fixed rate (usually once every second). The ENA driver implements a watchdog mechanism, which checks every for the presence of these keep-alive messages. If a message or messages are present, the watchdog is rearmed, otherwise the driver concludes that the device experienced a failure and then does the following:
Dumps its current statistics to syslog
Resets the ENA device
Resets the ENA driver state
The above reset procedure may result in some traffic loss for a short period of time (TCP connections should be able to recover), but should not otherwise affect the user.
The ENA device may also indirectly request a device reset procedure, by not sending a keep-alive notification, for example, if the ENA device reaches an unknown state after loading an irrecoverable configuration.
Below is an example of the reset procedure:
[18509.800135] ena 0000:00:07.0 eth1: Keep alive watchdog timeout. // The watchdog process initiates a reset
[18509.815244] ena 0000:00:07.0 eth1: Trigger reset is on
[18509.825589] ena 0000:00:07.0 eth1: tx_timeout: 0 // The driver logs the current statistics
[18509.834253] ena 0000:00:07.0 eth1: io_suspend: 0
[18509.842674] ena 0000:00:07.0 eth1: io_resume: 0
[18509.850275] ena 0000:00:07.0 eth1: wd_expired: 1
[18509.857855] ena 0000:00:07.0 eth1: interface_up: 1
[18509.865415] ena 0000:00:07.0 eth1: interface_down: 0
[18509.873468] ena 0000:00:07.0 eth1: admin_q_pause: 0
[18509.881075] ena 0000:00:07.0 eth1: queue_0_tx_cnt: 0
[18509.888629] ena 0000:00:07.0 eth1: queue_0_tx_bytes: 0
[18509.895286] ena 0000:00:07.0 eth1: queue_0_tx_queue_stop: 0
.......
........
[18511.280972] ena 0000:00:07.0 eth1: free uncompleted tx skb qid 3 idx 0x7 // At the end of the down process, the driver discards incomplete packets.
[18511.420112] [ENA_COM: ena_com_validate_version] ena device version: 0.10 //The driver begins its up process
[18511.420119] [ENA_COM: ena_com_validate_version] ena controller version: 0.0.1 implementation version 1
[18511.420127] [ENA_COM: ena_com_admin_init] ena_defs : Version:[b9692e8] Build date [Wed Apr 6 09:54:21 IDT 2016]
[18512.252108] ena 0000:00:07.0: Device watchdog is Enabled
[18512.674877] ena 0000:00:07.0: irq 46 for MSI/MSI-X
[18512.674933] ena 0000:00:07.0: irq 47 for MSI/MSI-X
[18512.674990] ena 0000:00:07.0: irq 48 for MSI/MSI-X
[18512.675037] ena 0000:00:07.0: irq 49 for MSI/MSI-X
[18512.675085] ena 0000:00:07.0: irq 50 for MSI/MSI-X
[18512.675141] ena 0000:00:07.0: irq 51 for MSI/MSI-X
[18512.675188] ena 0000:00:07.0: irq 52 for MSI/MSI-X
[18512.675233] ena 0000:00:07.0: irq 53 for MSI/MSI-X
[18512.675279] ena 0000:00:07.0: irq 54 for MSI/MSI-X
[18512.772641] [ENA_COM: ena_com_set_hash_function] Feature 10 isn't supported
[18512.772647] [ENA_COM: ena_com_set_hash_ctrl] Feature 18 isn't supported
[18512.775945] ena 0000:00:07.0: Device reset completed successfully // The reset process is completeRegister Read Timeout
The ENA architecture suggests a limited usage of memory mapped I/O (MMIO) read operations. MMIO registers are accessed by the ENA device driver only during its initialization procedure.
If the driver logs (available in dmesg output) indicate failures of read operations, this may be caused by an incompatible or incorrectly compiled driver, a busy hardware device, or hardware failure.
Intermittent log entries that indicate failures on read operations should not be considered an issue; the driver will retry them in this case. However, a sequence of log entries containing read failures indicate a driver or hardware problem.
Below is an example of driver log entry indicating a read operation failure due to a timeout:
[ 47.113698] [ENA_COM: ena_com_reg_bar_read32] reading reg failed for timeout. expected: req id[1] offset[88] actual: req id[57006] offset[0]
[ 47.333715] [ENA_COM: ena_com_reg_bar_read32] reading reg failed for timeout. expected: req id[2] offset[8] actual: req id[57007] offset[0]
[ 47.346221] [ENA_COM: ena_com_dev_reset] Reg read32 timeout occurredStatistics
If you experience insufficient network performance or latency issues, you should retrieve the device statistics and examine them. These statistics can be obtained using ethtool, as shown below:
[ec2-user ~]$ ethtool –S ethN
NIC statistics:
tx_timeout: 0
io_suspend: 0
io_resume: 0
wd_expired: 0
interface_up: 1
interface_down: 0
admin_q_pause: 0
queue_0_tx_cnt: 4329
queue_0_tx_bytes: 1075749
queue_0_tx_queue_stop: 0
...The following command output parameters are described below:
tx_timeout:NThe number of times that the
Netdevwatchdog was activated.io_suspend:NUnsupported. This value should always be zero.
io_resume:NUnsupported. This value should always be zero.
wd_expired:NThe number of times that the driver did not receive the keep-alive event in the preceding 3 seconds.
interface_up:NThe number of times that the ENA interface was brought up.
interface_down:NThe number of times that the ENA interface was brought down.
admin_q_pause:NThe admin queue is in an unstable state. This value should always be zero.
queue_N_tx_cnt:NThe number of transmitted packets for queue
N.queue_N_tx_bytes:NThe number of transmitted bytes for queue
N.queue_N_tx_queue_stop:NThe number of times that queue
Nwas full and stopped.queue_N_tx_queue_wakeup:NThe number of times that queue
Nresumed after being stopped.queue_N_tx_dma_mapping_err:NDirect memory access error count. If this value is not 0, it indicates low system resources.
queue_N_tx_napi_comp:NThe number of times the
napihandler callednapi_completefor queueN.queue_N_tx_poll:NThe number of times the
napihandler was scheduled for queueN.queue_N_tx_doorbells:NThe number of transmission doorbells for queue
N.queue_N_tx_linearize:NThe number of times SKB linearization was attempted for queue
N.queue_N_tx_linearize_failed:NThe number of times SKB linearization failed for queue
N.queue_N_tx_prepare_ctx_err:NThe number of times
ena_com_prepare_txfailed for queueN. This value should always be zero; if not, see the driver logs.queue_N_tx_missing_tx_comp:NThe number of packets that were left uncompleted for queue
N. This value should always be zero.queue_N_tx_bad_req_id:NInvalid
req_idfor queueN. The validreq_idis zero, minus thequeue_size, minus 1.queue_N_rx_cnt:NThe number of received packets for queue
N.queue_N_rx_bytes:NThe number of received bytes for queue
N.queue_N_rx_refil_partial:NThe number of times the driver did not succeed in refilling the empty portion of the
rxqueue with the buffers for queueN. If this value is not zero, it indicates low memory resources.queue_N_rx_bad_csum:NThe number of times the
rxqueue had a bad checksum for queueN(only ifrxchecksum offload is supported).queue_N_rx_page_alloc_fail:NThe number of time that page allocation failed for queue
N. If this value is not zero, it indicates low memory resources.queue_N_rx_skb_alloc_fail:NThe number of time that SKB allocation failed for queue
N. If this value is not zero, it indicates low system resources.queue_N_rx_dma_mapping_err:NDirect memory access error count. If this value is not 0, it indicates low system resources.
queue_N_rx_bad_desc_num:NToo many buffers per packet. If this value is not 0, it indicates usage of very small buffers.
queue_N_rx_small_copy_len_pkt:NOptimization: For packets smaller that this threshold, which is set by
sysfs, the packet is copied directly to the stack to avoid allocation of a new page.ena_admin_q_aborted_cmd:NThe number of admin commands that were aborted. This usually happens during the auto-recovery procedure.
ena_admin_q_submitted_cmd:NThe number of admin queue doorbells.
ena_admin_q_completed_cmd:NThe number of admin queue completions.
ena_admin_q_out_of_space:NThe number of times that the driver tried to submit new admin command, but the queue was full.
ena_admin_q_no_completion:NThe number of times that the driver did not get an admin completion for a command.
Driver Error Logs in syslog
The ENA driver writes log messages to syslog during system boot. You can examine these logs to look for errors if you are experiencing issues. Below is an example of information logged by the ENA driver in syslog during system boot, along with some annotations for select messages.
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 478.416939] [ENA_COM: ena_com_validate_version] ena device version: 0.10
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 478.420915] [ENA_COM: ena_com_validate_version] ena controller version: 0.0.1 implementation version 1
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.256831] ena 0000:00:03.0: Device watchdog is Enabled
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.672947] ena 0000:00:03.0: creating 8 io queues. queue size: 1024
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.680885] [ENA_COM: ena_com_init_interrupt_moderation] Feature 20 isn't supported // Interrupt moderation is not supported by the device
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.691609] [ENA_COM: ena_com_get_feature_ex] Feature 10 isn't supported // RSS HASH function configuration is not supported by the device
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.694583] [ENA_COM: ena_com_get_feature_ex] Feature 18 isn't supported //RSS HASH input source configuration is not supported by the device
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.697433] [ENA_COM: ena_com_set_host_attributes] Set host attribute isn't supported
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.701064] ena 0000:00:03.0 (unnamed net_device) (uninitialized): Cannot set host attributes
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.704917] ena 0000:00:03.0: Elastic Network Adapter (ENA) found at mem f3000000, mac addr 02:8a:3c:1e:13:b5 Queues 8
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 480.805037] EXT4-fs (xvda1): re-mounted. Opts: (null)
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 481.025842] NET: Registered protocol family 10Which errors can I ignore?
The following warnings that may appear in your system's error logs can be ignored for the Elastic Network Adapter:
Set host attribute isn't supportedHost attributes are not supported for this device.
failed to alloc buffer for rx queueThis is a recoverable error, and it indicates that there may have been a memory pressure issue when the error was thrown.
FeatureXisn't supportedThe referenced feature is not supported by the Elastic Network Adapter. Possible values for
Xinclude:10: RSS Hash function configuration is not supported for this device.12: RSS Indirection table configuration is not supported for this device.18: RSS Hash Input configuration is not supported for this device.
20: Interrupt moderation is not supported for this device.
Failed to config AENQThe Elastic Network Adapter does not support AENQ configuration.
Trying to set unsupported AENQ eventsThis error indicates an attempt to set an AENQ events group that is not supported by the Elastic Network Adapter.

