Diagnostics and Monitoring

How can we improve Azure Diagnostics and Monitoring?

You've used all your votes and won't be able to post a new idea, but you can still search and comment on existing ideas.

There are two ways to get more votes:

When an admin closes an idea you've voted on, you'll get your votes back from that idea.
You can remove your votes from an open idea you support.
To see ideas you have already voted on, select the "My feedback" filter and select "My open ideas".

Enter your idea

(thinking…)

Enter your idea and we'll search to see if someone has already suggested it.

If a similar idea already exists, you can support and comment on it.

If it doesn't exist, you can post your idea so others can support it.

Enter your idea and we'll search to see if someone has already suggested it.

Audit logs account admin performs actions in Azure control web page

As we are trying to get some security certification to our Azure product we stumbled upon this interesting security issue that is lacking in Azure management web site:

"Audit logs that record user activities, exceptions, and information security events should be produced, and kept for an agreed-upon time period, to assist in future investigations and access control monitoring.

Authorities: ISO-27002:2005 10.10.1.; HIPAA 164.312(b)"
265 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
9 comments · Flag idea as inappropriate… · Delete… · Admin →

planned · Stephen Siciliano responded

We currently log a large percentage of actions that take place in the portal, including actions for Cloud Services, Virtual Machines, Websites and others. However, you are correct that this doesn’t yet cover 100% of the actions that users can take in the portal. We are working on expanding this capability.
Add web based logging tools for Windows Azure

Make it easy for developers to see errors and events by exposing all logs and events, right in Azure developer portal.
228 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
6 comments · Flag idea as inappropriate… · Delete… · Admin →

planned · Stephen Siciliano responded

We do hope to offer a way for users to access logs in the portal. Today you can already view logs emitted by your Windows Azure Mobile Services directly in the portal.
Show if any application is effected by service level health problem

Today (02/22/2013.) there was a global problem effecting storage access via https. These changes effected a part of our application. It took our team a large amount of time to figure out the problem was not with our current application and changes we made on it recently, but was due to a global service problem. The management portal worked perfectly, all systems seamed to be live, and we where able to access the storage via http from our local environement (because that was the predefined protocol in our settings). But one of our runtime applications had a https connection string setting which did not work for hours at that point. We could have saved a lot of time if the health problem could have been visible via the management portal and not only a separate web site.
Today (02/22/2013.) there was a global problem effecting storage access via https. These changes effected a part of our application. It took our team a large amount of time to figure out the problem was not with our current application and changes we made on it recently, but was due to a global service problem. The management portal worked perfectly, all systems seamed to be live, and we where able to access the storage via http from our local environement (because that was the predefined protocol in our settings). But one of our runtime applications had a https connection string…
215 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
0 comments · Flag idea as inappropriate… · Delete… · Admin →

planned · Stephen Siciliano responded

We are looking at ways to make information about service outages more specific to your own services.
Alerts based on Queue Size

I would like to be able to setup an alert and monitor a Cloud Service based on Queue size. So if a queue has more than 10,000 items for 15 minutes send alert.
213 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
under review · 6 comments · Queues · Flag idea as inappropriate… · Delete… · Admin →
All kinds of useful alerts

I would love to get all kinds of alerts. Here are som examples:

* When a auto scale event triggers
* When a Cloud Service VM is patched
* When a new Azure OS is available
* When a service is added or deleted by a co-administrator
* If a certificate expires or is about to expire
* If a stagning service or VM *IS* running (sometimes we forget to turn them off)
* If a Cloud Service worker role or VM is *NOT* running.

It could also be a custom job that checks a database for "No new orders in one hour", "A fatal alert in the log" etc.

I would love to get all kinds of alerts. Here are som examples:

* When a auto scale event triggers
* When a Cloud Service VM is patched
* When a new Azure OS is available
* When a service is added or deleted by a co-administrator
* If a certificate expires or is about to expire
* If a stagning service or VM *IS* running (sometimes we forget to turn them off)
* If a Cloud Service worker role or VM is *NOT* running.

It could also be a custom job that checks a database for "No new orders…
156 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
2 comments · Flag idea as inappropriate… · Delete… · Admin →

under review · AdminCorey Sanders [MSFT] (Group Program Manager, Microsoft Azure) responded

Yes, good feedback. Would others want this as well?
Fix the "No available data" problem in the Monitoring pane

http://i.imgur.com/ZrRx6h1.png

That database has been up and running for a month. But it has no metrics for today? Doubt it. I see this "no available data" issue frequently. It's frustrating.
84 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
7 comments · Flag idea as inappropriate… · Delete… · Admin →
Retention Policy for Diagnostics

Add a retention policy to Azure Diagnostics much like Azure Storage has for logging and analytics. It is currently WAY too hard to clean up old diagnostics data.
64 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
1 comment · Flag idea as inappropriate… · Delete… · Admin →

under review · Stephen Siciliano responded

There is an option to configure the Retention policy for verbose diagnostics — go to the Configure tab for your Cloud Services. I do not believe that there is a way in the portal to configure retention for other types of diagnostics data, but I have passed the feedback to the WAD team.
Add CPU and Memory usage metrics per instance

We use New Relic to monitor our cloud services. Sometimes we see an instance going above 80% CPU, which would possibly be solved by rebooting the instance, but it's imposible to identify which of the instances is in trouble in the portal (Azure portal uses _N to name the instances, New Relic uses an ID).
60 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
3 comments · Flag idea as inappropriate… · Delete… · Admin →

started · AdminThe Azure Team on UserVoice (Admin, Microsoft Azure) responded

Guillermo, you can view the instance level data for CPU, memory and other metrics by turning on ‘Verbose’ level in the ‘Configure’ tab on the portal (https://manage.windowsazure.com). Please note that this is currently supported for Production servers only; support for Staging is coming soon.
- Ashwin Kamath, PM, Azure Insights
Setting PartitionKey when streaming diagnostics to EventHub

Thanks for adding the possibilty to stream diagnostic events to EventHub but this is unusable at the moment since there is no way to control how the events are distributed to the different EventHub partitions. As a result of this we are unable to guarantee that events can be received and processed according to the correct sequence.
One way to fix this would be to offer a way to set the partition key.
52 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
1 comment · Bugs · Flag idea as inappropriate… · Delete… · Admin →
Show memory and network metrics as percentages

Since each VM / Cloud Service has 3 major metrics: CPU, Memory , Network - diagnostics should collect this metrics by default (logging level minimal).

Because there are allot of instance types absolute values of Network and Memory counters are not informative and % should be used.
It is more informative to know that instance is using 90% of memory than to have a value of 600 MB. Same is for the network channel load. For example my service is network intensive and now it is really hard to understand when network channel load comes to it's limit and service scale is required.

Once this 2 metrics will be implemented they should be put to auto-scale. I strongly believe that combined with service bus queue they will cover all possible case of auto-scale.

Since each VM / Cloud Service has 3 major metrics: CPU, Memory , Network - diagnostics should collect this metrics by default (logging level minimal).

Because there are allot of instance types absolute values of Network and Memory counters are not informative and % should be used.
It is more informative to know that instance is using 90% of memory than to have a value of 600 MB. Same is for the network channel load. For example my service is network intensive and now it is really hard to understand when network channel load comes to it's limit and service…
47 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
1 comment · Flag idea as inappropriate… · Delete… · Admin →

under review · Azure portal team responded

This is very good feedback. Today we have separate datasources for what your quotas are (e.g. network, memory), and the metrics that we emit. Ideally, we could bring those two points together to give you a percentage of those metrics.

Also, once these are exposed, they will automatically be available for autoscale — already today you can use any exposed metric for scaling.
Support auto-scaling cloud services by service bus topic / subscription

At the moment you can only scale by queue size but not topic/subscription size. Given subscriptions can have filters it would be ideal to be able to scale by subscription size and not just general topic size.
45 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
1 comment · Flag idea as inappropriate… · Delete… · Admin →

under review · Stephen Siciliano responded

This is something that we are looking at. In the meanwhile, there is a workaround to forward the items you want to scale by to a queue and scale by that queue.
Send an email when an Azure deployment succeeds or fails

So far as I can tell, right now, the only way to tell whether your Azure deployment succeeded or failed is to go into the Azure portal, navigate through half a dozen options, and check. There's precisely zero alerting if your deployment fails. That's a horrible user experience. Azure really ought to send an email (at least to the main account holder, but it should be configurable) if the deployment succeeds or fails.
44 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
3 comments · Flag idea as inappropriate… · Delete… · Admin →

under review · Stephen Siciliano responded

This is a very interesting suggestion — likely it’s not scoped to just deployment, but other long-running operations in the portal. We likely will want to come up with a holistic system to alert you when something fails.
Have the diagnostics engine push remaining data to storage on shutdown

I'd like to see the Diagnostics agent be aware of a graceful shutdown scenario (instance count being lowered for example) and if it has been working on a schedule to push data to storage attempt to push any data since the last scheduled transfer over before the system is fully shutdown.

If you have a schedule set up to move data every 5 or 10 minutes (or really any time schedule that isn't in the seconds) you could considerable amount of data if the role is shutdown between scheduled pushes. It would be nice if an attempt is actually made automatically to move anything since the last scheduled push.

You could manually accomplish this in the Role OnStop to do a manual transfer, but then you could get duplicate data, etc.

The trick would be around how long could the system really wait on the transfer before shutting down. OnStop has roughly 30 seconds...how much could the WAD engine get? Interesting issues to think about, but I think having this type of mechanism built in would be a good idea.

I'd like to see the Diagnostics agent be aware of a graceful shutdown scenario (instance count being lowered for example) and if it has been working on a schedule to push data to storage attempt to push any data since the last scheduled transfer over before the system is fully shutdown.

If you have a schedule set up to move data every 5 or 10 minutes (or really any time schedule that isn't in the seconds) you could considerable amount of data if the role is shutdown between scheduled pushes. It would be nice if an attempt is actually made…
43 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
0 comments · Flag idea as inappropriate… · Delete… · Admin →

under review · Stephen Siciliano responded

Thank you for the suggestion. This seems like a reasonable way to make sure that you collect all of the diagnostic data from your virtual machines. I have passed this idea on to the WAD team.
Don't limit the number of automated database exports

Backups are a sad story for Azure Databases, but the new export feature is at least a step in the right direction.

However, there's a limit (10) on the number of automated exports that can be enabled, and that is both strange and crippling. Strange, because it goes against the scalability goal of Azure. We pay per database. ****, we even pay per export. Why the limit? Crippling, because it effectively makes Azure Databases a no go if you want to have more than one database.

Also, the only information about this limit comes when it's too late.
40 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
4 comments · azure.microsoft.com · Flag idea as inappropriate… · Delete… · Admin →

planned · Stephen Siciliano responded

The reason for this limit is automated database export is currently a preview feature. When this feature is GA’d the limit will be removed.
Provide functionality to solve character limit api requests.

Hi,
We are using the new Azure SDK 2.5 Diagnostics for our Cloud Service.
Since our Performance Counters are dynamic and rely on process ID and process Identifier (the thing with the #) we cannot set the configuration in the wadcfg files but have to create the configuration during runtime and upload it via the REST API.

E.g. ChangeDeploymentConfiguration:
https://management.core.windows.net/{0}/services/hostedservices/{1}/deploymentslots/{2}/?comp=config "POST"

Today we received an error that the maximum length for the public configuration cannot exceed 20480 characters.

That is not acceptable.
First of all we have so many performance counters that we can easily reach this limit. And secondly we do not have any alternative to extend the public configuration e.g with a PUT request (as someone would expect from a RESTful API).

Luckily i was able to work around this by removing some counters from our configuration which are not essential, but in the long run this in just plain unacceptable, since we only postponed the inevetible.

Request:
Add a PUT functionality to extend the public configuration of the cloud service or remove the character limit in the POST method. (at least increase it by the factor 10).

As an example i have added a list of performancecounters we are currently using when running a cloud service with 4 web, 4 worker and 4 cache roles..
those are 6300 characters.
Scaling up to 10 roles each and adding a few counters will already reach the limit.

Hi,
We are using the new Azure SDK 2.5 Diagnostics for our Cloud Service.
Since our Performance Counters are dynamic and rely on process ID and process Identifier (the thing with the #) we cannot set the configuration in the wadcfg files but have to create the configuration during runtime and upload it via the REST API.

E.g. ChangeDeploymentConfiguration:
https://management.core.windows.net/{0}/services/hostedservices/{1}/deploymentslots/{2}/?comp=config "POST"

Today we received an error that the maximum length for the public configuration cannot exceed 20480 characters.

That is not acceptable.
First of all we have so many performance counters that we can easily reach this limit. And secondly…
35 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
3 comments · Flag idea as inappropriate… · Delete… · Admin →
Autoscale Action With Multiple Rules

Currently, every autoscale action can only have one rule associated with it. There is no option to autoscale based off of multiple rules.

For example, if you wanted to scale in 1 instance if (CPU% < 40%) AND (Memory < 40%).
34 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
1 comment · Flag idea as inappropriate… · Delete… · Admin →
Monintor VM Status

Add a feature to monitor the Status of a VM with some conditions.
Ex.: I want to receive an alert when the Status of VM "X" is not "Running".
24 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
0 comments · Windows · Flag idea as inappropriate… · Delete… · Admin →

under review · Stephen Siciliano responded

We don’t currently have any way to alert on conditions like these (you can only alert on metrics), but I think I understand the scenario and we can take it under consideration. THanks,
SMS warning, server heartbeat function

We are as a partner monitoring more sites for customers, who expect us to act if something goes wrong or offline. We could use a heartbeat service which send an SMS / mobile text message when a service have been offline for x minutes, defined individually per site.
24 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
0 comments · Flag idea as inappropriate… · Delete… · Admin →

under review · Azure portal team responded

We are considering adding SMS as a way to notify you of alerts.
Add the ability to monitor total RAM usage on a VM

We have a graph that monitors CPU usage, Network traffic, and disk read/write, but it would be very nice to have a graph to show RAM usage on a VM over a period of time (much like the CPU). Especially when deciding to switch between say an A2 and an A5.
23 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
under review · 1 comment · Flag idea as inappropriate… · Delete… · Admin →
Allow Users added via RBAC Ability to Create Alerts and Availability Monitors in AI

I have users that I'd like to grant the ability to create, modify and delete Alerts and Availability Monitors. Even adding the users to the Owners Role does not give them permission to create the alerts via the portal (have not tried API yet). If it was up to me a Contributor should be able to do so.
22 votes
Vote

Sign in
prestine

Your name

Your email address
Check!

invalid email

(thinking…)

Reset

or sign in with

UserVoice password

Forgot password?

Create a password

I agree to the terms of service

Signed in as (Sign out)

Close

Close

You have left! (?) (thinking…)
5 comments · Flag idea as inappropriate… · Delete… · Admin →

under review · AdminAzure portal team (Admin, Microsoft Azure) responded

This is possible today. Instead of granting a user access to a specific resource, grant access to the resource group. Owners and contributors will then be able to create alert rules.

Do you specifically need the ability to manage alert rules for a single AI component? Currently, alert rules are global and don’t inherit permissions from AI components. It’d be good to understand your needs and expectations and if resource group access suffices.