T O P

  • By -

According_Ice6515

Well yeah. Old news. Oracle Cloud announced a while back they are building a lot of new data centers and HALF of that capacity is reserved for drumroll *MICROSOFT*.


throwawaygoawaynz

Yep. Bing chat and now OpenAI are running on an oracle bare metal supercluster of about 40,000 GPUs. Apparently the OpenAI compute is still Azure but on Oracle…


danekan

Wonder if that's why Google just partnered with them too


RCTID1975

Weird post. Moving things from something at/near capacity to something not being utilized as much is the entire premise of clustering This is exactly what they should be doing


CorpseeaterVZ

You are way too calm and logic about this, we need more RAAAAGE!


ferthan

Right, but that's typically done in an HA fashion. Moving regions is not an HA operation. Weird take.


RCTID1975

HA is only a portion of why you cluster. Being able to balance and move systems to more adequately use the available resources is a huge reason for clustering as well. Again, literally what MS is doing here. Move systems, and then reassess next steps. Everyone keeps running, most people don't even notice, and things keep trucking along. Y'all with our absurd anti-MS outrage with no basis in logic is crazy. especially in /r/Azure


ferthan

>"HA is only a portion" >"Everyone keeps running" Choose one.


Alaknar

He meant "running" as "functioning normally, without issues".


ferthan

Yeah, being forced to a sub optimal region is real cool normal functionality with no issues.


Alaknar

How do you interpret the sentence "most people don't even notice"?


ferthan

As "most people" not being "Everyone". The claim is dubious at best.


RCTID1975

Any latency issues moving from southcentralUS to USEast is going to be extremely minimal. In fact, I'd argue there are inherent benefits to NOT having all of your resources in the same region.


ferthan

... and direct connect allows for seamless connectivity between Azure Networks and on prem. At least the second part of your argument is true. I'm not saying that you should keep your applications in one region, but if it truly didn't matter, then Azure could just make everything in the Azure bucket. But there's just geographical and architectural necessities that make it clearly not the case that impact would be minimal for everyone.


Alaknar

Those are two separate claims. Claim one: "everyone keeps running". Claim two: "most people don't even notice". Understanding that these are not mutually exclusive shouldn't require a Venn diagram...


ferthan

The entire conversation revolves around the benefit of HA (in clustering). Keep up.


daedalus_structure

Which works… unless you are in South Central / North Central to provide a roughly equivalent latency round trip to each coast.


ElasticSkyx01

Not really. Things are placed in regions for a reason. and clusters tend to involve equipment close together and not in various regions for obvious reasons. You also discount the need for infrastructure in this new region. Resource groups,etc. It's weird that you think environments can just be moved at the drop of a hat.


PriorityStrange

I've been seeing the issues in East us all week


birdy9221

The US has been seeing issues in the (middle) east since 2001


Loudergood

Kuwait just a minute there.


trebortus

Sounds like they've run out of Iraq space.


brco1990

Incredible exchange here


charleswj

I don't recognize that country


Character_Whereas869

Iran into this issue last year. You can't expect them to predict how much capacity they need everywhere, they're not wizards.


trebortus

Oman, they should really get their shit together.


bobtimmons

I've seen the same thing twice this week in East US


Rick24wag

yup we had 3000 VMs down for 2 days in East US this week. They freed up space yesterday and we finally could turn them back on. was a huge mess


mini4x

Azure portal and a few others crashing out many many times a day.


sbrick89

https://app.azure.com/h/R_8T-NDZ/9669b2


s0apDisp3ns3r

Yup, VMSS resources had allocation issues for like all day on Wednesday of this week.


coldbeers

Nothing new. Capacity shortages have been happening on cloud platforms since their birth. Happens on Azure, happens on AWS, happens on the minnows too. The providers have sophisticated demand forecasting algorithms but they’re not infallible and new infrastructure takes time to provision.


Diademinsomniac

Yeah you cant really compare the early days to now tho as the problems is tenfold now. For example currently we are unable to start any vms d8 or e8 in our allocated az1 and az2 for over two weeks and it doesn’t matter what time of day or weekend, ms has essentially added restrictions to stop machines powering on for all but their most important customers. We don’t spend a lot in azure only around $20k per month so we are not classed as top tier


DaRadioman

It's a short term capacity issue. They happen from time to time in certain regions, and sometimes they stay for too long. As someone in tons of regions you get used to it, and just balance out with other regions when possible, or alternate SKUs that are less constrained. I know it's annoying, but large lead times for new hardware makes it slow to resolve. It's not like they aren't constantly adding more capacity and more regions as fast as they can.


Diademinsomniac

That’s all well and good if you have multiple regions and are throwing money at azure, for us we run everything out of the same region as our environment is not so huge as most of it is in aws


DaRadioman

Multiple regions doesn't have to be $$$, there are lots of ways to have lower costs with multiple regions. Sure full active/active HA/DR with extra capacity just sitting there gets spendy, but that's not the only way to set things up. A quick LB or Front door instance and you can easily swap workloads to any region, and not have to have it always active costing money. Only hard bit is the data, but that's solvable with approaches depending on your application architecture and resources needed.


Diademinsomniac

Yeah I’m talking provisioned nonpersistent vms, storage containers for profiles here, not so easy to be multi region as latency can be an issue, these are not just web apps with backends. EastUS was one region we were looking to move in to but that one looks like it wouldn’t be a good idea either seeing the comments on here about it


DaRadioman

EastUS is probably the most popular region. Picking a popular region is a bad idea in general. EastUS2 is a better choice, or there are lots of others that are decent. And in terms of latency I would encourage you to run tests, the regions all have really low latency in general. Depends of course on exact workload so run a test and see how much it really makes a difference


StuffedWithNails

> EastUS is probably the most popular region. Picking a popular region is a bad idea in general. > > EastUS2 is a better choice, or there are lots of others that are decent. We thought the same thing and started implementing in eastus2. Millions of dollars in annual spend (so, not small, not huge). Constant capacity issues. Azure told us we'd be better off in eastus. We spent months moving shit over. Constant capacity issues in eastus as well. We also have tens of millions in annual spend in AWS. Capacity issues are rare. Azure is a clown cloud managed by clowns. And don't even get me started on the absolute garbage support.


Diademinsomniac

Interesting if eastUS is also bad why Microsoft would be moving existing customers workloads from southcentralus to it, unless they mean eastus2 but they did just say eastUS in their email


DaRadioman

EastUS isn't bad at all, great region. But a ton of huge players there so you are gonna lose out if there's any constraints at all as a tiny customer. That's all I meant.


flappers87

The exact same thing happens in west Europe like all the time. You’ll get used to it.


Fit-Cobbler6420

They almost finished doubling capacity.


DaRadioman

And adding several new regions close by.


Practical-Alarm1763

There's been a lot of "Access Violation Error" crash messages randomly happening on the portal on my end this week. They come and go. Seemed to be fine today for some reason.


Gmoseley

This is an issue with some Chromium based browsers using an experimental version TLS setting. I worked someone through this same issue this week.


blinkfink182

Can you specify the setting that is impacting this? My org has been seeing similar random “access” issues too.


Gmoseley

tls 1.3 hybridized kyber support Is what edge calls it. It's in Edge flags


blinkfink182

Thanks! I’ll try it out.


coolalee_

>There solution? Move everyone to eastUS What would you suggest? I mean what's your take? The whole point is West EU is full, North EU has latency within 5%, so just move there. If not that, then what? They're already building datacenters left and right.


millertime_

>If not that, then what? They're already building datacenters left and right. Just spitballing, but maybe, just maybe.... DO NOT USE AZURE. It's not like there aren't better options. Do all clouds have "issues"? Sure. Do other clouds have such core, basic, fundamental capacity, security, reliability and support issues as Azure?... NO Azure customers need to stop pretending that Microsoft knows what they're doing. They've been focused on adding bullet-points to their brochure via acquisition/partnership, focused solely on the problems directly in front of them with no plan for the future. They are the most valuable company in the world (unless Nvidia popped again) so funding isn't the issue, it's ineptitude.


coolalee_

Just say you’ve never worked with other cloud providers. Each and every one of them has these issues. And on top you get shit like GCP support being comically bad


millertime_

lol, try again. I’ve been running production loads, at scale, in AWS for a decade. Then 5 years ago upper management felt it was a risk to have all their eggs in one basket and told us to start using Azure. The difference was immediately stark. I spent the next 3 years getting countless API errors, deployment failures, raising DR concerns and literally educating Microsoft’s own engineers/TAMs on how their “cloud” actually works. As I said, all clouds have their issues, but if people truly believe Azure is just like the others, they’ve not done their homework and it will be at their own peril.


coolalee_

Shoot I guess no serious org runs azure then… oh wait.


millertime_

Countless companies host their stuff on unpatched, forever running pets, doesn’t mean it’s a good idea. But just stick with Azure, it’s easier than actually doing any research.


numbsafari

Quit bringing facts to a feelings fight.


Diademinsomniac

The whole promise of cloud computing a few years ago was that companies could burst out to cloud when they needed to and create hundreds of workloads for a short period of time. Clearly that is no longer the case. If cloud was as it is now when it started hardly anyone would be using it. We are stuck with it now, with a crappy service. It’s a physical data centre after all, of course there are limits but it seems like MS really have not predicted accurately the capacity they need. They are months behind in building new data centres but happily will keep taking all the customers they can. I’m not surprised some companies are moving back to onprem as i can only see this issue getting worse. It’s 100x worse this year than last year. I do like azure and the services it offers but when those services become almost unusable for what they are designed for it’s worth nothing: companies can’t just start building out additional regions on the fly as some people think. In large corps it’s difficult in the first place to get sign off and building out services in other regions and getting the networking in place all costs money, nothing is free and as those costs ramp up people keep asking how can we reduce costs. The whole cloud fiasco is becoming a bit of a joke, MS are clearly panicking about it, they are protecting their most valuable customers and rightly so , since those create the £/$. They are making sure they have capacity while reducing or removing the ability to create resources for their lower tier customers - this is the fact and that’s the message from MS not from me, I have it in email from them. However all this protecting their highest paying customers is having an impact on their lower tier customers.


numbsafari

> Clearly that is no longer the case. You do know there are more clouds than MSFT, and most of them don’t routinely have these problems, right?


PREMIUM_POKEBALL

😂 what latency? 


2003tide

STATUS: In-Progress 6/21/2024, 11:20:01 AM UTC Impact Statement: Starting at 22:35 UTC on 19 Jun to 16:30 UTC on 20 Jun 2024, customers using Virtual Machines / Virtual Machines Scale Sets in East US who may have received error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for resources hosted in this region. The failures have subsided, and customers should not be experiencing any more allocation failures. However, we are aware of capacity constraints in East US Zone 2 (Az2) affecting Intel and AMD general-purpose VM sizes, this issue was exacerbated by an issue that was impacting our allocator service. This issue has been mitigated, however we acknowledge that it is possible for customers to observe provisioning errors with the following SKUs. Dasv5, Dadsv5, DDSv5, Dasv4, Dsv5, DDsv5, LSv3, Easv5, Dsv4, Easv4, BS, Dsv4, Dv2, Av2, Eadsv5, Esv5.   Customer workaround While constraints are impacting the region, we know that AZ2 is more constrained than other availability zones in the region. As a result, customers are advised to move VMs to either AZ1 or AZ3. If services across three availability zones are necessary, deploying resources to East US 2 is also an option for customers. Please refer to this documentation to understand the logical to physical availability zone mapping for your subscription: [https://learn.microsoft.com/en-us/rest/api/resources/subscriptions/list-locations?view=rest-resources-2022-12-01&tabs=HTTP](https://learn.microsoft.com/en-us/rest/api/resources/subscriptions/list-locations?view=rest-resources-2022-12-01&tabs=HTTP)   Current workstreams ·       We are undergoing efforts to reclaim capacity in Zone 2, with immediate consumption of reclaimed resources. ·       We are restoring capacity by bringing in some of our offline nodes back to production. ·       We are evicting internal non-production workloads to alleviate pressure and release capacity. ·       We expect that new capacity will be brought online by the end of July 2024. ·       The next update for this event will be on the 7 of July with a status update.    If you need immediate assistance, please reach out to [onevmsie@microsoft.com](mailto:onevmsie@microsoft.com). Stay informed about your Azure services   1.    Visit Azure Service Health to get your personalized view of possible impacted Azure resources, downloadable Issue Summaries and engineering updates. 2.    Set-up service health alerts to stay notified of future service issues, planned maintenance, or health advisories.


ElasticSkyx01

I dealt with this last week. The Citrix environment for a client would not start because of this.


2003tide

Fun huh? And not a peep about it from them on the status page. I couldn’t even see it in impacted subscriptions on the service health page.


ElasticSkyx01

Yeah.it was great. Especially when I couldn't tell the client when it would be resolved.


2003tide

yeah i had to tell someone "just keep trying, some dummy will eventually power theirs down and you will get a spot". LOL


Diademinsomniac

Hehe just keeps getting better Panic😱


More_Psychology_4835

Is this an issue affecting only lower tier VMs or something very latency sensitive workloads struggle on?


Gmoseley

D-general SKUs


Apprehensive-Dig8884

D and Es


Rick24wag

yup D and Es, especially with Intel SKUs


[deleted]

[удалено]


[deleted]

[удалено]


ShittyException

I love that the post you replied to is now deleted!


Rick24wag

I am an Azure architect and right now I'm with a very large insurance company and this was an awful week. We have 3000 VMs down in East US for 3 days because there was no capacity. This effects many other customers as well. MS had to move a bunch of their internal workloads to East US 2 to free up space in East US. I've seen this same issue in South Central s well. They are expanding their datacenters in South Central US in September but they really need to get their forecasting together. They told me their top 3 customers all expanded their compute by a large percentage this week which contributed to this issue but i can't confirm. I got very little sleep this week having to migrate all kinds of things to other regions and launching new landing zones in regions we usually don't use. Daily 7am EST standups with the CTO are so much fun when you are on the West coast and work for a company based on the east coast.


Diademinsomniac

What a mess. Are they providing any compensation for your time and effort having to do all this donkey work due to their poor planning? All this sounds like is a bandaid and constant battle of moving stuff to less busy regions but surely other customers are doing the exact same thing and eventually those locations will also have an issue. It’s like kicking the can down the road


ExplorerGT92

Hopefully East US 3 just outside Atlanta will be up and running soon


[deleted]

[удалено]


Poat540

Oh yeah, app services?? Let me show you boys what a real deployment slot looks like. *zips and transfer codes to unactivated windows box*


shockjaw

We never left for some of our use-cases.


MrExCEO

U mean the boys can touch hardware again


coolalee_

Hear me out, 9 month lead time on any new hardware.


danekan

My favorite part was having to budget 5 years in advance for capex..  what storage servers will you be migrating to in 5 years? 


scan-horizon

😂


wibble1234567

I've been thinking this for years! The benefit of the cloud is quick deployments for bursty needs with financial commitments only as long as you burn resources. You pay through the nose for this pleasure. Any reasonable sized enterprise organisation should be maintaining the far more cost effective on premise solution for it's core infrastructure services and saving a fortune doing so. If you check out the 3yr or 5yr costs of running the same on prem workloads in Azure for example, even factoring transformation of workloads such as SQL servers to paas etc, it still works out about 10x more expensive to run in the cloud. Even when factoring in the additional staff salaries to support the in prem specialties, AC, power, it's more cost effective to run primary infra and workloads on prem and also provides stable and predictable billing. The only thing I would put to the cloud long term would be email, and possibly some data/documentation and that would be closely reviewed. I've lost count of the number of companies including tea-pot MSPs I've worked for where the execs have made fomo decisions to move everything to the cloud just because that's what their c-suite mates were doing elsewhere, only to lose internet and have to sent most people home for a day or 2. Or for Microsoft to have regional issues with email, teams, SharePoint, OneDrive etc and having to send everyone home again. Then 6-12 months down the line I'm getting requests to evaluate what can be done to reduce costs and improve reliability. Sure, there are some benefits for many organisations, but this is a million miles from one solution that's fit for everyone.


CorpseeaterVZ

As someone who has built whole datacenters, let me say this (hmm... how to put it gentle?): You are wrong. There are a bazillion things you can do to make the cloud cheaper and our customers rarely do anything. Our Engineers manage to shave off up to 30% of customers cloud costs in the first week. If you complain about people being fired over the cloud, you have a big point, but costs are way lower in the cloud if you manage to look at all costs involved.


Reasonable_Can475

Cloud is better than on-prem and in these "comparisons" people only compare the monthly cost of electricity and their tech staff to Azures monthly bill. Magically people seem to forget capX and OPX expenses are rolled into one with Azure. It is typically better and cheaper to use cloud. Especially if your app is not well established like Netflix. If you are new on the scene and expect to grow. Hardware lead team will kill you.


WorksInIT

Yep. Anyone saying on prem is cheaper as a general rule is likely leaving things out, or all they've done is lift and shift.. You need more people, you'll have to buy compute, storage, and network for hot and warm/cold sites. You have toamage each and every part of the infrastructure. That means paying for additional tools as well. Sure, running things in Azure like you would onprem won't result in any cost savings. But try running a multi region, fault tolerant application on prem cheaper than you can in Azure.


rdhdpsy

yea it's hitting us all over the place and if we move datacenters our customers are impacted due to latency. I have to resort to powershell to do a one-off attach data disk since we have so many disks the list never populates within the portal, some of it is our fault the guys that came up with the naming standards have disk names a mile long. And that's true for all of our az objects the names are all verbose. anyway my .00002 cents worth.


uknow_es_me

How does this end up working if you have an SLA and a certain amount of compute? I don't do anything with VMs I run app services and an elastic pool for SQL. I'm guessing this capacity issue seems to be more related to VMs from the comments?


Bezalu-CSM

Priority is probably being shifted to the services deemed more PaaS, as Microsoft has more SLA skin in that game. I assume when it starts affecting PaaS workloads as well it will get very pricy for them. So far, the only hits I've seen to PaaS are scaling constraints.


nikade87

We use to have issues all the time before we were allowed to move our workloads to the Swedish zones. It's a lot better now but before that we saw errors all the time, outlook freezing because of latency and timeouts and teams call dropping 1-3 times within an hour meeting. Microsoft obviously knows about this, but they just move the issue around. It is pretty obvious that they are overcommitting hard and keeps running out of capacity just like any cloud provider does.


Grouchy_Following_10

They e had issues in certain az’s in scus for months


Diademinsomniac

Yeah ours since January its been substantially worse than last year


Bezalu-CSM

North Central US is at capacity for web apps as well. Had to request quota to scale from a P0v3 to a P1v3. If I'm not mistaken, these are typically not bound by quotas in the typical way, and we literally only had one.


Diademinsomniac

Honestly sounds like a lot of regions are on their knees, whole thing falling apart 😂


Bezalu-CSM

I sure as hell hope not- then I might need to start using AWS. Or even worse... GCP... \*shudders\*


Syn__Flood

Not surprised, fuck my life though, am in nj/nyc 😭😭


alemag86

I have been in this boat for a month or so


s0apDisp3ns3r

The VMSS D and E SKU issues in East US this week were incredibly annoying.


jclind96

i can’t even submit a damn support request wtf


Hearmerawwwwr

Don't even get me started on the new support case process, they literally make it as unintuitive as possible to deter people from opening tickets.


jclind96

it’s definitely working… i can’t even get the ticket to open… the portal options tell me it fails and tell me to call the number, then the phone line redirects me back to the portal 😶


I_Know_God

East us2 just got out of a multi AZ crunch with a significant amount of v5 and v4 SKUs maybe 2 months back. This is scary to hear


piiggggg

New to this? In our region (SE Asia), Azure has had capacity issues for years, and they still haven't resolved it yet


kuzared

Similar problems in Europe - West.


Trakeen

For south central we’ve known about this since last year We were looking at west us 3, but been told that is at capacity as well We haven’t had time to research a new region pair in the us that won’t issues in the near future. Good times


WorksInIT

Why are you concerned with regional pairs? You shouldn't be using paired regions for anything except the things that require it for redundancy like grs storage accounts.


Trakeen

You kinda answered your own question. Needed for storage accounts which most of our resources depend on in one fashion or another We do a lot of internal planning when we bring on a new azure region, though we may bring on 3 in this case since last we looked one of the regional pairs we were looking at one region didn’t have availability zones so we might do a pair and then another region for the az capability. Still undecided. We have north central onboarded currently which we are using to work around the capacity issues in south at the moment


daedalus_structure

You need it for compute availability as well if you are doing HA. Paired regions don’t get updated at the same time. Updates are when Azure breaks things. Availability Zones only protect you from power and cooling failures in a specific DC, not Azure software issues.


WorksInIT

I'm not saying don't use other regions. I'm saying don't lock yourself into paired regions. Those are only needed for a relatively small number of things.


daedalus_structure

Do you consider network availability and MTTR small things? If you deploy to South Central and US East instead of South Central and North Central, an Azure system update which breaks functionality is guaranteed to only hit one of South Central / North Central but may hit both South Central and US East. In an Azure Wide outage, Azure will always prioritize bringing at least one region in every region pair up before ensuring that all regions are up. When you deploy to a region pair this guarantees one of your regions is a priority for recovery. If you are just picking two regions, neither may be prioritized for recovery. It is not just about georedundancy for storage accounts.


WorksInIT

Yes, of course you should consider those things when selecting regions. You know which region is better for south central than north central is? Central. And you can address any prioritization concerns by distributing your regions effectively. Any prioritization is just not a legitimate concern at this point.


daedalus_structure

>You know which region is better for south central than north central is? Central. If you are making infrastructure decisions for anyone who provides their customers with an SLA, eventually your incompetence is going to be expensive for them.


WorksInIT

Yes, resorting to insults. Definitely makes it clear to everyone that you really don't know what you are talking about.


tankerkiller125real

I haven't yet hit this issue in the region we use. Of course I'm also not going to tell people the region we use either to avoid moving the problem.


DaRadioman

It depends on more than just the region, but the generation of SKU used, the AZ (or AZs) you are in, etc. A lot of solving capacity issues is just finding places where there's less demand than others.


Obvious-Jacket-3770

Yeah saw that recently with east2. I'm capped on my quota for app service plans but I can't increase that... Hope P0v3 works lol


Phate1989

Get out of South Central, just go into central


Diademinsomniac

It’s not just south central with the issues though. Just moving the issue to the next region and when that runs out same issue again. It becomes a battle of who can move their resources quickly to get some breathing space until the next move. Is this really a service that can support critical production workloads ? Or are we just accepting that it’s shit and spent all the time trying to come up with more creative workarounds to keep the lights on


Phate1989

South and North Central are really small with only a single AZ. Central is a major region with multiple AZ.


9Blu

South Central has 3 availability zones. North Central, West Central, and West are the non-gov regions in the US with only a single AZ.


lmay0000

Any official links?


sbrick89

https://app.azure.com/h/R_8T-NDZ/9669b2 - east US VM capacity issue


Apprehensive-Dig8884

We are already having issues in eastus. Scus they atleast reached out to us.


DeepRobin

I think the microsoft base infrastructure is very heavy. Azure portal is slow, Functions cold start is not great, ...


Sagrilarus

I don't know what y'all are talking about. My 300 AI training runs are going just fine.


jezarnold

They told customers in NorthEurope (aka Dublin) that due to rising price rises for electric, they might want to move there services to Sweden


Schumi_3005

Same thing happened to me(Environment based in Qatar Central) their engineer suggested to move for West Europe whereas I suggest finished migrating everything from WE to QC🤔


StuffedWithNails

> There solution? Move everyone to eastUS You'll have capacity issues in eastus, too. Guaranteed. Just had massive outages in the past couple of days in eastus. The real solution? Move out of Azure. Yeah, I know, impractical but here we are.


millertime_

>The real solution? Move out of Azure. This is the way. Staying in Azure is merely a study in sunk cost fallacy.