Cloud architecture is regularly discussed as if the network layer designs itself -- a commodity handled by the hyperscaler that the architect does not need to think about too carefully. Spin up a couple of VPCs, peer them together, create a couple of VPN tunnels to on-premises and the cloud transformation journey has begun.
After reviewing cloud migration and transformation programmes across financial services, biopharmaceutical, healthcare and automotive environments, I can say with confidence that this assumption is one of the most consistent sources of post-migration failure. The network is not a commodity. In complex enterprise environments, it is the architecture. Everything else -- application performance, security posture, compliance, resilience -- depends on the network design being correct. The platform is different. The principles are not.
What cloud network design actually involves
A cloud network design for an enterprise programme involves decisions that cannot be reversed cheaply once made. VPC and VNet architecture -- how the network is segmented, how traffic is routed and inspected, how boundaries are enforced -- shapes every security, compliance and connectivity decision that follows. Getting this wrong at the start means rebuilding it under pressure, with production systems in place.
Central to any cloud network design is understanding how traffic actually moves. There are two fundamentally different traffic patterns that must be designed for separately, because the failure modes, security controls and performance constraints for each are entirely different.
North-south flows: where most designs start, and where most mistakes are visible
North-south traffic enters or leaves the cloud environment -- from the internet into workloads, from data centres into cloud, or from cloud back to on-premises systems. This is where most architects start because it is the most visible: users cannot reach applications, partners cannot connect to APIs, data cannot flow between sites.
The most common north-south failure is over-reliance on a single connectivity path without genuine resilience modelling. ExpressRoute and Direct Connect provide private, dedicated connectivity -- but a single circuit is not a resilient design. Diverse physical paths, clear failover behaviour and deterministic BGP reconvergence need to be specified at design stage, not resolved during an outage. The LLD must specify actual BGP communities, AS path prepend values and route filtering policies -- not describe the intent and leave implementation detail to the engineer who builds it.
MTU is a consistent north-south failure point that standard testing almost never catches. ExpressRoute and Direct Connect circuits, VPN tunnels, SD-WAN overlays and encapsulation protocols all introduce MTU constraints that interact with each other. A network that performs correctly in a proof of concept -- where traffic volumes are low and the impact of fragmentation is masked -- can degrade significantly under production load. Every connectivity path in the LLD must specify the MTU value at each segment, and application teams must validate their traffic profiles against those constraints before go-live.
East-west flows: where the real risk lives, and where most designs are silent
East-west traffic moves between workloads inside the cloud environment -- between VPCs or VNets, between accounts or subscriptions, between application tiers. This is where most cloud network designs are dangerously incomplete, because east-west flows are not visible until something goes wrong.
The most common east-west failure is a flat network with implicit trust between workloads. A compromised workload in one VPC can reach workloads in any other with no inspection, no enforcement boundary and no detection. In regulated environments -- financial services, healthcare, biopharmaceutical -- this is not a theoretical risk. It is the design condition that turns a containable security incident into a reportable breach.
The diagnostic question every east-west design must answer explicitly is this: if a workload in environment A is compromised, what can it reach? If the honest answer is "most things," the design is not segmented -- it is flat with extra routing.
Centralised inspection and cloud-native security
The decision of where and how to inspect traffic -- both north-south and east-west -- is one of the most consequential choices in a cloud network design. Done correctly, it enforces consistent policy across all traffic flows. Done poorly, it creates inspection gaps that are invisible until they are exploited.
For AWS, the centralised inspection model is built around a dedicated inspection VPC, with AWS Transit Gateway routing all inter-VPC and internet-bound traffic through that enforcement point before it reaches its destination. Gateway Load Balancer (GWLB) and its endpoint (GWLBep) make this scalable and transparent -- distributing traffic across a fleet of inspection appliances using GENEVE encapsulation, preserving the original source and destination IP addresses so the appliance sees the actual traffic flow. GWLBep is deployed in each spoke VPC as the next-hop for traffic requiring inspection. The appliance processes and returns traffic without the source or destination being aware that inspection occurred. This is the correct architecture for centralised inspection at scale.
For Azure, the equivalent model uses Azure Firewall in a hub VNet -- or Azure Virtual WAN with Firewall Manager for multi-region deployments -- with forced tunnelling ensuring all spoke traffic traverses the firewall. vWAN simplifies the routing model significantly but introduces constraints on custom routing that must be understood before committing to the architecture.
The failure mode I see most frequently is an inspection architecture that looks complete on a diagram but has gaps in the route table configuration that allow certain traffic flows to bypass the inspection point entirely. Centralised inspection that does not inspect all traffic flows is not a security control -- it is an audit finding waiting to happen.
Secure connectivity at scale: PrivateLink and cloud-native service access
As cloud environments grow, how workloads access cloud services -- and how those services are exposed to other workloads -- becomes an architectural decision in its own right. The default answer -- allow access over the public endpoint -- is the wrong answer for most enterprise environments, and particularly wrong for regulated ones.
AWS PrivateLink creates a private endpoint within the VPC representing a specific service, whether AWS-managed or hosted in another VPC or account. Traffic travels entirely within the AWS network, governed by the security group of the endpoint rather than requiring broad outbound internet access rules. An environment designed with VPC endpoints for all applicable services can enforce a default-deny outbound policy. Without them, broad outbound rules are typically required -- which undermines the segmentation the rest of the design is working to achieve.
Azure Private Link provides the equivalent capability. The DNS configuration is where Azure Private Link designs most commonly fail -- private DNS zones must be linked to every VNet that needs to resolve private endpoint names. In hub-spoke topologies this is frequently misconfigured, causing resolution to fall back to public IPs and traffic to bypass the private path entirely.
The Landing Zone is not the network design
AWS and Azure Landing Zone frameworks provide an opinionated starting point for account structure and management hierarchy. They are not a substitute for network design. I have reviewed programmes where a Landing Zone was deployed and the team considered the network design done. The Landing Zone was correctly configured. The routing, segmentation, connectivity model, MTU strategy and DNS architecture -- the decisions that determine whether the environment will actually work -- had not been done at all. A framework deployment guide is not a network architect.
Replicating on-premises architecture in cloud
One of the most consistent design errors I see is the replication of on-premises services -- firewalls, load balancers, monitoring tools -- into the cloud environment without redesigning them for cloud-native operation. The motivation is understandable: it appears to reduce risk by preserving familiar operational models. In practice, it creates architectures that are costly to run, difficult to scale and poorly aligned with how cloud networking actually works.
I reviewed a programme where a customer migrated their on-premises vendor firewall and load balancer into cloud as virtual appliances. When those appliances could not integrate cleanly with the cloud environment's routing and inspection requirements, the team deployed cloud-native equivalents in front of them. The result was two overlapping layers of security and load-balancing infrastructure, combined cost, combined management overhead -- and a remediation bill significantly higher than an independent design review would have been. Cloud-native services -- AWS Gateway Load Balancer, Azure Firewall, private endpoints -- exist because the traffic patterns and operational requirements of cloud environments are different. Designs that ignore them are not reducing risk. They are deferring it.
What independent cloud network review looks for
When I review a cloud network design, the questions I am asking are the same questions a CCDE-level network architect asks of any design -- applied to a cloud context.
Are east-west traffic flows segmented and inspected, or is the environment flat behind a perimeter control? Does every traffic flow traverse the inspection point, verified in the route tables -- not just shown on a diagram? Are north-south connectivity paths genuinely resilient, with specific failover behaviour documented and tested? Are VPC endpoints and Private Link configured for all applicable services, with DNS resolution verified for every consuming VNet?
Does the LLD specify actual configuration values -- CIDR ranges, BGP attributes, MTU settings, security group rules, route table entries, GWLB listener configurations -- or does it describe intent and leave implementation to the engineer who builds it?
Cloud network design is still network design. The platform changes the tools. It does not change the discipline required to use them correctly.
- The Landing Zone is treated as the network design. Account structure and management hierarchy are in place. Routing, segmentation, MTU strategy, connectivity model and DNS architecture have not been designed at all.
- East-west traffic flows are uncontrolled. Spoke VPCs route directly to each other without traversing an inspection point. The inspection architecture exists on the diagram but is not enforced in the route tables.
- MTU and latency have not been modelled against application requirements. The design assumes performance will be acceptable. Production reveals it is not -- particularly for applications migrated from low-latency data centre environments.
- On-premises services replicated without redesign. Virtual firewall appliances and load balancers deployed with the same operational model as on-premises equivalents -- creating architectures that are expensive, difficult to scale and misaligned with cloud-native design.
- Private Link DNS misconfigured in hub-spoke topologies. Private DNS zones not linked to all consuming VNets. Resolution falls back to public IPs. Traffic bypasses the private path and the endpoint provides no security value.
- The LLD contains intent, not configuration. IP ranges to be confirmed. BGP communities to be agreed. MTU values absent. Private DNS resolution undefined. The document describes a design. It does not specify an implementation.