Cloud Computing Interview Questions and Answers (2026)

Preparing for a cloud or DevOps interview in 2026? This is the most comprehensive guide covering 200+ Cloud Computing interview questions with detailed answers — from beginner to advanced — for Cloud Engineers, DevOps Engineers, SREs, and Cloud Architects.

Topics covered: Cloud fundamentals · IaaS/PaaS/SaaS · Networking & VPC · Security & IAM · Containers & Kubernetes · Serverless · AWS/Azure/GCP services · CI/CD & IaC · High availability & DR · Cost optimization · Cloud migration · Microservices architecture

Who this is for: Freshers targeting cloud roles, experienced engineers preparing for senior interviews, and anyone interviewing at AWS, Google, Microsoft, Datadog, Cloudflare, or any cloud-native company.

1. Basic Cloud Computing

Q1. What is cloud computing?

Cloud computing is the on-demand delivery of computing resources — servers, storage, databases, networking, software, analytics — over the internet with pay-as-you-go pricing. Instead of owning physical data centers, you access technology services from a cloud provider like AWS, Azure, or GCP.

Q2. What are the key characteristics of cloud computing?

The NIST defines 5 essential characteristics:

On-demand self-service — provision resources instantly without human interaction
Broad network access — accessible via any internet-connected device
Resource pooling — multi-tenant shared infrastructure
Rapid elasticity — scale resources up/down automatically
Measured service — pay only for what you consume (metered billing)

Q3. What are the main advantages of using cloud computing?

Cost savings — no upfront capital expenditure, pay-as-you-go
Scalability — scale instantly to handle traffic spikes
Global reach — deploy to data centers worldwide in minutes
High availability — built-in redundancy across multiple zones
Faster innovation — focus on code, not hardware management
Disaster recovery — automated backups, geo-redundancy
Security — world-class security from providers with dedicated security teams

Q4. What are the different types of cloud deployment models?

Model	Who owns it	Best for	Examples
Public	Cloud provider	Startups, scalable workloads	AWS, Azure, GCP
Private	Organisation	Banks, government, compliance	OpenStack, VMware
Hybrid	Both	Burst to public, sensitive data on-prem	AWS Outposts, Azure Arc
Multi-cloud	Multiple providers	Avoid vendor lock-in	AWS + GCP + Azure
Community	Shared org group	Research, government agencies	GovCloud

Q5. What is the difference between public, private, and hybrid cloud?

Public cloud — infrastructure owned and operated by a third-party provider, shared among multiple customers. Cheapest, most scalable.
Private cloud — infrastructure dedicated to a single organisation. More control, higher cost, better compliance.
Hybrid cloud — combination of public and private cloud connected by technology that allows data and applications to move between them. Best of both worlds.

Q6. How does cloud computing differ from traditional data center operations?

Aspect	Traditional Data Center	Cloud Computing
Capital cost	High upfront (servers, networking)	Zero upfront — pay as you go
Scaling	Months to procure new hardware	Minutes to spin up new instances
Maintenance	In-house team manages hardware	Provider handles hardware
Global reach	Limited to owned locations	100+ regions worldwide
Elasticity	Fixed capacity — must over-provision	Elastic — scale with demand
Innovation speed	Slow (hardware procurement)	Fast (API-driven provisioning)

Q7. What is virtualization and how does it relate to cloud computing?

Virtualization creates software-based (virtual) versions of physical hardware. A hypervisor runs on physical hardware and creates multiple isolated virtual machines (VMs), each with its own OS.

Cloud computing is built on virtualization — providers run thousands of VMs on shared physical servers and rent them to customers. Without virtualization, multi-tenancy, elasticity, and cost efficiency would not be possible.

Type 1 hypervisors (bare-metal): VMware ESXi, KVM, Hyper-V — run directly on hardware, used in production. Type 2 hypervisors (hosted): VirtualBox, VMware Workstation — run on top of an OS, used for dev/test.

Q8. What is a hypervisor and its types?

A hypervisor (Virtual Machine Monitor) is software that creates and manages virtual machines by abstracting physical hardware.

Type 1 (Bare-metal) — runs directly on hardware, no host OS. Faster and more secure. Examples: VMware ESXi, Microsoft Hyper-V, KVM, Xen. Used in all major cloud providers.
Type 2 (Hosted) — runs on top of an existing OS. Slower but easier for dev/test. Examples: VirtualBox, VMware Workstation, Parallels.

Q9. What is elasticity in cloud computing?

Elasticity is the ability to automatically provision and de-provision resources in real time based on demand — scale out when load increases, scale in when load decreases.

Example: An e-commerce site that automatically adds 50 EC2 instances during Black Friday traffic and removes them afterward is elastic. Elasticity = scalability + automation + cost efficiency.

Q10. What is the difference between scalability and elasticity?

Scalability — the ability to handle increased load by adding resources. Can be manual or planned.
- Vertical (scale-up): bigger server (more CPU/RAM)
- Horizontal (scale-out): more servers behind a load balancer
Elasticity — automatic, real-time scaling up AND down based on current demand. Elasticity is scalability + automation.

Key difference: scalability is about capacity; elasticity is about automatic, dynamic adjustment.

Q11. What are the primary components of cloud architecture?

Front-end — client devices and interfaces (browser, mobile app)
Back-end — cloud infrastructure (servers, databases, storage)
Network — internet connection between front and back end
Cloud delivery model — IaaS, PaaS, SaaS
Security layer — IAM, encryption, firewalls
Management layer — monitoring, orchestration, automation tools

Q12. What is a Service Level Agreement (SLA)?

An SLA is a contract between cloud provider and customer defining the guaranteed level of service.

Key SLA metrics:

Availability/Uptime: 99.9% = 8.7 hrs downtime/year, 99.99% = 52 min/year, 99.999% = 5 min/year
Response time: how quickly support responds
Incident resolution time: how quickly issues are resolved
Remedies: service credits if SLA is missed

Always check SLA exclusions — scheduled maintenance, customer-caused outages, and force majeure events are typically excluded.

Q13. What is the difference between scalability and elasticity in cloud computing?

See Q10 above — scalability is the ability to grow capacity (manual or planned), while elasticity is automatic, real-time dynamic scaling in both directions.

Q14. What are the four layers of cloud architecture?

Physical layer — hardware (servers, storage, networking)
Virtualization layer — hypervisor creates VMs from physical resources
Cloud platform layer — cloud OS, APIs, management tools
Application layer — end-user applications and services running on the cloud

2. Cloud Service Models

Q15. What are the main cloud service models?

Model	You manage	Provider manages	Examples
IaaS	OS, runtime, apps, data	Servers, storage, networking	EC2, Azure VMs, GCP Compute Engine
PaaS	Apps and data	OS, runtime, middleware, servers	Heroku, Google App Engine, Azure App Service
SaaS	Just your data	Everything	Gmail, Salesforce, Dropbox, Zoom
FaaS	Function code only	Runtime, scaling, infra	AWS Lambda, Azure Functions, Cloud Run

Q16. What is Infrastructure as a Service (IaaS)?

IaaS provides virtualised computing resources over the internet. You get raw compute, storage, and networking — you manage the OS, middleware, runtime, and applications. The provider manages the physical hardware.

Best for: hosting custom applications, running VMs, data storage, DR, dev/test environments. Examples: AWS EC2, Azure Virtual Machines, Google Compute Engine, DigitalOcean Droplets.

Q17. What is Platform as a Service (PaaS)?

PaaS provides a platform for developing, deploying, and managing applications without managing the underlying infrastructure. The provider manages OS, runtime, middleware, and servers.

Best for: web app development, API backends, database management, microservices without K8s complexity. Examples: Google App Engine, Azure App Service, Heroku, AWS Elastic Beanstalk, Render.

Q18. What is Software as a Service (SaaS)?

SaaS delivers software applications over the internet on subscription. The provider manages everything — infrastructure, platform, and software. Customers just use the application.

Best for: email, CRM, HR tools, collaboration software — no installation or maintenance. Examples: Gmail, Microsoft 365, Salesforce, Zoom, Slack, Dropbox, Notion.

3. Cloud Storage and Networking

Q19. What are the different types of cloud storage?

Type	Description	Access method	Use case	AWS example
Object storage	Flat namespace, key-value, highly scalable	HTTP API	Images, videos, backups, static websites	S3
Block storage	Raw storage volumes attached to VMs like a hard drive	Mounted as disk	OS disks, databases	EBS
File storage	Shared file system mountable by multiple VMs (NFS/SMB)	Network file protocol	Shared app files, home directories	EFS
Archive storage	Very low-cost, high-latency, for long-term retention	Retrieval takes hours	Compliance archives, cold data	S3 Glacier

Q20. What is a Content Delivery Network (CDN) and how does it work?

A CDN is a globally distributed network of edge servers that cache static content (images, CSS, JS, videos) close to end users, reducing latency.

How it works:

User requests a file
DNS routes to nearest edge server (Point of Presence)
Cache hit → served immediately from edge
Cache miss → fetch from origin, cache it, serve

Benefits: lower latency, reduced origin load, DDoS absorption, global availability. Examples: AWS CloudFront, Cloudflare, Akamai, Azure CDN, Fastly.

Q21. What is a Virtual Private Cloud (VPC)?

A VPC is a logically isolated section of the cloud provider's network where you launch resources in a virtual network you control. You define IP ranges, subnets, route tables, and security rules.

Key VPC components:

Subnets — public (internet-facing) or private (internal)
Internet Gateway — connects VPC to internet
NAT Gateway — lets private subnet instances reach internet without being exposed
Security Groups — stateful instance-level firewall (allow rules only)
Network ACLs — stateless subnet-level firewall (allow + deny rules)
Route Tables — control traffic routing within VPC
VPC Peering — connect two VPCs privately

Q22. What is a cloud VPN?

A Cloud VPN creates an encrypted tunnel between an on-premises network and a cloud VPC over the public internet. Used to securely extend an on-premises data center into the cloud.

Types:

Site-to-site VPN — connects entire on-prem network to cloud VPC
Client VPN — individual user connects to cloud network

AWS: AWS Site-to-Site VPN, AWS Client VPN. For higher bandwidth and lower latency: use AWS Direct Connect or Azure ExpressRoute (dedicated private fiber connection).

Q23. What is cloud latency?

Latency is the round-trip time (RTT) for data to travel from client to cloud server and back. Affected by physical distance, network congestion, number of routing hops, and server processing time.

To minimise latency: choose the closest region, use a CDN for static assets, use edge computing, optimise database queries, use connection pooling.

Q24. What is load balancing in cloud computing?

A load balancer distributes incoming traffic across multiple backend instances to prevent overload and ensure availability.

Types by OSI layer:

Layer 4 (Transport/TCP) — routes based on IP and TCP/UDP protocol. Faster, no content inspection. AWS NLB.
Layer 7 (Application/HTTP) — routes based on URL path, HTTP headers, hostname, cookies. AWS ALB.

Algorithms: Round Robin, Least Connections, IP Hash, Weighted. Load balancers perform health checks — stop routing to unhealthy instances automatically.

Q25. What is a cloud database and its types?

Type	Description	Examples
Relational (SQL)	Structured data, ACID transactions, complex queries	AWS RDS, Azure SQL, Cloud SQL, Aurora
NoSQL Key-Value	Simple key-value, ultra-fast, massive scale	DynamoDB, Redis, Memcached
NoSQL Document	JSON documents, flexible schema	MongoDB Atlas, Firestore, Cosmos DB
NoSQL Wide-Column	Large scale analytical workloads	Cassandra, Bigtable, Redshift
Graph	Relationships between entities	Neptune, Neo4j
Time-series	Time-stamped metrics and events	InfluxDB, Timestream, TimescaleDB
Data Warehouse	Analytical queries on petabytes	Snowflake, BigQuery, Redshift

Q26. What is cloud networking?

Cloud networking provides virtual networking components — VPCs, subnets, load balancers, DNS, VPNs, firewalls — that connect cloud resources to each other and to the internet. Cloud networking is defined in software (Software-Defined Networking / SDN) rather than physical cables.

Q27. What is a cloud router?

A cloud router is a software-defined router that dynamically manages routing between VPC networks, on-premises networks, and other cloud regions. It uses BGP (Border Gateway Protocol) to exchange routing information.

AWS: AWS Transit Gateway (connect multiple VPCs and on-prem networks). GCP: Cloud Router.

Q28. What is cloud interconnect?

Cloud Interconnect provides a dedicated, private, high-bandwidth connection between your on-premises network and the cloud provider — bypassing the public internet.

Better performance, lower latency, and more consistent throughput than VPN over internet. Examples: AWS Direct Connect, Azure ExpressRoute, Google Cloud Interconnect.

Q29. What is a cloud storage gateway?

A cloud storage gateway is a software appliance or hardware device at an on-premises data center that bridges on-prem applications to cloud storage. It translates local file/block requests to cloud object storage APIs.

Use case: backup on-prem data to S3, extend NAS storage to cloud without changing applications. AWS: AWS Storage Gateway (File Gateway, Volume Gateway, Tape Gateway).

4. Cloud Deployment and Management

Q30. What is auto-scaling and how is it implemented?

Auto-scaling automatically adjusts the number of compute resources based on demand. Types:

Reactive (dynamic) — scale based on real-time metrics (CPU > 70% → add instance)
Scheduled — scale at predicted times (scale up every Monday 9AM)
Predictive — ML-based demand forecasting (AWS Predictive Scaling)

AWS Auto Scaling Group example: min=2, max=20, desired=4. Scale-out: add 2 when CPU > 80% for 5 min. Scale-in: remove 1 when CPU < 30% for 10 min.

Q31. What is cloud bursting and when is it used?

Cloud bursting is a hybrid configuration where an app runs on-premises but automatically bursts to a public cloud when local capacity is exceeded. You only pay for the burst capacity.

When to use: predictable seasonal spikes (retail on Black Friday), batch processing overflow, dev/test capacity. Challenges: network latency, data transfer costs, security at the boundary.

Q32. What is cloud provisioning?

Cloud provisioning is the process of allocating cloud resources to users or applications — selecting appropriate services, configuring settings, and making them available. Modern provisioning is done via IaC tools (Terraform, CloudFormation) or APIs, not manually through consoles.

Q33. What are cloud regions and availability zones?

Region — a geographic area with multiple data centers (e.g., us-east-1, eu-west-2). Choose closest to users for lowest latency.
Availability Zone (AZ) — one or more isolated data centers in a region with independent power, cooling, and networking. Typically 3+ AZs per region.
Edge Location — CDN cache points closer to users (AWS CloudFront).

Best practice: distribute across 3 AZs for HA; across 2+ regions for disaster recovery.

Q34. What is cloud automation?

Cloud automation uses tools and scripts to automatically provision, configure, deploy, and manage cloud infrastructure without manual intervention.

Tools: Terraform (IaC), Ansible (configuration), AWS Systems Manager (patching, commands), AWS Lambda (event-driven automation), CloudFormation (AWS-native IaC).

Benefits: consistency, speed, reduced human error, version-controlled infrastructure.

Q35. What is cloud monitoring?

Cloud monitoring collects and analyses metrics, logs, and traces from cloud resources to detect issues, track performance, and ensure availability.

Key areas: infrastructure metrics (CPU, memory, disk), application metrics (request rate, error rate, latency), logs (application, access, audit), distributed tracing.

Tools: AWS CloudWatch, Datadog, Prometheus + Grafana, New Relic, Dynatrace, Azure Monitor.

Q36. What is cloud governance?

Cloud governance is the set of rules, policies, and processes that ensure cloud usage aligns with business objectives, security requirements, and compliance standards.

Key elements: cost management (tagging, budgets, alerts), security policies (SCPs in AWS), compliance frameworks (CIS, SOC2, GDPR), naming conventions, access controls.

AWS tools: AWS Organizations, Service Control Policies (SCPs), AWS Config, AWS Control Tower.

Q37. What is the difference between horizontal and vertical scaling?

Vertical scaling (scale-up) — increase the size of an existing instance (more CPU, RAM). Has an upper limit. Requires restart/downtime in some cases.
Horizontal scaling (scale-out) — add more instances behind a load balancer. No hard limit. Cloud-native apps are designed for horizontal scaling.

Best practice: design stateless applications that scale horizontally. Store session state in Redis/DynamoDB.

Q38. What is multi-cloud strategy?

Multi-cloud means using services from multiple cloud providers (e.g., AWS for compute, GCP for ML/BigQuery, Cloudflare for edge). Benefits: avoid vendor lock-in, best-of-breed services, data sovereignty, resilience. Challenges: increased complexity, different APIs, higher operational overhead, data transfer costs. Management tools: Terraform, Crossplane, Anthos, Azure Arc.

Q39. What is resource replication and why is it important?

Resource replication copies data or configuration across multiple locations (AZs, regions) to ensure availability, durability, and disaster recovery.

Examples: S3 Cross-Region Replication, RDS Multi-AZ (synchronous replication), DynamoDB global tables (multi-region replication), database read replicas for performance.

Q40. What is cloud federation?

Cloud federation is the integration of multiple cloud environments (from different providers or on-premises) into a unified management layer, allowing workloads to move seamlessly between them based on cost, performance, or compliance needs.

5. Cloud Migration

Q41. What is cloud migration?

Cloud migration is the process of moving applications, data, and infrastructure from on-premises data centers (or from one cloud) to another cloud environment.

Migration phases: Discovery (inventory current state) → Assessment (analyse dependencies) → Planning (choose strategy) → Migration (move workloads) → Optimisation (right-size, cost-optimise).

Q42. What is cloud orchestration?

Cloud orchestration is the automated coordination and management of multiple cloud services and workflows to create a cohesive system. It ensures cloud resources work together in the right order.

Example: provision a VPC → launch EC2 instances → configure RDS → deploy application → configure load balancer — all automated via Terraform or AWS CloudFormation.

Q43. Explain the "lift and shift" approach and the 6 Rs of migration.

Strategy	Description	Effort	Cloud benefit
Rehost (lift & shift)	Move VM as-is to cloud	Low	Fast migration, immediate cloud benefits
Replatform (lift, tinker, shift)	Minor optimisations (e.g., move to RDS)	Low-medium	Reduced ops overhead
Refactor / Re-architect	Redesign for cloud-native (microservices, serverless)	High	Maximum agility and scalability
Repurchase	Replace with SaaS (e.g., CRM → Salesforce)	Medium	No more software maintenance
Retire	Decommission apps no longer needed	Low	Cost savings
Retain	Keep on-premises (compliance, latency)	None	Compliance or business need

Q44. What are the key considerations for cloud migration?

Application dependencies — map all dependencies before migrating
Data migration strategy — online (live) vs offline migration, data size, downtime tolerance
Security and compliance — data sovereignty, GDPR, encryption requirements
Network connectivity — VPN or Direct Connect during migration
Cost estimation — TCO analysis, reserved instance planning
Rollback plan — ability to revert if migration fails
Team upskilling — train teams on cloud-native tools

Q45. What is data residency and data sovereignty in cloud?

Data residency — specifies where data is physically stored (which country/region). Regulatory compliance often requires data to stay within borders.
Data sovereignty — data is subject to the laws of the country where it is stored. Even if your company is in India, data stored in the US is subject to US laws (e.g., CLOUD Act).

Solution: choose cloud regions that match your regulatory requirements, use data localisation features, or use region-locked services.

6. Cloud Security & Compliance

Q46. What is Identity and Access Management (IAM)?

IAM controls who (identity) can do what (access) on which resources. Core components:

Users — individual identities
Groups — collection of users with shared permissions
Roles — temporary permissions for services or cross-account access
Policies — JSON documents defining allow/deny actions on resources
Principle of least privilege — grant minimum permissions required

Q47. What is data encryption in cloud computing?

At rest — data encrypted when stored. AWS uses AES-256 via KMS. S3, EBS, RDS support server-side encryption.
In transit — data encrypted while moving between systems using TLS/SSL (HTTPS). Enforced via ACM certificates.
Client-side — data encrypted before sending to cloud. Customer controls keys.

AWS KMS key options: SSE-S3 (AWS-managed), SSE-KMS (customer-managed in KMS), SSE-C (customer-provided keys).

Q48. What is the shared responsibility model?

Model	Provider secures	Customer secures
IaaS	Hardware, data centers, hypervisor, network	OS patching, app security, data encryption, IAM, firewall config
PaaS	+ OS , runtime, middleware	App code, data, user access
SaaS	Everything except customer data	Data governance, user access management

Q49. What is cloud workload protection?

Cloud Workload Protection Platform (CWPP) secures workloads at runtime — VMs, containers, serverless functions. It monitors for malicious behaviour, enforces policies, and provides vulnerability scanning.

Examples: Prisma Cloud, Sysdig Secure, Aqua Security, AWS Inspector (vulnerability scanning).

Q50. What is a Cloud Access Security Broker (CASB)?

A CASB sits between users and cloud services to enforce security policies — visibility, compliance, data security, and threat protection for SaaS applications.

Use cases: prevent sensitive data from being uploaded to personal Dropbox, enforce MFA for Salesforce, audit shadow IT usage. Examples: Netskope, Microsoft Defender for Cloud Apps, Zscaler CASB.

Q51. What is Cloud Security Posture Management (CSPM)?

CSPM tools continuously scan cloud environments for misconfigurations and compliance violations. They detect open S3 buckets, public RDS instances, overly permissive IAM policies, and unencrypted storage.

Examples: Wiz, Orca Security, Prisma Cloud, AWS Security Hub, Microsoft Defender for Cloud.

Q52. What is cloud backup and recovery?

Cloud backup copies data to cloud storage to protect against loss. Cloud recovery restores that data after a failure.

Best practices: 3-2-1 rule (3 copies, 2 media types, 1 offsite), automated scheduled backups, test recovery regularly, use point-in-time recovery for databases (RDS PITR), use versioning on S3.

Q53. What are the best practices for securing API endpoints?

Authentication — require API keys, JWT tokens, or OAuth2
Authorisation — enforce least-privilege RBAC
Rate limiting — prevent brute force and DDoS
Input validation — reject malformed requests
HTTPS only — never allow plain HTTP
WAF — protect against OWASP Top 10 (SQL injection, XSS)
API Gateway — centralise auth, rate limiting, and logging
Secrets — never expose API keys in logs or error messages

Q54. What are key security challenges in cloud computing?

Misconfiguration — #1 cause of cloud breaches (open S3 buckets, public RDS)
Insecure APIs — attack surface for credential theft and data exfiltration
Overly permissive IAM — excessive permissions enable lateral movement
Data breaches — sensitive data in unencrypted storage or application logs
Shadow IT — employees using unsanctioned cloud services
Insider threats — malicious or negligent employees
DDoS attacks — volumetric attacks on cloud-exposed services
Supply chain attacks — compromised dependencies or container images

Q55. How do you implement encryption for data at rest and in transit?

At rest (AWS):

S3: enable default encryption with SSE-KMS
RDS: enable encryption at creation (AES-256, KMS-managed key)
EBS: enable EBS encryption by default at account level

In transit:

Enforce HTTPS via load balancer listeners (redirect HTTP to HTTPS)
Use ACM (AWS Certificate Manager) for free TLS certificates
Enable TLS for RDS connections (ssl-mode=require)
Use VPC endpoints to keep traffic within AWS network

7. Modern Cloud Architecture

Q56. What is multi-tenancy in cloud computing?

Multi-tenancy is an architecture where a single instance of software serves multiple customers (tenants). Each tenant's data is isolated and invisible to others, but they share the same underlying infrastructure.

Cloud providers are inherently multi-tenant. Isolation is achieved through virtualisation (VMs), containers, and logical data separation.

Q57. What are containers and how do they relate to cloud computing?

Containers package application code and all its dependencies (libraries, config) into a portable unit that runs consistently anywhere.

Unlike VMs, containers share the host OS kernel, making them lighter (MBs vs GBs), faster to start (seconds vs minutes), and more portable.

In cloud: containers are the foundation of modern cloud-native applications. Kubernetes orchestrates containers at scale. Every major cloud has a managed Kubernetes service (EKS, AKS, GKE).

Q58. What is the difference between containerization and virtualization?

	Containers	Virtual Machines
OS	Shares host OS kernel	Each has its own OS
Size	MBs	GBs
Startup time	Seconds	Minutes
Isolation	Process-level (namespaces, cgroups)	Full hardware-level
Portability	Runs anywhere Docker runs	Hypervisor-dependent
Overhead	Very low	Higher (full OS per VM)

Q59. What is Kubernetes and how does it work in cloud environments?

Kubernetes (K8s) automates deployment, scaling, and management of containerised applications.

Architecture:

Control Plane: API Server (entry point), etcd (state store), Scheduler (assigns pods to nodes), Controller Manager
Worker Nodes: kubelet (agent), kube-proxy (networking), container runtime (containerd/Docker)

Key objects: Pod (smallest unit), Deployment (manages replicas), Service (stable networking), Ingress (HTTP routing), ConfigMap/Secret (configuration), HPA (auto-scaling).

Cloud managed K8s: AWS EKS, Azure AKS, GCP GKE — provider manages the control plane.

Q60. What are microservices and how do they benefit cloud deployments?

Microservices architecture splits an application into small, independently deployable services, each owning its data and communicating via APIs.

Benefits for cloud:

Scale each service independently (only scale the checkout service on Black Friday)
Deploy services independently (no full-app re-deploy for a small change)
Different tech stacks per service
Fault isolation (one service failing doesn't bring down everything)

Challenges: distributed tracing, service discovery, network latency, data consistency.

Q61. What is serverless computing and what are its use cases?

Serverless — write and deploy code without managing servers. Provider automatically scales and bills per invocation.

Use cases: event-driven processing (S3 → Lambda → resize image), API backends, scheduled jobs, webhooks, stream processing, chatbots, IoT data ingestion.

Limitations: cold starts, max execution time (15 min for Lambda), stateless (state must be external), vendor lock-in.

Q62. What is a cold start in serverless computing?

A cold start occurs when a serverless function is invoked after being idle — the provider must spin up a new container, load the runtime, and load the function code. Adds 50ms–2 seconds latency.

Mitigation: provisioned concurrency (Lambda keeps warm instances), lightweight runtimes (Node.js, Python > Java), smaller package sizes, scheduled warm-up pings.

Q63. What is cloud-native architecture?

Cloud-native applications are designed specifically to exploit cloud capabilities — built as microservices, containerised, dynamically orchestrated, and managed via DevOps/CI-CD.

12-Factor App principles define cloud-native best practices: stateless processes, config in env vars, disposable processes, dev/prod parity, logs as streams.

Cloud-native stack: Docker + Kubernetes + Helm + Istio + Prometheus/Grafana + ArgoCD + Terraform.

Q64. What is immutable infrastructure?

Immutable infrastructure means servers are never modified after deployment. Instead of patching in-place, you replace the entire server with a new image.

Benefits: no configuration drift, consistent environments, easy rollback (re-deploy previous image), simpler troubleshooting. Enabled by: AMIs (AWS), Docker images, Terraform destroy/apply, GitOps.

Q65. What is an API Gateway?

An API Gateway is the single entry point for all client requests to backend microservices. It handles routing, authentication, rate limiting, SSL termination, request transformation, and logging.

Examples: AWS API Gateway, Kong, NGINX, Traefik, Azure API Management. AWS API Gateway integrates natively with Lambda, ECS, and EC2.

Q66. What is a service mesh?

A service mesh is an infrastructure layer for handling service-to-service communication in a microservices architecture. It provides traffic management, mutual TLS, observability, and circuit breaking — without changing application code.

How it works: a sidecar proxy (Envoy) is injected into each pod. All traffic between services goes through proxies. Examples: Istio, Linkerd, AWS App Mesh, Consul Connect.

Q67. What is a cloud message queue?

A message queue decouples services by holding messages until the consumer is ready to process them, enabling asynchronous communication.

Examples: AWS SQS (queue), AWS SNS (pub/sub), RabbitMQ, Apache Kafka (event streaming), Azure Service Bus.

Use case: order service places order → sends message to SQS → inventory service and email service consume asynchronously.

Q68. What is event-driven architecture?

Event-driven architecture produces, detects, and responds to events. Instead of synchronous request/response, services communicate via events (state changes).

Pattern: Producer emits event → Event bus/broker → Consumer reacts. AWS tools: EventBridge, SNS, SQS, Kinesis, Lambda. Benefits: loose coupling, scalability, real-time processing.

Q69. What is the difference between stateful and stateless applications?

Stateless — each request is independent; no session state stored on server. Scale horizontally. Any instance can handle any request. Examples: REST APIs, Lambda functions.
Stateful — server retains session state between requests. Harder to scale. Examples: traditional WebSockets, databases, Kafka consumers.

Cloud-native best practice: design stateless services, store state in managed external services (Redis, DynamoDB, S3).

Q70. What is edge computing and how does it complement cloud computing?

Edge computing processes data closer to where it is generated (IoT devices, end users) rather than sending everything to a central cloud data center. Reduces latency, bandwidth usage, and dependency on connectivity.

Complement to cloud: edge handles real-time, low-latency processing; cloud handles heavy compute, storage, and analytics. Examples: AWS Lambda@Edge, Cloudflare Workers, AWS IoT Greengrass, Azure IoT Edge.

8. Major Cloud Platforms

Q71. What are the differences between AWS, Azure, and GCP?

Aspect	AWS	Azure	GCP
Market share	~33% (largest)	~22%	~11%
Strength	Broadest service catalogue, most mature	Enterprise/hybrid (Microsoft integration)	AI/ML, Kubernetes (GKE), data analytics
Identity	IAM	Azure Active Directory (Entra ID)	Cloud IAM
Kubernetes	EKS	AKS	GKE (most advanced managed K8s)
Serverless	Lambda	Azure Functions	Cloud Functions / Cloud Run
Data warehouse	Redshift	Azure Synapse	BigQuery
Best for	General workloads, startups, enterprise	Microsoft-heavy enterprise, Office 365 integration	AI/ML workloads, data analytics

Q72. What is Amazon EC2?

EC2 (Elastic Compute Cloud) provides resizable virtual machines in AWS.

Key concepts: Instance types (t3=general, m5=balanced, c5=compute, r5=memory, p3=GPU). Pricing: On-Demand (hourly), Reserved (1-3yr, 72% off), Spot (spare capacity, 90% off). AMI = pre-configured OS image. Security Groups = stateful instance firewall. Elastic IP = static public IP.

Q73. What is Amazon S3 and how does it differ from EBS and EFS?

	S3	EBS	EFS
Type	Object storage	Block storage	File storage (NFS)
Access	HTTP API, globally	Attach to one EC2 in same AZ	Mount on multiple EC2
Use case	Backups, static websites, media	OS disk, database	Shared files across instances
Durability	99.999999999% (11 nines)	Replicated within AZ	Multi-AZ by default
Pricing	Per GB stored + requests	Per GB provisioned	Per GB used

Q74. What is AWS Lambda?

Lambda is AWS's serverless compute — runs code in response to events without provisioning servers. Triggers: S3, API Gateway, DynamoDB Streams, SQS, SNS, CloudWatch Events, Kinesis. Limits: 15-min max execution, up to 10GB memory, 1000 concurrent executions (default). Pricing: per invocation + per GB-second of compute. First 1M invocations/month free.

Q75. What is AWS Route 53?

Route 53 is AWS's scalable DNS (Domain Name System) web service. It translates domain names to IP addresses, routes internet traffic, and performs health checks.

Routing policies: Simple, Weighted, Latency-based, Geolocation, Failover, Multi-value. Use failover routing for DR — automatically switch to healthy endpoint when primary fails.

Q76. What is Microsoft Azure Blob Storage?

Azure Blob Storage is Azure's object storage — equivalent to AWS S3. Stores unstructured data (images, videos, backups, logs).

Tiers: Hot (frequently accessed), Cool (infrequent, lower cost), Archive (rarely accessed, cheapest). Integrated with Azure CDN, Azure Data Factory, and Azure ML.

Q77. What is Azure Active Directory?

Azure AD (now Microsoft Entra ID) is Microsoft's cloud identity platform — SSO, MFA, Conditional Access, B2B/B2C federation, and hybrid identity with on-premises Active Directory.

Critical for enterprise Azure deployments. Every Azure resource access is controlled through Azure AD identities and RBAC roles.

Q78. What is Google Cloud BigQuery?

BigQuery is GCP's serverless, petabyte-scale data warehouse. You query massive datasets with SQL without managing servers. Features: columnar storage, separation of compute/storage, built-in ML (BigQuery ML), streaming ingestion, pay-per-query pricing.

Q79. What is AWS Elastic Beanstalk vs ECS?

Elastic Beanstalk — PaaS. Upload code, Beanstalk handles provisioning (EC2, load balancer, auto-scaling). No container knowledge required. Less control.
ECS (Elastic Container Service) — CaaS. Run Docker containers. More control over infrastructure. Better for microservices.
EKS — managed Kubernetes. Full K8s API. Most complex, most powerful.

Q80. What is an AMI (Amazon Machine Image)?

An AMI is a pre-configured virtual machine image used to launch EC2 instances. Contains: OS, application server, application configuration.

Types: Amazon-provided (Amazon Linux, Ubuntu), AWS Marketplace (pre-installed software), community AMIs, custom AMIs (created from your instances). Use custom AMIs for faster scaling — bake all dependencies into the AMI so new instances launch ready to serve.

9. Advanced Cloud Architecture

Q81. What is high availability and how is it achieved in cloud?

High Availability (HA) minimises downtime by eliminating single points of failure.

HA design pattern on AWS:

Deploy across 3+ Availability Zones
Application Load Balancer distributes traffic
Auto Scaling Group replaces unhealthy instances
Multi-AZ RDS with automatic failover
ElastiCache (Redis) with Multi-AZ for session storage
S3 for static assets (99.99% availability)
Route 53 health checks for DNS failover

Q82. What is auto-healing in cloud systems?

Auto-healing automatically detects and replaces failed components without human intervention.

In AWS: Auto Scaling Group monitors EC2 health. If an instance fails health check → ASG terminates it → launches a replacement automatically. In Kubernetes: liveness probes detect unhealthy pods → K8s restarts them. Deployments ensure desired replica count is maintained.

Q83. How would you design a multi-region disaster recovery solution?

Active-Passive DR design:

Primary region — full production stack running
DR region — warm standby (scaled-down running copy)
Data replication — S3 Cross-Region Replication, RDS Read Replica in DR region, DynamoDB Global Tables
DNS failover — Route 53 health checks detect primary failure → switch DNS to DR region (RTO: minutes)
Regular DR drills — test failover quarterly

Key metrics: RTO (how long to recover) and RPO (how much data loss is acceptable).

Q84. What is disaster recovery in cloud computing?

DR is the process of restoring systems after catastrophic failure. DR strategies by RTO/RPO/cost:

Backup & Restore — cheapest, highest RTO/RPO (hours)
Pilot Light — minimal core services running in DR region, scale up on failure (minutes)
Warm Standby — scaled-down fully functional copy (minutes)
Multi-Site Active/Active — full capacity in multiple regions simultaneously (near-zero RTO/RPO, most expensive)

Q85. How would you design a scalable microservices architecture?

Key design decisions:

Service boundaries — define by business capability (Order Service, Payment Service, Inventory Service)
Communication — synchronous (REST/gRPC) for real-time, async (SQS/Kafka) for decoupled workflows
Data isolation — each service owns its own database (database-per-service pattern)
API Gateway — single entry point, handles auth and routing
Service discovery — Kubernetes Services or AWS Cloud Map
Circuit breaker — prevent cascade failures (Resilience4j, Istio)
Distributed tracing — track requests across services (Jaeger, AWS X-Ray, Datadog APM)

Q86. What is the role of a firewall in cloud computing?

Cloud firewalls control inbound/outbound traffic to cloud resources.

Layers:

Security Groups — stateful, instance-level (EC2, RDS). Allow rules only.
Network ACLs — stateless, subnet-level. Allow and deny rules.
WAF (Web Application Firewall) — Layer 7, protects against OWASP Top 10 (SQL injection, XSS). AWS WAF, Cloudflare WAF.
AWS Network Firewall — managed stateful firewall for VPC perimeter.

10. Cloud Optimization and Management

Q87. What is vendor lock-in and how do you handle it?

Vendor lock-in — deep dependency on a single cloud provider's proprietary services makes migration very costly.

Mitigation: use Kubernetes (portable) instead of ECS-only, use Terraform (vs CloudFormation), open-source databases (PostgreSQL vs Aurora), abstraction layers in code, multi-cloud architecture for critical services.

Q88. What strategies do you use to optimise cloud costs?

Right-sizing — downsize over-provisioned instances based on actual usage metrics
Reserved Instances/Savings Plans — commit to 1-3 years for up to 72% discount
Spot Instances — 90% discount for fault-tolerant batch/stateless workloads
Auto Scaling — scale down during off-peak hours (night, weekends)
S3 lifecycle policies — move cold data to Glacier automatically
Delete idle resources — unattached EBS volumes, unused Elastic IPs, orphaned snapshots
Serverless — pay per invocation, zero cost when idle
FinOps tooling — AWS Cost Explorer, Kubecost, CloudHealth, Infracost

Q89. How would you handle updates and patches in cloud infrastructure?

Best practice: immutable infrastructure — replace instead of patch.

Build new AMI/container image with patches applied
Test in staging
Rolling deploy via Auto Scaling (launch new instances, terminate old ones)
Blue/green deployment for zero downtime

For OS patches: AWS Systems Manager Patch Manager, AWS SSM Run Command. For containers: rebuild image from patched base image, push to ECR, roll out via Kubernetes.

Q90. What are the key metrics to monitor in cloud infrastructure?

Infrastructure: CPU utilisation, memory, disk I/O, network in/out, instance health. Application: request rate (RPS), error rate, response latency (p50/p95/p99), queue depth. Database: connections, query latency, replication lag, IOPS. Cost: spend by service/team, reserved instance utilisation, savings plan coverage. Security: failed login attempts, IAM changes, config changes (CloudTrail).

Golden Signals (SRE): Latency, Traffic, Errors, Saturation.

Q91. What is data governance in cloud and how do you ensure compliance?

Cloud data governance ensures data is managed consistently, securely, and in compliance with regulations.

Key practices: data classification (PII, confidential, public), data lineage tracking, access controls (RBAC on databases), data retention policies, audit logging (CloudTrail, S3 access logs), compliance frameworks (GDPR, HIPAA, PCI-DSS, SOC2).

AWS tools: AWS Config (compliance rules), AWS Macie (PII detection in S3), Lake Formation (data lake access control).

Q92. What are the challenges of cloud computing?

Security and compliance — shared infrastructure, data sovereignty
Cost management — easy to over-spend, complex billing
Vendor lock-in — proprietary services create migration barriers
Complexity — managing distributed systems at scale
Performance unpredictability — noisy neighbour problem in shared infrastructure
Data transfer costs — egress fees when moving data out of cloud
Skills gap — cloud expertise is scarce and expensive

11. Emerging Trends

Q93. How does AI integrate with cloud platforms?

Every major cloud now offers managed AI/ML services:

AWS: SageMaker (ML platform), Bedrock (LLM APIs — Claude, Llama), Rekognition (vision), Comprehend (NLP)
Azure: Azure OpenAI Service (GPT-4, DALL-E), Azure ML, Cognitive Services
GCP: Vertex AI, Gemini API, BigQuery ML, AutoML

Pattern: use managed AI services via API rather than building from scratch. Cloud provides the GPU infrastructure and model serving infrastructure.

Q94. How might quantum computing impact cloud infrastructure?

Quantum computing uses quantum mechanics (superposition, entanglement) to solve certain problems exponentially faster than classical computers.

Cloud impact:

Current encryption (RSA, ECC) would be broken by quantum computers — cloud providers are preparing post-quantum cryptography
Quantum as-a-service: AWS Braket, Azure Quantum, GCP (Cirq) — access quantum hardware via cloud APIs
Near-term use cases: drug discovery, financial optimisation, logistics, materials science
Timeline: mainstream quantum advantage is still 5-10+ years away

12. Practical Scenario Questions

Q95. How would you handle a cloud service outage affecting a critical application?

Incident response steps:

Detect — CloudWatch alarms, PagerDuty alert fires
Assess — check AWS Service Health Dashboard, identify affected services/regions
Communicate — notify stakeholders, update status page
Mitigate — activate DR plan: failover to backup region, switch DNS via Route 53 health checks
Resolve — wait for provider fix or keep serving from DR
Post-mortem — document timeline, impact, root cause, and preventive measures

Q96. How would you migrate an on-premises application to Azure?

Assess — Azure Migrate to discover on-prem VMs, assess dependencies
Plan — choose migration strategy (rehost to Azure VMs vs replatform to App Service)
Replicate — use Azure Site Recovery (ASR) to replicate VMs to Azure
Test — test failover, validate application works in Azure
Cutover — final sync, update DNS, cut traffic to Azure
Optimise — right-size VMs, set up auto-scaling, enable backups

Q97. How would you secure a cloud-native application against data breaches?

Defence-in-depth:

IAM — least-privilege, MFA on all accounts, no hardcoded credentials (use Secrets Manager)
Network — private subnets, Security Groups, WAF, DDoS protection
Data — encrypt at rest (KMS) and in transit (TLS), S3 bucket policies, block public access by default
Application — SAST/DAST in CI/CD pipeline, dependency scanning (Snyk), container image scanning
Monitoring — GuardDuty (threat detection), CloudTrail (audit), Security Hub (compliance)
Incident response — automated remediation (Lambda + EventBridge), documented runbooks

Q98. How would you design a fault-tolerant cloud architecture?

Fault-tolerant = the system continues operating even when components fail.

Design principles:

Eliminate single points of failure — multi-AZ for every tier
Degrade gracefully — serve cached responses when DB is unavailable
Health checks + auto-healing — ASG replaces unhealthy EC2 automatically
Circuit breakers — stop calling failing downstream services
Retry with exponential backoff — retry transient failures without thundering herd
Bulkhead pattern — isolate resources to prevent cascade failures
Chaos engineering — proactively inject failures to test resilience (Netflix Chaos Monkey)

Q99. How would you approach capacity planning for a rapidly growing application?

Baseline metrics — measure current CPU, memory, RPS, and DB connections at peak
Growth projections — estimate traffic growth over 3/6/12 months
Load testing — simulate projected traffic with k6, Gatling, or Locust
Set auto-scaling policies — ensure ASG and HPA can handle 3-5x current peak
Database scaling plan — plan for read replicas, connection pooling (RDS Proxy), or migration to DynamoDB
Reserved capacity — buy reserved instances for baseline, use spot/on-demand for spikes

Q100. Describe how you would optimise cloud costs for an organisation.

Immediate wins (week 1):

Delete unattached EBS volumes, unused Elastic IPs, old snapshots
Enable S3 lifecycle rules to move old data to Glacier
Stop non-production instances outside business hours

Short-term (month 1):

Right-size over-provisioned instances using CloudWatch metrics
Buy Reserved Instances for stable production workloads (1-year commitment)
Migrate batch jobs to Spot Instances

Long-term:

Adopt serverless for event-driven workloads
Set up FinOps practice: tagging all resources, team-level cost allocation, monthly reviews
Use Savings Plans for flexible commitment discounts

13. Cloud Deployment — Additional Topics

Q101. What is cloud-init?

cloud-init is the industry standard tool for cloud instance initialisation. When a new VM (EC2, Azure VM, GCE) is launched, cloud-init runs on first boot to configure the instance — install packages, set hostnames, create users, write config files, run scripts.

Example (EC2 User Data):

#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx

This runs automatically when the instance first starts. Used in auto-scaling groups to bootstrap instances automatically.

Q102. What is a cloud instance?

A cloud instance is a virtual machine running in the cloud. It is a computing environment with a defined amount of CPU, memory, storage, and networking capacity, created from a machine image (AMI on AWS).

Instance lifecycle: Pending → Running → Stopping → Stopped → Terminated. You pay for running instances (and stopped instances still incur EBS costs).

Q103. What is resource pooling in cloud computing?

Resource pooling is a core cloud characteristic where the provider's computing resources (compute, storage, networking) are pooled to serve multiple consumers using a multi-tenant model. Resources are dynamically assigned and reassigned based on consumer demand.

The customer typically has no control over the exact physical location of resources (though you can specify region/AZ). This pooling enables economies of scale that make cloud more cost-effective than dedicated hardware.

Q104. What is measured service in cloud computing?

Measured service means cloud resource usage is monitored, controlled, and reported — enabling pay-per-use billing. You pay only for what you consume (compute hours, GB stored, API calls, GB transferred).

Examples: AWS bills per EC2 instance-hour, per GB of S3 storage, per million Lambda invocations. This metering provides transparency and enables cost optimisation.

Q105. What is cloud performance monitoring?

Cloud performance monitoring tracks the performance of cloud infrastructure and applications to ensure they meet SLAs and user expectations.

Key metrics monitored:

Infrastructure: CPU, memory, disk I/O, network throughput
Application: response time, error rate, request rate (RPS)
Database: query latency, connections, slow queries
User experience: page load time, Apdex score

Tools: AWS CloudWatch, Datadog, New Relic, Dynatrace, Grafana + Prometheus.

Q106. What is cloud log management?

Cloud log management involves collecting, storing, searching, and analysing log data from cloud resources — application logs, access logs, audit logs, infrastructure logs.

Best practices: centralise logs in one service (CloudWatch Logs, ELK Stack, Datadog), set retention policies, enable structured logging (JSON), set up alerts on error patterns.

AWS: CloudWatch Logs + Logs Insights for querying. Ship logs from EC2 via CloudWatch Agent, from Lambda automatically.

Q107. What is cloud configuration management?

Cloud configuration management ensures cloud infrastructure is consistently configured according to defined standards — preventing configuration drift (where servers gradually diverge from their desired state).

Tools: Ansible (agentless, YAML playbooks), Chef, Puppet, AWS Systems Manager State Manager (enforce configuration on EC2 fleets), AWS Config (detect and alert on non-compliant resource configurations).

Q108. What is a cloud management platform?

A Cloud Management Platform (CMP) is a suite of tools for managing multi-cloud environments from a single interface — provisioning, governance, cost management, compliance, and automation.

Examples: CloudHealth (VMware), Morpheus, Apptio Cloudability, AWS Management Console (single cloud), Azure Arc (hybrid/multi-cloud management).

Q109. What is cloud compliance?

Cloud compliance ensures cloud deployments meet regulatory and industry standards — GDPR (EU data privacy), HIPAA (healthcare), PCI-DSS (payment card), SOC2 (security controls), ISO 27001.

How to achieve it: use compliance frameworks, enable audit logging (CloudTrail), use CSPM tools to detect violations, encrypt sensitive data, implement IAM least privilege, maintain documentation for auditors.

AWS: AWS Artifact provides compliance reports. AWS Config has pre-built conformance packs for PCI-DSS, HIPAA, CIS benchmarks.

Q110. What is cloud arbitrage?

Cloud arbitrage is the strategy of dynamically shifting workloads between cloud providers or regions to take advantage of price differences, performance advantages, or resource availability.

Example: run batch jobs on whichever cloud has the lowest spot instance prices at that moment. Requires multi-cloud tooling (Terraform, Crossplane) and application portability (Kubernetes).

14. Cloud Migration — Additional Topics

Q111. What is cloud repatriation?

Cloud repatriation (reverse migration) is moving workloads BACK from public cloud to on-premises or private cloud — the opposite of cloud migration.

Reasons: unexpected high cloud costs for predictable workloads, data sovereignty requirements, latency issues, better ROI with owned hardware at scale.

Example: Dropbox famously moved infrastructure from AWS back to their own data centers, saving $75M over 2 years. This works when workloads are predictable and stable at scale.

Q112. What is cloud data integration?

Cloud data integration connects data from multiple sources (on-premises databases, SaaS applications, cloud storage, APIs) into a unified view for analytics or operational use.

Approaches: ETL (Extract, Transform, Load), ELT (Extract, Load, Transform — load raw first, transform in cloud), CDC (Change Data Capture — stream database changes in real time).

Tools: AWS Glue, Azure Data Factory, Fivetran, Airbyte, Apache Kafka (real-time), dbt (transformation).

Q113. What is a cloud ETL service?

ETL (Extract, Transform, Load) services extract data from sources, transform it (clean, filter, aggregate), and load it into a destination (data warehouse, data lake).

Cloud-managed ETL services:

AWS Glue — serverless ETL with auto-generated code
Azure Data Factory — visual pipeline designer
GCP Cloud Dataflow — Apache Beam-based stream and batch processing
Fivetran, Airbyte — no-code connectors for SaaS data sources

15. Cloud Security — Additional Topics

Q114. What are the key components of AWS IAM?

Component	Description
Users	Individual identities with long-term credentials (username/password, access keys)
Groups	Collection of IAM users; assign policies to groups, not individual users
Roles	Temporary identities assumed by services, applications, or users (no permanent credentials)
Policies	JSON documents defining what actions are allowed/denied on which resources
Permission Boundaries	Limit the maximum permissions a user or role can have
Identity Providers	Federate external identity (Okta, Google, Active Directory) with AWS

Best practice: use roles (not users) for EC2/Lambda, enable MFA on root account, never use root for daily tasks, audit permissions with IAM Access Analyzer.

Q115. What is a SIEM in cloud?

SIEM (Security Information and Event Management) collects, correlates, and analyses security events from across cloud infrastructure in real time to detect threats and support incident response.

Cloud SIEM capabilities: log ingestion from all cloud services, correlation rules (e.g., "failed login + privilege escalation = alert"), threat intelligence integration, automated response playbooks.

Examples: Microsoft Sentinel (Azure-native), Splunk, Sumo Logic, AWS Security Lake + third-party SIEM, IBM QRadar.

Q116. What is cloud forensics?

Cloud forensics is the application of digital forensics techniques to cloud environments — collecting, preserving, and analysing evidence after a security incident.

Challenges unique to cloud: shared infrastructure (can't image physical servers), ephemeral resources (Lambda functions, containers disappear), multi-jurisdiction (data across countries), provider access limitations.

Techniques: capture CloudTrail logs, VPC Flow Logs, memory dumps from EC2, container runtime forensics (Sysdig Falco). Preserve evidence before terminating compromised instances.

Q117. What are the security best practices for Amazon EC2?

Use IAM roles — not access keys embedded in EC2 instances
Security Groups — least-privilege inbound rules (no 0.0.0.0/0 for SSH)
Use SSH keys — disable password authentication; use EC2 Instance Connect or SSM Session Manager instead of opening port 22
Patch regularly — use SSM Patch Manager for automated OS patching
Enable IMDSv2 — prevents SSRF attacks against the metadata service
Private subnets — keep application servers in private subnets, expose only load balancer publicly
Enable CloudWatch monitoring — detect unusual CPU/network behaviour
EBS encryption — enable by default for all new volumes

Q118. What are the risks of shadow IT in cloud?

Shadow IT = employees using cloud services (S3 buckets, SaaS tools) without IT approval or knowledge.

Risks: sensitive data in uncontrolled storage, no encryption or access controls, compliance violations, data breaches, no visibility for security team.

Mitigation: use a CASB to discover and control cloud app usage, set up AWS Organizations SCPs to restrict service usage, educate employees, provide approved self-service alternatives.

Q119. How do you conduct a security audit for cloud infrastructure?

IAM audit — review all users, roles, policies. Use IAM Access Analyzer, remove unused credentials (Credential Report)
Network audit — check Security Group rules for overly permissive access (0.0.0.0/0)
Storage audit — check for public S3 buckets, unencrypted volumes (AWS Trusted Advisor)
Logging audit — verify CloudTrail enabled in all regions, VPC Flow Logs enabled, S3 access logging enabled
Compliance check — run AWS Config conformance packs against CIS benchmark
Penetration testing — authorised testing of cloud environment
CSPM scan — Wiz/Orca/Security Hub automated findings

Data residency — store EU citizen data in EU regions only (use AWS EU regions)
Data minimisation — only collect and retain what is necessary
Encryption — encrypt personal data at rest and in transit
Access controls — role-based access, audit logging for data access
Data subject rights — ability to export or delete user data on request
Breach notification — detect breaches within 72 hours (GuardDuty + SIEM)
Data Processing Agreements — sign DPAs with cloud providers (AWS has a standard DPA)
Right to be forgotten — implement data deletion workflows

16. Modern Architecture — Additional Topics

Q121. What is a container orchestration platform?

A container orchestration platform automates the deployment, scaling, networking, and management of containers across a cluster of machines.

Key capabilities: scheduling (place containers on appropriate nodes), service discovery, load balancing, rolling updates, self-healing (restart failed containers), secrets management.

Kubernetes is the dominant standard. Cloud-managed options: EKS (AWS), AKS (Azure), GKE (GCP). Alternatives: HashiCorp Nomad, Docker Swarm (legacy).

Q122. What is a cloud function?

A cloud function (FaaS — Function as a Service) is a small, single-purpose piece of code that runs in response to an event. The cloud provider manages the infrastructure completely.

Properties: stateless, event-triggered, ephemeral, auto-scaling, pay-per-execution.

Examples: AWS Lambda, Azure Functions, GCP Cloud Functions. Each function handles one specific task — resize image, send email, process payment event, validate form input.

Q123. What are the challenges of serverless architecture?

Cold starts — latency on first invocation after idle period
Execution limits — max 15 minutes (Lambda), not suitable for long-running tasks
Stateless — state must be stored externally (DynamoDB, S3, Redis)
Debugging complexity — harder to reproduce issues locally; distributed tracing needed (X-Ray)
Vendor lock-in — Lambda-specific event schemas tie you to AWS
Local development — testing event-driven functions locally is harder (use SAM, Serverless Framework)
Monitoring — traditional APM doesn't work; need specialised tools (Lumigo, Epsagon)
Cost at high scale — at very high throughput, containers can be cheaper

Q124. What is DevOps and how does it relate to cloud computing?

DevOps is a cultural and technical movement combining development (Dev) and operations (Ops) to shorten the software delivery cycle through automation, collaboration, and continuous improvement.

DevOps + Cloud synergy:

Cloud provides on-demand infrastructure (no waiting for hardware)
IaC (Terraform, CDK) makes infrastructure version-controlled like code
Managed services (RDS, EKS, Lambda) reduce operational burden
Cloud CI/CD services (CodePipeline, GitHub Actions) enable fast automated delivery
Cloud monitoring (CloudWatch, Datadog) provides feedback loop

Cloud is the natural environment for DevOps practices.

Q125. What is the role of DevOps in modern cloud environments?

DevOps engineers in cloud are responsible for:

Designing and maintaining CI/CD pipelines
Writing and maintaining IaC (Terraform, CloudFormation)
Managing Kubernetes clusters and container platforms
Setting up monitoring, alerting, and observability
Implementing cloud security controls
Enabling developer self-service through internal developer platforms
Incident management and on-call rotation
Optimising cloud costs

Q126. What are APIs in cloud computing?

APIs (Application Programming Interfaces) are the primary way to interact with cloud services. Every cloud provider exposes APIs to provision resources, manage services, and query data programmatically.

Types in cloud:

REST APIs — most common (AWS, GCP, Azure all use REST)
gRPC — binary protocol, faster for microservice-to-service communication
GraphQL — flexible query language (used in some SaaS products)
Event APIs — webhooks, event streams (SNS, EventBridge)

AWS SDK, Azure SDK, GCP client libraries wrap the underlying REST APIs for developers.

Q127. What is a webhook in cloud applications?

A webhook is an HTTP callback — instead of polling an API repeatedly, a server sends an HTTP POST request to a specified URL when an event occurs.

Example: GitHub sends a webhook to your CI/CD system when code is pushed. Stripe sends a webhook when a payment succeeds. AWS SNS delivers notifications to HTTP endpoints.

Cloud pattern: API Gateway endpoint → Lambda → process webhook → update database.

Q128. What is the difference between synchronous and asynchronous processing in cloud?

	Synchronous	Asynchronous
Model	Request waits for response	Request submitted, response comes later
Coupling	Tight — caller blocked until done	Loose — caller continues immediately
Failure handling	Immediate error if downstream fails	Messages queued; retry on failure
Scalability	Limited by slowest service	Queue absorbs bursts
Use case	Real-time API (user login, search)	Order processing, email sending, batch jobs
Cloud tools	API Gateway + Lambda (sync)	SQS + Lambda, SNS, EventBridge

Q129. What are cloud-specific design patterns?

Pattern	Description
Ambassador	Proxy for outbound calls — adds retry, circuit breaker, logging
Circuit Breaker	Stop calling failing service; fail fast to prevent cascade failure
CQRS	Separate read and write models for scalability
Event Sourcing	Store state changes as events; rebuild state by replaying events
Saga	Manage distributed transactions across microservices
Strangler Fig	Gradually replace legacy monolith with microservices
Bulkhead	Isolate resource pools to prevent one failure from consuming all resources
Retry with backoff	Retry transient failures with exponential delay to avoid thundering herd
Sidecar	Attach helper containers to main app container (logging, service mesh proxy)

17. Azure & GCP — Additional Services

Q130. What is Azure App Service?

Azure App Service is a fully managed PaaS for hosting web apps, REST APIs, and mobile backends. It supports .NET, Java, Python, Node.js, PHP, Ruby. You deploy code; Azure manages the OS, patching, scaling, and load balancing.

Features: auto-scaling, custom domains with free SSL, deployment slots (blue/green), GitHub/Azure DevOps CI/CD integration, VNet integration for private backend access.

Equivalent: AWS Elastic Beanstalk, GCP App Engine.

Q131. How does Google Cloud Storage compare to AWS S3?

Feature	AWS S3	Google Cloud Storage (GCS)
Access control	Bucket policies, ACLs, IAM	Uniform bucket-level access (recommended), IAM, ACLs
Storage classes	Standard, IA, Glacier, Glacier Instant	Standard, Nearline, Coldline, Archive
Strong consistency	Default (since Dec 2020)	Always strongly consistent
Multipart upload	Yes	Resumable uploads
Lifecycle rules	Yes	Yes
Global namespace	Yes	Yes

Both are highly durable (11 nines), scalable object stores. GCS has a slight edge in strong consistency guarantees.

Q132. How does GCP support machine learning workloads?

GCP has strong ML/AI capabilities:

Vertex AI — managed ML platform (AutoML, custom training, model deployment, MLOps)
Tensor Processing Units (TPUs) — custom Google hardware for training large ML models, faster and cheaper than GPUs for TensorFlow workloads
BigQuery ML — run ML models directly in SQL on BigQuery data
Gemini API — access Google's LLM models
Pre-trained APIs — Vision AI, Natural Language AI, Speech-to-Text, Translation API

GCP is preferred for ML workloads involving TensorFlow and large-scale data in BigQuery.

Q133. What is a cloud resource group?

A resource group (Azure) is a logical container that holds related Azure resources for an application — VMs, databases, storage, networking — managed as a single unit.

Benefits: deploy/delete all resources together, apply RBAC and policies at group level, unified billing view, ARM templates target resource groups.

AWS equivalent: Tags (no formal grouping, but you tag resources with project/team) or AWS Resource Groups (group by tags or CloudFormation stack).

Q134. What is a cloud marketplace?

A cloud marketplace is an online store where you can find, test, buy, and deploy pre-configured software and services from third-party vendors, integrated with your cloud account.

Examples: AWS Marketplace (15,000+ listings), Azure Marketplace, GCP Marketplace. Products: pre-configured AMIs, SaaS subscriptions, ML models, security tools, databases, developer tools. Billing is consolidated into your cloud bill.

18. Architecture — Additional Topics

Q135. What is a cloud service broker?

A cloud service broker (CSB) is an intermediary that helps organisations manage multiple cloud services — acting as a negotiator, aggregator, and integrator between cloud consumers and multiple cloud providers.

CSBs handle: vendor negotiation, unified billing, compliance management, identity federation across clouds, service catalogue management.

Examples: IBM Cloud Brokerage, Flexera, Apptio Cloudability.

Q136. What is the role of a systems integrator in cloud?

A systems integrator (SI) in cloud helps organisations design, implement, and manage cloud solutions — bridging legacy on-premises systems with cloud infrastructure.

SIs handle: cloud strategy and architecture, migration projects, custom integration development, managed services, training. Major cloud SIs: Accenture, Deloitte, Infosys, Wipro, Capgemini, TCS.

19. Optimization — Additional Topics

Q137. How do you optimize database performance in cloud?

Connection pooling — use RDS Proxy (AWS) to manage connection pools, reducing overhead
Read replicas — offload read traffic from primary database
Caching — use ElastiCache (Redis) to cache frequent queries; reduce DB hits by 80%+
Query optimisation — add indexes on columns used in WHERE/JOIN, avoid SELECT *
Vertical scaling — upgrade to larger DB instance class for CPU/memory-bound queries
Partitioning — partition large tables by date/range for faster scans
Aurora Serverless — auto-scales database capacity automatically
Database-per-service — prevent noisy-neighbour between microservices

Q138. How would you troubleshoot performance issues in a cloud application?

Systematic approach:

Identify the bottleneck — check CloudWatch metrics (CPU, memory, DB connections, queue depth)
Analyse logs — search for errors, slow queries, timeouts (CloudWatch Logs Insights)
Distributed tracing — trace a slow request across services (AWS X-Ray, Jaeger, Datadog APM)
Load test — reproduce under controlled load to confirm root cause
Database — check slow query log, explain query plans, look for missing indexes
Network — check VPC flow logs for unusual latency, NAT Gateway metrics
Application code — profiling, look for N+1 queries, synchronous blocking calls
Fix and verify — deploy fix, monitor metrics to confirm improvement

20. Practical Scenarios

Q139. Describe a situation where you migrated an application to the cloud.

Example answer structure:

"In a previous project, we migrated a monolithic Java application from on-premises to AWS. The main challenges were:

Database migration — used AWS Database Migration Service (DMS) with a live replication lag < 1 second during cutover
Zero-downtime deployment — used blue/green deployment: spun up the new AWS environment, tested it, then cut over DNS via Route 53 weighted routing (10% → 50% → 100%)
Stateful sessions — moved session storage from in-memory to ElastiCache Redis so any instance could handle any request
Secrets — moved all hardcoded credentials to AWS Secrets Manager
Monitoring gap — set up CloudWatch dashboards and alarms from day one

Result: 40% infrastructure cost reduction, deployment frequency improved from monthly to daily."

Q140. How would you explain a complex cloud concept to a non-technical stakeholder?

Analogies that work:

Cloud computing = "renting computing power from a massive shared power plant instead of buying your own generator"
Auto-scaling = "a restaurant that automatically adds more chefs during lunch rush and sends them home after"
Load balancer = "a traffic cop directing customers to whichever checkout lane has the shortest queue"
CDN = "storing copies of your website's photos in warehouses close to every customer, so delivery is instant"
Microservices = "instead of one big factory that makes everything, you have specialised shops — each does one thing really well"

Key principle: map cloud concepts to things stakeholders already understand (traffic, warehouses, restaurants, factories).

Q141. What steps would you take to recover data lost due to cloud misconfiguration?

Stop the bleeding — identify the misconfiguration (accidental S3 bucket deletion, wrong lifecycle policy) and fix it immediately to prevent further loss
Check versioning — if S3 versioning was enabled, restore from a previous version
Point-in-time recovery — for RDS, restore to any point within the retention window (1-35 days)
Backup restore — restore from latest automated snapshot
CloudTrail investigation — identify who/what made the change and when
Data reconstruction — if no backup, attempt recovery from application logs or partner systems
Post-mortem — enable versioning, MFA delete, cross-region backup, and CSPM alerts to prevent recurrence

Q142. How would you approach integration of multiple cloud providers?

Architecture principles:

Abstraction layer — use Kubernetes to abstract compute; apps don't know which cloud they're on
Terraform — define all cloud resources in provider-agnostic IaC; switch providers by changing provider block
API-first design — all communication between services via APIs, not provider-specific mechanisms
Data layer — use a cloud-agnostic database (self-managed PostgreSQL or Snowflake) accessible from all clouds
Identity federation — federate IAM identities across providers (OIDC)
Networking — VPN or dedicated connection (Megaport) between cloud VPCs
Observability — centralised logging and monitoring across all clouds (Datadog, Grafana Cloud)

Q143. What would you do if a cloud deployment failed due to an infrastructure issue?

Immediate response — trigger the rollback: deploy previous stable version (Kubernetes rollout undo, blue/green switch DNS back)
Assess impact — is there a user-facing outage? Activate incident response
Identify root cause — check deployment logs, CloudWatch, Terraform plan output for what changed
Fix forward or rollback — if fix is quick, fix and redeploy; if complex, roll back first, investigate in staging
Communication — update status page, notify affected stakeholders
Post-deployment gates — add automated smoke tests and health checks to prevent future occurrences
Post-mortem — document root cause and preventive measures

Q144. How would you prioritise tasks when managing multiple cloud projects?

Framework:

Severity-first — production incidents always take priority over planned work
Impact × Urgency — use Eisenhower matrix: urgent+important (do now), important+not urgent (schedule), urgent+not important (delegate), neither (eliminate)
Business value — align priorities with business goals (revenue-impacting work first)
Dependencies — unblock other teams first (if your work blocks 5 other engineers, prioritise it)
Communication — when genuinely overloaded, communicate capacity constraints to stakeholders and get priority decisions from management, don't silently context-switch

Tools: Jira/Linear for tracking, daily standups to surface blockers early.

21. Remaining Original Questions

Q145. What is a cloud management gateway?

A cloud management gateway bridges on-premises management systems with cloud resources — allowing IT teams to manage cloud infrastructure using existing on-premises tools without direct internet exposure.

In Azure, Azure Arc acts as this bridge: onboards on-premises servers and Kubernetes clusters into Azure Resource Manager so they appear as Azure resources, managed from Azure Portal, Azure Policy, and Defender for Cloud. In AWS, AWS Systems Manager manages hybrid EC2 + on-premises servers through a single pane.

Q146. How would you implement a CI/CD pipeline in a cloud environment?

Complete pipeline stages:

Source — developer pushes to Git (GitHub, CodeCommit, GitLab)
Build — compile code, run unit tests, build Docker image
Security scan — SAST (code), SCA (dependencies), container image scan (Trivy, Snyk)
Push — push Docker image to registry (ECR, GCR)
Deploy to staging — Kubernetes rolling update or blue/green
Integration tests — automated smoke/API tests against staging
Deploy to production — canary (5% → 25% → 100%) with automatic rollback on error rate spike

Tools: GitHub Actions, GitLab CI, AWS CodePipeline + CodeBuild, Jenkins, ArgoCD (GitOps for K8s).

Q147. What is cloud repatriation and why do companies do it?

Cloud repatriation = moving workloads back from public cloud to on-premises or private cloud.

Reasons: cost (Dropbox saved $75M repatriating from AWS — owned hardware is cheaper at massive predictable scale), performance (ultra-low-latency requirements), data sovereignty (physical control needed), security (air-gapped systems).

Note: repatriation does not mean cloud failed — it means the workload economics changed. Most companies maintain hybrid.

Q148. What is a cloud-based continuous monitoring solution?

Continuous cloud monitoring = automated, real-time surveillance of security posture, compliance, and performance — 24/7 with no manual checks.

Layers:

Infrastructure metrics: CloudWatch, Datadog
Security: AWS GuardDuty (threat detection), AWS Config (compliance drift), Security Hub (aggregated findings)
Logs: CloudWatch Logs Insights, ELK Stack, Splunk
Application: APM (Datadog, New Relic) with distributed tracing

Goal: detect and alert on issues within minutes, not hours.

22. Advanced DevOps Topics

Q149. What are the differences between Terraform and CloudFormation?

	Terraform	CloudFormation
Provider	HashiCorp (open-source)	AWS-native only
Cloud support	Multi-cloud — AWS, Azure, GCP, 1000+ providers	AWS only
Language	HCL (HashiCorp Configuration Language)	YAML or JSON
State	Terraform state file (local or S3 backend)	Managed by AWS automatically
Import existing infra	terraform import	Limited support
Ecosystem	Terraform Registry, modules, providers	Nested stacks, StackSets
Best for	Multi-cloud, open-source ecosystems	AWS-only shops, tight AWS integration

Q150. What is the difference between blue/green and canary deployments?

	Blue/Green	Canary
How	Run two full environments. Switch all traffic at once.	Gradually shift % of traffic to new version (5% → 25% → 100%)
Rollback	Instant — switch traffic back	Fast — reduce canary % to 0
Risk	Low — instant rollback	Lower — only small % of users affected at any time
Cost	2x infrastructure during deployment	Small % overhead only
Best for	Major releases, database changes	Stateless API gradual rollouts

Q151. What is Helm in Kubernetes?

Helm is the package manager for Kubernetes. It packages K8s resources as charts — reusable, versioned templates with configurable values.

Without Helm: manage 20+ separate YAML files per application. With Helm: one command deploys everything: helm install myapp ./chart --values prod.yaml

Key concepts: Chart (package of K8s templates), Release (deployed chart instance), Values (per-environment configuration), Repository (chart registry like Artifact Hub).

Q152. What is service discovery in microservices?

Service discovery allows services to find and communicate with each other dynamically without hardcoded IP addresses.

Patterns:

Client-side discovery — service queries a registry (Consul, Eureka) and picks an instance
Server-side discovery — client calls a load balancer which routes to healthy instances

In Kubernetes: built-in DNS service discovery. Every Service gets a stable DNS name (payment-service.default.svc.cluster.local). No external registry needed.

Q153. What is an Ingress controller in Kubernetes?

An Ingress controller manages external HTTP/HTTPS access to Kubernetes services based on routing rules defined in Ingress resources.

Example routing rule: api.example.com/users → users-service, api.example.com/orders → orders-service.

Popular controllers: NGINX Ingress, Traefik, AWS ALB Ingress Controller (creates ALB automatically), GKE Ingress. Handles TLS termination, path-based routing, and load balancing in one place.

Q154. What is Horizontal Pod Autoscaler (HPA)?

HPA automatically adjusts the number of pod replicas based on observed metrics (CPU, memory, or custom metrics like RPS).

How it works: HPA queries metrics every 15 seconds. If average CPU across pods exceeds the target, it scales out. When load drops, it scales in down to minReplicas.

For event-driven scaling (scale on SQS queue depth, Kafka lag): use KEDA (Kubernetes Event-Driven Autoscaler) — more flexible than HPA.

Q155. What are PersistentVolumes and PersistentVolumeClaims in Kubernetes?

PersistentVolume (PV) — a piece of storage provisioned in the cluster (AWS EBS, NFS). Has a lifecycle independent of any pod.
PersistentVolumeClaim (PVC) — a request for storage by a pod. Pods use PVCs without knowing the underlying storage.
StorageClass — enables dynamic provisioning. When PVC is created, K8s automatically creates a PV (e.g., provisions an EBS volume).

Use for databases (MySQL, PostgreSQL) and any stateful workloads in K8s.

23. FinOps & Cloud Cost Management

Q156. What is FinOps?

FinOps (Cloud Financial Operations) is a practice that brings engineering, finance, and business together to maximise cloud value — making data-driven decisions about cost, speed, and quality.

Three phases: Inform (cost visibility via tagging + dashboards) → Optimise (right-sizing, reserved instances, waste removal) → Operate (cloud cost as ongoing shared responsibility).

Tools: AWS Cost Explorer, Azure Cost Management, GCP Cost Table, Infracost (IaC cost), CloudHealth.

Q157. What is a Reserved Instance vs Savings Plan?

	Reserved Instance (RI)	Savings Plan
Commitment	Specific instance family, region, OS	Hourly spend commitment (e.g. $10/hr for 1 year)
Flexibility	Limited — tied to specific instance type	High — applies to EC2, Lambda, Fargate automatically
Discount	Up to 72% vs On-Demand	Up to 66% vs On-Demand
Best for	Predictable workloads with known instance type	Mixed/microservices/Lambda-heavy architectures

Q158. How do you use AWS Cost Explorer for cost optimisation?

AWS Cost Explorer provides: cost trends by service/region/account/tag, spend forecasting, Reserved Instance recommendations, right-sizing recommendations for EC2, cost anomaly detection.

Best practice: tag ALL resources with project, team, and environment from day one. Without consistent tagging, cost attribution is impossible and Cost Explorer becomes much less useful.

24. Cloud Security Advanced Topics

Q159. What is Zero Trust security in cloud?

"Never trust, always verify" — no user or service is trusted by default, even inside the corporate network.

Key principles: verify explicitly (always authenticate based on identity + device + context), least privilege access (minimum permissions, time-limited), assume breach (minimise blast radius).

Cloud implementation: Azure AD Conditional Access, AWS IAM Identity Center, service mesh with mutual TLS (Istio), micro-segmentation with Security Groups.

Q160. What is AWS GuardDuty?

GuardDuty is a managed threat detection service that continuously monitors for malicious activity using ML and threat intelligence.

Analyses: CloudTrail event logs, VPC Flow Logs, DNS logs, EKS audit logs. Detects: cryptocurrency mining on EC2, unusual API calls from malicious IPs, credential exfiltration, compromised instances, unusual S3 data access.

No agents, no performance impact. Enable in all regions with one click. Findings integrate with Security Hub.

Q161. What is AWS CloudTrail?

CloudTrail records all API calls in your AWS account — who, what, when, from where. Primary audit log for AWS.

Captures: management events (create/delete/modify resources), data events (S3 object access, Lambda invocations), Insights events (unusual API activity spikes).

Best practice: enable in all regions, send logs to a dedicated security account S3 bucket with MFA delete, set retention to 7+ years for compliance.

Q162. What is secrets management in cloud?

Never store secrets in code, Git, environment variables in images, EC2 User Data, or config files.

Correct approach: AWS Secrets Manager (stores, rotates, audits — auto-rotates RDS passwords), AWS Parameter Store (cheaper, good for non-sensitive config + secrets), HashiCorp Vault (multi-cloud, fine-grained dynamic secrets).

Applications retrieve secrets at runtime via SDK calls: const secret = await secretsManager.getSecretValue({ SecretId: "prod/db/password" })

25. Cloud-Native Advanced Topics

Q163. What is SLO, SLI, and SLA?

	SLA	SLO	SLI
Full name	Service Level Agreement	Service Level Objective	Service Level Indicator
What	Legal contract with customers	Internal reliability target	The actual measurement metric
Who sets	Legal + business + engineering	Engineering/SRE team	Engineering team
Example	99.9% uptime guaranteed to customers	Internal target: 99.95% uptime	Measured uptime: 99.97%
Miss consequence	Customer credits	Error budget consumed, alerts	Data point for SLO tracking

Error budget = 100% - SLO. 99.9% SLO = 43 min/month error budget. Exhausted budget = freeze new features, focus on reliability.

Q164. What is chaos engineering?

Chaos engineering intentionally injects failures into production to discover weaknesses before they cause unplanned outages.

Process: define steady state (normal behaviour) → hypothesise it continues under failure → inject fault (kill servers, add latency, block network) → observe result.

Tools: Netflix Chaos Monkey, AWS Fault Injection Simulator (FIS), Gremlin, LitmusChaos (Kubernetes). Netflix runs chaos in production continuously.

Q165. What are the three pillars of observability?

Pillar	What it captures	Tool examples
Metrics	Numerical measurements over time (CPU %, request rate, error rate)	Prometheus, CloudWatch, Datadog
Logs	Timestamped records of events (errors, requests, state changes)	CloudWatch Logs, ELK Stack, Loki, Splunk
Traces	End-to-end journey of a request across distributed services	AWS X-Ray, Jaeger, Datadog APM, Zipkin

OpenTelemetry (OTel) is the vendor-neutral standard for instrumenting all three. Instrument once, send to any backend.

Q166. What is Platform Engineering?

Platform Engineering builds internal developer platforms (IDPs) — self-service infrastructure and tooling that lets development teams deploy, operate, and monitor applications without deep infra knowledge.

Platform teams provide: golden path CI/CD templates, self-service K8s namespaces, secrets management, observability stack, IaC modules, internal developer portals (Backstage, Port).

Goal: reduce developer cognitive load, increase deployment frequency, standardise security and compliance across all teams.

Q167. What is GitOps?

GitOps = Git is the single source of truth for infrastructure and application state. Changes are made via Git commits. An automated operator (ArgoCD, Flux) continuously reconciles actual cluster state with Git.

	Traditional CI/CD	GitOps
Trigger	Pipeline pushes changes to cluster	Operator pulls changes from Git
Source of truth	CI/CD system	Git repository
Audit trail	Pipeline logs	Git commit history
Drift correction	Manual or re-trigger pipeline	Automatic continuous reconciliation
Tools	Jenkins, CodePipeline	ArgoCD, Flux

Q168. What is AWS ECS vs EKS?

	ECS	EKS
Orchestrator	AWS-proprietary	Kubernetes (industry standard)
Learning curve	Lower — simpler to operate	Higher — full K8s knowledge required
Portability	AWS-only	Portable to any K8s environment
Ecosystem	AWS tools only	Massive K8s ecosystem (Helm, ArgoCD, Istio, Prometheus)
Control plane cost	Free	$0.10/hr (~$73/month)
Best for	Simpler AWS container workloads	Complex microservices, multi-cloud, enterprise

26. Data, AI & Streaming in Cloud

Q169. What is a data lake vs data warehouse?

	Data Lake	Data Warehouse
Data type	Raw — structured, semi-structured, unstructured	Structured, processed, cleaned
Schema	Schema-on-read (define when querying)	Schema-on-write (defined before loading)
Storage cost	Very low (S3, GCS)	Higher (Redshift, BigQuery, Snowflake)
Query performance	Slower (scan raw files)	Faster (optimised columnar storage)
Use case	ML training data, raw event storage, data exploration	Business intelligence, dashboards, reporting
Cloud examples	S3 + Athena, GCS + Dataproc	Redshift, BigQuery, Snowflake, Azure Synapse

Modern pattern: Lakehouse (Databricks Delta Lake, BigQuery) combines lake storage cost with warehouse query performance.

Q170. What is AWS Kinesis?

AWS Kinesis is a real-time streaming data platform:

Kinesis Data Streams — ingest and process real-time data (clickstreams, IoT, logs)
Kinesis Firehose — delivery to S3, Redshift, OpenSearch automatically (no code needed)
Kinesis Data Analytics — run SQL or Apache Flink queries on streaming data
Kinesis Video Streams — video streaming from IoT devices

Use case: user clicks → Kinesis → Lambda processes in real time → DynamoDB for live data + S3 via Firehose for analytics.

Q171. What is AWS SageMaker?

SageMaker is AWS's end-to-end managed ML platform covering: data preparation (Data Wrangler, Feature Store), model training (managed GPU/CPU clusters, built-in algorithms), experiment tracking, model deployment (real-time endpoints, serverless inference), and MLOps (Pipelines, Model Monitor for drift detection).

It removes the heavy lifting of ML infrastructure — data scientists focus on model quality, not cluster management.

27. Advanced Networking

Q172. What is AWS Direct Connect?

Direct Connect = dedicated private fiber connection between your data center and AWS, bypassing the public internet. Provides: lower consistent latency, higher bandwidth (up to 100 Gbps), reduced egress costs, private connectivity.

When to use: high-volume data migration, real-time latency-sensitive workloads (trading systems), compliance requiring private connectivity, consistent performance for hybrid applications.

Q173. What is AWS Transit Gateway?

Transit Gateway = a hub that connects multiple VPCs and on-premises networks centrally. Without it: each VPC needs a peering connection to every other (N*(N-1)/2 connections — doesn't scale). With it: all VPCs connect to the hub (N connections).

Supports cross-account and cross-region connectivity. Can attach VPNs and Direct Connect. Simplifies hub-and-spoke network architecture drastically.

Q174. What is a NAT Gateway and when do you need it?

NAT Gateway allows instances in a private subnet to initiate outbound internet connections (for software updates, external API calls) without accepting inbound internet traffic.

Architecture: NAT Gateway sits in the public subnet with an Elastic IP. Private subnet route table sends 0.0.0.0/0 traffic to the NAT Gateway. Return traffic is allowed because NAT is stateful.

Cost: ~$0.045/hour + $0.045/GB processed. Use VPC Endpoints for AWS services (S3, DynamoDB) to avoid NAT Gateway charges for AWS traffic.

Q175. What are VPC Endpoints?

VPC Endpoints connect EC2 instances in private subnets to AWS services (S3, DynamoDB, SQS, etc.) without going through the internet or NAT Gateway — traffic stays within the AWS network.

Types: Gateway Endpoints (S3, DynamoDB only — free, uses route table), Interface Endpoints/PrivateLink (most other AWS services — creates an ENI in your subnet, costs per hour).

Benefits: better security (no internet exposure), lower cost (no NAT Gateway charges), lower latency.

28. Cloud Architecture Reference

Q176. What is the AWS Well-Architected Framework?

Pillar	Focus areas
Operational Excellence	Automate operations, run workloads effectively, improve processes, respond to events
Security	Protect data and systems — IAM, encryption, detection, incident response
Reliability	Recover from failures, scale dynamically, mitigate disruptions automatically
Performance Efficiency	Use right resources for the right tasks at scale, monitor and evolve
Cost Optimisation	Eliminate waste, right-size, use managed services, analyse spend
Sustainability	Minimise environmental impact — right-size, use efficient code, pick green regions

AWS Well-Architected Tool in the console reviews workloads against these pillars and provides improvement recommendations.

Q177. What is the CAP theorem?

In a distributed system, you can only guarantee 2 of: Consistency (all nodes see same data), Availability (every request gets a response), Partition Tolerance (works despite network splits).

Since network partitions are unavoidable:

CP systems (sacrifice availability): HBase, ZooKeeper, MongoDB (strong consistency) — may reject requests during partition
AP systems (sacrifice consistency): Cassandra, DynamoDB (eventually consistent), CouchDB — responds with possibly stale data

Q178. What is RTO vs RPO?

RTO (Recovery Time Objective) — maximum acceptable downtime after a disaster. How quickly must the system be restored?
RPO (Recovery Point Objective) — maximum acceptable data loss measured in time. How much data can we afford to lose?

Lower RTO/RPO = higher cost. Design DR to meet business requirements at minimum cost. E.g., RTO=4 hours, RPO=1 hour → warm standby with hourly DB backups is sufficient.

Q179. What is infrastructure drift?

Infrastructure drift = actual cloud resource state diverges from IaC-defined desired state, usually due to manual console/CLI changes bypassing Terraform/CloudFormation.

Danger: Terraform may overwrite manual changes on next apply; different environments have inconsistent configs; debugging becomes much harder.

Prevention: terraform plan in CI/CD before every apply; AWS Config drift detection; restrict console access via SCPs; treat infrastructure as immutable — change only through IaC.

Q180. What is AWS Lambda@Edge?

Lambda@Edge runs Lambda functions at CloudFront edge locations worldwide, close to users.

Use cases: A/B testing (route based on cookies at edge), authentication (validate JWT before hitting origin), URL rewriting, image resisation based on device, personalisation based on geolocation.

4 trigger points: Viewer Request (before cache check), Origin Request (cache miss → before origin), Origin Response (after origin), Viewer Response (before sending to user).

Q181. What is Amazon CloudFront?

CloudFront is AWS's CDN with 450+ edge locations globally. Use cases: static website assets (HTML/CSS/JS/images from S3), video streaming, API response caching, DDoS protection at edge, reducing origin server load.

Features: Lambda@Edge for edge compute, origin failover, signed URLs/cookies for private content, real-time logs, TLS with ACM certificates (free).

Q182. What is a multi-region architecture?

Running your application in two or more cloud regions simultaneously for: disaster recovery (survive complete region failure), low latency for global users, compliance (data residency), and 99.999% availability.

Complexity: data replication (DynamoDB Global Tables, RDS Global Database, S3 CRR), global routing (Route 53 latency/geolocation/failover routing), stateless application design, increased operational overhead and cost.

Q183. What is the difference between stateful and stateless firewalls in AWS?

Security Groups (stateful) — track connection state. Allow inbound on port 443 and the return traffic is automatically allowed. Write allow rules only. Applied to EC2, RDS, Lambda.
Network ACLs (stateless) — inspect each packet independently. Must explicitly allow BOTH inbound AND outbound for each connection. Applied to subnets.

29. More AWS Services

Q184. What is AWS CloudFormation?

CloudFormation is AWS-native Infrastructure as Code. Define AWS resources in YAML/JSON templates; CloudFormation provisions them in the correct dependency order.

Key concepts: Stack (deployed template instance), Change Set (preview before applying), Drift Detection (detect manual changes), Nested Stacks (modular templates), StackSets (multi-account/region deployment).

When to use: AWS-only environments, tight integration with AWS services (e.g., SAM for serverless). Use Terraform for multi-cloud.

Q185. What is AWS Fargate?

Fargate = serverless compute for containers. Run ECS tasks or EKS pods without managing EC2 instances. Define CPU/memory requirements; AWS handles the underlying compute, OS, and scaling.

Best for: variable/bursty workloads, batch jobs, teams that don't want to manage node groups. Cost is higher per vCPU/hour than EC2 but eliminates node management overhead.

Q186. What is Amazon ElastiCache?

ElastiCache is a managed in-memory caching service supporting Redis and Memcached.

Use cases: session storage (replace sticky sessions), database query result caching (reduce DB load by 80%+), rate limiting counters, leaderboards, pub/sub messaging (Redis Pub/Sub), geospatial queries.

Multi-AZ Redis with automatic failover for production. Redis > Memcached for most use cases (persistence, data structures, replication).

	SNS (Simple Notification Service)	SQS (Simple Queue Service)
Pattern	Pub/Sub — one message to many subscribers	Point-to-point queue — one message to one consumer
Delivery	Push — subscribers receive immediately	Pull — consumers poll for messages
Persistence	No — if subscriber is down, message is lost	Yes — messages stored up to 14 days
Use case	Fan-out (notify email + Lambda + SQS simultaneously)	Decoupled async processing, task queue
Common pattern	SNS → SQS (fan-out + durability)	SQS → Lambda (event-driven processing)

Q188. What is AWS Route 53 routing policies?

Route 53 supports multiple routing policies beyond simple DNS:

Simple — return fixed IP/value (default)
Weighted — split traffic by percentage (10% to v2, 90% to v1) — for canary deployments
Latency-based — route to region with lowest latency for the user
Geolocation — route based on user's country/continent (GDPR compliance)
Failover — primary/secondary with health checks — automatic DNS failover for DR
Multi-value — return multiple IPs; Route 53 removes unhealthy ones (client-side load balancing)

Q189. What is AWS WAF?

AWS WAF (Web Application Firewall) protects web apps from common web exploits — OWASP Top 10 (SQL injection, XSS, CSRF), bot attacks, and DDoS at Layer 7.

Deploy in front of: CloudFront, Application Load Balancer, API Gateway, AppSync.

Rules: AWS Managed Rules (pre-built rule sets), custom rules (block specific IPs, rate limit, geo-block), bot control (CAPTCHA for suspicious traffic).

Q190. What are EC2 instance types and when to use each?

Family	Optimised for	Use cases
T3/T4 (burstable)	Cost — baseline CPU with burst credits	Dev/test, low-traffic web servers, microservices
M6/M7 (general)	Balance of CPU + memory	Application servers, backend APIs, databases
C6/C7 (compute)	High CPU	Web servers, batch processing, ML inference, game servers
R6/R7 (memory)	Large RAM	In-memory databases, Redis, SAP HANA, analytics
P4/G5 (GPU)	GPU compute	ML training, deep learning, video encoding
I3/I4 (storage)	High local NVMe IOPS	NoSQL databases, data warehousing, Elasticsearch
Inf2 (Inferentia)	ML inference chips	Cost-efficient LLM inference at scale

Q191. What is Amazon Aurora and how is it different from RDS?

Aurora is AWS's cloud-native relational database compatible with MySQL and PostgreSQL, but redesigned for cloud scale.

Feature	Standard RDS (MySQL)	Aurora
Storage	EBS volume per instance	Distributed, shared storage cluster
Read replicas	Up to 5, async replication	Up to 15, sub-10ms replica lag
Failover	60-120 seconds	30 seconds
Storage scaling	Manual	Auto-grows in 10GB increments up to 128TB
Cost	Lower	~20% more than RDS, but better performance

Aurora Serverless v2: auto-scales database capacity in fractions of ACUs — ideal for variable workloads.

Q192. What is Amazon DynamoDB and when should you use it?

DynamoDB is a fully managed NoSQL key-value and document database with single-digit millisecond performance at any scale.

Best for: high-throughput applications (gaming leaderboards, IoT, e-commerce carts, session storage, real-time bidding). Not good for: complex queries with JOINs, ad-hoc analytics, ACID transactions across many tables.

Key features: on-demand capacity (pay per request), DynamoDB Streams (change data capture), Global Tables (multi-region active/active), DynamoDB Accelerator (DAX) for microsecond caching.

30. Final Exam Questions

Q193. What is serverless and when should you NOT use it?

Serverless is wrong for: long-running tasks (Lambda max 15 min), high sustained throughput at massive scale (containers cheaper), strict latency requirements (cold starts unacceptable), complex stateful workflows, existing containerised apps that would require expensive refactor.

Serverless is right for: event-driven processing, variable/unpredictable load, simple API backends, batch jobs triggered by events, scheduled tasks.

Q194. What is the difference between EC2 On-Demand, Reserved, and Spot instances?

Type	Pricing	Use case	Risk
On-Demand	Full price, no commitment	Unpredictable workloads, dev/test, short-term	None — always available
Reserved (1yr/3yr)	Up to 72% discount	Steady, predictable production workloads	Commitment — pay even if unused
Spot	Up to 90% discount	Fault-tolerant batch, CI/CD runners, stateless workers	2-min termination notice when capacity needed
Savings Plans	Up to 66% discount	Flexible commitment across EC2+Lambda+Fargate	Hourly spend commitment

Q195. What is object storage consistency?

AWS S3 provides strong read-after-write consistency for all operations since December 2020. PUT an object, immediately GET it — you always get the latest version. Applies to new objects, overwrites, and deletes.

GCS (Google Cloud Storage) has always been strongly consistent. Both eliminate the need to worry about eventual consistency in your application design.

Q196. What is AWS Lambda concurrency?

Concurrency = number of function instances handling requests simultaneously.

Unreserved concurrency — default pool shared across all Lambda functions in the account (1000 by default, can be increased)
Reserved concurrency — guarantee a specific number of concurrent executions for a critical function; prevents other functions from consuming all concurrency
Provisioned concurrency — pre-initialises execution environments so cold starts never occur; use for latency-sensitive production APIs

When concurrency limit is hit: throttling (429 error). SQS buffers requests naturally; API Gateway returns 429 to clients.

Q197. How would you build a cost-optimised, highly available 3-tier architecture on AWS?