Cloud Computing Interview Questions and Answers (2026)

Preparing for a cloud or DevOps interview in 2026? This is the most comprehensive guide covering 200+ Cloud Computing interview questions with detailed answers — from beginner to advanced — for Cloud Engineers, DevOps Engineers, SREs, and Cloud Architects.

Topics covered: Cloud fundamentals · IaaS/PaaS/SaaS · Networking & VPC · Security & IAM · Containers & Kubernetes · Serverless · AWS/Azure/GCP services · CI/CD & IaC · High availability & DR · Cost optimization · Cloud migration · Microservices architecture

Who this is for: Freshers targeting cloud roles, experienced engineers preparing for senior interviews, and anyone interviewing at AWS, Google, Microsoft, Datadog, Cloudflare, or any cloud-native company.

1. Basic Cloud Computing

Q1. What is cloud computing?

Cloud computing is the on-demand delivery of computing resources — servers, storage, databases, networking, software, analytics — over the internet with pay-as-you-go pricing. Instead of owning physical data centers, you access technology services from a cloud provider like AWS, Azure, or GCP.


Q2. What are the key characteristics of cloud computing?

The NIST defines 5 essential characteristics:

  • On-demand self-service — provision resources instantly without human interaction
  • Broad network access — accessible via any internet-connected device
  • Resource pooling — multi-tenant shared infrastructure
  • Rapid elasticity — scale resources up/down automatically
  • Measured service — pay only for what you consume (metered billing)

Q3. What are the main advantages of using cloud computing?

  • Cost savings — no upfront capital expenditure, pay-as-you-go
  • Scalability — scale instantly to handle traffic spikes
  • Global reach — deploy to data centers worldwide in minutes
  • High availability — built-in redundancy across multiple zones
  • Faster innovation — focus on code, not hardware management
  • Disaster recovery — automated backups, geo-redundancy
  • Security — world-class security from providers with dedicated security teams

Q4. What are the different types of cloud deployment models?

ModelWho owns itBest forExamples
PublicCloud providerStartups, scalable workloadsAWS, Azure, GCP
PrivateOrganisationBanks, government, complianceOpenStack, VMware
HybridBothBurst to public, sensitive data on-premAWS Outposts, Azure Arc
Multi-cloudMultiple providersAvoid vendor lock-inAWS + GCP + Azure
CommunityShared org groupResearch, government agenciesGovCloud

Q5. What is the difference between public, private, and hybrid cloud?

  • Public cloud — infrastructure owned and operated by a third-party provider, shared among multiple customers. Cheapest, most scalable.
  • Private cloud — infrastructure dedicated to a single organisation. More control, higher cost, better compliance.
  • Hybrid cloud — combination of public and private cloud connected by technology that allows data and applications to move between them. Best of both worlds.

Q6. How does cloud computing differ from traditional data center operations?

AspectTraditional Data CenterCloud Computing
Capital costHigh upfront (servers, networking)Zero upfront — pay as you go
ScalingMonths to procure new hardwareMinutes to spin up new instances
MaintenanceIn-house team manages hardwareProvider handles hardware
Global reachLimited to owned locations100+ regions worldwide
ElasticityFixed capacity — must over-provisionElastic — scale with demand
Innovation speedSlow (hardware procurement)Fast (API-driven provisioning)

Q7. What is virtualization and how does it relate to cloud computing?

Virtualization creates software-based (virtual) versions of physical hardware. A hypervisor runs on physical hardware and creates multiple isolated virtual machines (VMs), each with its own OS.

Cloud computing is built on virtualization — providers run thousands of VMs on shared physical servers and rent them to customers. Without virtualization, multi-tenancy, elasticity, and cost efficiency would not be possible.

Type 1 hypervisors (bare-metal): VMware ESXi, KVM, Hyper-V — run directly on hardware, used in production. Type 2 hypervisors (hosted): VirtualBox, VMware Workstation — run on top of an OS, used for dev/test.


Q8. What is a hypervisor and its types?

A hypervisor (Virtual Machine Monitor) is software that creates and manages virtual machines by abstracting physical hardware.

  • Type 1 (Bare-metal) — runs directly on hardware, no host OS. Faster and more secure. Examples: VMware ESXi, Microsoft Hyper-V, KVM, Xen. Used in all major cloud providers.
  • Type 2 (Hosted) — runs on top of an existing OS. Slower but easier for dev/test. Examples: VirtualBox, VMware Workstation, Parallels.

Q9. What is elasticity in cloud computing?

Elasticity is the ability to automatically provision and de-provision resources in real time based on demand — scale out when load increases, scale in when load decreases.

Example: An e-commerce site that automatically adds 50 EC2 instances during Black Friday traffic and removes them afterward is elastic. Elasticity = scalability + automation + cost efficiency.


Q10. What is the difference between scalability and elasticity?

  • Scalability — the ability to handle increased load by adding resources. Can be manual or planned.
    • Vertical (scale-up): bigger server (more CPU/RAM)
    • Horizontal (scale-out): more servers behind a load balancer
  • Elasticity — automatic, real-time scaling up AND down based on current demand. Elasticity is scalability + automation.

Key difference: scalability is about capacity; elasticity is about automatic, dynamic adjustment.


Q11. What are the primary components of cloud architecture?

  • Front-end — client devices and interfaces (browser, mobile app)
  • Back-end — cloud infrastructure (servers, databases, storage)
  • Network — internet connection between front and back end
  • Cloud delivery model — IaaS, PaaS, SaaS
  • Security layer — IAM, encryption, firewalls
  • Management layer — monitoring, orchestration, automation tools

Q12. What is a Service Level Agreement (SLA)?

An SLA is a contract between cloud provider and customer defining the guaranteed level of service.

Key SLA metrics:

  • Availability/Uptime: 99.9% = 8.7 hrs downtime/year, 99.99% = 52 min/year, 99.999% = 5 min/year
  • Response time: how quickly support responds
  • Incident resolution time: how quickly issues are resolved
  • Remedies: service credits if SLA is missed

Always check SLA exclusions — scheduled maintenance, customer-caused outages, and force majeure events are typically excluded.


Q13. What is the difference between scalability and elasticity in cloud computing?

See Q10 above — scalability is the ability to grow capacity (manual or planned), while elasticity is automatic, real-time dynamic scaling in both directions.


Q14. What are the four layers of cloud architecture?

  1. Physical layer — hardware (servers, storage, networking)
  2. Virtualization layer — hypervisor creates VMs from physical resources
  3. Cloud platform layer — cloud OS, APIs, management tools
  4. Application layer — end-user applications and services running on the cloud

2. Cloud Service Models

Q15. What are the main cloud service models?

ModelYou manageProvider managesExamples
IaaSOS, runtime, apps, dataServers, storage, networkingEC2, Azure VMs, GCP Compute Engine
PaaSApps and dataOS, runtime, middleware, serversHeroku, Google App Engine, Azure App Service
SaaSJust your dataEverythingGmail, Salesforce, Dropbox, Zoom
FaaSFunction code onlyRuntime, scaling, infraAWS Lambda, Azure Functions, Cloud Run

Q16. What is Infrastructure as a Service (IaaS)?

IaaS provides virtualised computing resources over the internet. You get raw compute, storage, and networking — you manage the OS, middleware, runtime, and applications. The provider manages the physical hardware.

Best for: hosting custom applications, running VMs, data storage, DR, dev/test environments. Examples: AWS EC2, Azure Virtual Machines, Google Compute Engine, DigitalOcean Droplets.


Q17. What is Platform as a Service (PaaS)?

PaaS provides a platform for developing, deploying, and managing applications without managing the underlying infrastructure. The provider manages OS, runtime, middleware, and servers.

Best for: web app development, API backends, database management, microservices without K8s complexity. Examples: Google App Engine, Azure App Service, Heroku, AWS Elastic Beanstalk, Render.


Q18. What is Software as a Service (SaaS)?

SaaS delivers software applications over the internet on subscription. The provider manages everything — infrastructure, platform, and software. Customers just use the application.

Best for: email, CRM, HR tools, collaboration software — no installation or maintenance. Examples: Gmail, Microsoft 365, Salesforce, Zoom, Slack, Dropbox, Notion.


3. Cloud Storage and Networking

Q19. What are the different types of cloud storage?

TypeDescriptionAccess methodUse caseAWS example
Object storageFlat namespace, key-value, highly scalableHTTP APIImages, videos, backups, static websitesS3
Block storageRaw storage volumes attached to VMs like a hard driveMounted as diskOS disks, databasesEBS
File storageShared file system mountable by multiple VMs (NFS/SMB)Network file protocolShared app files, home directoriesEFS
Archive storageVery low-cost, high-latency, for long-term retentionRetrieval takes hoursCompliance archives, cold dataS3 Glacier

Q20. What is a Content Delivery Network (CDN) and how does it work?

A CDN is a globally distributed network of edge servers that cache static content (images, CSS, JS, videos) close to end users, reducing latency.

How it works:

  1. User requests a file
  2. DNS routes to nearest edge server (Point of Presence)
  3. Cache hit → served immediately from edge
  4. Cache miss → fetch from origin, cache it, serve

Benefits: lower latency, reduced origin load, DDoS absorption, global availability. Examples: AWS CloudFront, Cloudflare, Akamai, Azure CDN, Fastly.


Q21. What is a Virtual Private Cloud (VPC)?

A VPC is a logically isolated section of the cloud provider's network where you launch resources in a virtual network you control. You define IP ranges, subnets, route tables, and security rules.

Key VPC components:

  • Subnets — public (internet-facing) or private (internal)
  • Internet Gateway — connects VPC to internet
  • NAT Gateway — lets private subnet instances reach internet without being exposed
  • Security Groups — stateful instance-level firewall (allow rules only)
  • Network ACLs — stateless subnet-level firewall (allow + deny rules)
  • Route Tables — control traffic routing within VPC
  • VPC Peering — connect two VPCs privately

Q22. What is a cloud VPN?

A Cloud VPN creates an encrypted tunnel between an on-premises network and a cloud VPC over the public internet. Used to securely extend an on-premises data center into the cloud.

Types:

  • Site-to-site VPN — connects entire on-prem network to cloud VPC
  • Client VPN — individual user connects to cloud network

AWS: AWS Site-to-Site VPN, AWS Client VPN. For higher bandwidth and lower latency: use AWS Direct Connect or Azure ExpressRoute (dedicated private fiber connection).


Q23. What is cloud latency?

Latency is the round-trip time (RTT) for data to travel from client to cloud server and back. Affected by physical distance, network congestion, number of routing hops, and server processing time.

To minimise latency: choose the closest region, use a CDN for static assets, use edge computing, optimise database queries, use connection pooling.


Q24. What is load balancing in cloud computing?

A load balancer distributes incoming traffic across multiple backend instances to prevent overload and ensure availability.

Types by OSI layer:

  • Layer 4 (Transport/TCP) — routes based on IP and TCP/UDP protocol. Faster, no content inspection. AWS NLB.
  • Layer 7 (Application/HTTP) — routes based on URL path, HTTP headers, hostname, cookies. AWS ALB.

Algorithms: Round Robin, Least Connections, IP Hash, Weighted. Load balancers perform health checks — stop routing to unhealthy instances automatically.


Q25. What is a cloud database and its types?

TypeDescriptionExamples
Relational (SQL)Structured data, ACID transactions, complex queriesAWS RDS, Azure SQL, Cloud SQL, Aurora
NoSQL Key-ValueSimple key-value, ultra-fast, massive scaleDynamoDB, Redis, Memcached
NoSQL DocumentJSON documents, flexible schemaMongoDB Atlas, Firestore, Cosmos DB
NoSQL Wide-ColumnLarge scale analytical workloadsCassandra, Bigtable, Redshift
GraphRelationships between entitiesNeptune, Neo4j
Time-seriesTime-stamped metrics and eventsInfluxDB, Timestream, TimescaleDB
Data WarehouseAnalytical queries on petabytesSnowflake, BigQuery, Redshift

Q26. What is cloud networking?

Cloud networking provides virtual networking components — VPCs, subnets, load balancers, DNS, VPNs, firewalls — that connect cloud resources to each other and to the internet. Cloud networking is defined in software (Software-Defined Networking / SDN) rather than physical cables.


Q27. What is a cloud router?

A cloud router is a software-defined router that dynamically manages routing between VPC networks, on-premises networks, and other cloud regions. It uses BGP (Border Gateway Protocol) to exchange routing information.

AWS: AWS Transit Gateway (connect multiple VPCs and on-prem networks). GCP: Cloud Router.


Q28. What is cloud interconnect?

Cloud Interconnect provides a dedicated, private, high-bandwidth connection between your on-premises network and the cloud provider — bypassing the public internet.

Better performance, lower latency, and more consistent throughput than VPN over internet. Examples: AWS Direct Connect, Azure ExpressRoute, Google Cloud Interconnect.


Q29. What is a cloud storage gateway?

A cloud storage gateway is a software appliance or hardware device at an on-premises data center that bridges on-prem applications to cloud storage. It translates local file/block requests to cloud object storage APIs.

Use case: backup on-prem data to S3, extend NAS storage to cloud without changing applications. AWS: AWS Storage Gateway (File Gateway, Volume Gateway, Tape Gateway).


4. Cloud Deployment and Management

Q30. What is auto-scaling and how is it implemented?

Auto-scaling automatically adjusts the number of compute resources based on demand. Types:

  • Reactive (dynamic) — scale based on real-time metrics (CPU > 70% → add instance)
  • Scheduled — scale at predicted times (scale up every Monday 9AM)
  • Predictive — ML-based demand forecasting (AWS Predictive Scaling)

AWS Auto Scaling Group example: min=2, max=20, desired=4. Scale-out: add 2 when CPU > 80% for 5 min. Scale-in: remove 1 when CPU < 30% for 10 min.


Q31. What is cloud bursting and when is it used?

Cloud bursting is a hybrid configuration where an app runs on-premises but automatically bursts to a public cloud when local capacity is exceeded. You only pay for the burst capacity.

When to use: predictable seasonal spikes (retail on Black Friday), batch processing overflow, dev/test capacity. Challenges: network latency, data transfer costs, security at the boundary.


Q32. What is cloud provisioning?

Cloud provisioning is the process of allocating cloud resources to users or applications — selecting appropriate services, configuring settings, and making them available. Modern provisioning is done via IaC tools (Terraform, CloudFormation) or APIs, not manually through consoles.


Q33. What are cloud regions and availability zones?

  • Region — a geographic area with multiple data centers (e.g., us-east-1, eu-west-2). Choose closest to users for lowest latency.
  • Availability Zone (AZ) — one or more isolated data centers in a region with independent power, cooling, and networking. Typically 3+ AZs per region.
  • Edge Location — CDN cache points closer to users (AWS CloudFront).

Best practice: distribute across 3 AZs for HA; across 2+ regions for disaster recovery.


Q34. What is cloud automation?

Cloud automation uses tools and scripts to automatically provision, configure, deploy, and manage cloud infrastructure without manual intervention.

Tools: Terraform (IaC), Ansible (configuration), AWS Systems Manager (patching, commands), AWS Lambda (event-driven automation), CloudFormation (AWS-native IaC).

Benefits: consistency, speed, reduced human error, version-controlled infrastructure.


Q35. What is cloud monitoring?

Cloud monitoring collects and analyses metrics, logs, and traces from cloud resources to detect issues, track performance, and ensure availability.

Key areas: infrastructure metrics (CPU, memory, disk), application metrics (request rate, error rate, latency), logs (application, access, audit), distributed tracing.

Tools: AWS CloudWatch, Datadog, Prometheus + Grafana, New Relic, Dynatrace, Azure Monitor.


Q36. What is cloud governance?

Cloud governance is the set of rules, policies, and processes that ensure cloud usage aligns with business objectives, security requirements, and compliance standards.

Key elements: cost management (tagging, budgets, alerts), security policies (SCPs in AWS), compliance frameworks (CIS, SOC2, GDPR), naming conventions, access controls.

AWS tools: AWS Organizations, Service Control Policies (SCPs), AWS Config, AWS Control Tower.


Q37. What is the difference between horizontal and vertical scaling?

  • Vertical scaling (scale-up) — increase the size of an existing instance (more CPU, RAM). Has an upper limit. Requires restart/downtime in some cases.
  • Horizontal scaling (scale-out) — add more instances behind a load balancer. No hard limit. Cloud-native apps are designed for horizontal scaling.

Best practice: design stateless applications that scale horizontally. Store session state in Redis/DynamoDB.


Q38. What is multi-cloud strategy?

Multi-cloud means using services from multiple cloud providers (e.g., AWS for compute, GCP for ML/BigQuery, Cloudflare for edge). Benefits: avoid vendor lock-in, best-of-breed services, data sovereignty, resilience. Challenges: increased complexity, different APIs, higher operational overhead, data transfer costs. Management tools: Terraform, Crossplane, Anthos, Azure Arc.


Q39. What is resource replication and why is it important?

Resource replication copies data or configuration across multiple locations (AZs, regions) to ensure availability, durability, and disaster recovery.

Examples: S3 Cross-Region Replication, RDS Multi-AZ (synchronous replication), DynamoDB global tables (multi-region replication), database read replicas for performance.


Q40. What is cloud federation?

Cloud federation is the integration of multiple cloud environments (from different providers or on-premises) into a unified management layer, allowing workloads to move seamlessly between them based on cost, performance, or compliance needs.


5. Cloud Migration

Q41. What is cloud migration?

Cloud migration is the process of moving applications, data, and infrastructure from on-premises data centers (or from one cloud) to another cloud environment.

Migration phases: Discovery (inventory current state) → Assessment (analyse dependencies) → Planning (choose strategy) → Migration (move workloads) → Optimisation (right-size, cost-optimise).


Q42. What is cloud orchestration?

Cloud orchestration is the automated coordination and management of multiple cloud services and workflows to create a cohesive system. It ensures cloud resources work together in the right order.

Example: provision a VPC → launch EC2 instances → configure RDS → deploy application → configure load balancer — all automated via Terraform or AWS CloudFormation.


Q43. Explain the "lift and shift" approach and the 6 Rs of migration.

StrategyDescriptionEffortCloud benefit
Rehost (lift & shift)Move VM as-is to cloudLowFast migration, immediate cloud benefits
Replatform (lift, tinker, shift)Minor optimisations (e.g., move to RDS)Low-mediumReduced ops overhead
Refactor / Re-architectRedesign for cloud-native (microservices, serverless)HighMaximum agility and scalability
RepurchaseReplace with SaaS (e.g., CRM → Salesforce)MediumNo more software maintenance
RetireDecommission apps no longer neededLowCost savings
RetainKeep on-premises (compliance, latency)NoneCompliance or business need

Q44. What are the key considerations for cloud migration?

  1. Application dependencies — map all dependencies before migrating
  2. Data migration strategy — online (live) vs offline migration, data size, downtime tolerance
  3. Security and compliance — data sovereignty, GDPR, encryption requirements
  4. Network connectivity — VPN or Direct Connect during migration
  5. Cost estimation — TCO analysis, reserved instance planning
  6. Rollback plan — ability to revert if migration fails
  7. Team upskilling — train teams on cloud-native tools

Q45. What is data residency and data sovereignty in cloud?

  • Data residency — specifies where data is physically stored (which country/region). Regulatory compliance often requires data to stay within borders.
  • Data sovereignty — data is subject to the laws of the country where it is stored. Even if your company is in India, data stored in the US is subject to US laws (e.g., CLOUD Act).

Solution: choose cloud regions that match your regulatory requirements, use data localisation features, or use region-locked services.


6. Cloud Security & Compliance

Q46. What is Identity and Access Management (IAM)?

IAM controls who (identity) can do what (access) on which resources. Core components:

  • Users — individual identities
  • Groups — collection of users with shared permissions
  • Roles — temporary permissions for services or cross-account access
  • Policies — JSON documents defining allow/deny actions on resources
  • Principle of least privilege — grant minimum permissions required

Q47. What is data encryption in cloud computing?

  • At rest — data encrypted when stored. AWS uses AES-256 via KMS. S3, EBS, RDS support server-side encryption.
  • In transit — data encrypted while moving between systems using TLS/SSL (HTTPS). Enforced via ACM certificates.
  • Client-side — data encrypted before sending to cloud. Customer controls keys.

AWS KMS key options: SSE-S3 (AWS-managed), SSE-KMS (customer-managed in KMS), SSE-C (customer-provided keys).


Q48. What is the shared responsibility model?

ModelProvider securesCustomer secures
IaaSHardware, data centers, hypervisor, networkOS patching, app security, data encryption, IAM, firewall config
PaaS+ OS

, runtime, middleware

App code, data, user access
SaaSEverything except customer dataData governance, user access management

Q49. What is cloud workload protection?

Cloud Workload Protection Platform (CWPP) secures workloads at runtime — VMs, containers, serverless functions. It monitors for malicious behaviour, enforces policies, and provides vulnerability scanning.

Examples: Prisma Cloud, Sysdig Secure, Aqua Security, AWS Inspector (vulnerability scanning).


Q50. What is a Cloud Access Security Broker (CASB)?

A CASB sits between users and cloud services to enforce security policies — visibility, compliance, data security, and threat protection for SaaS applications.

Use cases: prevent sensitive data from being uploaded to personal Dropbox, enforce MFA for Salesforce, audit shadow IT usage. Examples: Netskope, Microsoft Defender for Cloud Apps, Zscaler CASB.


Q51. What is Cloud Security Posture Management (CSPM)?

CSPM tools continuously scan cloud environments for misconfigurations and compliance violations. They detect open S3 buckets, public RDS instances, overly permissive IAM policies, and unencrypted storage.

Examples: Wiz, Orca Security, Prisma Cloud, AWS Security Hub, Microsoft Defender for Cloud.


Q52. What is cloud backup and recovery?

Cloud backup copies data to cloud storage to protect against loss. Cloud recovery restores that data after a failure.

Best practices: 3-2-1 rule (3 copies, 2 media types, 1 offsite), automated scheduled backups, test recovery regularly, use point-in-time recovery for databases (RDS PITR), use versioning on S3.


Q53. What are the best practices for securing API endpoints?

  1. Authentication — require API keys, JWT tokens, or OAuth2
  2. Authorisation — enforce least-privilege RBAC
  3. Rate limiting — prevent brute force and DDoS
  4. Input validation — reject malformed requests
  5. HTTPS only — never allow plain HTTP
  6. WAF — protect against OWASP Top 10 (SQL injection, XSS)
  7. API Gateway — centralise auth, rate limiting, and logging
  8. Secrets — never expose API keys in logs or error messages

Q54. What are key security challenges in cloud computing?

  1. Misconfiguration — #1 cause of cloud breaches (open S3 buckets, public RDS)
  2. Insecure APIs — attack surface for credential theft and data exfiltration
  3. Overly permissive IAM — excessive permissions enable lateral movement
  4. Data breaches — sensitive data in unencrypted storage or application logs
  5. Shadow IT — employees using unsanctioned cloud services
  6. Insider threats — malicious or negligent employees
  7. DDoS attacks — volumetric attacks on cloud-exposed services
  8. Supply chain attacks — compromised dependencies or container images

Q55. How do you implement encryption for data at rest and in transit?

At rest (AWS):

  • S3: enable default encryption with SSE-KMS
  • RDS: enable encryption at creation (AES-256, KMS-managed key)
  • EBS: enable EBS encryption by default at account level

In transit:

  • Enforce HTTPS via load balancer listeners (redirect HTTP to HTTPS)
  • Use ACM (AWS Certificate Manager) for free TLS certificates
  • Enable TLS for RDS connections (ssl-mode=require)
  • Use VPC endpoints to keep traffic within AWS network

7. Modern Cloud Architecture

Q56. What is multi-tenancy in cloud computing?

Multi-tenancy is an architecture where a single instance of software serves multiple customers (tenants). Each tenant's data is isolated and invisible to others, but they share the same underlying infrastructure.

Cloud providers are inherently multi-tenant. Isolation is achieved through virtualisation (VMs), containers, and logical data separation.


Q57. What are containers and how do they relate to cloud computing?

Containers package application code and all its dependencies (libraries, config) into a portable unit that runs consistently anywhere.

Unlike VMs, containers share the host OS kernel, making them lighter (MBs vs GBs), faster to start (seconds vs minutes), and more portable.

In cloud: containers are the foundation of modern cloud-native applications. Kubernetes orchestrates containers at scale. Every major cloud has a managed Kubernetes service (EKS, AKS, GKE).


Q58. What is the difference between containerization and virtualization?

ContainersVirtual Machines
OSShares host OS kernelEach has its own OS
SizeMBsGBs
Startup timeSecondsMinutes
IsolationProcess-level (namespaces, cgroups)Full hardware-level
PortabilityRuns anywhere Docker runsHypervisor-dependent
OverheadVery lowHigher (full OS per VM)

Q59. What is Kubernetes and how does it work in cloud environments?

Kubernetes (K8s) automates deployment, scaling, and management of containerised applications.

Architecture:

  • Control Plane: API Server (entry point), etcd (state store), Scheduler (assigns pods to nodes), Controller Manager
  • Worker Nodes: kubelet (agent), kube-proxy (networking), container runtime (containerd/Docker)

Key objects: Pod (smallest unit), Deployment (manages replicas), Service (stable networking), Ingress (HTTP routing), ConfigMap/Secret (configuration), HPA (auto-scaling).

Cloud managed K8s: AWS EKS, Azure AKS, GCP GKE — provider manages the control plane.


Q60. What are microservices and how do they benefit cloud deployments?

Microservices architecture splits an application into small, independently deployable services, each owning its data and communicating via APIs.

Benefits for cloud:

  • Scale each service independently (only scale the checkout service on Black Friday)
  • Deploy services independently (no full-app re-deploy for a small change)
  • Different tech stacks per service
  • Fault isolation (one service failing doesn't bring down everything)

Challenges: distributed tracing, service discovery, network latency, data consistency.


Q61. What is serverless computing and what are its use cases?

Serverless — write and deploy code without managing servers. Provider automatically scales and bills per invocation.

Use cases: event-driven processing (S3 → Lambda → resize image), API backends, scheduled jobs, webhooks, stream processing, chatbots, IoT data ingestion.

Limitations: cold starts, max execution time (15 min for Lambda), stateless (state must be external), vendor lock-in.


Q62. What is a cold start in serverless computing?

A cold start occurs when a serverless function is invoked after being idle — the provider must spin up a new container, load the runtime, and load the function code. Adds 50ms–2 seconds latency.

Mitigation: provisioned concurrency (Lambda keeps warm instances), lightweight runtimes (Node.js, Python > Java), smaller package sizes, scheduled warm-up pings.


Q63. What is cloud-native architecture?

Cloud-native applications are designed specifically to exploit cloud capabilities — built as microservices, containerised, dynamically orchestrated, and managed via DevOps/CI-CD.

12-Factor App principles define cloud-native best practices: stateless processes, config in env vars, disposable processes, dev/prod parity, logs as streams.

Cloud-native stack: Docker + Kubernetes + Helm + Istio + Prometheus/Grafana + ArgoCD + Terraform.


Q64. What is immutable infrastructure?

Immutable infrastructure means servers are never modified after deployment. Instead of patching in-place, you replace the entire server with a new image.

Benefits: no configuration drift, consistent environments, easy rollback (re-deploy previous image), simpler troubleshooting. Enabled by: AMIs (AWS), Docker images, Terraform destroy/apply, GitOps.


Q65. What is an API Gateway?

An API Gateway is the single entry point for all client requests to backend microservices. It handles routing, authentication, rate limiting, SSL termination, request transformation, and logging.

Examples: AWS API Gateway, Kong, NGINX, Traefik, Azure API Management. AWS API Gateway integrates natively with Lambda, ECS, and EC2.


Q66. What is a service mesh?

A service mesh is an infrastructure layer for handling service-to-service communication in a microservices architecture. It provides traffic management, mutual TLS, observability, and circuit breaking — without changing application code.

How it works: a sidecar proxy (Envoy) is injected into each pod. All traffic between services goes through proxies. Examples: Istio, Linkerd, AWS App Mesh, Consul Connect.


Q67. What is a cloud message queue?

A message queue decouples services by holding messages until the consumer is ready to process them, enabling asynchronous communication.

Examples: AWS SQS (queue), AWS SNS (pub/sub), RabbitMQ, Apache Kafka (event streaming), Azure Service Bus.

Use case: order service places order → sends message to SQS → inventory service and email service consume asynchronously.


Q68. What is event-driven architecture?

Event-driven architecture produces, detects, and responds to events. Instead of synchronous request/response, services communicate via events (state changes).

Pattern: Producer emits event → Event bus/broker → Consumer reacts. AWS tools: EventBridge, SNS, SQS, Kinesis, Lambda. Benefits: loose coupling, scalability, real-time processing.


Q69. What is the difference between stateful and stateless applications?

  • Stateless — each request is independent; no session state stored on server. Scale horizontally. Any instance can handle any request. Examples: REST APIs, Lambda functions.
  • Stateful — server retains session state between requests. Harder to scale. Examples: traditional WebSockets, databases, Kafka consumers.

Cloud-native best practice: design stateless services, store state in managed external services (Redis, DynamoDB, S3).


Q70. What is edge computing and how does it complement cloud computing?

Edge computing processes data closer to where it is generated (IoT devices, end users) rather than sending everything to a central cloud data center. Reduces latency, bandwidth usage, and dependency on connectivity.

Complement to cloud: edge handles real-time, low-latency processing; cloud handles heavy compute, storage, and analytics. Examples: AWS Lambda@Edge, Cloudflare Workers, AWS IoT Greengrass, Azure IoT Edge.


8. Major Cloud Platforms

Q71. What are the differences between AWS, Azure, and GCP?

AspectAWSAzureGCP
Market share~33% (largest)~22%~11%
StrengthBroadest service catalogue, most matureEnterprise/hybrid (Microsoft integration)AI/ML, Kubernetes (GKE), data analytics
IdentityIAMAzure Active Directory (Entra ID)Cloud IAM
KubernetesEKSAKSGKE (most advanced managed K8s)
ServerlessLambdaAzure FunctionsCloud Functions / Cloud Run
Data warehouseRedshiftAzure SynapseBigQuery
Best forGeneral workloads, startups, enterpriseMicrosoft-heavy enterprise, Office 365 integrationAI/ML workloads, data analytics

Q72. What is Amazon EC2?

EC2 (Elastic Compute Cloud) provides resizable virtual machines in AWS.

Key concepts: Instance types (t3=general, m5=balanced, c5=compute, r5=memory, p3=GPU). Pricing: On-Demand (hourly), Reserved (1-3yr, 72% off), Spot (spare capacity, 90% off). AMI = pre-configured OS image. Security Groups = stateful instance firewall. Elastic IP = static public IP.


Q73. What is Amazon S3 and how does it differ from EBS and EFS?

S3EBSEFS
TypeObject storageBlock storageFile storage (NFS)
AccessHTTP API, globallyAttach to one EC2 in same AZMount on multiple EC2
Use caseBackups, static websites, mediaOS disk, databaseShared files across instances
Durability99.999999999% (11 nines)Replicated within AZMulti-AZ by default
PricingPer GB stored + requestsPer GB provisionedPer GB used

Q74. What is AWS Lambda?

Lambda is AWS's serverless compute — runs code in response to events without provisioning servers. Triggers: S3, API Gateway, DynamoDB Streams, SQS, SNS, CloudWatch Events, Kinesis. Limits: 15-min max execution, up to 10GB memory, 1000 concurrent executions (default). Pricing: per invocation + per GB-second of compute. First 1M invocations/month free.


Q75. What is AWS Route 53?

Route 53 is AWS's scalable DNS (Domain Name System) web service. It translates domain names to IP addresses, routes internet traffic, and performs health checks.

Routing policies: Simple, Weighted, Latency-based, Geolocation, Failover, Multi-value. Use failover routing for DR — automatically switch to healthy endpoint when primary fails.


Q76. What is Microsoft Azure Blob Storage?

Azure Blob Storage is Azure's object storage — equivalent to AWS S3. Stores unstructured data (images, videos, backups, logs).

Tiers: Hot (frequently accessed), Cool (infrequent, lower cost), Archive (rarely accessed, cheapest). Integrated with Azure CDN, Azure Data Factory, and Azure ML.


Q77. What is Azure Active Directory?

Azure AD (now Microsoft Entra ID) is Microsoft's cloud identity platform — SSO, MFA, Conditional Access, B2B/B2C federation, and hybrid identity with on-premises Active Directory.

Critical for enterprise Azure deployments. Every Azure resource access is controlled through Azure AD identities and RBAC roles.


Q78. What is Google Cloud BigQuery?

BigQuery is GCP's serverless, petabyte-scale data warehouse. You query massive datasets with SQL without managing servers. Features: columnar storage, separation of compute/storage, built-in ML (BigQuery ML), streaming ingestion, pay-per-query pricing.


Q79. What is AWS Elastic Beanstalk vs ECS?

  • Elastic Beanstalk — PaaS. Upload code, Beanstalk handles provisioning (EC2, load balancer, auto-scaling). No container knowledge required. Less control.
  • ECS (Elastic Container Service) — CaaS. Run Docker containers. More control over infrastructure. Better for microservices.
  • EKS — managed Kubernetes. Full K8s API. Most complex, most powerful.

Q80. What is an AMI (Amazon Machine Image)?

An AMI is a pre-configured virtual machine image used to launch EC2 instances. Contains: OS, application server, application configuration.

Types: Amazon-provided (Amazon Linux, Ubuntu), AWS Marketplace (pre-installed software), community AMIs, custom AMIs (created from your instances). Use custom AMIs for faster scaling — bake all dependencies into the AMI so new instances launch ready to serve.


9. Advanced Cloud Architecture

Q81. What is high availability and how is it achieved in cloud?

High Availability (HA) minimises downtime by eliminating single points of failure.

HA design pattern on AWS:

  1. Deploy across 3+ Availability Zones
  2. Application Load Balancer distributes traffic
  3. Auto Scaling Group replaces unhealthy instances
  4. Multi-AZ RDS with automatic failover
  5. ElastiCache (Redis) with Multi-AZ for session storage
  6. S3 for static assets (99.99% availability)
  7. Route 53 health checks for DNS failover

Q82. What is auto-healing in cloud systems?

Auto-healing automatically detects and replaces failed components without human intervention.

In AWS: Auto Scaling Group monitors EC2 health. If an instance fails health check → ASG terminates it → launches a replacement automatically. In Kubernetes: liveness probes detect unhealthy pods → K8s restarts them. Deployments ensure desired replica count is maintained.


Q83. How would you design a multi-region disaster recovery solution?

Active-Passive DR design:

  1. Primary region — full production stack running
  2. DR region — warm standby (scaled-down running copy)
  3. Data replication — S3 Cross-Region Replication, RDS Read Replica in DR region, DynamoDB Global Tables
  4. DNS failover — Route 53 health checks detect primary failure → switch DNS to DR region (RTO: minutes)
  5. Regular DR drills — test failover quarterly

Key metrics: RTO (how long to recover) and RPO (how much data loss is acceptable).


Q84. What is disaster recovery in cloud computing?

DR is the process of restoring systems after catastrophic failure. DR strategies by RTO/RPO/cost:

  1. Backup & Restore — cheapest, highest RTO/RPO (hours)
  2. Pilot Light — minimal core services running in DR region, scale up on failure (minutes)
  3. Warm Standby — scaled-down fully functional copy (minutes)
  4. Multi-Site Active/Active — full capacity in multiple regions simultaneously (near-zero RTO/RPO, most expensive)

Q85. How would you design a scalable microservices architecture?

Key design decisions:

  1. Service boundaries — define by business capability (Order Service, Payment Service, Inventory Service)
  2. Communication — synchronous (REST/gRPC) for real-time, async (SQS/Kafka) for decoupled workflows
  3. Data isolation — each service owns its own database (database-per-service pattern)
  4. API Gateway — single entry point, handles auth and routing
  5. Service discovery — Kubernetes Services or AWS Cloud Map
  6. Circuit breaker — prevent cascade failures (Resilience4j, Istio)
  7. Distributed tracing — track requests across services (Jaeger, AWS X-Ray, Datadog APM)

Q86. What is the role of a firewall in cloud computing?

Cloud firewalls control inbound/outbound traffic to cloud resources.

Layers:

  • Security Groups — stateful, instance-level (EC2, RDS). Allow rules only.
  • Network ACLs — stateless, subnet-level. Allow and deny rules.
  • WAF (Web Application Firewall) — Layer 7, protects against OWASP Top 10 (SQL injection, XSS). AWS WAF, Cloudflare WAF.
  • AWS Network Firewall — managed stateful firewall for VPC perimeter.

10. Cloud Optimization and Management

Q87. What is vendor lock-in and how do you handle it?

Vendor lock-in — deep dependency on a single cloud provider's proprietary services makes migration very costly.

Mitigation: use Kubernetes (portable) instead of ECS-only, use Terraform (vs CloudFormation), open-source databases (PostgreSQL vs Aurora), abstraction layers in code, multi-cloud architecture for critical services.


Q88. What strategies do you use to optimise cloud costs?

  1. Right-sizing — downsize over-provisioned instances based on actual usage metrics
  2. Reserved Instances/Savings Plans — commit to 1-3 years for up to 72% discount
  3. Spot Instances — 90% discount for fault-tolerant batch/stateless workloads
  4. Auto Scaling — scale down during off-peak hours (night, weekends)
  5. S3 lifecycle policies — move cold data to Glacier automatically
  6. Delete idle resources — unattached EBS volumes, unused Elastic IPs, orphaned snapshots
  7. Serverless — pay per invocation, zero cost when idle
  8. FinOps tooling — AWS Cost Explorer, Kubecost, CloudHealth, Infracost

Q89. How would you handle updates and patches in cloud infrastructure?

Best practice: immutable infrastructure — replace instead of patch.

  1. Build new AMI/container image with patches applied
  2. Test in staging
  3. Rolling deploy via Auto Scaling (launch new instances, terminate old ones)
  4. Blue/green deployment for zero downtime

For OS patches: AWS Systems Manager Patch Manager, AWS SSM Run Command. For containers: rebuild image from patched base image, push to ECR, roll out via Kubernetes.


Q90. What are the key metrics to monitor in cloud infrastructure?

Infrastructure: CPU utilisation, memory, disk I/O, network in/out, instance health. Application: request rate (RPS), error rate, response latency (p50/p95/p99), queue depth. Database: connections, query latency, replication lag, IOPS. Cost: spend by service/team, reserved instance utilisation, savings plan coverage. Security: failed login attempts, IAM changes, config changes (CloudTrail).

Golden Signals (SRE): Latency, Traffic, Errors, Saturation.


Q91. What is data governance in cloud and how do you ensure compliance?

Cloud data governance ensures data is managed consistently, securely, and in compliance with regulations.

Key practices: data classification (PII, confidential, public), data lineage tracking, access controls (RBAC on databases), data retention policies, audit logging (CloudTrail, S3 access logs), compliance frameworks (GDPR, HIPAA, PCI-DSS, SOC2).

AWS tools: AWS Config (compliance rules), AWS Macie (PII detection in S3), Lake Formation (data lake access control).


Q92. What are the challenges of cloud computing?

  1. Security and compliance — shared infrastructure, data sovereignty
  2. Cost management — easy to over-spend, complex billing
  3. Vendor lock-in — proprietary services create migration barriers
  4. Complexity — managing distributed systems at scale
  5. Performance unpredictability — noisy neighbour problem in shared infrastructure
  6. Data transfer costs — egress fees when moving data out of cloud
  7. Skills gap — cloud expertise is scarce and expensive

Q93. How does AI integrate with cloud platforms?

Every major cloud now offers managed AI/ML services:

  • AWS: SageMaker (ML platform), Bedrock (LLM APIs — Claude, Llama), Rekognition (vision), Comprehend (NLP)
  • Azure: Azure OpenAI Service (GPT-4, DALL-E), Azure ML, Cognitive Services
  • GCP: Vertex AI, Gemini API, BigQuery ML, AutoML

Pattern: use managed AI services via API rather than building from scratch. Cloud provides the GPU infrastructure and model serving infrastructure.


Q94. How might quantum computing impact cloud infrastructure?

Quantum computing uses quantum mechanics (superposition, entanglement) to solve certain problems exponentially faster than classical computers.

Cloud impact:

  • Current encryption (RSA, ECC) would be broken by quantum computers — cloud providers are preparing post-quantum cryptography
  • Quantum as-a-service: AWS Braket, Azure Quantum, GCP (Cirq) — access quantum hardware via cloud APIs
  • Near-term use cases: drug discovery, financial optimisation, logistics, materials science
  • Timeline: mainstream quantum advantage is still 5-10+ years away

12. Practical Scenario Questions

Q95. How would you handle a cloud service outage affecting a critical application?

Incident response steps:

  1. Detect — CloudWatch alarms, PagerDuty alert fires
  2. Assess — check AWS Service Health Dashboard, identify affected services/regions
  3. Communicate — notify stakeholders, update status page
  4. Mitigate — activate DR plan: failover to backup region, switch DNS via Route 53 health checks
  5. Resolve — wait for provider fix or keep serving from DR
  6. Post-mortem — document timeline, impact, root cause, and preventive measures

Q96. How would you migrate an on-premises application to Azure?

  1. Assess — Azure Migrate to discover on-prem VMs, assess dependencies
  2. Plan — choose migration strategy (rehost to Azure VMs vs replatform to App Service)
  3. Replicate — use Azure Site Recovery (ASR) to replicate VMs to Azure
  4. Test — test failover, validate application works in Azure
  5. Cutover — final sync, update DNS, cut traffic to Azure
  6. Optimise — right-size VMs, set up auto-scaling, enable backups

Q97. How would you secure a cloud-native application against data breaches?

Defence-in-depth:

  1. IAM — least-privilege, MFA on all accounts, no hardcoded credentials (use Secrets Manager)
  2. Network — private subnets, Security Groups, WAF, DDoS protection
  3. Data — encrypt at rest (KMS) and in transit (TLS), S3 bucket policies, block public access by default
  4. Application — SAST/DAST in CI/CD pipeline, dependency scanning (Snyk), container image scanning
  5. Monitoring — GuardDuty (threat detection), CloudTrail (audit), Security Hub (compliance)
  6. Incident response — automated remediation (Lambda + EventBridge), documented runbooks

Q98. How would you design a fault-tolerant cloud architecture?

Fault-tolerant = the system continues operating even when components fail.

Design principles:

  1. Eliminate single points of failure — multi-AZ for every tier
  2. Degrade gracefully — serve cached responses when DB is unavailable
  3. Health checks + auto-healing — ASG replaces unhealthy EC2 automatically
  4. Circuit breakers — stop calling failing downstream services
  5. Retry with exponential backoff — retry transient failures without thundering herd
  6. Bulkhead pattern — isolate resources to prevent cascade failures
  7. Chaos engineering — proactively inject failures to test resilience (Netflix Chaos Monkey)

Q99. How would you approach capacity planning for a rapidly growing application?

  1. Baseline metrics — measure current CPU, memory, RPS, and DB connections at peak
  2. Growth projections — estimate traffic growth over 3/6/12 months
  3. Load testing — simulate projected traffic with k6, Gatling, or Locust
  4. Set auto-scaling policies — ensure ASG and HPA can handle 3-5x current peak
  5. Database scaling plan — plan for read replicas, connection pooling (RDS Proxy), or migration to DynamoDB
  6. Reserved capacity — buy reserved instances for baseline, use spot/on-demand for spikes

Q100. Describe how you would optimise cloud costs for an organisation.

Immediate wins (week 1):

  • Delete unattached EBS volumes, unused Elastic IPs, old snapshots
  • Enable S3 lifecycle rules to move old data to Glacier
  • Stop non-production instances outside business hours

Short-term (month 1):

  • Right-size over-provisioned instances using CloudWatch metrics
  • Buy Reserved Instances for stable production workloads (1-year commitment)
  • Migrate batch jobs to Spot Instances

Long-term:

  • Adopt serverless for event-driven workloads
  • Set up FinOps practice: tagging all resources, team-level cost allocation, monthly reviews
  • Use Savings Plans for flexible commitment discounts

13. Cloud Deployment — Additional Topics

Q101. What is cloud-init?

cloud-init is the industry standard tool for cloud instance initialisation. When a new VM (EC2, Azure VM, GCE) is launched, cloud-init runs on first boot to configure the instance — install packages, set hostnames, create users, write config files, run scripts.

Example (EC2 User Data):

#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx

This runs automatically when the instance first starts. Used in auto-scaling groups to bootstrap instances automatically.


Q102. What is a cloud instance?

A cloud instance is a virtual machine running in the cloud. It is a computing environment with a defined amount of CPU, memory, storage, and networking capacity, created from a machine image (AMI on AWS).

Instance lifecycle: Pending → Running → Stopping → Stopped → Terminated. You pay for running instances (and stopped instances still incur EBS costs).


Q103. What is resource pooling in cloud computing?

Resource pooling is a core cloud characteristic where the provider's computing resources (compute, storage, networking) are pooled to serve multiple consumers using a multi-tenant model. Resources are dynamically assigned and reassigned based on consumer demand.

The customer typically has no control over the exact physical location of resources (though you can specify region/AZ). This pooling enables economies of scale that make cloud more cost-effective than dedicated hardware.


Q104. What is measured service in cloud computing?

Measured service means cloud resource usage is monitored, controlled, and reported — enabling pay-per-use billing. You pay only for what you consume (compute hours, GB stored, API calls, GB transferred).

Examples: AWS bills per EC2 instance-hour, per GB of S3 storage, per million Lambda invocations. This metering provides transparency and enables cost optimisation.


Q105. What is cloud performance monitoring?

Cloud performance monitoring tracks the performance of cloud infrastructure and applications to ensure they meet SLAs and user expectations.

Key metrics monitored:

  • Infrastructure: CPU, memory, disk I/O, network throughput
  • Application: response time, error rate, request rate (RPS)
  • Database: query latency, connections, slow queries
  • User experience: page load time, Apdex score

Tools: AWS CloudWatch, Datadog, New Relic, Dynatrace, Grafana + Prometheus.


Q106. What is cloud log management?

Cloud log management involves collecting, storing, searching, and analysing log data from cloud resources — application logs, access logs, audit logs, infrastructure logs.

Best practices: centralise logs in one service (CloudWatch Logs, ELK Stack, Datadog), set retention policies, enable structured logging (JSON), set up alerts on error patterns.

AWS: CloudWatch Logs + Logs Insights for querying. Ship logs from EC2 via CloudWatch Agent, from Lambda automatically.


Q107. What is cloud configuration management?

Cloud configuration management ensures cloud infrastructure is consistently configured according to defined standards — preventing configuration drift (where servers gradually diverge from their desired state).

Tools: Ansible (agentless, YAML playbooks), Chef, Puppet, AWS Systems Manager State Manager (enforce configuration on EC2 fleets), AWS Config (detect and alert on non-compliant resource configurations).


Q108. What is a cloud management platform?

A Cloud Management Platform (CMP) is a suite of tools for managing multi-cloud environments from a single interface — provisioning, governance, cost management, compliance, and automation.

Examples: CloudHealth (VMware), Morpheus, Apptio Cloudability, AWS Management Console (single cloud), Azure Arc (hybrid/multi-cloud management).


Q109. What is cloud compliance?

Cloud compliance ensures cloud deployments meet regulatory and industry standards — GDPR (EU data privacy), HIPAA (healthcare), PCI-DSS (payment card), SOC2 (security controls), ISO 27001.

How to achieve it: use compliance frameworks, enable audit logging (CloudTrail), use CSPM tools to detect violations, encrypt sensitive data, implement IAM least privilege, maintain documentation for auditors.

AWS: AWS Artifact provides compliance reports. AWS Config has pre-built conformance packs for PCI-DSS, HIPAA, CIS benchmarks.


Q110. What is cloud arbitrage?

Cloud arbitrage is the strategy of dynamically shifting workloads between cloud providers or regions to take advantage of price differences, performance advantages, or resource availability.

Example: run batch jobs on whichever cloud has the lowest spot instance prices at that moment. Requires multi-cloud tooling (Terraform, Crossplane) and application portability (Kubernetes).


14. Cloud Migration — Additional Topics

Q111. What is cloud repatriation?

Cloud repatriation (reverse migration) is moving workloads BACK from public cloud to on-premises or private cloud — the opposite of cloud migration.

Reasons: unexpected high cloud costs for predictable workloads, data sovereignty requirements, latency issues, better ROI with owned hardware at scale.

Example: Dropbox famously moved infrastructure from AWS back to their own data centers, saving $75M over 2 years. This works when workloads are predictable and stable at scale.


Q112. What is cloud data integration?

Cloud data integration connects data from multiple sources (on-premises databases, SaaS applications, cloud storage, APIs) into a unified view for analytics or operational use.

Approaches: ETL (Extract, Transform, Load), ELT (Extract, Load, Transform — load raw first, transform in cloud), CDC (Change Data Capture — stream database changes in real time).

Tools: AWS Glue, Azure Data Factory, Fivetran, Airbyte, Apache Kafka (real-time), dbt (transformation).


Q113. What is a cloud ETL service?

ETL (Extract, Transform, Load) services extract data from sources, transform it (clean, filter, aggregate), and load it into a destination (data warehouse, data lake).

Cloud-managed ETL services:

  • AWS Glue — serverless ETL with auto-generated code
  • Azure Data Factory — visual pipeline designer
  • GCP Cloud Dataflow — Apache Beam-based stream and batch processing
  • Fivetran, Airbyte — no-code connectors for SaaS data sources

15. Cloud Security — Additional Topics

Q114. What are the key components of AWS IAM?

ComponentDescription
UsersIndividual identities with long-term credentials (username/password, access keys)
GroupsCollection of IAM users; assign policies to groups, not individual users
RolesTemporary identities assumed by services, applications, or users (no permanent credentials)
PoliciesJSON documents defining what actions are allowed/denied on which resources
Permission BoundariesLimit the maximum permissions a user or role can have
Identity ProvidersFederate external identity (Okta, Google, Active Directory) with AWS

Best practice: use roles (not users) for EC2/Lambda, enable MFA on root account, never use root for daily tasks, audit permissions with IAM Access Analyzer.


Q115. What is a SIEM in cloud?

SIEM (Security Information and Event Management) collects, correlates, and analyses security events from across cloud infrastructure in real time to detect threats and support incident response.

Cloud SIEM capabilities: log ingestion from all cloud services, correlation rules (e.g., "failed login + privilege escalation = alert"), threat intelligence integration, automated response playbooks.

Examples: Microsoft Sentinel (Azure-native), Splunk, Sumo Logic, AWS Security Lake + third-party SIEM, IBM QRadar.


Q116. What is cloud forensics?

Cloud forensics is the application of digital forensics techniques to cloud environments — collecting, preserving, and analysing evidence after a security incident.

Challenges unique to cloud: shared infrastructure (can't image physical servers), ephemeral resources (Lambda functions, containers disappear), multi-jurisdiction (data across countries), provider access limitations.

Techniques: capture CloudTrail logs, VPC Flow Logs, memory dumps from EC2, container runtime forensics (Sysdig Falco). Preserve evidence before terminating compromised instances.


Q117. What are the security best practices for Amazon EC2?

  1. Use IAM roles — not access keys embedded in EC2 instances
  2. Security Groups — least-privilege inbound rules (no 0.0.0.0/0 for SSH)
  3. Use SSH keys — disable password authentication; use EC2 Instance Connect or SSM Session Manager instead of opening port 22
  4. Patch regularly — use SSM Patch Manager for automated OS patching
  5. Enable IMDSv2 — prevents SSRF attacks against the metadata service
  6. Private subnets — keep application servers in private subnets, expose only load balancer publicly
  7. Enable CloudWatch monitoring — detect unusual CPU/network behaviour
  8. EBS encryption — enable by default for all new volumes

Q118. What are the risks of shadow IT in cloud?

Shadow IT = employees using cloud services (S3 buckets, SaaS tools) without IT approval or knowledge.

Risks: sensitive data in uncontrolled storage, no encryption or access controls, compliance violations, data breaches, no visibility for security team.

Mitigation: use a CASB to discover and control cloud app usage, set up AWS Organizations SCPs to restrict service usage, educate employees, provide approved self-service alternatives.


Q119. How do you conduct a security audit for cloud infrastructure?

  1. IAM audit — review all users, roles, policies. Use IAM Access Analyzer, remove unused credentials (Credential Report)
  2. Network audit — check Security Group rules for overly permissive access (0.0.0.0/0)
  3. Storage audit — check for public S3 buckets, unencrypted volumes (AWS Trusted Advisor)
  4. Logging audit — verify CloudTrail enabled in all regions, VPC Flow Logs enabled, S3 access logging enabled
  5. Compliance check — run AWS Config conformance packs against CIS benchmark
  6. Penetration testing — authorised testing of cloud environment
  7. CSPM scan — Wiz/Orca/Security Hub automated findings

Q120. What are key considerations for GDPR compliance in cloud?

  1. Data residency — store EU citizen data in EU regions only (use AWS EU regions)
  2. Data minimisation — only collect and retain what is necessary
  3. Encryption — encrypt personal data at rest and in transit
  4. Access controls — role-based access, audit logging for data access
  5. Data subject rights — ability to export or delete user data on request
  6. Breach notification — detect breaches within 72 hours (GuardDuty + SIEM)
  7. Data Processing Agreements — sign DPAs with cloud providers (AWS has a standard DPA)
  8. Right to be forgotten — implement data deletion workflows

16. Modern Architecture — Additional Topics

Q121. What is a container orchestration platform?

A container orchestration platform automates the deployment, scaling, networking, and management of containers across a cluster of machines.

Key capabilities: scheduling (place containers on appropriate nodes), service discovery, load balancing, rolling updates, self-healing (restart failed containers), secrets management.

Kubernetes is the dominant standard. Cloud-managed options: EKS (AWS), AKS (Azure), GKE (GCP). Alternatives: HashiCorp Nomad, Docker Swarm (legacy).


Q122. What is a cloud function?

A cloud function (FaaS — Function as a Service) is a small, single-purpose piece of code that runs in response to an event. The cloud provider manages the infrastructure completely.

Properties: stateless, event-triggered, ephemeral, auto-scaling, pay-per-execution.

Examples: AWS Lambda, Azure Functions, GCP Cloud Functions. Each function handles one specific task — resize image, send email, process payment event, validate form input.


Q123. What are the challenges of serverless architecture?

  1. Cold starts — latency on first invocation after idle period
  2. Execution limits — max 15 minutes (Lambda), not suitable for long-running tasks
  3. Stateless — state must be stored externally (DynamoDB, S3, Redis)
  4. Debugging complexity — harder to reproduce issues locally; distributed tracing needed (X-Ray)
  5. Vendor lock-in — Lambda-specific event schemas tie you to AWS
  6. Local development — testing event-driven functions locally is harder (use SAM, Serverless Framework)
  7. Monitoring — traditional APM doesn't work; need specialised tools (Lumigo, Epsagon)
  8. Cost at high scale — at very high throughput, containers can be cheaper

Q124. What is DevOps and how does it relate to cloud computing?

DevOps is a cultural and technical movement combining development (Dev) and operations (Ops) to shorten the software delivery cycle through automation, collaboration, and continuous improvement.

DevOps + Cloud synergy:

  • Cloud provides on-demand infrastructure (no waiting for hardware)
  • IaC (Terraform, CDK) makes infrastructure version-controlled like code
  • Managed services (RDS, EKS, Lambda) reduce operational burden
  • Cloud CI/CD services (CodePipeline, GitHub Actions) enable fast automated delivery
  • Cloud monitoring (CloudWatch, Datadog) provides feedback loop

Cloud is the natural environment for DevOps practices.


Q125. What is the role of DevOps in modern cloud environments?

DevOps engineers in cloud are responsible for:

  • Designing and maintaining CI/CD pipelines
  • Writing and maintaining IaC (Terraform, CloudFormation)
  • Managing Kubernetes clusters and container platforms
  • Setting up monitoring, alerting, and observability
  • Implementing cloud security controls
  • Enabling developer self-service through internal developer platforms
  • Incident management and on-call rotation
  • Optimising cloud costs

Q126. What are APIs in cloud computing?

APIs (Application Programming Interfaces) are the primary way to interact with cloud services. Every cloud provider exposes APIs to provision resources, manage services, and query data programmatically.

Types in cloud:

  • REST APIs — most common (AWS, GCP, Azure all use REST)
  • gRPC — binary protocol, faster for microservice-to-service communication
  • GraphQL — flexible query language (used in some SaaS products)
  • Event APIs — webhooks, event streams (SNS, EventBridge)

AWS SDK, Azure SDK, GCP client libraries wrap the underlying REST APIs for developers.


Q127. What is a webhook in cloud applications?

A webhook is an HTTP callback — instead of polling an API repeatedly, a server sends an HTTP POST request to a specified URL when an event occurs.

Example: GitHub sends a webhook to your CI/CD system when code is pushed. Stripe sends a webhook when a payment succeeds. AWS SNS delivers notifications to HTTP endpoints.

Cloud pattern: API Gateway endpoint → Lambda → process webhook → update database.


Q128. What is the difference between synchronous and asynchronous processing in cloud?

SynchronousAsynchronous
ModelRequest waits for responseRequest submitted, response comes later
CouplingTight — caller blocked until doneLoose — caller continues immediately
Failure handlingImmediate error if downstream failsMessages queued; retry on failure
ScalabilityLimited by slowest serviceQueue absorbs bursts
Use caseReal-time API (user login, search)Order processing, email sending, batch jobs
Cloud toolsAPI Gateway + Lambda (sync)SQS + Lambda, SNS, EventBridge

Q129. What are cloud-specific design patterns?

PatternDescription
AmbassadorProxy for outbound calls — adds retry, circuit breaker, logging
Circuit BreakerStop calling failing service; fail fast to prevent cascade failure
CQRSSeparate read and write models for scalability
Event SourcingStore state changes as events; rebuild state by replaying events
SagaManage distributed transactions across microservices
Strangler FigGradually replace legacy monolith with microservices
BulkheadIsolate resource pools to prevent one failure from consuming all resources
Retry with backoffRetry transient failures with exponential delay to avoid thundering herd
SidecarAttach helper containers to main app container (logging, service mesh proxy)

17. Azure & GCP — Additional Services

Q130. What is Azure App Service?

Azure App Service is a fully managed PaaS for hosting web apps, REST APIs, and mobile backends. It supports .NET, Java, Python, Node.js, PHP, Ruby. You deploy code; Azure manages the OS, patching, scaling, and load balancing.

Features: auto-scaling, custom domains with free SSL, deployment slots (blue/green), GitHub/Azure DevOps CI/CD integration, VNet integration for private backend access.

Equivalent: AWS Elastic Beanstalk, GCP App Engine.


Q131. How does Google Cloud Storage compare to AWS S3?

FeatureAWS S3Google Cloud Storage (GCS)
Access controlBucket policies, ACLs, IAMUniform bucket-level access (recommended), IAM, ACLs
Storage classesStandard, IA, Glacier, Glacier InstantStandard, Nearline, Coldline, Archive
Strong consistencyDefault (since Dec 2020)Always strongly consistent
Multipart uploadYesResumable uploads
Lifecycle rulesYesYes
Global namespaceYesYes

Both are highly durable (11 nines), scalable object stores. GCS has a slight edge in strong consistency guarantees.


Q132. How does GCP support machine learning workloads?

GCP has strong ML/AI capabilities:

  • Vertex AI — managed ML platform (AutoML, custom training, model deployment, MLOps)
  • Tensor Processing Units (TPUs) — custom Google hardware for training large ML models, faster and cheaper than GPUs for TensorFlow workloads
  • BigQuery ML — run ML models directly in SQL on BigQuery data
  • Gemini API — access Google's LLM models
  • Pre-trained APIs — Vision AI, Natural Language AI, Speech-to-Text, Translation API

GCP is preferred for ML workloads involving TensorFlow and large-scale data in BigQuery.


Q133. What is a cloud resource group?

A resource group (Azure) is a logical container that holds related Azure resources for an application — VMs, databases, storage, networking — managed as a single unit.

Benefits: deploy/delete all resources together, apply RBAC and policies at group level, unified billing view, ARM templates target resource groups.

AWS equivalent: Tags (no formal grouping, but you tag resources with project/team) or AWS Resource Groups (group by tags or CloudFormation stack).


Q134. What is a cloud marketplace?

A cloud marketplace is an online store where you can find, test, buy, and deploy pre-configured software and services from third-party vendors, integrated with your cloud account.

Examples: AWS Marketplace (15,000+ listings), Azure Marketplace, GCP Marketplace. Products: pre-configured AMIs, SaaS subscriptions, ML models, security tools, databases, developer tools. Billing is consolidated into your cloud bill.


18. Architecture — Additional Topics

Q135. What is a cloud service broker?

A cloud service broker (CSB) is an intermediary that helps organisations manage multiple cloud services — acting as a negotiator, aggregator, and integrator between cloud consumers and multiple cloud providers.

CSBs handle: vendor negotiation, unified billing, compliance management, identity federation across clouds, service catalogue management.

Examples: IBM Cloud Brokerage, Flexera, Apptio Cloudability.


Q136. What is the role of a systems integrator in cloud?

A systems integrator (SI) in cloud helps organisations design, implement, and manage cloud solutions — bridging legacy on-premises systems with cloud infrastructure.

SIs handle: cloud strategy and architecture, migration projects, custom integration development, managed services, training. Major cloud SIs: Accenture, Deloitte, Infosys, Wipro, Capgemini, TCS.


19. Optimization — Additional Topics

Q137. How do you optimize database performance in cloud?

  1. Connection pooling — use RDS Proxy (AWS) to manage connection pools, reducing overhead
  2. Read replicas — offload read traffic from primary database
  3. Caching — use ElastiCache (Redis) to cache frequent queries; reduce DB hits by 80%+
  4. Query optimisation — add indexes on columns used in WHERE/JOIN, avoid SELECT *
  5. Vertical scaling — upgrade to larger DB instance class for CPU/memory-bound queries
  6. Partitioning — partition large tables by date/range for faster scans
  7. Aurora Serverless — auto-scales database capacity automatically
  8. Database-per-service — prevent noisy-neighbour between microservices

Q138. How would you troubleshoot performance issues in a cloud application?

Systematic approach:

  1. Identify the bottleneck — check CloudWatch metrics (CPU, memory, DB connections, queue depth)
  2. Analyse logs — search for errors, slow queries, timeouts (CloudWatch Logs Insights)
  3. Distributed tracing — trace a slow request across services (AWS X-Ray, Jaeger, Datadog APM)
  4. Load test — reproduce under controlled load to confirm root cause
  5. Database — check slow query log, explain query plans, look for missing indexes
  6. Network — check VPC flow logs for unusual latency, NAT Gateway metrics
  7. Application code — profiling, look for N+1 queries, synchronous blocking calls
  8. Fix and verify — deploy fix, monitor metrics to confirm improvement

20. Practical Scenarios

Q139. Describe a situation where you migrated an application to the cloud.

Example answer structure:

"In a previous project, we migrated a monolithic Java application from on-premises to AWS. The main challenges were:

  1. Database migration — used AWS Database Migration Service (DMS) with a live replication lag < 1 second during cutover
  2. Zero-downtime deployment — used blue/green deployment: spun up the new AWS environment, tested it, then cut over DNS via Route 53 weighted routing (10% → 50% → 100%)
  3. Stateful sessions — moved session storage from in-memory to ElastiCache Redis so any instance could handle any request
  4. Secrets — moved all hardcoded credentials to AWS Secrets Manager
  5. Monitoring gap — set up CloudWatch dashboards and alarms from day one

Result: 40% infrastructure cost reduction, deployment frequency improved from monthly to daily."


Q140. How would you explain a complex cloud concept to a non-technical stakeholder?

Analogies that work:

  • Cloud computing = "renting computing power from a massive shared power plant instead of buying your own generator"
  • Auto-scaling = "a restaurant that automatically adds more chefs during lunch rush and sends them home after"
  • Load balancer = "a traffic cop directing customers to whichever checkout lane has the shortest queue"
  • CDN = "storing copies of your website's photos in warehouses close to every customer, so delivery is instant"
  • Microservices = "instead of one big factory that makes everything, you have specialised shops — each does one thing really well"

Key principle: map cloud concepts to things stakeholders already understand (traffic, warehouses, restaurants, factories).


Q141. What steps would you take to recover data lost due to cloud misconfiguration?

  1. Stop the bleeding — identify the misconfiguration (accidental S3 bucket deletion, wrong lifecycle policy) and fix it immediately to prevent further loss
  2. Check versioning — if S3 versioning was enabled, restore from a previous version
  3. Point-in-time recovery — for RDS, restore to any point within the retention window (1-35 days)
  4. Backup restore — restore from latest automated snapshot
  5. CloudTrail investigation — identify who/what made the change and when
  6. Data reconstruction — if no backup, attempt recovery from application logs or partner systems
  7. Post-mortem — enable versioning, MFA delete, cross-region backup, and CSPM alerts to prevent recurrence

Q142. How would you approach integration of multiple cloud providers?

Architecture principles:

  1. Abstraction layer — use Kubernetes to abstract compute; apps don't know which cloud they're on
  2. Terraform — define all cloud resources in provider-agnostic IaC; switch providers by changing provider block
  3. API-first design — all communication between services via APIs, not provider-specific mechanisms
  4. Data layer — use a cloud-agnostic database (self-managed PostgreSQL or Snowflake) accessible from all clouds
  5. Identity federation — federate IAM identities across providers (OIDC)
  6. Networking — VPN or dedicated connection (Megaport) between cloud VPCs
  7. Observability — centralised logging and monitoring across all clouds (Datadog, Grafana Cloud)

Q143. What would you do if a cloud deployment failed due to an infrastructure issue?

  1. Immediate response — trigger the rollback: deploy previous stable version (Kubernetes rollout undo, blue/green switch DNS back)
  2. Assess impact — is there a user-facing outage? Activate incident response
  3. Identify root cause — check deployment logs, CloudWatch, Terraform plan output for what changed
  4. Fix forward or rollback — if fix is quick, fix and redeploy; if complex, roll back first, investigate in staging
  5. Communication — update status page, notify affected stakeholders
  6. Post-deployment gates — add automated smoke tests and health checks to prevent future occurrences
  7. Post-mortem — document root cause and preventive measures

Q144. How would you prioritise tasks when managing multiple cloud projects?

Framework:

  1. Severity-first — production incidents always take priority over planned work
  2. Impact × Urgency — use Eisenhower matrix: urgent+important (do now), important+not urgent (schedule), urgent+not important (delegate), neither (eliminate)
  3. Business value — align priorities with business goals (revenue-impacting work first)
  4. Dependencies — unblock other teams first (if your work blocks 5 other engineers, prioritise it)
  5. Communication — when genuinely overloaded, communicate capacity constraints to stakeholders and get priority decisions from management, don't silently context-switch

Tools: Jira/Linear for tracking, daily standups to surface blockers early.


21. Remaining Original Questions

Q145. What is a cloud management gateway?

A cloud management gateway bridges on-premises management systems with cloud resources — allowing IT teams to manage cloud infrastructure using existing on-premises tools without direct internet exposure.

In Azure, Azure Arc acts as this bridge: onboards on-premises servers and Kubernetes clusters into Azure Resource Manager so they appear as Azure resources, managed from Azure Portal, Azure Policy, and Defender for Cloud. In AWS, AWS Systems Manager manages hybrid EC2 + on-premises servers through a single pane.


Q146. How would you implement a CI/CD pipeline in a cloud environment?

Complete pipeline stages:

  1. Source — developer pushes to Git (GitHub, CodeCommit, GitLab)
  2. Build — compile code, run unit tests, build Docker image
  3. Security scan — SAST (code), SCA (dependencies), container image scan (Trivy, Snyk)
  4. Push — push Docker image to registry (ECR, GCR)
  5. Deploy to staging — Kubernetes rolling update or blue/green
  6. Integration tests — automated smoke/API tests against staging
  7. Deploy to production — canary (5% → 25% → 100%) with automatic rollback on error rate spike

Tools: GitHub Actions, GitLab CI, AWS CodePipeline + CodeBuild, Jenkins, ArgoCD (GitOps for K8s).


Q147. What is cloud repatriation and why do companies do it?

Cloud repatriation = moving workloads back from public cloud to on-premises or private cloud.

Reasons: cost (Dropbox saved $75M repatriating from AWS — owned hardware is cheaper at massive predictable scale), performance (ultra-low-latency requirements), data sovereignty (physical control needed), security (air-gapped systems).

Note: repatriation does not mean cloud failed — it means the workload economics changed. Most companies maintain hybrid.


Q148. What is a cloud-based continuous monitoring solution?

Continuous cloud monitoring = automated, real-time surveillance of security posture, compliance, and performance — 24/7 with no manual checks.

Layers:

  • Infrastructure metrics: CloudWatch, Datadog
  • Security: AWS GuardDuty (threat detection), AWS Config (compliance drift), Security Hub (aggregated findings)
  • Logs: CloudWatch Logs Insights, ELK Stack, Splunk
  • Application: APM (Datadog, New Relic) with distributed tracing

Goal: detect and alert on issues within minutes, not hours.


22. Advanced DevOps Topics

Q149. What are the differences between Terraform and CloudFormation?

TerraformCloudFormation
ProviderHashiCorp (open-source)AWS-native only
Cloud supportMulti-cloud — AWS, Azure, GCP, 1000+ providersAWS only
LanguageHCL (HashiCorp Configuration Language)YAML or JSON
StateTerraform state file (local or S3 backend)Managed by AWS automatically
Import existing infraterraform importLimited support
EcosystemTerraform Registry, modules, providersNested stacks, StackSets
Best forMulti-cloud, open-source ecosystemsAWS-only shops, tight AWS integration

Q150. What is the difference between blue/green and canary deployments?

Blue/GreenCanary
HowRun two full environments. Switch all traffic at once.Gradually shift % of traffic to new version (5% → 25% → 100%)
RollbackInstant — switch traffic backFast — reduce canary % to 0
RiskLow — instant rollbackLower — only small % of users affected at any time
Cost2x infrastructure during deploymentSmall % overhead only
Best forMajor releases, database changesStateless API gradual rollouts

Q151. What is Helm in Kubernetes?

Helm is the package manager for Kubernetes. It packages K8s resources as charts — reusable, versioned templates with configurable values.

Without Helm: manage 20+ separate YAML files per application. With Helm: one command deploys everything: helm install myapp ./chart --values prod.yaml

Key concepts: Chart (package of K8s templates), Release (deployed chart instance), Values (per-environment configuration), Repository (chart registry like Artifact Hub).


Q152. What is service discovery in microservices?

Service discovery allows services to find and communicate with each other dynamically without hardcoded IP addresses.

Patterns:

  • Client-side discovery — service queries a registry (Consul, Eureka) and picks an instance
  • Server-side discovery — client calls a load balancer which routes to healthy instances

In Kubernetes: built-in DNS service discovery. Every Service gets a stable DNS name (payment-service.default.svc.cluster.local). No external registry needed.


Q153. What is an Ingress controller in Kubernetes?

An Ingress controller manages external HTTP/HTTPS access to Kubernetes services based on routing rules defined in Ingress resources.

Example routing rule: api.example.com/users → users-service, api.example.com/orders → orders-service.

Popular controllers: NGINX Ingress, Traefik, AWS ALB Ingress Controller (creates ALB automatically), GKE Ingress. Handles TLS termination, path-based routing, and load balancing in one place.


Q154. What is Horizontal Pod Autoscaler (HPA)?

HPA automatically adjusts the number of pod replicas based on observed metrics (CPU, memory, or custom metrics like RPS).

How it works: HPA queries metrics every 15 seconds. If average CPU across pods exceeds the target, it scales out. When load drops, it scales in down to minReplicas.

For event-driven scaling (scale on SQS queue depth, Kafka lag): use KEDA (Kubernetes Event-Driven Autoscaler) — more flexible than HPA.


Q155. What are PersistentVolumes and PersistentVolumeClaims in Kubernetes?

  • PersistentVolume (PV) — a piece of storage provisioned in the cluster (AWS EBS, NFS). Has a lifecycle independent of any pod.
  • PersistentVolumeClaim (PVC) — a request for storage by a pod. Pods use PVCs without knowing the underlying storage.
  • StorageClass — enables dynamic provisioning. When PVC is created, K8s automatically creates a PV (e.g., provisions an EBS volume).

Use for databases (MySQL, PostgreSQL) and any stateful workloads in K8s.


23. FinOps & Cloud Cost Management

Q156. What is FinOps?

FinOps (Cloud Financial Operations) is a practice that brings engineering, finance, and business together to maximise cloud value — making data-driven decisions about cost, speed, and quality.

Three phases: Inform (cost visibility via tagging + dashboards) → Optimise (right-sizing, reserved instances, waste removal) → Operate (cloud cost as ongoing shared responsibility).

Tools: AWS Cost Explorer, Azure Cost Management, GCP Cost Table, Infracost (IaC cost), CloudHealth.


Q157. What is a Reserved Instance vs Savings Plan?

Reserved Instance (RI)Savings Plan
CommitmentSpecific instance family, region, OSHourly spend commitment (e.g. $10/hr for 1 year)
FlexibilityLimited — tied to specific instance typeHigh — applies to EC2, Lambda, Fargate automatically
DiscountUp to 72% vs On-DemandUp to 66% vs On-Demand
Best forPredictable workloads with known instance typeMixed/microservices/Lambda-heavy architectures

Q158. How do you use AWS Cost Explorer for cost optimisation?

AWS Cost Explorer provides: cost trends by service/region/account/tag, spend forecasting, Reserved Instance recommendations, right-sizing recommendations for EC2, cost anomaly detection.

Best practice: tag ALL resources with project, team, and environment from day one. Without consistent tagging, cost attribution is impossible and Cost Explorer becomes much less useful.


24. Cloud Security Advanced Topics

Q159. What is Zero Trust security in cloud?

"Never trust, always verify" — no user or service is trusted by default, even inside the corporate network.

Key principles: verify explicitly (always authenticate based on identity + device + context), least privilege access (minimum permissions, time-limited), assume breach (minimise blast radius).

Cloud implementation: Azure AD Conditional Access, AWS IAM Identity Center, service mesh with mutual TLS (Istio), micro-segmentation with Security Groups.


Q160. What is AWS GuardDuty?

GuardDuty is a managed threat detection service that continuously monitors for malicious activity using ML and threat intelligence.

Analyses: CloudTrail event logs, VPC Flow Logs, DNS logs, EKS audit logs. Detects: cryptocurrency mining on EC2, unusual API calls from malicious IPs, credential exfiltration, compromised instances, unusual S3 data access.

No agents, no performance impact. Enable in all regions with one click. Findings integrate with Security Hub.


Q161. What is AWS CloudTrail?

CloudTrail records all API calls in your AWS account — who, what, when, from where. Primary audit log for AWS.

Captures: management events (create/delete/modify resources), data events (S3 object access, Lambda invocations), Insights events (unusual API activity spikes).

Best practice: enable in all regions, send logs to a dedicated security account S3 bucket with MFA delete, set retention to 7+ years for compliance.


Q162. What is secrets management in cloud?

Never store secrets in code, Git, environment variables in images, EC2 User Data, or config files.

Correct approach: AWS Secrets Manager (stores, rotates, audits — auto-rotates RDS passwords), AWS Parameter Store (cheaper, good for non-sensitive config + secrets), HashiCorp Vault (multi-cloud, fine-grained dynamic secrets).

Applications retrieve secrets at runtime via SDK calls: const secret = await secretsManager.getSecretValue({ SecretId: "prod/db/password" })


25. Cloud-Native Advanced Topics

Q163. What is SLO, SLI, and SLA?

SLASLOSLI
Full nameService Level AgreementService Level ObjectiveService Level Indicator
WhatLegal contract with customersInternal reliability targetThe actual measurement metric
Who setsLegal + business + engineeringEngineering/SRE teamEngineering team
Example99.9% uptime guaranteed to customersInternal target: 99.95% uptimeMeasured uptime: 99.97%
Miss consequenceCustomer creditsError budget consumed, alertsData point for SLO tracking

Error budget = 100% - SLO. 99.9% SLO = 43 min/month error budget. Exhausted budget = freeze new features, focus on reliability.


Q164. What is chaos engineering?

Chaos engineering intentionally injects failures into production to discover weaknesses before they cause unplanned outages.

Process: define steady state (normal behaviour) → hypothesise it continues under failure → inject fault (kill servers, add latency, block network) → observe result.

Tools: Netflix Chaos Monkey, AWS Fault Injection Simulator (FIS), Gremlin, LitmusChaos (Kubernetes). Netflix runs chaos in production continuously.


Q165. What are the three pillars of observability?

PillarWhat it capturesTool examples
MetricsNumerical measurements over time (CPU %, request rate, error rate)Prometheus, CloudWatch, Datadog
LogsTimestamped records of events (errors, requests, state changes)CloudWatch Logs, ELK Stack, Loki, Splunk
TracesEnd-to-end journey of a request across distributed servicesAWS X-Ray, Jaeger, Datadog APM, Zipkin

OpenTelemetry (OTel) is the vendor-neutral standard for instrumenting all three. Instrument once, send to any backend.


Q166. What is Platform Engineering?

Platform Engineering builds internal developer platforms (IDPs) — self-service infrastructure and tooling that lets development teams deploy, operate, and monitor applications without deep infra knowledge.

Platform teams provide: golden path CI/CD templates, self-service K8s namespaces, secrets management, observability stack, IaC modules, internal developer portals (Backstage, Port).

Goal: reduce developer cognitive load, increase deployment frequency, standardise security and compliance across all teams.


Q167. What is GitOps?

GitOps = Git is the single source of truth for infrastructure and application state. Changes are made via Git commits. An automated operator (ArgoCD, Flux) continuously reconciles actual cluster state with Git.

Traditional CI/CDGitOps
TriggerPipeline pushes changes to clusterOperator pulls changes from Git
Source of truthCI/CD systemGit repository
Audit trailPipeline logsGit commit history
Drift correctionManual or re-trigger pipelineAutomatic continuous reconciliation
ToolsJenkins, CodePipelineArgoCD, Flux

Q168. What is AWS ECS vs EKS?

ECSEKS
OrchestratorAWS-proprietaryKubernetes (industry standard)
Learning curveLower — simpler to operateHigher — full K8s knowledge required
PortabilityAWS-onlyPortable to any K8s environment
EcosystemAWS tools onlyMassive K8s ecosystem (Helm, ArgoCD, Istio, Prometheus)
Control plane costFree$0.10/hr (~$73/month)
Best forSimpler AWS container workloadsComplex microservices, multi-cloud, enterprise

26. Data, AI & Streaming in Cloud

Q169. What is a data lake vs data warehouse?

Data LakeData Warehouse
Data typeRaw — structured, semi-structured, unstructuredStructured, processed, cleaned
SchemaSchema-on-read (define when querying)Schema-on-write (defined before loading)
Storage costVery low (S3, GCS)Higher (Redshift, BigQuery, Snowflake)
Query performanceSlower (scan raw files)Faster (optimised columnar storage)
Use caseML training data, raw event storage, data explorationBusiness intelligence, dashboards, reporting
Cloud examplesS3 + Athena, GCS + DataprocRedshift, BigQuery, Snowflake, Azure Synapse

Modern pattern: Lakehouse (Databricks Delta Lake, BigQuery) combines lake storage cost with warehouse query performance.


Q170. What is AWS Kinesis?

AWS Kinesis is a real-time streaming data platform:

  • Kinesis Data Streams — ingest and process real-time data (clickstreams, IoT, logs)
  • Kinesis Firehose — delivery to S3, Redshift, OpenSearch automatically (no code needed)
  • Kinesis Data Analytics — run SQL or Apache Flink queries on streaming data
  • Kinesis Video Streams — video streaming from IoT devices

Use case: user clicks → Kinesis → Lambda processes in real time → DynamoDB for live data + S3 via Firehose for analytics.


Q171. What is AWS SageMaker?

SageMaker is AWS's end-to-end managed ML platform covering: data preparation (Data Wrangler, Feature Store), model training (managed GPU/CPU clusters, built-in algorithms), experiment tracking, model deployment (real-time endpoints, serverless inference), and MLOps (Pipelines, Model Monitor for drift detection).

It removes the heavy lifting of ML infrastructure — data scientists focus on model quality, not cluster management.


27. Advanced Networking

Q172. What is AWS Direct Connect?

Direct Connect = dedicated private fiber connection between your data center and AWS, bypassing the public internet. Provides: lower consistent latency, higher bandwidth (up to 100 Gbps), reduced egress costs, private connectivity.

When to use: high-volume data migration, real-time latency-sensitive workloads (trading systems), compliance requiring private connectivity, consistent performance for hybrid applications.


Q173. What is AWS Transit Gateway?

Transit Gateway = a hub that connects multiple VPCs and on-premises networks centrally. Without it: each VPC needs a peering connection to every other (N*(N-1)/2 connections — doesn't scale). With it: all VPCs connect to the hub (N connections).

Supports cross-account and cross-region connectivity. Can attach VPNs and Direct Connect. Simplifies hub-and-spoke network architecture drastically.


Q174. What is a NAT Gateway and when do you need it?

NAT Gateway allows instances in a private subnet to initiate outbound internet connections (for software updates, external API calls) without accepting inbound internet traffic.

Architecture: NAT Gateway sits in the public subnet with an Elastic IP. Private subnet route table sends 0.0.0.0/0 traffic to the NAT Gateway. Return traffic is allowed because NAT is stateful.

Cost: ~$0.045/hour + $0.045/GB processed. Use VPC Endpoints for AWS services (S3, DynamoDB) to avoid NAT Gateway charges for AWS traffic.


Q175. What are VPC Endpoints?

VPC Endpoints connect EC2 instances in private subnets to AWS services (S3, DynamoDB, SQS, etc.) without going through the internet or NAT Gateway — traffic stays within the AWS network.

Types: Gateway Endpoints (S3, DynamoDB only — free, uses route table), Interface Endpoints/PrivateLink (most other AWS services — creates an ENI in your subnet, costs per hour).

Benefits: better security (no internet exposure), lower cost (no NAT Gateway charges), lower latency.


28. Cloud Architecture Reference

Q176. What is the AWS Well-Architected Framework?

PillarFocus areas
Operational ExcellenceAutomate operations, run workloads effectively, improve processes, respond to events
SecurityProtect data and systems — IAM, encryption, detection, incident response
ReliabilityRecover from failures, scale dynamically, mitigate disruptions automatically
Performance EfficiencyUse right resources for the right tasks at scale, monitor and evolve
Cost OptimisationEliminate waste, right-size, use managed services, analyse spend
SustainabilityMinimise environmental impact — right-size, use efficient code, pick green regions

AWS Well-Architected Tool in the console reviews workloads against these pillars and provides improvement recommendations.


Q177. What is the CAP theorem?

In a distributed system, you can only guarantee 2 of: Consistency (all nodes see same data), Availability (every request gets a response), Partition Tolerance (works despite network splits).

Since network partitions are unavoidable:

  • CP systems (sacrifice availability): HBase, ZooKeeper, MongoDB (strong consistency) — may reject requests during partition
  • AP systems (sacrifice consistency): Cassandra, DynamoDB (eventually consistent), CouchDB — responds with possibly stale data

Q178. What is RTO vs RPO?

  • RTO (Recovery Time Objective) — maximum acceptable downtime after a disaster. How quickly must the system be restored?
  • RPO (Recovery Point Objective) — maximum acceptable data loss measured in time. How much data can we afford to lose?

Lower RTO/RPO = higher cost. Design DR to meet business requirements at minimum cost. E.g., RTO=4 hours, RPO=1 hour → warm standby with hourly DB backups is sufficient.


Q179. What is infrastructure drift?

Infrastructure drift = actual cloud resource state diverges from IaC-defined desired state, usually due to manual console/CLI changes bypassing Terraform/CloudFormation.

Danger: Terraform may overwrite manual changes on next apply; different environments have inconsistent configs; debugging becomes much harder.

Prevention: terraform plan in CI/CD before every apply; AWS Config drift detection; restrict console access via SCPs; treat infrastructure as immutable — change only through IaC.


Q180. What is AWS Lambda@Edge?

Lambda@Edge runs Lambda functions at CloudFront edge locations worldwide, close to users.

Use cases: A/B testing (route based on cookies at edge), authentication (validate JWT before hitting origin), URL rewriting, image resisation based on device, personalisation based on geolocation.

4 trigger points: Viewer Request (before cache check), Origin Request (cache miss → before origin), Origin Response (after origin), Viewer Response (before sending to user).


Q181. What is Amazon CloudFront?

CloudFront is AWS's CDN with 450+ edge locations globally. Use cases: static website assets (HTML/CSS/JS/images from S3), video streaming, API response caching, DDoS protection at edge, reducing origin server load.

Features: Lambda@Edge for edge compute, origin failover, signed URLs/cookies for private content, real-time logs, TLS with ACM certificates (free).


Q182. What is a multi-region architecture?

Running your application in two or more cloud regions simultaneously for: disaster recovery (survive complete region failure), low latency for global users, compliance (data residency), and 99.999% availability.

Complexity: data replication (DynamoDB Global Tables, RDS Global Database, S3 CRR), global routing (Route 53 latency/geolocation/failover routing), stateless application design, increased operational overhead and cost.


Q183. What is the difference between stateful and stateless firewalls in AWS?

  • Security Groups (stateful) — track connection state. Allow inbound on port 443 and the return traffic is automatically allowed. Write allow rules only. Applied to EC2, RDS, Lambda.
  • Network ACLs (stateless) — inspect each packet independently. Must explicitly allow BOTH inbound AND outbound for each connection. Applied to subnets.

29. More AWS Services

Q184. What is AWS CloudFormation?

CloudFormation is AWS-native Infrastructure as Code. Define AWS resources in YAML/JSON templates; CloudFormation provisions them in the correct dependency order.

Key concepts: Stack (deployed template instance), Change Set (preview before applying), Drift Detection (detect manual changes), Nested Stacks (modular templates), StackSets (multi-account/region deployment).

When to use: AWS-only environments, tight integration with AWS services (e.g., SAM for serverless). Use Terraform for multi-cloud.


Q185. What is AWS Fargate?

Fargate = serverless compute for containers. Run ECS tasks or EKS pods without managing EC2 instances. Define CPU/memory requirements; AWS handles the underlying compute, OS, and scaling.

Best for: variable/bursty workloads, batch jobs, teams that don't want to manage node groups. Cost is higher per vCPU/hour than EC2 but eliminates node management overhead.


Q186. What is Amazon ElastiCache?

ElastiCache is a managed in-memory caching service supporting Redis and Memcached.

Use cases: session storage (replace sticky sessions), database query result caching (reduce DB load by 80%+), rate limiting counters, leaderboards, pub/sub messaging (Redis Pub/Sub), geospatial queries.

Multi-AZ Redis with automatic failover for production. Redis > Memcached for most use cases (persistence, data structures, replication).


Q187. What is Amazon SNS and SQS, and when do you use each?

SNS (Simple Notification Service)SQS (Simple Queue Service)
PatternPub/Sub — one message to many subscribersPoint-to-point queue — one message to one consumer
DeliveryPush — subscribers receive immediatelyPull — consumers poll for messages
PersistenceNo — if subscriber is down, message is lostYes — messages stored up to 14 days
Use caseFan-out (notify email + Lambda + SQS simultaneously)Decoupled async processing, task queue
Common patternSNS → SQS (fan-out + durability)SQS → Lambda (event-driven processing)

Q188. What is AWS Route 53 routing policies?

Route 53 supports multiple routing policies beyond simple DNS:

  • Simple — return fixed IP/value (default)
  • Weighted — split traffic by percentage (10% to v2, 90% to v1) — for canary deployments
  • Latency-based — route to region with lowest latency for the user
  • Geolocation — route based on user's country/continent (GDPR compliance)
  • Failover — primary/secondary with health checks — automatic DNS failover for DR
  • Multi-value — return multiple IPs; Route 53 removes unhealthy ones (client-side load balancing)

Q189. What is AWS WAF?

AWS WAF (Web Application Firewall) protects web apps from common web exploits — OWASP Top 10 (SQL injection, XSS, CSRF), bot attacks, and DDoS at Layer 7.

Deploy in front of: CloudFront, Application Load Balancer, API Gateway, AppSync.

Rules: AWS Managed Rules (pre-built rule sets), custom rules (block specific IPs, rate limit, geo-block), bot control (CAPTCHA for suspicious traffic).


Q190. What are EC2 instance types and when to use each?

FamilyOptimised forUse cases
T3/T4 (burstable)Cost — baseline CPU with burst creditsDev/test, low-traffic web servers, microservices
M6/M7 (general)Balance of CPU + memoryApplication servers, backend APIs, databases
C6/C7 (compute)High CPUWeb servers, batch processing, ML inference, game servers
R6/R7 (memory)Large RAMIn-memory databases, Redis, SAP HANA, analytics
P4/G5 (GPU)GPU computeML training, deep learning, video encoding
I3/I4 (storage)High local NVMe IOPSNoSQL databases, data warehousing, Elasticsearch
Inf2 (Inferentia)ML inference chipsCost-efficient LLM inference at scale

Q191. What is Amazon Aurora and how is it different from RDS?

Aurora is AWS's cloud-native relational database compatible with MySQL and PostgreSQL, but redesigned for cloud scale.

FeatureStandard RDS (MySQL)Aurora
StorageEBS volume per instanceDistributed, shared storage cluster
Read replicasUp to 5, async replicationUp to 15, sub-10ms replica lag
Failover60-120 seconds30 seconds
Storage scalingManualAuto-grows in 10GB increments up to 128TB
CostLower~20% more than RDS, but better performance

Aurora Serverless v2: auto-scales database capacity in fractions of ACUs — ideal for variable workloads.


Q192. What is Amazon DynamoDB and when should you use it?

DynamoDB is a fully managed NoSQL key-value and document database with single-digit millisecond performance at any scale.

Best for: high-throughput applications (gaming leaderboards, IoT, e-commerce carts, session storage, real-time bidding). Not good for: complex queries with JOINs, ad-hoc analytics, ACID transactions across many tables.

Key features: on-demand capacity (pay per request), DynamoDB Streams (change data capture), Global Tables (multi-region active/active), DynamoDB Accelerator (DAX) for microsecond caching.


30. Final Exam Questions

Q193. What is serverless and when should you NOT use it?

Serverless is wrong for: long-running tasks (Lambda max 15 min), high sustained throughput at massive scale (containers cheaper), strict latency requirements (cold starts unacceptable), complex stateful workflows, existing containerised apps that would require expensive refactor.

Serverless is right for: event-driven processing, variable/unpredictable load, simple API backends, batch jobs triggered by events, scheduled tasks.


Q194. What is the difference between EC2 On-Demand, Reserved, and Spot instances?

TypePricingUse caseRisk
On-DemandFull price, no commitmentUnpredictable workloads, dev/test, short-termNone — always available
Reserved (1yr/3yr)Up to 72% discountSteady, predictable production workloadsCommitment — pay even if unused
SpotUp to 90% discountFault-tolerant batch, CI/CD runners, stateless workers2-min termination notice when capacity needed
Savings PlansUp to 66% discountFlexible commitment across EC2+Lambda+FargateHourly spend commitment

Q195. What is object storage consistency?

AWS S3 provides strong read-after-write consistency for all operations since December 2020. PUT an object, immediately GET it — you always get the latest version. Applies to new objects, overwrites, and deletes.

GCS (Google Cloud Storage) has always been strongly consistent. Both eliminate the need to worry about eventual consistency in your application design.


Q196. What is AWS Lambda concurrency?

Concurrency = number of function instances handling requests simultaneously.

  • Unreserved concurrency — default pool shared across all Lambda functions in the account (1000 by default, can be increased)
  • Reserved concurrency — guarantee a specific number of concurrent executions for a critical function; prevents other functions from consuming all concurrency
  • Provisioned concurrency — pre-initialises execution environments so cold starts never occur; use for latency-sensitive production APIs

When concurrency limit is hit: throttling (429 error). SQS buffers requests naturally; API Gateway returns 429 to clients.


Q197. How would you build a cost-optimised, highly available 3-tier architecture on AWS?

Architecture:

  • Tier 1 (Web/CDN): CloudFront → ALB (in 3 AZs)
  • Tier 2 (App): Auto Scaling Group of t3.small EC2 in private subnets across 3 AZs. Spot Instances for stateless app servers (70% savings). Min 2, max 20.
  • Tier 3 (Database): Aurora MySQL with 1 writer + 1 read replica, Multi-AZ enabled.
  • Session storage: ElastiCache Redis Multi-AZ (stateless app servers)
  • Static assets: S3 + CloudFront (eliminate App tier for static content)
  • Secrets: AWS Secrets Manager
  • Monitoring: CloudWatch dashboards + alarms, GuardDuty, Config

Cost optimisation: Spot for app tier, Reserved for DB (1-year), S3 + CloudFront for assets, right-sized instances using CloudWatch metrics.


Struggling to Land a Cloud Job? These Companies Are Hiring ✅ Top Cloud & DevOps Companies 2026

Explore More

Join our WhatsApp Channel for daily cloud job alerts.