Amazon ML Summer School Previous Year Questions 2025

The Amazon Machine Learning (ML) Summer School is a premier program aimed at equipping engineering students in India with industry-relevant ML skills. Launched in 2021, its fifth edition in 2025 promises an immersive learning experience led by Amazon’s top scientists. This blog covers what the program is, who can apply, how to apply, and what to expect.

What is the Amazon ML Summer School?

This free four-weekend program (Aug 9–31, 2025) blends theory and practice across key ML topics such as Supervised Learning, Deep Neural Networks, Generative AI, Reinforcement Learning, and more. Sessions include live lectures and Q&As with Amazon scientists.

Key Highlights:

  • Curriculum: 8 modules covering ML topics including Generative AI, Causal Inference, and Reinforcement Learning.
  • Schedule: Saturdays and Sundays, 9 AM–1 PM IST (sessions) and 2 PM–5 PM IST (Q&A).
  • Networking: Includes access to Amazon Research Days (ARD) and mentorship from scientists.
  • Internship Pathway: Top performers may be eligible for internships in Data Science or Applied Scientist roles at Amazon.

Over 17,500 students applied in 2022 alone—underscoring the program’s growing popularity.

Eligibility

  • Who Can Apply: Engineering students enrolled in Bachelor’s, Master’s, or PhD programs at recognized Indian institutes.
  • Graduation Year: 2026 or 2027.
  • Background: All engineering disciplines welcome. Familiarity with Python and basic ML concepts is helpful but not mandatory.

Selection Process

Candidates take a 60-minute test assessing math and programming skills:

  • Part A: 20 MCQs on probability, statistics, and linear algebra.
  • Part B: 2 DSA programming problems (Python recommended).

Top 3,000 scorers will be selected for the program.

How to apply for Amazon ML Summer School- Check now

💻 Part A – Amazon ML Summer School MCQ Questions

Question 1: Variance Calculation

Given a dataset, calculate the variance of the numbers: [2, 4, 6, 8, 10].

a) 4
b) 6
c) 8
d) 10

Show Answer

Answer: c) 8

Explanation: To calculate variance:

  1. Find the mean: (2+4+6+8+10)/5 = 30/5 = 6
  2. Calculate squared deviations from mean: (2-6)² + (4-6)² + (6-6)² + (8-6)² + (10-6)² = 16 + 4 + 0 + 4 + 16 = 40
  3. Divide by n: 40/5 = 8

Question 2: Matrix Eigenvalues

The eigenvalues of a 4x4 square matrix having 0's as the diagonal elements and 1's on the off diagonal elements is:

a) 2, -2, 0, 0
b) 1, -1, 1, -1
c) 3, -1, -1, -1
d) 4, 0, 0, 0

Show Answer

Answer: c) 3, -1, -1, -1

Explanation: The matrix is J - I where J is the all-ones 4×4 matrix and I is the identity matrix. The eigenvalues of the all-ones matrix J are: n=4 (with multiplicity 1) and 0 (with multiplicity 3). For J - I, the eigenvalues become: 4-1=3, 0-1=-1, 0-1=-1, 0-1=-1.


Question 3: Probability (Two-Headed Coin)

You pick a coin at random from a bag containing a fair coin and a two-headed coin. You flip the coin twice and get heads both times. What is the probability that you picked the fair coin?

a) 1/3
b) 4/13
c) 1/5
d) 2/5

Show Answer

Answer: c) 1/5

Explanation: Using Bayes' theorem: P(Fair|HH) = P(HH|Fair) × P(Fair) / P(HH)

  • P(HH|Fair) = 1/4, P(HH|Two-headed) = 1
  • P(Fair) = P(Two-headed) = 1/2
  • P(HH) = P(HH|Fair) × P(Fair) + P(HH|Two-headed) × P(Two-headed) = (1/4)(1/2) + (1)(1/2) = 1/8 + 1/2 = 5/8
  • P(Fair|HH) = (1/4)(1/2) / (5/8) = (1/8) / (5/8) = 1/5

Question 4: Medical Screening P-Value

In medical screening, it is sometimes more important to avoid false negatives than false positives. How would this affect the p-value we might use for a medical test?

a) we would choose a higher p-value
b) we would choose a lower p-value
c) we would choose the same p-value
d) false negatives do not depend on p-value

Show Answer

Answer: a) we would choose a higher p-value

Explanation: To avoid false negatives (missing actual cases), we want high sensitivity. This means we're willing to accept more false positives (Type I error). A higher p-value threshold makes it easier to reject the null hypothesis, increasing sensitivity but also increasing Type I error rate.


Question 5: Derivative Calculation

Given y=x^x, what is dy/dx at x = 2?

a) 4(1+log2)
b) 4(2+log2)
c) 4
d) 8

Show Answer

Answer: a) 4(1+log2)

Explanation: For y = x^x, use logarithmic differentiation:

  • ln(y) = x ln(x)
  • Differentiating: (1/y)(dy/dx) = ln(x) + 1
  • Therefore: dy/dx = x^x(ln(x) + 1)
  • At x = 2: dy/dx = 2^2(ln(2) + 1) = 4(ln(2) + 1) = 4(1 + log2)

Question 6: Algorithm Selection (Clustering)

In a scenario where you need to group customers based on purchasing behavior without prior labels, which algorithm should you use?

a) Linear regression
b) K-means clustering
c) Logistic regression
d) Random forest

Show Answer

Answer: b) K-means clustering

Explanation: This is an unsupervised learning problem since there are no prior labels. K-means clustering is designed to group similar data points together without requiring labeled training data. Linear regression and logistic regression are supervised learning algorithms that need labeled data.


Question 7: Correlation Calculation

For random variables X and Y, we have Var(X)=1, Var(Y)=4, and Var(2X-3Y)=34, then the correlation between X and Y is:

a) 1/2
b) 1/4
c) 1/3
d) None of the above

Show Answer

Answer: b) 1/4

Explanation:

  • Var(2X - 3Y) = 4Var(X) + 9Var(Y) - 12Cov(X,Y)
  • 34 = 4(1) + 9(4) - 12Cov(X,Y)
  • 34 = 4 + 36 - 12Cov(X,Y)
  • 34 = 40 - 12Cov(X,Y)
  • 12Cov(X,Y) = 6
  • Cov(X,Y) = 1/2
  • Correlation = Cov(X,Y)/√(Var(X)×Var(Y)) = (1/2)/√(1×4) = (1/2)/2 = 1/4

Question 8: Probability (Chessboard - Version 1)

Two squares are chosen at random on a chessboard. What is the probability that they have a side in common?

a) 8/13
b) 17/18
c) 5/13
d) 1/18

Show Answer

Answer: d) 1/18

Explanation:

  • Total ways to choose 2 squares from 64: C(64,2) = 2016
  • Adjacent pairs: 4 corner squares (2 neighbors each) + 24 edge squares (3 neighbors each) + 36 interior squares (4 neighbors each)
  • Total adjacent pairs = (4×2 + 24×3 + 36×4)/2 = 112
  • Probability = 112/2016 = 1/18

Question 9: Gradient Descent vs Normal Equation

Suppose you have a dataset with m=50 examples and n=200000 features for each example. You want to use multivariate linear regression to fit the parameters Θ. Should you prefer gradient descent or the normal equation?

a) Gradient descent, since inverse(X^T X) will be very slow to compute
b) Gradient descent, since it will always converge to the optimal Θ
c) The normal equation, since it provides an efficient way to directly find the solution
d) The normal equation, since gradient descent might be unable to find the optimal Θ

Show Answer

Answer: a) Gradient descent, since inverse(X^T X) will be very slow to compute

Explanation: With n=200000 features, computing (X^T X)^(-1) requires O(n³) operations, which is computationally expensive. Gradient descent scales much better with the number of features and is preferred for high-dimensional problems. The normal equation becomes impractical when n is very large.


Question 10: Unbiased Estimators

Let M and S² be the mean and variance of a random sample of size > 1 from a normal population with unknown mean μ and unknown finite variance σ²>0. Which statements are true?

  1. S² is an unbiased estimator of σ², and S is an unbiased estimator of σ
  2. ((n-1)/n)M is an unbiased estimator of μ, and ((n-1)/n)S² is an unbiased estimator of σ²

a) 1 only
b) 2 only
c) Both 1 and 2
d) Neither 1 nor 2

Show Answer

Answer: d) Neither 1 nor 2

Explanation: Statement 1: S² (sample variance with n-1 denominator) is unbiased for σ², but S (sample standard deviation) is NOT unbiased for σ due to Jensen's inequality. Statement 2: M is unbiased for μ, but ((n-1)/n)M is biased. Also, ((n-1)/n)S² is biased for σ². Therefore, neither statement is completely true.


Question 11: Overfitting/Underfitting Detection

Given a graph of a model's training and validation error, determine if it represents underfitting or overfitting.

a) Underfitting
b) Overfitting
c) Neither
d) Both

Show Answer

Answer: Cannot be determined without the graph

Explanation: This question refers to a graph that is not provided in the document. To determine overfitting vs underfitting:

  • Overfitting: Training error much lower than validation error
  • Underfitting: Both training and validation errors are high and similar

Question 12: Probability (Biased Coins)

You have two coins. One is fair (1/2 heads probability) and the other is biased (3/4 heads probability). You randomly pick a coin and flip it twice, getting heads both times. What is the probability that you picked the fair coin?

a) 13/32
b) 4/13
c) 2/13
d) 19/32

Show Answer

Answer: b) 4/13

Explanation: Using Bayes' theorem:

  • P(HH|fair) = (1/2)² = 1/4
  • P(HH|biased) = (3/4)² = 9/16
  • P(fair) = P(biased) = 1/2
  • P(fair|HH) = P(HH|fair)×P(fair) / [P(HH|fair)×P(fair) + P(HH|biased)×P(biased)]
  • = (1/4)(1/2) / [(1/4)(1/2) + (9/16)(1/2)] = (1/8) / (1/8 + 9/32) = (1/8) / (13/32) = 4/13

Question 13: Matrix Rank Relationships

If rank(A) is 2 and rank(AB) is 3, then:

a) rank(B) = 3
b) rank(B) <= 3
c) rank(B) >= 3
d) data insufficient

Show Answer

Answer: c) rank(B) >= 3

Explanation: We know that rank(AB) ≤ min(rank(A), rank(B)). Given rank(AB) = 3 and rank(A) = 2, this seems impossible under normal circumstances. However, if we interpret this as asking what must be true about rank(B), then since rank(AB) = 3, we must have rank(B) ≥ 3.


Question 14: System of Equations (Version 1)

The number of solutions for the system: 2x+y-z=4, x-2y+z=-2, -x+2y-z=-2 is:

a) 0
b) 1
c) Inf
d) Can't be determined

Show Answer

Answer: a) 0

Explanation: Looking at equations 2 and 3:

  • x - 2y + z = -2
  • -x + 2y - z = -2 Adding these equations: 0 = -4, which is a contradiction. Therefore, the system has no solutions.

Question 15: Logistic Classification Likelihood

When classifying data with logistic classification, what is the upper bound of the likelihood in the maximum likelihood method? Is this value attainable?

a) 1, Yes
b) e, No
c) 1, No
d) 0, Yes

Show Answer

Answer: a) 1, Yes

Explanation: The likelihood is the product of probabilities, each between 0 and 1. The maximum possible value is 1, achieved when all predictions are correct with probability 1. This is theoretically attainable when the data is perfectly separable.


Question 16: Neural Network Output Calculation

A 3-input neuron has weights 1, 4 and 3. The transfer function is linear with constant of proportionality = 3. The inputs are 4, 8 and 5 respectively. What will be the output?

a) 51
b) 153
c) 54
d) 160

Show Answer

Answer: b) 153

Explanation:

  • Weighted sum = w₁x₁ + w₂x₂ + w₃x₃ = 1×4 + 4×8 + 3×5 = 4 + 32 + 15 = 51
  • Output = transfer function × weighted sum = 3 × 51 = 153

Question 17: PCA Statements (Version 1)

Which statements about PCA are correct?

(i) We must standardize the data before applying
(ii) We should select principal components which explain the highest variance
(iii) We should select principal components which explain the lowest variance
(iv) We can use PCA for visualizing data in lower dimensions

a) (i), (ii) and (iv)
b) (ii) and (iv)
c) (iii) and (iv)
d) (i) and (iii)

Show Answer

Answer: b) (ii) and (iv)

Explanation:

  • (i) False: Standardization is recommended but not always necessary
  • (ii) True: We select components explaining the highest variance to retain most information
  • (iii) False: We want high variance components, not low variance ones
  • (iv) True: PCA is commonly used for dimensionality reduction and visualization

Question 18: Algorithm Selection (Regression)

Given a scenario involving a dataset with labeled data and a need to predict a continuous output, which algorithm is most suitable?

a) K-means clustering
b) Linear regression
c) Decision tree classifier
d) Support vector machine (SVM) with RBF kernel

Show Answer

Answer: b) Linear regression

Explanation: For predicting continuous output with labeled data, this is a supervised regression problem. Linear regression is specifically designed for predicting continuous values. K-means is unsupervised, decision tree classifier is for classification, and SVM with RBF can work but linear regression is more direct.


Question 19: Infinite Solutions Parameter

The system x + y + z = 1, ax - ay + 3z = 5, 5x - 3y + az = 6 has infinite solutions if a =?

a) -3
b) 3
c) -4
d) 4

Show Answer

Answer: d) 4

Explanation: For infinite solutions, the system must be consistent and the coefficient matrix must have rank less than the number of variables. This occurs when the equations become dependent, which happens when a = 4. This can be verified by setting up the augmented matrix and finding when rank(A) = rank([A|b]) < 3.


Question 20: Probability (Speed Trap - Version 1)

Police enforce speed limits on routes A, B, C, D operated 40%, 30%, 20%, 30% of time. Biff speeds with route probabilities 0.2, 0.1, 0.5, 0.2. What's the probability he gets a ticket?

a) 0.27
b) 0.93
c) 0.73
d) 0.07

Show Answer

Answer: a) 0.27

Explanation: P(ticket) = Σ P(route i) × P(enforcement on route i) = 0.2×0.4 + 0.1×0.3 + 0.5×0.2 + 0.2×0.3 = 0.08 + 0.03 + 0.10 + 0.06 = 0.27


Question 21: Matrix Eigenvalues (A^19)

Let A be a 2×2 matrix with a11=a12=a21=+1 and a22=-1. The eigenvalues of matrix A^19 are:

a) 1024 and -1024
b) 1024√2 and -1024√2
c) 4√2 and -4√2
d) 512√2 and -512√2

Show Answer

Answer: d) 512√2 and -512√2

Explanation:

  • Matrix A = [1 1; 1 -1]
  • Characteristic equation: det(A - λI) = (1-λ)(-1-λ) - 1 = λ² - 2 = 0
  • Eigenvalues of A: λ₁ = √2, λ₂ = -√2
  • Eigenvalues of A^19: (√2)^19 = 2^(19/2) = 2^9.5 = 512√2, (-√2)^19 = -512√2

Question 22: Characteristic Equation and Matrix Inverse

If the characteristic equation of matrix A is t²-t-1=0, then:

a) A^(-1) does not exist
b) A^(-1) exists but cannot be determined
c) A^(-1) = A - I
d) A^(-1) = A + I

Show Answer

Answer: c) A^(-1) = A - I

Explanation:

  • From the characteristic equation: A² - A - I = 0
  • Rearranging: A² = A + I
  • Multiplying both sides by A^(-1): A = I + A^(-1)
  • Therefore: A^(-1) = A - I
  • Since 0 is not an eigenvalue (from t²-t-1=0), A^(-1) exists.

Question 23: Bias-Variance Tradeoff

Fitting data from a cubic function corrupted by Gaussian noise using linear (M1) and 5th degree polynomial (M5) models:

a) Bias(M1) ≤ Bias(M5), Variance(M1) ≤ Variance(M5)
b) Bias(M1) ≥ Bias(M5), Variance(M1) ≤ Variance(M5)
c) Bias(M1) ≤ Bias(M5), Variance(M1) ≥ Variance(M5)
d) Bias(M1) ≥ Bias(M5), Variance(M1) ≥ Variance(M5)

Show Answer

Answer: b) Bias(M1) ≥ Bias(M5), Variance(M1) ≤ Variance(M5)

Explanation: Linear model (M1) cannot capture the cubic relationship, so it has high bias but low variance (simple model). 5th degree polynomial (M5) can fit the cubic function well, so it has low bias but high variance (complex model). This is the classic bias-variance tradeoff: simpler models have higher bias and lower variance.


Question 24: Probability (Chessboard - Version 2)

Two squares are chosen at random on a chessboard. What is the probability that they have a side in common?

a) 1/36
b) 1/9
c) 2/9
d) 1/18

Show Answer

Answer: d) 1/18

Explanation: Same as Question 8:

  • Total ways to choose 2 squares from 64: C(64,2) = 2016
  • Adjacent pairs: 4 corner squares (2 neighbors each) + 24 edge squares (3 neighbors each) + 36 interior squares (4 neighbors each)
  • Total adjacent pairs = (4×2 + 24×3 + 36×4)/2 = 112
  • Probability = 112/2016 = 1/18

Question 25: System of Equations (Version 2)

The system 2x+y-z=1, x-2y+z=-2, -x+2y-z=-1 has how many solutions?

a) 0
b) 1
c) 2
d) Infinitely many

Show Answer

Answer: a) 0

Explanation: Looking at equations 2 and 3:

  • x - 2y + z = -2
  • -x + 2y - z = -1 Adding these equations: 0 = -3, which is a contradiction. Therefore, the system has no solutions.

💻 Part B – Amazon ML Summer SchoolCoding Questions

Problem 1: Statistical Calculations (Mean, Median, Mode)

Given 'n' integers, calculate their mean, median and mode with specific precision requirements.

Function Signature

function calculateStats(input1: number, input2: number[]): [number, number, number]

Requirements

  • Mean and median: Must be correct to six decimal places
  • Mode: Return as integer (smallest number in case of ties)

Definitions

Mean

Defined as the average of all numbers in the array:

Mean = (sum of all elements) / (number of elements)

Median

Defined as the middle element of the sorted array:

  • If n is even: median = average of two middle elements
  • If n is odd: median = middle element

Note: Array must be sorted in ascending order for median calculation.

Mode

Defined as the number with the highest frequency:

  • If multiple numbers have same highest frequency, choose the smallest number

Input Specification

  • input1: Integer (1 ≤ input1 ≤ 1000) - length of array
  • input2: Integer array containing 'input1' integers

Output Specification

  • output1: Mean (double with 6 decimal places)
  • output2: Median (double with 6 decimal places)
  • output3: Mode (integer)

Test Cases

Example 1

Input1: 3
Input2: [1, 2, 3]
Output: 2.000000, 2.000000, 1

Example 2

Input1: 5
Input2: [41, 18467, 6334, 26500, 19169]
Output: 14102.200000, 18467.000000, 41

Calculation for Example 2:

  • Mean: (41 + 18467 + 6334 + 26500 + 19169) / 5 = 14102.200000
  • Median: Sorted array [41, 6334, 18467, 19169, 26500] → 18467.000000
  • Mode: All elements appear once, smallest is 41

Problem 2: Robot String Manipulation

There are three robots named Ray, Ben and Kevin. Initially Ray has a string S of length N, while the other two robots have empty strings. We can make either of the following moves:

  • Move 1: Remove the first character from Ray's string and append it to Ben's string
  • Move 2: Remove the last character from Ben's string and append it to Kevin's string

You must perform either of the two moves mentioned above in such a way that the strings left with Ray and Ben are empty and the string left with Kevin is lexicographically the smallest.

Task

Return the lexicographically smallest string that Kevin has after completing this activity.

Important Note

For any two given strings, a string is said to be lexicographically smaller than the other if it comes before the other string in the dictionary.

Constraints

  • Input: String S of length N
  • Output: Lexicographically smallest string that Kevin will have

Approach Hints

  • This is essentially a stack-based problem
  • Ben's string acts like a stack (LIFO - Last In, First Out)
  • Need to find optimal sequence of moves to minimize lexicographic order

Problem 3: Minimum Cost Flight Path

There are N cities in a country. George starts at city 1 (airport) and wants to reach city N. From any city i, he can travel to:

  • City (i+1) if it exists, OR
  • City (i+3) if it exists

Cost Calculation

Flight cost between cities i and j is calculated as:

Cost = |Cost[i] - Cost[j]|

where Cost[i] represents the cost value for city i in the given array.

Task

Find and return the minimum possible cost to reach city N from city 1.

Constraints

  • Number of cities N > 3
  • Use 1-based indexing
  • Can only move to (i+1) or (i+3) from city i

Input Specification

  • input1: Integer N (number of cities)
  • input2: Integer array A (cost values for each city)

Output Specification

  • Return minimum total cost to reach city N

Test Cases

Example 1

Input1: 4
Input2: [1, 4, 5, 2]
Output: 1

Explanation:

  • Direct path: City 1 → City 4 (using i+3 rule)
  • Cost: |Cost[1] - Cost[4]| = |1 - 2| = 1

Example 2

Input1: 6
Input2: [4, 12, 13, 18, 10, 12]
Output: 10

Explanation: Optimal path: City 1 → City 2 → City 3 → City 6

  • City 1 → City 2: |4 - 12| = 8
  • City 2 → City 3: |12 - 13| = 1
  • City 3 → City 6: |13 - 12| = 1
  • Total cost: 8 + 1 + 1 = 10

Algorithm Approach

  • This is a Dynamic Programming problem
  • Can also be solved using BFS/Dijkstra's algorithm
  • State: dp[i] = minimum cost to reach city i
  • Transition: dp[i] = min(dp[i-1] + cost(i-1,i), dp[i-3] + cost(i-3,i))


Below are problem statements similar to those reportedly asked in previous Amazon ML Summer School coding tests, based on participant experiences:

1. Mean, Median, Mode Calculation

Given an array of integers, write a function to compute the mean, median, and mode.

2. Climbing Stairs Variant

Given a staircase with n steps, you can climb 1 or 2 steps at a time. Find the number of distinct ways to reach the top.

3. Array Manipulation with Constraints

Given an array of package weights and a maximum weight limit, compute the minimum cost to deliver all packages by choosing consecutive packages in a single trip, where the cost is based on the sum of weights in each trip.

4. Run Length Encoding

Given a string, return its run-length encoding (e.g., "aaaaabb""a5b2").

5. Median in a Stream

Given a stream of integers, write a function to add numbers to the stream and return the median after each addition.

6. Maximum Subarray Sum

Given an array of integers (positive and negative), find the maximum sum of any contiguous subarray.

7. Valid Parentheses

Given a string containing only parentheses characters ('(', ')', '{', '}', '[', ']'), determine if the input string is valid. A string is valid if all open brackets are closed in the correct order.

8. Top K Frequent Elements

Given an array of integers and an integer k, return the k most frequent elements in the array. You can return the answer in any order.

9. Longest Substring Without Repeating Characters

Given a string, find the length of the longest substring without repeating characters.

10. Minimum Cost to Merge Stones

Given an array of integers representing stone weights and an integer k, you can merge exactly k consecutive stones into one stone with weight equal to their sum. Find the minimum cost (sum of merged weights) to merge all stones into one.



Apply now if you haven't applied it yet- Check now

Join Telegram group for any doubts & discussions!

🧰 Useful Resources for Your Placement Prep