Skip to main content

Batch Size Analysis

Batch size determines how many samples are processed before updating model weights. Four values were tested: 16, 32, 64, and 128.

Trade-offs

Batch SizeGradient Updates per EpochUpdate NoiseGeneralization
16HighHigh (noisy)Often better
32ModerateModerateGood balance
64Moderate-lowLowerModerate
128LowLow (smooth)May overfit

How It Works

Each training step computes the gradient on a mini-batch and updates the weights. With a smaller batch:
  • More updates happen per epoch (more gradient steps).
  • Each gradient estimate is noisier (fewer samples to average over), but this noise can act as implicit regularization — helping the model escape sharp local minima and find flatter, better-generalizing solutions.
With a larger batch:
  • Fewer updates per epoch.
  • Gradient estimates are smoother and more accurate, but the model may converge to sharper minima that generalize less well.

Practical Considerations

For a dataset of 1,025 samples:
  • Batch size 16 or 32 keeps gradient updates frequent and noisy enough to regularize effectively.
  • Batch size 128 reduces the number of updates per epoch significantly (only ~8 updates per epoch), which may slow learning or reduce generalization on a small dataset.

Takeaway

Smaller batch sizes (16–32) tend to work better on small datasets like this one. They provide more gradient updates and their inherent noise acts as a form of regularization alongside Dropout.