Approximating Memorization Using Loss Surface Geometry for Dataset Pruning and Summarization

Andrea Agiollo, Young In Kim, Rajiv Khanna

ACM

August 2024

The sustainable training of modern neural network models represents an open challenge. Several existing methods approach this issue by identifying a subset of relevant data samples from the full training data to be used in model optimization with the goal of matching the performance of the full data training with that of the subset data training. Our work explores using memorization scores to find representative and atypical samples. We demonstrate that memorization-aware dataset summarization improves the subset construction performance. However, computing memorization scores is notably resource-intensive. To this end, we propose a novel method that leverages the discrepancy between sharpness-aware minimization and stochastic gradient descent to capture data points atypicality. We evaluate our metric over several efficient approximation functions for memorization scores – namely proxies –, empirically showing superior correlation and effectiveness. We explore the causes behind our approximation quality, highlighting how typical data points trigger a flatter loss landscape compared to atypical ones. Extensive experiments confirm the effectiveness of our proxy for dataset pruning and summarization tasks, surpassing state-of-the-art approaches both on canonical setups – where atypical data points benefit performance – and few-shot learning scenarios—where atypical data points can be detrimental.

keywords Neural Networks, Data-efficient Learning, Memorization, Flatness

reference talk

Approximating Memorization Using Loss Surface Geometry for Dataset Pruning and Summarization (KDD 2024, 27/08/2024) — Andrea Agiollo (Andrea Agiollo, Kim Young In, Rajiv Khanna)

origin event

KDD 2024

funding project

ENGINES — ENGineering INtElligent Systems around intelligent agent technologies (28/09/2023–27/09/2025)

works as

reference publication for talk

Approximating Memorization Using Loss Surface Geometry for Dataset Pruning and Summarization (KDD 2024, 27/08/2024) — Andrea Agiollo (Andrea Agiollo, Kim Young In, Rajiv Khanna)