What are proxy datasets for machine learning?

In machine learning, proxy datasets are an intermediary for the original datasets and model training process. While performing ML research and experiments, it is good practice to create and use smaller, high quality datasets different from the original dataset called proxy datasets.

Developing deep learning models

Designing a system using a proxy dataset is easy, fast, and cost-effective, it enables faster alterations of architectural settings, developments, and hyperparameter tuning. Architecture selection and tuning is a common bottleneck in developing deep learning models, having a compressed and efficient dataset helps in speeding up this process without compromising with the performance and efficiency on the unknown test set. The experiment results performed on proxy datasets correlate with the results obtained from the entire original dataset. Without having access to the trained model, the model efficiency can be increased by ensuring the generation and usage of higher quality proxy datasets.

Using compressed versions of the original dataset

Often a proxy dataset is a subset of the original dataset. Experiments show that proxy datasets of size equal to (1-10)% of the original dataset when used for training, result in the same performance as using the original dataset. These datasets can be created in two different ways - One method is to use deep learning techniques to synthesize and generate syntactic data samples resulting in small-sized, compressed versions of the original dataset. Eg: Data Distillation, Data Condensation, etc.

Another way is to sample data from the original dataset and use the resulting subset in the ML cycle, this includes various sampling strategies to ensure only the samples most beneficial for training get chosen. Eg: Random sampling, Class-conditional random sampling, Outlier removal sampling, etc. In recent times, proxy datasets are widely used in computer vision applications. Real datasets with poor resolution, corrupted and high-dimension images are replaced with compressed, crisp synthetic images.

Learn more about proxy datasets for machine learning

Devron is a next-generation federated learning and data science platform that enables decentralized analytics. Learn more about our solutions, read more of our knowledge base articles, about our federated learning platform, or schedule a demo with us today.

View All Knowledge Base Questions

What are proxy datasets for machine learning?

Developing deep learning models

Using compressed versions of the original dataset

Learn more about proxy datasets for machine learning

Table of Contents

Improve Data Science Agility

Platform

Solutions

Learn

About

Connect