Cookie Consent

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Data Science Community Knowledge Base

What are proxy datasets for machine learning?

In machine learning, proxy datasets are an intermediary for the original datasets and model training process. While performing ML research and experiments, it is good practice to create and use smaller, high quality datasets different from the original dataset called proxy datasets. Designing a system using a proxy dataset is easy, fast, and cost-effective, it enables faster alterations of architectural settings, developments, and hyperparameter tuning. Architecture selection and tuning is a common bottleneck in developing deep learning models, having a compressed and efficient dataset helps in speeding up this process without compromising with the performance and efficiency on the unknown test set. The experiment results performed on proxy datasets correlate with the results obtained from the entire original dataset. Without having access to the trained model, the model efficiency can be increased by ensuring the generation and usage of higher quality proxy datasets.

Often a proxy dataset is a subset of the original dataset. Experiments show that proxy datasets of size equal to (1-10)% of the original dataset when used for training, result in the same performance as using the original dataset. These datasets can be created in two different ways - One method is to use deep learning techniques to synthesize and generate syntactic data samples resulting in small-sized, compressed versions of the original dataset. Eg: Data Distillation, Data Condensation, etc. Another way is to sample data from the original dataset and use the resulting subset in the ML cycle, this includes various sampling strategies to ensure only the samples most beneficial for training get chosen. Eg: Random sampling,  Class-conditional random sampling, Outlier removal sampling, etc. In recent times, proxy datasets are widely used in computer vision applications. Real datasets with poor resolution, corrupted and high-dimension images are replaced with compressed, crisp synthetic images.

View All Knowledge Base Questions

See how Devron can provide better insight for your organization

Request a Demo