Taking Subsets of a PyTorch Dataset
PyTorch’s Dataset class provides a flexible way to manage your data, but sometimes you need to work with a smaller subset of your data. This might be for:
- Debugging
- Experimenting with different models
- Performing stratified sampling
Let’s explore various methods to take subsets of your PyTorch datasets.
1. Slicing
1.1 Basic Slicing
You can use Python’s standard slicing syntax to select a range of data points:
import torch
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]
data = torch.arange(10)
dataset = MyDataset(data)
# Select elements from index 2 to 5 (exclusive)
subset = dataset[2:5]
1.2 Selecting Specific Indices
To access specific data points, you can use a list of indices:
indices = [0, 3, 7]
subset = dataset[indices]
2. Subset Class
For more complex selections, the torch.utils.data.Subset
class is helpful.
2.1 Creating a Subset
from torch.utils.data import Subset
# Create a Subset using a list of indices
subset = Subset(dataset, indices)
2.2 Using a Filter Function
You can define a filter function to select elements based on specific criteria:
def filter_function(idx):
# Return True for indices to include, False otherwise
return dataset.data[idx] % 2 == 0
subset = Subset(dataset, indices=range(len(dataset)), filter=filter_function)
3. Custom Subset Class
For maximum control, create a custom subclass of Dataset
:
class MySubset(Dataset):
def __init__(self, dataset, indices):
self.dataset = dataset
self.indices = indices
def __len__(self):
return len(self.indices)
def __getitem__(self, idx):
return self.dataset[self.indices[idx]]
subset = MySubset(dataset, indices)
4. Stratified Sampling
For balanced subsets, consider stratified sampling. This technique ensures that the distribution of classes in the subset matches the original dataset:
4.1 Using a Library
Several libraries, such as scikit-learn, provide functions for stratified sampling.
from sklearn.model_selection import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, val_index in splitter.split(dataset.data, dataset.labels):
train_subset = Subset(dataset, train_index)
val_subset = Subset(dataset, val_index)
4.2 Manual Stratification
You can manually implement stratified sampling:
import numpy as np
# Assuming labels are available as 'dataset.labels'
classes, counts = np.unique(dataset.labels, return_counts=True)
# Calculate proportions of each class
proportions = counts / len(dataset)
# Create subsets
subsets = []
for cls in classes:
indices = np.where(dataset.labels == cls)[0]
num_samples = int(proportions[cls] * len(dataset))
subset_indices = np.random.choice(indices, size=num_samples, replace=False)
subsets.append(subset_indices)
# Concatenate subsets
all_indices = np.concatenate(subsets)
subset = Subset(dataset, all_indices)
5. Best Practices
- Use
torch.utils.data.Subset
when possible for simple selections. - Consider custom subsets for more complex scenarios.
- Prioritize stratified sampling for balanced and representative subsets.
Conclusion
This article has covered various ways to create subsets of PyTorch datasets, empowering you to work with specific portions of your data for diverse tasks. Choose the approach that best suits your needs and ensure your data is used effectively.