Scalability Platforms
PyTorch Lightning
Integrating with PyTorch Lightning was the first major exploration into generating saliency maps over many images (datasets) instead of just a few selected samples so there was a learning curve in how to best achieve that goal in addition to PyTorch Lightning specific lessons learned. This exploration identified two levels of parallelization:
Across masks for a single image.
Across many images, such as a dataset.
Initially, the “standard” integration strategy of implementing the
ClassifyImage interface was attempted. This strategy was dubbed the
“high-level” strategy in the example notebook. However, the ClassifyImage
interface is really only the prediction part of the saliency map computation.
Therefore, the remainder of the computation remained outside the scope of
Lightning’s framework, limiting the scalability that could be leveraged. This
strategy has the most appeal when a large number of pertubation masks is
needed, as masks from each individual image are split up into batches but
saliency maps are still computed serially.
The structure that the high-level API currently imposes upon the user
essentially limits saliency map generation to one at a time, unless the user
implements a threaded strategy or some other parallelization technique. The
second integration strategy in the example notebook, the “low-level” strategy,
aims to more effectively utilize PyTorch Lightning as such. When using the low-level
API, the user is responsible for driving all components of saliency map
generation: generating perturbed data, predicting on this data, and generating
saliency maps from these results. While the “brains” of the computation still
remain within xaitk-saliency itself, the finer-grained control of the
individual components give the user more control over when/where/how each
component happens. If the user is able to trigger these computations
after/where scaling takes place within the framework, then scalability can be
taken advantage of, with minimal/no resource management from the user’s
perspective. Thus, this method is most advantageous where the “many images”
level of parallelization is concerned. The question then becomes more platform
specific: where is the ideal location for these API calls to go?
A few different options were considered to integrate the low-level API calls
into the PyTorch Lightning framework including callbacks, loops, and *_step
methods. Callbacks were considered due to their isolated nature; they just
need to be attached to the trainer to add on saliency map generation. Callbacks
in Lightning have many entry points and are relatively flexible. However, it
was found ill-advised to get new model outputs (for perturbed data) within a
callback as that can again trigger other callbacks causing a cascading effect
among other issues. The next option that was considered was overriding or
writing a custom loop. Loops are once again easy to add to a Lightning
trainer and are still separate from the LightningModule itself, but it takes
some effort to maintain existing functionality within these loops
(bookkeeping/hooks) and ensuring resources (e.g. perturbed data) end up on the
correct device can be challenging. Ultimately, it was decided the best option,
in the case of the example notebook, was overriding a *_step method,
the predict step in particular. With this option, the model of interest is
wrapped with another LightningModule overriding the relevant *_step
method(s). While this option is slightly more integrated with the
LightningModule than the other options, it is very easy to get generated data
on the correct device and the user doesn’t need to be concerened with
mantaining any bookkeeping or hooks. Additionally, it is easy to load the
wrapped model with an existing checkpoint so no re-training needs to occur.
The example notebook generates saliency maps for every image that is predicted
upon, with the idea that a condition-check could be included if desired.
Realistically, any strategy that generates many saliency maps likely needs to be integrated with an artifact tracking system which may pose further considerations when choosing a strategy.
While a certain strategy may be better in a particular situation, the results produced via each method should be consistent. We can see this is the case in our example notebook with the results comparison cell.
Benchmarking data for each integration strategy is available in
examples/lightning/benchmarking.ipynb. These results showcase that the
high-level integration is better suited for scaling across number of masks,
while the low-level integration strategy achieves some level of effective
scaling across both number of masks and number of total images.
Speaking more generally about scaling again, xaitk-saliency currently uses
numpy as its “universal” input format. Depending on the framework being
used, there is overhead in the form of tensor conversions from GPU <-> CPU and
also some data format (e.g. torch tensors) <-> numpy, which may be
suboptimal. There is potential that adding support to xaitk-saliency for
specific data types could improve both performance and usability, however,
care must be taken so that xaitk-saliency does not lose its
interoperability between frameworks/platforms/etc. Furthermore, it may be worth
exploring whether or not additional speedup can be gained from moving key
operations, which are currently CPU only, to GPU processing. Again, any
benefits this might provide should be weighed against the change in complexity
and maintainability required to support this functionality.
Other PyTorch Lightning specific “gotchas”:
Scaling strategies are limited within interactive environments
dpis noted to be slow by documentation.ddp_notebookcan only be initialized once in a session.submitit can be used as a workaround to enable true
ddpmode but some interactiveness will be lost (see Lightning benchmarking notebook for example).
If reproducability is required, ensure the global seed is set. The number of accelerators may also need to be reduced if results must be in a specific order.
Benchmarking PyTorch Lightning code should be done with PyTorch’s benchmarking module.
Lightning operates mainly on
DataLoaders. AnIterableDatasetcan be used with the generatorsxaitk-saliencyyields to prevent loading everything into memory at once. Note that usingnum_workers > 1with anIterableDatasetmay result in duplicating data. Care should be taken to split data assignments across workers so that work is not also duplicated. In the example notebook, worker ID is used to carry out this assignment process. If data cannot be duplicated,num_workersshould be limited.The fill value used by the saliency generator depends on when perturbed data is generated (before or after pre-processing).
Ensure the model outputs given to the saliency generator correspond to what the interface is asking for, such as the result of a softmax operation for probabilities/confidence values. Providing other values may result in strange-looking saliency maps that are difficult (or impossible) to interpret.
Hugging Face Accelerate
Hugging Face Accelerate enables PyTorch code to be run across any distributed
configuration, just by utilizing the Accelerator class. At a high-level,
at least in the multi-GPU use case, Accelerate acts as a multiprocessing
adapter, which lends itself very well to parallelization across many images.
This capability bypasses the single-threaded limitation that
xaitk-saliency’s high-level API has, without requiring the user to fully
implement a parallelization technique themselves.
Overall, Hugging Face Accelerate provides a fairly low-barrier way to run
across various distributed configurations with regards to code changes. Due to
this, the example notebook created for exploring this integration largely
follows the “typical” integration strategy used when implementing the
ClassifyImage interface, with the biggest difference in integration being
the need to gather results as an artifact of the truly multiprocessing nature
of Accelerate. It should also be noted that this gethering of results may
result in some data transfer overhead. To gather results across processes,
results must be PyTorch Tensors and on the appropriate device (i.e. GPU not
CPU if GPU is being utilized). However, as xaitk-saliency works with
numpy as its universal format, these results must be converted back into
Tensors on GPU to be gathered across processes. It is possible, especially with
artifact tracking in place, that this gathering of results could be unnecessary
depending on the application.
Benchmarking data for the integration strategy is available in
examples/huggingface/benchmarking.ipynb. These results showcase that the
integration effectively reduces computation time with an increase in the
number of GPUs used. The improvement is not quite linear due to the overhead
in managing data across multiple processes. As this integration does not
specifically consider parallelizing computation within the computation of
saliency maps for a singular image, we see limited improvement as the number
of masks increases, as expected.
It was noted during this exploration that an incongruence between
xaitk-saliency and these scalability platforms may exist. xaitk-saliency
uses a channel-last format while both Lightning and Accelerate used channel-first
formats for the given integration use cases. This difference incurs
potentially significant overhead cost to get the data in the appropriate
format.
Other Hugging Face Accelerate specific “gotchas”:
Use caution when selecting batch size. Using a batch size larger than the number of image samples (potentially relative to the number of processes, based on the
Acceleratorsettings) can resultNone, nonsense, or duplicate data.Masked data needs to be moved to the appropriate device.
Avoid initializing
cudabefore theAccelerator. Initialize theAcceleratoras soon as possible. The easiest way to do this is wrap all relevant code in a function that thenotebook_launchercalls.Like Lightning, submitit can be used as a workaround to enable multiple launches.