Nearing the home stretch. As part of trying to improve the efficiency of the GNN runtime across many GPUs, I started to experiment with using a different approach to using so many GPUs. A “normal” GNN-GPU setup will have 1 Thor per 1 GPU which does work. There are some performance issues when the model weights need to aggregate. Weight aggregation needs to occur rather often during the overall training process in order to output a single cohesive model that was trained on a dataset via several nodes. If this aggregation step happens to quickly, the system spends more time communicating weights back and forth than it does doing any gradient calculations and if there is no aggregation steps, essentially you have an ensemble of models, each trained on a separate dataset with no overlap.
It may be a bit late in the program to start something so new, but I feel it is very important, and even if I cannot complete it, it will provide a starting point for future work. Ideally this would have been started much earlier, but with all of the delays, this got pushed back unfortunately.
The idea is to have one Thor per physical computer, each with many GPUs, instead of 1 GPU per 1 Thor. This way, we can have TF “control” the many GPUs in the system. In short, the benefit is TF has the capacity to use NVIDIA’s NCCL library so that all GPUs on a single machine can communicate directly between each toher, without having to pass data through the motherboard’s bus via the PCIe channels and through the CPU. This is another layer of hardware acceleration for the distributed training of NN models. There were some issues with the session management across multiple Thors, so unfortunately this was not completed.
Comments