top of page
Search
  • Robert Kennedy

Week 10

Nearing the home stretch. As part of trying to improve the efficiency of the GNN runtime across many GPUs, I started to experiment with using a different approach to using so many GPUs. A “normal” GNN-GPU setup will have 1 Thor per 1 GPU which does work. There are some performance issues when the model weights need to aggregate. Weight aggregation needs to occur rather often during the overall training process in order to output a single cohesive model that was trained on a dataset via several nodes. If this aggregation step happens to quickly, the system spends more time communicating weights back and forth than it does doing any gradient calculations and if there is no aggregation steps, essentially you have an ensemble of models, each trained on a separate dataset with no overlap.


It may be a bit late in the program to start something so new, but I feel it is very important, and even if I cannot complete it, it will provide a starting point for future work. Ideally this would have been started much earlier, but with all of the delays, this got pushed back unfortunately.

The idea is to have one Thor per physical computer, each with many GPUs, instead of 1 GPU per 1 Thor. This way, we can have TF “control” the many GPUs in the system. In short, the benefit is TF has the capacity to use NVIDIA’s NCCL library so that all GPUs on a single machine can communicate directly between each toher, without having to pass data through the motherboard’s bus via the PCIe channels and through the CPU. This is another layer of hardware acceleration for the distributed training of NN models. There were some issues with the session management across multiple Thors, so unfortunately this was not completed.

6 views0 comments

Recent Posts

See All

Week 11

This week all the code (major changes at least) is going to be unchanged. Other than cleaning it up for making a repo that others can use. No one wants to (or should have to) read poorly commented cod

Week 9

It’s already August and there are only 3 more weeks until I run out of time with this internship! Time certainly flies. As per a recommendation from my mentor who knows way more about HPCC than I, I s

Week 8

Started work on writing an aws kubernetes blog. I am not most knowledgeable when it comes to kubernetes at this point, but at this point I think everything is good to go, assuming we have proper helm

Post: Blog2_Post
bottom of page