• https://www.youtube.com/watch?v=LLUxwHc7O4A
    • when we aggregate, we can take in the features of our neighbours AND our neighbour’s neighbours
      • and the feature. vector of ourself! notice how the feature vector for 0 is fed into the three lowest convolution blocks
    • “A 2-layer GNN generates embedding of node 0 using 2-hop neighbourhood structure and features”
    • this changed how we thought of neural nets
      • before, GNNs needed the entire graph to make prediction about a single node
        • older GNNs were also limited to making predictions on nodes that were there during the training phase
      • but now, batches of node neighbourhoods can be put into the GPU, so we don’t need to care about the rest of the graph.
      • This gave us the ability to process:
          1. much larger graphs, and
          1. structures that weren’t there during training
  • Here’s how we can perform stochastic gradient descent:
      1. randomly sample M << N nodes to put into our minibatch
      1. generate the computation graph for all M nodes and put into our minibatch
      1. compute loss and get our gradients!
  • but this is compute intensive cause
      1. for each node, we need to get the computation graph
      • but increasing hops increases the number of nodes exponentially
      1. if we hit a hub node, we’re screwed
  • Solution: use neighbourhood sampling
    • we just sample at most H neighbours when doing the calculation (instead of every neighbour)
      • we are capping the fan-out by H
    • 3 remarks
      • choosing H is a tradeoff for accuracy vs training time
      • H doesn’t change the fact that fan-out is exponential
      • random sampling of neighbours may not be optimal
        • we can use random walk with restarts
          • basically, the random walk will score each node (prob based on its features)
            • we perform a few of these random walks
          • then we just sample the neighbouring nodes with the highest scores
  • https://youtu.be/JtDgmmQ60x8?si=7DxQdzoiINZia0xE&t=1430
    • main 2 modifications of graphSAGE:
        1. rather than taking the sum of the neighbour’s hidden layers, we can make it more general and say we’re using an general AGG function (could be mean)
        • note: AGG could be max pool, or an LSTM
          • problem: LSTM has a different output if you input the features in a diff order.
          • Solution we can perform multiple LSTM inferences on a random permutation of the input features to get a stable result
        1. rather than ADDING our own prev hidden layer, we are CONCATENATING our prev hidden layer