PyTorch Headfirst

Sun Jun 14 2020

This post dives headfirst into PyTorch: a powerful open-source ML library for Python. Google's flagship ML library is TensorFlow, whereas Facebook developed PyTorch. Researchers are gravitating towards PyTorch due to its flexibility and efficiency. This tutorial goes over the basics of PyTorch, including tensors and a simple perceptron. This tutorial requires knowledge of Python, Numpy, and neural networks. If you don't know anything about neural networks, I suggest that you watch this amazing video by 3Blue1Brown:

You can install both Numpy and PyTorch using pip. Check out the PyTorch website to get the version of PyTorch that works with your version of CUDA installed on your computer. If you don't have an NVIDIA GPU with CUDA, you can still run PyTorch, but you won't be able to run your programs on the GPU.

!pip install numpy
!pip install torch torchvision

Pro tip: if you are in a notebook adding a "!" will execute your command on the terminal.


1 Tensors

The core of the PyTorch library is centered around Tensors. Tensors are analogous to Numpy matrices, however, the benefit of tensors is their ability to get placed on the GPU. Tensors also allow you to do auto gradients, which makes doing backpropagation in neural networks a lot faster.

Creating an empty tensor is similar to creating a new C array: anything can be in the memory that you grabbed, so don't expect it to be zero, ones, or "random."

import torch

torch.empty(5, 2)
    tensor([[4.8132e-36, 4.5597e-41],
            [1.4906e-11, 3.0957e-41],
            [4.4842e-44, 0.0000e+00],
            [8.9683e-44, 0.0000e+00],
            [7.7759e-13, 3.0957e-41]])

If you explicitly create a random matrix, you will get values between zero and one.

x = torch.rand(5, 3)
    tensor([[0.7825, 0.7864, 0.1257],
            [0.7588, 0.6572, 0.9262],
            [0.4881, 0.6329, 0.3424],
            [0.1333, 0.4235, 0.6760],
            [0.9737, 0.6657, 0.9946]])
    torch.Size([5, 3])

Similarly, there is a function for random integers.

torch.randint(low=0, high=5, size=(3,3))
    tensor([[4, 2, 1],
            [2, 0, 3],
            [2, 2, 2]])

Similar to numpy, you can also specify an empty array filled with zeros and specify a data type.

Common data types: - torch.long - torch.bool - torch.float - - torch.int8 - torch.int16 - torch.int32 - torch.int64 - torch.double

x = torch.zeros(5, 2, dtype=torch.long)
    tensor([[0, 0],
            [0, 0],
            [0, 0],
            [0, 0],
            [0, 0]])

Size returns a tuple. In PyTorch it is common to do .size(), however .shape will return the same thing.

    torch.Size([5, 2])
    torch.Size([5, 2])

Like Numpy, Pytorch supports a ton of operators on a Tensor.

Check out the documentation at the official website.

y = torch.rand(5,2)

x + y # same as torch.add(x, y)

result = torch.add(x, y)

torch.add(x, y, out= result) 
    tensor([[0.4942, 0.7370],
            [0.9927, 0.7068],
            [0.1702, 0.9578],
            [0.6510, 0.4992],
            [0.2482, 0.4928]])

Pytorch added multiple functions with "_" for standard operators that operate on the calling tensor.

# adds the result to y
    tensor([[0.9885, 1.4740],
            [1.9855, 1.4135],
            [0.3405, 1.9155],
            [1.3020, 0.9984],
            [0.4964, 0.9856]])

Tensors support pythonic ways of accessing each element.

print(y[0][0]) # first element as a tensor
print(y[0, 0]) # same thing as y[0][0]
print(y[0][0].item()) # grabs data inside tensor
print(y[:, 0]) # gets first col
print(y[1, :]) # gets second row
    tensor([0.9885, 1.9855, 0.3405, 1.3020, 0.4964])
    tensor([1.9855, 1.4135])

You can resize the tensor using the view function.


    tensor([[0.9885, 1.4740, 1.9855, 1.4135, 0.3405, 1.9155, 1.3020, 0.9984, 0.4964,
    tensor([[0.9885, 1.4740, 1.9855, 1.4135, 0.3405],
            [1.9155, 1.3020, 0.9984, 0.4964, 0.9856]])

1.1 CUDA

One of the great things about PyTorch is that you can run everything on either the GPU or the CPU. To make code more flexible to run on either device, most people set the device dynamically. Keeping your devices consistent is crucial because you can't do operations to a "CUDA" tensor by a "CPU" tensor-- this makes sense because one is on the GPU's memory, where the other is in the computer's main memory -- RAM.

torch.cuda.is_available() # prints if CUDA is available on system
device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda") # puts the x matrix on device selected
    tensor([[0.7825, 0.7864, 0.1257],
            [0.7588, 0.6572, 0.9262],
            [0.4881, 0.6329, 0.3424],
            [0.1333, 0.4235, 0.6760],
            [0.9737, 0.6657, 0.9946]], device='cuda:0')

1.2 NumPy to Tensor

It is possible to switch between Numpy arrays and Tensors. Note, that this is now a shadow reference. Anything done to the NumPy array will get reflected in the original tensor and vice versa.

import numpy as np

g = np.zeros(5)

gg = torch.from_numpy(g)

    tensor([0., 0., 0., 0., 0.], dtype=torch.float64)

1.3 CUDA Performance

Without question, the performance of matrix operations on the GPU is lightyears faster than on the CPU. The following code is an example of the speed difference.

import time # times in seconds
def time_torch(size):
    x = torch.rand(size, size, device=torch.device("cuda"))
    start = time.time()
    end = time.time()
    return(end - start)

def time_numpy(size):
    x = np.random.rand(size, size)
    start = time.time()
    np.sin(x, out=x)
    end = time.time()
    return(end - start)


On the CPU, it took 1.9 seconds to take the sin of a 10k by 10k matrix, on my GPU (Nvidia 1060), it only took 0.003 seconds! It is worth pointing out that there is some overhead when transferring data from the GPU's memory to the main memory. For this reason, when designing algorithms, you should avoid swapping data on and off the GPU.

2 Basic Perceptron

Now that we have seen Tensors, we can look at a basic neural network example. In this example, we are merely going to be doing linear regression. IE: our algorithm takes in a single input and tries to predict the output using the equation:

\[ y = mx+b \]

The "x" is our input, and the "m" is the weight, and the b is the bias.

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

There is a lot to be said about how you define a neural network in PyTorch; however, most follow this basic example. The constructor creates each layer, and the forward function defines how data gets calculated as it gets pushed through the network.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(1,1)
    def forward(self, x):
        x = self.fc1(x)
        return x
net = Net()
net.cuda() # puts the NN on the GPU
print(net) # displays NN structure
      (fc1): Linear(in_features=1, out_features=1, bias=True)

Printing the network like this is useful because you can see the dimensions of the data going in and out of the neural net.

    [Parameter containing:
    tensor([[0.2431]], device='cuda:0', requires_grad=True), Parameter containing:
    tensor([0.3372], device='cuda:0', requires_grad=True)]

As to be expected, we see that the NN has two parameters initialized to random values.

Next, we have to define our loss function before we can train. Our loss measures the magnitude of how incorrect our prediction made by the network was. For this example, we are using the squared difference; this can be changed to other loss functions. The important thing is that the loss is positive so that we can do backpropagation on it.

def criterion(out, label):
    return (label - out)**2

An optimizer is a special object by Pytorch that will adjust the neural network weights based on how they affected the gradient of the error.

import torch.optim as optim
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.5)

The dummy data simply follows the equation y = 3x + 0.

data = [(1,3), (2,6), (3,9), (4,12), (5,15), (6,18)]

All training loops look something like the following example. An epoch is simply how many iterations the neural network looks at the data. Optionally people will include another outer loop for "runs" this will simply run this training process multiple times to see if we are always converging on the same answer or are we getting stuck at local minimums.

for epoch in range(100):
    for i, data2 in enumerate(data):
        X, Y = iter(data2)
        X, Y = Variable(torch.FloatTensor([X]), requires_grad=True).cuda(), Variable(torch.FloatTensor([Y])).cuda()
        outputs = net(X)
        loss = criterion(outputs, Y)
        if (i % 10 == 0):
            print("Epoch {} - loss: {}".format(epoch,[0]))
    Epoch 0 - loss: 5.854729652404785
    Epoch 1 - loss: 2.294259548187256
    Epoch 2 - loss: 0.5001814961433411
    Epoch 3 - loss: 0.8155164122581482
    Epoch 4 - loss: 0.6028059720993042
    Epoch 95 - loss: 4.468947372515686e-05
    Epoch 96 - loss: 4.02352525270544e-05
    Epoch 97 - loss: 3.622113945311867e-05
    Epoch 98 - loss: 3.2605526939732954e-05
    Epoch 99 - loss: 2.934764779638499e-05
    [Parameter containing:
    tensor([[2.9989]], device='cuda:0', requires_grad=True), Parameter containing:
    tensor([0.0063], device='cuda:0', requires_grad=True)]

We can see that our NN has extimated the data generated by "y= 3x" to be "y= 2.999x + 0.006".

We can now use this network to make predictions. Note: the shape of the input has to comply with the forward function, and the device of your input tensor must be the same device that the network is on.

input = Variable(torch.ones(1,1,1).cuda())
    tensor([[[1.]]], device='cuda:0')
    tensor([[[3.0051]]], device='cuda:0', grad_fn=<AddBackward0>)

From this example, we can quickly create more intricate neural networks. By adding a few lines of code, we can create a multi-layer perceptron.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(1,10)
        self.fc2 = nn.Linear(10,1)
    def forward(self, x):
        x = self.fc2(self.fc1(x))
        return x
net = Net().cuda()
      (fc1): Linear(in_features=1, out_features=10, bias=True)
      (fc2): Linear(in_features=10, out_features=1, bias=True)
criterion = nn.MSELoss() # mean squared error for loss function
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.01)
for epoch in range(100):
    for i, data2 in enumerate(data):
        X, Y = iter(data2)
        X, Y = Variable(torch.FloatTensor([X]), requires_grad=True).cuda(), Variable(torch.FloatTensor([Y])).cuda()
        outputs = net(X)
        loss = criterion(outputs, Y)
        if (i % 10 == 0):
            print("Epoch {} - loss: {}".format(epoch,
    Epoch 0 - loss: 7.190098285675049
    Epoch 1 - loss: 1.51701192407927e-06
    Epoch 2 - loss: 0.1253584325313568
    Epoch 3 - loss: 0.5402220487594604
    Epoch 4 - loss: 1.1704645156860352
    Epoch 95 - loss: 3.3565470403118525e-09
    Epoch 96 - loss: 5.913989298278466e-10
    Epoch 97 - loss: 1.8417267710901797e-11
    Epoch 98 - loss: 2.3283064365386963e-10
    Epoch 99 - loss: 7.130438461899757e-10