0%

自动混合精度AMP

上一篇总结了PyTorch,这一篇总结下自动混合精度AMP训练。
参考AUTOMATIC MIXED PRECISION PACKAGEPyTorch的自动混合精度(AMP)PyTorch 源码解读之 torch.cuda.amp: 自动混合精度详解

自动混合精度(Automatic Mixed Precision, AMP)训练,是在训练一个数值精度 FP32 的模型,一部分算子的操作时,数值精度为 FP16,其余算子的操作精度是 FP32,而具体哪些算子用 FP16,哪些用 FP32,不需要用户关心,amp 自动给它们都安排好了。这样在不改变模型、不降低模型训练精度的前提下,可以缩短训练时间,降低存储需求,因而能支持更多的 batch size、更大模型和尺寸更大的输入进行训练。PyTorch 从 1.6 以后(在此之前 OpenMMLab 已经支持混合精度训练,即 Fp16OptimizerHook),开始原生支持 amp,即torch.cuda.amp module。

torch.cuda.amp.autocast可以自动对模型参数的dtype转换;torch.cuda.amp.GradScaler用于对梯度进行scaling操作,防止FP16的梯度数值溢出,而且在优化器更新参数前,会自动对梯度unscaling。
相关代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def main_worker(gpu, ngpus_per_node, args):
model = ...
criterion = ...
optimizer = ...
scheduler = ...
train_loader = ...

# **********GradScaler对象用来自动做梯度缩放**********
torch_scaler = torch.cuda.amp.GradScaler()

for epoch in range(args.epochs):
for i, (images, target) in enumerate(train_loader):
...

# **********前向过程(model+loss)开启autocast**********
with torch.cuda.amp.autocast():
output = model(images)
loss = criterion(output, target)

# **********由于半精度数值范围有限,损失需要放大**********
torch_scaler.scale(loss).backward()
torch_scaler.step(optimizer)
torch_scaler.update()
...

分布式训练结合amp的训练代码示例如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def myargs():
parser = argparse.ArgumentParser(description='training example')
parser.add_argument('--batch_size', default=256, type=int, help='this is the total batch size of all GPUs on the current node when using Distributed Data Parallel')
parser.add_argument('--workers', default=8, type=int, help='number of data loading workers')
parser.add_argument('--epochs', default=100, type=int, help='number of total epochs to run')
parser.add_argument('--seed', default=20, type=int, help='seed for initializing training.')
# 分布式训练相关
parser.add_argument('--dist_url', default='tcp://127.0.0.1:23456', type=str, help='url used to set up distributed training')
parser.add_argument('--dist_backend', default='nccl', type=str, help='distributed backend')
parser.add_argument('--world_size', default=1, type=int, help='number of nodes for distributed training')
parser.add_argument('--rank', default=0, type=int, help='node rank for distributed training')
parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.')
myargs = parser.parse_args()
return myargs


def main_worker(gpu, ngpus_per_node, args):
# 这边设定随机种子,benchmark设定为False,原因?
random.seed(myargs.random_seed)
np.random.seed(myargs.random_seed)
torch.manual_seed(myargs.random_seed)
torch.cuda.manual_seed_all(myargs.random_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# spawn传进来的局部rank号,即GPU号
args.gpu = gpu
print("Use GPU: {} for distributed training".format(args.gpu))

# init_process_group需传入当前进程的全局rank号
args.rank = args.rank * ngpus_per_node + gpu # 全局rank
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank)

# 定义模型,修改batch_size,num_workers
model = resnet50()
torch.cuda.set_device(args.gpu) # 设置device为当前gpu
model.cuda(args.gpu)
args.batch_size = int(args.batch_size / ngpus_per_node) # 得到单个GPU上的batch_size
args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

# 定义损失、optimizer、scheduler
criterion = torch.nn.CrossEntropyLoss().cuda(args.gpu)
optimizer = torch.optim.SGD(model.parameters(), 0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# 需定义数据sampler
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=False, sampler=train_sampler, num_workers=args.workers, pin_memory=True, drop_last=True)

# **********GradScaler对象用来自动做梯度缩放**********
torch_scaler = torch.cuda.amp.GradScaler()

for epoch in range(args.epochs):
# 保证每个进程拿到的数据不一样
train_sampler.set_epoch(epoch)

model.train()
for i, (images, target) in enumerate(train_loader):
optimizer.zero_grad()

images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

# **********前向过程(model+loss)开启autocast**********
with torch.cuda.amp.autocast():
output = model(images)
loss = criterion(output, target)

# **********由于半精度数值范围有限,损失需要放大**********
torch_scaler.scale(loss).backward()
torch_scaler.step(optimizer)
torch_scaler.update()

scheduler.step()
if args.rank % ngpus_per_node == 0:
torch.save(model.module.state_dict(), 'checkpoint.pth.tar')

torch.cuda.empty_cache()


if __name__ == '__main__':
args = myargs()

ngpus_per_node = torch.cuda.device_count() # 一个机子的GPU数
args.world_size = ngpus_per_node * args.world_size # 所有机子总的GPU数
torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))