Tag: HPC

Alchemy: Rethinking How ML Experiments Are Orchestrated

I've spent a fair amount of time running ML experiments on remote GPU clusters, and I kept running into the same frustration: the code that actually matters — the training logic — drowns in infrastructure noise.

A typical training script looks something like this:

parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=1e-3)
args = parser.parse_args()

if args.resume:
    ckpt = torch.load(args.resume)
    model.load_state_dict(ckpt["model"])
    optimizer.load_state_dict(ckpt["optimizer"])
    start_step = ckpt["step"]
else:
    start_step = 0

for step in range(start_step, total_steps):
    loss = train_step(batch)
    wandb.log({"loss": loss})
    if step % 1000 == 0:
        torch.save({...}, f"ckpt_{step}.pt")

baka_mashiroAbout 6 min