MLflow Tracking 사용 및 Docker 환경에서 MLflow Tracking server 실행

죽난

|2024. 10. 17. 22:20

MLflow가 뭔지는 https://dream2reality.tistory.com/13를 참고.

🔧 MLflow Common Setups

MLflow에서 일반적인 설정은 아래 3가지다. 1, 2 번은 모두 개인이 사용할 때의 방법이며 3번은 팀이 개발할 때 사용하는 형식이다. 3번과 같이 MLflow Tracking server를 구성하면 편리하게 실험 내용을 공유할 수 있다. 또한, Server에 올려 항상 켜둘 수 있다는 장점이 있다.

🏃‍♂️ MLflow Tracking Server 실행

아래 명령어들을 통해 MLflow server를 실행시켜준다. (PC가 2대인 경우 Server PC에)

1. Docker Image 다운로드

docker pull ghcr.io/mlflow/mlflow

2. Docekr Image 실행

아래 명령어 실행하면 http://localhost:5000 또는 http://<host서버ip>:5000 를 통해 접속 가능.

`--host 0.0.0.0` 을 통해 외부접속이 가능하게 해줌.

docker run -it -p 5000:5000 --rm --name mlflow-server ghcr.io/mlflow/mlflow mlflow server --host 0.0.0.0

잘 실행이 됐다면 이런 UI를 볼 수가 있다.

➕ 추가 사항

위와 같이 container를 실행하게 되면 container가 중지되는 동시에 삭제되게된다. --rm 태그를 제거하면 해결할 수 있으며 container 내부에 존재하는 mlruns 폴더에 log들이 저장되기 때문에 -v 태그를 사용해 저장하고 싶은 경로와 마운트 해준다.

docker run -it -v ~/workspace/mlflow/:/mlruns -p 5000:5000 --rm --name mlflow-server ghcr.io/mlflow/mlflow mlflow server --host 0.0.0.0

🧑‍💻 MLflow 설치 및 ML 코드 수정

아래 Repo에서 전체 소스 코드를 볼 수 있다.

https://github.com/dev-jinwoohong/mlflow-quickstart

GitHub - dev-jinwoohong/mlflow-quickstart

Contribute to dev-jinwoohong/mlflow-quickstart development by creating an account on GitHub.

github.com

1. Local 환경에서 MLflow를 설치해준다.

pip install mlflow

2. MLflow tracking URI 설정 및 실험 이름 지정 코드 추가

import mlflow

mlflow.set_tracking_uri(uri="http://127.0.0.1:5000")
mlflow.set_experiment("Learning Fashion MNIST Dataset with Resnet")

3. 학습 코드 수정

기존 학습 코드에 기록할 hyper parameter, metrics 등을 MLflow Trakcing Server에 기록한다.

기존 학습 코드

for epoch in range(1, opt.epochs + 1):
		train_acc, train_loss = train(model, train_loader, criterion, optimizer, epoch)
		val_acc, val_loss = test(model, val_loader, criterion, epoch)
	
		if val_acc > best_acc:
		    best_acc = val_acc
		    torch.save(model.state_dict(), os.path.join(opt.save_dir, 'best_model.pth'))

변경 학습 코드

with mlflow.start_run() as run:
		# 어떤 hyper parameter를 기록할 것인지 설정
    params = {
        "model": opt.model,
        "batch_size": opt.batch_size,
        "learning_rate": opt.learning_rate,
        "weight_decay": opt.weight_decay,
    }

    mlflow.log_params(params)

    for epoch in range(1, opt.epochs + 1):
        train_acc, train_loss = train(model, train_loader, criterion, optimizer, epoch)
        val_acc, val_loss = test(model, val_loader, criterion, epoch)
				
				# 기록할 metric에 대한 설정
        mlflow.log_metric("train_accuracy", train_acc, step=epoch)
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_accuracy", val_acc, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)

        if val_acc > best_acc:
            best_acc = val_acc
            # 모델 저장
            mlflow.pytorch.log_model(model, "best_model")

📜 결과

기존에 접속했던 http://localhost:5000 또는 http://<host서버ip>:5000에 들어가면 실험에 사용한 parameter와 실험 결과들을 볼 수가 있다.

📄 참고자료

https://mlflow.org/docs/latest/tracking.html#

MLflow Tracking

Can I directly access remote storage without running the Tracking Server? Yes, while it is best practice to have the MLflow Tracking Server as a proxy for artifacts access for team development workflows, you may not need that if you are using it for person

mlflow.org

'AI Development > MLOps' 카테고리의 다른 글

모델 학습에 MLflow 적용하기 (0)	2024.07.11