用户集群说明补充

SJTU HTCondor Tips

1.如何安装Anaconda?

如果你需要安装annaconda conda，可以用集群的环境，例如：

source /sw/anaconda/3.7-2020.02/thisconda.sh

然后conda create自己的环境，但是这个时候，你的环境会默认存在/lustre/….//.conda下面，但是这样的问题是，你只有20GB的存储空间，因此你可以在create的时候，写清楚自己使用的环境（例如在lustre下面，然而你需要用df-inpac检查自己的innode数目，不要超过这个数目！），例如：

conda create -n  <lustre…./<user>/<dir_name>>

2. 如何确保自己的环境可以在远程节点使用？

2.1 如何安装正确的pytorch版本

由于服务器还是os7而不是el9，所以在安装的时候需要注意各种软件的版本，尤其是cuda驱动的版本，这里给出一个example，你可以参照这个去安装pytorch和cuda-toolkit。安装的时候，顺序是首先使用pip先安装对应gpu版本的pytorch:

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)

这时候，pip会对应地把cuda toolkit也安装好，然后可以再慢慢使用pip安装其他需要的软件包

如果你对安装的环境不是很自信，则可以登录到远程节点，如何ssh一样。

首先可以写一个空的job，什么都不干：

Universe   = vanilla
Executable =
#Arguments  =
#request_CPUs = 1   # no more than 4 CPUs
request_GPUs = 1
+SJTU_GPUModel = "V100_32G_Test"
should_transfer_files = NO
Queue

然后将文件按保存为submit_inpac_test.sub，提交的同时添加”-i” （interactive）

condor_submit -i submit_inpac_test.sub

这时候你其实等于提交了一个job，但是你自己同时也登陆到了bl-4节点（远程节点）。然后就可以像在登录节点一样，去检查是否自己安装了正确的pytorch。

3 使用 Python 将 HTCondor 任务拆分为多 CPU 并行执行

在需要进行大规模参数扫描或批量计算时，将任务拆分为多个独立子任务并行运行，可以显著减少整体运行时间。本节将介绍如何使用 HTCondor Python API 构建批量任务、分批提交、动态补交，并持续监控任务运行状态。尤其适合job 被evited的情况，同时也不超过提交脚本的上限500个。

下面是如何使用python脚本，安装h5Condor 之后进行提交:

def wait_for_complete(wait_time, constraint, schedd, itemdata, submit_result, sub_job):
    time.sleep(1)
    print(constraint)
    while True:
        ads = schedd.query(
            constraint=constraint,
            projection=["ClusterId", "ProcId", "Out", "JobStatus"],
        )
        if len(itemdata) == 0: return
        if len(ads) < max_materialize:
            sub_data = itemdata[:max_materialize - len(ads)]
            print(len(itemdata))
            submit_result += [schedd.submit(sub_job, itemdata=iter(sub_data))]
            print(f"==> Submitting {len(sub_data)} jobs to cluster {submit_result[-1].cluster()}")
            itemdata = itemdata[max_materialize - len(ads):]
            constraint = '||'.join([f'ClusterId == {id.cluster()}' for id in submit_result])
            print(len(itemdata))

        n_runs = len([ad["JobStatus"] for ad in ads if ad["JobStatus"] == 2])
        n_idle = len([ad["JobStatus"] for ad in ads if ad["JobStatus"] == 1])
        n_other = len([ad["JobStatus"] for ad in ads if ad["JobStatus"] > 2])
        print(f"-- {n_idle} idle, {n_runs}/{init_N} running ({len(itemdata)} left)... (wait for another {wait_time} seconds)")

        if n_other > 0:
            print(f"-- {n_other} jobs in other status, please check")
        if n_other > 0 and (n_runs + n_idle == 0):
            print(f"-- {n_other} jobs in other status, other's done, please check")
            return

        time.sleep(wait_time)

然后可以使用python脚本进行提交：

sub_job = htcondor.Submit({
        "executable": "/home/chenxiang/.conda/envs/darkshine/bin/python",
        "arguments": f"./llpstats/make_signal_distributed.py -i {data_dir}/out_$(cur)ns.json -f $(fit_type) -n $(point_n)",
        "output": f"log/$(job_tag).out",
        "error": f"log/$(job_tag).err",
        "log": f"log/$(ClusterID).log",
        "rank": '(OpSysName == "CentOS")',
        "getenv": 'True',
})

此时，使用下面的脚本进行提交：

sub_data = itemdata[:max_materialize]
submit_result += [schedd.submit(sub_job, itemdata=iter(sub_data))]
print(f"==> Submitting {len(sub_data)} jobs to cluster {submit_result[-1].cluster()}")
itemdata = itemdata[max_materialize:]
constraint = '||'.join([f'ClusterId == {id.cluster()}' for id in submit_result])

这时候就可以提交不超过500个job，但是一直在提交。