nextnanomat
Software documentation
Operating System
nextnano.cloud
nextnano++
Tutorials
Software documentation
Operating System
nextnano.cloud
Tutorials
You can use HTCondor to run the nextnano software on your local computer infrastructure (“on-premise”). Essentially, the nextnanomat software submits the job either locally or on the “HTCondor” cluster. In both cases, the results of the calculations are located on your local computer.
This feature is only supported with our new license system.
The following shows a screenshot. 6 computers are connected to the HTCondor pool called e25nn
.
120 slots are configured, 44 are currently available.
Computers 2, 3, 4 and 6 are selected to accept jobs.
Computers 2 and 6 are currently not available as they are in use.
Download HTCondor installer from HTCondor.
Download
and go to Current Stable Release
of UW Madison
(as of September 24 2020, HTCondor 8.8.10). Native Packages
. The filenames look similar to this one:condor-8.8.10-513586-Windows-x64.msi
(Version 8.8.10).msi
file. When you download it, you can optionally enter your name, email address and institution and subscribe to the HTCondor newsletter.Install HTCondor.
Next
and then accept License AgreementCreate a new HTCondor Pool
and fill in the name of the Pool, e.g. nextnanoHTCondorPool
. This is a unique name for your pool of machines.Join an existing HTCondor Pool
and fill in the hostname of the central manager, e.g. computername where nextnanoHTCondorPool
has been created.Submit jobs to HTCondorPool
and choose Always run jobs and never suspend them.
(Alternative: If you do not want other people to run jobs on your machine at all, select Do not run jobs on this machine
or if you do not want other people to run jobs on your machine while you are working, select When keyboard has been idle for 15 minutes.
. You can of course modify these settings later.)yourcompanyname.com
(without www
).) All PCs of your network should get the same domain name, this does not necessarily have to be your Windows domain.*
$(CONDOR_HOST), $(IP_ADDRESS), *.yourdomainname.com, 192.168.178.*
, (Replace *.cs.wisc.edu with your domain name and add your local IP subnet e.g. 192.168.178.*). On Windows you can find your IP subnet by opening the Command Prompt cmd.exe
and typing in ipconfig
.*
(or $(IP_ADDRESS)
)No
C:\condor\
). The directory Program Files
is problematic due to write permissions, so we do not recommend using it.Install
and type in the Administrator password of your PC. (You need Administrator rights.)A few more setups
condor_store_cred add
condor_store_cred add -debug
for more output information on the error.Tools
→ Options
→ Cloud computing
. If everything is correctly set up, you will find the “HTCondor” section highlighted with green color, and the available computers show up in “Cluster”. If this is not the case, maybe you have not installed HTCondor on the computer where you are running nextnanomat. Please also check that the HTCondor installation path is correctly set within nextnanomat, e.g. the default path C:\condor
might not be the one where you installed HTCondor.Hostname (for HTCondor pool): computername.yourcompanyname.com Policy: "Always run jobs" Accounting domain: yourcompanyname.com Read access: * Write access: $(CONDOR_HOST), $(IP_ADDRESS), *.yourcompanyname.com, 192.168.178.* Administrator: $(IP_ADDRESS)
You can find your HTCondor config settings in the file C:\condor\condor_config
.
Let's look at an example below.
Simpson
.simpson.com
.TheSimpsonsCondorPool
.homer.simpson.com
.lisa.simpson.com
.192.168.188.*
. (or 2001:db8:2042::*
in IPv6)RELEASE_DIR = C:\condor LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local REQUIRE_LOCAL_CONFIG_FILE = FALSE LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config use SECURITY : HOST_BASED #CONDOR_HOST: $(FULL_HOSTNAME) # on computer called homer CONDOR_HOST: homer # on computer called lisa COLLECTOR_NAME = TheSimpsonsCondorPool # only on computer called homer #UID_DOMAIN = # empty if you do not have a domain UID_DOMAIN = simpson.com SOFT_UID_DOMAIN=TRUE # entry is missing if you do not have a domain FILESYSTEM_DOMAIN = simpson.com # entry is missing if you do not have a domain CONDOR_ADMIN = SMTP_SERVER = ALLOW_READ = * ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.simpson.com, 192.168.188.*, 2001:db8:2042::* ALLOW_ADMINISTRATOR = $(IP_ADDRESS) use POLICY : ALWAYS_RUN_JOBS #use POLICY : DESKTOP WANT_VACATE = FALSE WANT_SUSPEND = TRUE #DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR STARTD # on computer called homer #DAEMON_LIST = MASTER SCHEDD STARTD KBDD # on computer called lisa if keyboard idle 15 minutes option was chosen DAEMON_LIST = MASTER SCHEDD STARTD # on computer called lisa
Submit job
Show information on HTCondor cluster
condor_status
command are shown, i.e. the number of compute slots are displayed.condor_q
to show the status of your submitted jobs, i.e. select condor_q
, and then press the Refresh button.dir
.Results of HTCondor simulations
<nextnano simulation output folder\<name of input file>\
.<input file name>.log
.condor_submit <filename>.sub
Submit a job to the pool.condor_q
Shows current state of own jobs in the queue.condor_q -nobatch -global -allusers
Shows state of all jobs in the cluster. Of all users.condor_q -goodput -global -allusers
Shows state and occupied CPU of all jobs in the cluster.condor_q -allusers -global -analyze
Detailed information for every job in the cluster.condor_q -global -allusers -hold
Shows why jobs are in hold state.condor_status
Shows state of all available resources.condor_status -long
Shows state of all available resources and many other information.condor_status -debug
Shows state of all available resources and some additional information, e.g. WARNING: Saw slow DNS query, which may impact entire system: getaddrinfo(<Computername>) took 11.083566 seconds.condor_rm
Remove jobs from a queue:condor_rm -all
Removes all jobs from a queue.condor_rm <cluster>.<id>
Removes jobs on cluster <cluster> with id <id> (It seems <cluster>.
can be omitted, and id
is the JOB_IDS
number.)condor_release -all
If any jobs are in state hold, use this command to restart them.condor_restart
Restart all HTCondor daemons/services after changes in config file.condor_version
Returns the version number of HTCondor condor_store_cred query
Returns info about the credentials stored for HTCondor jobscondor_history
Lists the recently submitted jobs. If for a specific job ID
the status has the value ST
=C
, then this job has been completed (C
) successfully.condor_status -master
: returns Name, HTCondor Version, CPU and Memory of central managercmd.exe
as Administrator. Type in: net start condor
. This has the same effect as restarting your computer, i.e. the networking service condor
is started. This is useful if you have changed your local condor_config
file.
With this option in the condor.config
file on the central manager, one can set a policy that the jobs are spread out over several machines rather than filling all slots of one computer before filling the slots of the other computers.
##------nn: SPREAD JOBS BREADTH-FIRST OVER SERVERS ##-- Jobs are "spread out" as much as possible, ## so that each machine is running the fewest number of jobs. NEGOTIATOR_PRE_JOB_RANK = isUndefined(RemoteOwner) * (- SlotId)
Q: I submitted a job to HTCondor, but nothing happens. nextnanomat says “transmitted”.
A: It could be that nextnanomat does not have read in all required settings. You can try to type in the command line condor_restart
. Please make sure that you entered your credentials using condor_store_cred add -debug
. You should then start nextnanomat again.
Q: I submitted a job to HTCondor, but the Batch line of nextnanomat is stuck with preparing
. What is wrong?
A1: Did you store your credentials after the installation of HTCondor? If not, enter condor_store_cred add
into the command prompt to add your password, see above (Recommended Installation Process).
A2: Did you change your password recently? If yes you have to reenter your credentials for HTCondor.
Enter condor_store_cred add
into the command prompt to add your password, see above (Recommended Installation Process). If this does not work, try to enter condor_store_cred add -debug
for more output information on the error.
Q: I specified target machines in Tools - Options. Afterwards every submitted job to HTCondor is stuck with transmitting
. What is wrong?
A: The value for UID_DOMAIN
within the condor_config file needs to be the same for every computer of your cluster. (You can easily test it in a command prompt with condor_status -af uiddomain
) If it's not the same value, no matching computer will be found and the job won't be transmitted successfully.
If you receive the following error when you type in condor_status
C:\Users\"<your user name>">condor_status Error: communication error CEDAR:6001:Failed to connect to <123.456.789.123>
you can check whether the computer associated with this IP address is your HTCondor computer using the following command.
nslookup 123.456.789.123
It is also a good idea to type in
nslookup
This will return the name of the Default Server that resolves DNS names.
If it is not the expected computer, you can open a Command Prompt as Administrator and type in ipconfig /flushdns
to flush the DNS Resolver Cache.
C:\Users\"<your user name>">ipconfig /flushdns
If the DNS address cannot be resolved correctly it could be related to a VPN connection that has configured a different default server for Domain Name to IP address mapping. E.g. if your Windows Domain is called contoso.com (which is only visible within your own network and your own HTCondor pool) but your DNS is resolved to www.contoso.com (which might be outside your local HTCondor pool).
Solution:
Edit condor_config
file and add host, i.e. local computer name (here: nn-delta).
ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS) ==> ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), nn-delta
If you encounter any strange errors, you can find some hints in the history or Log files generated by HTCondor. You can find them here:
C:\condor\spool
C:\condor\log
More details can be found here: Logging in HTCondor
You can even run your own executable with nextnanomat locally or on HTCondor! We tested the following programs:
An input file identifier is a special string in the input file that signals to nextnanomat whether the input file is an input file for the nextnano++, nextnano³, nextnano.QCL or nextnano.MSB software, or for a custom executable.
In nextnanomat, we need the following settings:
D:\HW\HelloWorld.exe
HelloWorld
.
Open input file input_file_for_HelloWorld.in
(or any other input file that contains the string HelloWorld
) and run the simulation either locally or on HTCondor.
Our folder structure is
D:\QE\inputfile\My_QE_inputfile.in
(QE input file)D:\QE\input\pseudo\C.UPF
(pseudopotential file for atom species 'C' as specified in input file)D:\QE\exe\pw.exe
(QE executable file)D:\QE\exe\*.dll
(all dll files needed by pw.exe)D:\QE\working_directory\QE_nextnanomat_HTCondor.bat
(batch file)In nextnanomat, we need the following settings:
D:\QE\working_directory\QE_nextnanomat_HTCondor.bat
D:\QE\
&control
.
$INPUTFILE
)
The batch file (*.bat
) contains the following content:
.\exe\pw.exe -in .\inputfile\My_QE_inputfile.in
This means that relative to the working directory, pw.exe
is started, and the specified input file is read in. In this input file, the following quantities are specified:
C.UPF
: name of pseudopotential file./input/pseudo/
: path to pseudopotential file C.UPF
Open input file My_QE_inputfile.in
and run the simulation either locally or on HTCondor.
Things that could be improved:
condor_exec.exe
is deleted (better: do not copy it back)*.dll
files should be deleted (better: do not copy them back)Our folder structure is
D:\abinit\inputfile\t30.in
(ABINIT input file)D:\abinit\input\*
(input files needed by ABINIT)D:\abinit\exe\abinit.exe
(ABINIT executable file)D:\abinit\exe\*.dll
(all dll files needed by abinit.exe)D:\abinit\working_directory\abinit_nextnanomat.bat
(batch file)D:\abinit\working_directory\abinit_nextnanomat.bat
D:\abinit\
acell
.
The batch file (*.bat
) contains the following content:
.\exe\abinit.exe < .\input\ab_nextnanomat_HTCondor.files
This means that relative to the working directory, abinit.exe
is started, and the specified input file is read in. In this input file, the following quantities are specified:
.\inputfile\t30.in
: name of input file.\input\14si.pspnc
:
Open input file t30.in
and run the simulation either locally or on HTCondor.
condor_exec.exe
is deleted (better: do not copy it back)*.dll
files should be deleted (better: do not copy them back)condor_startd
daemon. By default, the condor_startd
will automatically divide the machine into slots, placing one core in each slot. E.g. a 6-core computer with hyperthreading has 12 logical processors. Alternatively, the number of cores (or logical processors) can be distributed to the slots as follows.SLOT_TYPE_1 = cpus=4 SLOT_TYPE_2 = cpus=4 SLOT_TYPE_3 = cpus=2 SLOT_TYPE_4 = cpus=1 SLOT_TYPE_5 = cpus=1 SLOT_TYPE_1_PARTITIONABLE = TRUE SLOT_TYPE_2_PARTITIONABLE = TRUE SLOT_TYPE_3_PARTITIONABLE = TRUE SLOT_TYPE_4_PARTITIONABLE = TRUE SLOT_TYPE_5_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1 NUM_SLOTS_TYPE_2 = 1 NUM_SLOTS_TYPE_3 = 1 NUM_SLOTS_TYPE_4 = 1 NUM_SLOTS_TYPE_5 = 1
PartitionableSlot
: For SMP (symmetric multiprocessing) machines, a boolean value identifying that this slot may be partitioned. DynamicSlot
: For SMP machines that allow dynamic partitioning of a slot, this boolean value identifies that this dynamic slot may be partitioned. SlotID
: For SMP machines, the integer that identifies the slot.condor_status -af Name TotalCpus DynamicSlot PartitionableSlot SlotID
. It returns the requested properties of each slot:Name
TotalCpus
DynamicSlot
PartitionableSlot
SlotID
In our pool we have chosen dynamic partitioning which gives full flexibility. For instance, a quad-core CPU that is dynamically partitioned can accept
request_cpus = 1
)request_cpus = 2
)request_cpus = 1
) and the other uses 3 threads (request_cpus = 3
)request_cpus = 4
).#################################################### # Dynamic partitioning # We use HTCondors dynamic partitioning mechanism. # Each PC has one partitionable whole machine slot. # (It seems that hyperthreading is not taken into account.) #################################################### NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = 100% SLOT_TYPE_1_PARTITIONABLE = true SlotWeight = Cpus
A machine is in any of the following 6 states. The most important one are Owner
, Unclaimed
, Claimed
.
Owner
: The machine is being used by the machine owner, and/or is not available to run HTCondor jobs. When the machine first starts up, it begins in this state. Unclaimed
: The machine is available to run HTCondor jobs, but it is not currently doing so. Matched
: The machine is available to run jobs, and it has been matched by the negotiator with a specific schedd. That schedd just has not yet claimed this machine. In this state, the machine is unavailable for further matches. Claimed
: The machine has been claimed by a schedd. Preempting
: The machine was claimed by a schedd, but is now preempting that claim for one of the following reasons.Backfill
: (not relevant for us)
Each machine state can have different activities. The machine state Claimed
can have one out of these four activities.
Idle
: Busy
: Suspended
: Retiring
: (We are working on it.)