Adamis-cluster

Good usage of the ADAMIS cluster

Version française

I° Introduction

In order to prevent problems caused by bad utilization of resources, we think it is necessary to remind the users of a few rules that will allow us to reduce the number of downtime periods and the time lost by the administrators on these occasions.

This is not about constraining users with rules that would prevent the best use of their time. On the contrary, we think that a reminder of these few rules will allow us to maintain:

  • flexible usage, allowing all users to prototype their codes and have easy access to the computational resources they need
  • communication and friendly sharing of information between the different users and administrators

II° Respect for users and administrators

It goes without saying that respect among the different users and between users and administrators is the first rule. It is important to keep in mind:

  • there is no privileged user for the machine. By default, everyone

has to follow the same usage rules. Yet, it is possible that some exceptions could be made in case of temporary and exceptional needs IF this decision results from a consultation between all of the users and administrators (administrators being asked first).

  • the administrators of the machine:
    • are not specialized in either network administration or user support
    • were not recruited to answer that need. They share the same activities as the users, which are just as demanding. Nevertheless, they try their best to answer as fast as possible to the different problems (hardware or software) that can arise. This is in order to not block the work of other users. It is important to not pollute the time that they are donating to the administration of the cluster with questions that could be answered by a good search engine.
    • they can be consulted for the scaling of the code developed by the users. This advice will be given within the framework of a short or long-term project and not during an emergency situation for a paper/report.

In order to mitigate the lack of human resources in the "user-support" area a wiki has been created. It allows the administrators to easily pass the baton in case of departure. It should especially allow the users to find as much information as possible that they need to use the resources of the machine at its best (the most efficiently while following the rules of good usage). It is the responsibility of the users to improve the wiki.

III° Utilization of the storage resources

Summary

As shown on the wiki, the storage resources are the following:
/home : 46 Go total - medium speed access - daily backup
/data : 11 To total (1.2 To per node) - slow access - no backup
/scratch : 150 Go per node - fast access - arbitrary erase possible

Home directory

We ask that each user stay under 1Go in that directory. If the compiling of certain code requires more space, we recommend the use of the /scratch of the master node and copying the created binary to your home.

Data directory

In that directory we ask each user to not go above 800Go for long periods and exceptionally 1.5 To for less than a month.

This directory can only be used to save "final" data. Concurrent access (several jobs accessing the same file at the same time) creates

problems on the file system. Therefore, it is asked that users use the /scratch directory to save the data the codes are accessing.

It is very important to keep a well balanced usage of the different disks that compose this directory (/glfsdata). When new data are copied over on /data, the choice of a particular disk is made following these rules:

  • a local disk if it is written by a computation node
  • a random node's disk if it is done from the master node

Therefore, it is asked that each user use the master node to copy large amounts in /data.

With the current level of used space in /data, it is asked of each user to delete obsolete data from that directory. There will be a reminder on the mailing list every month.

Scratch Directory

The directory /scratch is for saving data that are accessed by different jobs. It is a scratch zone that needs to be cleaned at the end of each job.

Be aware that the /tmp directory should only be used by some essential system routines. It is extremely important to check that each code uses /scratch for temporary data. In general, the location is taken from the environment variable TEMPDIR that can be defined in your bash profile. export TEMPDIR=/scratch/your_username

IV° Usage of the computational resources

There is only one way to use the computational resources: by submitting a job to the scheduler. All other methods of asking for resources (job launch from an ssh connection on a local node, or on the master node) can:

  • create an overload of resources
  • harm other users

both of which are against the principle of respect between users and are therefore unacceptable.

Only the submission via the scheduler will give a fair distribution and safeguarding of the resources.

It is important to ensure a realistic declaration of the resources used. In that respect, only one option has to be specified, namely the number of cores (CPUs). Each computation node has 8 CPUs and 16Go of RAM. The number of cores asked needs to reflect your usage of both of these elements. For example, a user using 1CPU but 8Go of RAM needs to declare 4 cores.

In order to answer the need for reactivity (for example during the debugging phase), it is possible to use the interactive option of qsub (as explained in the wiki).

V° Usage of IDL token

Each and every IDL session requires 6 tokens from the Centre de calclul de Lyon (CC-IN2P3). Lack of tokens is a frequent problem. Try to moderate your usage of these tokens. For that reason, IDL cannot be used for parallel computing.

VI° Conclusion

For questions/discussion concerning this document, we ask that you use the corresponding wiki page or the mailing list. It will be updated with:

  • suggestions of modifications and additions
  • any changes in the soft/hardware