Advanced usage -------------- This section describe more complicated (and useful) features and requires good familiarity with the concept introduced in the previous section. Multiplex directive ~~~~~~~~~~~~~~~~~~~ The default behavior can be altered by specifying a ``#multiplex`` directive in the commentary of the segment code. If several multiplex directives are present in the segment code the last one is retained. The multiplex directive can be one of: + ``#multiplex cross_prod`` : default behavior, return the Cartesian product. + ``#multiplex union`` : make the union of the inputs Moreover the ``#multiplex cross_prod`` directive admits filtering and grouping by class similarly to SQL requests:: #multiplex cross_prod where "condition" group_by "class_function" ``condition`` and ``class_function`` are python code evaluated for each element of the product set. The argument of ``where`` is a condition. The element will be part of the input set only if it evaluates to ``True``. The ``group_by`` directive groups elements into class according to the result of the evaluation of the given class function. The input set contains all the resulting class. For example, if the function is a constant, the input set will contain only one element: the class containing all elements. During the evaluation, the values of the tuple elements are accessible as variable wearing the name of the corresponding parents. Given the Cartesian product set:: [('Lancelot','the Brave'), ('Lancelot','the Pure'), ('Galahad','the Brave'), ('Galahad','the Pure')] one can use :: #multiplex cross_prod where "quality=='the Brave'" to get 2 instances of the following segment (``melt``) running on ``('Lancelot','the Brave'), ('Galahad','the Brave')`` or:: #multiplex cross_prod group_by "knights" to get 2 instances of the ``melt`` segment running on ``('Lancelot'), ('Galahad')`` or:: #multiplex cross_prod group_by "0" to get 1 instance of the ``melt`` segment running on: (0) Note that to make use of ``group_by``, elements of the output set have to be hashable. Another caution on the use of group: segment input data type is no longer a dictionary in those cases as the original tuple is lost and replaced by the result of the class function. See section :ref:`The segment environment` section for more details. Depend directive ~~~~~~~~~~~~~~~~ As explained in the introduction section, Pipelet offers the possibility to spare CPU time by saving intermediate products on disk. We call intermediate products the input/output data files of the different segments. Each segment repository is identified by a unique key which depends on: - the segment processing code and parameters (segment and hook scripts) - the input data (identified from the key of the parent segments) Every change made on a segment (new parameter or new parent) will then give a different key, and tell the Pipelet engine to compute a new segment instance. It is possible to add some external dependencies to the key computation using the depend directive:: #depend file1 file2 At the very beginning of the pipeline execution, all dependencies will be stored, to prevent any change (code edition) between the key computation and actual processing. Note that this mechanism works only for segment and hook scripts. External dependencies are also read as the beginning of the pipeline execution, but only used for the key computation. Database reconstruction ~~~~~~~~~~~~~~~~~~~~~~~ In case of unfortunate lost of the pipeline sql data base, it is possible to reconstruct it from the disk :: import pipelet pipelet.utils.rebuild_db_from_disk (prefix, sqlfile) All information will be retrieve, but with new identifiers. The hooking system ~~~~~~~~~~~~~~~~~~ As described in the :ref:`The segment environment` section, Pipelet supports an hooking system which allows the use of generic processing code, and code sectioning. Let's consider a set of instructions that have to be systematically applied at the end of a segment (post processing), one can put those instruction in the separate script file named for example ``segname_postproc.py`` and calls the hook function:: hook('postproc', globals()) A specific dictionary can be passed to the hook script to avoid confusion. The hook scripts are included into the hash key computation. Segment script repository ~~~~~~~~~~~~~~~~~~~~~~~~~ Local repository **************** By default, segment scripts are read from a local directory, specified at the pipeline initialization with the parameter named ``code_dir``:: from pipelet.pipeline import Pipeline P = Pipeline(pipedot, code_dir="./", prefix="./") The segment script contents are immediatly stored, to prevent from any modification between the pipeline start time and the actual execution of each segment. It is generally a good idea to make this directory controlled by an RCS, to ease the reproducibility of the pipeline (even if the pipelet engine makes a copy of the segment script in the segment output directory). If using Git, the revision number will be stored at the beginning of the copy of the segment script. Writing custom environments ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Pipelet software provides a set of default utilities available from the segment environment. It is possible to extend this default environment or even re-write a completely new one. Extending the default environment ********************************* The different environment utilities are actually methods of the class Environment. It is possible to add new functionalities by using the python heritage mechanism: File : ``myenvironment.py``:: from pipelet.environment import * class MyEnvironment(Environment): def my_function (self): """ My function do nothing """ return The Pipelet engine objects (segments, tasks, pipeline) are available from the worker attribut ``self._worker``. See the :ref:`The pipelet actors` section for more details about the Pipelet machinery. Writing new environment *********************** In order to start with a completely new environment, extend the base environment: File : ``myenvironment.py``:: from pipelet.environment import * class MyEnvironment(EnvironmentBase): def my_get_data_fn (self, x): """ New name for get_data_fn """ return self._get_data_fn(x) def _close(self, glo): """ Post processing code """ return glo['seg_output'] From the base environment, the basic functionalities for getting file names and executing hook scripts are still available through: - ``self._get_data_fn`` - ``self._hook`` The segment input argument is also stored in ``self._seg_input`` The segment output argument has to be returned by the ``_close(self, glo)`` method. The pipelet engine objects (segments, tasks, pipeline) are available from the worker attribut ``self._worker``. See doxygen documentation for more details about the Pipelet machinery. Loading another environment *************************** To load another environment, set the pipeline environment attribute accordingly. :: Pipeline(pipedot, codedir=, prefix=, env=MyEnvironment) Writing custom main files ~~~~~~~~~~~~~~~~~~~~~~~~~ Launching pipeweb behind apache ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pipeweb use the cherrypy web framework server and can be run behind an apache web server which brings essentially two advantages: - access to *_mod apache facilities (https, gzip, authentication facilities ...). - faster static files serving (the pipelet application actually use quite few of them so the actual gain is marginal, getting the actual data served by apache may be feasible but is not planned yet). There is actually several way of doing so, the `cherrypy `_ documentation giving hints about each. We describe here an example case using mod_rewrite and virtual hosting. 1. The first thing we need is a working installation of apache with ``mod_rewrite`` activated. On a debian-like distribution this usually obtain by:: sudo a2enmod rewrite sudo a2enmod proxy sudo a2enmod proxy_http 2. We then configure apache to rewrite request to the cherrypy application except for the static files of the application that will be served directly. Here is a sample configuration file for a dedicated virtual host named pipeweb with pipelet installed under ``/usr/local/lib/python2.6/dist-packages/`` .:: ServerAdmin pipeweb_admin@localhost DocumentRoot /usr/local/lib/python2.6/dist-packages/pipelet # ErrorLog /some/custom/error_file.log # CustomLog /some/custom/access_file.log common RewriteEngine on RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L] RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy] 3. Restart apache and start the pipeweb application to serve on the specified address and port: ``pipeweb start -H 127.0.0.1`` There is also some possibility to start the application on demand using a cgi script like:: #!/usr/local/bin/python print "Content-type: text/html\r\n" print """Restarting site ...click here""" import os os.system('pipeweb start -H 127.0.0.1') To have it executed when the proxy detect the absence of the application:: #... ScriptAliasMatch ^/pipeweb_autostart\.cgi$ /usr/local/bin/pipeweb_autostart.cgi RewriteEngine on RewriteCond %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$ RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L] RewriteCond %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$ RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy] ErrorDocument 503 /pipeweb_autostart.cgi #... You may want to adjust ownership and suid of the ``pipeweb_autostart.cgi`` script so that it executes with the correct rights. Pipeweb handles access rights using per pipeline ACL registered in the database file. It support Basic and Digest http authentication. When deploying the pipeweb interface in a production environment, one may want to defer a part of the authorization process to external and potentially more secure systems. The pipeweb behavior in term of authorization is controlled by the ``-A`` option that accept the following arguments: - ``Digest`` (default) Authenticate users via HTTP Digest authentication according to the user:passwd list stored in the database. - ``Basic`` Authenticate users via HTTP Basic (clear text) authentication according to the user:passwd list stored in the database. - ``ACL`` Check the access rights of otherwise authenticated users according to the user list stored in the database. - ``None`` Do no check. (Defer the whole authentication/authorization process to the proxy.) Here is a complete configuration sample making of https, basic authentication, and per pipeline ACL to secure data browsing.:: ServerAdmin pipeweb_admin@localhost DocumentRoot /usr/local/lib/python2.6/dist-packages/pipelet # ErrorLog /some/custom/error_file.log # CustomLog /some/custom/access_file.log common # Adjust the ssl configuration to fit your needs SSLEngine on SSLCertificateFile /etc/ssl/certs/ssl-cert-snakeoil.pem SSLCertificateKeyFile /etc/ssl/private/ssl-cert-snakeoil.key # This handles authentication and access to the index page # Access right checking to the various registered pipelines # is left to pipeweb #Replace with Any suitable authentication system AuthName "pipeweb" AuthType Basic AuthUserFile /etc/apache2/pipelet.pwd require valid-user ScriptAliasMatch ^/pipeweb_autostart\.cgi$ /usr/local/bin/pipeweb_autostart.cgi RewriteEngine on RewriteCond %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$ RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L] RewriteCond %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$ RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy] ErrorDocument 503 /pipeweb_autostart.cgi And the corresponding cgi script:: #!/usr/local/bin/python print "Content-type: text/html\r\n" print """Restarting site ...click here""" import os os.system('pipeweb start -H 127.0.0.1 -A ACL')