Advanced usage

This section describe more complicated (and useful) features and requires good familiarity with the concept introduced in the previous section.

Multiplex directive

The default behavior can be altered by specifying a #multiplex directive in the commentary of the segment code. If several multiplex directives are present in the segment code the last one is retained.

The multiplex directive can be one of:

  • #multiplex cross_prod : default behavior, return the Cartesian product.
  • #multiplex union : make the union of the inputs

Moreover the #multiplex cross_prod directive admits filtering and grouping by class similarly to SQL requests:

#multiplex cross_prod where "condition" group_by "class_function"

condition and class_function are python code evaluated for each element of the product set.

The argument of where is a condition. The element will be part of the input set only if it evaluates to True.

The group_by directive groups elements into class according to the result of the evaluation of the given class function. The input set contains all the resulting class. For example, if the function is a constant, the input set will contain only one element: the class containing all elements.

During the evaluation, the values of the tuple elements are accessible as variable wearing the name of the corresponding parents.

Given the Cartesian product set:

[('Lancelot','the Brave'), ('Lancelot','the Pure'), ('Galahad','the Brave'), ('Galahad','the Pure')]

one can use

#multiplex cross_prod where "quality=='the Brave'"

to get 2 instances of the following segment (melt) running on ('Lancelot','the Brave'), ('Galahad','the Brave')

or:

#multiplex cross_prod group_by "knights"

to get 2 instances of the melt segment running on ('Lancelot'), ('Galahad')

or:

#multiplex cross_prod group_by "0"

to get 1 instance of the melt segment running on: (0)

Note that to make use of group_by, elements of the output set have to be hashable.

Another caution on the use of group: segment input data type is no longer a dictionary in those cases as the original tuple is lost and replaced by the result of the class function.

See section The segment environment section for more details.

Depend directive

As explained in the introduction section, Pipelet offers the possibility to spare CPU time by saving intermediate products on disk. We call intermediate products the input/output data files of the different segments.

Each segment repository is identified by a unique key which depends on:

  • the segment processing code and parameters (segment and hook scripts)
  • the input data (identified from the key of the parent segments)

Every change made on a segment (new parameter or new parent) will then give a different key, and tell the Pipelet engine to compute a new segment instance.

It is possible to add some external dependencies to the key computation using the depend directive:

#depend file1 file2

At the very beginning of the pipeline execution, all dependencies will be stored, to prevent any change (code edition) between the key computation and actual processing.

Note that this mechanism works only for segment and hook scripts. External dependencies are also read as the beginning of the pipeline execution, but only used for the key computation.

Database reconstruction

In case of unfortunate lost of the pipeline sql data base, it is possible to reconstruct it from the disk

import pipelet
pipelet.utils.rebuild_db_from_disk (prefix, sqlfile)

All information will be retrieve, but with new identifiers.

The hooking system

As described in the The segment environment section, Pipelet supports an hooking system which allows the use of generic processing code, and code sectioning.

Let’s consider a set of instructions that have to be systematically applied at the end of a segment (post processing), one can put those instruction in the separate script file named for example segname_postproc.py and calls the hook function:

hook('postproc', globals())

A specific dictionary can be passed to the hook script to avoid confusion.

The hook scripts are included into the hash key computation.

Segment script repository

Local repository

By default, segment scripts are read from a local directory, specified at the pipeline initialization with the parameter named code_dir:

from pipelet.pipeline import Pipeline
P = Pipeline(pipedot, code_dir="./", prefix="./")

The segment script contents are immediatly stored, to prevent from any modification between the pipeline start time and the actual execution of each segment.

It is generally a good idea to make this directory controlled by an RCS, to ease the reproducibility of the pipeline (even if the pipelet engine makes a copy of the segment script in the segment output directory).

If using Git, the revision number will be stored at the beginning of the copy of the segment script.

Writing custom environments

The Pipelet software provides a set of default utilities available from the segment environment. It is possible to extend this default environment or even re-write a completely new one.

Extending the default environment

The different environment utilities are actually methods of the class Environment. It is possible to add new functionalities by using the python heritage mechanism:

File : myenvironment.py:

from pipelet.environment import *

class MyEnvironment(Environment):
      def my_function (self):
         """ My function do nothing
         """
         return

The Pipelet engine objects (segments, tasks, pipeline) are available from the worker attribut self._worker. See the The pipelet actors section for more details about the Pipelet machinery.

Writing new environment

In order to start with a completely new environment, extend the base environment:

File : myenvironment.py:

from pipelet.environment import *

class MyEnvironment(EnvironmentBase):
      def my_get_data_fn (self, x):
         """ New name for get_data_fn
         """
         return self._get_data_fn(x)

      def _close(self, glo):
         """ Post processing code
         """
         return glo['seg_output']

From the base environment, the basic functionalities for getting file names and executing hook scripts are still available through:

  • self._get_data_fn
  • self._hook

The segment input argument is also stored in self._seg_input The segment output argument has to be returned by the _close(self, glo) method.

The pipelet engine objects (segments, tasks, pipeline) are available from the worker attribut self._worker. See doxygen documentation for more details about the Pipelet machinery.

Loading another environment

To load another environment, set the pipeline environment attribute accordingly.

Pipeline(pipedot, codedir=, prefix=, env=MyEnvironment)

Writing custom main files

Launching pipeweb behind apache

Pipeweb use the cherrypy web framework server and can be run behind an apache web server which brings essentially two advantages:

  • access to *_mod apache facilities (https, gzip, authentication facilities ...).
  • faster static files serving (the pipelet application actually use quite few of them so the actual gain is marginal, getting the actual data served by apache may be feasible but is not planned yet).

There is actually several way of doing so, the cherrypy documentation giving hints about each. We describe here an example case using mod_rewrite and virtual hosting.

  1. The first thing we need is a working installation of apache with mod_rewrite activated. On a debian-like distribution this usually obtain by:

    sudo a2enmod rewrite
    sudo a2enmod proxy
    sudo a2enmod proxy_http
  2. We then configure apache to rewrite request to the cherrypy application except for the static files of the application that will be served directly. Here is a sample configuration file for a dedicated virtual host named pipeweb with pipelet installed under /usr/local/lib/python2.6/dist-packages/ .:

    <VirtualHost pipeweb:80>
    ServerAdmin pipeweb_admin@localhost
    DocumentRoot /usr/local/lib/python2.6/dist-packages/pipelet
    #    ErrorLog /some/custom/error_file.log
    #    CustomLog /some/custom/access_file.log common
    
    RewriteEngine on
    RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L]
    RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy]
    </VirtualHost>
  3. Restart apache and start the pipeweb application to serve on the specified address and port: pipeweb start -H 127.0.0.1

There is also some possibility to start the application on demand using a cgi script like:

#!/usr/local/bin/python
print "Content-type: text/html\r\n"
print """<html><head><META HTTP-EQUIV="Refresh" CONTENT="1; URL=/"></head><body>Restarting site ...<a href="/">click here<a></body></html>"""
import os
os.system('pipeweb start -H 127.0.0.1')

To have it executed when the proxy detect the absence of the application:

<VirtualHost pipeweb:80>
#...
ScriptAliasMatch ^/pipeweb_autostart\.cgi$ /usr/local/bin/pipeweb_autostart.cgi
RewriteEngine on
RewriteCond  %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$
RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L]
RewriteCond  %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$
RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy]
ErrorDocument 503 /pipeweb_autostart.cgi
#...
</VirtualHost>

You may want to adjust ownership and suid of the pipeweb_autostart.cgi script so that it executes with the correct rights.

Pipeweb handles access rights using per pipeline ACL registered in the database file. It support Basic and Digest http authentication. When deploying the pipeweb interface in a production environment, one may want to defer a part of the authorization process to external and potentially more secure systems. The pipeweb behavior in term of authorization is controlled by the -A option that accept the following arguments: - Digest (default) Authenticate users via HTTP Digest authentication

according to the user:passwd list stored in the database.
  • Basic Authenticate users via HTTP Basic (clear text) authentication according to the user:passwd list stored in the database.
  • ACL Check the access rights of otherwise authenticated users according to the user list stored in the database.
  • None Do no check. (Defer the whole authentication/authorization process to the proxy.)

Here is a complete configuration sample making of https, basic authentication, and per pipeline ACL to secure data browsing.:

<VirtualHost _default_:443>
    ServerAdmin pipeweb_admin@localhost
    DocumentRoot /usr/local/lib/python2.6/dist-packages/pipelet

    # ErrorLog /some/custom/error_file.log
    # CustomLog /some/custom/access_file.log common

    # Adjust the ssl configuration to fit your needs
    SSLEngine on
    SSLCertificateFile    /etc/ssl/certs/ssl-cert-snakeoil.pem
    SSLCertificateKeyFile /etc/ssl/private/ssl-cert-snakeoil.key

    # This handles authentication and access to the index page
    # Access right checking to the various registered pipelines
    # is left to pipeweb
    <Location />
        #Replace with Any suitable authentication system
        AuthName             "pipeweb"
        AuthType             Basic
        AuthUserFile         /etc/apache2/pipelet.pwd
        require              valid-user
    </Location>

    ScriptAliasMatch ^/pipeweb_autostart\.cgi$ /usr/local/bin/pipeweb_autostart.cgi
    RewriteEngine on
    RewriteCond  %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$
    RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L]
    RewriteCond  %{SCRIPT_FILENAME} !pipeweb_autostart\.cgi$
    RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy]
    ErrorDocument 503 /pipeweb_autostart.cgi
</VirtualHost>

And the corresponding cgi script:

#!/usr/local/bin/python
print "Content-type: text/html\r\n"
print """<html><head><META HTTP-EQUIV="Refresh" CONTENT="1; URL=/"></head><body>Restarting site ...<a href="/">click here<a></body></html>"""
import os
os.system('pipeweb start -H 127.0.0.1 -A ACL')