xtl.autoproc
#
The xtl.autoproc
command is a wrapper around autoPROC from Global
Phasing Ltd. This command helps in processing several datasets with autoPROC in batch mode, either with the same or
different parameters. In addition, it can also take care of loading the dependencies, setting up file permissions and
parsing the results, with minimum user intervention.
Prerequisites#
Caution
xtl.autoproc
is currently only tested and officially supported on Linux systems.
In order for xtl.autoproc
to run, a working installation of autoPROC is required (instructions can be found
here). This implies that all autoPROC dependencies are
also installed.
Once installed, the autoPROC process
command should be discoverable in the system’s PATH. You can check this by
running the following command in your terminal:
$ which process
If the above command returns a path, then you are good to go.
Hint
If autoPROC is not available on the system PATH, then xtl.autoproc
can also leverage modules (e.g.
Environment Modules) to load the required dependencies. Detailed
documentation will be provided in the future.
process
command#
The process
command is the main entry point for running autoPROC on multiple datasets. It takes as input a CSV file
containing paths to the first image of each dataset. In the simplest case, the command can be run as follows:
$ xtl.autoproc process datasets.csv
where datasets.csv
is a CSV file containing the paths the first image of each dataset to be processed, i.e.
datasets.csv
# # fist_image
/path/to/dataset1/dataset1_00001.cbf.gz
/path/to/dataset2/dataset2_00001.cbf.gz
/path/to/dataset3/dataset3_00001.cbf.gz
...
You will be prompted to confirm that the input parameters are correct.
Important
Before proceeding take a moment to check the output paths. autoPROC jobs can generate hundreds of files and directories that might be difficult or annoying to clean up if they are at the wrong location.
A few examples#
Hint
To print a detail list of all available options run:
$ xtl.autoproc process --help
When options are provided directly to the process
command, they will be applied to all datasets by default, unless
explicitly overridden in the CSV file. Some frequently used options are:
Unit-cell & space group#
Specify starting unit-cell parameters and space group for indexing:
$ xtl.autoproc process datasets.csv --unit-cell="78 78 37 90 90 90" --space-group="P43212"
Reference MTZ file#
Specify a reference MTZ file to use for unit-cell parameters, space group and R-free flags:
$ xtl.autoproc process datasets.csv --mtz-ref="/path/to/reference.mtz"
Resolution cutoff#
Apply a resolution range cutoff:
$ xtl.autoproc process datasets.csv --resolution=80-1.2
$ xtl.autoproc process datasets.csv --resolution=80-
$ xtl.autoproc process datasets.csv --resolution=1.2
In the first case both a low and high resolution cutoff are applied, while in the other two only a low or high resolution cutoff is applied, respectively.
Anomalous signal#
By default, xtl.autoproc
will run autoPROC process
with the -ANO
flag, meaning that the Friedel pairs will
be kept separate. To enforce merging of Friedel pairs:
$ xtl.autoproc process datasets.csv --no-anomalous
Ice rings#
autoPROC can automatically detect and exclude ice rings from the data. In case where the datasets are heavily contaminated with ice, one can force the ice ring exclusion:
$ xtl.autoproc process datasets.csv --exlude-ice
This option will set the following two autoPROC parameters to True
:
XdsExcludeIceRingsAutomatically=yes
RunIdxrefExcludeIceRingShells=yes
Beamline macros#
Certain beamline-specific macros exist in autoPROC. These can be selected as follows:
$ xtl.autoproc process datasets.csv --beamline="PetraIIIP14"
This is equivalent to -M PetraIIIP14
when running directly the autoPROC process
command.
Note that the only beamline macro files can be passed with this mechanism and not any arbitrary macro. The list of
supported beamline macros can be found via xtl.autoproc process --help
.
Dataset merging#
In case of incomplete data, or multiple sweeps of the same crystal, one can process multiple datasets in a single
autoPROC run and try to merge them, if they are compatible. In xtl.autoproc
this can only be achieved by providing
a group_id
column in the CSV file, e.g.:
datasets.csv
## group_id,first_image
1,/path/to/dataset1/dataset1_00001.cbf.gz
2,/path/to/dataset2/dataset2_00001.cbf.gz
2,/path/to/dataset3/dataset3_00002.cbf.gz
In the above example, dataset2_00001.cbf.gz
and dataset3_00002.cbf.gz
will be merged into the same dataset,
since they have they same group_id
. This option essentially passes multiple -Id
flags to the autoPROC
process
command (see Multi-sweep dataset processing).
See also
See CSV specification for more details on how to structure the input CSV file.
Output directory#
By default, the output of xtl.autoproc process
will be saved in the current working directory. To specify a
different output directory:
$ xtl.autoproc process datasets.csv --out-dir="/path/to/output"
This will create subdirectories for each dataset in the specified output directory.
Note
xtl.autoproc
internally splits the input path to the first image into a raw_data_dir
, dataset_dir
and
dataset_name
components, so that first_image = {raw_data_dir}/{dataset_dir}/{dataset_name}_00001.ext
.
Therefore, the final output directory for each dataset will be {output_dir}/{dataset_name}
. To further
understand the dataset discovery process, its gotchas and how to overcome them, read the Dataset discovery
section.
A very common convention for diffraction data is to store the raw images and processed data in separate locations that have a similar filestructure, e.g.:
RAW_DATA: /path/to/raw_data/datasets/dataset1/dataset1_00001.cbf.gz
PROCESSED: /path/to/processed/datasets/dataset1/
Here, the top level directory is different (/path/to/raw_data
vs. /path/to/processed
), but the file tree is the
same after some point (datasets/dataset1/
). In this case, one can specify both --raw-dir
and --out-dir
to
influence the dataset discovery process and ensure that all output files will be saved in the correct subdirectory:
$ xtl.autoproc process datasets.csv --raw-dir="/path/to/raw_data" --out-dir="/path/to/processed"
If we run the above command with the following CSV file:
datasets.csv
## first_image
/path/to/raw_data/datasets/dataset1/dataset1_00001.cbf.gz
/path/to/raw_data/datasets/dataset2/dataset2_00001.cbf.gz
/path/to/raw_data/datasets/dataset3/dataset3_00001.cbf.gz
then the output directories will be /path/to/processed/datasets/datasetX
.
See also
See the Dataset discovery section for more details.
Parallelization#
By default, xtl.autoproc process
will wait until each dataset has finished processing before starting the next one.
However, if the system has adequate resources (e.g. in a high-performance cluster), one can perform multiple autoPROC
jobs in parallel:
$ xtl.autoproc process datasets.csv --no-jobs=2
Danger
Be careful when running multiple jobs in parallel, as autoPROC can be quite resource-intensive. It is recommended to monitor the system’s resources during the run to determine the optimal number of parallel jobs.
Additionally, one can also specify the number of XDS jobs and processors for each autoPROC run, using:
$ xtl.autoproc process datasets.csv --xds-jobs=4 --xds-proc=8
This essentially sets the following two autoPROC parameters:
autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_JOBS=4
autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_PROCESSORS=8
process_wf
command#
The process_wf
command is a variation of the process
command, intended to process datasets that have been
collected using the Global Phasing workflow, available at certain synchrotrons. When collecting data with the GPhL
workflow, a stratcal_gen.nml
file is generated, containing information about the different sweeps, how they are
related to each other, etc, but most importantly (at least for xtl.autoproc
), the paths to the images.
While these data can be merged manually (see: Dataset merging), it is recommended to make use of
the information available in the stratcal_gen.nml
file. Typically, one would run the autoPROC process_wf
command
instead, providing the NML file as input. The xtl.autoproc process_wf
does exactly that, but also enables queuing
multiple runs with the use of a CSV file, similar to the xtl.autoproc process
command.
The typical usage of the xtl.autoproc process_wf
command is as follows:
$ xtl.autoproc process_wf datasets.csv
where datasets.csv
now contains a list of paths to NML files, instead of first images, e.g.:
datasets.csv
## nml_file
/path/to/dataset1/stratcal_gen.nml
/path/to/dataset2/stratcal_gen.nml
/path/to/dataset3/stratcal_gen.nml
Although additional options can be passed along using the CSV file, do note that the NML files already contain information about unit-cell, space group, relative crystal orientation between sweeps, etc.
Updating NML files#
The stratcal_gen.nml
files are generated during data collection at the synchrotron. This means that the paths to the
collected images will follow the file structure of the light source. If the data have been transferred to a different
location (e.g. a local drive), then all the paths in the NML file would be invalid. However, the NML file still
contains valuable information and should be the preferred way of processing GPhL workflow data.
In order to update the image directories within an NML file, a simple command is provided:
$ xtl.autoproc fixnml stratcal_gen.nml --from="/synchrotron/path/to/raw_data" --to="/local/path/to/raw_data" --check
This will read the stratcal_gen.nml
file, and update the NAME_TEMPLATE
for each sweep, by replacing the
value of --from
with the value of --to
. For example, if the NML file contains:
stratcal_gen.nml
#&SIMCAL_SWEEP_LIST
NAME_TEMPLATE = '/synchrotron/path/to/raw_data/datasets/dataset1/dataset1_####.cbf.gz'
then a stratcal_gen_updated.nml
file will be created with:
stratcal_gen_updated.nml
#&SIMCAL_SWEEP_LIST
NAME_TEMPLATE = '/local/path/to/raw_data/datasets/dataset1/dataset1_####.cbf.gz'
The --check
flag will perform a GLOB search for dataset1_*.cbf.gz
within the new directory, i.e.
/local/path/to/raw_data/datasets/dataset1/
, and if no files match that pattern, the user will be notified.
Multiple NML files can be updated at once, by providing a list of them as arguments to the command. At the end, an
updated_nml.csv
file will be saved in the current working directory, which will contain the absolute paths to the
updated NML files and can be passed directly to xtl.autoproc process_wf
.
options
command#
The options
command prints a table of all supported options for dataset discovery and autoPROC configuration that
can be parsed from the CSV file.
CSV specification#
The datasets.csv
file is a powerful way to fully customize the autoPROC runs on a per-dataset basis. Various options
influencing the dataset discovery or autoPROC configuration can be passed along.
See also
See the options command for more details on the available options.
When preparing the input CSV file, a few rules should be followed:
The first line should start with
#
followed by a space, and then comma-separated list of column names (without spaces). The order of the columns is not important.If an unknown column is found in the header, it will be ignored.
Any subsequent lines starting with
#
will be treated as comments and ignored.Each line should contain values for one dataset, in the same order as in the header.
Each line should contain the same number of values as the header, but one or more of them can be empty.
Each line should be terminated with a newline character.
Spaces in values will not be trimmed!
Taking all the above into account, a more complex CSV file might look like this:
datasets.csv
## first_image,unit_cell,space_group,reference_mtz,beamline
/path/to/dataset1/dataset1_00001.cbf.gz,78;78;37;90;90;90,P 43 21 2,,
/path/to/dataset2/dataset2_00002.cbf.gz,,,/path/to/reference.mtz,
/path/to/dataset3/dataset3_00003.cbf.gz,,,,PetraIIIP14
This will run the first dataset with the specified space group and unit-cell, the second dataset with a reference MTZ
file and the last one with the PetraIIIP14
macro file. Notice that each line contains the same number of commas,
meaning that they all have the same number of columns. Also note that the unit_cell
parameter is provided as a
semicolon-separated list of values.
Any dataset-specific options specified on the CSV file will first be merged with the global options passed along the
xtl.autoproc process
command,. If an option is specified both on a global and a dataset level, then the dataset one
will take precedence. For example, running the above CSV file with:
$ xtl.autoproc process datasets.csv --space-group="P 21"
then the first dataset will be processed with space group \(P 4_3 2_1 2\), while the rest with \(P 2_1\).
One can easily image the flexibility for fully customized runs that is provided with this architecture.
Technical documentation#
Dataset discovery#
The dataset discovery process is a crucial part of xtl.autoproc
as it determines the output directories for each
autoPROC run. When provided with an absolute path to the first image, xtl.autoproc
tries to extract three values
from that path: raw_data_dir
, dataset_dir
and dataset_name
. The dataset_dir
is particularly important,
because the same value will be used to determine the output path, i.e. {output_dir}/{dataset_dir}
.
Let’s consider the following example:
datasets.csv
## first_image
/path/to/raw_data/datasets/dataset1/dataset1_measurement1_00001.cbf.gz
When no additional information is provided, the dataset_dir
is assumed to be the parent directory of the first
image, and everything preceding that will be the raw_data_dir
, i.e.:
/path/to/raw_data/datasets/dataset1/dataset1_measurement1_00001.cbf.gz
\________________________/ \______/ \___________________/
raw_data_dir dataset_dir dataset_name
In this case, the output directory for the dataset will be {output_dir}/dataset1
, where output_dir
can be
specified using the --out-dir
option (default: current working directory).
However, the discovery process can be influenced by providing a raw data directory, either globally with the
--raw-dir
option or on a per-dataset basis with the raw_data_dir
parameter on the CSV file (see:
CSV specification). If, for example, the user specifies --raw-dir="/path/to/raw_data"
, then the same path will be
split as follows:
/path/to/raw_data/datasets/dataset1/dataset1_measurement1_00001.cbf.gz
\_______________/ \_______________/ \___________________/
raw_data_dir dataset_dir dataset_name
and the output directory for the dataset will be {output_dir}/datasets/dataset1
.
Optionally, one can specify subsequent subdirectories within the above location with the --out-subdir
flag or
output_subdir
column in the CSV file. For example, setting --out-subdir=my/output
will put the output of
autoPROC in {output_dir}/datasets/dataset1/my/output
.
Dataset name determination#
In order to explicitly instruct autoPROC to process a specific dataset, xtl.autoproc
constructs a -Id
flag,
which is in the form of:
-Id xtl1234,/path/to/raw_data/datasets/dataset1/,dataset1_measurement1_0####.cbf.gz,1,3600
\_____/ \_________________________________/ \_________________________________/ | \__/
sweep_id image_directory image_template | last_image_no
first_image_no
The sweep_id
is a unique alphanumeric identifier (irrelevant for the user), image_directory
is set to
{raw_data_dir}/{dataset_dir}/
, while image_template
, first_image_no
and last_image_no
are required
for XDS.INP
. The part of the image_template
preceding the #
characters is called dataset_name
within
xtl.autoproc
.
Essentially, the dataset_name
is the part of the image filename preceding the image number
(dataset1_measurement1
in the above example). To determine that value, a GLOB search is performed within
the dataset directory, the results are sorted alphabetically, and a character-by-character comparison is performed
between the first and last result, i.e. the first and last image (hopefully). Then dataset_name
is set to the
longest common substring between the two files, e.g. dataset1_measurement1_0
if the first and last images are
dataset1_measurement1_00001.cbf.gz
and dataset1_measurement1_03600.cbf.gz
.
In practice, there are a few more tricks in place to ensure that the dataset name determination is robust enough, although it can never be foolproof. In case of atypical naming conventions, the automatic dataset name determination might fail. In such cases, one can explicitly define all the above parameters in the CSV file, e.g.:
datasets.csv
## raw_data_dir,dataset_dir,dataset_name
/path/to/raw_data,/datasets/dataset1,dataset1_measurement1.cbf.gz
Notice that dataset1_measurement1.cbf.gz
is not the same as the first image filename
(dataset1_measurement1_00001.cbf.gz
), but rather the dataset_name
and the file extension. This is enough to
determine the first image by performing a GLOB search for {dataset_name}*{file_extension}
within
{raw_data_dir}/{dataset_dir}
. The first alphabetically sorted result will be considered as the first image.
Job execution#
For each dataset to be processed, xtl.autoproc
will create a job. A job is a subprocess that runs autoPROC in the
background. To better organize the autoPROC runs, we have opted to include the input to autoPROC in a .dat
macro
file, which is in turn passed to the autoPROC process
command, via an intermediate shell script.
Once a job is launched, the job directory is first created within {output_dir}/{dataset_dir}
, typically in the form
of autoproc_runXX
, and then the macro file and shell script are created within that directory. A typical shell
script and macro file will look like this:
xtl_autoPROC.sh
##!/bin/bash
process -M /path/to/processed/datasets/dataset1/autoproc_run01/xtl_autoPROC.dat -d /path/to/processed/datasets/dataset1/autoproc_run01/autoproc
xtl_autoPROC.dat
## autoPROC macro file
# Generated by xtl v.0.1.0 on 2025-12-30T19:47:35.620825
# user@host [distro]
### Dataset definitions
# autoproc_id = xtl7156
# no_sweeps = 1
## Sweep 1 [xtl7156]: dataset1_measurement1
# raw_data = /path/to/raw_data/datasets/dataset1/
# first_image = dataset1_measurement1_00001.cbf.gz
# image_template = dataset1_measurement1_#####.cbf.gz
# img_no_first = 1
# img_no_last = 3600
# idn = xtl7156,/path/to/raw_data/datasets/dataset1/,dataset1_measurement1_#####.cbf.gz,1,3600
### CLI arguments (including dataset definitions and macros)
__args='-Id "xtl7156,/path/to/raw_data/datasets/dataset1/,dataset1_measurement1_#####.cbf.gz,1,3600" -B -M HighResCutOnCChalf'
### User parameters
cell="79.0 79.0 37.0 90.0 90.0 90.0"
symm="P43212"
nres=129
### XDS parameters
autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_JOBS=16
autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_PROCESSORS=4
...
As you can see, the xtl_autoPROC.sh
runs the autoPROC process
command by passing the xtl_autoPROC.dat
macro
file as input and the output directory for the autoPROC run. In turn, the macro file contains the rest of the autoPROC
parameters, as well as some debug information for the dataset discovery process of xtl.autoproc
.
When the xtl_autoPROC.sh
script is executed as a subprocess, its standard output and error streams (and subsequently
those of the autoPROC process
command) are redirected to xtl_autoPROC.stdout.log
and
xtl_autoPROC.stderr.log
, respectively, withing the job directory.
The jobs are executed asynchronously, and the main event loop ensures that only a certain number of jobs are running
simultaneously (controlled by the --no-jobs
option, default: 1
).
Post-completion tidy-up#
Once the autoPROC process
command exits, a few tidy-up tasks are triggered. First, xtl.autoproc
will try to
determine if the autoPROC run was successful (i.e. whether it yielded a reflection file), by checking for the presence
of the staraniso_alldata-unique.mtz
file. It then copies the following files from the autoPROC output directory to
the job directory, for easier access:
summary.html
report.pdf
report_staraniso.pdf
truncate-unique.mtz
staraniso_alldata-unique.mtz
xtlXXXX.dat
The xtlXXXX.dat
file contains all the parameters that autoPROC digested from the user’s input. The two MTZ files
will be prepended with the dataset_name
.
Finally, if the autoPROC run was deemed successful, an xtl_autoPROC.json
file will also be generated within the job
directory. This JSON file combines results from imginfo.xml
, truncate.xml
, staraniso.xml
and CORRECT.LP
,
and can be very convenient for downstream programmatic parsing of autoPROC results. A little jiffy
(xtl.autoproc json2csv
) is provided to convert the JSON files from all jobs into a single monolithic CSV file, that
may or may not be more convenient to work with than the individual JSON files.
Advanced parametrization#
Although xtl.autoproc
includes options for the most frequently used autoPROC parameters, sometimes it may not be
enough. To ensure that any possible autoPROC configuration can be launched within the provided framework, any arbitrary
parameter can be passed along to autoPROC using one of two ways.
On a global level, using the -x
/--extra
option will pass a single parameter=value
pair. If more than one
parameter need to be passed along, then the -x
option can be provided multiple times, e.g.:
$ xtl.autoproc process datasets.csv -x autoPROC_XdsIntegPostrefNumCycle=5 -x wave=0.9876
On a dataset level, the same can be achieved by specifying an extra_params
column, which expects a
semicolon-separated list of parameter=value
pairs, e.g.:
datasets.csv
## first_image,extra_params
/path/to/dataset1/dataset1_00001.cbf.gz,autoPROC_XdsIntegPostrefNumCycle=5;wave=0.9876
/path/to/dataset2/dataset2_00001.cbf.gz,wave=0.9876
Internally, the provided arguments will be converted into a {parameter: value}
dictionary by splitting on the =
character, and then try to apply proper character escaping to the value, e.g. padding with double quotes if it
contains any spaces, before passing it to the autoPROC command. However, do note that no checks will be performed on the
parameter names to ensure that they are valid autoPROC parameters. This responsibility falls on the user. A list of all
the supported autoPROC parameters can be found here.