Skip to main content

Path Regularization

In this entry I'd like to document something that I get asked by few people and for which I assembled a script several years ago. I'd like to put it here for everyone to use (and perhaps for some to contribute, just a wish).

In the past I've worked on something called "Path Collective Variables" (PCVs) which are some useful descriptors that are used in enhanced sampling (umbrella sampling, metadynamics, steered-MD and others). 

The essence of the PCVs is that if you have a trajectory, being it true or fictitious, of a reactive event you can use this as a blueprint, a sort of "track" and ask the system to walk over it using some enhanced sampling method so that you will be able to measure the thermodynamic force along it and calculate the free energy. 

Getting a rough reactive trajectory can be simple in many cases. One can put a harmonic restraint on the coordinate of your target (say "state B") structure and pull the system toward it with some beastly method like Steered MD (SMD), dragging it from your initial state "A" to state "B".  What you will get is probably a reactive path which is not great but it is a start, and sometimes you can also improve on it  rather easily (something me and Alessio Lodola have shown in the past).

PCVs need a "template path" to draw the initial track. So what is needed once you have the reactive trajectory is a set of snapshot which are equally spaced in the PCVs metrics.

If you're using SMD most of the time you're in luck and the snapshots are almost equally spaced between the state "A" (the starting structure) and state "B" (the target structure), but they're not homogeneously spaced in the metrics used by PCVs and this is a sort of requirement for the most naive version of PCVs (this is not the case with Bernd Ensing geometric path formulation for example but there are other implications there).

The reason why the snapshots are inhomogeneously spaced may be due to the fact that you've been steering for example with a distance between two centers of mass but the PCVs you want to use are defined say in RMSD or contact map.

You may ask "If I can do everything with a distance why should I bother doing it with PCVs which are complex and tedious to setup". Good question.

My personal answer is that when restraining a distance you can control only that degree-of-freedom (DOF). With PCVs you can restrain the change of multiple coordinated degrees of freedom at the same time and therefore have a thermodynamic view on a specific realization of a given event.

This is an overkill when you have a single pathway between two states, but if you suspect you have multiple pathways, once you manage to harvest them (by multiple SMD for example), then you can use PCVs to calculate the barrier of each individual pathway without the need of finding a specific set of CVs per each pathway that would be able to drive each of them (because, if you're unlucky as I am, you may need different CVs to drive different pathways :-/ )

A way to homogenize the snapshot was formulated a while ago by Luca Maragliano and Eric Vanden-Eijnden in the context of the "string method" and I took inspiration from them to do put this together. 

Assume I want to extract N "milestones" to form my PCVs. What I do here is first read a PDB file containing the reactive trajectory. For the sake of the discussion, s uppose that a representation of your initial set of trajectory snapshot is like this (here idealized as dots)

The line represent the interpolation between two frames acquired at consecutive times. One can get this by doing linear interpolation after optimal alignment. We'll see later how this is used.

Now, let's say you want a set of 6 equally distributed milestones for your PCVs. You first calculate the "total distance" as a sum of the red segments and you divide by 5 (6 snapshots correspond to 5 segments) and you calculate the "ideal" location of these along the red path.

ß

Unfortunately that's not the end because now when you calculate the distance over the blue dots as a sum of the blue dot - to blue dot distance (blue segments) this is different from the sum of the red dots. In particular you've been cutting some corners here and there and here the distance measured will be shorter.
Therefore you may want to bring some snapshot closer together to close the gaps.

Now if you would evolve according to the arrows your new points will cut corners and you don't want that. The corners are actually interesting: it is where a reaction changes its behaviour and relative weight of its components. So what you would rather do is, instead of moving along the blue line, move along the red line. The red line is the interpolation of the trajectory and I assume this to be more reliable because is done between real trajectory snapshots, while an interpolation between blue snapshot (which are already fictitious) can be very misleading IMHO.

Since I don't know how much I need to move, I just record the direction of motion and move for an arbitrary step I decide upfront (yup, this means there's a parameter involved). Note that I don't move the milestones at the edge (dark blue) otherwise the string will contract indefinitely.
Now the hope is that new milestones may be better spaced and would follow the natural path in between the red lines.
The process needs to be iterated a number of times to get to a minimum of something I call "harmonic energy" which is essentially the sum of square distance between blue frames.
This is ideally minimized (but will never be zero).
Okay, now that we have the idea of what is done, let's see the script that you can find here.
To run this script you need a linux/mac machine, VMD, perl, bash. In the folder you also have a test trajectory of a sugar molecule I run a long time ago.
You can run a first example using this

./rmsd_mc.sh  -f trajframes.pdb  -r 20 -t 1
where "-f trajframes.pdb" indicates the input trajectory, "-r 20" indicates that you want to reparametrize in 20 frames, and "-t 1" means that you read every snapshot of the input trajectory (sometimes is useful a bit of coarsening in input if your trajectory is humongous, and in general useful to run some initial tests).
This will do zero step of the previous process: basically will calculate the blue squares the first time without trying to evolve them. Sometimes this is quite good already.
Here is the output

DELTA from 0 to 1 : 0.1592
DELTA from 1 to 2 : 0.1083
DELTA from 2 to 3 : 0.1312
DELTA from 3 to 4 : 0.1455
DELTA from 4 to 5 : 0.1191
DELTA from 5 to 6 : 0.1723
DELTA from 6 to 7 : 0.3124
DELTA from 7 to 8 : 0.2749
DELTA from 8 to 9 : 0.2412
DELTA from 9 to 10 : 0.2464
DELTA from 10 to 11 : 0.1935
DELTA from 11 to 12 : 0.1608
DELTA from 12 to 13 : 0.2007
DELTA from 13 to 14 : 0.1774
DELTA from 14 to 15 : 0.2200
DELTA from 15 to 16 : 0.0971
DELTA from 16 to 17 : 0.1091
DELTA from 17 to 18 : 0.1365
DELTA from 18 to 19 : 0.0907
HARMONIC_ENERGY 0.6430
******************************
>>>> MSD        0.03384385352114148  Ang^2
>>>> MSD_STDEV  0.023862404602339612  Ang^2
>>>> LAMBDA 67.9591642412482  [Ang ^ -2]
>>>> SAFE_LIMIT 39.8570289391908 [Ang ^ -2]
>>>> Note: if you use gromacs you have to increase these values of 100 since units are [ nm ^-2 ]
>>>> Note2: LAMBDA and SAFE_LIMIT are truly different your matrix could be messed up: check MSD_MATRIX FILE
******************************

you can see that the distance between milestones is indicated and the harmonic energy is reported. 
You can see that distances are quite different from each other, and there are fluctuations which are significant.
At the bottom you also get the average MSD (Mean Square Deviation) between milestones and its standard deviation. This is useful to calculate the lambda parameter for PCVs. 

 
or, if you have a large inhomogeneity is slightly better to accept something larger, calibrated on the standard deviation between milestones MSD (this is called SAFE LIMIT in output).

 
Note that the units are Angstrom squared. You may need to convert if you use a MD software that uses nm instead (who on earth is using nm anyways? You-know-who! :-) 
In output it produces a file REPARAM.pdb that contains the reparametrized milestones.

Now you may want to try multiple iterations.  
A good place to start is using 10 steps (-n 10) and a small step ( -s 0.1)

./rmsd_mc.sh  -f trajframes.pdb  -r 20 -t 1 -n 10 -s 0.1. | grep HARMONIC_ENERGY

In output you see something like this


HARMONIC_ENERGY 0.6511
HARMONIC_ENERGY 0.6430
HARMONIC_ENERGY 0.6357
HARMONIC_ENERGY 0.6225
HARMONIC_ENERGY 0.6128
HARMONIC_ENERGY 0.6050
HARMONIC_ENERGY 0.5978
HARMONIC_ENERGY 0.5923
HARMONIC_ENERGY 0.5880
HARMONIC_ENERGY 0.5838
HARMONIC_ENERGY 0.5797
HARMONIC_ENERGY 0.5763
HARMONIC_ENERGY 0.5736
HARMONIC_ENERGY 0.5708
HARMONIC_ENERGY 0.5674
HARMONIC_ENERGY 0.5645
HARMONIC_ENERGY 0.5624
HARMONIC_ENERGY 0.5610
HARMONIC_ENERGY 0.5604
HARMONIC_ENERGY 0.5605
HARMONIC_ENERGY 0.5609
 so you can clearly see that the inter-milestones distance is kind of converging and then doesn't change

and the last iteration will look like this:

DELTA from 0 to 1 : 0.1406
DELTA from 1 to 2 : 0.1035
DELTA from 2 to 3 : 0.1742
DELTA from 3 to 4 : 0.1386
DELTA from 4 to 5 : 0.1333
DELTA from 5 to 6 : 0.1873
DELTA from 6 to 7 : 0.2155
DELTA from 7 to 8 : 0.2389
DELTA from 8 to 9 : 0.2429
DELTA from 9 to 10 : 0.2279
DELTA from 10 to 11 : 0.1969
DELTA from 11 to 12 : 0.1740
DELTA from 12 to 13 : 0.1921
DELTA from 13 to 14 : 0.1816
DELTA from 14 to 15 : 0.1569
DELTA from 15 to 16 : 0.1119
DELTA from 16 to 17 : 0.1172
DELTA from 17 to 18 : 0.1210
DELTA from 18 to 19 : 0.0957
HARMONIC_ENERGY 0.5609
******************************
>>>> MSD        0.029523467304599694  Ang^2
>>>> MSD_STDEV  0.015375323992993602  Ang^2
>>>> LAMBDA 77.90412881625407  [Ang ^ -2]
>>>> SAFE_LIMIT 51.22632332695527 [Ang ^ -2]
>>>> Note: if you use gromacs you have to increase these values of 100 since units are [ nm ^-2 ]
>>>> Note2: LAMBDA and SAFE_LIMIT are truly different your matrix could be messed up: check MSD_MATRIX FILE
******************************

You can clearly see that now the standard deviation is smaller and this is a sign that the optimization has worked properly.

There's also another option which is a Monte Carlo optimizaiton instead of a steepest-descent like but I've just used it for experimenting and in the end I always ended using more the default steepest descent. So I won't go in details. 

A last paragraph about how this script is assembled. 

It is horrible :)

I just wanted to use VMD in a vanilla way (hence tcl) and VMD had already a bunch of functionality for trajectory manipulation so what I did is a bash script which prints a tcl executable and execute it in batch mode.  Not very classy but it is kind of functional because one doesn't have to tinker with paths and other interesting things. Happy if anyone picks it up and repackages in a more stylish way.




Comments