generalfit.py: number_of_functions_per_element parameter

yury-lysogorskiy · yury-lysogorskiy · commit 4c06bdb7ef59 · 2022-07-06T14:49:47.000+02:00
update documentation
diff --git a/docs/pacemaker/inputfile.md b/docs/pacemaker/inputfile.md
@@ -48,7 +48,7 @@ data: # dataset specification section
   # query_limit: 1000              # limiting number of entries to query from `structdb`
                                    # ignored if reading from cache
     
-  # cache_ref_df: True             # whether to store the queried or modified dataset into file, default - True
+  # cache_ref_df: False             # whether to store the queried or modified dataset into file, default - True
   # filename: some.pckl.gzip       # force to read reference pickled dataframe from given file
   # ignore_weights: False          # whether to ignore energy and force weighting columns in dataframe
   # datapath: ../data              # path to folder with cache files with pickled dataframes 
@@ -66,7 +66,7 @@ Example of creating the **subselection of fitting dataframe** and saving it is g
 
 Example of generating **custom energy/forces weights** is given in `notebooks/data_custom_weights.ipynb`
 
-### Querying data
+### Querying data (using structDB only)
 You can just query and preprocess data, without running potential fitting.
 Here is the minimalistic input YAML:
 
@@ -197,7 +197,8 @@ potential:
 
   ## possible keywords: ALL, UNARY, BINARY, TERNARY, QUATERNARY, QUINARY,
   ##  element combinations as (Al,Al), (Al, Ni), (Al, Ni, Zn), etc...
-  functions: 
+  functions:
+    # number_of_functions_per_element: 700  # specify the total number of functions per element to keep
     UNARY: {
       nradmax_by_orders: [15, 3, 2, 2, 1],
       lmax_by_orders: [ 0, 2, 2, 1, 1],
diff --git a/docs/pacemaker/quickstart.md b/docs/pacemaker/quickstart.md
@@ -7,7 +7,35 @@ process.
 In this section we will describe the format of the fitting dataset, we will run a fit with an example dataset and
 overview the output produced by `pacemaker`. Input parameters are detailed in the [section](inputfile.md#Input_file) below.    
 
-## Fitting dataset preparation
+## Automatic DFT data collection
+
+You can collect DFT calculations (currently only for VASP from `vasprun.xml` or `OUTCAR` files) by using `pace_collect`
+utility. For example, if your data is in `my_dft_calculation/` folder and subfolders, and single atoms reference energies
+are -0.123 eV for Al and  -0.456 eV for Cu, then run command 
+```
+pace_collect -wd path/to/my_dft_calculation --free-atom-energy Al:-0.123 Cu:-0.456
+```
+that will scan through all folders and subfolders and collect DFT free energies (that are force-consistent) and forces 
+and make a single atom corrections. Resulting dataset will be stored into `collected.pckl.gzip` file.
+
+If you need more flexibility for DFT dataset manipulation,
+please check [Manual fitting dataset preparation](#markdown-header-manual-fitting-dataset-preparation).
+
+## Automatic input file generation
+
+In order to fit an ACE potential, one need to create a configurational  file with relevant settings. 
+`pacemaker` utilizes `.yaml` format for configurations. 
+
+In order to interactively generate default `pacemaker` input file `input.yaml`, please run 
+```
+pacemaker -t
+```
+and enter requested information, such as dataset filename, test set size (optional), list of elements, cutoff,
+number of functions.  Doing so will produce an `input.yaml` file with the most general
+settings that can be adjusted for a particular task. Detailed overview of the input file parameters can be found in the
+[section](#input-file-overview) below.
+
+## Manual fitting dataset preparation
 
 In order to use your data for fitting with `pacemaker` one would need to provide it in the form of `pandas` DataFrame.
 An example DataFrame can be red as:
@@ -97,12 +125,7 @@ or use the utility `pace_collect` from a top-level directory to collect VASP cal
 The resulting dataframe can be used for fitting with `pacemaker`.
 
 ## Creating an input file
-
-In order to fit an ACE potential to the data prepared following the previous section, one need to create a configurational
-file with relevant settings. `pacemaker` utilizes `.yaml` format for configurations. An input file template can be created
-by running `pacemaker --template` (or `pacemaker -t`). Doing so will produce an `input.yaml` file with the most general 
-settings that can be adjusted for a particular task. Detailed overview of the input file parameters can be found in the 
-[section](#input-file-overview) below.  
+ 
 In this example we will use template as it is, however one would need to provide a path to the
 example dataset `exmpl_df.pckl.gzip`. This can be done by changing `filename` parameter in the `data` section of the 
 `input.yaml`:
@@ -129,8 +152,8 @@ nohup pacemaker input.yaml &
 ```
 For more `pacemaker` command options see the corresponding [section](#pacemaker-commands).  
 
-Default behavior of pacemaker is to utilize a GPU accelerated fitting of ACE using `tensorpotential`. However, GPU
-parallelization is not supported at the moment. Therefore, if your machine has a multi GPU setup one would need to select
+Default behavior of pacemaker is to utilize a GPU accelerated fitting of ACE using `tensorpotential`. However, 
+parallelization over multiple GPU is not supported at the moment. Therefore, if your machine has a multi GPU setup one would need to select
 a single one before running `pacemaker`. This can be done by executing  `export CUDA_VISIBLE_DEVICES=ind` in the shell
 replacing `ind` with the GPU index (i.g. 0, 1, ...) or -1 to disable GPU usage.  
 Note, that `tensorpotential` can be used without a GPU as well.
@@ -277,6 +300,16 @@ For more information see [here](https://lammps.sandia.gov/doc/Build_cmake.html).
 3. Build LAMMPS using `cmake --build .` or `make`
 
 
+Please note, that there is a KOKKOS implementation of PACE for LAMMPS as `pair_style pace/kk`, but you need to compile
+LAMMPS with this support, see official documentation [here](https://docs.lammps.org/Build_extras.html#kokkos).
+This implementation allows to run calculations on GPU which give the speedup of **up to x100** on modern GPU architectures 
+in comparison to single-core CPU. In that case you should modify LAMMPS input script as 
+```
+## in.lammps
+
+pair_style  pace product 
+pair_coeff  * * output_potential.yace Al Ni
+```
 
 ## More examples
 
diff --git a/src/pyace/generalfit.py b/src/pyace/generalfit.py
@@ -217,8 +217,34 @@ def __init__(self,
             else:
                 self.target_bbasisconfig = construct_bbasisconfiguration(potential_config,
                                                                          initial_basisconfig=self.initial_bbasisconfig)
+                if ("functions" in potential_config and
+                        "number_of_functions_per_element" in potential_config['functions']):
+                    num_block = len(self.target_bbasisconfig.funcspecs_blocks)
+                    number_of_functions_per_element = potential_config["functions"]["number_of_functions_per_element"]
+                    target_bbasis = ACEBBasisSet(self.target_bbasisconfig)
+                    nelements = target_bbasis.nelements
+                    ladder_step = number_of_functions_per_element * nelements // num_block
+                    expected_number_of_functions = ladder_step * num_block
+                    log.info(
+                        """Target potential contains {total_number_of_functions} functions,"""
+                        """ but is limited to maximum {number_of_functions_per_element}"""
+                        """ functions per element  for {nelements} elements ({num_block} blocks)""".format(
+                            total_number_of_functions=self.target_bbasisconfig.total_number_of_functions,
+                            number_of_functions_per_element=number_of_functions_per_element,
+                            nelements=nelements,
+                            num_block=num_block))
+
+                    initial_basisconfig = self.target_bbasisconfig.copy()
+                    clean_bbasisconfig(initial_basisconfig)
+                    current_bbasisconfig = extend_multispecies_basis(initial_basisconfig, self.target_bbasisconfig,
+                                                                     "power_order", ladder_step)
+                    self.target_bbasisconfig = current_bbasisconfig
+                    log.info("Resulted potential contains {} functions".format(
+                        self.target_bbasisconfig.total_number_of_functions))
+
                 log.info("Target potential shape constructed from dictionary, it contains {} functions".format(
                     self.target_bbasisconfig.total_number_of_functions))
+
         elif isinstance(potential_config, str):
             self.target_bbasisconfig = BBasisConfiguration(potential_config)
             log.info("Target potential loaded from file '{}', it contains {} functions".format(potential_config,
@@ -236,6 +262,8 @@ def __init__(self,
                 ("Non-supported type: {}. Only dictionary (configuration), " +
                  "str (YAML file name) or BBasisConfiguration are supported").format(
                     type(potential_config)))
+        # save target_potential.yaml
+        self.target_bbasisconfig.save(TARGET_POTENTIAL_YAML)
 
         if FIT_LADDER_STEP_KW in fit_config and not self.ladder_scheme:
             if self.initial_bbasisconfig is None:
@@ -374,6 +402,9 @@ def test_metric_callback(self, metrics_dict, extended_display_step=None):
         self.metrics_aggregator.test_metric_callback(metrics_dict, extended_display_step=extended_display_step)
 
     def save_fitting_data_info(self):
+        # columns to save: w_energy, w_forces, NUMBER_OF_ATOMS, PROTOTYPE_NAME, prop_id,structure_id, gen_id, if any
+        # columns_to_save = ["PROTOTYPE_NAME", "NUMBER_OF_ATOMS", "prop_id", "structure_id", "gen_id", "pbc"] + \
+        #                   [ENERGY_CORRECTED_COL, EWEIGHTS_COL, FWEIGHTS_COL]
         columns_to_drop = ["tp_atoms", "atomic_env"]
         fitting_data_columns = self.fitting_data.columns
 
@@ -467,7 +498,7 @@ def cycle_fitting(self, bbasisconfig: BBasisConfiguration) -> BBasisConfiguratio
             log.warning(
                 ("Number of finished fit cycles ({}) >= number of expected fit cycles ({}). " +
                  "Use another potential or remove `{}` from potential metadata")
-                    .format(finished_fit_cycles, fit_cycles, "_" + FIT_FIT_CYCLES_KW))
+                .format(finished_fit_cycles, fit_cycles, "_" + FIT_FIT_CYCLES_KW))
             return current_bbasisconfig
 
         fitting_attempts_list = []
diff --git a/src/pyace/multispecies_basisextension.py b/src/pyace/multispecies_basisextension.py
@@ -25,7 +25,7 @@
 TERNARY = "TERNARY"
 QUATERNARY = "QUATERNARY"
 QUINARY = "QUINARY"
-KEYWORDS = [ALL, UNARY, BINARY, TERNARY, QUATERNARY, QUINARY]
+KEYWORDS = [ALL, UNARY, BINARY, TERNARY, QUATERNARY, QUINARY, 'number_of_functions_per_element']
 
 NARY_MAP = {UNARY: 1, BINARY: 2, TERNARY: 3, QUATERNARY: 4, QUINARY: 5}
 PERIODIC_ELEMENTS = chemical_symbols = [
@@ -225,7 +225,8 @@ def species_key_to_bonds(key):
     return bonds
 
 
-def create_multispecies_basis_config(potential_config: Dict, unif_mus_ns_to_lsLScomb_dict: Dict = None,
+def create_multispecies_basis_config(potential_config: Dict,
+                                     unif_mus_ns_to_lsLScomb_dict: Dict = None,
                                      func_coefs_initializer="zero",
                                      initial_basisconfig: BBasisConfiguration = None) -> BBasisConfiguration:
     """
@@ -636,7 +637,7 @@ def create_species_block(elements_vec: List, block_spec_dict: Dict,
                             if func_coefs_initializer == "zero":
                                 coefs = [0] * ndensity
                             elif func_coefs_initializer == "random":
-                                coefs = np.random.randn(ndensity)
+                                coefs = np.random.randn(ndensity)*1e-4
                             else:
                                 raise ValueError(
                                     "Unknown func_coefs_initializer={}. Could be only 'zero' or 'random'".format(
diff --git a/src/pyace/preparedata.py b/src/pyace/preparedata.py
@@ -28,6 +28,9 @@
 
 log = logging.getLogger(__name__)
 
+REF_PROP_NAME = '1-body-000001:static'
+REF_GENERIC_PROTOTYPE_NAME = '1-body-000001'
+
 # ## QUERY DATA
 LATTICE_COLUMNS = ["_lat_ax", "_lat_ay", "_lat_az",
                    "_lat_bx", "_lat_by", "_lat_bz",
@@ -129,17 +132,7 @@ def query_data(config: Dict, seed=None, query_limit=None, db_conn_string=None):
         if REF_ENERGY_KW not in config:
             try:
                 # TODO: generalize query of reference property
-                REF_PROP_NAME = '1-body-000001:static'
-                REF_GENERIC_PROTOTYPE_NAME = '1-body-000001'
-                ref_prop = storage.query(StaticProperty).join(StructureEntry, GenericEntry).filter(
-                    Property.CALCULATOR == reference_calculator,
-                    Property.NAME == REF_PROP_NAME,
-                    StructureEntry.COMPOSITION.like(config["element"] + "-%"),
-                    StructureEntry.NUMBER_OF_ATOMS == 1,
-                    GenericEntry.PROTOTYPE_NAME == REF_GENERIC_PROTOTYPE_NAME
-                ).one()
-                # free atom reference energy
-                ref_energy = ref_prop.energy / ref_prop.n_atom
+                ref_energy = query_reference_energy(config["element"], reference_calculator, storage)
             except NoResultFound as e:
                 log.error(("No reference energy for {} was found in database. " +
                            "Either add property named `{}` with generic named `{}` to database or use `{}` " +
@@ -214,6 +207,20 @@ def query_data(config: Dict, seed=None, query_limit=None, db_conn_string=None):
         return df_total, ref_energy
 
 
+def query_reference_energy(element, reference_calculator, storage):
+    from structdborm import StructureEntry, StaticProperty, GenericEntry, Property
+    ref_prop = storage.query(StaticProperty).join(StructureEntry, GenericEntry).filter(
+        Property.CALCULATOR == reference_calculator,
+        Property.NAME == REF_PROP_NAME,
+        StructureEntry.COMPOSITION.like(element + "-%"),
+        StructureEntry.NUMBER_OF_ATOMS == 1,
+        GenericEntry.PROTOTYPE_NAME == REF_GENERIC_PROTOTYPE_NAME
+    ).one()
+    # free atom reference energy
+    ref_energy = ref_prop.energy / ref_prop.n_atom
+    return ref_energy
+
+
 class StructuresDatasetWeightingPolicy:
     def generate_weights(self, df):
         raise NotImplementedError
@@ -639,7 +646,7 @@ def get_fit_dataframe(self, force_query=None, weights_policy=None, ignore_weight
 
 class EnergyBasedWeightingPolicy(StructuresDatasetWeightingPolicy):
 
-    def __init__(self, nfit=20000,
+    def __init__(self, nfit=None,
                  cutoff=None,
                  DElow=1.0,
                  DEup=10.0,
@@ -705,6 +712,10 @@ def __str__(self):
                                                                                reftype=self.reftype, seed=self.seed)
 
     def generate_weights(self, df):
+        if self.nfit is None:
+            self.nfit = len(df)
+            log.info("Set nfit to the dataset size {}".format(self.nfit))
+
         if self.reftype == "bulk":
             log.info("Reducing to bulk data")
             df = df[df.pbc]
@@ -1019,7 +1030,8 @@ def generate_weights(self, df):
             if col_to_drop in df.columns:
                 df.drop(columns=col_to_drop, inplace=True)
 
-        mdf = pd.merge(df, self.weights_df[[WEIGHTS_ENERGY_COLUMN, WEIGHTS_FORCES_COLUMN]], left_index=True, right_index=True)
+        mdf = pd.merge(df, self.weights_df[[WEIGHTS_ENERGY_COLUMN, WEIGHTS_FORCES_COLUMN]], left_index=True,
+                       right_index=True)
         if not (mdf[FORCES_COLUMN].map(len) == mdf[WEIGHTS_FORCES_COLUMN].map(len)).all():
             error_msg = ("Shape of the `{}` column doesn't correspond to the shape of "
                          "`forces` column in original dataframe").format(WEIGHTS_FORCES_COLUMN)