Datasets (#18)

sarahbeenie · sarah.mubeen · web-flow · commit d6772308e4a5 · 2020-04-22T17:01:05.000+02:00
* update example datasets

* update readme

* update docs

* update example scripts

* update constants

* refactor

* try diffuse

* cleaning

* remove unused nodetype

* cleaning

* cleaning

* cleaning

* flake8 fixes

* more flake8 fixes

* try docs fix

Co-authored-by: sarah.mubeen &lt;sarah.mubeen@scai.fraunhofer.de&gt;
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -56,8 +56,8 @@
 # Example configuration for intersphinx: refer to the Python standard library.
 intersphinx_mapping = {
     'python': ('https://docs.python.org/3', None),
-    'networkx': ('https://networkx.github.io/', None),
-    'sqlalchemy': ('https://docs.sqlalchemy.org/en/latest', None),
+    'networkx': ('https://networkx.github.io/documentation/stable', None),
+    'sqlalchemy': ('https://docs.sqlalchemy.org/en/13/', None),
     'pybel': ('https://pybel.readthedocs.io/en/latest/', None),
 }
 
diff --git a/docs/source/diffusion.rst b/docs/source/diffusion.rst
@@ -25,7 +25,7 @@ Methods without statistical normalisation
   a graph kernel, see :doc:`kernels <kernels>`. These scores treat negative and unlabelled nodes equivalently.
 
 - **ml**: Same as raw, but negative nodes introduce a negative unit of flow. Therefore not equivalent to unlabelled
-  nodes. [2]_
+  nodes [2]_.
 
 - **gl**: Same as ml, but the unlabelled nodes are assigned a (generally non-null) bias term based on the total number
   of positives, negatives and unlabelled nodes [3]_.
diff --git a/docs/source/intro.rst b/docs/source/intro.rst
@@ -2,6 +2,7 @@ First Steps
 ===========
 The first step before running diffusion algorithms on your network using DiffuPy is to learn about the graph and data
 formats are supported. Next, you can find samples of input datasets and networks to run diffusion methods over.
+
 Input Data
 ----------
 
@@ -10,9 +11,8 @@ You can submit your dataset in any of the following formats:
 - CSV (.csv)
 - TSV (.tsv)
 
-Please ensure that the dataset has a column 'Node' containing node IDs. If you only provide the node IDs, you can
-also include a column in your dataset 'NodeType' indicating the entity type for each node. You can also optionally add
-the following columns to your dataset:
+Please ensure that the dataset minimally has a column 'Node' containing node IDs. You can also optionally add the
+following columns to your dataset:
 
 - LogFC [*]_
 - p-value
@@ -28,20 +28,19 @@ DiffuPath accepts several input formats which can be codified in different ways.
 `diffusion scores <https://github.com/multipaths/DiffuPy/blob/master/docs/source/diffusion.rst>`_ summary for more
 details.
 
-1. You can provide a dataset with a column 'Node' containing node IDs along with a column 'NodeType' indicating the
-entity type.
-
-+------------+--------------+
-|     Node   |   NodeType   |
-+============+==============+
-|      A     |     Gene     |
-+------------+--------------+
-|      B     |     Gene     |
-+------------+--------------+
-|      C     |  Metabolite  |
-+------------+--------------+
-|      D     |    Gene      |
-+------------+--------------+
+1. You can provide a dataset with a column 'Node' containing node IDs.
+
++------------+
+|     Node   |
++============+
+|      A     |
++------------+
+|      B     |
++------------+
+|      C     |
++------------+
+|      D     |
++------------+
 
 2. You can also choose to provide a dataset with a column 'Node' containing node IDs as well as a column 'logFC' with
 their abs(LogFC).
diff --git a/examples/README.rst b/examples/README.rst
@@ -8,9 +8,8 @@ You can submit your dataset in any of the following formats:
 - CSV (.csv)
 - TSV (.tsv)
 
-Please ensure that the dataset has a column 'Node' containing node IDs. If you only provide the node IDs, you can
-also include a column in your dataset 'NodeType' indicating the entity type for each node. You can also optionally add
-the following columns to your dataset:
+Please ensure that the dataset minimally has a column 'Node' containing node IDs. You can also optionally add the
+following columns to your dataset:
 
 - LogFC [*]_
 - p-value
@@ -26,20 +25,19 @@ DiffuPath accepts several input formats which can be codified in different ways.
 `diffusion scores <https://github.com/multipaths/DiffuPy/blob/master/docs/source/diffusion.rst>`_ summary for more
 details.
 
-1. You can provide a dataset with a column 'Node' containing node IDs along with a column 'NodeType' indicating the
-entity type.
-
-+------------+--------------+
-|     Node   |   NodeType   |
-+============+==============+
-|      A     |     Gene     |
-+------------+--------------+
-|      B     |     Gene     |
-+------------+--------------+
-|      C     |  Metabolite  |
-+------------+--------------+
-|      D     |    Gene      |
-+------------+--------------+
+1. You can provide a dataset with a column 'Node' containing node IDs.
+
++------------+
+|     Node   |
++============+
+|      A     |
++------------+
+|      B     |
++------------+
+|      C     |
++------------+
+|      D     |
++------------+
 
 2. You can also choose to provide a dataset with a column 'Node' containing node IDs as well as a column 'logFC' with
 their | logFC |.
diff --git a/examples/datasets/node.csv b/examples/datasets/node.csv
@@ -0,0 +1,6 @@
+Node
+A
+B
+C
+D
+E
diff --git a/examples/datasets/node_logfc.csv b/examples/datasets/node_logfc.csv
@@ -0,0 +1,6 @@
+Node,LogFC
+A,0.7
+B,1.2
+C,-0.2
+D,-0.4
+E,-2.2
diff --git a/examples/datasets/node_logfc_pval.csv b/examples/datasets/node_logfc_pval.csv
@@ -0,0 +1,6 @@
+Node,LogFC,p-value
+A,0.7,0.2
+B,1.2,0.01
+C,-0.2,0.01
+D,-0.4,0.3
+E,-2.2,0.005
diff --git a/examples/datasets/sample_dataset.csv b/examples/datasets/sample_dataset.csv
@@ -0,0 +1,8 @@
+Node
+A
+B
+C
+D
+E
+F
+G
diff --git a/examples/datasets/sample_dataset_with_ids.csv b/examples/datasets/sample_dataset_with_ids.csv
diff --git a/examples/datasets/sample_dataset_with_logfc.csv b/examples/datasets/sample_dataset_with_logfc.csv
diff --git a/examples/datasets/sample_dataset_with_logfc_and_p_value.csv b/examples/datasets/sample_dataset_with_logfc_and_p_value.csv
diff --git a/examples/scripts/example.sh b/examples/scripts/example.sh
@@ -1 +1,18 @@
-#!/usr/bin/env bash
+#!/usr/bin/env bash
+
+# network=network
+# data=data file
+# output=output file results
+# method=diffusion method (i.e., gm, ml, raw and z)
+# binarize=binarize labels (default True)
+# absolute_value=apply absolute value to logFC in data (default True)
+# threshold=threshold to apply if logFC in data
+# p_value=statistical significance (default 0.05)
+
+
+
+python3 -m diffupath diffusion diffuse \
+  --network ../networks/sample_network_2.csv \
+  --data ../datasets/sample_dataset.csv \
+  --method raw \
+
diff --git a/examples/scripts/example2.sh b/examples/scripts/example2.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+# network=network
+# data=data file
+# output=output file results
+# method=diffusion method (i.e., gm, ml, raw and z)
+# binarize=binarize labels (default True)
+# absolute_value=apply absolute value to logFC in data (default True)
+# threshold=threshold to apply if logFC in data
+# p_value=statistical significance (default 0.05)
+
+python3 -m diffupath diffusion diffuse \
+  --network ../networks/sample_network_2.csv \
+  --data ../datasets/sample_dataset_with_logfc.csv \
+  --method raw \
+  --binarize True \
+  --absolute_value True \
+  --threshold 0.5 \
+  --p_value 0.05 \
+
diff --git a/src/diffupy/cli.py b/src/diffupy/cli.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
 
-"""Command line interface for diffuPy."""
+"""Command line interface for DiffuPy."""
 
 import json
 import logging
@@ -152,17 +152,23 @@ def diffuse(
 
     click.secho(f'Codifying data from {data}.')
 
-    input_scores_dict = process_input(data, method, binarize, absolute_value, p_value, threshold)
+    label_dict = process_input(data, method, binarize, absolute_value, p_value, threshold)
 
     click.secho(f'Running the diffusion algorithm.')
 
     results = run_diffusion(
-        input_scores_dict,
+        label_dict,
         method,
         graph,
     )
 
-    json.dump(results, output, indent=2)
+    # results = run_diffusion(
+    #     label_dict,
+    #     method,
+    #     graph,
+    # )
+
+    # json.dump(results, output, indent=2)
 
     click.secho(f'Finished!')
 
diff --git a/src/diffupy/constants.py b/src/diffupy/constants.py
@@ -109,8 +109,6 @@ def ensure_output_dirs():
 
 #: Node name
 NODE = 'Node'
-#: Node type
-NODE_TYPE = 'NodeType'
 #: Log2 fold change (logFC)
 LOG_FC = 'LogFC'
 #: Statistical significance (p-value)
diff --git a/src/diffupy/diffuse.py b/src/diffupy/diffuse.py
@@ -4,10 +4,12 @@
 
 import copy
 import logging
+from typing import Dict
 
 import networkx as nx
 import numpy as np
 
+from .constants import *
 from .diffuse_raw import diffuse_raw
 from .matrix import Matrix
 from .utils import get_label_list_graph
@@ -19,6 +21,27 @@
     'diffuse',
 ]
 
+"""Map nodes from input to network"""
+
+
+def run_diffusion_algorithm(
+        input_labels: Dict[str, int],
+        method: str,
+        network: nx.Graph
+):
+    """Run diffusion algorithm."""
+    # List of nodes in network
+    network_nodes = list(network.nodes)
+    print(f'input_labels: {input_labels}')
+
+    print(f'network_nodes:{network_nodes}')
+    # Map nodes from input dataset to nodes in network to get a set of labelled and unlabelled nodes
+    label_vector = [input_labels[node] if node in input_labels else None for node in network_nodes]
+    print(f'label_vector: {label_vector}')
+
+    if method == RAW:
+        return diffuse_raw(network, label_vector)
+
 
 def diffuse(
         input_scores,
@@ -29,15 +52,15 @@ def diffuse(
     """Run diffusion on a network given an input and a diffusion method.
 
     :param input_scores: score collection, supplied as n-dimensional array. Could be 1-dimensional (List) or n-dimensional (Matrix).
-    :param method: Elected method ["raw", "ml", "gm", "ber_s", "ber_p", "mc", "z"]
+    :param method: Selected method ["raw", "ml", "gm", "ber_s", "ber_p", "mc", "z"]
     :param graph: A network as a graph. It could be optional if a Kernel is provided
     :param kwargs: Optional arguments:
-                    - k: a  kernel [matrix] steaming from a graph, thus sparing the graph transformation process
+                    - k: a  kernel [matrix] stemming from a graph, thus sparing the graph transformation process
                     - Other arguments which would differ depending on the chosen method
     :return: The diffused scores within the matrix transformation of the network, with the diffusion operation
              [k x input_vector] performed
     """
-    # Sanity checks
+    # Sanity checks; create copy of input labels
     scores = copy.copy(input_scores)
 
     _validate_scores(scores)
@@ -50,13 +73,13 @@ def diffuse(
             raise ValueError("Neither a graph 'graph' or a kernel 'k' has been provided.")
         format_network = 'kernel'
 
-    if method == 'raw':
+    if method == RAW:
         return diffuse_raw(graph, scores, **kwargs)
 
-    elif method == 'z':
+    elif method == Z:
         return diffuse_raw(graph, scores, z=True, **kwargs)
 
-    elif method == 'ml':
+    elif method == ML:
         for score, i, j in scores.__iter__(get_labels=False, get_indices=True):
             if score not in [-1, 0, 1]:
                 raise ValueError("Input scores must be binary.")
@@ -65,7 +88,7 @@ def diffuse(
 
         return diffuse_raw(graph, scores, **kwargs)
 
-    elif method == 'gm':
+    elif method == GM:
         for score, i, j in scores.__iter__(get_labels=False, get_indices=True):
             if score not in [0, 1]:
                 raise ValueError("Input scores must be binary.")
diff --git a/src/diffupy/diffuse_raw.py b/src/diffupy/diffuse_raw.py
@@ -29,10 +29,10 @@ def _calculate_scores(
 
     :param col_ind: index of the column to operate
     :param scores: array of score matrices
-    :param raw_diff_scores: precomputatated raw diffusion scores
-    :param const_mean: precalculated constant mean over columns
-    :param const_var: precalculated constant variance over columns
-    :return:  Calculated column z-score
+    :param raw_diff_scores: pre-computed raw diffusion scores
+    :param const_mean: pre-calculated constant mean over columns
+    :param const_var: pre-calculated constant variance over columns
+    :return: Calculated column z-score
     """
     col_in = scores[:, col_ind]
     col_raw = raw_diff_scores[:, col_ind]
@@ -55,7 +55,7 @@ def diffuse_raw(
     z: bool = False,
     k: Matrix = None,
 ) -> Matrix:
-    """Compute the score diffusion procedure, given an initial state as a set of scores and a network where diffuse it.
+    """Compute the score diffusion procedure, given an initial state as a set of scores and a network to diffuse over.
 
     :param graph: background network
     :param scores: array of score matrices. For a single path with a single background, supply a list with a vector col
@@ -64,7 +64,7 @@ def diffuse_raw(
     :return: A list of scores, with the same length and dimensions as scores
     """
     # Sanity checks
-    _validate_scores(scores)
+    # _validate_scores(scores)
     logging.info('Scores validated.')
 
     # Get the Kernel
@@ -119,7 +119,7 @@ def diffuse_raw(
          for row in kernel[:, :n] ** 2]
     )
 
-    logging.info('Rowmeans and rowmeans2 computatated.')
+    logging.info('Rowmeans and rowmeans2 computed.')
 
     # Constant terms over columns
     const_mean = row_sums / n
diff --git a/src/diffupy/kernels.py b/src/diffupy/kernels.py
@@ -18,14 +18,14 @@
 
 __all__ = [
     'diffusion_kernel',
-    'commute_time_kernel',
+    'compute_time_kernel',
     'inverse_cosine_kernel',
     'regularised_laplacian_kernel',
     'p_step_kernel',
 ]
 
 
-def commute_time_kernel(graph: nx.Graph, normalized: bool = False) -> Matrix:
+def compute_time_kernel(graph: nx.Graph, normalized: bool = False) -> Matrix:
     """Compute the commute-time kernel, which is the expected time of going back and forth between a couple of nodes.
 
     If the network is connected, then the commuted time kernel will be totally dense, therefore reflecting global
@@ -97,7 +97,7 @@ def p_step_kernel(graph: nx.Graph, a: int = 2, p: int = 5) -> Matrix:
     :param graph: A graph
     :param a: regularising summed to the spectrum. Spectrum of the normalised Laplacian matrix.
     :param p: p-step kernels can be cheaper to compute and have been successful in biological tasks.
-    :return: Laplacian repr'esentation of the graph.
+    :return: Laplacian representation of the graph.
     """
     laplacian = LaplacianMatrix(graph, normalized=True)
     laplacian.mat = -laplacian.mat
diff --git a/src/diffupy/matrix.py b/src/diffupy/matrix.py
diff --git a/src/diffupy/process_input.py b/src/diffupy/process_input.py
diff --git a/src/diffupy/validate_input.py b/src/diffupy/validate_input.py

-Original file line number
+Diff line change
@@ @@ -0,0 +1,6 @@ @@
 +Node
 +A
 +B
 +C
 +D
 +E