proxi.algorithms package

Submodules

proxi.algorithms.knng module

Warrper for using sklearng kNN Graph (KNNG) construction method (see http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html).

proxi.algorithms.knng.get_knn_graph(data, k, metric='correlation', p=2, metric_params=None, OTU_column=None, is_undirected=True, is_normalize_samples=True, is_standardize_otus=True)

Compute the (directed/undirected) graph of k-Neighbors for points in the input data. The kNN-graph is constructed using sklearn method, sklearn.neighbors.kneighbors_graph.

Parameters:
  • data (DataFrame) – Input data as pandas DataFrame object. Each row is an OTU and each column is a sample
  • k (int) – Number of neighbors for each node
  • metric (string or callable, default 'correlation') –

    metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

    If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

    Valid values for metric are:

    • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
    • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

    See the documentation for scipy.spatial.distance for details on these metrics.

    • any collable function (e.g., distance functions in proxi.utils.distance module)
p : int, optional, default = 2
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric_params : dict, optional, default = None
Additional keyword arguments for the scipy metric function.
OTU_column : string, optional, default = None
Name of the DataFrame column that contains the OTUs IDs (i.e., nodes IDs). If OTU_column is None, the first column in the dataframe is treated as the OTU_column.
is_undirected : bool, optional, default = True
whether to compute undirected/directed graph. Default is undirected.
is_weighted : bool, optional, default = False
whether to compute weighted graph. Default is unweighted.
is_normalize_samples : bool, optional, default = True
whether to normalize each sample (i.e., column in the input data).
is_standardize_otus : bool, optional, default = True
whether to standardize each OTU (i.e., row in the input data)
Returns:
  • nodes_id (array_like) – list of nodes.
  • _A (scipy sparse matrix) – Adjacency matrix of the constructed graph.

Examples

>>> df = pd.read_csv(in_file)
>>> # construct kNN-graph
>>> nodes, a = get_knn_graph(df, 5,  metric='braycurtis')
>>> # Note that a is a sparse matrix.
>>> # Use 'todense' to convert a into numpy matrix format required for NetworkX
>>> a = a.todense()
>>> print('Shape of adjacency matris is {}'.format(np.shape(a)))
>>> # save the constructed graph in graphml format
>>> save_graph(a, nodes, out_file)

proxi.algorithms.pairwise module

Construct a graph using a pairwise similarity metric (e.g. PCC).

proxi.algorithms.pairwise.create_graph_using_pairwise_metric(data, similarity_metric, threshold, is_symmetric=True, OTU_column=None, is_normalize_samples=True, is_standardize_otus=True, is_weighted=False)

Construct a graph using a pairwise similarity metric.

Parameters:
  • data (DataFrame) – Input data as pandas DataFrame object. Each row is an OTU and each column is a sample.
  • similarity_metric (collable similarity function) – A collable function for computing the similarity between two vectors.
  • threshold (float) – Minimum similarity score between two vectors required for having an edgle between their corresponding nodes.
  • is_symmetric (bool, optional, default=True) – Set this parameter to False if the similarity function is not symmetric.
  • OTU_column (string, optional, default = None) – Name of the DataFrame column that contains the OTUs IDs (i.e., nodes IDs). If OTU_column is None, the first column in the dataframe is treated as the OTU_column.
is_normalize_samples : bool, optional, default = True
whether to normalize each sample (i.e., column in the input data).
is_standardize_otus : bool, optional, default = True
whether to standardize each OTU (i.e., row in the input data)
is_weighted : bool, optional, default = False
whether to compute weighted graph. Default is unweighted.
Returns:
  • nodes_IDs (array_like) – list of nodes.
  • A (array_like, Shape(N,N)) – Adjacency matrix of the constructed graph.
  • W (array_like, Shape(N,N)) – Weight matrics.

Examples

>>> df = pd.read_csv(in_file)
>>> nodes, a, weights = create_graph_using_pairwise_metric(df, similarity_metric=abs_pcc,
>>>                            threshold=0.5, is_weighted=True)
>>> # save unweighted graph in graphml format
>>> save_graph(a, nodes, out_file)
>>> # save weighted graph in graphml format
>>> save_weighted_graph(a, nodes, weights, out_file2)

proxi.algorithms.pknng module

Implementation of Perturbed kNN Graph (PKNNG) [1].

1- Generate T bootstrapped kNN graphs where at each iteration a new dataset is generated by resampling with replacement from the original dataset.

2- Aggregate the T graphs into a single one by keeping edges that appear in more than cT of the bootstrapped graphs with the sample weights for those edges.

References

[1] Wagaman, A. (2013). Efficient k‐NN graph construction for graphs on variables. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(5), 443-455.

proxi.algorithms.pknng.get_pknn_graph(data, k, c=0.5, T=100, metric='correlation', p=2, metric_params=None, OTU_column=None, random_state=0, is_undirected=True, is_weighted=False, is_normalize_samples=True, is_standardize_otus=True)

Compute the (directed/undirected) graph of k-Neighbors for points in the input data. Each kNN-graph is constructed using sklearn method, sklearn.neighbors.kneighbors_graph.

Parameters:
  • data (DataFrame) – Input data as pandas DataFrame object. Each row is an OTU and each column is a sample.
  • k (integer) – Number of neighbors for each node.
  • c (float, optional, default=0.5) – Graph aggregation tunning parameter.
  • T (integer, optional, default=100) – Number of bootstrap iterations.
  • metric (string or callable, default='correlation') –

    metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

    If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

    Valid values for metric are:

    • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
    • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

    See the documentation for scipy.spatial.distance for details on these metrics.

    • any collable function (e.g., distance functions in utils.distance module)
p : int, optional, default = 2
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric_params : dict, optional, default = None
Additional keyword arguments for the scipy metric function.
OTU_column : string, optional, default = None
Name of the DataFrame column that contains the OTUs IDs (i.e., nodes IDs). If OTU_column is None, the first column in the dataframe is treated as the OTU_column.
random_state : integer, optional, default=0
#TODO
is_undirected : bool, optional, default = True
whether to compute undirected/directed graph. Default is undirected.
is_weighted : bool, optional, default = False
whether to compute weighted graph. Default is unweighted.
is_normalize_samples : bool, optional, default = True
whether to normalize each sample (i.e., column in the input data).
is_standardize_otus : bool, optional, default = True
whether to standardize each OTU (i.e., row in the input data)
Returns:
  • nodes_id (array_like) – list of nodes.
  • _A (scipy sparse matrix) – Adjacency matrix of the constructed graph.

Examples

>>> df = pd.read_csv(in_file)
>>> # construct kNN-graph
>>> nodes, a = get_pknn_graph(df, 5, metric='braycurtis', T=10, c=0.5, is_weighted=True,
>>>                            OTU_column='SID')
>>> print('Shape of adjacency matris is {}'.format(np.shape(a)))
>>> # save the constructed graph in graphml format
>>> save_graph(a, nodes, out_file)
>>> # save the directed graph in graphml format
>>> save_graph(a, nodes, out_file2, create_using=nx.DiGraph())

References

[1]Dong, W., Moses, C., & Li, K. (2011). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (pp. 577-586). ACM.

proxi.algorithms.rng module

Computes a (weighted) graph of Neighbors for each data point. Neighborhoods are restricted to the points at a distance lower than radius. This is simply a warrper for using sklearng radius_neighbors_graph method.

proxi.algorithms.rng.get_rn_graph(data, radius, metric='braycurtis', p=2, metric_params=None, OTU_column=None, is_undirected=True, is_normalize_samples=True, is_standardize_otus=True)

Computes the (weighted/directed) graph of k-Neighbors for points in data

Parameters:
  • data (DataFrame) – input data as pandas DataFrame object. Each row is an OTU and each column is a sample
  • radius (float) – Radius of neighborhoods.
  • metric – The distance metric used to calculate the neighbors within a given radius for each sample point. The DistanceMetric class gives a list of available metrics. The default distance is correlation.
p : int, optional, default = 2
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric_params : dict, optional, default = None
Additional keyword arguments for the scipy metric function.
OTU_column : string, optional, default = None
Name of the DataFrame column that contains the OTUs IDs (i.e., nodes IDs). If OTU_column is None, the first column in the dataframe is treated as the OTU_column.
is_undirected : bool, optional, default = True
whether to compute undirected/directed graph. Default is undirected.
is_weighted : bool, optional, default = False
whether to compute weighted graph. Default is unweighted.
is_normalize_samples : bool, optional, default = True
whether to normalize each sample (i.e., column in the input data).
is_standardize_otus : bool, optional, default = True
whether to standardize each OTU (i.e., row in the input data)
Returns:
  • nodes_id (array_like) – list of nodes.
  • _A (scipy sparse matrix) – Adjacency matrix of the constructed graph.

Examples

>>> df = pd.read_csv(in_file)
>>> # construct kNN-graph
>>> nodes, a = get_rn_graph(df, 0.3,  metric='braycurtis')
>>> # Note that a is a sparse matrix.
>>> # Use 'todense' to convert a into numpy matrix format required for NetworkX
>>> a = a.todense()
>>> print('Shape of adjacency matris is {}'.format(np.shape(a)))
>>> # save the constructed graph in graphml format
>>> save_graph(a, nodes, out_file)

Module contents