API

Risk Assessment

SequentialPrivacyFrame

class privlib.riskAssessment.sequentialprivacyframe.SequentialPrivacyFrame(data, user_id='uid', datetime='datetime', order_id='order', sequence_id='sequence', elements='elements', timestamp=False, check_order_date=True)

SequentialPrivacyFrame.

A SequentialPrivacyFrame object is a pandas.DataFrame that represents sequences. A sequence has at least the following attributes: user_id, datetime, order_id, sequence_id, elements.

Parameters:
datalist or dict or pandas DataFrame

the data that must be embedded into a SequentialPrivacyFrame.

user_idint or str, optional

the position or the name of the column in data`containing the user identifier. The default is `constants.UID.

datetimeint or str, optional

the position or the name of the column in data containing the datetime. The default is constants.DATETIME.

order_idint or str, optional

the position or the name of the column in data containing the order identifier for the sequences. The default is constants.ORDER_ID.

sequence_idint or str, optional

the position or the name of the column in data containing the sequence identifier. The default is constants.SEQUENCE_ID.

elementsint or str, or list of int or list of str

the positions or the names of the columns in data containingn the elements of the sequences. Elements can be represented by any number of attributes that will be grouped together to represent the single element of the sequence. The default is constants.ELEMENTS

timestampboolean, optional

if True, the datetime is a timestamp. The default is False.

check_order_date

if True, the order of the various elements in the sequences of each user will be checked against the timestamp to ensure consistency. If some ordering attributes were not present in the original data, they will be computed based on what is available in the data. The default is True.

Attributes:
datetime
elements
order
sequence
uid

Methods

from_file

Risk Evaluators

class privlib.riskAssessment.riskevaluators.IndividualElementEvaluator(data, attack, knowledge_length, **kwargs)

Class for evaluating risk on individual level: risk is computed based on the whole data of each individual, i.e., each individual risk will be equal to the inverse of the number of other individuals in the data that match the background knowledge.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment.

attackBackgroundKnowledgeAttack

an attack to be simulated. Must be a class implementing the BackgroundKnowledgeAttack abstract class

knowledge_lengthint

the length of the knowledge of the simultated attack, i.e., how many data points are assumed to be in the background knowledge of the adversary

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

References

[TIST2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, and Anna Monreale. 2017. A Data Mining Approach to Assess Privacy Risk in Human Mobility Data. ACM Trans. Intell. Syst. Technol. 9, 3, Article 31 (December 2017), 27 pages. DOI: https://doi.org/10.1145/3106774

[MOB2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, Anna Monreale: Analyzing Privacy Risk in Human Mobility Data. STAF Workshops 2018: 114-129

Methods

aggregation_levels()

Allows attack preprocess to be dependant on the logic of the RiskEvaluator if needed.

background_knowledge_gen(single_priv_df)

Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.

risk(single_privacy_frame[, complete])

Computes the privacy risk for a single individual

aggregation_levels()

Allows attack preprocess to be dependant on the logic of the RiskEvaluator if needed. For IndividualElementEvaluator, aggregation is done for each individual in the data and for each distinct element belonging to the individual.

Returns:
list

a list with the attributes to be aggregated, should an attack need it. For IndividualElementEvaluator these are user id and the elements of the sequence.

background_knowledge_gen(single_priv_df)

Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of the single individual from which to generate all possible background knowledge instances.

Returns:
casesiterator

an iterator over all possible combinations of data points, i.e., all possible background knowledge instances.

risk(single_privacy_frame, complete=False)

Computes the privacy risk for a single individual

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of the single individual from which to generate all possible background knowledge instances.

Returns:
privacy_riskfloat

the privacy risk for the individual, computed as the inverse of the number of other individuals in the data that match the background knowledge.

class privlib.riskAssessment.riskevaluators.IndividualSequenceEvaluator(data, attack, knowledge_length, **kwargs)

Class for evaluating risk on sequence level: risk is computed based on the different sequences of each individual, i.e., each individual risk will be equal to the number of sequences in her own data divided by the total number of sequences belonging to other individuals in the data that match the background knowledge.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment.

attackBackgroundKnowledgeAttack

an attack to be simulated. Must be a class implementing the BackgroundKnowledgeAttack abstract class

knowledge_lengthint

the length of the knowledge of the simultated attack, i.e., how many data points are assumed to be in the background knowledge of the adversary

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

References

[TIST2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, and Anna Monreale. 2017. A Data Mining Approach to Assess Privacy Risk in Human Mobility Data. ACM Trans. Intell. Syst. Technol. 9, 3, Article 31 (December 2017), 27 pages. DOI: https://doi.org/10.1145/3106774

[MOB2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, Anna Monreale: Analyzing Privacy Risk in Human Mobility Data. STAF Workshops 2018: 114-129

Methods

aggregation_levels()

Allows attack preprocess to be dependent on the logic of the RiskEvaluator if needed.

background_knowledge_gen(single_priv_df)

Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.

risk(single_privacy_frame[, complete])

Computes the privacy risk for a single individual

aggregation_levels()

Allows attack preprocess to be dependent on the logic of the RiskEvaluator if needed. For IndividualSequenceEvaluator, aggregation is done for each individual in the data and for each sequence and distinct element that belong to the individual.

Returns:
list

a list with the attributes to be aggregated, should an attack need it. For IndividualSequenceEvaluator these are user id, sequence id and the elements of the sequence.

background_knowledge_gen(single_priv_df)

Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of the single individual from which to generate all possible background knowledge instances.

Returns:
casesiterator
an iterator over all possible combinations of data points, i.e., all possible background knowledge instances.
risk(single_privacy_frame, complete=False)

Computes the privacy risk for a single individual

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of the single individual from which to generate all possible background knowledge instances.

Returns:
privacy_riskfloat

the privacy risk for the individual, computed as the number of sequences belonging to the individual divided by the number of all sequences in the data that match the brackground knowledge.

Attacks

class privlib.riskAssessment.attacks.ElementsAttack

In an ElementsAttack the adversary knows some elements in the sequences of an individual.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment simulating this attack.

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

matching(case)

Matching function for the attack.

preprocess(**kwargs)

Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For ElementsAttack, only the elements are used in the matching.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of a single individual.

caselist or numpy array or dict

the background knowledge instance.

Returns
——-
int

1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:
dataSequenctialPrivacyFrame

the entire data to be preprocessed before attack simulation.

**kwargsmapping, optional

further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.FrequencyAttack

In an FrequencyAttack the adversary knows some elements in the sequences of an individual and the frequency with which they appear.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment simulating this attack.

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

matching(case)

Matching function for the attack.

preprocess(**kwargs)

Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For FrequencyAttack, elements and their frequency are used in the matching.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of a single individual.

caselist or numpy array or dict

the background knowledge instance.

Returns
——-
int

1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:
dataSequentialPrivacyFrame

the entire data to be preprocessed before attack simulation.

**kwargsmapping, optional

further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.ProbabilityAttack

In an FrequencyAttack the adversary knows some elements in the sequences of an individual and the probability with which they appear.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment simulating this attack.

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

matching(case)

Matching function for the attack.

preprocess(**kwargs)

Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For ProbabilityAttack, elements and their probability are used in the matching.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of a single individual.

caselist or numpy array or dict

the background knowledge instance.

Returns
——-
int

1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:
dataSequentialPrivacyFrame

the entire data to be preprocessed before attack simulation.

**kwargsmapping, optional

further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.ProportionAttack

In an ProportionAttack the adversary knows some elements in the sequences of an individual and the proportion with which they appear w.r.t. the most frequent elements in the sequences.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment simulating this attack.

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

matching(case)

Matching function for the attack.

preprocess(**kwargs)

Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For ProportionAttack, elements and their proportion w.r.t. the most frequent element are used in the matching.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of a single individual.

caselist or numpy array or dict

the background knowledge instance.

Returns
——-
int

1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:
dataSequentialPrivacyFrame

the entire data to be preprocessed before attack simulation.

**kwargsmapping, optional

further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.SequenceAttack

In an SequenceAttack the adversary knows some elements in the sequences of an individual and the orders with which they appear.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment simulating this attack.

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

matching(case)

Matching function for the attack.

preprocess(**kwargs)

Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For SequenceAttack, elements and their relative order are used in the matching.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of a single individual.

caselist or numpy array or dict

the background knowledge instance.

Returns
——-
int

1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:
dataSequentialPrivacyFrame

the entire data to be preprocessed before attack simulation.

**kwargsmapping, optional

further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.TimeAttack

In an TimeAttack the adversary knows some elements in the sequences of an individual and the datetime with which they appear.

Parameters:
dataSequentialPrivacyFrame

the data on which to perform privacy risk assessment simulating this attack.

**kwargsmapping, optional

a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

matching(case)

Matching function for the attack.

preprocess(**kwargs)

Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For TimeAttack, elements and their datetime are used in the matching.

Parameters:
single_priv_dfSequentialPrivacyFrame

the data of a single individual.

caselist or numpy array or dict

the background knowledge instance.

Returns
——-
int

1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:
dataSequentialPrivacyFrame

the entire data to be preprocessed before attack simulation.

**kwargsmapping, optional

further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

Discrimination Discrovery

Discrimination discovery

dd Discrimination Discocery version 1.0

@author: Salvatore Ruggieri

class privlib.discriminationDiscovery.discrimination_discovery.tDBIndex(tDB)

load transactions

Methods

cover

supp

Anonymization

Algorithms

class privlib.anonymization.src.algorithms.algorithm.Algorithm

Abstract class that represents a clustering algorithm. Defines a series of functions necessaries in all clustering algorithms. Classes implementing a clustering algorithm must extend this class.

Methods

calculate_centroid(records, **kwargs)

Function that calculates the centroid of a list of records.

create_clusters(records, k)

Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.

static calculate_centroid(records, **kwargs)

Function that calculates the centroid of a list of records. The centroid is formed as the centroid of each attribute. Each attribute type value implements its centroid calculation (see Value)

Parameters:
recordslist of Record

the list of records to calculate the centroid.

**kwargsoptional

Additional arguments that the specific attribute type value may need to calculate the centroid

Returns
——-
:class:`Record`

A record that is the centroid of the list of records

See Also
——–
:class:`Record`
:class:`Value`
abstract static create_clusters(records, k)

Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.

Parameters:
recordslist of Record

the list of records to perform the clustering.

kinteger

The minimum number of records in each cluster

Returns
——-
list

return a list where each item is a list of Record corresponding to a cluster of size >= k.

See also

Record
class privlib.anonymization.src.algorithms.anonymization_scheme.Anonymization_scheme(original_dataset)

Abstract class that represents the anonymization scheme. Defines a series of functions and attributes necessaries in all anonymization methods. Classes implementing an anonymization method must extend this class. (See examples of use in sections 1, 2 ,3 ,4 and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

References

[1]

Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5

[4]

Josep Domingo-Ferrer and Vicenç Torra, “Disclosure risk assessment in statistical data protection”, Journal of Computational and Applied Mathematics, Vol. 164, pp. 285-293, Mar 2004. DOI: https://doi.org/10.1016/S0377-0427(03)00643-5

Methods

anonymized_dataset_to_SPF()

Function Called to convert the anonymized dataset to a SequentialPrivacyFrame.

anonymized_dataset_to_dataframe()

Function Called to convert the anonymized dataset to a pandas dataframe.

calculate_anonymization(algorithm)

Function to perform the anonymization of the dataset given in the constructor Abstract method, all anonymization methods must implement it.

calculate_fast_record_linkage(...[, window_size])

Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.

calculate_information_loss(original_dataset, ...)

Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.

calculate_record_linkage(original_dataset, ...)

Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.

save_anonymized_dataset(path)

Function Called to save the anonymized dataset.

suppress_identifiers()

Function that removes the identifiers attribute values from the data set.

list_to_string

anonymized_dataset_to_SPF()

Function Called to convert the anonymized dataset to a SequentialPrivacyFrame.

Returns:
SequentialPrivacyFrame

The SequentialPrivacyFrame data set.

anonymized_dataset_to_dataframe()

Function Called to convert the anonymized dataset to a pandas dataframe.

Returns:
DataFrame :DataFrame

The pandas dataframe.

abstract calculate_anonymization(algorithm)

Function to perform the anonymization of the dataset given in the constructor Abstract method, all anonymization methods must implement it.

Parameters:
algorithmAlgorithm

the clustering algorithm used to group records during the anonymization.

See also

Algorithm
static calculate_fast_record_linkage(original_dataset, anonymized_dataset, window_size=None)

Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one. This is a fast version of record linkage calculation but less accurate

Parameters:
original_datasetDataset

The original data set.

anonymized_datasetDataset

The anonymized version of the original dataset

window_sizeint

optional, The desired size of the window, the greater the window the more accurate, but slower. If it is omitted, the 1% of the data set is taken

Returns
——-
:class:`Disclosure_risk_result`

The disclosure risk.

See also

Disclosure_risk_result
static calculate_information_loss(original_dataset, anonymized_dataset)

Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.

Parameters:
original_datasetDataset

The original data set.

anonymized_datasetDataset

The anonymized version of the original dataset

Returns
——-
:class:`Information_loss_result`

Information loss statistics.

See also

Record
Information_loss_result
static calculate_record_linkage(original_dataset, anonymized_dataset)

Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.

Parameters:
original_datasetDataset

The original data set.

anonymized_datasetDataset

The anonymized version of the original dataset

Returns
——-
:class:`Disclosure_risk_result`

The disclosure risk.

See also

Disclosure_risk_result
save_anonymized_dataset(path)

Function Called to save the anonymized dataset.

Parameters:
pathstr

desired path to save the anonymized dataset.

suppress_identifiers()

Function that removes the identifiers attribute values from the data set.

class privlib.anonymization.src.algorithms.differential_privacy.Differential_privacy(original_dataset, k, epsilon)

Class that implements differential privacy via individual ranking microaggregation-based perturbatuion. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in section 5 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_differential_privacy.py” in the folder “tests”)

See also

Anonymization_scheme

References

[3]

Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “Enhancing data utility in differential privacy via microaggregation-based k-anonymity”, The VLDB Journal, Vol. 23, no. 5, pp. 771-794, Sep 2014. DOI: https://doi.org/10.1007/s00778-014-0351-4

Methods

calculate_anonymization(algorithm)

Function to perform the differential privacy anonymization.

individual_ranking

calculate_anonymization(algorithm)

Function to perform the differential privacy anonymization.

Parameters:
algorithmAlgorithm

The clustering algorithm used during the anonymization.

See also

Algorithm
class privlib.anonymization.src.algorithms.k_anonymity.K_anonymity(original_dataset, k)

Class that implements k-anonymity anonymization. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in sections 2 and 3 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_k_anonymity.py” in the folder “tests”)

See also

Anonymization_scheme

References

[1]

Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5

Methods

calculate_anonymization(algorithm)

Function to perform the k-anonymity anonymization.

calculate_anonymization(algorithm)

Function to perform the k-anonymity anonymization.

Parameters:
algorithmAlgorithm

The clustering algorithm used during the anonymization.

See also

Algorithm
class privlib.anonymization.src.algorithms.mdav.Mdav

MDAV

Class that implements the MDAV clustering algorithm. The MDAV algorithm performs an accurate clustering of records being the computational cost is quadratic This algorithm implementation can be executed by the anonymization scheme due to its extends Algorithm class and implements the necessary methods. (See examples of use in sections 2 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py” and “test_differential_privacy” in the folder “tests”)

See also

Algorithm

References

[1]

Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5

Methods

create_clusters(records, k)

Function to perform the clustering of the list of records given as parameter.

calculate_furthest

create_cluster

distance

static create_clusters(records, k)

Function to perform the clustering of the list of records given as parameter. The size of the resulting clusters will be >= k

Parameters:
recordslist of Record

The list of records to perform the clustering.

kint

The desired level of clusters (size of cluster >= k)

Returns
——-
: :list of list of :class:`Record`

A list where each item is a list a cluster of records.

See also

Record
class privlib.anonymization.src.algorithms.t_closeness.T_closeness(original_dataset, k, t)

Class that implements k-t-closeness anonymization method. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in section 4 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_t_closeness.py” in the folder “tests”)

See also

Anonymization_scheme

References

[2]

Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “t-Closeness through microaggregation: strict privacy with enhanced utility preservation”, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, no. 11, pp. 3098-3110, Oct 2015. DOI: https://doi.org/10.1109/TKDE.2015.2435777

Methods

calculate_anonymization(algorithm)

Function to perform the k-t-closeness anonymization method.

create_k_t_clusters

get_index_confidential_attribute

calculate_anonymization(algorithm)

Function to perform the k-t-closeness anonymization method.

Parameters:
algorithmAlgorithm

The clustering algorithm used during the anonymization.

See also

Algorithm

Entities

class privlib.anonymization.src.entities.dataset.Dataset(name, settings_path, attrs_settings, separator, sample=None)

Abstract class that represents a dataset of records and described by the metadata stored in settings_path or attrs_settings Different dataset formats have to inherit this class (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

add_record(values_in)

Parameters:

calculate_standard_deviations(records)

Calculates the standard deviations of the list of records given as parameter

dataset_description()

Shows a description of the dataset using pandas Dataframe

description()

Shows a description of the dataset.

load_available_attribute_types()

Loads all available attribute types.

load_dataset()

Load the dataset.

load_dataset_settings()

Loads the dataset metadata describing each attribute type

load_header()

Load the header of the dataset.

set_header(header)

Shows a description of the dataset.

set_reference_record

take_sample

add_record(values_in)
Parameters:
values_in

The record to be stored in the dataset. It consist of a list of values. The name of each value matches with the header and the attribute type is defined in the metadata

Adds a record to the dataset. The load_dataset method implementation should call this method to store the data

See also

Record
static calculate_standard_deviations(records)

Calculates the standard deviations of the list of records given as parameter

Parameters:
records

The list of records to calculate the standard deviation. It is applied the specific value standard deviation calculation in function of the specific implementation. It is used to normalize values

dataset_description()

Shows a description of the dataset using pandas Dataframe

abstract description()

Shows a description of the dataset.

load_available_attribute_types()

Loads all available attribute types.

See also

Attribute_type
abstract load_dataset()

Load the dataset. The specific implementation should call the add_record method.

load_dataset_settings()

Loads the dataset metadata describing each attribute type

abstract load_header()

Load the header of the dataset. The header consist of the name of the attributes

set_header(header)

Shows a description of the dataset.

Parameters:
header

The header including the name of the attributes in the dataset. The attribute names have to be included in the dataset metadata description

class privlib.anonymization.src.entities.dataset_CSV.Dataset_CSV(dataset_path, settings_path, separator, sample=None)

Class that represents a dataset of records stored in csv format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description()

Shows a description of the dataset.

load_dataset()

Load the dataset.

load_header()

Load the header of the dataset.

description()

Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the csv formatted file

See also

Dataset
load_header()

Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset
class privlib.anonymization.src.entities.dataset_DataFrame.Dataset_DataFrame(dataframe, settings_path, sample=None)

Class that represents a dataset of records stored in pandas Dataframe format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description()

Shows a description of the dataset.

load_dataset()

Load the dataset.

load_header()

Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset.

description()

Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the pandas Dataframe format

See also

Dataset
load_header()

Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset
class privlib.anonymization.src.entities.dataset_SPF.Dataset_SPF(spf, path_settings=None, attrs_settings=None, sample=None)

Class that represents a dataset of records stored in SPF (sequential privacy frame) object format (See examples of use in section 7 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_spf.py” in the folder “tests”)

Methods

add_attrs_settings_to_spf()

Adds the attribute description settings into the SPF

description()

Shows a description of the dataset.

load_dataset()

Load the dataset.

load_header()

Implements the inherited load_header method for the SPF format Load the header of the dataset.

add_attrs_settings_to_spf()

Adds the attribute description settings into the SPF

See also

SequentialPrivacyFrame
description()

Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the SPF format

See also

Dataset
SequentialPrivacyFrame
load_header()

Implements the inherited load_header method for the SPF format Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset
SequentialPrivacyFrame
class privlib.anonymization.src.entities.record.Record(id_rec, values)

Class that represents a record. A record consist of a list of values. A Dataset is formed by a list of records

Attributes:
reference_record

Methods

calculate_distances_to_reference_record(records)

Calculates and stores for each record in the list given as parameter the distance to the reference record.

distance(record)

Calculates the distance between this record and the record given as parameter.

distance_all_attributes(record)

Calculates the distance between this record and the record given as parameter.

set_reference_record(dataset)

Creates and stored a record that is formed by the reference value of each attribute.

static calculate_distances_to_reference_record(records)

Calculates and stores for each record in the list given as parameter the distance to the reference record.

Parameters:
recordslist

The list of records to calculate and store the distance to the reference record

distance(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes only the quasi-identifier attributes to calculate the distance between records.

Parameters:
recordRecord

The record to calculate the distance

Returns:
float

The distance between this record and the given one.

distance_all_attributes(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes all attributes to calculate the distance between records.

Parameters:
recordRecord

The record to calculate the distance

Returns:
float

The distance between this record and the given one.

static set_reference_record(dataset)

Creates and stored a record that is formed by the reference value of each attribute.

Parameters:
datasetDataset

The dataset to create the reference record

class privlib.anonymization.src.entities.disclosure_risk_result.Disclosure_risk_result(disclosure_risk, dataset_size)

Class that stores the results of the disclosure risk calculation (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description()

Shows the results description of the disclosure risk calculation

description()

Shows the results description of the disclosure risk calculation

class privlib.anonymization.src.entities.information_loss_result.Information_loss_result(SSE, attribute_name, original_mean, anonymized_mean, original_variance, anonymized_variance)

Class that stores the results of the information loss calculation. This class is used to store the information loss calculated with the calculate_information_loss method of class Anonymization_scheme (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description

AntiDiscrimination

Algorithms

class privlib.antiDiscrimination.src.algorithms.anonymization_scheme.Anonymization_scheme(original_dataset)

Abstract class that represents the anonymization scheme. Defines a series of functions and attributes necessaries in all anonymization methods. Classes implementing an anonymization method must extend this class. (See examples of use in sections 1 and 2 of the jupyter notebook: “test_antiDiscrimination.ipynb”) (See also the file “anti_discrimmination_test.py” in the folder “tests”)

References

[1]

Sara Hajian and Josep Domingo-Ferrer, “A methodology for direct and indirect discrimination prevention in data mining”, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, no. 7, pp. 1445-1459, Jun 2013. DOI: https://doi.org/10.1109/TKDE.2012.72

Methods

anonymized_dataset_to_dataframe()

Function Called to convert the anonymized dataset to a pandas dataframe.

calculate_anonymization()

Function to perform the anonymization (anti-discrimination) of the dataset given in the constructor Abstract method, all anonymization methods must implement it.

calculate_metrics()

Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.

save_anonymized_dataset(path)

Function Called to save the anonymized dataset.

list_to_string

anonymized_dataset_to_dataframe()

Function Called to convert the anonymized dataset to a pandas dataframe.

Returns:
DataFrame :DataFrame

The pandas dataframe.

abstract calculate_anonymization()

Function to perform the anonymization (anti-discrimination) of the dataset given in the constructor Abstract method, all anonymization methods must implement it.

abstract calculate_metrics()

Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.

save_anonymized_dataset(path)

Function Called to save the anonymized dataset.

Parameters:
pathstr

desired path to save the anonymized dataset.

class privlib.antiDiscrimination.src.algorithms.anti_discrimination.Anti_discrimination(original_dataset, min_support, min_confidence, alfa, DI)

Class that implements anti-discrimination anonymization. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)

See also

Anonymization_scheme

References

[1]

Sara Hajian and Josep Domingo-Ferrer, “A methodology for direct and indirect discrimination prevention in data mining”, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, no. 7, pp. 1445-1459, Jun 2013. DOI: https://doi.org/10.1109/TKDE.2012.72

Methods

calculate_anonymization()

Function to perform the anti-discrimination anonymization.

calculate_metrics()

Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.

anonymize_direct_indirect

anonymize_direct_rules

anonymize_indirect_rules

calculate_MR_rules

calculate_RR_rules

calculate_and_save_FR_rules

calculate_and_save_rules_direct

calculate_and_save_rules_indirect

calculate_impact

calculate_maxs_index

calculate_next_index

count_items_hash

count_items_no_hash

create_rules

elb

f

get_noA_B_noC

get_noA_B_noD_noC

inspect

is_PD_rule

is_all_item_set_in_record

is_any_A_in_X

is_any_item_set_in_record

is_rule_in_rule_set

is_rule_possible

load_rules_FR

load_rules_direct

load_rules_indirect

save_rules_FR

save_rules_direct

save_rules_indirect

to_item_DI

calculate_anonymization()

Function to perform the anti-discrimination anonymization.

calculate_metrics()

Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.

Entities

class privlib.antiDiscrimination.src.entities.anti_discrimination_metrics.Anti_discrimination_metrics(DDPD, DDPP, IDPD, IDPP)

Anti_discrimination_result

Class that stores the results of the anti-discrimination metrics calculation (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)

Methods

description()

Shows the results description of the anti discrimination metrics calculation

description()

Shows the results description of the anti discrimination metrics calculation

class privlib.antiDiscrimination.src.entities.dataset.Dataset(name, separator, sample=None)

Abstract class that represents a dataset of records Different dataset formats have to inherit this class (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)

Methods

add_record(values_in)

Parameters:

dataset_description()

Shows a description of the dataset using pandas Dataframe

description()

Shows a description of the dataset.

load_dataset()

Load the dataset.

load_header()

Load the header of the dataset.

set_header(header)

Shows a description of the dataset.

take_sample

add_record(values_in)
Parameters:
values_in

The record to be stored in the dataset. It consist of a list of values.

Adds a record to the dataset. The load_dataset method implementation should call this method to store the data

See also

Record
dataset_description()

Shows a description of the dataset using pandas Dataframe

abstract description()

Shows a description of the dataset.

abstract load_dataset()

Load the dataset. The specific implementation should call the add_record method.

abstract load_header()

Load the header of the dataset. The header consist of the name of the attributes

set_header(header)

Shows a description of the dataset.

Parameters:
header

The header including the name of the attributes in the dataset.

class privlib.anonymization.src.entities.dataset_CSV.Dataset_CSV(dataset_path, settings_path, separator, sample=None)

Class that represents a dataset of records stored in csv format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description()

Shows a description of the dataset.

load_dataset()

Load the dataset.

load_header()

Load the header of the dataset.

description()

Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the csv formatted file

See also

Dataset
load_header()

Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset
class privlib.anonymization.src.entities.dataset_DataFrame.Dataset_DataFrame(dataframe, settings_path, sample=None)

Class that represents a dataset of records stored in pandas Dataframe format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description()

Shows a description of the dataset.

load_dataset()

Load the dataset.

load_header()

Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset.

description()

Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the pandas Dataframe format

See also

Dataset
load_header()

Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset
class privlib.anonymization.src.entities.record.Record(id_rec, values)

Class that represents a record. A record consist of a list of values. A Dataset is formed by a list of records

Attributes:
reference_record

Methods

calculate_distances_to_reference_record(records)

Calculates and stores for each record in the list given as parameter the distance to the reference record.

distance(record)

Calculates the distance between this record and the record given as parameter.

distance_all_attributes(record)

Calculates the distance between this record and the record given as parameter.

set_reference_record(dataset)

Creates and stored a record that is formed by the reference value of each attribute.

static calculate_distances_to_reference_record(records)

Calculates and stores for each record in the list given as parameter the distance to the reference record.

Parameters:
recordslist

The list of records to calculate and store the distance to the reference record

distance(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes only the quasi-identifier attributes to calculate the distance between records.

Parameters:
recordRecord

The record to calculate the distance

Returns:
float

The distance between this record and the given one.

distance_all_attributes(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes all attributes to calculate the distance between records.

Parameters:
recordRecord

The record to calculate the distance

Returns:
float

The distance between this record and the given one.

static set_reference_record(dataset)

Creates and stored a record that is formed by the reference value of each attribute.

Parameters:
datasetDataset

The dataset to create the reference record