API

Risk Assessment

`SequentialPrivacyFrame`

class privlib.riskAssessment.sequentialprivacyframe.SequentialPrivacyFrame(data, user_id='uid', datetime='datetime', order_id='order', sequence_id='sequence', elements='elements', timestamp=False, check_order_date=True)

SequentialPrivacyFrame.

A SequentialPrivacyFrame object is a pandas.DataFrame that represents sequences. A sequence has at least the following attributes: user_id, datetime, order_id, sequence_id, elements.

Parameters:

datalist or dict or pandas DataFrame: the data that must be embedded into a SequentialPrivacyFrame.
user_idint or str, optional: the position or the name of the column in data`containing the user identifier. The default is `constants.UID.
datetimeint or str, optional: the position or the name of the column in data containing the datetime. The default is constants.DATETIME.
order_idint or str, optional: the position or the name of the column in data containing the order identifier for the sequences. The default is constants.ORDER_ID.
sequence_idint or str, optional: the position or the name of the column in data containing the sequence identifier. The default is constants.SEQUENCE_ID.
elementsint or str, or list of int or list of str: the positions or the names of the columns in data containingn the elements of the sequences. Elements can be represented by any number of attributes that will be grouped together to represent the single element of the sequence. The default is constants.ELEMENTS
timestampboolean, optional: if True, the datetime is a timestamp. The default is False.
check_order_date: if True, the order of the various elements in the sequences of each user will be checked against the timestamp to ensure consistency. If some ordering attributes were not present in the original data, they will be computed based on what is available in the data. The default is True.

Attributes:

datetime
elements
order
sequence
uid

Methods

from_file

`Risk Evaluators`

class privlib.riskAssessment.riskevaluators.IndividualElementEvaluator(data, attack, knowledge_length, **kwargs)

Class for evaluating risk on individual level: risk is computed based on the whole data of each individual, i.e., each individual risk will be equal to the inverse of the number of other individuals in the data that match the background knowledge.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment.
attackBackgroundKnowledgeAttack: an attack to be simulated. Must be a class implementing the BackgroundKnowledgeAttack abstract class
knowledge_lengthint: the length of the knowledge of the simultated attack, i.e., how many data points are assumed to be in the background knowledge of the adversary
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

References

[TIST2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, and Anna Monreale. 2017. A Data Mining Approach to Assess Privacy Risk in Human Mobility Data. ACM Trans. Intell. Syst. Technol. 9, 3, Article 31 (December 2017), 27 pages. DOI: https://doi.org/10.1145/3106774

[MOB2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, Anna Monreale: Analyzing Privacy Risk in Human Mobility Data. STAF Workshops 2018: 114-129

Methods

`aggregation_levels`()	Allows attack preprocess to be dependant on the logic of the RiskEvaluator if needed.
`background_knowledge_gen`(single_priv_df)	Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.
`risk`(single_privacy_frame[, complete])	Computes the privacy risk for a single individual

aggregation_levels()

Allows attack preprocess to be dependant on the logic of the RiskEvaluator if needed. For IndividualElementEvaluator, aggregation is done for each individual in the data and for each distinct element belonging to the individual.

Returns:

list: a list with the attributes to be aggregated, should an attack need it. For IndividualElementEvaluator these are user id and the elements of the sequence.

background_knowledge_gen(single_priv_df)

Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of the single individual from which to generate all possible background knowledge instances.

Returns:

casesiterator: an iterator over all possible combinations of data points, i.e., all possible background knowledge instances.

risk(single_privacy_frame, complete=False)

Computes the privacy risk for a single individual

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of the single individual from which to generate all possible background knowledge instances.

Returns:

privacy_riskfloat: the privacy risk for the individual, computed as the inverse of the number of other individuals in the data that match the background knowledge.

class privlib.riskAssessment.riskevaluators.IndividualSequenceEvaluator(data, attack, knowledge_length, **kwargs)

Class for evaluating risk on sequence level: risk is computed based on the different sequences of each individual, i.e., each individual risk will be equal to the number of sequences in her own data divided by the total number of sequences belonging to other individuals in the data that match the background knowledge.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment.
attackBackgroundKnowledgeAttack: an attack to be simulated. Must be a class implementing the BackgroundKnowledgeAttack abstract class
knowledge_lengthint: the length of the knowledge of the simultated attack, i.e., how many data points are assumed to be in the background knowledge of the adversary
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

References

[TIST2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, and Anna Monreale. 2017. A Data Mining Approach to Assess Privacy Risk in Human Mobility Data. ACM Trans. Intell. Syst. Technol. 9, 3, Article 31 (December 2017), 27 pages. DOI: https://doi.org/10.1145/3106774

[MOB2018]

Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, Anna Monreale: Analyzing Privacy Risk in Human Mobility Data. STAF Workshops 2018: 114-129

Methods

`aggregation_levels`()	Allows attack preprocess to be dependent on the logic of the RiskEvaluator if needed.
`background_knowledge_gen`(single_priv_df)	Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.
`risk`(single_privacy_frame[, complete])	Computes the privacy risk for a single individual

aggregation_levels()

Allows attack preprocess to be dependent on the logic of the RiskEvaluator if needed. For IndividualSequenceEvaluator, aggregation is done for each individual in the data and for each sequence and distinct element that belong to the individual.

Returns:

list: a list with the attributes to be aggregated, should an attack need it. For IndividualSequenceEvaluator these are user id, sequence id and the elements of the sequence.

background_knowledge_gen(single_priv_df)

Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of the single individual from which to generate all possible background knowledge instances.

Returns:

casesiterator
an iterator over all possible combinations of data points, i.e., all possible background knowledge instances.

risk(single_privacy_frame, complete=False)

Computes the privacy risk for a single individual

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of the single individual from which to generate all possible background knowledge instances.

Returns:

privacy_riskfloat: the privacy risk for the individual, computed as the number of sequences belonging to the individual divided by the number of all sequences in the data that match the brackground knowledge.

`Attacks`

class privlib.riskAssessment.attacks.ElementsAttack

In an ElementsAttack the adversary knows some elements in the sequences of an individual.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment simulating this attack.
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

`matching`(case)	Matching function for the attack.
`preprocess`(**kwargs)	Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For ElementsAttack, only the elements are used in the matching.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of a single individual.
caselist or numpy array or dict: the background knowledge instance.
Returns
——-
int: 1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:

dataSequenctialPrivacyFrame: the entire data to be preprocessed before attack simulation.
**kwargsmapping, optional: further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.FrequencyAttack

In an FrequencyAttack the adversary knows some elements in the sequences of an individual and the frequency with which they appear.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment simulating this attack.
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

`matching`(case)	Matching function for the attack.
`preprocess`(**kwargs)	Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For FrequencyAttack, elements and their frequency are used in the matching.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of a single individual.
caselist or numpy array or dict: the background knowledge instance.
Returns
——-
int: 1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:

dataSequentialPrivacyFrame: the entire data to be preprocessed before attack simulation.
**kwargsmapping, optional: further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.ProbabilityAttack

In an FrequencyAttack the adversary knows some elements in the sequences of an individual and the probability with which they appear.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment simulating this attack.
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

`matching`(case)	Matching function for the attack.
`preprocess`(**kwargs)	Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For ProbabilityAttack, elements and their probability are used in the matching.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of a single individual.
caselist or numpy array or dict: the background knowledge instance.
Returns
——-
int: 1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:

dataSequentialPrivacyFrame: the entire data to be preprocessed before attack simulation.
**kwargsmapping, optional: further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.ProportionAttack

In an ProportionAttack the adversary knows some elements in the sequences of an individual and the proportion with which they appear w.r.t. the most frequent elements in the sequences.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment simulating this attack.
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

`matching`(case)	Matching function for the attack.
`preprocess`(**kwargs)	Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For ProportionAttack, elements and their proportion w.r.t. the most frequent element are used in the matching.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of a single individual.
caselist or numpy array or dict: the background knowledge instance.
Returns
——-
int: 1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:

dataSequentialPrivacyFrame: the entire data to be preprocessed before attack simulation.
**kwargsmapping, optional: further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.SequenceAttack

In an SequenceAttack the adversary knows some elements in the sequences of an individual and the orders with which they appear.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment simulating this attack.
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

`matching`(case)	Matching function for the attack.
`preprocess`(**kwargs)	Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For SequenceAttack, elements and their relative order are used in the matching.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of a single individual.
caselist or numpy array or dict: the background knowledge instance.
Returns
——-
int: 1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:

dataSequentialPrivacyFrame: the entire data to be preprocessed before attack simulation.
**kwargsmapping, optional: further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

class privlib.riskAssessment.attacks.TimeAttack

In an TimeAttack the adversary knows some elements in the sequences of an individual and the datetime with which they appear.

Parameters:

dataSequentialPrivacyFrame: the data on which to perform privacy risk assessment simulating this attack.
**kwargsmapping, optional: a dictionary of keyword arguments passed into the preprocessing of attack.

Methods

`matching`(case)	Matching function for the attack.
`preprocess`(**kwargs)	Function to perform preprocess of the data.

matching(case)

Matching function for the attack. For TimeAttack, elements and their datetime are used in the matching.

Parameters:

single_priv_dfSequentialPrivacyFrame: the data of a single individual.
caselist or numpy array or dict: the background knowledge instance.
Returns
——-
int: 1 if the instance matches the single_priv_df, 0 otherwise.

preprocess(**kwargs)

Function to perform preprocess of the data.

Parameters:

dataSequentialPrivacyFrame: the entire data to be preprocessed before attack simulation.
**kwargsmapping, optional: further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels

Discrimination Discrovery

`Discrimination discovery`

dd Discrimination Discocery version 1.0

@author: Salvatore Ruggieri

class privlib.discriminationDiscovery.discrimination_discovery.tDBIndex(tDB)

load transactions

Methods

cover
supp

Anonymization

`Algorithms`

class privlib.anonymization.src.algorithms.algorithm.Algorithm

Abstract class that represents a clustering algorithm. Defines a series of functions necessaries in all clustering algorithms. Classes implementing a clustering algorithm must extend this class.

Methods

`calculate_centroid`(records, **kwargs)	Function that calculates the centroid of a list of records.
`create_clusters`(records, k)	Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.

static calculate_centroid(records, **kwargs)

Function that calculates the centroid of a list of records. The centroid is formed as the centroid of each attribute. Each attribute type value implements its centroid calculation (see Value)

Parameters:

recordslist of Record: the list of records to calculate the centroid.
**kwargsoptional: Additional arguments that the specific attribute type value may need to calculate the centroid
Returns
——-
:class:`Record`: A record that is the centroid of the list of records
See Also
——–
:class:`Record`
:class:`Value`

abstract static create_clusters(records, k)

Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.

Parameters:

recordslist of Record: the list of records to perform the clustering.
kinteger: The minimum number of records in each cluster
Returns
——-
list: return a list where each item is a list of Record corresponding to a cluster of size >= k.

See also

Record

class privlib.anonymization.src.algorithms.anonymization_scheme.Anonymization_scheme(original_dataset)

Abstract class that represents the anonymization scheme. Defines a series of functions and attributes necessaries in all anonymization methods. Classes implementing an anonymization method must extend this class. (See examples of use in sections 1, 2 ,3 ,4 and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

References

[1]

Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5

[4]

Josep Domingo-Ferrer and Vicenç Torra, “Disclosure risk assessment in statistical data protection”, Journal of Computational and Applied Mathematics, Vol. 164, pp. 285-293, Mar 2004. DOI: https://doi.org/10.1016/S0377-0427(03)00643-5

Methods

`anonymized_dataset_to_SPF`()	Function Called to convert the anonymized dataset to a `SequentialPrivacyFrame`.
`anonymized_dataset_to_dataframe`()	Function Called to convert the anonymized dataset to a pandas dataframe.
`calculate_anonymization`(algorithm)	Function to perform the anonymization of the dataset given in the constructor Abstract method, all anonymization methods must implement it.
`calculate_fast_record_linkage`(...[, window_size])	Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.
`calculate_information_loss`(original_dataset, ...)	Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.
`calculate_record_linkage`(original_dataset, ...)	Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.
`save_anonymized_dataset`(path)	Function Called to save the anonymized dataset.
`suppress_identifiers`()	Function that removes the identifiers attribute values from the data set.

list_to_string

anonymized_dataset_to_SPF()

Function Called to convert the anonymized dataset to a SequentialPrivacyFrame.

Returns:

SequentialPrivacyFrame: The SequentialPrivacyFrame data set.

anonymized_dataset_to_dataframe()

Function Called to convert the anonymized dataset to a pandas dataframe.

Returns:

DataFrame :DataFrame: The pandas dataframe.

abstract calculate_anonymization(algorithm)

Function to perform the anonymization of the dataset given in the constructor Abstract method, all anonymization methods must implement it.

Parameters:

algorithmAlgorithm: the clustering algorithm used to group records during the anonymization.

See also

Algorithm

static calculate_fast_record_linkage(original_dataset, anonymized_dataset, window_size=None)

Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one. This is a fast version of record linkage calculation but less accurate

Parameters:

original_datasetDataset: The original data set.
anonymized_datasetDataset: The anonymized version of the original dataset
window_sizeint: optional, The desired size of the window, the greater the window the more accurate, but slower. If it is omitted, the 1% of the data set is taken
Returns
——-
:class:`Disclosure_risk_result`: The disclosure risk.

See also

Disclosure_risk_result

static calculate_information_loss(original_dataset, anonymized_dataset)

Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.

Parameters:

original_datasetDataset: The original data set.
anonymized_datasetDataset: The anonymized version of the original dataset
Returns
——-
:class:`Information_loss_result`: Information loss statistics.

See also

Record
Information_loss_result

static calculate_record_linkage(original_dataset, anonymized_dataset)

Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.

Parameters:

original_datasetDataset: The original data set.
anonymized_datasetDataset: The anonymized version of the original dataset
Returns
——-
:class:`Disclosure_risk_result`: The disclosure risk.

See also

Disclosure_risk_result

save_anonymized_dataset(path)

Function Called to save the anonymized dataset.

Parameters:

pathstr: desired path to save the anonymized dataset.

suppress_identifiers(): Function that removes the identifiers attribute values from the data set.

class privlib.anonymization.src.algorithms.differential_privacy.Differential_privacy(original_dataset, k, epsilon)

Class that implements differential privacy via individual ranking microaggregation-based perturbatuion. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in section 5 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_differential_privacy.py” in the folder “tests”)

See also

Anonymization_scheme

References

[3]

Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “Enhancing data utility in differential privacy via microaggregation-based k-anonymity”, The VLDB Journal, Vol. 23, no. 5, pp. 771-794, Sep 2014. DOI: https://doi.org/10.1007/s00778-014-0351-4

Methods

calculate_anonymization(algorithm)

Function to perform the differential privacy anonymization.

individual_ranking

calculate_anonymization(algorithm)

Function to perform the differential privacy anonymization.

Parameters:

algorithmAlgorithm: The clustering algorithm used during the anonymization.

See also

Algorithm

class privlib.anonymization.src.algorithms.k_anonymity.K_anonymity(original_dataset, k)

Class that implements k-anonymity anonymization. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in sections 2 and 3 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_k_anonymity.py” in the folder “tests”)

See also

Anonymization_scheme

References

[1]

Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5

Methods

calculate_anonymization(algorithm)

Function to perform the k-anonymity anonymization.

calculate_anonymization(algorithm)

Function to perform the k-anonymity anonymization.

Parameters:

algorithmAlgorithm: The clustering algorithm used during the anonymization.

See also

Algorithm

class privlib.anonymization.src.algorithms.mdav.Mdav

MDAV

Class that implements the MDAV clustering algorithm. The MDAV algorithm performs an accurate clustering of records being the computational cost is quadratic This algorithm implementation can be executed by the anonymization scheme due to its extends Algorithm class and implements the necessary methods. (See examples of use in sections 2 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py” and “test_differential_privacy” in the folder “tests”)

See also

Algorithm

References

[1]

Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5

Methods

create_clusters(records, k)

Function to perform the clustering of the list of records given as parameter.

calculate_furthest
create_cluster
distance

static create_clusters(records, k)

Function to perform the clustering of the list of records given as parameter. The size of the resulting clusters will be >= k

Parameters:

recordslist of Record: The list of records to perform the clustering.
kint: The desired level of clusters (size of cluster >= k)
Returns
——-
: :list of list of :class:`Record`: A list where each item is a list a cluster of records.

See also

Record

class privlib.anonymization.src.algorithms.t_closeness.T_closeness(original_dataset, k, t)

Class that implements k-t-closeness anonymization method. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in section 4 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_t_closeness.py” in the folder “tests”)

See also

Anonymization_scheme

References

[2]

Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “t-Closeness through microaggregation: strict privacy with enhanced utility preservation”, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, no. 11, pp. 3098-3110, Oct 2015. DOI: https://doi.org/10.1109/TKDE.2015.2435777

Methods

calculate_anonymization(algorithm)

Function to perform the k-t-closeness anonymization method.

create_k_t_clusters
get_index_confidential_attribute

calculate_anonymization(algorithm)

Function to perform the k-t-closeness anonymization method.

Parameters:

algorithmAlgorithm: The clustering algorithm used during the anonymization.

See also

Algorithm

`Entities`

class privlib.anonymization.src.entities.dataset.Dataset(name, settings_path, attrs_settings, separator, sample=None)

Abstract class that represents a dataset of records and described by the metadata stored in settings_path or attrs_settings Different dataset formats have to inherit this class (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

`add_record`(values_in)	Parameters:
`calculate_standard_deviations`(records)	Calculates the standard deviations of the list of records given as parameter
`dataset_description`()	Shows a description of the dataset using pandas Dataframe
`description`()	Shows a description of the dataset.
`load_available_attribute_types`()	Loads all available attribute types.
`load_dataset`()	Load the dataset.
`load_dataset_settings`()	Loads the dataset metadata describing each attribute type
`load_header`()	Load the header of the dataset.
`set_header`(header)	Shows a description of the dataset.

set_reference_record
take_sample

add_record(values_in)

Parameters:

values_in: The record to be stored in the dataset. It consist of a list of values. The name of each value matches with the header and the attribute type is defined in the metadata
Adds a record to the dataset. The load_dataset method implementation should call this method to store the data

See also

Record

static calculate_standard_deviations(records)

Calculates the standard deviations of the list of records given as parameter

Parameters:

records: The list of records to calculate the standard deviation. It is applied the specific value standard deviation calculation in function of the specific implementation. It is used to normalize values

dataset_description(): Shows a description of the dataset using pandas Dataframe

abstract description(): Shows a description of the dataset.

load_available_attribute_types()

Loads all available attribute types.

See also

Attribute_type

abstract load_dataset(): Load the dataset. The specific implementation should call the add_record method.

load_dataset_settings(): Loads the dataset metadata describing each attribute type

abstract load_header(): Load the header of the dataset. The header consist of the name of the attributes

set_header(header)

Shows a description of the dataset.

Parameters:

header: The header including the name of the attributes in the dataset. The attribute names have to be included in the dataset metadata description

class privlib.anonymization.src.entities.dataset_CSV.Dataset_CSV(dataset_path, settings_path, separator, sample=None)

Class that represents a dataset of records stored in csv format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

`description`()	Shows a description of the dataset.
`load_dataset`()	Load the dataset.
`load_header`()	Load the header of the dataset.

description(): Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the csv formatted file

See also

Dataset

load_header()

Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset

class privlib.anonymization.src.entities.dataset_DataFrame.Dataset_DataFrame(dataframe, settings_path, sample=None)

Class that represents a dataset of records stored in pandas Dataframe format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

`description`()	Shows a description of the dataset.
`load_dataset`()	Load the dataset.
`load_header`()	Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset.

description(): Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the pandas Dataframe format

See also

Dataset

load_header()

Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset

class privlib.anonymization.src.entities.dataset_SPF.Dataset_SPF(spf, path_settings=None, attrs_settings=None, sample=None)

Class that represents a dataset of records stored in SPF (sequential privacy frame) object format (See examples of use in section 7 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_spf.py” in the folder “tests”)

Methods

`add_attrs_settings_to_spf`()	Adds the attribute description settings into the SPF
`description`()	Shows a description of the dataset.
`load_dataset`()	Load the dataset.
`load_header`()	Implements the inherited load_header method for the SPF format Load the header of the dataset.

add_attrs_settings_to_spf()

Adds the attribute description settings into the SPF

See also

SequentialPrivacyFrame

description(): Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the SPF format

See also

Dataset
SequentialPrivacyFrame

load_header()

Implements the inherited load_header method for the SPF format Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset
SequentialPrivacyFrame

class privlib.anonymization.src.entities.record.Record(id_rec, values)

Class that represents a record. A record consist of a list of values. A Dataset is formed by a list of records

Attributes:

reference_record

Methods

`calculate_distances_to_reference_record`(records)	Calculates and stores for each record in the list given as parameter the distance to the reference record.
`distance`(record)	Calculates the distance between this record and the record given as parameter.
`distance_all_attributes`(record)	Calculates the distance between this record and the record given as parameter.
`set_reference_record`(dataset)	Creates and stored a record that is formed by the reference value of each attribute.

static calculate_distances_to_reference_record(records)

Calculates and stores for each record in the list given as parameter the distance to the reference record.

Parameters:

recordslist: The list of records to calculate and store the distance to the reference record

distance(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes only the quasi-identifier attributes to calculate the distance between records.

Parameters:

recordRecord: The record to calculate the distance

Returns:

float: The distance between this record and the given one.

distance_all_attributes(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes all attributes to calculate the distance between records.

Parameters:

recordRecord: The record to calculate the distance

Returns:

float: The distance between this record and the given one.

static set_reference_record(dataset)

Creates and stored a record that is formed by the reference value of each attribute.

Parameters:

datasetDataset: The dataset to create the reference record

class privlib.anonymization.src.entities.disclosure_risk_result.Disclosure_risk_result(disclosure_risk, dataset_size)

Class that stores the results of the disclosure risk calculation (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description()

Shows the results description of the disclosure risk calculation

description(): Shows the results description of the disclosure risk calculation

class privlib.anonymization.src.entities.information_loss_result.Information_loss_result(SSE, attribute_name, original_mean, anonymized_mean, original_variance, anonymized_variance)

Class that stores the results of the information loss calculation. This class is used to store the information loss calculated with the calculate_information_loss method of class Anonymization_scheme (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

description

AntiDiscrimination

`Algorithms`

class privlib.antiDiscrimination.src.algorithms.anonymization_scheme.Anonymization_scheme(original_dataset)

Abstract class that represents the anonymization scheme. Defines a series of functions and attributes necessaries in all anonymization methods. Classes implementing an anonymization method must extend this class. (See examples of use in sections 1 and 2 of the jupyter notebook: “test_antiDiscrimination.ipynb”) (See also the file “anti_discrimmination_test.py” in the folder “tests”)

References

[1]

Sara Hajian and Josep Domingo-Ferrer, “A methodology for direct and indirect discrimination prevention in data mining”, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, no. 7, pp. 1445-1459, Jun 2013. DOI: https://doi.org/10.1109/TKDE.2012.72

Methods

`anonymized_dataset_to_dataframe`()	Function Called to convert the anonymized dataset to a pandas dataframe.
`calculate_anonymization`()	Function to perform the anonymization (anti-discrimination) of the dataset given in the constructor Abstract method, all anonymization methods must implement it.
`calculate_metrics`()	Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.
`save_anonymized_dataset`(path)	Function Called to save the anonymized dataset.

list_to_string

anonymized_dataset_to_dataframe()

Function Called to convert the anonymized dataset to a pandas dataframe.

Returns:

DataFrame :DataFrame: The pandas dataframe.

abstract calculate_anonymization(): Function to perform the anonymization (anti-discrimination) of the dataset given in the constructor Abstract method, all anonymization methods must implement it.

abstract calculate_metrics(): Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.

save_anonymized_dataset(path)

Function Called to save the anonymized dataset.

Parameters:

pathstr: desired path to save the anonymized dataset.

class privlib.antiDiscrimination.src.algorithms.anti_discrimination.Anti_discrimination(original_dataset, min_support, min_confidence, alfa, DI)

Class that implements anti-discrimination anonymization. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)

See also

Anonymization_scheme

References

[1]

Sara Hajian and Josep Domingo-Ferrer, “A methodology for direct and indirect discrimination prevention in data mining”, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, no. 7, pp. 1445-1459, Jun 2013. DOI: https://doi.org/10.1109/TKDE.2012.72

Methods

`calculate_anonymization`()	Function to perform the anti-discrimination anonymization.
`calculate_metrics`()	Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.

anonymize_direct_indirect
anonymize_direct_rules
anonymize_indirect_rules
calculate_MR_rules
calculate_RR_rules
calculate_and_save_FR_rules
calculate_and_save_rules_direct
calculate_and_save_rules_indirect
calculate_impact
calculate_maxs_index
calculate_next_index
count_items_hash
count_items_no_hash
create_rules
elb
f
get_noA_B_noC
get_noA_B_noD_noC
inspect
is_PD_rule
is_all_item_set_in_record
is_any_A_in_X
is_any_item_set_in_record
is_rule_in_rule_set
is_rule_possible
load_rules_FR
load_rules_direct
load_rules_indirect
save_rules_FR
save_rules_direct
save_rules_indirect
to_item_DI

calculate_anonymization(): Function to perform the anti-discrimination anonymization.

calculate_metrics(): Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.

`Entities`

class privlib.antiDiscrimination.src.entities.anti_discrimination_metrics.Anti_discrimination_metrics(DDPD, DDPP, IDPD, IDPP)

Anti_discrimination_result

Class that stores the results of the anti-discrimination metrics calculation (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)

Methods

description()

Shows the results description of the anti discrimination metrics calculation

description(): Shows the results description of the anti discrimination metrics calculation

class privlib.antiDiscrimination.src.entities.dataset.Dataset(name, separator, sample=None)

Abstract class that represents a dataset of records Different dataset formats have to inherit this class (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)

Methods

`add_record`(values_in)	Parameters:
`dataset_description`()	Shows a description of the dataset using pandas Dataframe
`description`()	Shows a description of the dataset.
`load_dataset`()	Load the dataset.
`load_header`()	Load the header of the dataset.
`set_header`(header)	Shows a description of the dataset.

take_sample

add_record(values_in)

Parameters:

values_in: The record to be stored in the dataset. It consist of a list of values.
Adds a record to the dataset. The load_dataset method implementation should call this method to store the data

See also

Record

dataset_description(): Shows a description of the dataset using pandas Dataframe

abstract description(): Shows a description of the dataset.

abstract load_dataset(): Load the dataset. The specific implementation should call the add_record method.

abstract load_header(): Load the header of the dataset. The header consist of the name of the attributes

set_header(header)

Shows a description of the dataset.

Parameters:

header: The header including the name of the attributes in the dataset.

class privlib.anonymization.src.entities.dataset_CSV.Dataset_CSV(dataset_path, settings_path, separator, sample=None)

Class that represents a dataset of records stored in csv format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

`description`()	Shows a description of the dataset.
`load_dataset`()	Load the dataset.
`load_header`()	Load the header of the dataset.

description(): Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the csv formatted file

See also

Dataset

load_header()

Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset

class privlib.anonymization.src.entities.dataset_DataFrame.Dataset_DataFrame(dataframe, settings_path, sample=None)

Class that represents a dataset of records stored in pandas Dataframe format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”

in the folder “tests”)

Methods

`description`()	Shows a description of the dataset.
`load_dataset`()	Load the dataset.
`load_header`()	Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset.

description(): Shows a description of the dataset.

load_dataset()

Load the dataset. Implements the inherited load_dataset method for the pandas Dataframe format

See also

Dataset

load_header()

Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset. The header consist of the name of the attributes

See also

Dataset

class privlib.anonymization.src.entities.record.Record(id_rec, values)

Class that represents a record. A record consist of a list of values. A Dataset is formed by a list of records

Attributes:

reference_record

Methods

`calculate_distances_to_reference_record`(records)	Calculates and stores for each record in the list given as parameter the distance to the reference record.
`distance`(record)	Calculates the distance between this record and the record given as parameter.
`distance_all_attributes`(record)	Calculates the distance between this record and the record given as parameter.
`set_reference_record`(dataset)	Creates and stored a record that is formed by the reference value of each attribute.

static calculate_distances_to_reference_record(records)

Calculates and stores for each record in the list given as parameter the distance to the reference record.

Parameters:

recordslist: The list of records to calculate and store the distance to the reference record

distance(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes only the quasi-identifier attributes to calculate the distance between records.

Parameters:

recordRecord: The record to calculate the distance

Returns:

float: The distance between this record and the given one.

distance_all_attributes(record)

Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes all attributes to calculate the distance between records.

Parameters:

recordRecord: The record to calculate the distance

Returns:

float: The distance between this record and the given one.

static set_reference_record(dataset)

Creates and stored a record that is formed by the reference value of each attribute.

Parameters:

datasetDataset: The dataset to create the reference record

API

Risk Assessment

SequentialPrivacyFrame

Risk Evaluators

Attacks

Discrimination Discrovery

Discrimination discovery

Anonymization

Algorithms

Entities

AntiDiscrimination

Algorithms

Entities

`SequentialPrivacyFrame`

`Risk Evaluators`

`Attacks`

`Discrimination discovery`

`Algorithms`

`Entities`

`Algorithms`

`Entities`