API
Risk Assessment
SequentialPrivacyFrame
- class privlib.riskAssessment.sequentialprivacyframe.SequentialPrivacyFrame(data, user_id='uid', datetime='datetime', order_id='order', sequence_id='sequence', elements='elements', timestamp=False, check_order_date=True)
SequentialPrivacyFrame.
A SequentialPrivacyFrame object is a pandas.DataFrame that represents sequences. A sequence has at least the following attributes: user_id, datetime, order_id, sequence_id, elements.
- Parameters:
- datalist or dict or pandas DataFrame
the data that must be embedded into a SequentialPrivacyFrame.
- user_idint or str, optional
the position or the name of the column in data`containing the user identifier. The default is `constants.UID.
- datetimeint or str, optional
the position or the name of the column in data containing the datetime. The default is constants.DATETIME.
- order_idint or str, optional
the position or the name of the column in data containing the order identifier for the sequences. The default is constants.ORDER_ID.
- sequence_idint or str, optional
the position or the name of the column in data containing the sequence identifier. The default is constants.SEQUENCE_ID.
- elementsint or str, or list of int or list of str
the positions or the names of the columns in data containingn the elements of the sequences. Elements can be represented by any number of attributes that will be grouped together to represent the single element of the sequence. The default is constants.ELEMENTS
- timestampboolean, optional
if True, the datetime is a timestamp. The default is False.
- check_order_date
if True, the order of the various elements in the sequences of each user will be checked against the timestamp to ensure consistency. If some ordering attributes were not present in the original data, they will be computed based on what is available in the data. The default is True.
- Attributes:
- datetime
- elements
- order
- sequence
- uid
Methods
from_file
Risk Evaluators
- class privlib.riskAssessment.riskevaluators.IndividualElementEvaluator(data, attack, knowledge_length, **kwargs)
Class for evaluating risk on individual level: risk is computed based on the whole data of each individual, i.e., each individual risk will be equal to the inverse of the number of other individuals in the data that match the background knowledge.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment.
- attackBackgroundKnowledgeAttack
an attack to be simulated. Must be a class implementing the BackgroundKnowledgeAttack abstract class
- knowledge_lengthint
the length of the knowledge of the simultated attack, i.e., how many data points are assumed to be in the background knowledge of the adversary
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
References
[TIST2018]Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, and Anna Monreale. 2017. A Data Mining Approach to Assess Privacy Risk in Human Mobility Data. ACM Trans. Intell. Syst. Technol. 9, 3, Article 31 (December 2017), 27 pages. DOI: https://doi.org/10.1145/3106774
[MOB2018]Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, Anna Monreale: Analyzing Privacy Risk in Human Mobility Data. STAF Workshops 2018: 114-129
Methods
Allows attack preprocess to be dependant on the logic of the RiskEvaluator if needed.
background_knowledge_gen
(single_priv_df)Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.
risk
(single_privacy_frame[, complete])Computes the privacy risk for a single individual
- aggregation_levels()
Allows attack preprocess to be dependant on the logic of the RiskEvaluator if needed. For IndividualElementEvaluator, aggregation is done for each individual in the data and for each distinct element belonging to the individual.
- Returns:
- list
a list with the attributes to be aggregated, should an attack need it. For IndividualElementEvaluator these are user id and the elements of the sequence.
- background_knowledge_gen(single_priv_df)
Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of the single individual from which to generate all possible background knowledge instances.
- Returns:
- casesiterator
an iterator over all possible combinations of data points, i.e., all possible background knowledge instances.
- risk(single_privacy_frame, complete=False)
Computes the privacy risk for a single individual
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of the single individual from which to generate all possible background knowledge instances.
- Returns:
- privacy_riskfloat
the privacy risk for the individual, computed as the inverse of the number of other individuals in the data that match the background knowledge.
- class privlib.riskAssessment.riskevaluators.IndividualSequenceEvaluator(data, attack, knowledge_length, **kwargs)
Class for evaluating risk on sequence level: risk is computed based on the different sequences of each individual, i.e., each individual risk will be equal to the number of sequences in her own data divided by the total number of sequences belonging to other individuals in the data that match the background knowledge.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment.
- attackBackgroundKnowledgeAttack
an attack to be simulated. Must be a class implementing the BackgroundKnowledgeAttack abstract class
- knowledge_lengthint
the length of the knowledge of the simultated attack, i.e., how many data points are assumed to be in the background knowledge of the adversary
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
References
[TIST2018]Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, and Anna Monreale. 2017. A Data Mining Approach to Assess Privacy Risk in Human Mobility Data. ACM Trans. Intell. Syst. Technol. 9, 3, Article 31 (December 2017), 27 pages. DOI: https://doi.org/10.1145/3106774
[MOB2018]Roberto Pellungrini, Luca Pappalardo, Francesca Pratesi, Anna Monreale: Analyzing Privacy Risk in Human Mobility Data. STAF Workshops 2018: 114-129
Methods
Allows attack preprocess to be dependent on the logic of the RiskEvaluator if needed.
background_knowledge_gen
(single_priv_df)Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.
risk
(single_privacy_frame[, complete])Computes the privacy risk for a single individual
- aggregation_levels()
Allows attack preprocess to be dependent on the logic of the RiskEvaluator if needed. For IndividualSequenceEvaluator, aggregation is done for each individual in the data and for each sequence and distinct element that belong to the individual.
- Returns:
- list
a list with the attributes to be aggregated, should an attack need it. For IndividualSequenceEvaluator these are user id, sequence id and the elements of the sequence.
- background_knowledge_gen(single_priv_df)
Generates all possible combinations of knowledge_length length from the data of an individual, to provide all possible background knowledge instances to the simulation.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of the single individual from which to generate all possible background knowledge instances.
- Returns:
- casesiterator
- an iterator over all possible combinations of data points, i.e., all possible background knowledge instances.
- risk(single_privacy_frame, complete=False)
Computes the privacy risk for a single individual
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of the single individual from which to generate all possible background knowledge instances.
- Returns:
- privacy_riskfloat
the privacy risk for the individual, computed as the number of sequences belonging to the individual divided by the number of all sequences in the data that match the brackground knowledge.
Attacks
- class privlib.riskAssessment.attacks.ElementsAttack
In an ElementsAttack the adversary knows some elements in the sequences of an individual.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment simulating this attack.
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
Methods
matching
(case)Matching function for the attack.
preprocess
(**kwargs)Function to perform preprocess of the data.
- matching(case)
Matching function for the attack. For ElementsAttack, only the elements are used in the matching.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of a single individual.
- caselist or numpy array or dict
the background knowledge instance.
- Returns
- ——-
- int
1 if the instance matches the single_priv_df, 0 otherwise.
- preprocess(**kwargs)
Function to perform preprocess of the data.
- Parameters:
- dataSequenctialPrivacyFrame
the entire data to be preprocessed before attack simulation.
- **kwargsmapping, optional
further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels
- class privlib.riskAssessment.attacks.FrequencyAttack
In an FrequencyAttack the adversary knows some elements in the sequences of an individual and the frequency with which they appear.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment simulating this attack.
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
Methods
matching
(case)Matching function for the attack.
preprocess
(**kwargs)Function to perform preprocess of the data.
- matching(case)
Matching function for the attack. For FrequencyAttack, elements and their frequency are used in the matching.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of a single individual.
- caselist or numpy array or dict
the background knowledge instance.
- Returns
- ——-
- int
1 if the instance matches the single_priv_df, 0 otherwise.
- preprocess(**kwargs)
Function to perform preprocess of the data.
- Parameters:
- dataSequentialPrivacyFrame
the entire data to be preprocessed before attack simulation.
- **kwargsmapping, optional
further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels
- class privlib.riskAssessment.attacks.ProbabilityAttack
In an FrequencyAttack the adversary knows some elements in the sequences of an individual and the probability with which they appear.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment simulating this attack.
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
Methods
matching
(case)Matching function for the attack.
preprocess
(**kwargs)Function to perform preprocess of the data.
- matching(case)
Matching function for the attack. For ProbabilityAttack, elements and their probability are used in the matching.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of a single individual.
- caselist or numpy array or dict
the background knowledge instance.
- Returns
- ——-
- int
1 if the instance matches the single_priv_df, 0 otherwise.
- preprocess(**kwargs)
Function to perform preprocess of the data.
- Parameters:
- dataSequentialPrivacyFrame
the entire data to be preprocessed before attack simulation.
- **kwargsmapping, optional
further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels
- class privlib.riskAssessment.attacks.ProportionAttack
In an ProportionAttack the adversary knows some elements in the sequences of an individual and the proportion with which they appear w.r.t. the most frequent elements in the sequences.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment simulating this attack.
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
Methods
matching
(case)Matching function for the attack.
preprocess
(**kwargs)Function to perform preprocess of the data.
- matching(case)
Matching function for the attack. For ProportionAttack, elements and their proportion w.r.t. the most frequent element are used in the matching.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of a single individual.
- caselist or numpy array or dict
the background knowledge instance.
- Returns
- ——-
- int
1 if the instance matches the single_priv_df, 0 otherwise.
- preprocess(**kwargs)
Function to perform preprocess of the data.
- Parameters:
- dataSequentialPrivacyFrame
the entire data to be preprocessed before attack simulation.
- **kwargsmapping, optional
further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels
- class privlib.riskAssessment.attacks.SequenceAttack
In an SequenceAttack the adversary knows some elements in the sequences of an individual and the orders with which they appear.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment simulating this attack.
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
Methods
matching
(case)Matching function for the attack.
preprocess
(**kwargs)Function to perform preprocess of the data.
- matching(case)
Matching function for the attack. For SequenceAttack, elements and their relative order are used in the matching.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of a single individual.
- caselist or numpy array or dict
the background knowledge instance.
- Returns
- ——-
- int
1 if the instance matches the single_priv_df, 0 otherwise.
- preprocess(**kwargs)
Function to perform preprocess of the data.
- Parameters:
- dataSequentialPrivacyFrame
the entire data to be preprocessed before attack simulation.
- **kwargsmapping, optional
further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels
- class privlib.riskAssessment.attacks.TimeAttack
In an TimeAttack the adversary knows some elements in the sequences of an individual and the datetime with which they appear.
- Parameters:
- dataSequentialPrivacyFrame
the data on which to perform privacy risk assessment simulating this attack.
- **kwargsmapping, optional
a dictionary of keyword arguments passed into the preprocessing of attack.
Methods
matching
(case)Matching function for the attack.
preprocess
(**kwargs)Function to perform preprocess of the data.
- matching(case)
Matching function for the attack. For TimeAttack, elements and their datetime are used in the matching.
- Parameters:
- single_priv_dfSequentialPrivacyFrame
the data of a single individual.
- caselist or numpy array or dict
the background knowledge instance.
- Returns
- ——-
- int
1 if the instance matches the single_priv_df, 0 otherwise.
- preprocess(**kwargs)
Function to perform preprocess of the data.
- Parameters:
- dataSequentialPrivacyFrame
the entire data to be preprocessed before attack simulation.
- **kwargsmapping, optional
further arguments for preprocessing that can be passed from the RiskEvaluator, for example aggregation_levels
Discrimination Discrovery
Discrimination discovery
dd Discrimination Discocery version 1.0
@author: Salvatore Ruggieri
- class privlib.discriminationDiscovery.discrimination_discovery.tDBIndex(tDB)
load transactions
Methods
cover
supp
Anonymization
Algorithms
- class privlib.anonymization.src.algorithms.algorithm.Algorithm
Abstract class that represents a clustering algorithm. Defines a series of functions necessaries in all clustering algorithms. Classes implementing a clustering algorithm must extend this class.
Methods
calculate_centroid
(records, **kwargs)Function that calculates the centroid of a list of records.
create_clusters
(records, k)Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.
- static calculate_centroid(records, **kwargs)
Function that calculates the centroid of a list of records. The centroid is formed as the centroid of each attribute. Each attribute type value implements its centroid calculation (see
Value
)- Parameters:
- recordslist of Record
the list of records to calculate the centroid.
- **kwargsoptional
Additional arguments that the specific attribute type value may need to calculate the centroid
- Returns
- ——-
- :class:`Record`
A record that is the centroid of the list of records
- See Also
- ——–
- :class:`Record`
- :class:`Value`
- abstract static create_clusters(records, k)
Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.
- Parameters:
- recordslist of
Record
the list of records to perform the clustering.
- kinteger
The minimum number of records in each cluster
- Returns
- ——-
- list
return a list where each item is a list of Record corresponding to a cluster of size >= k.
- recordslist of
See also
Record
- class privlib.anonymization.src.algorithms.anonymization_scheme.Anonymization_scheme(original_dataset)
Abstract class that represents the anonymization scheme. Defines a series of functions and attributes necessaries in all anonymization methods. Classes implementing an anonymization method must extend this class. (See examples of use in sections 1, 2 ,3 ,4 and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”
in the folder “tests”)
References
[1]Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5
[4]Josep Domingo-Ferrer and Vicenç Torra, “Disclosure risk assessment in statistical data protection”, Journal of Computational and Applied Mathematics, Vol. 164, pp. 285-293, Mar 2004. DOI: https://doi.org/10.1016/S0377-0427(03)00643-5
Methods
Function Called to convert the anonymized dataset to a
SequentialPrivacyFrame
.Function Called to convert the anonymized dataset to a pandas dataframe.
calculate_anonymization
(algorithm)Function to perform the anonymization of the dataset given in the constructor Abstract method, all anonymization methods must implement it.
calculate_fast_record_linkage
(...[, window_size])Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.
calculate_information_loss
(original_dataset, ...)Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.
calculate_record_linkage
(original_dataset, ...)Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.
save_anonymized_dataset
(path)Function Called to save the anonymized dataset.
Function that removes the identifiers attribute values from the data set.
list_to_string
- anonymized_dataset_to_SPF()
Function Called to convert the anonymized dataset to a
SequentialPrivacyFrame
.- Returns:
SequentialPrivacyFrame
The SequentialPrivacyFrame data set.
- anonymized_dataset_to_dataframe()
Function Called to convert the anonymized dataset to a pandas dataframe.
- Returns:
- DataFrame :DataFrame
The pandas dataframe.
- abstract calculate_anonymization(algorithm)
Function to perform the anonymization of the dataset given in the constructor Abstract method, all anonymization methods must implement it.
- Parameters:
- algorithm
Algorithm
the clustering algorithm used to group records during the anonymization.
- algorithm
See also
Algorithm
- static calculate_fast_record_linkage(original_dataset, anonymized_dataset, window_size=None)
Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one. This is a fast version of record linkage calculation but less accurate
- Parameters:
- original_dataset
Dataset
The original data set.
- anonymized_dataset
Dataset
The anonymized version of the original dataset
- window_sizeint
optional, The desired size of the window, the greater the window the more accurate, but slower. If it is omitted, the 1% of the data set is taken
- Returns
- ——-
- :class:`Disclosure_risk_result`
The disclosure risk.
- original_dataset
See also
Disclosure_risk_result
- static calculate_information_loss(original_dataset, anonymized_dataset)
Function to perform the clustering of the records given as parameter Abstract method, all clustering algorithms must implement it.
- Parameters:
- original_dataset
Dataset
The original data set.
- anonymized_dataset
Dataset
The anonymized version of the original dataset
- Returns
- ——-
- :class:`Information_loss_result`
Information loss statistics.
- original_dataset
See also
Record
Information_loss_result
- static calculate_record_linkage(original_dataset, anonymized_dataset)
Function to Calculates the disclosure risk of the anonymized data set by comparing it with the original one.
- Parameters:
- original_dataset
Dataset
The original data set.
- anonymized_dataset
Dataset
The anonymized version of the original dataset
- Returns
- ——-
- :class:`Disclosure_risk_result`
The disclosure risk.
- original_dataset
See also
Disclosure_risk_result
- save_anonymized_dataset(path)
Function Called to save the anonymized dataset.
- Parameters:
- pathstr
desired path to save the anonymized dataset.
- suppress_identifiers()
Function that removes the identifiers attribute values from the data set.
- class privlib.anonymization.src.algorithms.differential_privacy.Differential_privacy(original_dataset, k, epsilon)
Class that implements differential privacy via individual ranking microaggregation-based perturbatuion. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in section 5 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_differential_privacy.py” in the folder “tests”)
See also
Anonymization_scheme
References
[3]Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “Enhancing data utility in differential privacy via microaggregation-based k-anonymity”, The VLDB Journal, Vol. 23, no. 5, pp. 771-794, Sep 2014. DOI: https://doi.org/10.1007/s00778-014-0351-4
Methods
calculate_anonymization
(algorithm)Function to perform the differential privacy anonymization.
individual_ranking
- calculate_anonymization(algorithm)
Function to perform the differential privacy anonymization.
- Parameters:
- algorithm
Algorithm
The clustering algorithm used during the anonymization.
- algorithm
See also
Algorithm
- class privlib.anonymization.src.algorithms.k_anonymity.K_anonymity(original_dataset, k)
Class that implements k-anonymity anonymization. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in sections 2 and 3 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_k_anonymity.py” in the folder “tests”)
See also
Anonymization_scheme
References
[1]Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5
Methods
calculate_anonymization
(algorithm)Function to perform the k-anonymity anonymization.
- calculate_anonymization(algorithm)
Function to perform the k-anonymity anonymization.
- Parameters:
- algorithm
Algorithm
The clustering algorithm used during the anonymization.
- algorithm
See also
Algorithm
- class privlib.anonymization.src.algorithms.mdav.Mdav
MDAV
Class that implements the MDAV clustering algorithm. The MDAV algorithm performs an accurate clustering of records being the computational cost is quadratic This algorithm implementation can be executed by the anonymization scheme due to its extends Algorithm class and implements the necessary methods. (See examples of use in sections 2 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py” and “test_differential_privacy” in the folder “tests”)
See also
Algorithm
References
[1]Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5
Methods
create_clusters
(records, k)Function to perform the clustering of the list of records given as parameter.
calculate_furthest
create_cluster
distance
- static create_clusters(records, k)
Function to perform the clustering of the list of records given as parameter. The size of the resulting clusters will be >= k
- Parameters:
- recordslist of
Record
The list of records to perform the clustering.
- kint
The desired level of clusters (size of cluster >= k)
- Returns
- ——-
- : :list of list of :class:`Record`
A list where each item is a list a cluster of records.
- recordslist of
See also
Record
- class privlib.anonymization.src.algorithms.t_closeness.T_closeness(original_dataset, k, t)
Class that implements k-t-closeness anonymization method. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in section 4 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_t_closeness.py” in the folder “tests”)
See also
Anonymization_scheme
References
[2]Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “t-Closeness through microaggregation: strict privacy with enhanced utility preservation”, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, no. 11, pp. 3098-3110, Oct 2015. DOI: https://doi.org/10.1109/TKDE.2015.2435777
Methods
calculate_anonymization
(algorithm)Function to perform the k-t-closeness anonymization method.
create_k_t_clusters
get_index_confidential_attribute
- calculate_anonymization(algorithm)
Function to perform the k-t-closeness anonymization method.
- Parameters:
- algorithm
Algorithm
The clustering algorithm used during the anonymization.
- algorithm
See also
Algorithm
Entities
- class privlib.anonymization.src.entities.dataset.Dataset(name, settings_path, attrs_settings, separator, sample=None)
Abstract class that represents a dataset of records and described by the metadata stored in settings_path or attrs_settings Different dataset formats have to inherit this class (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”
in the folder “tests”)
Methods
add_record
(values_in)- Parameters:
calculate_standard_deviations
(records)Calculates the standard deviations of the list of records given as parameter
Shows a description of the dataset using pandas Dataframe
Shows a description of the dataset.
Loads all available attribute types.
Load the dataset.
Loads the dataset metadata describing each attribute type
Load the header of the dataset.
set_header
(header)Shows a description of the dataset.
set_reference_record
take_sample
- add_record(values_in)
- Parameters:
- values_in
The record to be stored in the dataset. It consist of a list of values. The name of each value matches with the header and the attribute type is defined in the metadata
- Adds a record to the dataset. The load_dataset method implementation should call this method to store the data
See also
Record
- static calculate_standard_deviations(records)
Calculates the standard deviations of the list of records given as parameter
- Parameters:
- records
The list of records to calculate the standard deviation. It is applied the specific value standard deviation calculation in function of the specific implementation. It is used to normalize values
- dataset_description()
Shows a description of the dataset using pandas Dataframe
- abstract description()
Shows a description of the dataset.
- load_available_attribute_types()
Loads all available attribute types.
See also
Attribute_type
- abstract load_dataset()
Load the dataset. The specific implementation should call the add_record method.
- load_dataset_settings()
Loads the dataset metadata describing each attribute type
- abstract load_header()
Load the header of the dataset. The header consist of the name of the attributes
- set_header(header)
Shows a description of the dataset.
- Parameters:
- header
The header including the name of the attributes in the dataset. The attribute names have to be included in the dataset metadata description
- class privlib.anonymization.src.entities.dataset_CSV.Dataset_CSV(dataset_path, settings_path, separator, sample=None)
Class that represents a dataset of records stored in csv format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”
in the folder “tests”)
Methods
Shows a description of the dataset.
Load the dataset.
Load the header of the dataset.
- description()
Shows a description of the dataset.
- load_dataset()
Load the dataset. Implements the inherited load_dataset method for the csv formatted file
See also
Dataset
- load_header()
Load the header of the dataset. The header consist of the name of the attributes
See also
Dataset
- class privlib.anonymization.src.entities.dataset_DataFrame.Dataset_DataFrame(dataframe, settings_path, sample=None)
Class that represents a dataset of records stored in pandas Dataframe format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”
in the folder “tests”)
Methods
Shows a description of the dataset.
Load the dataset.
Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset.
- description()
Shows a description of the dataset.
- load_dataset()
Load the dataset. Implements the inherited load_dataset method for the pandas Dataframe format
See also
Dataset
- load_header()
Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset. The header consist of the name of the attributes
See also
Dataset
- class privlib.anonymization.src.entities.dataset_SPF.Dataset_SPF(spf, path_settings=None, attrs_settings=None, sample=None)
Class that represents a dataset of records stored in SPF (sequential privacy frame) object format (See examples of use in section 7 of the jupyter notebook: test_anonymization.ipynb) (See also the file “test_spf.py” in the folder “tests”)
Methods
Adds the attribute description settings into the SPF
Shows a description of the dataset.
Load the dataset.
Implements the inherited load_header method for the SPF format Load the header of the dataset.
- add_attrs_settings_to_spf()
Adds the attribute description settings into the SPF
See also
SequentialPrivacyFrame
- description()
Shows a description of the dataset.
- load_dataset()
Load the dataset. Implements the inherited load_dataset method for the SPF format
See also
Dataset
SequentialPrivacyFrame
- load_header()
Implements the inherited load_header method for the SPF format Load the header of the dataset. The header consist of the name of the attributes
See also
Dataset
SequentialPrivacyFrame
- class privlib.anonymization.src.entities.record.Record(id_rec, values)
Class that represents a record. A record consist of a list of values. A
Dataset
is formed by a list of records- Attributes:
- reference_record
Methods
Calculates and stores for each record in the list given as parameter the distance to the reference record.
distance
(record)Calculates the distance between this record and the record given as parameter.
distance_all_attributes
(record)Calculates the distance between this record and the record given as parameter.
set_reference_record
(dataset)Creates and stored a record that is formed by the reference value of each attribute.
- static calculate_distances_to_reference_record(records)
Calculates and stores for each record in the list given as parameter the distance to the reference record.
- Parameters:
- recordslist
The list of records to calculate and store the distance to the reference record
- distance(record)
Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes only the quasi-identifier attributes to calculate the distance between records.
- Parameters:
- record
Record
The record to calculate the distance
- record
- Returns:
- float
The distance between this record and the given one.
- distance_all_attributes(record)
Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes all attributes to calculate the distance between records.
- Parameters:
- record
Record
The record to calculate the distance
- record
- Returns:
- float
The distance between this record and the given one.
- static set_reference_record(dataset)
Creates and stored a record that is formed by the reference value of each attribute.
- Parameters:
- dataset
Dataset
The dataset to create the reference record
- dataset
- class privlib.anonymization.src.entities.disclosure_risk_result.Disclosure_risk_result(disclosure_risk, dataset_size)
Class that stores the results of the disclosure risk calculation (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”
in the folder “tests”)
Methods
Shows the results description of the disclosure risk calculation
- description()
Shows the results description of the disclosure risk calculation
- class privlib.anonymization.src.entities.information_loss_result.Information_loss_result(SSE, attribute_name, original_mean, anonymized_mean, original_variance, anonymized_variance)
Class that stores the results of the information loss calculation. This class is used to store the information loss calculated with the calculate_information_loss method of class
Anonymization_scheme
(See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”in the folder “tests”)
Methods
description
AntiDiscrimination
Algorithms
- class privlib.antiDiscrimination.src.algorithms.anonymization_scheme.Anonymization_scheme(original_dataset)
Abstract class that represents the anonymization scheme. Defines a series of functions and attributes necessaries in all anonymization methods. Classes implementing an anonymization method must extend this class. (See examples of use in sections 1 and 2 of the jupyter notebook: “test_antiDiscrimination.ipynb”) (See also the file “anti_discrimmination_test.py” in the folder “tests”)
References
[1]Sara Hajian and Josep Domingo-Ferrer, “A methodology for direct and indirect discrimination prevention in data mining”, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, no. 7, pp. 1445-1459, Jun 2013. DOI: https://doi.org/10.1109/TKDE.2012.72
Methods
Function Called to convert the anonymized dataset to a pandas dataframe.
Function to perform the anonymization (anti-discrimination) of the dataset given in the constructor Abstract method, all anonymization methods must implement it.
Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.
save_anonymized_dataset
(path)Function Called to save the anonymized dataset.
list_to_string
- anonymized_dataset_to_dataframe()
Function Called to convert the anonymized dataset to a pandas dataframe.
- Returns:
- DataFrame :DataFrame
The pandas dataframe.
- abstract calculate_anonymization()
Function to perform the anonymization (anti-discrimination) of the dataset given in the constructor Abstract method, all anonymization methods must implement it.
- abstract calculate_metrics()
Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.
- save_anonymized_dataset(path)
Function Called to save the anonymized dataset.
- Parameters:
- pathstr
desired path to save the anonymized dataset.
- class privlib.antiDiscrimination.src.algorithms.anti_discrimination.Anti_discrimination(original_dataset, min_support, min_confidence, alfa, DI)
Class that implements anti-discrimination anonymization. This algorithm implementation can be executed by the anonymization scheme due to its extends Anonymization_scheme class (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)
See also
Anonymization_scheme
References
[1]Sara Hajian and Josep Domingo-Ferrer, “A methodology for direct and indirect discrimination prevention in data mining”, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, no. 7, pp. 1445-1459, Jun 2013. DOI: https://doi.org/10.1109/TKDE.2012.72
Methods
Function to perform the anti-discrimination anonymization.
Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.
anonymize_direct_indirect
anonymize_direct_rules
anonymize_indirect_rules
calculate_MR_rules
calculate_RR_rules
calculate_and_save_FR_rules
calculate_and_save_rules_direct
calculate_and_save_rules_indirect
calculate_impact
calculate_maxs_index
calculate_next_index
count_items_hash
count_items_no_hash
create_rules
elb
f
get_noA_B_noC
get_noA_B_noD_noC
inspect
is_PD_rule
is_all_item_set_in_record
is_any_A_in_X
is_any_item_set_in_record
is_rule_in_rule_set
is_rule_possible
load_rules_FR
load_rules_direct
load_rules_indirect
save_rules_FR
save_rules_direct
save_rules_indirect
to_item_DI
- calculate_anonymization()
Function to perform the anti-discrimination anonymization.
- calculate_metrics()
Function to calculate the metrics of the datasets given in the constructor Abstract method, all anonymization methods must implement it.
Entities
- class privlib.antiDiscrimination.src.entities.anti_discrimination_metrics.Anti_discrimination_metrics(DDPD, DDPP, IDPD, IDPP)
Anti_discrimination_result
Class that stores the results of the anti-discrimination metrics calculation (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)
Methods
Shows the results description of the anti discrimination metrics calculation
- description()
Shows the results description of the anti discrimination metrics calculation
- class privlib.antiDiscrimination.src.entities.dataset.Dataset(name, separator, sample=None)
Abstract class that represents a dataset of records Different dataset formats have to inherit this class (See examples of use in sections 1 and 2 of the jupyter notebook: test_antiDiscrimination.ipynb) (See also the file “anti_discrimination_test.py” in the folder “tests”)
Methods
add_record
(values_in)- Parameters:
Shows a description of the dataset using pandas Dataframe
Shows a description of the dataset.
Load the dataset.
Load the header of the dataset.
set_header
(header)Shows a description of the dataset.
take_sample
- add_record(values_in)
- Parameters:
- values_in
The record to be stored in the dataset. It consist of a list of values.
- Adds a record to the dataset. The load_dataset method implementation should call this method to store the data
See also
Record
- dataset_description()
Shows a description of the dataset using pandas Dataframe
- abstract description()
Shows a description of the dataset.
- abstract load_dataset()
Load the dataset. The specific implementation should call the add_record method.
- abstract load_header()
Load the header of the dataset. The header consist of the name of the attributes
- set_header(header)
Shows a description of the dataset.
- Parameters:
- header
The header including the name of the attributes in the dataset.
- class privlib.anonymization.src.entities.dataset_CSV.Dataset_CSV(dataset_path, settings_path, separator, sample=None)
Class that represents a dataset of records stored in csv format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”
in the folder “tests”)
Methods
Shows a description of the dataset.
Load the dataset.
Load the header of the dataset.
- description()
Shows a description of the dataset.
- load_dataset()
Load the dataset. Implements the inherited load_dataset method for the csv formatted file
See also
Dataset
- load_header()
Load the header of the dataset. The header consist of the name of the attributes
See also
Dataset
- class privlib.anonymization.src.entities.dataset_DataFrame.Dataset_DataFrame(dataframe, settings_path, sample=None)
Class that represents a dataset of records stored in pandas Dataframe format (See examples of use in sections 1, 2, 3, 4, and 5 of the jupyter notebook: test_anonymization.ipynb) (See also the files “test_k_anonymity.py”, “test_k_t_closeness.py” and “test_differential_privacy.py”
in the folder “tests”)
Methods
Shows a description of the dataset.
Load the dataset.
Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset.
- description()
Shows a description of the dataset.
- load_dataset()
Load the dataset. Implements the inherited load_dataset method for the pandas Dataframe format
See also
Dataset
- load_header()
Implements the inherited load_header method for the pandas Dataframe format Load the header of the dataset. The header consist of the name of the attributes
See also
Dataset
- class privlib.anonymization.src.entities.record.Record(id_rec, values)
Class that represents a record. A record consist of a list of values. A
Dataset
is formed by a list of records- Attributes:
- reference_record
Methods
Calculates and stores for each record in the list given as parameter the distance to the reference record.
distance
(record)Calculates the distance between this record and the record given as parameter.
distance_all_attributes
(record)Calculates the distance between this record and the record given as parameter.
set_reference_record
(dataset)Creates and stored a record that is formed by the reference value of each attribute.
- static calculate_distances_to_reference_record(records)
Calculates and stores for each record in the list given as parameter the distance to the reference record.
- Parameters:
- recordslist
The list of records to calculate and store the distance to the reference record
- distance(record)
Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes only the quasi-identifier attributes to calculate the distance between records.
- Parameters:
- record
Record
The record to calculate the distance
- record
- Returns:
- float
The distance between this record and the given one.
- distance_all_attributes(record)
Calculates the distance between this record and the record given as parameter. The distance is calculated as the Euclidean distance normalized by standard deviation. The distance of each attribute value is calculated with each specific attribute value implementation This method takes all attributes to calculate the distance between records.
- Parameters:
- record
Record
The record to calculate the distance
- record
- Returns:
- float
The distance between this record and the given one.
- static set_reference_record(dataset)
Creates and stored a record that is formed by the reference value of each attribute.
- Parameters:
- dataset
Dataset
The dataset to create the reference record
- dataset