While dealing with large datasets, it is crucial to ensure there are no duplicates in order for downstream applications to process the data effectively. Oracle Big Data Preparation cloud service enables me to perform a duplicate analysis from the toolbar in the Transform Authoring Page. The Duplicate Analysis dialog displays a list of all the columns available in the transform. I select the columns where I want to identify duplicates and move them to the Selected Columns region. I specify the precision level that I want the engine to use while searching for duplicates.
For optimal results, I set the match precision value somewhere in the center of the slider. When I apply the duplicate analysis settings, the results are displayed in the last page of the profile drawer. These metrics help me analyze the number of duplicates that could be in my dataset. I see a count of the total number of distinct values and possible duplicate records. I click the number links to view the actual row values. I am interested in viewing the duplicate data values. A list of similar data clusters and a count indicating the number of records within each cluster is displayed. I view the similar records and confirm whether records have duplicates. Once I have reviewed my data, I use the icons on the toolbar to return to the spread sheet or meta data view. Duplicate analysis also creates a column named cluster_id in the data set that I can use to create reports on these duplicates, if necessary.
Handling Sensitive Information
While preparing data for downstream applications, you often come across confidential or sensitive data such as credit card and social security numbers and personal information like birthdates. Oracle Big Data Preparation Cloud Service gives you options to handle sensitive information by completely or partially removing data or extracting just the information you need. You edit a transform in its authoring page. The recommendation engine automatically identifies columns that could contain sensitive information and displays privacy alerts. When you select a column with a privacy alert, the options to handle the sensitive information are listed in the recommendations panel.
Highly Sensitive Data
You can completely obfuscate highly sensitive data like credit card numbers by invoking the context menu for the column and selecting obfuscate or by applying the recommendation. To apply a recommendation to your transform script, select the check mark next to the recommendation. The action gets added to the transform script and the sample data is also updated immediately. If you only need partial results for certain data categories, such as a birth month or the last four digits of a social security number, then partial obfuscation or extraction might be a good option for you. You can execute these actions by applying a recommendation or, by using the “extract” option from the More Actions menu.