Typically, it’s any data that can be used to distinguish or trace a living person (data subject). Definitions of Personal Data may differ depending on legislation.
A data subject is an identifiable living person to whom a particular data item relates. A data subject may be given the ability to inquire about or remove their data according to a particular practice, standard, rule or regulation.
The training data is an initial set of data items (examples) used to train an AI model to produce results. Typically, in AI model development, training sets make up the majority of the total data available for model development. The allocation ratio between Training, Test, and Validation is usually around 60%:20%:20%.
In supervised machine learning, a model is trained using a labeled training data set. Labeling (or tagging, annotation) procedure typically involves human Labeler to augment each item of unlabeled data with meaningful tags (or labels) that are informative. For example, labels might indicate whether a photo contains a dog or a cat, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, whether the dot in an x-ray is a tumor, etc.
In Machine Learning, Labeler is an individual typically involved in the construction of the Training Data sets by assigning labels to every data item. Labelers use their human knowledge and experience to assign labels, which could result in biased labeling results and therefore in biased predictions by a machine learning model.
The testing data set is used to provide an unbiased evaluation of a final model fit on the training dataset. After a model has been processed by using the training data set, you test the model by making predictions against the test data set.
A validation data set is a set of data used to train AI to find and optimize the best model to solve a given problem. Validation sets are also known as dev sets.
Artificial intelligence (AI)
A system that makes it possible for a computer to learn from experience, adjust to new inputs, and perform tasks commonly associated with human intelligence. Today AI is properly known as “narrow AI” (or “weak AI”), as it is designed to perform a narrow task (e.g. recognize objects, classify images, recommend products, conduct coaching conversations).
A decision space is the set of all possible decisions that the system can make. For example, the decision space of a traffic light is a set of states that it shows, such as Green, Yellow, Red, blinking Yellow, and Not working at all.
In computer science, an algorithm is a finite sequence of well-defined instructions that could be implemented in a computer program, typically to solve a class of problems or to perform a computation. An algorithm is often paired with words specifying the activity for which a set of rules has been designed.
Open-source software (OSS) is a type of computer software in which source code is released under a license where the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. The main principle of Open Source is peer production, with products such as source code, blueprints, and documentation freely available to the public.
Open data is data that can be freely accessed, used, shared and built-on by anyone, anywhere, for any purpose. This is the summary of the full Open Definition.
Machine learning is a subset of methods in Artificial Intelligence (AI), studying algorithms and statistical models to provide systems the ability to autonomously learn and improve from experience without being explicitly programmed.
Supervised learning consists of mapping data to known labels which together are composed in a Training Data Set. This mapping process is done by human Subject-matter Experts provide during the Data Labeling process. Training the model means achieving sufficient accuracy of prediction for real-world examples. The testing of such models is performed using Test and Validation datasets.
Unsupervised learning is where the input data is unlabeled and the system tries to learn the structure from that data automatically, without any human guidance. Anomaly detection, such as flagging unusual credit card transactions to prevent fraud, is an example of unsupervised learning.
Semi-supervised learning is often a combination of the first two approaches. That is, the system trains on partially labeled input data — usually a lot of unlabeled data and a little bit of labeled data. Facial recognition in photo services from Facebook and Google are real-world applications of this approach.
Reinforcement learning occurs when a computer system receives data in a specific environment and then learns how to maximize its outcomes for particular criteria. Reinforcement learning differs from supervised learning in not needing labeled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Applications range from Robotics, personalized recommendations to drug development.
Robotics is the intersection of science, engineering and technology that produces machines, called robots. A robot has three consistent characteristics: robots have a mechanical component that allows to complete tasks in the environment for which it’s designed. Robots require a source of power, typically electrical. To operate robots execute a computer program. Robots be of any form but some are made to resemble humans in appearance (called “androids”).
Transfer learning involves reusing a model that was trained while solving one problem and applying it to a different but related problem. For example, a deep learning model trained on millions of images of cats could be “fine-tuned” to detect melanoma in medical imaging.
Subject-matter Expert (SME)
The term Subject-matter Expert or Domain Expert is frequently used in software development, and the term refers to the knowledge in the domains other than the software domain. Typically, SMEs have developed their expertise in their particular discipline over a long period of time and after a great deal of immersion in the topic.
In Biased prediction (estimation, decision), the expected value of the result differs from the true underlying value being estimated. Bias is a statistical property and can be viewed as a systematic error introduced into measurement, sampling, or testing by selecting or encouraging one outcome or answer over the others.
Human-in-the-loop is a design pattern in AI that leverages both human and machine intelligence to create machine learning models and to bring meaningful automation scenarios into the real-world. The human-in-the-loop approach reframes an automation problem as a Human-Computer Interaction (HCI) design problem. With this approach AI systems are designed to augment or enhance the human capacity, serving as a tool to be exercised through human interaction.
A system of moral values and principles of conduct, governing an individual, a group or an AI system when it works as an autonomous subject of decision-making.
Limited Access Data
A Limited Access Data is a set of data that may be disclosed to an outside party without data subject authorization if certain conditions are met. Usually, the purpose of the disclosure may only be for research, security, public health or health care operations. Second, the organization/person receiving the Limited Access Data must sign a data use agreement with Data Controller.
Data Controller is the entity that takes ownership of personal data and determines how and by whom it is handled.
Data Processor is the entity that handles personal data on behalf of the controller. A data processor may have subprocessors.