Aug 11, 2021, 10:53am

Data Terminology: Important concepts and definitions in data.

Data
Personal data
Data dividend
Data income
Data brokerage
First-party data
Cohorts
Data misuse
Data poisoning
Data pooling
Data subject

1. What is data

Data is knowledge; its power, its insight, but at its core, data is information. Specifically, data is a unit of information, just like a mile is a unit of distance. Collected through observation, data is stored in notebooks, on hard drives, and even in your own mind—that’s right, every memory, idea, recollection, and association you can think up is a result of data you’ve collected. Data comes in two types—qualitative or quantitative. Qualitative data measures the quality of something—what it is? How it is? etc. Quantitative data measures the quantity of something—how many are there? Together, these collections of numbers, texts, and facts are used to calculate, analyze, predict, or solve some sort of unknown variable.

Unprocessed data—loose collections of numbers and facts without any clear structure or intention—is known as raw data. Raw data by itself offers little value, but when that data is refined, analyzed, and measured, analysts and data scientists can use that information to make informed decisions.

Right now, privacy concerns have made data a hot topic in our cultural conversation. Unfortunately, many people don’t know how or when their data is collected or used, raising serious concerns with governing bodies and consumers alike.

Companies can access your data in a variety of ways. Sometimes it’s first-party data that belongs to and is used by a company you already frequent, and other times advertisers purchase this data from a brokerage to market their products to you. Specifically, data can include anything from personal information about your past purchases and online browsing history to your gender and even your physical location. One thing’s for sure: as long as you have a smartphone or online presence, there will be companies collecting your data and using it to advertise their products.

2. What is personal data?

Personal data is information about a person that can identify them. The Internet has made it easy to collect large amounts of personally identifiable information (PII) on people. Names, social security numbers, dates or places of birth, phone numbers, IP addresses, and internet cookies are all considered types of PII. The collection and analysis of personal data created the multi-billion-dollar data brokerage industry where companies collect and resell personal data, primarily for advertising purposes. The personal data market is so profitable that many social media platforms and apps maintain their “free” status through the massive amounts of user data they collect and sell.

Many places are beginning to regulate the use and processing of personal data. For example, California’s California Consumer Privacy Act (CCPA) and the European Union’s General Data Protection Regulation (GDPR) establish clear data privacy principles that limit the distribution and accessibility of personally identifiable information while many other governing bodies worldwide are creating and enacting their own legislation. However, there is some confusion on precisely how we define personally identifiable information for the sake of regulatory standards. Some personal data is considered linked PII—something that is not directly identifiable, like an IP address, but can be used with some effort to identify a person.

While there is no set global standard or definition of PII, many regulatory agencies, including the National Institutes of Standards and Technology, have come up with their own definitions of what constitutes personally identifiable information.

3. What is data dividend?

Your personal data dividend is your cut of the profits made from the data that you have created. It’s exactly what Invisibly advocates for, and helps you to claim.

4. What is data income?

Data income is money earned from sharing or licensing your personal data. Big tech companies, data brokers, and advertisers make nearly $24.2 billion every year aggregating, analyzing, and selling our personal data. A data income model removes the middle man and instead connects you directly with the companies that were already buying your data in the first place and puts that cash directly into your pocket.

This model gives everyday people a chance to not only profit from our digital habits, but to exercise greater control over which companies access our data and how that data is used. By wresting our data from platforms and brokerages, we finally have the option to make money by directly licensing consented data directly to advertisers.

5. What is data brokerage?

Data brokerages are businesses that compile raw data from numerous sources and then sort and analyze it for meaning. These brokerages then license the analyzed data to other organizations. Data brokers can also directly license another company’s data or help companies process their data to uncover more valuable insights.

Data brokers source their data based on the products their customers sell. Generally, that information is gathered through website cookies and free apps, which collect mountains of information just by people using their connected devices.

Data brokering has become a highly lucrative industry because the insights these firms generate help companies target new customers. Without data brokerages, companies have to rely on first-party data and cast a much wider net with their advertising dollars, placing ads in front of any and everyone without any clue who may be interested in their product. As a result, companies of all sizes and industries, and even government entities, rely on data brokers to help share their message with the ideal audience.

6. What is first-party data

More limits and regulations mean fewer resources for marketers, developers, sales teams, and any other division reliant on data-driven customer insights. Simply put, many of the tools companies have relied on in recent years may be forced out of existence either through legislation or the court of public opinion. This shift will force data workers to revisit more traditional engagement and development processes to create a new era of audience identification strategies.

7. What are cohorts

Despite GDPR policies, many companies outside of the EU have huge swaths of customer data just stored away collecting digital dust. Companies didn’t have a plan for their customer data; they just knew that they wanted as much of it as they could scavenge. As regulations continue to redefine what is and isn’t acceptable data practice, what brands do with these data graveyards is of great value for their customers. Not only are these wastelands significant financial burdens for companies to structure and maintain, but they’re a blatant target for bad actors looking to steal customer data.

8. What is data misuse?

Data misuse is the inappropriate use of data, where data is used in ways or by people beyond its stated intention. Every region has its own laws and policies that shape data use protocols. Generally, when data is collected, the collector is expected to outline that data’s specific intended and acceptable use.

Today, data misuse is more common thanks to employees and third-party partners that may have access to sensitive company information. Not to be confused with data theft, data misuse is rarely due to malicious intent or collection without consent; it’s more often a result of ignoring specific permissions and allowable use cases for personal data.

For example, if a credit card company employee were to peek at a friend’s balance or if a person working for a ride-sharing app were to track a specific customer’s location: both no-nos by all standards. Even something as simple as using company software for personal use can be considered data misuse.

Data misuse is a huge threat to privacy and security and often comes with specific penalties outlined in company policies. Unfortunately, many companies don’t have clearly defined cyber policies to prevent data misuse other than terminating an employee.

9. What is data poisoning

Data poisoning occurs when the training data of an AI or machine learning algorithm is corrupted, creating an inaccurate final output. Data poisoning is often a direct criminal attack on the integrity of the device. The difference between data poisoning and other cyber-attacks is that eventually, the poisoning becomes an accepted part of the AI. Attackers learn how the system learns and feed it the wrong information in order to exploit the model.

Another way to corrupt data is to introduce the attack before the machine learning can begin. Compared to corrupting an existing system, this model gives bad actors a more accessible breaching opportunity because it reduces the security protocols they have to bypass. Instead, criminals can poison the learning process before it starts. By the time a developer or engineer realizes something’s wrong, it’s already too late.

Data poisoning takes much longer than most other cyber-attacks, so it’s difficult to pinpoint exactly when the data was corrupted. AIs are constantly learning and updating to make the most correct predictions based on their inputs. The only way to fix the corruption is by retraining the system from the ground up. Realistically, the only way to avoid data poisoning is by preventing it in the first place through validity checks, regression training, rate limits, and various other measures. General digital hygiene can help as well—limiting who has access to the machine learning system, not sharing passwords, etc.

10. What is data pooling?

Data pooling is the process of consolidating data from a large number of sources in a single, centralized database where it can be analyzed and compiled into a standardized format. Database software then verifies and synchronizes that information.

Likewise, a data pool is a related set of values obtained from a single source. A data pool can be any data set meant for analysis—employee records, patient medical information, Global Trade Item Numbers (GTIN), etc. Data pools can be private or shared, but a private pool cannot be seen or shared with anyone except the administrator. Most web-based data pools are shared between different sources and can be accessed by anyone with permission.

The key attributes within each data pool allow trading partners to synchronize information easily. While most data is collected through automation, how that information is collected can affect its accuracy and, in many cases, its usability.

11. What is data subject?

A data subject is any person whose personal data has been collected. Data subjects may potentially be identified through the data that has been collected on them – either directly or indirectly. Companies use this data for various reasons which should be clearly communicated to the subject, specifically what, why, and how the data is being used and collected.

Depending on the jurisdiction, data subjects may have different rights under data protection plans such as the GDPR or the CCPA. Data subjects should know their rights, how to exercise them, and what to do when their rights are violated.

Frequently, those rights include the right to know what data is being collected, how it is being used, how long it will be kept, and if it is shared with any third parties. In many regions, data subjects have the right to request their own data at any time for any reason, and if they find the data is incorrect, they have the right to have it updated. They may also have the right to request their data be erased or restricted and use their data for their own purposes. Most importantly, data subjects should have the right to reject companies from collecting data on them.

See your data work for you.