Data profiling involves analyzing data to understand its structure, content, and quality. Here's a vocabulary list associated with data profiling:
Basic Concepts:
- Data Profiling: The process of examining data to collect statistics and metadata about it.
- Data Source: The origin of the data being profiled (e.g., a database table, a file, an API).
- Data Set: A collection of data being analyzed.
- Column/Field/Attribute: A specific data element within a dataset (e.g., "customer name," "product price").
- Row/Record/Tuple: A single instance of data within a dataset.
- Metadata: Data about data (e.g., data type, length, format).
Statistical Measures:
- Count: The number of rows in a dataset or the number of non-null values in a column.
- Distinct Count: The number of unique values in a column.
- Null Count/Missing Values: The number of missing or null values in a column.
- Min/Max: The minimum and maximum values in a column.
- Mean/Average: The average value in a column.
- Median: The middle value in a sorted column.
- Mode: The most frequent value in a column.
- Standard Deviation: A measure of the spread or dispersion of values in a column.
- Variance: The square of the standard deviation.
- Range: The difference between the maximum and minimum values.
- Frequency Distribution: A summary of how often each value occurs in a column.
- Data Type: The type of data stored in a column (e.g., integer, string, date).
- Data Length: The number of characters or bytes in a data value.
Data Quality Metrics:
- Completeness: The extent to which data is not missing.
- Accuracy: The extent to which data is correct and free from errors.
- Consistency: The extent to which data is consistent across different sources and formats.
- Validity: The extent to which data conforms to defined rules and constraints.
- Uniqueness: The extent to which values are unique within a column.
Data Patterns and Structures:
- Data Pattern: The format or structure of data values (e.g., date format, email format).
- Regular Expression: A sequence of characters that define a search pattern, often used to validate data patterns.
- Data Domain: The set of valid values for a column.
Data Profiling Outputs:
- Data Profile: A summary of the statistics and metadata collected during data profiling.
- Data Quality Report: A report summarizing the quality of the data based on various metrics.
Tools and Techniques:
- Data Profiling Tools: Software tools designed to automate the data profiling process.
- SQL Queries: Used to extract and analyze data for profiling.

No comments:
Post a Comment