Search This Blog

Thursday, January 23, 2025

Data profiling vocabulary

 Data profiling involves analyzing data to understand its structure, content, and quality. Here's a vocabulary list associated with data profiling:

Basic Concepts:

  • Data Profiling: The process of examining data to collect statistics and metadata about it.
  • Data Source: The origin of the data being profiled (e.g., a database table, a file, an API).
  • Data Set: A collection of data being analyzed.
  • Column/Field/Attribute: A specific data element within a dataset (e.g., "customer name," "product price").
  • Row/Record/Tuple: A single instance of data within a dataset.
  • Metadata: Data about data (e.g., data type, length, format).

Statistical Measures:

  • Count: The number of rows in a dataset or the number of non-null values in a column.
  • Distinct Count: The number of unique values in a column.
  • Null Count/Missing Values: The number of missing or null values in a column.
  • Min/Max: The minimum and maximum values in a column.
  • Mean/Average: The average value in a column.
  • Median: The middle value in a sorted column.
  • Mode: The most frequent value in a column.
  • Standard Deviation: A measure of the spread or dispersion of values in a column.
  • Variance: The square of the standard deviation.
  • Range: The difference between the maximum and minimum values.
  • Frequency Distribution: A summary of how often each value occurs in a column.
  • Data Type: The type of data stored in a column (e.g., integer, string, date).
  • Data Length: The number of characters or bytes in a data value.

Data Quality Metrics:

  • Completeness: The extent to which data is not missing.
  • Accuracy: The extent to which data is correct and free from errors.
  • Consistency: The extent to which data is consistent across different sources and formats.
  • Validity: The extent to which data conforms to defined rules and constraints.
  • Uniqueness: The extent to which values are unique within a column.

Data Patterns and Structures:

  • Data Pattern: The format or structure of data values (e.g., date format, email format).
  • Regular Expression: A sequence of characters that define a search pattern, often used to validate data patterns.
  • Data Domain: The set of valid values for a column.

Data Profiling Outputs:

  • Data Profile: A summary of the statistics and metadata collected during data profiling.
  • Data Quality Report: A report summarizing the quality of the data based on various metrics.

Tools and Techniques:

  • Data Profiling Tools: Software tools designed to automate the data profiling process.
  • SQL Queries: Used to extract and analyze data for profiling.



Subscribe

 YouTube Channel 




By Jerry Ramonyai


No comments:

Post a Comment

Followers