How Rich Metadata Powers Data Discovery in Modern Data Catalogs
Anuar Ustayev
Introduction
In our previous post, we explored what metadata is and how it transforms raw data files into understandable resources. But metadata's true power emerges when you're managing not just one CSV file, but hundreds or thousands of datasets in a data catalog or portal.
Imagine you’re a researcher looking for climate data about New York City in a government data portal with 5,000 datasets. Without good metadata and search capabilities, you’d need to browse through endless pages of cryptically named files. With rich metadata powering a search engine, you can simply type “climate data New York” and instantly find relevant datasets ranked by relevance. This doesn’t yet account for using natural language queries to discover data, which we’ll explore in upcoming posts on modern data catalogs and portals.
The Scale Challenge: From Dozens to Thousands of Datasets
When Browsing Breaks Down
Small data collections work fine with simple browsing. A research team with 20 datasets can organize them in folders or basic categories. Users can scan through everything and find what they need.
But data catalogs grow quickly:
- Municipal open data portals: Often contain 50-2,000 datasets covering everything from parking violations to budget data.
- Enterprise data catalogs: Can house 500-50,000 datasets across departments and systems.
- Research repositories: May contain hundreds of thousands of datasets from different studies and institutions.
At this scale, browsing becomes impossible. Users need search capabilities—and search engines need rich, consistent metadata to deliver relevant results.
A common question:
Can our search engine query the actual data rows within tables?
Answer:
Not directly. Indexing every data row—especially for tables with millions of time-series records—would be prohibitively large and memory-intensive for engines like Elasticsearch. Instead, we index structured metadata (e.g., table schemas, column descriptions, data dictionaries) and generate salient “key points” or summaries. This approach keeps the search index lean and performant while still guiding users to the most relevant tables.
The Standardization Imperative
When you have thousands of datasets, inconsistent or insufficient metadata creates chaos:
// Inconsistent or insufficient metadata across datasets:
Dataset 1: "nyc_311_jan2023.csv"
Dataset 2: "New York City 311 Service Requests - January 2023.xlsx"
Dataset 3: "January 2023 NYC Citizen Complaints and Service Calls.pdf"
Search engines can't effectively match these three related datasets to a user query like "NYC 311 data" because the metadata varies wildly.
The solution is a metadata schema—a standardized list of metadata fields that all datasets must follow, a simple example would be:
# Example metadata schema
required_fields:
- title # Human-readable dataset name
- description # 2-3 sentence explanation
- owner # Dataset maintainer/publisher
- license # Usage permissions
- last_updated # When data was last modified
recommended_fields:
- tags # Keywords for discovery
- category # Thematic classification
- format # File format (CSV, JSON, etc.)
- temporal_coverage # Date range covered
optional_fields:
- geographic_coverage # Spatial extent
- update_frequency # How often data changes
- data_quality_score # Completeness/accuracy rating
This schema enables three critical capabilities:
-
Systematic indexing: Search engines can reliably index the same fields across all datasets, ensuring consistent search behavior.
-
Enforced quality: Required fields ensure every dataset has minimum discoverable information before publication.
-
Guided creation: Data publishers know exactly what metadata to provide, with clear distinctions between required, recommended, and optional fields.
Note about our simple yet powerful metadata standard and tooling
Frictionless Data is a progressive open-source framework for building data infrastructure – data management, data integration, data flows, etc. It includes various data standards and provides software to work with data.
Search Engines: The Discovery Engine Behind Data Catalogs
Popular Platform Combinations
Modern data catalogs pair with powerful search engines to handle discovery at scale:
-
CKAN + Apache Solr: CKAN uses Solr's faceted search and text analysis capabilities to index dataset metadata, enabling complex filtering by organization, tags, formats, and date ranges.
-
OpenMetadata + Elasticsearch: OpenMetadata leverages Elasticsearch's real-time indexing to make data assets searchable immediately after ingestion, with support for complex queries across schemas, lineage, and usage patterns.
-
Custom portals + various engines: Many organizations build custom data portals using search engines like Elasticsearch, Solr, or even newer solutions like Meilisearch for specific performance requirements.
How Search Engines Index Metadata
Here's what happens when you add a new dataset to a catalog:
-
Ingestion: The catalog system reads both built-in metadata (filename, format, size) and external metadata files (title, description, tags)
-
Parsing & Validation: Metadata is parsed according to the catalog's schema and validated for required fields
-
Transformation: Data is normalized—dates converted to standard formats, tags split into arrays, categories mapped to controlled vocabularies
-
Indexing: The search engine creates searchable indexes from metadata fields, with different indexing strategies for text search vs. faceted filtering
-
Ranking: Search algorithms determine relevance based on text matching, metadata completeness, dataset popularity, and recency
Rich Metadata = Better Matches
Title & Description: The Foundation of Text Search
Compare these two approaches to naming the same dataset:
Minimal metadata approach:
filename: NYC-311-2023-Q1.csv
title: [same as filename]
description: [empty]
Rich metadata approach:
filename: NYC-311-2023-Q1.csv
title: New York City 311 Service Requests - Q1 2023
description: Comprehensive dataset of citizen complaints, service requests,
and municipal responses from NYC's 311 system during January-March 2023.
Includes request types, geographic distribution, response times, and
resolution status for over 500,000 service requests.
When someone searches for "NYC citizen complaints," the rich metadata provides multiple text matching opportunities:
- "NYC" matches "New York City"
- "citizen" matches "citizen complaints"
- "complaints" matches the description text
- "service requests" provides additional context
The search engine can confidently rank this dataset highly for the user's query.
Tags & Keywords: Enabling Faceted Discovery
Tags transform search from simple text matching to structured exploration:
# Example rich tagging for a transportation dataset
tags:
- transportation
- public-transit
- bus-routes
- geographic-data
- real-time
- gtfs-format
These tags enable users to:
- Filter results: "Show me only transportation datasets"
- Discover related data: "Other datasets tagged 'gtfs-format'"
- Refine searches: "Real-time transportation data"
Categories & Themes: Structured Browsing Paths
While search handles specific queries, categories provide structure for exploration:
Government Data Portal Categories:
├── Transportation
│ ├── Public Transit
│ ├── Traffic & Parking
│ └── Infrastructure
├── Environment
│ ├── Air Quality
│ ├── Water Resources
│ └── Climate & Weather
└── Demographics
├── Census Data
├── Housing
└── Economic Indicators
Good categorization helps users discover datasets they didn't know existed.
Temporal & Spatial Metadata: Context That Matters
Time and location metadata enable powerful filtering:
# Temporal metadata
date_created: "2023-01-15"
date_modified: "2023-03-20"
temporal_coverage_start: "2023-01-01"
temporal_coverage_end: "2023-03-31"
update_frequency: "daily"
# Spatial metadata
geographic_coverage: "New York City, NY, USA"
bounding_box:
north: 40.9176
south: 40.4774
east: -73.7004
west: -74.2591
This enables queries like:
- "Show me datasets updated in the last week"
- "Find data covering Manhattan from 2020-2023"
- "Daily updated transportation data"
Real-World Example: Search in Action
Let's trace how rich metadata helps a user find relevant data:
User query: "climate data New York"
Dataset 1 (Poor metadata):
filename: weather_station_data_2023.csv
title: weather_station_data_2023
description: [empty]
tags: [empty]
Dataset 2 (Rich metadata):
filename: weather_station_data_2023.csv
title: New York State Climate Monitoring Network - 2023 Weather Data
description: Hourly temperature, precipitation, humidity, and wind measurements
from 45 weather stations across New York State. Data quality controlled and
validated by NYS Climate Office. Includes extreme weather events and monthly
climate summaries.
tags: climate, weather, temperature, precipitation, new-york, monitoring, hourly-data
geographic_coverage: New York State, USA
temporal_coverage: 2023-01-01 to 2023-12-31
Search engine matching process:
-
Text relevance: Dataset 2 matches "climate" (exact match in tags and description) and "New York" (in title and geographic coverage)
-
Metadata completeness bonus: Dataset 2 has rich descriptions and complete metadata fields, suggesting higher quality
-
Geographic precision: "New York State" closely matches user's "New York" query
-
Thematic relevance: Climate-related tags confirm this dataset is specifically about climate data
Result: Dataset 2 ranks much higher and provides the user with clear information about whether it meets their needs—all before they even download it.
Building for Discovery: Metadata Standards
Popular Metadata Schemas
Successful data catalogs adopt standardized metadata schemas:
DCAT (Data Catalog Vocabulary)
- W3C standard for web-based data catalogs
- Defines common properties: title, description, keywords, themes, publisher, license
- Enables cross-catalog data discovery and harvesting
Schema.org Dataset
- Structured data markup for web search engines
- Helps Google and other search engines understand and index your datasets
- Includes properties for data downloads, temporal coverage, and spatial coverage
Frictionless Data Package
- Lightweight standard with focus on data usability
- Emphasizes clear resource descriptions and table schemas
- Popular in research and scientific data communities
Custom Organizational Schemas
- Extended versions of standards with domain-specific fields
- Example: Adding "data_sensitivity_level" for enterprise catalogs
- Balance between standardization and specific organizational needs
The Consistency Advantage
The most important factor isn't which schema you choose—it's applying it consistently across all datasets. A catalog where every dataset has a clear title, description, tags, and category will outperform one with perfect but inconsistent metadata.
# Minimal but consistent schema across all datasets:
title: [required - human readable name]
description: [required - 2-3 sentences explaining the data]
tags: [required - 3-5 relevant keywords]
category: [required - from controlled vocabulary]
license: [required - usage terms]
last_updated: [required - ISO date format]
This consistency enables:
- Reliable search results: Users can depend on finding complete information
- Effective filtering: Faceted search works when fields are consistently populated
- Automated processing: APIs and integrations can rely on standard fields being present
Conclusion
Rich metadata is the bridge between data creators and data users at scale. While a single CSV file might need only basic context, data catalogs with hundreds or thousands of datasets require standardized, searchable metadata to remain useful.
The combination of rich metadata and powerful search engines transforms data discovery from a frustrating browsing experience into targeted, relevant search results. Whether you're using CKAN with Solr, OpenMetadata with Elasticsearch, or building a custom solution, the principle remains the same: invest in consistent, descriptive metadata and your users will find exactly what they need.
In our next post, we'll dive into the practical aspects of implementing metadata standards in your data portal—covering schema design, validation workflows, and migration strategies for existing catalogs.