What's the difference between browsing and searching in a data catalog?

Browsing works well for small collections (dozens of datasets) where users can scan through categories. Searching becomes essential at scale—when you have hundreds or thousands of datasets, users need to query by keywords, filters, and metadata to find relevant data quickly.

Why do different data catalogs use different search engines?

Different search engines excel at different tasks. Apache Solr is great for faceted search and text analysis, Elasticsearch excels at real-time indexing and complex queries, while others might prioritize simplicity or specific features. The choice depends on scale, performance needs, and existing infrastructure.

How much metadata is "enough" for good discovery?

Start with title, description, tags, and category—these four fields dramatically improve discoverability. Add temporal (dates), spatial (geography), and provenance (source) metadata based on your users' needs. The key is consistency across all datasets rather than perfect completeness.

Can I improve discovery without changing my existing data catalog software?

Often yes! Most catalog systems allow you to enrich existing datasets with better metadata. Focus on improving titles, descriptions, and tags for your most important datasets first. The search improvements will be immediate and noticeable.

How Rich Metadata Powers Data Discovery in Modern Data Catalogs

June 25, 2025

Anuar Ustayev

Introduction

In our previous post, we explored what metadata is and how it transforms raw data files into understandable resources. But metadata's true power emerges when you're managing not just one CSV file, but hundreds or thousands of datasets in a data catalog or portal.

Imagine you’re a researcher looking for climate data about New York City in a government data portal with 5,000 datasets. Without good metadata and search capabilities, you’d need to browse through endless pages of cryptically named files. With rich metadata powering a search engine, you can simply type “climate data New York” and instantly find relevant datasets ranked by relevance. This doesn’t yet account for using natural language queries to discover data, which we’ll explore in upcoming posts on modern data catalogs and portals.

The Scale Challenge: From Dozens to Thousands of Datasets

When Browsing Breaks Down

Small data collections work fine with simple browsing. A research team with 20 datasets can organize them in folders or basic categories. Users can scan through everything and find what they need.

But data catalogs grow quickly:

Municipal open data portals: Often contain 50-2,000 datasets covering everything from parking violations to budget data.
Enterprise data catalogs: Can house 500-50,000 datasets across departments and systems.
Research repositories: May contain hundreds of thousands of datasets from different studies and institutions.

At this scale, browsing becomes impossible. Users need search capabilities—and search engines need rich, consistent metadata to deliver relevant results.

A common question:

Can our search engine query the actual data rows within tables?
Answer:
Not directly. Indexing every data row—especially for tables with millions of time-series records—would be prohibitively large and memory-intensive for engines like Elasticsearch. Instead, we index structured metadata (e.g., table schemas, column descriptions, data dictionaries) and generate salient “key points” or summaries. This approach keeps the search index lean and performant while still guiding users to the most relevant tables.

The Standardization Imperative

When you have thousands of datasets, inconsistent or insufficient metadata creates chaos:

// Inconsistent or insufficient metadata across datasets:
Dataset 1: "nyc_311_jan2023.csv"
Dataset 2: "New York City 311 Service Requests - January 2023.xlsx"
Dataset 3: "January 2023 NYC Citizen Complaints and Service Calls.pdf"

Search engines can't effectively match these three related datasets to a user query like "NYC 311 data" because the metadata varies wildly.

The solution is a metadata schema—a standardized list of metadata fields that all datasets must follow, a simple example would be:

# Example metadata schema
required_fields:
  - title          # Human-readable dataset name
  - description    # 2-3 sentence explanation
  - owner          # Dataset maintainer/publisher
  - license        # Usage permissions
  - last_updated   # When data was last modified

recommended_fields:
  - tags           # Keywords for discovery
  - category       # Thematic classification
  - format         # File format (CSV, JSON, etc.)
  - temporal_coverage  # Date range covered

optional_fields:
  - geographic_coverage  # Spatial extent
  - update_frequency    # How often data changes
  - data_quality_score  # Completeness/accuracy rating

This schema enables three critical capabilities:

Systematic indexing: Search engines can reliably index the same fields across all datasets, ensuring consistent search behavior.
Enforced quality: Required fields ensure every dataset has minimum discoverable information before publication.
Guided creation: Data publishers know exactly what metadata to provide, with clear distinctions between required, recommended, and optional fields.

Note about our simple yet powerful metadata standard and tooling

Frictionless Data is a progressive open-source framework for building data infrastructure – data management, data integration, data flows, etc. It includes various data standards and provides software to work with data.

Search Engines: The Discovery Engine Behind Data Catalogs

Popular Platform Combinations

Modern data catalogs pair with powerful search engines to handle discovery at scale:

CKAN + Apache Solr: CKAN uses Solr's faceted search and text analysis capabilities to index dataset metadata, enabling complex filtering by organization, tags, formats, and date ranges.
OpenMetadata + Elasticsearch: OpenMetadata leverages Elasticsearch's real-time indexing to make data assets searchable immediately after ingestion, with support for complex queries across schemas, lineage, and usage patterns.
Custom portals + various engines: Many organizations build custom data portals using search engines like Elasticsearch, Solr, or even newer solutions like Meilisearch for specific performance requirements.

How Search Engines Index Metadata

Here's what happens when you add a new dataset to a catalog:

Ingestion: The catalog system reads both built-in metadata (filename, format, size) and external metadata files (title, description, tags)
Parsing & Validation: Metadata is parsed according to the catalog's schema and validated for required fields
Transformation: Data is normalized—dates converted to standard formats, tags split into arrays, categories mapped to controlled vocabularies
Indexing: The search engine creates searchable indexes from metadata fields, with different indexing strategies for text search vs. faceted filtering
Ranking: Search algorithms determine relevance based on text matching, metadata completeness, dataset popularity, and recency

Rich Metadata = Better Matches

Title & Description: The Foundation of Text Search

Compare these two approaches to naming the same dataset:

Minimal metadata approach:

filename: NYC-311-2023-Q1.csv
title: [same as filename]
description: [empty]

Rich metadata approach:

filename: NYC-311-2023-Q1.csv
title: New York City 311 Service Requests - Q1 2023
description: Comprehensive dataset of citizen complaints, service requests,
and municipal responses from NYC's 311 system during January-March 2023.
Includes request types, geographic distribution, response times, and
resolution status for over 500,000 service requests.

When someone searches for "NYC citizen complaints," the rich metadata provides multiple text matching opportunities:

"NYC" matches "New York City"
"citizen" matches "citizen complaints"
"complaints" matches the description text
"service requests" provides additional context

The search engine can confidently rank this dataset highly for the user's query.

Tags & Keywords: Enabling Faceted Discovery

Tags transform search from simple text matching to structured exploration:

# Example rich tagging for a transportation dataset
tags:
  - transportation
  - public-transit
  - bus-routes
  - geographic-data
  - real-time
  - gtfs-format

These tags enable users to:

Filter results: "Show me only transportation datasets"
Discover related data: "Other datasets tagged 'gtfs-format'"
Refine searches: "Real-time transportation data"

Categories & Themes: Structured Browsing Paths

While search handles specific queries, categories provide structure for exploration:

Government Data Portal Categories:
├── Transportation
│   ├── Public Transit
│   ├── Traffic & Parking
│   └── Infrastructure
├── Environment
│   ├── Air Quality
│   ├── Water Resources
│   └── Climate & Weather
└── Demographics
    ├── Census Data
    ├── Housing
    └── Economic Indicators

Good categorization helps users discover datasets they didn't know existed.

Temporal & Spatial Metadata: Context That Matters

Time and location metadata enable powerful filtering:

# Temporal metadata
date_created: "2023-01-15"
date_modified: "2023-03-20"
temporal_coverage_start: "2023-01-01"
temporal_coverage_end: "2023-03-31"
update_frequency: "daily"

# Spatial metadata
geographic_coverage: "New York City, NY, USA"
bounding_box:
  north: 40.9176
  south: 40.4774
  east: -73.7004
  west: -74.2591

This enables queries like:

"Show me datasets updated in the last week"
"Find data covering Manhattan from 2020-2023"
"Daily updated transportation data"

Real-World Example: Search in Action

Let's trace how rich metadata helps a user find relevant data:

User query: "climate data New York"

Dataset 1 (Poor metadata):

filename: weather_station_data_2023.csv
title: weather_station_data_2023
description: [empty]
tags: [empty]

Dataset 2 (Rich metadata):

filename: weather_station_data_2023.csv
title: New York State Climate Monitoring Network - 2023 Weather Data
description: Hourly temperature, precipitation, humidity, and wind measurements
from 45 weather stations across New York State. Data quality controlled and
validated by NYS Climate Office. Includes extreme weather events and monthly
climate summaries.
tags: climate, weather, temperature, precipitation, new-york, monitoring, hourly-data
geographic_coverage: New York State, USA
temporal_coverage: 2023-01-01 to 2023-12-31

Search engine matching process:

Text relevance: Dataset 2 matches "climate" (exact match in tags and description) and "New York" (in title and geographic coverage)
Metadata completeness bonus: Dataset 2 has rich descriptions and complete metadata fields, suggesting higher quality
Geographic precision: "New York State" closely matches user's "New York" query
Thematic relevance: Climate-related tags confirm this dataset is specifically about climate data

Result: Dataset 2 ranks much higher and provides the user with clear information about whether it meets their needs—all before they even download it.

Building for Discovery: Metadata Standards

Popular Metadata Schemas

Successful data catalogs adopt standardized metadata schemas:

DCAT (Data Catalog Vocabulary)

W3C standard for web-based data catalogs
Defines common properties: title, description, keywords, themes, publisher, license
Enables cross-catalog data discovery and harvesting

Schema.org Dataset

Structured data markup for web search engines
Helps Google and other search engines understand and index your datasets
Includes properties for data downloads, temporal coverage, and spatial coverage

Frictionless Data Package

Lightweight standard with focus on data usability
Emphasizes clear resource descriptions and table schemas
Popular in research and scientific data communities

Custom Organizational Schemas

Extended versions of standards with domain-specific fields
Example: Adding "data_sensitivity_level" for enterprise catalogs
Balance between standardization and specific organizational needs

The Consistency Advantage

The most important factor isn't which schema you choose—it's applying it consistently across all datasets. A catalog where every dataset has a clear title, description, tags, and category will outperform one with perfect but inconsistent metadata.

# Minimal but consistent schema across all datasets:
title: [required - human readable name]
description: [required - 2-3 sentences explaining the data]
tags: [required - 3-5 relevant keywords]
category: [required - from controlled vocabulary]
license: [required - usage terms]
last_updated: [required - ISO date format]

This consistency enables:

Reliable search results: Users can depend on finding complete information
Effective filtering: Faceted search works when fields are consistently populated
Automated processing: APIs and integrations can rely on standard fields being present

Conclusion

Rich metadata is the bridge between data creators and data users at scale. While a single CSV file might need only basic context, data catalogs with hundreds or thousands of datasets require standardized, searchable metadata to remain useful.

The combination of rich metadata and powerful search engines transforms data discovery from a frustrating browsing experience into targeted, relevant search results. Whether you're using CKAN with Solr, OpenMetadata with Elasticsearch, or building a custom solution, the principle remains the same: invest in consistent, descriptive metadata and your users will find exactly what they need.

In our next post, we'll dive into the practical aspects of implementing metadata standards in your data portal—covering schema design, validation workflows, and migration strategies for existing catalogs.

Introduction

The Scale Challenge: From Dozens to Thousands of Datasets

When Browsing Breaks Down

The Standardization Imperative

Search Engines: The Discovery Engine Behind Data Catalogs

Popular Platform Combinations

How Search Engines Index Metadata

Rich Metadata = Better Matches

Title & Description: The Foundation of Text Search

Tags & Keywords: Enabling Faceted Discovery

Categories & Themes: Structured Browsing Paths

Temporal & Spatial Metadata: Context That Matters

Real-World Example: Search in Action

Building for Discovery: Metadata Standards

Popular Metadata Schemas

The Consistency Advantage

Conclusion

Quick FAQs