From Audio to Knowledge: How We Turned CUAC FM's Sound Archive into a Queryable Knowledge Graph

From Audio to Knowledge: How We Turned CUAC FM's Sound Archive into a Queryable Knowledge Graph

Table of Contents

Audio of the post:

When a community radio station accumulates years of shows, interviews, debates, and local memory, the biggest risk is not losing the audio.
The biggest risk is not being able to find it.

That was exactly the starting point for this CUAC FM project: transform a historical sound archive into something navigable, searchable, and reusable with AI, without losing the community context that gives it meaning.

The real problem: high value, low accessibility

A sound archive is not just a podcast collection. It is cultural, political, and social memory.

The challenge is that even when everything is published, answering specific questions is still hard:

  • “In which episode did they talk about X?”
  • “What did they say about Y in the latest show?”
  • “Which programs covered this topic during 2024?”

Without a structure layer, the audio remains “locked” inside thousands of minutes of content.

The core idea: treat radio as a knowledge graph

The solution was to model the archive with graph logic:

  • Programme
  • Episode
  • Segment (time-aligned fragment of transcribed audio)
  • Entity (people, organizations, places, concepts)
  • Topic

Connected through explicit relationships:

  • Programme -> PRODUCED -> Episode
  • Episode -> CONTAINS -> Segment
  • Episode/Segment -> DISCUSSES -> Entity
  • Episode/Segment -> ABOUT -> Topic

When you represent the archive this way, you no longer have only indexed text.
You get structured context.

Keyword search works for simple cases, but it fails on real conversational questions.

The graph adds three key benefits:

  1. Relational context
    It does not just find words. It understands which entity appears in which episode, in which program, and at what exact point.

  2. More precise queries
    It allows combining constraints such as program, date, episode, entities, and topics.

  3. Better retrieval for chat
    Combined with hybrid retrieval (vector + keyword + graph signals), it becomes much more robust.

From audio to answer: high-level pipeline

This is the flow we are using:

  1. Episode ingestion and time segmentation.
  2. Transcription and content normalization.
  3. Entity/topic extraction.
  4. Neo4j graph load.
  5. Hybrid retrieval for each question.
  6. Chat response with sources and traceability.

The goal is not “make AI sound good.”
The goal is reliable answers grounded in the archive, with verifiable references.

Why this matters for community radio

In community media, technology should not replace local voices.
It should make them more accessible.

This approach helps:

  • Reuse years of radio production.
  • Support research and documentation.
  • Improve access for first-time listeners.
  • Give historical content a second life without redoing editorial work.

In short: turn a passive archive into living memory infrastructure.

What results are already showing

In our internal retriever evaluations (large question set), we see quality improvements versus the baseline, especially in keyword coverage and top-k retrieval accuracy.

That said, there is still work ahead: the hardest problem is not only better retrieval, but maintaining conversational context across follow-up questions (“that show,” “that topic,” “what we said before”).

That is the next frontier: conversation memory + useful compaction without losing key entities and facts.

AI at CUAC FM: experimentation with purpose

This project is part of a broader CUAC FM initiative exploring AI in community radio.
For additional context:

Closing

For me, this is the most meaningful part of the project: it is not about “adding AI for the sake of AI.”

It is about preserving collective memory, making it queryable, and returning it to the community as real utility.

If you have a media archive, a podcast library, or years of underused audio, this shift is not only technical.

It is editorial and cultural.

You move from storing content to building knowledge.

Related Posts

Risco Cero: Flutter tech to prevent STDs diseases and unintended pregnancy in young people

Risco Cero: Flutter tech to prevent STDs diseases and unintended pregnancy in young people

I want to explain how I helped two doctors (Ana and Elvira) and a teacher (Angel) to build an app to show some important information about STDs and how to prevent unintended pregnancy in young people (the app is in Galician and Spanish but we are open to update the content to English or any other language if someone can help us).

Read More
Teaching an LLM to ride the bus (Coruña edition)

Teaching an LLM to ride the bus (Coruña edition)

I’ve been playing around with the Model Context Protocol (MCP) for a few weeks now. There’s already a lot of material out there, but I wanted to try it with a simple, real-world case: the buses in A Coruña.

Read More
Transforming RAG systems with enhanced context (a langchain implementation)

Transforming RAG systems with enhanced context (a langchain implementation)

Audio of the post:

I recently stumbled upon Anthropic’s fascinating post about contextual retrieval and was immediately intrigued by the potential to revolutionize RAG systems! The concept was so compelling that I decided to put it to the test with a real-world experiment using 10 years of annual reports from CUAC FM. What started as curiosity turned into a comprehensive research project. I crafted 30 carefully designed questions paired with 30 human-reviewed reference answers to rigorously evaluate whether contextual retrieval truly delivers on its promises. The results? Absolutely game-changing!

Read More