The bigger picture - Search for Humans and Machines alike
A good decision is an informed one. Policy officers and a civic society
need information to make decisions and learn about areas of public interest,
improving search will help move us closer to this goal.
But good search isn’t limited to just human usage. Advances in Large Language
Models have led to a proliferation of data-augmented LLM generations, or
more commonly known as Retrieval Augmented Generation (RAG). The heart
of a good RAG system is a good search engine to retrieve the relevant data
chunks for ingestion. Figuring out the inside-outs of how to build a good
search engine is essential to not just our work on the Pair suite of products,
but potentially helpful to other LLM products in government as well.
Every word that is said in Parliament is recorded and published as the
Official Report of Parliamentary debates or “Hansard”. These reports can
be found as far back as 1955, when Parliament was known as the Legislative
Assembly. This treasure trove of information is used by policy makers,
professionals and members of the public alike.
However searching for relevant Reports is currently difficult and unintuitive.
Results are poor due to being 100% keyword search based, meaning that the
documents that frequently mention a single word in a query are often ranked
more highly, when they should be taking the entire search phrase into account.
Due to how frequently Covid is mentioned in Parliamentary debates, the
actual results are flooded with reports that are only tangentially related
to the query, if at all. Additionally, only the titles are presented without
any smart text snippets to help a user confirm whether a link is likely
to be useful. Compare that to the results from searching the same query in Pair Search:
Search smarter, not harder
Current public and enterprise search experiences are generally low quality,
expensive and slow. We believe that this does not have to be the case.
Information should and can be readily accessible at everyone’s fingertips.
Pair Search was prototyped during OGP’s Hack for Public Good 2024 Hackathon,
under the guiding principles of proving search that should be:
State of the Art (SOTA) - Search should leverage on all
the recent advances in search technology
Fast - The complexity of the search algorithms should
no impact the latency of a search
Simple, Intuitive and Effective
Usage of the app should be intuitive
Information should be immediately presented in as useful a format as possible,
to aid users in deciding which results are worth further investigating
The site is currently live,
and indexed on the full Hansard Database from 1955 - January 2024, spanning
over 30,000 reports. Already we are seeing dramatic improvements in search
results using Pair Search.
Search that actually works
Pair Search leverages Vespa.ai, a highly versatile and scalable open-source
big data serving engine, to offer state-of-the-art search capabilities.
We leverage on their extensive text-search capabilities, as well as continuous
integration of state of the art techniques and models (e.g. the e5/m3 models
and ColbertV2 reranking)
*Note, the above video showcases a "Magic Summary" feature that automatically
generates a summary of the best results in a chronological timeline. However,
we are rolling Pair Search out without it first to focus on the core search
We undertook the challenging task of scraping and parsing the extensive
Hansard database. This site contains over 30,000 reports starting from
1955. Given the evolving data formats over the decades, standardizing this
diverse information for search purposes presented significant hurdles.
Nonetheless, these were successfully navigated to create a uniform format
conducive to efficient search operations.
The search process in Pair Search is meticulously divided into three phases:
This initial phase involved the challenging task of scraping and parsing
the extensive Hansard database, which contains records starting from 1955.
Given the evolving data formats over the decades, standardizing this diverse
information for search purposes presented significant hurdles. Nonetheless,
these were successfully navigated to create a uniform format conducive
to efficient search operations.
The retrieval mechanism combines keyword-based and semantic search strategies.
The keyword-based search utilizes Vespa's weakAnd operator alongside nativeRank
and BM25 text matching algorithms to efficiently sift through vast amounts
On the semantic front, Pair Search incorporates e5 embeddings, which offer
a balance of speed, cost-effectiveness, and enhanced performance compared
to alternatives like OpenAI's ada embeddings model.
This dual approach ensures a robust and nuanced search process that captures
both the intent behind user queries as well as the actual textual content.
To keep results speedy despite our usage of complex re-ranking algorithms,
we utilize a three phase approach to iteratively refine the search results.
Phase 1: each content node employs cost-effective algorithms
to narrow down the initial set of results.
Phase 2: a more resource-intensive re-ranking is performed
using the ColbertV2 model, ensuring that the most relevant documents are
Phase 3 (global phase): The final phase amalgamates the
top results from all content nodes, creating a hybrid score that integrates
semantic, keyword-based, and ColbertV2 scores. Utilizing a hybrid score
gives us significantly better performance than relying on just one metric,
which tend to be overly biased towards one dimension of result quality
We have currently soft-launched Pair Search (Hansard) with AGC, and a
selection of users who frequently utilize Hansard Search, including Ministry
of Law legal policy officers, Comm Ops officers at MCI and PMO, COS coordinators
among others. We are averaging ~150 daily users and ~200 daily searches
in the early days of our soft launch, and initial user feedback has been
"This will improve productivity by a few folds. Hansard search was the most painful thing previously" - MCI Policy Officer
"Wah useful. Returns results much faster than the current one" - MHA Policy Officer #1
"Wow this is really impressive!" - MHA Policer Officer #2
"The Hansard is an open book. We can all refer to it readily, and soon, we will be able to do a generative AI search on it." - Prime Minister Lee Hsien Loong
We have also seen interest from multiple agencies as well looking to leverage
our core Search Engine on their own data.
Expanding our Data Sources
Hansard is not the only useful public corpus of data with a less-than-ideal
search experience. For instance, even the search for a simple query such
as "Covid 19" on the SG Press Centre page yields zero results
(you have to search for "covid-19" with a dash for it to return some documents.
One of our immediate goals is hence to expand the number of data corpuses
we make available for search. We are currently in the midst of working
with Judiciary to incorporate the case judgements from the High Court and
Court of Appeal (to go hand-in-hand with the current Hansard dataset for
any legal research use-cases).
Going from Search to Recommendations
The next goal for our search engine is to fully leverage the search index
we've built for not just direct queries, but also for enhancing discovery
and recommendations. By analyzing search patterns, user interactions, and
content relationships within our index, we can proactively surface relevant
content and personalized suggestions.
This approach will transform the search experience from being purely query-driven
to becoming a dynamic platform for content discovery, tailored to meet
individual user needs and preferences, thereby enriching user engagement
Increasing the Search Performance
We also constantly aim to better our search performance, which we chart
through metrics such as the average rank of results that are clicked, or
the number of pages a user has to go through before finding results they
want. These metrics can be used to tune our hybrid algorithm and/or weights
to improve both the accuracy and relevance of the results generated.
Another thing we are exploring now is the integration of LLMs to augment
the search performance via i) enriching the search index, via automated
tagging and potential-question generation and ii) enhancing the retrieval
/ ranking process through query expansion (appending additional related
terms and phrases to a query to increase the chance of retrieving a relevant
Broadening the Applications of Our Search Engine
As a search engine, we’ve designed it to work out of the box as the retrieval
stack for a Retrieval Augmented Generation system as well, such as the
one we have been trialing with the Assistants feature in Pair Chat. By
providing an API to both the base search functionality and the RAG retrieval
functionality, we hope to have our base engine serving multiple applications
If you're currently using a search that doesn't quite work, or are interested
in collaborating, reach out to us here!
Meet the team
The prototype was developed by the following team.