A Sensible Information to Semantic Caching With Redis LangCache

Semantic cache is a complicated caching mechanism that differs from conventional caching, which depends on precise key phrase matching; it shops and retrieves information based mostly on semantic similarity. Redis LangCache is a totally hosted semantic caching service that helps cache LLM prompts and responses semantically, thereby decreasing LLM utilization prices.

On this tutorial, let’s learn to shortly create a easy utility and use LangCache for caching LLM queries. Additionally, see if we are able to mix fuzzy logic to enhance the responses.

Step 1: Set Up Redis LangCache

If you happen to shouldn’t have an account but, create an account on https://cloud.redis.io/. Upon getting logged in,

Navigate to databases and create a brand new database (I used the free plan right here).
Click on on LangCache within the left menu and create an occasion of the LangCache service. I used the “Fast service creation” choice to create the LangCache service.
Copy the API key and hold it protected.
Click on on the LangCache service you simply created. Once you click on the “Join” button, a fast join information will seem on the fitting with examples of how to hook up with your LangCache occasion.

Step 2: Create a Easy Python Script

Let’s create a easy Python script that can first verify the cache for a immediate. If a match is discovered, it returns the cached response. If not, it sends the immediate to the LLM, caches the response, and returns it.

Connect with LangCache:

lang_cache =  LangCache( 
    server_url="",
    cache_id=",
)

Search the cache earlier than sending it to LLM:

outcome = lang_cache.search(
        immediate=question,
        similarity_threshold=0.90,
    )

This similarity_threshold determines how intently the immediate should match. The next worth means stricter matching.

Deal with a cache hit:

if outcome :
        for entry in outcome.information:
            print("Cache Hit!")
            print("Cache Response:::")
            print(f"Immediate: {entry.immediate}")
            print(f"Response: {entry.response}")
            print(f"Rating: {entry.similarity}")
            return

Deal with cache miss and retailer response:

# Calling LLM right here
response = requests.put up(url, json=payload, headers=headers)
response_json = response.json()
response_text = response_json["choices"][0]["message"]["content"]

# --- Storing the reponse from LLM in LangCache ---
save_response = lang_cache.set(
     immediate=question,
     response=response_text,
  )

Advantages of Semantic Caching

For related queries, responses are fetched from the cache, avoiding costly LLM calls.
Sooner response instances.

Issues to Watch Out For

Similarity threshold: Set it thoughtfully. Too excessive and you will miss helpful matches. Too low and you will obtain irrelevant outcomes.
Accuracy: Even with an optimum threshold, outcomes might not at all times be excellent.
Knowledge privateness: In multi-tenant architectures, guarantee correct information partitioning so customers don’t see one another’s data.
Cache eviction: Know when and the way to evict cache entries.

Let’s Run the Code

First Question: (Cache Is Empty)

Question : Temporary historical past on Capital of France
Response: 
      Cache Miss!
    Redirecting to LLM

The question and the response are actually saved within the cache.

Modified Question

Question:  ----Temporary historical past on Paris---
Response: 
    Cache Hit!
    Cache Response:::
    Immediate: Temporary historical past on Capital of France
    Rating: 0.92440444

Though the question modified, the cache returned a semantically related outcome with a similarity rating of 0.92

One other Variation

Question:  ----Temporary historical past on France---
Response: 
    Cache Hit!
    Cache Response:::
    Immediate: Temporary historical past on Capital of France
    Rating: 0.9121176

Oops! We requested for the historical past of France, however we received the historical past of the capital of France. Although Paris performs a serious function in France’s historical past, the context is completely different. One is a metropolis, and the opposite is a rustic!

Tuning the Threshold

Let’s enhance the edge to 0.92 and clear the cache.

Question:  ----Temporary historical past on Capital of France---
Response:
    Cache Miss!
    LLM Response: 
#####################################

Question:  ----Temporary historical past on Paris---
Response:
    Cache Hit!
    Cache Response:::
        Immediate: Temporary historical past on Capital of France
        Rating: 0.92440444
#####################################

Question:  ----Temporary historical past on France---
Response:
    Cache Miss!
    LLM Response:

It appears to be working higher!

Efficiency Comparability

Let’s evaluate the time it takes to question from the cache vs. querying from the LLM.

Semantic Cache Outcomes — Sorted by Time

Question	Consequence	Matched Immediate	Similarity Rating	Response Supply	Time (seconds)
Temporary historical past on the Capital of France	Cache Miss			LLM	0.8499
Temporary historical past on Paris	Cache Hit	Temporary historical past on the Capital of France	0.9244	Cache	0.2705
Temporary historical past on France	Cache Miss			LLM	1.2543
Temporary historical past on the Capital of France	Cache Hit	Temporary historical past on the Capital of France	1.0	Cache	1.1139
Temporary historical past on Paris	Cache Hit	Temporary historical past on France	0.9386	Cache	0.2761
Temporary historical past on France	Cache Hit	Temporary historical past on France	1.0	Cache	0.2798
Temporary historical past on the Capital of France	Cache Hit	Temporary historical past on the Capital of France	1.0	Cache	1.0178
Temporary historical past on Paris	Cache Hit	Temporary historical past on France	0.9386	Cache	0.2806
Temporary historical past on France	Cache Hit	Temporary historical past on France	1.0	Cache	0.2778
How does Langcache work? clarify…	Cache Hit	How does Langcache work? clarify…	1.0	Cache	0.2930

Observations:

Although there are some anomalies, the response from the cache is far quicker, which is apparent.
A excessive similarity threshold is held to reuse the cached response.
For distinct solutions, a better similarity threshold is really helpful.
Primarily based on the question and enterprise necessities, at all times tune and experiment, as outcomes fluctuate with embedding fashions, similarity thresholds, and so forth.

One Extra Instance

Question:  ----How does Langcache work---
Response:
    Cache Miss!
    LLM Response: < Offers a 20 strains response>
    Time to get response from LLM API: 0.8144 seconds
----#####################################---
Question:  ----How does Langcache work---
Response:  
    Cache Hit!
    Time for cache lookup: 0.3162 seconds
    Cache Response:::
    Immediate: How does Langcache work
    Response: < Offers the identical 20 strains response>
    Rating: 1.0

All is properly until now. Let’s modify the question a bit!

----#####################################---
Question:  ----How does Langcache work, clarify in 5 lines---
Response:
    Cache Hit!
    Time for cache lookup: 0.2719 seconds
    Cache Response:::
    Immediate: How does Langcache work
    Response: < Offers the identical 20 strains response>
    Rating: 0.9714471

----#####################################---

We requested for a 5-line response, however received the cached 20-line one. This highlights the significance of tuning and utilizing attributes to scope responses.

Semantic Cache vs. Fuzzy Match

Fuzzy matching works based mostly on approximate string matching. It really works finest for dealing with typos, spelling variants, and near-duplicate strings, whereas semantic match operates on the that means and context ranges.

Let’s have a look at the distinction between them in motion. Let’s evaluate the semantic rating (LangCache) with the fuzzy rating (Ratcliff–Obershelp algorithm) when matching two strings.

Querying for the primary time:

Question: Does Semantic cache work?
Response:
    Cache Miss!
    Cache Response:::
    Immediate: How does Semantic cache work?
    Response: Semantic caching shops question outcomes together with their semantic descriptions, enabling new queries to be answered partially or totally by reusing cached information.

Querying the identical once more — discover that the semantic rating and fuzzy rating are fairly shut.

Question: How does Semantic cache work?
Response:
    Cache Hit!
    Cache Response:::
    Immediate: How does Semantic cache work?
    Response: Semantic caching shops question outcomes together with their semantic descriptions, enabling new queries to be answered partially or totally by reusing cached information.
    Semantic rating  (Langacache) : 0.95
    Fuzzy match rating : 0.93

Let’s attempt with a couple of extra variations:

Question: Will Semantic cache work?
Response:
    Cache Hit!
    Cache Response:::
    Immediate: How does Semantic cache work?
    Response: Semantic caching shops question outcomes together with their semantic descriptions, enabling new queries to be answered partially or totally by reusing cached information.
    Semantic rating  (Langacache) : 0.94
    Fuzzy match rating : 0.83

Question: What does Semantic cache imply?
Response:
    Cache Hit!
    Cache Response:::
    Immediate: How does Semantic cache work?
    Response: Semantic caching shops question outcomes together with their semantic descriptions, enabling new queries to be answered partially or totally by reusing cached information.
    Semantic rating  (Langacache) : 0.94
    Fuzzy match rating : 0.79

As you’ll be able to see, fuzzy match focuses on “seems like,” whereas semantic matching focuses on “means like.”

Full implementation of the above is offered right here.

Fuzzy match could be mixed with the semantic cache in a couple of methods:

Retailer the final ‘n’ prompts and do a fuzzy match on these. Use semantic caching provided that there isn’t a match discovered.
When a excessive similarity threshold/rating is used for the semantic cache (e.g., > 0.95), we find yourself caching prompts for each cache miss. This process will lead to a number of near-duplicates. We will use fuzzy match to establish these close to duplicates and retailer solely the prompts which are completely different
If the caching layer incorporates a number of close to duplicates, we are able to use fuzzy match for compaction.

Ultimate Ideas

When constructing functions with semantic caching, efficient outcomes depend upon steady testing and context-aware tuning. Similarity thresholds, immediate patterns, and cache scope ought to be adjusted based mostly on workload habits and accuracy necessities. Redis LangCache allows fine-grained management via attributes that partition and scope cached responses. Semantic caching turns into much more environment friendly when fuzzy matching logic is added, placing a steadiness between accuracy and elevated cache hit charges. When mixed, these strategies can enhance latency, decrease LLM prices, and supply constant outcomes whereas sustaining accuracy and relevance.

Completely satisfied coding!