Compression For LLM Generation Optimization & Cost Reduction

Category :

AI

Posted On :

Share This :

 

With complex reasoning at work, large language models (LLMs) are primarily trained to produce text responses to user queries or prompts. This includes not only language generation, which involves anticipating each subsequent token in the output sequence, but also a thorough comprehension of the linguistic patterns surrounding the user input text.

 

The necessity to address the slow, time-consuming inference brought on by larger user prompts and context windows has recently drawn attention to prompt reduction solutions in the LLM landscape. These methods are intended to help reduce token usage, speed up token creation, and lower overall computing costs while maintaining the highest level of job output quality.

Five popular quick compression methods are presented and explained in this article to expedite LLM creation in difficult situations.

 

1. The process of condensing lengthy or repetitious text into a more concise version while maintaining its fundamental semantics is known as semantic summary. A digest with just the most important information is supplied to the model instead of feeding it the full text documents or discussion iteratively. As a result, the model must “read” fewer input tokens, speeding up the next token generation process and cutting costs without sacrificing important data.

 

For example, “In yesterday’s meeting, Iván reviewed the quarterly numbers…” is a five-paragraph summary of a lengthy prompt context drawn from meeting minutes. “Summary: Iván reviewed quarterly numbers, highlighted a sales dip in Q4, and proposed cost-saving measures” is an example of how the abbreviated context can appear following semantic summarizing.

 

 

2. JSON-structured The goal of the prompting technique is to convey lengthy, fluid text passages in condensed, semi-structured formats such as a list of bullet points or JSON (key-value pairs). Token counts are usually decreased in target formats used for structured prompting. This improves model consistency, decreases ambiguity, and lessens prompts along the way by assisting the model in interpreting user instructions more accurately.

 

Raw prompts can be transformed by structured prompting algorithms using commands like In a systematic format such as this, please compare Product X and Product Y in depth, paying particular attention to price, features, and user reviews. {objectives: [“Product X”, “Product Y”], criteria: [“price”, “features”, “ratings”], task: “compare”}

 

 

3. Relevance filtering employs the idea of “focusing on what really matters” by evaluating the text’s relevance in specific passages and including only the context that is actually pertinent to the task at hand in the final prompt. Only small parts of the information that are most pertinent to the target request are retained, as opposed to dumping complete pieces of information, such as papers that are part of the context. This is a another method of significantly reducing prompt size, improving the model’s focus, and increasing prediction accuracy (keep in mind that LLM token creation is essentially a repetitive next-word prediction task).

 

Consider the addition of a 10-page mobile product manual as an attachment (prompt context). Only a few brief, pertinent portions about “battery life” and “charging process” remain after relevance filtering because the user was alerted to the potential safety risks when charging the gadget.

 

 

4. Instruction Referencing has numerous prompts that repeatedly state the same kind of instructions, such as “use concise sentences,” “adopt this tone,” and “reply in this format,” to mention a few. For every common instruction (which consists of a collection of tokens), instruction referencing generates a reference, registers each one just once, and reuses it as a single token identifier. That identity is utilized whenever a registered “common request” is mentioned in subsequent prompts. In addition to making reminders shorter, this technique aids in sustaining steady task behavior over time.

 

a collection of guidelines such as “Write in a friendly tone.” Steer clear of jargon. Make your sentences brief. Give examples.” might be shortened to “Use Style Guide X.” and then used again when the same instructions are given.

 

 

5. Template Abstraction is similar to how some patterns or instructions, including report structures, evaluation forms, or step-by-step procedures, frequently show up across prompts. Similar to instruction referencing, template abstraction concentrates on the format and shape of the generated outputs, encapsulating such common patterns under a template name. After that, template referencing is employed, and the LLM fills in the remaining details. This significantly lessens the frequency of repeated tokens while also helping to keep prompts more understandable.

 

A prompt may be transformed into something like “Produce a Competitive Analysis using Template AB-3” following template abstraction, where AB-3 is a list of the analysis’s requested content sections, each of which is precisely described. Something along the lines of:

 

Create a four-section competitive analysis:

  • Market Overview (two to three paragraphs that highlight trends in the market)
  • A table that compares at least five competitors is called a competitor breakdown.
  • Weaknesses and Strengths (bullet points)
  • Three concrete actions are suggested by the strategy.

 

Concluding

This article outlines five popular methods for compressing user prompts to speed up LLM creation in difficult situations. It frequently focuses on the context portion of the prompts, which is typically the main reason why “overloaded prompts” cause LLMs to slow down.