Context Entropy
At Convictional, we're building infrastructure that helps businesses make better decisions by bridging the gap between AI capabilities and human judgment. One of our core challenges has been developing systems that can effectively capture and utilize an organization's knowledge across all its documents, activities, and people. This piece shares insights from our research into how business knowledge naturally distributes itself across documents, and what that means for building effective knowledge systems.
Large Language Models are incredibly capable, but in business applications they can be limited by their knowledge and understanding of said business' operating model. Context and information of topics within the business sprawls across documentation, activity and people. Not all topics are documented equally however, and so we observe an entropic-like behavior in how the information of a topic is distributed. While some topics are structured and targeted (low entropy), others are highly fragmented, almost homogenous across all documents (high entropy).
To solve for this entropy in 2024, you've probably implemented or at least considered implementing RAG (Retrieval Augmented Generation). The pattern is elegant in its simplicity: when a user asks a question, find relevant documents, stuff them into the context window of an LLM, and let the model work its magic. But if you've deployed RAG in production, you've likely encountered its limitations in addressing core business context challenges:
- Information Silos: While RAG can search across repositories, it struggles to meaningfully connect information that lives in disconnected islands across customer, operations, and financial systems. As a topic’s entropy increases, this siloed nature is exacerbated.
- Incomplete Context: Executive status often creates barriers to authentic information flow. RAG can retrieve documents but can't account for these organizational dynamics that affect how information should be interpreted.
- Historical Amnesia: Even when past decisions and their rationale exist in documents, RAG might miss crucial connections or fail to surface them at the right moment, leading to missed learning opportunities.
This leads to a crucial architectural decision: Should you stick with simple document retrieval and RAG, or invest in building a more sophisticated pre-processed knowledge store that can better handle these organizational complexities?
After analyzing over 3,200 entities retrieved from over 2,500 documents spanning decisions, meeting summaries, GitHub issues, and Google Drive docs, we have some concrete answers – and they might surprise you.
The Evidence: What We Found
We analyzed 3,281 distinct entities across 16 categories, tracking over 100,000 individual facts about these entities. What emerged was a fascinating pattern in how knowledge is distributed across documents.
The Power Law of Knowledge Distribution
The most striking finding was just how concentrated knowledge tends to be. For most entities:
- 50% of everything we know about them comes from just 1-2 documents
- 95% of their knowledge is contained in 5-6 documents
- Only 4.7% of entities need more than 10 documents to capture 95% of their knowledge
This has profound implications for RAG implementation. With typical LLM context windows now handling 8-10 documents comfortably (although LLM attention can still be a limitation), pure RAG is theoretically sufficient for over 95% of entities. But the devil, as always, is in the details.
The Impact of Entity Size on Knowledge Distribution
Our analysis revealed a clear pattern: the size of an entity - measured by the number of facts about it - strongly influences how its knowledge is distributed across documents. We identified three distinct categories:
This size distribution has important implications for system design. While pure RAG works well for over 95% of entities (the small and medium categories), that crucial 3.4% of large entities may benefit from more sophisticated handling through entity pre-processing.
Implementation Approaches: Finding the Right Balance
While our data suggests RAG is sufficient for most cases, implementing any approach requires careful consideration of practical tradeoffs. Our production experience has revealed key insights about different implementation strategies:
A Practical Implementation Strategy
Based on our research and production experience, we recommend a pragmatic approach that starts simple and adds complexity only where it demonstrably adds value:
- Start with Document Processing
- Implement basic document indexing and RAG
- Add named entity detection during indexing
- Track entity mentions and their distribution across documents
- Monitor which entities are frequently queried
- Measure Entity Patterns
- Track how knowledge about entities spreads across documents
- Identify entities that consistently require synthesizing information from many sources
- Monitor query patterns to understand which entities are most important to your users
- Selective Pre-processing
- Begin pre-processing only for entities that:
- Span many documents (typically more than 10 if using full documents)
- Are frequently queried
- Require consistent responses
- Have high business impact
- Continue using simple RAG for the majority of entities
- Regularly evaluate the effectiveness of pre-processing decisions
- We maintain preprocessed entities only for certain high-value, deterministic knowledge areas, this is still growing and changing:
- Company information
- Team member profiles
- Company, Team and User Goals
- Core organizational structure
- Begin pre-processing only for entities that:
- Search-Based Retrieval: We use hybrid search to find relevant pre-processed entities, documents and activity ranked by a combined score across the search methods.
This balanced approach gives us RAG's simplicity for most cases while ensuring consistency for critical knowledge areas. The key is being selective about where we apply additional complexity, maximizing the return on engineering investment while keeping the system maintainable.
Conclusion: Beyond RAG
Just like the second law of thermodynamics, entropy of the universe (your business context), always increases. This means that while RAG is indeed sufficient for about 95% of your entities, that critical 5% – often your most important business knowledge – requires more sophisticated handling. But building and maintaining these hybrid systems is complex and resource-intensive. That's why we've built Convictional: to give you the benefits of both approaches without the implementation overhead. Instead of investing in building and maintaining your own knowledge management infrastructure, you can leverage our platform for:
- Intelligent Context Management: Our system automatically determines when to use simple RAG versus more sophisticated entity processing based on the patterns we've discovered.
- Adaptive Knowledge Integration: Rather than forcing you to choose between RAG and entity pre-processing, we handle the complexity of when and how to use each approach.
- Seamless Scaling: As your organization's knowledge grows, our platform automatically adjusts its approach to maintain optimal performance without requiring architectural changes on your part.
If you're interested in learning more about how Convictional can help your organization better manage and utilize its knowledge, reach out to us at decide@convictional.com.
This research was conducted by the Convictional research team as part of our ongoing work to understand and improve how organizations manage and utilize their knowledge. For more insights from our team, check out convictional.com.
Appendix (For the Technically Curious)
Our Research Implementation
Our analysis is based on experimental processing of approximately 2,400 documents spanning GitHub Issues and PRs, Google Drive docs, in-app activity, and meeting transcripts. Here's what we learned from building and analyzing this test dataset:
Entity Extraction Process
We used LLMs(specifically claude-3-5-sonnet-20241022) to:
- Extract entities from documents, categorizing them into 16 types
- We use Instructor to ensure structured responses
- We perform an individual glean of each document to extract entities
- We only chunk documents when/if they surpass the context limit
- Extract facts about each entity with confidence scores
- In order to drive completeness, we perform an individual glean of the document for each entity
- While highly token inefficient, this was necessary to help the LLM properly apply attention - particularly on long documents
- De-duplicate entities using embedding similarity, but simply concatenate facts preserving ties to documents
- We don’t de-duplicate facts as we want to be able to inherit permissions from the underlying documents in order to provide only Facts that the user (or users) would have appropriate access to.
- Generate entity summaries from consolidated facts
- For our purposes, we generate a summary for debugging purposes, but it is important to note that this could be replaced with any job over the individually retrieved Facts from the entity at query time.
This process gave us valuable insights into knowledge distribution patterns and the challenges of entity-based knowledge management, though it wasn't implemented as a production system.
Considerations
Document Chunking
While our analysis has focused on document-level retrieval, it's important to acknowledge the role of document chunking in RAG systems. Chunking - breaking documents into smaller pieces - can increase the number of "documents" you can include in a context window and often improves retrieval relevancy by creating more focused semantic units.
However, chunking introduces its own tradeoffs. While it allows for more granular retrieval, it can actually exacerbate entity fragmentation. Consider a document that discusses a project holistically - chunking might spread this naturally cohesive information across multiple pieces, potentially losing important context in the process. Our analysis suggests that, with aggressive chunking, entities can end up spread across even more chunks than their original document count would suggest.
Chunking also doesn't fully solve the token efficiency problem. While you can fit more "documents" in a context window, you still risk including irrelevant information within chunks, and the semantic boundaries created by chunking may not align with natural knowledge boundaries about entities.
We chose to focus our analysis on document-level metrics as they provide a clearer picture of how knowledge naturally clusters. However, practitioners should consider how their chunking strategy might affect these patterns. The choice of chunking approach - whether by fixed token count, semantic boundaries, or maintaining document integrity - can significantly impact both retrieval effectiveness and entity coverage.
Multi-Entity Queries
Most practical business queries involve understanding relationships between multiple entities simultaneously. While our analysis has focused on single-entity RAG performance, it's important to acknowledge that real-world usage often requires retrieving information about several interrelated entities.
This multi-entity reality affects RAG's scalability in two key ways. First, when retrieving documents for multiple entities, there's often overlap in the relevant documents. Our analysis shows that related entities frequently share documentation context - for instance, projects within the same department or team members working on related initiatives. This document overlap can help mitigate the potential explosion of context window usage.
However, in cases where there's minimal document overlap between queried entities, the total document retrieval load grows roughly linearly with the number of entities. This can push up against context window limits more quickly than our single-entity analysis might suggest.
The below scatter plots show how the most well documented topics often overlap (Fig 2, right), but that we see much sparser density in the smaller entities.
The practical implication is that while RAG is sufficient for most single-entity queries, teams should carefully consider their typical query patterns when designing their system. For workflows heavily focused on understanding relationships between many entities, selective pre-processing of highly-connected (highly documented) entities may be warranted.
Documents per Entity
A good indication for the sprawl of an Entity is the number of documents associated with said Entity. However, the Fact efficiency of each Document can either mitigate or exacerbate the ‘entropy’ of the entity
- For Example, an Entity with 100 associated docs could get 95% Fact coverage from a single Document if it was extremely Fact dense
- More on the Fact density of Documents below
Facts Per Document
Facts within Documents are more normally distributed than the other distributions we’ve observed, but still highly skewed. When combined with the pattern of Documents per Entity, we observe the inverse super-exponential distribution of Fact coverage across Entities (Fig 1).