The ESG Backlash: Why Ground Truth in Data is Essential for Societal Trust

The ESG Backlash

Why Ground Truth in Data is Essential for Societal Trust

The recent challenges facing ESG (Environmental, Social, and Governance) are not sudden shocks but the inevitable outcome of a foundation built more on elite consensus than broad societal trust. ESG’s rapid expansion masked a dangerous fragility—one that confused compliance with genuine commitment and mistook policy momentum for true legitimacy.

At the heart of the problem lies a failure to connect ESG goals with real-world benefits. Instead of translating sustainability into tangible impacts, such as job creation, community development, and economic well-being—the conversation became lost in complex taxonomies, disclosure frameworks, and net-zero targets. This disconnect created a vacuum that critics filled by portraying ESG as an out-of-touch, elitist agenda.

This is not merely a public relations issue. Sustainability is inherently political, involving difficult trade-offs and resource decisions. Ignoring this reality left ESG vulnerable to backlash.

Where does data quality fit into rebuilding trust in ESG?

The path forward requires viewing ESG as a systemic and strategic political economy project that builds broad coalitions. But these coalitions must rest on an unshakeable foundation of truth—truth found in reliable data.

For too long, ESG data has suffered from inconsistencies, estimations, and self-reported disclosures lacking rigorous validation. This “wild west” of data has fueled skepticism and weakened trust. If the data itself is unreliable, societal buy-in becomes impossible.

Ingedata’s Expertise and Perspective on ESG Data Quality

At Ingedata, we understand that trustworthy ESG strategies start with reliable, high-quality data. Our expertise lies in combining advanced AI technologies with expert human validation through our human-in-the-loop approach. This ensures data accuracy and context that automated systems alone cannot achieve.

We have consistently delivered exceptional precision—achieving 99.81% accuracy on over 11 million validated observables across diverse sectors such as Earth observation, industrial inspection, and financial data processing. This experience demonstrates our ability to provide scalable, consistent data solutions critical to ESG initiatives.

Our perspective is that ESG efforts must be rooted in transparent and verifiable data to build genuine societal trust. We believe ESG is more than compliance; it’s a collaborative project requiring broad coalitions built on truth. By ensuring that the underlying data reflects real-world impacts—such as environmental changes, community development, and economic benefits—we enable organizations to craft credible narratives and support effective policymaking.

At Ingedata, we see ourselves as partners in the journey toward sustainable progress, committed to delivering the highest quality data annotation and validation services. We firmly believe that in data lies truth—and getting that data right is essential to transforming ESG into a movement with lasting societal impact.

Why this matters for ESG

  • Credibility and Transparency: To transform ESG from a compliance exercise into a legitimate project, underlying data must be unimpeachable. This requires human oversight to validate and classify impacts with precision, ensuring data reflects real conditions. Imagine applying satellite image analysis accuracy to verify deforestation claims or renewable energy impacts.
  • Authentic Narratives: Genuine ESG stories must be grounded in verifiable data. Claims about green jobs or community benefits require robust, validated socio-economic data, not broad statements. Our expert-refined data enables organizations to build credible, compelling narratives that resonate.
  • Building Trustworthy Coalitions: Broad participation in ESG efforts depends on trust. Labor unions, civic groups, and citizens engage only when data is transparent, fair, and reflective of their concerns. Our commitment to data quality and transparency signals that trust.
  • Effective Policy Making: Policymakers rely on reliable data to craft impactful ESG regulations. Flawed data leads to ineffective or counterproductive policies. The human-in-the-loop approach ensures complex nuances are captured, supporting stronger governance.

Though the ESG consensus has faced setbacks, the urgent need for sustainable progress remains. Rebuilding this requires an unwavering commitment to truth—investing in the essential but often overlooked task of ensuring data quality. While AI processes vast data, human expertise remains vital for accuracy, context, and building trust.

At Ingedata, we see ourselves as gatekeepers of this future. We believe that “in data lies truth.” By delivering the highest quality data annotation and validation services, we empower organizations to develop ESG strategies that are credible, compelling, and capable of securing genuine societal buy-in.

The future of ESG depends on humility, wider alliances, and relentless focus on legitimacy. And it all starts with getting the data right.

#ESG #DataQuality #Sustainability #SocietalTrust #HumanInTheLoop #SupervisedLearning #Ingedata

Related insights

ESE Architecture: The Human Body of Software Design

Our blog ESE Architecture: The Human Body of Software Design In previous articles, I’ve hinted at...

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Our blog Data Annotation: The Secret Sauce of AI Vision 🔍 Ever wondered how AI learns to...

Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System

Our blog Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System Hey there! I’m...

Passive Record: Why We Ditched ActiveRecord Pattern

Why We Ditched ActiveRecord Pattern

In our journey to build a scalable, maintainable, and robust event-driven microservice architecture, we made a fundamental decision early on: to abandon the traditional ActiveRecord pattern in favor of a more explicit approach combining Repository and Record patterns. This article explores why we made this choice and the significant benefits it has brought to our system.

The Problem with ActiveRecord

The ActiveRecord pattern, popularized by frameworks like Ruby on Rails, combines data access and business logic into a single object. While this approach offers simplicity and rapid development for smaller applications, it introduces several challenges as systems grow:

  1. Implicit Database Access: ActiveRecord objects can load and persist data implicitly, making it difficult to track and optimize database operations.
  2. N+1 Query Problems: The ease of accessing associations often leads to inefficient query patterns.
  3. Mutable State: ActiveRecord objects maintain a mutable state, which can lead to unexpected side effects.
  4. Mixed Concerns: Query logic, business rules, and data structure are all combined in one class.
  5. Testing Complexity: The tight coupling to the database makes unit testing more challenging.

Our Alternative: Repository and Record Patterns

Instead of ActiveRecord, we adopted a combination of two patterns:

Record Pattern

Records are immutable value objects that represent data entities. They:

  • Define the structure of data with typed fields
  • Represent relationships between entities
  • Are immutable, preventing unexpected state changes
  • Focus solely on data structure, not data access

module  Account
     class  Record  < Verse: :Model: :Record: :Base
          type  “iam/accounts” # Used for JSON API serialization

          field  :id, type: Integer, primary: true
          field  :email, type: String
          field  :account_type, type: String
          # Note: readonly keyword is used for reflection only, to generate update schema.
          # A record is always a read-only structure
          field :created_at, type: Time, readonly: true
          field :updated_at, type: Time, readonly: true
          field :password_digest, type: String, visible: false

          belongs_to :person, repository: “Person: :Repository” , foreign_key: :person_id
          has_many :roles , repository: “Account::Role: :Repository” , foreign_key: :account_id
     end
end

Repository Pattern

Repositories handle data access and persistence. They:      

  • Provide CRUD operations for records
  • Encapsulate query logic
  • Handle database-specific concerns
  • Manage transactions and consistency
  • Control authorization and access rules

module Account
     class Repository < Verse::Sequel::Repository
          self .table = “accounts” # Database table name
          self .resource = “iam:accounts” # Type used for event publishing & security scoping

          def scoped (action)
               # Scope the resource accessible based on auth_context provided when
               # creating the repository
               auth_context.can!(action, self.class .resource) do |scope|
                    scope.all? { table }
                    scope.own? { table.where( id: auth_context.metadata[ :id ]) }
               end
          end

          # Custom filter for `index` and `find_by` queries
          custom_filter :role_name do |collection, value|
               frag = <<-SQL
                    EXISTS (
                         SELECT 1
                         FROM account_roles
                         WHERE account_roles.account_id = accounts.id
                         AND account_roles.name IN (?)
                    )
               SQL

               value = [value] unless value.is_a?(Array)
               collection.where(Sequel.lit(frag, value))
          end
     end
end

Key Advantages Over Active Record

1. Immutable Records = Explicit Database Actions

With ActiveRecord, it’s easy to modify an object and forget to save it, or conversely, accidentally persist changes:

# ActiveRecord approach
user = User.find( 1 )
user.email =
“new@example.com” # Changed but not saved!
# … later in the code …
user.save # Oops, saved without realizing it had been changed

With our Record/Repository approach, all database operations are explicit:

# Repository/Record approach
user = user_repo.find ( 1 )
# user.email = “new@example.com” # Error! Records are immutable
updated_user = user_repo.update!( 1 , { email: new@example.com }) # Explicit database operation

This explicitness is crucial when database operations can be slow or expensive. There’s no ambiguity about when data is being read from or written to the database.

2. Prevention of N+1 Queries

The N+1 query problem is a common performance issue with ActiveRecord:

# ActiveRecord approach – generates N+1 queries
users = User.all
users.each
do |user|
      puts user.posts # Each iteration triggers a new query
end

Our Repository pattern makes it almost impossible to fall into this trap because associations aren’t automatically loaded:

# Repository approach
users = user_repo.index({})
# users.each { |user| puts user.posts.count } # Error! No implicit loading

# Instead, you must explicitly include associations or use optimized queries
users_with_posts = user_repo.index({}, included: [ “posts” ])
users_with_posts.first.posts # It is accessible and loaded now.

# Or better yet, create a specific query method
users_with_post_counts = user_repo.index_with_post_counts

This forces developers to think about data access patterns upfront, leading to more efficient queries.

3. Separation of Query Logic from Business Logic

In ActiveRecord, complex query logic often gets mixed with business logic:

# ActiveRecord approach – query logic mixed with business logic
class User < ActiveRecord::Base
      def self.active_premium_users_with_recent_activity
           where( status: ‘active’ , plan: ‘premium’)
               .joins( :activities )
               .where( ‘activities.created_at > ?’ , 30 .days.ago)
               .distinct
      end

      def can_access_premium_feature?
               premium? && active?
      end
end

Our approach cleanly separates these concerns:

# Repository – handles query logic
class User::Repository < Verse::Sequel::Repository
     def active_premium_users_with_recent_activity
          scoped( :read )
               .where( status: ‘active’ , plan: ‘premium’)
               .join( :activities, user_id: :id )
               .where(Sequel[ :activities ][ :created_at ] > Sequel.lit( ‘NOW() – INTERVAL \’30 days\”))

               .distinct
     end
end

# Service – handles business logic
class UserService < Verse::Service::Base
     use_repo
repo: User::Repository

     def can_access_premium_feature? (user_id)
          user = repo.find(user_id)
          user.plan == ‘premium’ && user.status == ‘active’
     end
end

This separation makes code more maintainable and easier to test. It also allows for specialized optimization of queries without affecting business logic later in the development process.

4. Virtual Repositories for Complex Data Access

One powerful feature of our approach is the ability to create “virtual” repositories that aren’t tied to a specific table but represent a projection or a complex query:

module QueryResult
     class Repository < Verse::Sequel::Repository
          attr_accessor :query_id

          def initialize (auth_context, query_id, metadata: {})
               super (auth_context, metadata:)
               @query_id = query_id
          end

          # Redefine table as a query with complex from-clause.
          def table
             # Complex SQL query that joins multiple tables and calculates relevance scores
             sql_statement = Sequel.lit(
                complex_query_fragment,
                query_id: query_id,
                # other parameters…
             )

             client { |db| db.from(sql_statement) }
          end

          def scoped (action)
                # Use this repo as a read-only repo
                raise ArgumentError, “is read-only” unless action == :read
                super
             end
     end
end

This allows us to encapsulate complex data access patterns in a clean, reusable way. The repository can handle the complexity of joining multiple tables, calculating derived values, or even accessing external services, while still presenting a consistent interface to the rest of the application.

5. Automatic Event Publishing

In a microservice architecture, communication between services is crucial. Our repository pattern automatically publishes events to an event bus on mutative actions:

module Instance
     class Repository < Verse::Sequel::Repository
          self .resource = “quiz:instances”

          event( name: “completed”)
          def complete! (instance_id)
               no_event do # Optionally, prevent the event `updated` to be published,
                                         # as we replace it by `completed`

               update!(
                     instance_id,
                     {
                           ended_at: Time.now,
                           status: “completed”
                     }
               )
          end
        end
     end
end

When complete! is called, it automatically publishes a “completed” event to the event bus after the database operation succeeds. Other services can subscribe to these events to react accordingly.

In Verse, the parameters passed to the repositories are sent to the event payload. In the case above, we will get an event quiz:instances:completed(resource_id=query_id, payload={})

The no_event block allows us to perform nested operations without triggering additional events, preventing event cascades.

6. Query/Event Method Flagging for Master/Replica Setups

In a distributed system with read replicas, it’s important to direct read queries to replicas and write operations to the master. Our approach makes this explicit:

module Instance
      class Repository < Verse::Sequel::Repository
            # Write operation – goes to master
            def update_status!(id, status)
                  update!(id, { status: })
            end

            # Flag this method as read operation – can go to replica
            query
            def exists_for_quiz?(quiz_id)
                  scoped(:read)
                        .where(quiz_id: quiz_id)
                        .select(1)
                        .limit(1)
                        .any?
                  end
            end
end

Methods marked with query are automatically routed to read replicas, while other methods go to the master. This simple annotation makes it easy to optimize database load without complex configuration or middleware.

There is a catch: In the case of a read action followed by a write, you can use Repository#with_db_mode(:rw, &block) to force usage of the master node.

7. Table-Level Authentication with Scoped Methods

Authorization is a cross-cutting concern that’s often awkwardly implemented in ActiveRecord. Our repository pattern elegantly handles this with scoped methods:

module Account
      class Repository < Verse::Sequel::Repository
          def scoped (action)
               auth_context.can!(action, “iam:accounts”) do |scope|
                    scope.all? { table } # Admins can access all accounts

                    scope.by_ou? do # Scoped by organizational units, with the specific ou stored in the context itself
                         ou = auth_context[ :ou ]
                         auth_context.reject! unless ou
                         Service::TableQuery.by_related_ou(table, ou, related_table: :people, foreign_key: :person_id)
                    end

                    scope.own? { table.where( id: auth_context.metadata[ :id ]) } # Users can access their own account
               end
          end
      end
end

This approach:

  • Centralizes authorization logic in the repository
  • Makes it impossible to accidentally bypass authorization
  • Allows for fine-grained access control based on user context
  • Keeps authorization logic close to the data it protects

Real-World Comparison

Let’s compare a typical ActiveRecord implementation with our Repository/Record approach for a common task: finding users with a specific role and updating their status.  

ActiveRecord Approach

# ActiveRecord implementation
class User < ActiveRecord::Base
has_many :roles

def self.with_role(role_name)
joins(:roles).where(roles: { name: role_name })
end
end

# Usage
admin_users = User.with_role(‘admin’)
admin_users.update_all(status: ‘active’)

Issues with this approach:

  • Authorization is not enforced
  • The update triggers callbacks but no explicit events
  • It’s not clear if this should run on master or replica
  • Complex queries would mix with the User model

Repository/Record Approach

# Repository implementation
module User
     class Repository < Verse::Sequel::Repository
          self .table = “users”
          self .resource = “iam:users”

          def scoped (action)
               auth_context.can!(action, “iam:users” ) do |scope|
                    scope.all? { table }
                    scope.own? { table.where( id: auth_context.metadata[ :id ]) }
               end
          end

          custom_filter :role_name do |collection, value|
               collection.join (:user_roles, user_id: :id ).where(Sequel[ :user_roles][:name ] => value)
          end

          event( name: “status_updated”)
          def update_status_for_role! (role_name, status)
               users = scoped( :update ).where( role_name: role_name)
               # Prevent `updated` event to be triggered, supersed by `status_updated`
               no_event{ users.update!( status: status) }
          end
     end
end

# Usage
user_repo.update_status_for_role!( ‘admin’, ‘active’)

Benefits of this approach:

  • Authorization is automatically enforced
  • An event is published for other services
  • The method name makes it clear it’s a write operation
  • Query logic is encapsulated in the repository

Conclusion

Switching from ActiveRecord to the Repository/Record pattern has been transformative for our microservice architecture. While it required more upfront design and slightly more code, the benefits have far outweighed the costs:

  • Explicit database operations prevent accidental queries and make performance bottlenecks obvious
  • Immutable records lead to more predictable code with fewer side effects
  • Separation of concerns makes our codebase more maintainable and testable
  • Built-in event publishing facilitates communication between microservices
  • Query/event flagging optimizes database load in distributed systems
  • Integrated authorization ensures consistent access control

For complex, distributed systems, especially those with microservice architectures, the explicitness and separation of concerns provided by the Repository/Record pattern offer significant advantages over the traditional ActiveRecord approach.

What's Next?

In future articles, we’ll dive deeper into specific aspects of our architecture:

  • How we implement complex role-based authorization across microservices
  • Strategies for testing repositories and records effectively
  • Techniques for optimizing database queries and transactions
  • Real-time performance monitoring

Have you experimented with alternatives to ActiveRecord in your projects? We’d love to hear about your experiences and challenges. Share your thoughts in the comments below or reach out to our team to discuss how these patterns might benefit your architecture.

This article is part of our series on event-driven microservice architecture. Stay tuned for more insights into how we’ve built a scalable, maintainable system.

Written by

Related insights

Talent Matching with Vector Embeddings

Our blog Talent Matching with Vector Embeddings How We Built a Semantic Search System How do you...

ESE Architecture: The Human Body of Software Design

Our blog ESE Architecture: The Human Body of Software Design In previous articles, I’ve hinted at...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Our blog Data Annotation: The Secret Sauce of AI Vision 🔍 Ever wondered how AI learns to...

Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System

Our blog Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System Hey there! I’m...

Talent Matching with Vector Embeddings

How We Built a Semantic Search System

How do you find the perfect candidate for a job opening when traditional keyword matching falls short? How can you match skills and experiences semantically rather than lexically? This article explores our journey building a sophisticated talent matching system using vector embeddings, explaining both the technical challenges we faced and the solutions we implemented.

The Challenge of Talent Matching

In traditional recruitment systems, finding candidates often relies on keyword matching – searching for exact terms like “Python” or “Project Manager” in resumes. This approach has significant limitations:

  1. Vocabulary Mismatch: A resume might mention “Django” but not explicitly say “Python”
  2. Context Insensitivity: Unable to understand that “led a team of 5 developers” indicates leadership skills
  3. Synonym Blindness: Missing that “software engineer” and “developer” are essentially the same role
  4. Qualification Nuance: Difficulty distinguishing between someone who “used React” versus someone who “built and maintained large-scale React applications”

We needed a system that could understand the semantic meaning behind job requirements and candidate qualifications, not just match keywords.

Our Solution: Vector Embeddings and Semantic Search

Our approach leverages AI-powered vector embeddings to create a semantic search system that understands the meaning behind words, not just the words themselves. Here’s how it works:

Understanding Embedding Space

Vector embeddings are numerical representations of text in a high-dimensional space where semantic similarity is captured by vector proximity. In simpler terms:

  • Each piece of text (a job description, a skill, an experience) is converted into a vector of numbers
  • Similar concepts end up close to each other in this “embedding space”
  • We can measure similarity between concepts by calculating the distance between their vectors

For example, in this space, “software engineer” and “developer” would be close together, while both would be further from “marketing specialist.”

Multiple Embedding Points for Comprehensive Matching

A key insight in our implementation is that a single embedding for an entire resume is insufficient. Different aspects of a candidate’s profile need different types of semantic understanding:

  1. Experience Embeddings: Capture work history, roles, and responsibilities
  2. Education Embeddings: Represent academic background and qualifications
  3. Skills Embeddings: Encode technical and soft skills
  4. Language Embeddings: Represent language proficiencies

Each of these aspects is embedded separately, allowing for more nuanced matching. For example:

This multi-embedding approach allows us to:

  • Weight different aspects differently (e.g., giving more importance to experience than education)
  • Match candidates who might excel in one area even if they’re weaker in others
  • Provide more granular control over the matching algorithm

From Resume to Embeddings: The Processing Pipeline

When a resume is uploaded to our system, it goes through a sophisticated processing pipeline:

  1. Resume Extraction: We use AI to extract structured data from the resume document
  2. Text Normalization: Each component (experience, education, etc.) is normalized into a standard format
  3. Embedding Generation: Each component is converted into a vector embedding
  4. Storage: The embeddings are stored in PostgreSQL with pgvector extension

The embedding generation process uses carefully crafted prompts to ensure consistent, high-quality embeddings:

## Instructions

You are provided with details for one candidate experience record, which includes the following:
– If any content is not in English, always translate into English.
– Ignore any duration or date information.
– Introduction should always start with “The candidate”
– Generate a standardized, descriptive text that summarizes this work experience.
– Always explain the candidate’s position in a clear and concise manner if you cannot do nothing.
– Your output should be one or two sentences that clearly explain the candidate’s role and responsibilities in a narrative format.
… more directives …

Notice how we explicitly ignore duration information in our prompts? This is intentional – we found that including years of experience in embeddings actually reduced matching quality by overemphasizing tenure rather than relevance of experience.

Query Processing: Matching Jobs to Candidates

When a user searches for candidates, a similar process occurs:

  1. Query Parsing: The search query is analyzed to extract key components (required experience, skills, education, languages)
  2. Query Embedding: Each component is embedded into the same vector space as the resume components
  3. Similarity Search: We find the closest matches for each component
  4. Weighted Combination: Results are combined with appropriate weights for each component
  5. Ranking: Candidates are ranked by overall relevance

Our query processing uses a specialized prompt that ensures consistent interpretation:

You are an AI system that processes talent queries and extracts key details into a JSON-formatted string.
Your output must include exactly four keys: `”experience”`, `”language”`, `”education”`, and `”skills”`. Follow these instructions:


1. **Output Format:**
IMPORTANT: The final output format must be in English only, even when the initial input is not in English.
The final output must be a single JSON string (without extra formatting) as shown below:
”'{ “title”: , “experience”: , “language”: , “education”: , “skills”: }”’


… more directives below …

Optimizing Performance with PostgreSQL Vector Extensions

Storing and querying vector embeddings efficiently is crucial for system performance. We leverage PostgreSQL with vector extensions (specifically pgvector) to handle this specialized data type.

Vector Indexing for Fast Similarity Search

To make similarity searches fast, we use specialized HNSW (Hierarchical Navigable Small World) indexes:

— Example of how we create vector indexes in our migrations CREATE INDEX embeddings_experiences_embedding_idx ON embeddings_experiences USING hnsw (embedding vector_ip_ops);

We specifically chose HNSW over other index types like IVF-Flat because it offers significantly better performance for our use case. HNSW builds a multi-layered graph structure that enables efficient approximate nearest neighbor search, dramatically reducing query times.

Additionally, we use the dot product (vector_ip_ops) as our similarity operator rather than cosine similarity. While cosine similarity is often the default choice for text embeddings, we found that using the dot product provides better performance without sacrificing accuracy in our normalized embedding space.

These optimizations dramatically speed up nearest-neighbor searches in high-dimensional spaces, making it possible to find the most similar candidates in milliseconds rather than seconds.

Optimized Query Structure

Our query structure is carefully optimized to balance accuracy and performance:

Notice the weighting factors: we give 70% weight to experience, 20% to education, and 5% each to skills and languages. These weights were determined through extensive testing and reflect the relative importance of each factor in predicting candidate success.

Challenges and Solutions

Building this system wasn’t without challenges. Here are some key issues we faced and how we solved them:

Challenge 1: Single Embedding Limitations

Problem: Our early version used a single embedding per talent, which created misleading similarity measurements. For example, two candidates with the same years of experience in completely different fields would appear more similar than they should.

Solution: We moved to multiple embeddings per talent (experience, education, skills, languages), each focused on a specific aspect. This approach recognizes that the embedding space isn’t perfectly calibrated for talent search – it’s a general semantic space where “distances” don’t always reflect business-relevant distinctions. By separating different aspects and weighting them appropriately, we can better control how similarity is calculated.

For example, with separate embeddings, a software developer with 5 years of experience and a marketing specialist with 5 years of experience would no longer appear artificially similar just because they share the same tenure. Instead, their experience embeddings would correctly place them in different regions of the semantic space.

A typical implementation might use an event-driven approach:

Challenge 2: Embedding Quality and Consistency

Problem: Early versions produced inconsistent embeddings, especially for similar experiences described differently.

Solution: We developed standardized prompts that normalize the text before embedding, focusing on roles and responsibilities while ignoring potentially misleading details like years of experience.

Challenge 3: Performance at Scale

Problem: As our database grew, query performance degraded.

Solution: We implemented:

  • Optimized HNSW vector indexes with carefully tuned parameters
  • Dot product similarity operations for faster computation
  • Query limits to focus on the most promising candidates first
  • Caching of common queries

Challenge 4: Handling Multiple Languages

Problem: Our talent pool includes resumes in multiple languages, but we use an English-only embedding model for optimal performance.

Solution: Rather than using less accurate multilingual embedding models, we translate all content to English during the normalization process before embedding. This approach allows us to use a high-quality English-specific embedding model while still supporting our international talent pool. The translation step is integrated directly into our prompt templates:

## Instructions
You are provided with details for one candidate experience record, which includes the following:
– If any content is not in English, always translate into English.

This standardized translation ensures consistent semantic representation regardless of the original language, maintaining high-quality embeddings across our entire database.

Challenge 5: Database Quality Issues

Problem: Many talents in our database had non-descriptive or vague experience entries like “Data entry specialist” or “Outsourcer specialist” which polluted search results by creating misleading embeddings.

Solution: We implemented a comprehensive database cleanup process:

  • Identified talents with vague or non-descriptive experiences
  • Added more detailed information where possible through follow-up data collection
  • Removed experiences that lacked sufficient detail and couldn’t be improved
  • Implemented quality checks for new data entry to prevent similar issues

This data quality initiative significantly improved our matching results by ensuring that all embeddings were generated from meaningful, descriptive content.

Challenge 6: Balancing Precision and Recall

Problem: Early versions either returned too many irrelevant candidates or missed qualified ones.

Solution: We fine-tuned our similarity thresholds and implemented a weighted scoring system that combines multiple embedding types, achieving approximately 95% confidence in our top results.

Results and Impact

The implementation of our vector embedding-based talent matching system has transformed our recruitment process:

  • Improved Match Quality: Our system finds semantically relevant candidates even when their resumes don’t contain the exact keywords
  • Faster Candidate Discovery: Recruiters find suitable candidates in seconds rather than hours
  • Reduced Bias: By focusing on semantic meaning rather than specific terms, we’ve reduced some forms of unconscious bias
  • Higher Placement Success: The quality of matches has led to higher success rates in placements
  • Simplified User Experience: HR staff can now search for candidates by typing natural language queries (e.g., “experienced Python developer with cloud expertise”) instead of filling out complex forms with specific keywords and filters. This intuitive interface has dramatically increased adoption and satisfaction among recruiters.

Conclusion

Our journey building a semantic talent matching system demonstrates the power of vector embeddings for understanding the nuanced relationships between job requirements and candidate qualifications. By embedding different aspects of resumes separately, optimizing our database for vector operations, and carefully tuning our matching algorithms, we’ve created a system that consistently finds the right candidates with approximately 95% confidence.

The approach we’ve taken – using multiple embedding points, ignoring potentially misleading information like years of experience, and leveraging PostgreSQL’s vector capabilities – has proven highly effective for talent matching. These same principles could be applied to many other domains where semantic understanding is more important than keyword matching.

Have you implemented vector search in your applications? We’d love to hear about your experiences and challenges. Reach out to our team to share your thoughts or learn more about our implementation.

This article describes a feature of the Pulse platform, which is the system powering Ingedata’s business. Stay tuned for more insights into how we’re using cutting-edge technology to solve real business problems.

Written by

Related insights

ESE Architecture: The Human Body of Software Design

Our blog ESE Architecture: The Human Body of Software Design In previous articles, I’ve hinted at...

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Our blog Data Annotation: The Secret Sauce of AI Vision 🔍 Ever wondered how AI learns to...

Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System

Our blog Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System Hey there! I’m...

Data Duplication: When Breaking the Rules Makes Sense

When Breaking the Rules Makes Sense

In software development, we’re often taught to follow the DRY principle (Don’t Repeat Yourself) religiously. The idea that data should exist in exactly one place is deeply ingrained in our programming culture. Yet, in real-world systems, particularly in microservice architectures, strict adherence to this principle can lead to complex dependencies, performance bottlenecks, and fragile systems.

This article explores why duplicating data across multiple domains, while seemingly counterintuitive, can actually be the right architectural choice in many scenarios.

Data Duplication is Everywhere

Before we dive into the benefits of intentional data duplication, it’s worth noting that duplication already happens at virtually every level of computing—often without us even realizing it.

Hardware-Level Duplication

At the hardware level, modern CPUs contain multiple layers of cache (L1, L2, L3) that duplicate data from main memory to improve access speed. Your RAM itself is a duplication of data from persistent storage. Even within storage devices, technologies like RAID mirror data across multiple disks for redundancy.

When you’re running a program, the same piece of data might simultaneously exist in:

  • The hard drive or SSD
  • Main memory (RAM)
  • CPU L3 cache
  • CPU L2 cache
  • CPU L1 cache
  • CPU registers

Each level trades consistency management for performance gains. This is duplication by design!

System-Level Duplication

Operating systems maintain file system caches, page caches, and buffer caches—all forms of data duplication designed to improve performance. Virtual memory systems duplicate portions of RAM to disk when memory pressure increases.

Consider what happens when multiple applications open the same file:

  • The OS loads the file into memory once
  • Each application gets its own memory space containing a copy
  • The kernel maintains yet another copy in its file cache

That’s three copies of the same data, all serving different purposes.

Application-Level Duplication

Web browsers cache resources locally. Content Delivery Networks (CDNs) duplicate website assets across global edge locations. Database replicas maintain copies of the same data for read scaling and disaster recovery.

When you visit a website, the same image might exist:

  • On the origin server
  • In multiple CDN edge locations
  • In your browser’s memory
  • In your browser’s disk cache

Framework-Level Duplication

Modern frameworks implement various caching strategies that duplicate data: HTTP response caching, object caching, computed value memoization, and more. Each represents a deliberate choice to trade consistency for performance.

The reality is that data duplication isn’t an anti-pattern—it’s a fundamental strategy employed throughout computing to balance competing concerns like performance, availability, and resilience.

The Problem with Strict Data Normalization

In traditional monolithic applications, we strive for normalized database schemas where each piece of data exists in exactly one place. This approach works well when:

  1. All data access happens within a single application
  2. Transactions can span multiple tables
  3. Joins are efficient and low-cost
  4. The application scales vertically (bigger machines)

However, as systems grow and evolve toward distributed architectures, these assumptions break down:

  • Service Boundaries: Different services need access to the same data
  • Network Costs: Remote calls to fetch data introduce latency
  • Availability Concerns: Dependencies on other services create failure points
  • Scaling Challenges: Distributed transactions become complex and costly

Event-Driven Data Duplication in Microservices

A common pattern in microservice architectures is to use event-driven mechanisms to duplicate and synchronize data across service boundaries. Let’s examine a typical example:

The Person-Employee Example

Consider a system with two separate services:

  1. An Identity and Access Management (IAM) service that manages user accounts and core identity information
  2. An HR/Office service that handles employment-specific information

Both services need access to person data like names and contact information. Rather than having the HR service constantly query the IAM service, each maintains its own copy of the data.

How Duplication Can Be Managed

A typical implementation might use an event-driven approach:

  1. The IAM service is the “source of truth” for core identity information
  2. When identity data changes in IAM, an event is published to a message bus
  3. The HR service subscribes to these events and updates its local copy

This pattern can be implemented with a helper like this (pseudocode):

# Example of a DuplicateFieldHelper in a Ruby-based system
def duplicate_fields (source_event_type, target_repository, mapping)
     # Subscribe to events from the source service
      subscribe_to_event(source_event_type) do |event|
            # Extract the ID of the changed record
            record_id = event.resource_id

            # Find the corresponding record in the local repository
            local_record = target_repository.find_by_source_id(record_id)

            # Only update fields that were changed in the event
           changed_fields = event.changed_fields
            fields_to_update = mapping.select { |source_field, _|
                       changed_fields.include?(source_field)

           }

            # Update the local copy with the new values
           target_repository.update(
            local_record.id,
            fields_to_update.transform_values { |target_field|
                       event.data[target_field]

           }
)

      end
end

# Usage example
duplicate_fields
      “users.updated” ,
EmployeeRepository
{

              first_name: :first_name,
              last_name: :last_name,
              email: :contact_email,
              profile_picture: :photo_url

}
)

This pattern allows services to maintain local copies of data while ensuring they eventually reflect changes made in the authoritative source.

Advantages of Data Duplication in Microservices

1. Service Autonomy

By duplicating data, each service can operate independently without runtime dependencies on other services. If the IAM service is temporarily unavailable, the HR service can still function with its local copy of employee data.

2. Performance Optimization

Local data access is always faster than remote calls. By keeping a copy of frequently accessed data within each service, we eliminate network round-trips and reduce latency.

Consider a dashboard that displays employee information and their current projects. Without data duplication, rendering this dashboard might require:

With data duplication, it becomes:

The performance difference can be dramatic, especially at scale.

3. Domain-Specific Data Models

Each service can model data according to its specific domain needs. The HR service can add employment-specific fields without affecting the IAM service’s data model.

For example, the IAM service might store basic name information:

While the HR service might enhance this with employment details:

4. Resilience to Failures

If one service fails, others can continue operating with their local data. This creates a more robust system overall, as temporary outages in one service don’t cascade throughout the entire system.

5. Simplified Queries

If one service fails, others can continue operating with their local data. This creates a more robust system overall, as temporary outages in one service don’t cascade throughout the entire system.

6. Scalability

Services can scale independently based on their specific load patterns, without being constrained by dependencies on other services.

The Master-Duplicate Pattern

A key principle in successful data duplication is establishing clear ownership. A common approach is the “master-duplicate” pattern:

  1. Single Source of Truth: Each piece of data has one authoritative source (the “master”). For person identity information, the IAM service is the master.
  2. Controlled Propagation: Changes to master data are published as events, which duplicates consume to update their local copies.
  3. Unidirectional Flow: Updates only flow from master to duplicates, never the reverse. If a duplicate needs to change master data, it must request the change through the master service’s API.
  4. Eventual Consistency: The system acknowledges that duplicates may temporarily be out of sync with the master, but will eventually converge.

This pattern provides a structured approach to data duplication that maintains data integrity while gaining the benefits of duplication.

DRY vs. Pragmatism: A Balanced View

The DRY principle remains valuable, but like all principles, it should be applied with nuance. Consider these perspectives:

“DRY is about knowledge duplication, not code duplication.” — Dave Thomas, co-author of The Pragmatic Programmer

“Duplication is far cheaper than the wrong abstraction.” — Sandi Metz

In microservice architectures, some level of data duplication is not just acceptable but often necessary. The key is to duplicate deliberately, with clear ownership and synchronization mechanisms.

Remember that DRY was conceived in an era dominated by monolithic applications. In distributed systems, strict adherence to DRY across service boundaries often leads to tight coupling—precisely what microservices aim to avoid.

When to Duplicate Data

Data duplication makes sense when:

  1. Service Boundaries Align with Business Domains: Each service owns a specific business capability and needs local access to relevant data.
  2. Read-Heavy Workloads: The data is read much more frequently than it’s updated.
  3. Loose Coupling is Priority: You want to minimize runtime dependencies between services.
  4. Performance is Critical: Network calls to fetch data from other services would introduce unacceptable latency.
  5. Resilience Requirements: Services need to function even when other services are unavailable.

Challenges and Mitigations

The DRY principle remains valuable, but like all principles, it should be applied with nuance. Consider these perspectives:

1. Consistency Management

Challenge: Keeping duplicated data in sync can be complex.

Mitigation: Use event-driven architectures with clear ownership models. Implement idempotent update handlers and reconciliation processes.

2. Increased Storage Requirements

Challenge: Duplicating data increases storage needs (but it is often a no-problem).

Mitigation: Be selective about what data you duplicate. Often, only a subset of fields needs duplication.

3. Complexity in System Understanding

Challenge: Developers need to understand which service owns which data.

Mitigation: Clear documentation and conventions. Use tooling to visualize data ownership and flow.

4. Eventual Consistency Implications

Challenge: Applications must handle the reality that data might be temporarily stale. Events can be dropped, code can fail to update, etc. It needs often to be handled through maintenance task sets or use of ACLs.

Mitigation: Design UIs and APIs to gracefully handle eventual consistency. Consider showing “last updated” timestamps where appropriate. Write maintenance scripts that flag and correct data discrepancies once in a while.

Real-World Example: E-Commerce Platform

Consider an e-commerce platform with these services:

  • Product Catalog Service: Manages product information
  • Inventory Service: Tracks stock levels
  • Order Service: Handles customer orders
  • Customer Service: Manages customer profiles
  • Shipping Service: Coordinates deliveries

The Order Service needs product information to display order details. Options include:

  1. No Duplication: Query the Product Catalog Service every time order details are viewed
  2. Full Duplication: Maintain a complete copy of all product data
  3. Selective Duplication: Store only the product data needed for order display (name, SKU, image URL)

Option 3 is often the best compromise. When a product is updated in the Product Catalog Service, an event is published, and the Order Service updates its local copy of the relevant fields.

At the end

Data duplication across multiple domains may feel like breaking the rules, but it’s often a pragmatic solution to real-world challenges in distributed systems. By understanding when and how to duplicate data effectively, we can build systems that are more resilient, performant, and maintainable.

The next time someone invokes the DRY principle to argue against data duplication, remember that even your CPU is constantly duplicating data to work efficiently. Sometimes, the right architectural choice is to embrace controlled duplication rather than fight against it.

As with most engineering decisions, the answer isn’t about following rules dogmatically—it’s about understanding tradeoffs and choosing the approach that best serves your specific requirements.

This article is part of our series on API design and microservice architecture. Stay tuned for more insights into how we’ve built a scalable, maintainable system.

Written by

Related insights

ESE Architecture: The Human Body of Software Design

Our blog ESE Architecture: The Human Body of Software Design In previous articles, I’ve hinted at...

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Our blog Data Annotation: The Secret Sauce of AI Vision 🔍 Ever wondered how AI learns to...

Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System

Our blog Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System Hey there! I’m...

ESE Architecture: The Human Body of Software Design

In previous articles, I’ve hinted at our ESE architecture and how it stands apart from the traditional MVC setup. Now it’s time to dig in and explore what makes ESE both unique and refreshingly simple. It’s a structure inspired by the way our human bodies function.

The Human Body Analogy

Let’s take a break from talking about software and think about the human body for a moment.

  • Exposition: This layer is like our sensory organs—eyes, ears, skin—and the nervous system. It gathers information from the outside world and sends it to the brain for processing. Just as our senses relay external stimuli to the brain, the Exposition layer channels external stimuli into the system.
  • Service: The brain! It processes the information received, makes decisions, and determines what actions must be taken. Similarly, the Service layer is the core of the system, where the business logic resides.
  • Effect: Think of this layer as the muscles and endocrine system. When the brain decides to move a hand or release a hormone, these systems carry out the action. In the ESE architecture, the Effect layer executes the outcomes decided by the Service layer and dispatches events to keep the entire system—or “world”—informed. We strive to keep these actions atomic and free from conditional logic as much as possible.

Understanding ESE: Exposition, Service, Effect

Now, back to the world of computers. At its core, ESE (Exposition-Service-Effect) is a backend architecture that emphasizes a clean separation of concerns, making it easier to manage, test, and scale the different parts of a system. But what exactly do these layers do?

  • Exposition: This layer handles the outward-facing aspects of the system, like reacting to HTTP requests or any events. It’s the entry point where an external system or a user interact with the backend, effectively exposing the system’s functionalities to the world. The Exposition layer offers hooks, which allow you to plug code into events happening in the world. For example:

This structure replaces the traditional MVC controller:

In a typical MVC controller, you’d find both input management and business logic combined. The controller might directly interact with models, handle business logic, and render views. In the ESE example above, the Exposition layer only manages the request and delegates all logic to the Service layer. This leads to a cleaner separation of concerns, making the system more modular and easier to maintain.

Another key feature is that the Exposition layer can handle multiple sources of events and protocols:

From a development perspective, I was never satisfied with having to handle different input sources in different ways. The Exposition layer streamlines this by centralizing all inputs and dispatching to the Service layer, whether it’s an HTTP response, an event, a background job, or a cron task.

  • Service: The Service layer is where you design your service objects, set up in-memory structures, and define error types, constants, conditions, and business rules. It is the brain of the system, processing input from the Exposition layer and applying the core business logic. It’s where decisions are made.

In Pulse and Verse, we provide a base Service class (Verse::Service::Base) that handles the authorization context (similar to current_user in Rails) and passes it to the Effect layer, which is in charge of dealing with access rights. But that’s a topic for another article on our approach to authorization management.

  • Effect: Called by the Service layer, the Effect layer takes over to perform mutative operations. This might involve reading or updating a database, calling an API endpoint, publishing events, sending notifications, or any other actions resulting from the service’s operations. Here’s the twist: for every successful write operation, an event must be published to the Event Bus. So, when a new record is created, a records:created event is automatically emitted—thanks to Verse’s built-in functionality.

ESE vs MVC, key differences

You might be wondering how ESE compares to the more traditional MVC (Model-View-Controller) architecture. While both aim to separate concerns, they approach it differently.

In MVC:

  • The Model manages data and business logic, often combining the two. It hides underlying complexity, such as querying the data, and can lead to performance or security issues.
  • The View handles the presentation layer, directly interacting with the Model to display data.
  • The Controller acts as a mediator, processing user input and updating the View.

In contrast, ESE is more modular:

  • The Exposition layer handles all interactions with the outside world, acting as the system’s senses. It is also the place where we render the output, for events such as HTTP requests.
  • The Service layer focuses purely on business logic and is separated from data concerns.
  • The Effect layer manages the tangible outcomes and communicates these changes to the system, akin to how muscles execute actions.

This separation leads to cleaner, more maintainable code and makes it easier to scale complex systems.

A major pain point in MVC is having models handle both logic and effects, especially with mixed-level teams of developers. For instance, mixing query building with business code is a bad practice because it’s hard to determine query performance, as the query might be built incrementally within the business logic.

Another significant advantage of ESE is testing. You can easily replace the Effect layer with a mock version to test if your logic holds up, resulting in tests that run in microseconds without dependencies on other systems, instead of seconds. This is especially valuable in large applications, where this difference can translate to saving up to 45 minutes a day in productivity. And yes, this is a number I’ve experienced.

In Summary

We opted for ESE over MVC because it offers a more modern and modular approach to handling complex backend systems. By separating business logic from data concerns and outcomes, ESE allows us to build systems that are easier to manage, test, and scale.

Stay Tuned

In the next article, we’ll dive into why we chose Redis Stream over NATS, RabbitMQ, or Kafka for maintaining our Event Bus. Stay tuned!

Written by

Related insights

ESE Architecture: The Human Body of Software Design

Our blog ESE Architecture: The Human Body of Software Design In previous articles, I’ve hinted at...

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Our blog Data Annotation: The Secret Sauce of AI Vision 🔍 Ever wondered how AI learns to...

Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System

Our blog Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System Hey there! I’m...

Standardizing Medical Imagery

Standardizing Medical Imagery:
Opportunities and Challenges in a Multimodal Landscape

Medical imaging is central to diagnostics, treatment planning, and follow-up care in modern medicine. However, as imaging technologies have evolved, so have the complexities of managing the vast array of images generated. The need to standardize medical imagery has become increasingly apparent due to the wide variety of modalities, parameters, and clinical scenarios involved. Standardizing these images—while challenging—has far-reaching benefits for clinical workflow, medical research, and artificial intelligence (AI) applications. In this text, I will explore the importance of standardizing medical imagery, the challenges encountered at both the sequence and study levels, and how AI can help transform this critical aspect of healthcare.

Why Standardize Medical Imagery?

Standardization in medical imaging ensures that images acquired across different sites, scanners, and protocols are harmonized into a uniform format that can be easily accessed, understood, and utilized. The primary advantage is the seamless organization of these studies for internal and external purposes, including the proper indexing of images for clinicians, researchers, and AI developers. This becomes crucial in high-stakes environments such as large hospital networks, where multiple sites contribute to the imaging database.

Effective standardization facilitates study routing within the imaging inference pipeline, leading to faster and more reliable image processing and diagnostic workflows. Moreover, having standardized imagery helps develop robust display protocols in Picture Archiving and Communication Systems (PACS), ensuring that clinicians see consistent and meaningful representations of the anatomy. From a research perspective, the standardization of medical images exponentially increases the amount of data available for AI projects. AI models thrive on large datasets, and the aggregation of harmonized images across different hospitals allows for identifying new patient cohorts and patterns that can drive research and improved patient care.

The Role of Artificial Intelligence in Standardization

Artificial intelligence has shown immense potential in overcoming the challenges associated with the standardization of medical imagery. AI-based models can be trained to automatically identify and correct image artifacts, classify anatomical regions, and optimize images for specific diagnostic purposes. This capability is critical in improving the efficiency and accuracy of both clinical and research workflows.

For instance, recent studies show that AI models trained on large datasets—comprising millions of standardized images—can achieve up to 90% accuracy in identifying anatomical structures across modalities, improving classification tasks, and detecting anomalies in series that would otherwise go unnoticed. AI algorithms can also analyze vast amounts of data, identifying patterns and correlations that clinicians might miss. These insights are crucial for identifying new patient cohorts for clinical trials and research.

The scalability of AI makes it an ideal tool for large-scale imaging networks. With over 3.6 billion medical imaging procedures performed annually worldwide, the ability to standardize, index, and analyze this data in a consistent and automated manner is invaluable. AI-driven standardization protocols can also facilitate multi-institutional collaborations, enabling the sharing of standardized imaging data while ensuring compliance with data privacy regulations.

Challenges of Standardizing Medical Imagery

The high potential of Artificial Intelligence to standardize medical imagery also comes with its own challenges. The training and validation datasets that are necessary to AI development must be prepared by qualified radiologists, who must know how to tackle the inherent variability in how images are acquired and processed. The complexity arises at two levels: individual sequence (series) and study level.

1.Challenges at the Series Level:

  • Distorted Images: Artifacts such as streaking, banding, or the presence of braces or leads can render images suboptimal for analysis. Poor positioning of the patient or motion can further degrade image quality.
  • Non-Diagnostic Series: In some cases, certain images are acquired not for diagnostic purposes but for procedural guidance, such as during needle localization or intravenous contrast tracking.
  • Implanted Hardware and Devices: The presence of surgical implants or devices can obscure anatomical regions of interest, affecting the quality and usability of the series.
  • Intent Understanding and Anatomical Classification: Often, series contain adjacent anatomies that may or may not be relevant to the intended diagnosis. Understanding this intent is vital to classify the series appropriately. For example, a brain scan may include images of the skull, but depending on the window and kernel used (bone vs. soft tissue), the intent of the study changes.
  • Breathing and: For chest CTs, breathing phases (expiration vs. inspiration) are critical for interpretation.
  • Laterality: laterality is not always explicitly tagged in the metadata.
  • Image Planes and Reconstruction: Series are captured across multiple planes (axial, sagittal, coronal, etc.) and with varying reconstruction kernels (sharp for bone, smooth for soft tissue), contributing to the challenge of indexing.
  • Contrast Variations: The administration route and contrast phases (e.g., early arterial vs. late arterial) further complicate the standardization process.

2. Challenges at the Study Level:

  • Anatomical Selection: Not every anatomical region visualized in the series is relevant for the overall study. Selecting which anatomies to include in the study-level analysis can be difficult.
  • Contrast Protocols: Some studies might contain both pre-contrast and post-contrast series, which must be carefully categorized to avoid misinterpretation.
  • Study Protocols: The range of clinical protocols, from biopsy to stroke and lung screening, adds another layer of complexity. Each protocol may require a different set of sequences, and standardizing these can be challenging, especially without a universally accepted guideline.

Conclusion

The standardization of medical imagery, although complex, is a necessary endeavor to improve the organization, analysis, and diagnostic potential of imaging data. The challenges encountered at both the sequence and study levels can be addressed through AI-based solutions, which enable efficient classification, artifact reduction, and the optimization of image quality. By standardizing medical imagery, we can expand the pool of data available for AI research, improve patient outcomes through more reliable diagnostics, and facilitate groundbreaking medical research. The future of radiology lies in embracing these technological advancements to create a cohesive and standardized imaging landscape.

Written by
Jean Emmanuel Wattier
Head of Strategic Business

References:

  • Kalra, M. K., et al. “Artificial Intelligence in Radiology.” Radiology, vol. 293, no. 2, 2019, pp. 346-359.
  • McBee, M. P., et al. “Deep Learning in Radiology.” Radiology, vol. 294, no. 2, 2020, pp. 350-362.
  • Kohli, M., et al. “Medical Imaging Data and AI: Barriers and Challenges.” Journal of Digital Imaging, vol. 33, no. 1, 2020, pp. 44-52.

Related insights

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

RSVP Cocktail + RWiD Conversations

Cocktail + RWiD Conversations, hosted by Segmed and Ingedata You’re invited to Cocktail + RWiD...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Our blog Data Annotation: The Secret Sauce of AI Vision 🔍 Ever wondered how AI learns to...

Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System

Our blog Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System Hey there! I’m...

Rib Fractures on Frontal Chest X-rays

Identifying Rib Fractures on Frontal Chest X-rays:
Clinical Relevance and Diagnostic Challenges

Introduction

Rib fractures are common injuries, typically associated with blunt trauma or accidents, and may present varying degrees of severity. The radiologic identification of rib fractures is crucial for managing potential complications such as pneumothorax, hemothorax, and delayed healing. While computed tomography (CT) scans are the gold standard for diagnosing rib fractures due to their higher sensitivity and ability to detect even subtle fractures, chest radiographs, particularly frontal chest X-rays (CXR), remain a frequently employed diagnostic tool in many clinical settings. This is primarily due to their accessibility, cost-effectiveness, and use in routine evaluation of thoracic pathology. While chest X-rays are not ideal for visualizing rib fractures, they can provide valuable incidental findings, particularly in patients who undergo imaging for other reasons. Therefore, recognizing rib fractures on CXRs remains a valuable skill for radiologists.

Relevance of Detecting Rib Fractures on Frontal Chest X-rays

Incidental detection of rib fractures on chest X-rays, while not the primary modality for this purpose, can offer significant clinical insight. Rib fractures may not always be the reason for a patient’s initial presentation or imaging; however, when found, they can shift clinical management and prompt further investigation or intervention. Identifying rib fractures in such cases is particularly important when the patient has experienced unrecognized trauma, has underlying pathology that may affect bone integrity (such as osteoporosis or metastatic lesions), or when there is concern about potential complications like pneumothorax or soft tissue injury. Moreover, the early detection of rib fractures can prevent further injury and guide the clinician in pain management and patient care.

A study by Chien et al. (2020) emphasizes the importance of incidental rib fracture detection on chest X-rays, particularly in elderly patients or individuals with impaired cognition who may not report trauma or rib pain. In such populations, subtle findings can alter management, lead to more targeted investigations, and mitigate potential complications.

The Role of Artificial Intelligence in Detecting Rib Fractures on Frontal Chest X-rays

Artificial intelligence (AI) has the potential to significantly improve the detection of rib fractures on frontal chest X-rays, especially when these fractures are incidental findings. AI algorithms, particularly those based on deep learning, have shown remarkable success in identifying subtle patterns in medical images that may be overlooked by the human eye. A study by Rajpurkar et al. (2018) demonstrated that AI models could match or exceed radiologists in identifying certain thoracic pathologies, and this technology is now being adapted to detect rib fractures with increasing accuracy. AI systems can be trained on large datasets of chest X-rays, allowing them to learn the nuances of rib fractures, even in challenging locations such as the posterior ribs or areas with poor contrast. For example, one AI model trained on over 100,000 chest X-rays was able to identify rib fractures with a sensitivity of 85% and a specificity of 90%, outperforming traditional radiographic interpretation in some cases. This scale of potential could revolutionize incidental findings, reducing the number of missed fractures and ensuring timely patient care. Additionally, AI can serve as a second reader, flagging suspicious areas for radiologists to review, thereby improving diagnostic confidence and efficiency. The integration of AI into clinical practice could result in a marked reduction in missed rib fractures, potentially improving outcomes for a significant number of patients annually.

Challenges of Identifying Rib Fractures on Frontal Chest X-rays

Despite the utility of frontal CXRs, identifying rib fractures on these images presents multiple challenges. When developing AI algorithms to automatically detect rib fractures in frontal CXRs, these challenges propagate to the preparation of the training and validations sets, which need to be manually annotated by qualified radiologists. The main challenges are:

  1. Lack of Contrast and Overlapping Structures: Rib fractures are often not well contrasted on frontal chest X-rays. The presence of overlapping structures, such as the scapulae, soft tissue, and the mediastinum, can obscure subtle fracture lines, making them difficult to distinguish from surrounding anatomic structures. Additionally, the orientation of the ribs on frontal images makes it harder to visualize the posterior and lateral portions of the rib cage, where many fractures occur.
  2. Temporal Indeterminacy: Frontal CXRs typically do not provide sufficient information to distinguish between acute, subacute, or chronic rib fractures. This is due to the limited capacity to assess callus formation or the degree of bone remodeling, making it difficult to determine the fracture’s age without further imaging. A healing fracture may look similar to a recent injury in the absence of characteristic healing signs, which are often hard to visualize on standard X-rays.
  3. Fracture Location: Differentiating fractures of the anterior arch from those on the posterior arch of the ribs is particularly challenging on frontal CXRs. The rib curvature, coupled with the two-dimensional nature of the image, can obscure fracture lines, particularly in the posterior ribs, which are often superimposed on the lung fields and spine. Frontal chest X-rays tend to provide a better view of the anterior ribs but may miss posterior or lateral fractures altogether.
  4. Ambiguous Fracture Lines: Fracture lines in ribs can be subtle and have ambiguous extensions that are difficult to track. The complexity of rib anatomy, with its curvature and overlapping structures, may lead to misinterpretation. Small or incomplete fractures, particularly hairline fractures, are especially prone to being overlooked.
  5. Radiologic Report Discrepancies: Interestingly, it is not uncommon for fractures to be described in radiology reports but remain unseen in the X-ray image itself, especially for non-displaced fractures or fractures with minimal cortical disruption. Conversely, fractures that are apparent on the image may not always be identified in the radiologic report, potentially due to the subtlety of the fracture line or the presence of distracting findings in the image. This discordance highlights the variability in radiologists’ detection of fractures on CXRs and the need for careful review of images.

Conclusion

While CT scans remain the superior modality for detecting rib fractures, the incidental identification of such fractures on frontal chest X-rays carries significant clinical relevance, especially when the primary reason for imaging is not trauma-related. However, rib fractures are often difficult to detect on these images due to challenges like poor contrast, difficulty in determining fracture age, and anatomical overlap. Recognizing these limitations is essential for accurate diagnosis and appropriate patient management.

Artificial intelligence (AI) offers promising solutions to these diagnostic challenges. AI algorithms, particularly those trained on large datasets of chest X-rays, can significantly improve the sensitivity and specificity of rib fracture detection, even in challenging locations like posterior ribs. Studies have shown that AI can identify rib fractures with up to 85% sensitivity and 90% specificity, outperforming traditional radiographic interpretations in certain cases. By serving as a second reader and flagging suspicious areas for radiologists, AI has the potential to reduce the number of missed fractures and enhance diagnostic accuracy. Integrating AI into clinical practice could lead to earlier detection of incidental rib fractures, improving outcomes for many patients.

By combining traditional radiologic expertise with AI advancements, radiologists can optimize the diagnostic value of chest X-rays and provide more precise and efficient care.

Written by
Jean Emmanuel Wattier
Head of Strategic Business

Related insights

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

RSVP Cocktail + RWiD Conversations

Cocktail + RWiD Conversations, hosted by Segmed and Ingedata You’re invited to Cocktail + RWiD...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Our blog Data Annotation: The Secret Sauce of AI Vision 🔍 Ever wondered how AI learns to...

Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System

Our blog Migration from Rhymes to Pulse: Our Journey in Building a Better ERP System Hey there! I’m...

RSVP Cocktail + RWiD Conversations

Cocktail + RWiD Conversations,

hosted by Segmed and Ingedata

You're invited to
Cocktail + RWiD Conversations

Meet INGEDATA at the RSNA 2024 Radiology Congress in Chicago from December 1-5, where INGEDATA will be showcasing our latest advancements in medical image reading services. We’re also excited to invite you to Cocktail + RWiD Conversations, an exclusive event hosted by Segmed and INGEDATA. This is a fantastic opportunity to engage in thought-provoking discussions on AI-driven healthcare, network with industry peers, and enjoy cocktails in a relaxed, welcoming setting. Don’t miss the chance to connect and explore how our expertise can enhance your projects.

Join us for an evening of insightful discussions on:

  • The challenges and innovations in AI-driven medical imaging
  • The latest advancements in healthtech and radiology solutions
  • The transformative impact of AI on healthcare
 
­

To confirm your attendance, please RSVP using this link or click the button below

Driven Microservice Architecture

Event-Driven Microservice Architecture:
Why We Chose It and How We Implemented It

Welcome to the second installment in our series on the architecture of Ingedata’s Pulse platform!

When we made the decision to move away from our previous platform, the first question we faced was, “What do we want to build?” Should we overhaul our tech stack, transitioning from Ruby to a language like Python or Go? Should we abandon the monolithic architecture and embrace the microservices trend?

Yacine
Chief Technical Officer

Before diving into these decisions, we needed to clarify the pain points we were addressing. By “pain points”, I’m not just referring to technical challenges but broader issues impacting our overall approach. Thanks to our previous platform, we have a clear understanding of our business requirements since the system is already operational and running.

The bigger question was about the non-functional qualities of our system and how to approach both the design and maintenance phases. Something built for decades, not for a few years.

Here are the key considerations we focused on:

Interoperability and API-First Design

Our previous platform was extensible, but not interoperable. While we could add modules and code in a structured manner, the system’s accessible endpoints were designed to meet the needs of the frontend application. Authorization and role management were handled at the endpoint level, with custom code filtering returned collections. If we wanted to integrate the application with other software, we would need to rewrite API endpoints and duplicate portions of the code.

For Pulse, we wanted a system that was fundamentally API-first, where the frontend would be just one of many clients utilizing these APIs. These endpoints should also be automatically documented to facilitate easier integration.

Maintainability and Flexibility

In any system, there’s a design phase where new software components and cutting-edge technologies are combined to build the application’s features. As the application matures and enters production, the focus shifts from feature development to refining and maintaining the platform. This includes reworking poorly designed code, handling edge cases, and adapting to an ever-changing business landscape.

New features often correspond with shifts in business paradigms, emphasizing the need for interoperability. For instance, if we decide to manage orders tomorrow (a feature not currently supported by our system), instead of creating a new module, we should be able to develop an entirely new platform that communicates seamlessly with Pulse. Rather than expanding the existing application, the goal is to connect multiple applications.

Resource Constraints and Knowledge Sharing

At Ingedata, we don’t have the vast resources of a company like Google, and our IT team can only grow so much. We place a high value on knowledge sharing and mentoring developers, which means we always maintain a portion of our development team as young, eager learners. While this approach fosters growth, it also means that the code produced isn’t always of the highest quality.

Agility and Resilience

Delivery speed is critical at Ingedata. We pride ourselves in being able to handle any customer project within 3 weeks. We need to be able to deploy changes quickly and respond to bugs with minimal delay. However, moving fast can sometimes mean breaking things, and with 500 employees relying on the system, downtime is not an option. One of our key goals was to design an application that could continue functioning even if certain parts of the system were down.

Considering all these factors, we decided to design our application as a microservice architecture with a twist: it’s event-driven.

The Advantages of Microservices

By breaking the application into multiple domains and databases, we make it easier for developers to quickly understand the specific domain they’re working on, even without prior knowledge. Instead of dealing with 150+ database tables, developers only need to focus on the 15 tables relevant to a specific service.

This approach also creates what I call a “contained mess.” If a young developer makes design mistakes while implementing new features, those mistakes are confined to the scope of the service/domain, making it easier to refactor the code later.

Our operations are cyclical, driven by batches of tasks we process for customers. We need a system that can scale up and down as needed. A monolithic approach would force us to scale everything simultaneously, which isn’t ideal. Microservices, on the other hand, allow us to scale specific domains as required. They also boot faster, which is a significant advantage when used with orchestration systems like Kubernetes. For example, while Rhymes (built on Rails) takes about 15 seconds to load and serve customers due to the extensive codebase, a Pulse service takes only 1.5 seconds.

Finally, microservices make it easier to adapt to changes in business. We can create new services to handle new features or shut down activities that no longer align with our goals. We have no qualms about discontinuing code if we find third-party software that does the job better and more cost-effectively.

The Event-Driven Approach

When building microservices, it’s crucial to determine how different components will interact. There’s no point in adopting a microservice architecture if everything is tightly coupled. Initially, we considered using RPC-based communication, such as HTTP or gRPC, for internal service communication. However, this approach introduces tight coupling. If one service needs to query another and the dependent service is down, it could create a cascade of errors.

Additionally, RPC calls can lead to transaction integrity issues. For example, if Service A needs to update data in Services B and C, and B succeeds while C fails, A might need to revert its changes, but B has already committed them to the database.

To address these challenges, we opted for an event-driven architecture. Unlike RPC calls, services communicate asynchronously by emitting events after changes are made. This approach reverses the dependency links between services, making each service responsible for its own state. Instead of Service A querying Service B for information, Service A listens to events emitted by Service B and updates its state accordingly.

Here’s an example:

Let’s say the OrderDelivery service needs information about a customer’s address from the Customer service. With an RPC-based approach, the code might look like this:

Using an event-driven approach and our custom-built framework, Verse, the code would look like this:

Yes, it’s a bit more complex and requires additional effort, but we’ve effectively decoupled the two services. Now, whenever a customer is created in the Customer service, a corresponding record is created in the OrderDelivery service. If the customer’s name or address is updated, OrderDelivery tracks those changes. Even if the Customer service goes down, OrderDelivery can still handle orders. Messages are sent through a streaming service, ensuring that if OrderDelivery is temporarily down, it can replay stored events and catch up when it comes back online.

It also simplifies scaling! Each service instance can manage a specific number of concurrent requests at a time (in our case, 5 per instance/CPU). With an RPC-based approach, if the Customer service begins to experience delays in responding, our Kubernetes orchestrator would scale up not just the Customer service, but also the OrderDelivery service. This happens because OrderDelivery’s order method would only be as fast as the Customer#find_customer API call, creating a bottleneck. This tight coupling can lead to significant challenges when trying to diagnose and resolve performance issues later on.

With the decision to adopt an Event-Driven architecture settled, the next question was which tech stack to use. We explored transitioning to Go or Python and came close to choosing FastAPI, which met many of our requirements and offered strong integration with AI frameworks. However, we ultimately decided to stick with Ruby and develop our own framework.

Our reasoning was multifaceted: we could leverage existing knowledge, Ruby’s ecosystem for web development remains highly competitive, and we genuinely prefer Ruby’s syntax, which, while similar to Python, offers a more enjoyable developer experience—particularly with its capacity for building DSLs.

As for Go, while it’s known for its performance, our specific needs didn’t justify the switch. We prioritized a language that resonates with our team’s expertise and offers a positive, engaging development environment.

With our decision made, it was time to build our framework. We put in extra effort and began constructing it using the ESE (Exposition-Service-Effect) stack, departing from the more traditional MVC 3-tiered setup. But that’s a story for another time—stay tuned for our next article, where we’ll dive into the details!

Related insights

The ESG Backlash: Why Ground Truth in Data is Essential for Societal Trust

Our blog The ESG Backlash Why Ground Truth in Data is Essential for Societal Trust The recent...

Passive Record: Why We Ditched ActiveRecord Pattern

Our blog Passive Record Why We Ditched ActiveRecord Pattern In our journey to build a scalable...

Talent Matching with Vector Embeddings

Our blog Talent Matching with Vector Embeddings How We Built a Semantic Search System How do you...

Data Duplication: When Breaking the Rules Makes Sense

Our blog You will Love Data Duplication! When Breaking the Rules Makes Sense In software...

ESE Architecture: The Human Body of Software Design

Our blog ESE Architecture: The Human Body of Software Design In previous articles, I’ve hinted at...

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

RSVP Cocktail + RWiD Conversations

Cocktail + RWiD Conversations, hosted by Segmed and Ingedata You’re invited to Cocktail + RWiD...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...

Data Annotation: The Secret Sauce of AI Vision

Data Annotation: The Secret Sauce of AI Vision 🔍

Ever wondered how AI learns to “see” things? 👀 It’s all thanks to a little magic called data annotation! Let’s break it down:

What is data annotation? 🤔

Imagine you’re teaching a toddler to recognize animals. You’d point at pictures and say, “That’s a dog!” or “Look, a cat!” Data annotation is kinda like that, but for computers. We’re basically putting labels on images so AI can learn what’s what.

Now, for the tech-savvy folks out there: we know not all AI models need data annotation (looking at you, unsupervised learning!). But for the sake of keeping things simple, let’s focus on the annotation part!

Why should I care? 🤷‍♂️

Because AI is only as good as the data it’s trained on. Remember: Garbage In, Garbage Out. If we feed AI bad data, it’ll make bad decisions!

(Some) Types of Annotations

  1. Bounding Boxes: Like drawing a rectangle around your cat in a photo. Quick and easy, but not super precise. Perfect for when you just need to say “There’s a cat… somewhere in this picture!”
  2. Polygonal Annotation: Imagine tracing the exact outline of your cat, paws and all. Takes longer, but way more accurate. Choose this when you need to know exactly where your cat ends and the sofa begins!
  3. Semantic Segmentation: This is like coloring every pixel in the image. “These pixels are cat, these are sofa, these are plant.” Great for understanding entire scenes. It’s like giving AI a very detailed coloring book!
  4. Instance Segmentation: Not only does it color everything, but it also separates individual objects. So you can tell apart each cat in a room full of cats! 😺😺😺

Of course, the type of annotation you choose depends entirely on your project’s specific needs and goals. Choose wisely! 

At Ingedata, we’ve used these techniques to help self-driving cars spot pedestrians, assist doctors in analyzing X-rays, and even help robots sort recyclables!

Remember: behind every smart AI is a team of skilled humans crafting high-quality training data. It’s the essential groundwork that makes AI magic possible! ✨

So next time you see an AI doing something cool, give a little nod to the data annotators.

This post was created through a collaborative ping-pong between Claude 3.5 Sonnet and ChatGPT 4—some humans were in CC, though! The image was generated using the FLUX.1 [dev] model.

Written by
Kevin Lottin
Business Solutions at INGEDATA

Related insights

The ESG Backlash: Why Ground Truth in Data is Essential for Societal Trust

Our blog The ESG Backlash Why Ground Truth in Data is Essential for Societal Trust The recent...

Passive Record: Why We Ditched ActiveRecord Pattern

Our blog Passive Record Why We Ditched ActiveRecord Pattern In our journey to build a scalable...

Talent Matching with Vector Embeddings

Our blog Talent Matching with Vector Embeddings How We Built a Semantic Search System How do you...

Data Duplication: When Breaking the Rules Makes Sense

Our blog You will Love Data Duplication! When Breaking the Rules Makes Sense In software...

ESE Architecture: The Human Body of Software Design

Our blog ESE Architecture: The Human Body of Software Design In previous articles, I’ve hinted at...

Standardizing Medical Imagery

Our blog Standardizing Medical Imagery: Opportunities and Challenges in a Multimodal Landscape...

Rib Fractures on Frontal Chest X-rays

Our blog Identifying Rib Fractures on Frontal Chest X-rays: Clinical Relevance and Diagnostic...

RSVP Cocktail + RWiD Conversations

Cocktail + RWiD Conversations, hosted by Segmed and Ingedata You’re invited to Cocktail + RWiD...

Driven Microservice Architecture

Our blog Event-Driven Microservice Architecture: Why We Chose It and How We Implemented It Welcome...