Fontalike: Font Similarity Through Geometric Analysis

Building a font similarity engine that analyzes 7,441 Google Fonts using custom geometric algorithms, achieving 94% classification accuracy and sub-50ms query times on 55 million character comparisons.

Table of contents

The Challenge

Fontalike started with a simple question: Can we find visually similar fonts programmatically? With over 7,441 fonts in Google Fonts alone, manually browsing for the perfect typeface is time-consuming. Designers need a tool that understands font similarity beyond simple metadata like "sans-serif" or "serif."

The technical challenge became clear quickly: fonts are complex geometric shapes that vary wildly in their structure. A naive pixel-by-pixel comparison would fail—fonts rendered at different sizes, weights, or with anti-aliasing variations would appear completely different. We needed algorithms that understood the structural essence of letterforms.

Design Goals

The system needed to:

  1. Classify fonts accurately into Sans-Serif, Serif, and Script categories
  2. Find similar fonts within each category (no one wants a script font similar to a geometric sans)
  3. Handle scale efficiently—55+ million character-to-character comparisons
  4. Return results fast—under 50ms for any similarity query
  5. Work with incomplete fonts—not all fonts support all 94 ASCII printable characters

The approach would need to balance accuracy, interpretability, and computational efficiency.

Geometric Classification: The Two-Index Approach

After testing seven different algorithms for serif detection and multiple ornament metrics, the winning approach emerged: a simple two-index system that achieved 93.8% accuracy on hand-labeled test fonts.

Serif Index (v5): Detecting Horizontal Protrusions

Serif fonts have distinctive horizontal strokes at the ends of letterforms. To detect these programmatically:

  1. Render each character at 500pt into a 1000x1000 bitmap
  2. Apply morphological thinning to reduce each stroke to a 1-pixel skeleton
  3. Analyze skeleton endpoints: points with exactly one neighbor
  4. Classify endpoints as horizontal or vertical based on their orientation
  5. Calculate ratio: horizontal endpoints / total endpoints

The key insight: serif fonts have significantly more horizontal endpoints (serifs) compared to sans-serif fonts where endpoints tend to be vertical (stem terminals).

Thresholds discovered empirically:

  • Sans-Serif: < 0.16 horizontal endpoint ratio
  • Serif: ≥ 0.16 horizontal endpoint ratio

Calligraphic Index: Detecting Flowing Scripts

Script fonts exhibit smooth, flowing curves—the hallmark of handwritten calligraphy. To capture this:

  1. Sample skeleton points at regular intervals (every 2 pixels)
  2. Calculate local curvature using tangent vectors with a 5-pixel window
  3. Measure smoothness: how gradually the direction changes
  4. Average across all characters to get a "flow score"

The discovery: script fonts maintain consistent curvature changes (smooth flowing lines), while print fonts have more abrupt direction changes (geometric corners).

Thresholds:

  • Script: ≥ 0.88 calligraphic index
  • Decorative Serif (edge case): 0.84-0.87 range
  • Sans/Serif: < 0.88 calligraphic index

Combined Classification Logic

if calligraphic_index >= 0.88:
    category = "Script"
elif serif_index >= 0.16:
    category = "Serif"
else:
    category = "Sans-Serif"

This simple decision tree achieved 93.8% accuracy (30/32 correct) on diverse test fonts, outperforming more complex multi-index approaches.

Why Simpler Metrics Failed

Before arriving at this solution, I tested numerous other approaches that seemed promising but failed in practice:

  • Thinning Index (skeleton pixels / total pixels): Complete overlap between all categories
  • Ornament Length (distance from endpoints to junctions): Script fonts varied wildly (0.0 to 0.54)
  • Skeleton Complexity (perimeter / area): Simple scripts scored lower than decorative serifs
  • Curved Section Ratio: Counter-intuitive—serif fonts had more curved sections than smooth scripts

The lesson: more metrics ≠ better accuracy. Focus on orthogonal features that capture fundamentally different aspects of typography.

Similarity Algorithm: Contour Matching

Once fonts are classified, finding similar fonts requires measuring shape similarity. The challenge: comparing letterforms that might vary in size, thickness, or subtle stylistic details while capturing their core geometric structure.

Character-Level Comparison

For each pair of fonts, we compare 94 ASCII printable characters individually:

  1. Render & skeletonize both characters at the same size
  2. Extract contours from the skeletonized shapes
  3. Compute contour distance using shape moment invariants
  4. Handle missing glyphs: store NULL for characters not supported by a font
  5. Average across characters: only non-NULL comparisons contribute

The database stores all 55+ million individual character comparisons, allowing flexible aggregation strategies later.

Why Contour Matching Won

I tested two similarity approaches:

Structural Similarity (SSIM): Pixel-level comparison

  • Performance: ~20 comparisons/second
  • Accuracy: Good
  • Issue: Extremely slow for 55M comparisons

Contour Matching (shape moments): Shape descriptor comparison

  • Performance: 400+ comparisons/second (20x faster!)
  • Accuracy: Comparable to SSIM
  • Benefit: Scale-invariant, focuses on shape structure

The 20x performance gain made full-corpus processing feasible: 2 days on 8 cores on a modest laptop instead of 40 days.

Category-Aware Filtering

Crucial design decision: similarity results are filtered by category. When you search for fonts similar to Pacifico (a script font), you only get other script fonts—not decorative serifs that might have similar flourishes.

This simple rule transformed result quality from "interesting but mixed" to "immediately useful."

API Query Strategy

-- Normalized similarity score (lower contour distance = higher similarity)
SELECT 
    font_id,
    font_name,
    AVG(EXP(-distance * 200)) as similarity_score
FROM char_results
WHERE reference_font_id = ?
    AND distance IS NOT NULL  -- Skip missing glyph comparisons
    AND category = ?          -- Same category only
GROUP BY font_id
ORDER BY similarity_score DESC
LIMIT 20

Query time: <50ms on 55M rows, thanks to proper indexing on (reference_font_id, category).

Performance at Scale

Processing 7,441 fonts with 94 characters each means 55+ million character-pair comparisons. This section documents the optimization journey from infeasible to production-ready.

Computational Bottleneck: The Profile Surprise

Initial assumption: Font rendering must be slow.

Reality after profiling:

  • Font rendering: 5% of CPU time
  • Structural similarity (SSIM): 95% of CPU time

The bottleneck was entirely in the similarity calculation, not data preparation.

Optimization 1: Algorithm Swap

Switching from SSIM to contour matching delivered a 20x speedup:

  • Before: 19.98 comp/sec
  • After: 400+ comp/sec

This single change made full-corpus processing feasible.

Optimization 2: Database Commit Strategy

With 55+ million character comparisons to store, database writes became a critical bottleneck. The challenge: balance write performance with crash recovery.

The tradeoff: Commit every N comparisons vs. commit at the end.

  • Small batches (commit every 1,000 rows): Slow writes, but easy to resume after crashes
  • Large batches (commit every 1M rows): Fast writes, but lose hours of work on crashes
  • Sweet spot (commit every 100,000 rows): Balance between speed and resumability

This strategy allowed the process to recover from interruptions without re-computing millions of comparisons, while still achieving reasonable write throughput.

Optimization 3: Pickle Caching

Font rendering not super expensive, but it is still expensive (TTF parsing, rasterization, thresholding). Precompute once, reuse forever:

# First run: Render and cache
for font in all_fonts:
    rendered_chars = render_all_chars(font)
    skeletons = skeletonize_all(rendered_chars)
    pickle.dump(skeletons, f"cache/{font.id}.pkl")

# Subsequent runs: Load from cache
skeletons = pickle.load(f"cache/{font.id}.pkl")

Cache size: 2.1 GB for 7,441 fonts × 94 characters. One-time cost that eliminates 95% of preprocessing time on this operation. Reading from an NVME disk is still faster than rendering a character each time.

Parallel Processing

With optimizations in place, the workload became embarrassingly parallel:

  • Split fonts into 8 groups
  • Run 8 processes simultaneously (one per CPU core)
  • Each process handles ~930 fonts
  • Monitor progress with a simple shell script

Final performance:

  • 2 days on 8-core machine
  • 55,331,282 rows in database
  • <50ms average query time

For comparison, the initial SSIM approach would have taken 40+ days on the same hardware.

Results & Lessons Learned

Classification Accuracy

Test set: 32 hand-labeled fonts across all categories

Category Accuracy
Sans-Serif 100% (7/7)
Serif 90% (9/10)
Script 93.3% (14/15)
TOTAL 93.8% (30/32)

For comparison, earlier approaches achieved only 81-84% accuracy. The two-index method's simplicity proved to be its strength.

Similarity Quality

Before category filtering, searching for fonts similar to "Pacifico" (a script font) would return decorative serifs mixed with actual scripts. After filtering by category: 100% relevant results.

The system successfully handles edge cases like decorative serifs (high calligraphic index but low enough to stay out of script territory) and print handwriting fonts (script-like but with less flowing curves).

Key Technical Lessons

1. Profile Before Optimizing

My initial guess about bottlenecks (font rendering) was completely wrong. The actual bottleneck was SSIM computation, not font rendering (even tho we optimized the latter as well). Always profile first.

2. Simpler Features Win

After testing 7 serif detection algorithms and 4 ornament metrics, the simplest two-index approach achieved the best accuracy. More metrics ≠ better classification.

3. Algorithm Choice Matters More Than Code Optimization

Switching from SSIM to contour matching delivered 20x speedup—far more than any micro-optimizations could achieve.

4. Domain Knowledge Beats Black Boxes

Hand-crafted geometric features (serif index, calligraphic index) are interpretable and debuggable. When classification fails, I can examine the indices and understand why. This interpretability was crucial for iterating on the algorithm.

5. User Needs Drive Architecture

The requirement "filter similar fonts by category" fundamentally shaped the database schema (store individual character scores), classification algorithm (need accurate categories), and API design (category-aware queries). Design from use cases, not from algorithms.

What's Next

Potential improvements for future iterations:

  • Deep learning classification: Train a CNN on font images to potentially exceed 94% accuracy
  • Character-specific weighting: Not all characters matter equally. Some characters will remain highly consistent across the whole dataset, while other will show a higher variation
  • Multi-dimensional similarity: Separate metrics for shape, weight, width, and proportion
  • Contextual pairing: Instead of "similar fonts," suggest "complementary fonts" for design pairings
  • Extracting particularly characteristic parts of every character and score each of them separately
  • Allowing user submitted samples for similarity matching (a consequence of the above)

Conclusion

Fontalike demonstrates that carefully engineered geometric algorithms can achieve excellent results for specialized domains. By developing custom features and combining orthogonal metrics, the system achieves production-scale performance with interpretable results.

The journey from 20 comparisons/second to 400+ comparisons/second, and from 81% to 94% accuracy, showcases the value of iterative refinement, rigorous testing, and thoughtful feature engineering.

Most importantly: documenting failures provides as much value as documenting successes. The path to 94% accuracy was paved with seven failed serif detection algorithms, four rejected ornament metrics, and countless parameter tuning experiments—all of which taught me what doesn't work and why.

Partial demo available: demo.fontalike.net shows our purely algorithmic filtering of fonts across Script/Serif/Sans-Serif categories. We are still working on further computation and new algorithm approaches.

Source code: The full code will be available on GitHub once we are satisfied with the results.