Analyzer Development Guide

Overview

This guide covers the development and testing of CodeSage analyzers, focusing on the Phase 1 stabilization improvements implemented for Python, Go, and Java parsers.

Architecture

Parser Factory

The parser_factory.py module provides a unified interface for creating language-specific parsers:

from codesage.analyzers.parser_factory import create_parser

# Create a parser for a specific language
parser = create_parser('python')  # or 'go', 'java'
parser.parse(source_code)
functions = parser.extract_functions()

Base Parser

All language parsers inherit from BaseParser which provides common functionality:

Tree-sitter integration
AST traversal utilities
Error handling
Performance monitoring

Language-Specific Features

Python Parser Enhancements

Nested Async Function Support

The Python parser now correctly handles nested async functions with proper parent scope tracking:

async def outer():
    async def inner():  # Correctly tracked as nested in 'outer'
        pass
    return inner

Implementation Details:

Recursive function extraction with scope stack
Parent scope attribution for nested functions
Proper async/await detection

Python 3.10+ Match Statement Support

Enhanced complexity calculation for match statements:

def process_data(data):
    match data:
        case int() if data > 0:  # Each case adds to complexity
            return "positive"
        case int() if data < 0:
            return "negative"
        case _:
            return "unknown"

Complexity Calculation:

Base complexity: 1
Each case clause: +1
Guard conditions: +1 each
Logical operators (and/or): +1 each

Error Recovery

Robust error handling for syntax errors:

def valid_function():
    return "valid"

def broken_function(
    # Missing closing parenthesis - syntax error
    param1: str

def another_valid_function():  # Still parsed correctly
    return "also valid"

Features:

Partial AST extraction on syntax errors
Graceful degradation
Warning logging for parsing issues

Go Parser Enhancements

Generic Type Support (Go 1.18+)

Full support for generic functions and structs:

func Add[T constraints.Ordered](a, b T) T {
    return a + b
}

type Container[T any] struct {
    Value T
    Items []T
}

Extracted Information:

Type parameter names
Type constraints
Generic decorators
Constraint validation

Struct Tags

Enhanced struct tag extraction:

type User struct {
    ID   int    `json:"id" db:"user_id" validate:"required"`
    Name string `json:"name" db:"full_name" validate:"required,min=2"`
}

Features:

Complete tag preservation
Multiple tag format support
Tag-based semantic analysis

Method Receivers

Proper method receiver parsing:

func (u *User) SetName(name string) {  // Pointer receiver
    u.Name = name
}

func (u User) GetName() string {       // Value receiver
    return u.Name
}

Java Parser Enhancements

Record Class Support

Full support for Java 14+ record classes:

public record Person(String name, int age) {
    public Person {  // Compact constructor
        if (name == null) throw new IllegalArgumentException();
    }
    
    public boolean isAdult() {
        return age >= 18;
    }
}

Features:

Record component extraction
Compact constructor detection
Record method identification

Enhanced Annotation Parsing

Improved nested annotation support:

@ApiOperation(
    value = "Get user",
    authorizations = {
        @Authorization(
            value = "oauth2",
            scopes = {@AuthorizationScope(scope = "read")}
        )
    }
)
public User getUser(@PathVariable Long id) {
    return userService.findById(id);
}

Features:

Nested annotation extraction
Annotation parameter parsing
Semantic tag generation from annotations

Lambda Expression Filtering

Proper filtering of lambda expressions from function extraction:

items.stream()
    .filter(item -> item.length() > 1)  // Not extracted as function
    .map(item -> item.toUpperCase())    // Not extracted as function
    .forEach(System.out::println);

Testing Framework

Test Structure

tests/
├── unit/analyzers/
│   ├── test_python_parser_comprehensive.py
│   ├── test_go_parser_edge_cases.py
│   ├── test_java_parser_advanced.py
│   └── test_ground_truth_validation.py
├── performance/
│   └── test_analyzer_performance.py
└── fixtures/analyzer-validation-set/
    ├── python/
    ├── go/
    ├── java/
    └── ground-truth/

Test Categories

Unit Tests

Comprehensive unit tests for each parser:

Python: Nested async functions, match statements, error recovery
Go: Generic constraints, struct tags, method receivers
Java: Record classes, nested annotations, lambda filtering

Performance Tests

Benchmarking for parsing speed and memory usage:

@pytest.mark.benchmark
def test_python_parsing_speed_1000_loc(self, benchmark):
    code = generate_python_code(1000)
    result = benchmark(parse_python_code, code)
    assert benchmark.stats.mean < 0.5  # < 500ms

Ground Truth Validation

Accuracy validation against manually curated test cases:

def test_python_nested_async_functions_accuracy(self):
    expected = load_ground_truth("complex_nested_async.py")
    actual = parse_and_extract(code)
    validate_accuracy(expected, actual)

Test Configuration

The .codesage/test-config.yaml file defines:

Coverage thresholds (95% minimum)
Performance requirements (500ms for 1000 LOC)
Accuracy targets (95% minimum)
Test fixture specifications

Running Tests

# Run all analyzer tests with coverage
pytest tests/unit/analyzers/ --cov=codesage/analyzers --cov-fail-under=95

# Run performance benchmarks
pytest tests/performance/ --benchmark-only

# Run ground truth validation
pytest tests/unit/analyzers/test_ground_truth_validation.py

# Run specific language tests
pytest tests/unit/analyzers/test_python_parser_comprehensive.py -v

Performance Requirements

Parsing Speed

Target: Parse 1000 lines of code in < 500ms
Measurement: Average time across multiple runs
Optimization: Tree-sitter query optimization, caching

Memory Usage

Target: < 200MB peak memory for 10,000 LOC
Measurement: Peak memory usage during parsing
Optimization: Efficient AST traversal, garbage collection

Accuracy

Target: > 95% accuracy against ground truth
Measurement: Function/class/import detection accuracy
Validation: Manual verification of complex test cases

Continuous Integration

GitHub Actions Workflow

The .github/workflows/analyzer-tests.yml workflow:

Test Matrix: Python 3.10, 3.11, 3.12
Coverage: 95% minimum with HTML reports
Performance: Benchmark validation
Quality: Code style and lint checks
Documentation: Automated test report generation

Quality Gates

All tests must pass
Coverage must be ≥ 95%
Performance benchmarks must meet targets
Code quality checks must pass
Ground truth validation must achieve ≥ 95% accuracy

Development Workflow

Adding New Features

Design: Define the feature requirements
Implementation: Add parser logic
Testing: Create comprehensive tests
Validation: Add ground truth test cases
Documentation: Update this guide
Performance: Verify performance impact

Bug Fixes

Reproduction: Create failing test case
Fix: Implement the fix
Validation: Ensure test passes
Regression: Run full test suite
Performance: Verify no performance degradation

Performance Optimization

Profiling: Identify bottlenecks
Optimization: Implement improvements
Benchmarking: Measure performance gains
Validation: Ensure accuracy maintained
Documentation: Update performance metrics

Best Practices

Parser Development

Use Tree-sitter queries for efficient AST traversal
Implement error recovery for robust parsing
Cache expensive operations
Validate against real-world code samples

Testing

Write tests before implementing features (TDD)
Use property-based testing for edge cases
Benchmark performance regularly
Maintain comprehensive ground truth dataset

Code Quality

Follow language-specific conventions
Use type hints and documentation
Implement proper error handling
Optimize for readability and maintainability

Troubleshooting

Common Issues

Parsing Errors

Check Tree-sitter grammar compatibility
Verify source code encoding
Review error recovery logic

Performance Issues

Profile with cProfile or similar tools
Check for inefficient AST traversal
Optimize Tree-sitter queries

Test Failures

Review ground truth expectations
Check for language version compatibility
Verify test fixture integrity

Debugging Tools

Tree-sitter playground for query testing
Python debugger for step-through debugging
Memory profilers for memory usage analysis
Benchmark tools for performance analysis

Future Enhancements

Planned Features

Python: Pattern matching optimization, walrus operator support
Go: Workspace module support, build constraint parsing
Java: Virtual threads, pattern matching for switch

Performance Targets

Speed: 1000 LOC in < 250ms (50% improvement)
Memory: < 100MB for 10,000 LOC (50% reduction)
Accuracy: > 98% ground truth validation

Extensibility

Plugin architecture for custom analyzers
Language-agnostic semantic analysis
Real-time parsing for IDE integration

FilesExpand file tree

analyzer-development.md

Latest commit

History

analyzer-development.md

File metadata and controls

Analyzer Development Guide

Overview

Architecture

Parser Factory

Base Parser

Language-Specific Features

Python Parser Enhancements

Nested Async Function Support

Python 3.10+ Match Statement Support

Error Recovery

Go Parser Enhancements

Generic Type Support (Go 1.18+)

Struct Tags

Method Receivers

Java Parser Enhancements

Record Class Support

Enhanced Annotation Parsing

Lambda Expression Filtering

Testing Framework

Test Structure

Test Categories

Unit Tests

Performance Tests

Ground Truth Validation

Test Configuration

Running Tests

Performance Requirements

Parsing Speed

Memory Usage

Accuracy

Continuous Integration

GitHub Actions Workflow

Quality Gates

Development Workflow

Adding New Features

Bug Fixes

Performance Optimization

Best Practices

Parser Development

Testing

Code Quality

Troubleshooting

Common Issues

Parsing Errors

Performance Issues

Test Failures

Debugging Tools

Future Enhancements

Planned Features

Performance Targets

Extensibility