INDEX
    Explanations

    comparing features in tables

    New Auto-Interp
    Negative Logits
    ouvoir
    0.44
    िनय
    0.43
    первых
    0.41
    oforte
    0.41
    ים
    0.40
    用の
    0.40
    füh
    0.39
    ¬
    0.39
    →</
    0.38
    ’:
    0.38
    POSITIVE LOGITS
     Feature
    0.57
                  
    0.55
                 
    0.54
     Features
    0.54
                     
    0.52
     특징
    0.50
     Characteristics
    0.49
    Features
    0.49
     FEATURE
    0.49
                           
    0.49
    Act Density 0.010%

    No Known Activations