INDEX
    Explanations

    Purity/cleanliness

    New Auto-Interp
    Negative Logits
     pure
    -1.05
    Pure
    -0.91
    pure
    -0.89
     purity
    -0.88
     Pure
    -0.85
     impurities
    -0.84
     impurity
    -0.79
     impure
    -0.75
     Great
    -0.74
     purify
    -0.74
    POSITIVE LOGITS
    ly
    0.96
    ArrowToggle
    0.94
    ness
    0.90
    ième
    0.88
    AnchorStyles
    0.86
    ening
    0.83
    цездатний
    0.78
     ویکی‌پدی
    0.73
    ^(@)
    0.73
     déchir
    0.73
    Act Density 0.061%

    No Known Activations