INDEX
    Explanations

    words relating to principles, ethics, or moral considerations

    New Auto-Interp
    Negative Logits
     myster
    -0.78
     vulner
    -0.76
     sacrific
    -0.74
     limb
    -0.74
     mathemat
    -0.73
     writ
    -0.73
     trainers
    -0.70
     conduc
    -0.69
     builders
    -0.69
     destro
    -0.68
    POSITIVE LOGITS
    ï¸ı
    1.31
    vernment
    1.04
    SpaceEngineers
    0.95
    lean
    0.95
    log
    0.92
    ove
    0.91
    ï¸
    0.91
    ËĪ
    0.90
    deg
    0.89
    âĹ¼
    0.89
    Act Density 0.036%

    No Known Activations