INDEX
    Explanations

    recommendations for safety and prevention measures

    New Auto-Interp
    Negative Logits
    ego
    -0.15
    anz
    -0.15
    scal
    -0.14
    porto
    -0.14
    erland
    -0.14
    коз
    -0.14
    PixelFormat
    -0.14
    maal
    -0.14
     Tunnel
    -0.14
    ancybox
    -0.14
    POSITIVE LOGITS
     never
    0.20
     avoided
    0.20
     avoid
    0.19
     Avoid
    0.19
     NEVER
    0.19
     avoiding
    0.19
     familiar
    0.18
     avoidance
    0.17
     avoids
    0.17
     always
    0.17
    Act Density 0.078%

    No Known Activations