INDEX
    Explanations

    phrases indicating safety and security

    New Auto-Interp
    Negative Logits
    loth
    -0.18
    polator
    -0.16
    ideon
    -0.15
    oles
    -0.15
    ÑĢава
    -0.15
    strcasecmp
    -0.15
    ogne
    -0.14
    -piece
    -0.14
    atre
    -0.14
    aceous
    -0.14
    POSITIVE LOGITS
     safe
    0.31
     Safe
    0.30
    .safe
    0.28
    safe
    0.26
    Safe
    0.24
    -safe
    0.22
    unsafe
    0.22
    .Safe
    0.21
     safely
    0.21
     safer
    0.20
    Act Density 0.027%

    No Known Activations