INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    安全
    0.50
    安全性
    0.47
    Safety
    0.46
     안전
    0.45
     безпе
    0.45
    safety
    0.44
     安全
    0.44
    SAFETY
    0.44
     Sicherheit
    0.43
     безопасности
    0.43
    POSITIVE LOGITS
     guidelines
    0.49
     protocols
    0.47
     barriers
    0.47
     protocolos
    0.45
    protocols
    0.41
     Barriers
    0.39
     Protocols
    0.38
     Guidelines
    0.38
    ető
    0.38
    одо
    0.38
    Act Density 0.029%

    No Known Activations