INDEX
Negative Logits
VM
0.40
önes
0.39
StudentRecord
0.39
प्रय
0.38
selfish
0.38
incorrect
0.38
simpl
0.38
giỏi
0.37
যথার্থ
0.37
Incorrect
0.37
POSITIVE LOGITS
safe
0.94
harmless
0.88
безопас
0.87
Safe
0.83
safe
0.81
Safe
0.79
bezpie
0.77
safest
0.77
SAFE
0.75
safes
0.73
Activations Density 0.099%