INDEX
Explanations
explains requests violate safety
New Auto-Interp
Negative Logits
always
0.43
dieser
0.38
neste
0.37
pip
0.37
diesen
0.36
مل
0.35
任何
0.35
Always
0.35
thằng
0.35
toujours
0.34
POSITIVE LOGITS
two
0.55
quite
0.52
覺
0.47
αρκε
0.47
several
0.47
khá
0.46
understandably
0.46
quite
0.46
രണ്ട്
0.46
രണ്ടു
0.45
Activations Density 0.168%