INDEX
Explanations
safety disclaimers and warnings
New Auto-Interp
Negative Logits
initial
0.42
perceiving
0.42
Selection
0.41
Selected
0.40
INITIAL
0.40
Responding
0.40
selection
0.40
ྷ
0.40
Initial
0.39
responding
0.39
POSITIVE LOGITS
dangerous
0.76
dangereux
0.75
dangerous
0.75
berbahaya
0.71
Dangerous
0.70
危险
0.68
危険
0.68
Dangerous
0.66
खतरनाक
0.65
peligros
0.65
Activations Density 0.317%