INDEX
Explanations
words related to danger or potential harm
references to danger or harmful situations
New Auto-Interp
Negative Logits
issance
-0.81
ļéĨĴ
-0.78
galitarian
-0.73
guyen
-0.73
ergy
-0.71
owned
-0.71
Ħ¢
-0.70
elle
-0.70
urally
-0.69
eenth
-0.68
POSITIVE LOGITS
danger
0.95
Danger
0.91
endanger
0.82
danger
0.82
ously
0.82
dangers
0.81
lur
0.81
lurking
0.78
dangerous
0.74
peril
0.73
Activations Density 0.054%