INDEX
Explanations
phrases related to danger and warning
references to danger or harmfulness
New Auto-Interp
Negative Logits
via
-0.83
rix
-0.82
elle
-0.75
ILA
-0.75
ļéĨĴ
-0.73
ARCH
-0.72
angular
-0.72
roma
-0.72
ann
-0.72
arger
-0.71
POSITIVE LOGITS
dangerous
1.11
undermin
1.02
endanger
1.00
adolesc
0.91
danger
0.89
danger
0.87
hazardous
0.85
mosqu
0.84
dangers
0.83
deadly
0.80
Activations Density 0.015%