INDEX
Explanations
words related to threats, risks, and dangerous situations
references to various forms of danger
New Auto-Interp
Negative Logits
ergy
-0.84
orney
-0.79
owned
-0.79
olitan
-0.75
anmar
-0.71
issance
-0.70
urally
-0.69
guyen
-0.67
ulous
-0.65
eenth
-0.64
POSITIVE LOGITS
ously
0.93
lurking
0.93
lur
0.89
posed
0.83
Danger
0.83
endanger
0.81
lessly
0.80
mong
0.79
crow
0.77
danger
0.74
Activations Density 0.027%