INDEX
Explanations
references to various types of threats
New Auto-Interp
Negative Logits
urses
-0.84
ricks
-0.83
tein
-0.73
arist
-0.71
raham
-0.68
ools
-0.68
gian
-0.68
gown
-0.67
ributes
-0.67
uties
-0.67
POSITIVE LOGITS
posed
1.27
threat
0.99
threats
0.94
threat
0.87
emanating
0.82
Threat
0.81
glare
0.77
lessly
0.76
xual
0.75
sov
0.75
Activations Density 0.024%