INDEX
Explanations
words related to risks or dangers
references to perceived dangers or risks
New Auto-Interp
Negative Logits
puted
-0.81
mys
-0.78
raham
-0.74
bits
-0.74
ilts
-0.71
arist
-0.71
ria
-0.69
gans
-0.68
unes
-0.68
ANN
-0.67
POSITIVE LOGITS
threat
1.13
threats
0.98
Threat
0.93
proble
0.92
threat
0.89
posed
0.86
deterrent
0.85
menace
0.84
challeng
0.84
undermin
0.83
Activations Density 0.017%