INDEX
Explanations
words related to danger and risk
references to danger or threats
New Auto-Interp
Negative Logits
issance
-0.71
ulous
-0.69
eenth
-0.66
orney
-0.66
atters
-0.65
ergy
-0.64
GB
-0.64
pel
-0.61
guyen
-0.60
anmar
-0.59
POSITIVE LOGITS
ously
1.12
lur
1.01
posed
0.97
ous
0.94
lurking
0.88
Danger
0.84
zone
0.84
OUS
0.83
saf
0.83
hazards
0.81
Activations Density 0.035%