INDEX
Explanations
phrases related to suffering or health issues
New Auto-Interp
Negative Logits
Collider
-0.75
hens
-0.68
onomy
-0.68
TRY
-0.66
clusive
-0.65
amera
-0.64
tarians
-0.64
ouf
-0.64
uese
-0.63
appro
-0.63
POSITIVE LOGITS
setbacks
1.15
losses
1.08
fools
1.00
terribly
0.98
debilitating
0.96
horrend
0.95
severe
0.93
horribly
0.93
embarrassment
0.91
injuries
0.91
Activations Density 0.058%