INDEX
Explanations
mentions of intense suffering, pain, and related concepts
references to suffering and its various contexts
New Auto-Interp
Negative Logits
ouncing
-0.68
sure
-0.67
Collider
-0.67
clude
-0.66
sports
-0.64
cluding
-0.64
leans
-0.63
reek
-0.62
lev
-0.62
wed
-0.62
POSITIVE LOGITS
inflicted
0.97
hani
0.83
Nadu
0.82
miser
0.78
horribly
0.77
fools
0.76
lessly
0.76
havoc
0.75
agony
0.75
endured
0.74
Activations Density 0.034%