INDEX
Explanations
negative situations or outcomes
New Auto-Interp
Negative Logits
rouse
-0.80
ynthesis
-0.77
glas
-0.75
cript
-0.72
cellent
-0.71
Collider
-0.71
uador
-0.71
leans
-0.70
amaru
-0.70
ools
-0.69
POSITIVE LOGITS
inflicted
0.89
plag
0.88
havoc
0.86
fully
0.77
consequences
0.77
horribly
0.77
imaru
0.76
nightmares
0.76
der
0.74
consequence
0.73
Activations Density 2.512%