INDEX
Explanations
phrases that suggest potential negative outcomes or consequences
phrases indicating results or consequences of actions
New Auto-Interp
Negative Logits
aeper
-0.75
kt
-0.70
ian
-0.69
ault
-0.65
ramid
-0.64
atu
-0.63
ilings
-0.63
aredevil
-0.63
Quart
-0.62
Technique
-0.62
POSITIVE LOGITS
cele
1.08
havoc
0.86
irre
0.78
trouble
0.77
ãĥĨãĤ£
0.75
undue
0.74
heat
0.69
facts
0.69
cause
0.68
nightmares
0.67
Activations Density 0.025%