INDEX
Explanations
actions related to legal or medical outcomes with potential negative consequences
New Auto-Interp
Negative Logits
idth
-0.79
reci
-0.61
lished
-0.60
overty
-0.59
hov
-0.58
heny
-0.56
rouse
-0.56
icipated
-0.55
RPGs
-0.55
Peaks
-0.54
POSITIVE LOGITS
anyway
1.00
afterwards
0.98
afterward
0.96
anyways
0.95
instantly
0.93
promptly
0.92
accordingly
0.92
unanimously
0.91
luckily
0.89
shortly
0.88
Activations Density 0.464%