INDEX
Explanations
phrases that indicate responsibility or accountability
New Auto-Interp
Negative Logits
unbelievably
-0.95
fucking
-0.94
insanely
-0.93
goddamn
-0.93
fucking
-0.90
absolutely
-0.90
utterly
-0.89
FUCKING
-0.89
absolutely
-0.87
EVERY
-0.85
POSITIVE LOGITS
perhaps
1.21
perhaps
1.10
Perhaps
1.02
somewhat
0.97
Perhaps
0.95
maybe
0.95
<bos>
0.92
anskje
0.91
Somewhat
0.89
vielleicht
0.88
Activations Density 0.862%