INDEX
Explanations
references to political figures and their affiliations
New Auto-Interp
Negative Logits
itſelf
-1.18
myſelf
-1.10
)"),
-1.05
Jefus
-0.99
pleaſure
-0.98
nakalista
-0.95
ſind
-0.95
―――――
-0.95
་་
-0.94
whoſe
-0.94
POSITIVE LOGITS
or
1.03
et
0.64
if
0.63
without
0.63
would
0.61
for
0.61
just
0.60
might
0.59
?
0.58
even
0.57
Activations Density 0.278%