INDEX
Explanations
words that indicate a problem or limitation
scientific writing
New Auto-Interp
Negative Logits
houſe
-0.97
فريبيس
-0.97
despite
-0.96
<?
-0.94
pleaſure
-0.94
wikipagina
-0.93
becauſe
-0.93
Monfieur
-0.92
Theſe
-0.92
ſtate
-0.91
POSITIVE LOGITS
,
0.56
b
0.56
M
0.55
De
0.55
al
0.55
<eos>
0.54
des
0.54
b
0.54
v
0.53
ib
0.52
Activations Density 6.820%