INDEX
Explanations
instances of confusion and the need for clarification
New Auto-Interp
Negative Logits
zos
-0.17
tak
-0.17
Wy
-0.15
isposable
-0.15
ughs
-0.15
uels
-0.14
unya
-0.14
manship
-0.14
rio
-0.14
اÛĮØ´
-0.14
POSITIVE LOGITS
/conf
0.27
confuse
0.24
confusion
0.24
confusing
0.23
confused
0.20
ingly
0.17
xes
0.16
olini
0.16
Conf
0.15
-cut
0.15
Activations Density 0.044%