INDEX
Explanations
negative qualifiers and counterarguments regarding statements or claims
New Auto-Interp
Negative Logits
ubar
-0.16
igel
-0.15
quoi
-0.14
oning
-0.14
iling
-0.14
stry
-0.14
894
-0.14
527
-0.14
yntax
-0.14
boats
-0.13
POSITIVE LOGITS
necessarily
0.28
actual
0.24
mere
0.20
merely
0.20
actual
0.18
intended
0.18
nor
0.18
gua
0.17
directly
0.17
Da
0.17
Activations Density 0.115%