INDEX
Explanations
statements about truthfulness or falsehood
New Auto-Interp
Negative Logits
ReusableCell
-0.46
)"),
-0.44
taal
-0.42
\{\\-0.42
indietro
-0.41
îna
-0.41
tilbake
-0.39
vermelhas
-0.39
rojas
-0.39
tilbage
-0.38
POSITIVE LOGITS
kasarigan
0.99
intptr
0.83
autorytatywna
0.78
estekak
0.75
twimg
0.75
typelib
0.72
ivelany
0.72
awaiter
0.70
<<<<<<<<<<<<<<
0.69
rungsseite
0.68
Activations Density 0.859%