INDEX
Explanations
terms indicating degrees of correctness or falseness
New Auto-Interp
Negative Logits
er
-0.34
ar
-0.27
ore
-0.25
eru
-0.22
ORE
-0.21
at
-0.21
erse
-0.20
arro
-0.20
erer
-0.20
arb
-0.19
POSITIVE LOGITS
hetics
0.22
hetic
0.21
ev
0.20
sal
0.19
ead
0.19
eb
0.18
ing
0.18
imate
0.18
ee
0.17
t
0.17
Activations Density 0.044%