INDEX
Explanations
assertions or statements of knowledge
New Auto-Interp
Negative Logits
521
-0.16
492
-0.15
ipur
-0.15
okane
-0.14
279
-0.14
ghi
-0.14
Regents
-0.14
Oak
-0.14
gro
-0.14
circuit
-0.14
POSITIVE LOGITS
emer
0.16
orman
0.16
utz
0.15
Ñıж
0.15
chấm
0.14
fingert
0.14
зн
0.14
forc
0.14
zon
0.14
irth
0.14
Activations Density 0.001%