INDEX
Explanations
mentions of experiments involving human-like creatures
New Auto-Interp
Negative Logits
interpol
-0.16
ugins
-0.15
interp
-0.14
afen
-0.14
quel
-0.14
aters
-0.14
dg
-0.13
stell
-0.13
_GC
-0.13
erland
-0.13
POSITIVE LOGITS
experiments
0.28
experimental
0.28
research
0.26
experiment
0.25
Experimental
0.25
testing
0.24
experimental
0.23
experiment
0.23
tests
0.22
çłĶç©¶
0.22
Activations Density 0.050%