INDEX
Explanations
This neuron activates on seemingly random sets of words, and doesn't seem to have a clear function
New Auto-Interp
Negative Logits
ra
-0.63
Ră
-0.50
RAD
-0.47
Rad
-0.46
RAD
-0.44
Ra
-0.44
Rav
-0.43
########.
-0.43
rad
-0.43
ram
-0.43
POSITIVE LOGITS
InstrumentedTest
0.69
esternos
0.68
ogaster
0.63
CreateIndex
0.60
})`
0.60
enfans
0.59
SourceChecksum
0.59
vorbehalten
0.59
setViewName
0.59
<bos>
0.59
Activations Density 1.407%