INDEX
Explanations
phrases discussing moral or judgmental assessments
New Auto-Interp
Negative Logits
cci
-0.15
Briggs
-0.15
aint
-0.14
del
-0.14
usra
-0.14
uf
-0.14
aget
-0.14
shi
-0.14
Keystone
-0.13
lou
-0.13
POSITIVE LOGITS
cis
0.15
erb
0.15
hoo
0.15
ìĸ´ì§Ħ
0.15
plies
0.15
pedo
0.15
.elapsed
0.14
favor
0.14
rupa
0.14
routing
0.14
Activations Density 0.250%