INDEX
Explanations
phrases indicating consensus or agreement
New Auto-Interp
Negative Logits
phrase
-0.06
ERA
-0.06
afari
-0.06
grains
-0.06
word
-0.06
orris
-0.05
olith
-0.05
revealed
-0.05
ystick
-0.05
dux
-0.05
POSITIVE LOGITS
ÏĮÏĦε
0.07
é³´
0.07
aight
0.07
_past
0.07
é
0.07
λεÏħ
0.07
νÏİ
0.07
_cached
0.07
xDD
0.07
LabelText
0.07
Activations Density 0.001%