INDEX
Explanations
references to concepts and ideas
New Auto-Interp
Negative Logits
OUR
-0.17
deer
-0.17
coming
-0.17
adow
-0.16
eeper
-0.16
alnız
-0.15
agi
-0.15
aces
-0.15
ity
-0.15
ings
-0.15
POSITIVE LOGITS
ually
0.51
ual
0.40
UAL
0.30
uality
0.26
uali
0.26
tual
0.25
ively
0.24
uele
0.22
uale
0.22
ors
0.20
Activations Density 0.022%