INDEX
Explanations
references to traditional concepts, practices, or items across various contexts
New Auto-Interp
Negative Logits
arel
-0.16
ogl
-0.15
indr
-0.14
bras
-0.13
thing
-0.13
ings
-0.13
mented
-0.13
sburg
-0.13
.joda
-0.13
/he
-0.13
POSITIVE LOGITS
ists
0.36
ist
0.31
ISTS
0.24
ism
0.24
itionally
0.23
/current
0.22
isti
0.21
ista
0.21
-looking
0.21
istic
0.20
Activations Density 0.029%