INDEX
Explanations
terms related to abstraction and abstract concepts
New Auto-Interp
Negative Logits
ÑĥÑĩа
-0.17
лаÑģ
-0.16
одо
-0.16
-------------------------------------------------------------------------↵
-0.16
ern
-0.15
shaw
-0.15
)prepare
-0.15
unta
-0.15
ermo
-0.15
agra
-0.14
POSITIVE LOGITS
ed
0.31
edly
0.24
ivism
0.23
ified
0.23
ively
0.20
ivist
0.19
-syntax
0.19
s
0.18
ing
0.18
ly
0.17
Activations Density 0.017%