INDEX
Explanations
terms related to membership or possession
New Auto-Interp
Negative Logits
llib
-0.16
lle
-0.15
ffe
-0.15
uell
-0.15
resse
-0.15
اÙģØª
-0.15
riel
-0.15
rana
-0.14
äll
-0.14
utr
-0.14
POSITIVE LOGITS
(ed
0.20
gers
0.20
nowhere
0.20
ents
0.19
ÂŃing
0.17
ading
0.17
ent
0.16
Sizer
0.16
belong
0.16
together
0.16
Activations Density 0.012%