INDEX
Explanations
the followed by descriptive nouns
New Auto-Interp
Negative Logits
whose
0.18
مدينة
0.17
inguistic
0.17
های
0.17
idespread
0.17
Saw
0.17
From
0.16
from
0.16
-
0.16
à
0.16
POSITIVE LOGITS
guy
0.20
老師
0.18
entire
0.18
it
0.18
editor
0.18
onus
0.18
only
0.17
guys
0.17
designers
0.17
slightest
0.17
Activations Density 0.482%