INDEX
Explanations
words that indicate a specific category or classification
New Auto-Interp
Negative Logits
er
-0.23
thon
-0.18
iser
-0.17
995
-0.16
ãģĤ
-0.15
ARSE
-0.15
é¸
-0.15
s
-0.15
sı
-0.15
кав
-0.15
POSITIVE LOGITS
opher
0.21
otle
0.21
ream
0.20
otel
0.20
ead
0.20
ea
0.19
ortion
0.19
ries
0.18
rik
0.17
Ø©
0.17
Activations Density 0.037%