INDEX
Explanations
phrases indicating strong positive evaluations
New Auto-Interp
Negative Logits
aldo
-0.18
tees
-0.16
at
-0.16
elli
-0.16
eum
-0.15
elight
-0.15
nal
-0.15
ofile
-0.15
yen
-0.15
ziel
-0.14
POSITIVE LOGITS
spring
0.21
ington
0.19
못
0.19
ows
0.19
-known
0.18
enough
0.17
akit
0.17
acre
0.17
ender
0.16
iam
0.16
Activations Density 0.054%