INDEX
Explanations
repetitive or comparative phrases indicating similarity and difference
New Auto-Interp
Negative Logits
seau
-0.15
anford
-0.15
tet
-0.15
yro
-0.14
pong
-0.14
dependencies
-0.14
Ð¡Ð¡Ðł
-0.14
анг
-0.14
Verifier
-0.14
doing
-0.14
POSITIVE LOGITS
justice
0.35
Justice
0.25
job
0.25
justice
0.24
things
0.24
thing
0.23
damage
0.23
wrong
0.23
jobs
0.22
cket
0.21
Activations Density 0.174%