INDEX
Explanations
repeated phrases or concepts that indicate similarity
New Auto-Interp
Negative Logits
related
-0.16
itself
-0.15
#ac
-0.15
loff
-0.14
the
-0.14
rac
-0.14
requ
-0.13
better
-0.13
any
-0.13
rious
-0.13
POSITIVE LOGITS
exact
0.43
thing
0.42
-sex
0.41
kind
0.38
kinds
0.35
sort
0.35
amount
0.33
type
0.33
exact
0.32
-old
0.31
Activations Density 0.083%