INDEX
Explanations
phrases indicating alignment or similarity
New Auto-Interp
Negative Logits
ı
-0.06
normally
-0.06
Lon
-0.06
Usually
-0.05
Blocking
-0.05
quez
-0.05
¢
-0.05
oard
-0.05
specified
-0.05
ment
-0.05
POSITIVE LOGITS
match
0.08
corre
0.08
exactly
0.08
matched
0.08
same
0.07
identical
0.07
same
0.07
-match
0.07
match
0.07
.Match
0.07
Activations Density 0.031%