INDEX
Explanations
phrases emphasizing consistency or similarity
New Auto-Interp
Negative Logits
etched
-0.66
dale
-0.63
り
-0.60
spaced
-0.60
bard
-0.59
breath
-0.59
omn
-0.56
com
-0.55
quoted
-0.55
initially
-0.55
POSITIVE LOGITS
same
0.87
same
0.77
ourke
0.74
chwitz
0.73
result
0.72
conn
0.70
ouses
0.70
olini
0.69
Same
0.67
ighed
0.67
Activations Density 0.162%