INDEX
Explanations
phrases that denote precision or specificity in statements
New Auto-Interp
Negative Logits
Flavoring
-0.70
oké
-0.70
oscopic
-0.69
ngth
-0.66
rift
-0.66
cffff
-0.64
rug
-0.64
ocene
-0.64
sacrific
-0.63
itiz
-0.62
POSITIVE LOGITS
ãĤ¨
0.78
opposite
0.76
wrong
0.68
why
0.68
analogous
0.65
minus
0.64
correct
0.64
µ
0.61
Horowitz
0.61
¯
0.61
Activations Density 0.006%