INDEX
Explanations
verbs indicating action or intention
phrases that indicate necessity, capability, or ongoing actions
New Auto-Interp
Negative Logits
bury
-0.64
KL
-0.56
hack
-0.54
Sly
-0.52
Britain
-0.52
paternity
-0.51
Pigs
-0.50
Truth
-0.50
Seah
-0.49
hin
-0.49
POSITIVE LOGITS
ãĥĩãĤ£
0.65
âĵĺ
0.65
ãĥĦ
0.64
pmwiki
0.64
Tokens
0.63
nevertheless
0.61
nonetheless
0.59
partName
0.59
sugg
0.58
downright
0.58
Activations Density 0.986%