INDEX
Explanations
mentions of specific people or possibly social media handles
references to individuals and their associations with specific actions or roles
New Auto-Interp
Negative Logits
usc
-0.74
augment
-0.70
ãĤ¼ãĤ¦ãĤ¹
-0.69
displayText
-0.68
Ctrl
-0.68
govern
-0.68
subp
-0.66
ãĥŁ
-0.65
Afric
-0.63
ward
-0.63
POSITIVE LOGITS
bies
0.89
smoking
0.81
ĸļ士
0.79
TAMADRA
0.79
zees
0.77
phies
0.74
Reviewer
0.73
zee
0.73
ansky
0.73
pty
0.73
Activations Density 0.530%