INDEX
Explanations
mentions of individuals in specific scenarios or interactions
the word "who" and its variations related to individuals in various contexts
New Auto-Interp
Negative Logits
Anything
-0.69
Beginning
-0.63
Gothic
-0.59
Texture
-0.58
å½
-0.58
Ending
-0.56
Meaning
-0.55
creation
-0.55
stop
-0.55
Spending
-0.54
POSITIVE LOGITS
upon
1.36
promptly
1.10
subsequently
1.06
oping
1.03
oped
1.02
resembled
0.97
then
0.97
proceeded
0.96
resided
0.96
wished
0.95
Activations Density 0.176%