INDEX
Explanations
mentions of names followed by numbers or the abbreviation 'ON'
instances of the word "on."
New Auto-Interp
Negative Logits
cow
-0.61
Liberties
-0.59
eg
-0.58
looking
-0.58
refuge
-0.58
revenge
-0.57
functioning
-0.57
median
-0.55
fam
-0.55
shopping
-0.55
POSITIVE LOGITS
ON
3.94
ONS
2.72
ons
1.96
ONY
1.93
OND
1.88
on
1.76
ONES
1.59
ONE
1.58
ONT
1.50
OFF
1.45
Activations Density 0.010%