INDEX
Explanations
references to oneself or reflexive actions
references to the concept of "self."
New Auto-Interp
Negative Logits
akings
-0.76
microsoft
-0.66
airspace
-0.64
Amend
-0.63
wedge
-0.62
MSN
-0.58
Learns
-0.58
osa
-0.57
arrivals
-0.57
rought
-0.56
POSITIVE LOGITS
ortium
1.09
selves
1.04
destruct
0.92
self
0.84
ridges
0.83
theless
0.83
terday
0.81
same
0.75
ridge
0.73
acid
0.73
Activations Density 0.014%