INDEX
Explanations
self-referential language, such as words that indicate self-identity or self-perception
references to self-identification and personal perspective
New Auto-Interp
Negative Logits
Meth
-0.70
--+
-0.67
airst
-0.62
Lear
-0.61
Via
-0.61
methamphetamine
-0.60
ARB
-0.60
Madden
-0.60
frogs
-0.59
////
-0.59
POSITIVE LOGITS
limits
0.76
zbek
0.75
ãĥ¤
0.72
tical
0.69
animous
0.69
priv
0.68
agi
0.67
favorably
0.66
ilitarian
0.66
polit
0.66
Activations Density 0.135%