INDEX
Explanations
references to interactions or behaviors involving helping, influence, or manipulation
references to power dynamics and relationships in social contexts
New Auto-Interp
Negative Logits
Hopefully
-0.73
chester
-0.69
Ultra
-0.66
fortunately
-0.65
almost
-0.65
EVEN
-0.64
Ĥª
-0.64
redibly
-0.62
Pretty
-0.62
âĤ¬
-0.61
POSITIVE LOGITS
oneself
1.11
spouse
0.82
omission
0.79
periphery
0.78
duty
0.77
infancy
0.75
effic
0.75
superiors
0.74
perman
0.74
nonex
0.73
Activations Density 0.310%