INDEX
Explanations
mention of past actions or hypothetical scenarios
expressions of personal agency and responsibility
New Auto-Interp
Negative Logits
izens
-0.65
quirks
-0.65
Bits
-0.63
BUG
-0.62
Colleges
-0.61
freezes
-0.60
loops
-0.60
themselves
-0.59
Lovecraft
-0.59
seams
-0.59
POSITIVE LOGITS
myself
1.40
personally
0.98
â̦"
0.87
my
0.80
displayText
0.75
ministerial
0.75
sworn
0.74
%"
0.73
chair
0.72
privileged
0.72
Activations Density 0.760%