INDEX
Explanations
references to governmental actions or decisions related to legal or political contexts
New Auto-Interp
Negative Logits
Instruments
-0.70
idth
-0.70
Digest
-0.65
Nano
-0.65
Angry
-0.64
Confeder
-0.60
Admir
-0.59
Fine
-0.58
Politics
-0.57
fuck
-0.57
POSITIVE LOGITS
alian
1.18
self
1.18
chy
1.05
atically
0.95
atic
0.93
unes
0.89
selves
0.86
asca
0.78
ELF
0.77
squarely
0.77
Activations Density 0.116%