INDEX
Explanations
single words and phrases related to rebellion or opposition
references to "counter" concepts or movements
New Auto-Interp
Negative Logits
OOD
-0.67
goodbye
-0.64
Forge
-0.63
Franks
-0.61
Finch
-0.61
Slime
-0.60
Tornado
-0.60
Bram
-0.60
icity
-0.59
Bib
-0.58
POSITIVE LOGITS
measures
1.47
balance
1.38
intuitive
1.38
attack
1.31
fact
1.25
clock
1.24
culture
1.23
offensive
1.22
intelligence
1.22
cultural
1.21
Activations Density 0.019%