INDEX
Explanations
words related to unusual or strange concepts or events
New Auto-Interp
Negative Logits
ptive
-0.87
ptives
-0.84
FORE
-0.78
vation
-0.77
ignty
-0.74
cussion
-0.70
aders
-0.70
HCR
-0.69
ailable
-0.67
ILA
-0.67
POSITIVE LOGITS
ness
1.36
nesses
1.19
ly
1.15
est
1.01
ed
0.97
oes
0.97
ety
0.95
os
0.92
er
0.89
ening
0.88
Activations Density 0.034%