INDEX
Explanations
references to historical figures or texts
the end of a document or text
New Auto-Interp
Negative Logits
interns
-0.87
inputs
-0.84
backdoor
-0.81
TSA
-0.80
lasers
-0.79
Intercept
-0.79
waivers
-0.78
monitors
-0.77
triggers
-0.77
rollout
-0.77
POSITIVE LOGITS
æ
1.30
ocrates
1.25
û
1.20
á¸
1.18
â
1.17
olkien
1.15
akespe
1.14
ospels
1.14
Å
1.13
Åį
1.11
Activations Density 0.374%