INDEX
Explanations
references to principles and values in various contexts
New Auto-Interp
Negative Logits
gie
-0.19
ÙĪØ·
-0.16
ney
-0.16
akan
-0.16
reat
-0.15
itude
-0.15
elian
-0.15
ÑĢÑĥ
-0.15
NESS
-0.15
ropolis
-0.14
POSITIVE LOGITS
-agent
0.30
ities
0.24
investigator
0.22
-Agent
0.20
Investig
0.20
ps
0.19
stown
0.18
pal
0.18
ized
0.18
/pr
0.18
Activations Density 0.018%