INDEX
Explanations
contrasts between opposing viewpoints or groups
New Auto-Interp
Negative Logits
edin
-0.16
unas
-0.15
Î
-0.15
RIORITY
-0.15
ola
-0.14
IENTATION
-0.14
chaft
-0.14
idd
-0.14
彩
-0.14
Cornel
-0.14
POSITIVE LOGITS
905
0.15
isha
0.15
decl
0.14
aggio
0.14
μÏĮ
0.14
ãĤį
0.14
fur
0.14
Ùħباش
0.13
classical
0.13
atters
0.13
Activations Density 0.158%