INDEX
Explanations
references to societal norms and perceptions
New Auto-Interp
Negative Logits
amt
-0.15
är
-0.15
aram
-0.14
Trad
-0.14
erties
-0.14
ilim
-0.13
orama
-0.13
onald
-0.13
ãĤ§
-0.13
ollen
-0.13
POSITIVE LOGITS
routine
0.20
normal
0.18
background
0.18
NORMAL
0.17
normal
0.17
NORMAL
0.17
-normal
0.17
Routine
0.16
normalize
0.16
routine
0.16
Activations Density 0.205%