INDEX
Explanations
references to criticism and public backlash
New Auto-Interp
Negative Logits
alez
-0.17
IFn
-0.15
idth
-0.15
bjerg
-0.14
blat
-0.14
ấn
-0.13
oss
-0.13
ysl
-0.13
aze
-0.13
tron
-0.13
POSITIVE LOGITS
directed
0.29
online
0.25
hur
0.22
Directed
0.21
leveled
0.21
aimed
0.21
lev
0.21
level
0.20
voices
0.20
Directed
0.19
Activations Density 0.177%