INDEX
Explanations
mentions of specific celebrities, particularly Brad Pitt and Angelina Jolie
New Auto-Interp
Negative Logits
oon
-0.16
czy
-0.16
ASE
-0.15
åľŃ
-0.15
liá»ģn
-0.15
oons
-0.15
emics
-0.14
rys
-0.14
mate
-0.13
pers
-0.13
POSITIVE LOGITS
eref
0.17
pedo
0.16
bane
0.16
ileri
0.15
IMIT
0.15
ungen
0.14
Shift
0.14
URE
0.14
Shift
0.14
acket
0.14
Activations Density 0.001%