INDEX
Explanations
negations and assertions that challenge common beliefs or misconceptions
New Auto-Interp
Negative Logits
.Apis
-0.18
iger
-0.16
berger
-0.16
zan
-0.16
wyn
-0.15
Faker
-0.15
Pais
-0.15
705
-0.15
heels
-0.14
shr
-0.14
POSITIVE LOGITS
anymore
0.21
nor
0.16
today
0.16
ItemSelected
0.15
uti
0.14
ob
0.14
933
0.14
inter
0.14
as
0.14
ab
0.14
Activations Density 0.066%