INDEX
Explanations
references to characters and episodes from well-known stories or narratives
New Auto-Interp
Negative Logits
apur
-0.17
aley
-0.15
.gt
-0.15
anagan
-0.14
ench
-0.13
uml
-0.13
voksne
-0.13
ấp
-0.13
bt
-0.13
lname
-0.13
POSITIVE LOGITS
linger
0.17
Dirty
0.15
tir
0.14
-chat
0.14
Bare
0.14
Vend
0.14
@(
0.14
932
0.13
wnd
0.13
rew
0.13
Activations Density 0.015%