INDEX
Explanations
conversations about identity and personal experiences
Japanese sentence fragments
what are you doing
New Auto-Interp
Negative Logits
itſelf
-0.98
ſelf
-0.94
myſelf
-0.89
auffi
-0.82
sahiptir
-0.82
iſt
-0.81
―――――
-0.81
―――
-0.80
pleaſure
-0.79
ſelves
-0.79
POSITIVE LOGITS
really
1.03
pretty
0.99
guys
0.89
nice
0.89
weird
0.86
shit
0.85
fucking
0.84
REALLY
0.82
like
0.82
Really
0.81
Activations Density 0.197%