INDEX
Explanations
actions related to helping or assisting others
references to frequent actions or common experiences
New Auto-Interp
Negative Logits
Bam
-0.70
Showdown
-0.69
Variety
-0.67
Kush
-0.65
Bern
-0.64
Kob
-0.63
ASAP
-0.63
Ys
-0.62
PLEASE
-0.62
Kitty
-0.60
POSITIVE LOGITS
ttes
0.89
depending
0.83
oots
0.78
pas
0.72
times
0.70
rist
0.69
pmwiki
0.68
entimes
0.68
ensical
0.68
rences
0.66
Activations Density 0.403%