INDEX
Explanations
words and phrases related to roles and classifications within various groups
New Auto-Interp
Negative Logits
cherchés
-0.71
were
-0.64
Were
-0.58
gdyby
-0.57
Has
-0.55
extAlignment
-0.55
Has
-0.54
Were
-0.53
theyre
-0.52
سمبر
-0.51
POSITIVE LOGITS
love
0.92
often
0.88
rarely
0.81
tend
0.81
seldom
0.76
spend
0.75
typically
0.75
learn
0.71
shouldn
0.69
LOVE
0.69
Activations Density 0.497%