INDEX
Explanations
references to collective human experiences and common social behaviors
New Auto-Interp
Negative Logits
doesn
-0.85
não
-0.83
doesn
-0.77
didn
-0.76
isn
-0.74
weren
-0.73
never
-0.72
neither
-0.72
Doesn
-0.72
niet
-0.71
POSITIVE LOGITS
đều
1.07
except
0.95
individually
0.89
alike
0.86
except
0.83
câte
0.83
kecuali
0.82
equally
0.81
ล้ว
0.81
sauf
0.80
Activations Density 0.329%