INDEX
Explanations
conjunctions and phrases indicating conditions or contrasts
New Auto-Interp
Negative Logits
ãĤ´ãĥ³
-0.73
®
-0.67
zu
-0.65
!:
-0.62
hered
-0.60
aily
-0.60
ãĥĭ
-0.59
antine
-0.59
iren
-0.58
iden
-0.58
POSITIVE LOGITS
blah
1.11
stuff
1.11
everybody
1.06
romeda
1.00
hopefully
0.98
yeah
0.97
maybe
0.94
secondly
0.93
frankly
0.92
basically
0.89
Activations Density 0.187%