INDEX
Explanations
negations or expressions of doubt and disbelief
New Auto-Interp
Negative Logits
tilgjenge
-0.63
Jefus
-0.62
becauſe
-0.59
eſt
-0.57
againſt
-0.55
Available
-0.54
circonst
-0.53
interessanti
-0.52
acestea
-0.52
perfons
-0.52
POSITIVE LOGITS
want
1.05
knew
1.02
wanted
1.00
didn
0.98
want
0.98
thought
0.96
liked
0.95
think
0.94
hate
0.94
know
0.94
Activations Density 0.164%