What’s The Difference Between utf8_general_ci and utf8_unicode_ci ?

Related Articles




There are at least two important differences:

Accuracy of sorting

  • utf8_unicode_ci is based on the Unicode standard for sorting, and sorts accurately in a very wide range of languages.
  • utf8_general_ci comes close to correct Unicode sorting in many common languages, but has a number of limitations: in some languages, it won’t sort correctly at all. In others, it will merely have some quirks.

Performance

  • utf8_general_ci is faster at comparisons and sorting, because it takes a bunch of performance-related shortcuts.
  • utf8_unicode_ci uses a much more complex comparison algorithm which aims for correct sorting according in a very wide range of languages. This makes it slower to sort and compare large numbers of fields.

Unicode defines complex sets of rules for how characters should be sorted. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call ‘alphabetical order’.

As far as Latin (ie “European”) languages go, there is not much difference between the Unicode sorting and the simplified utf8_general_ci sorting in MySQL, but there are still a few differences:

For examples, the Unicode collation sorts “ß” like “ss”, and “Œ” like “OE” as people using those characters would normally want, whereas utf8_general_ci sorts them as single characters (presumably like “s” and “e” respectively).

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8_general_ci sorting. The suitability of utf8_general_ci will depend heavily on the language used. For some languages, it’ll be quite inadequate.

Some Unicode characters are defined as ignorable, which means they shouldn’t count toward the sort order and the comparison should move on to the next character instead. utf8_unicode_ci handles these properly.

What should you use?

There is almost never any reason to use utf_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by quite other bottlenecks than this nowadays. The difference in performance is only going to be measurable in extremely specialized situations, and if that’s you, you’d already know about it. If you’re experiencing slow sorting, in almost all cases it’ll be an issue with your indexes/query plan. Changing your collation function should not be high on the list of things to troubleshoot.

When we originally wrote this answer (over 4 years ago) we said that if you wanted, you could use utf8_general_ci most of the time, and only use utf8_unicode_ci when sorting was going to be important enough to justify the performance cost. However, the performance cost is no longer really relevant (and it may not have been back then, either). It’s more important to sort properly in whichever language your users are using.

One other thing I’ll add is that even if you know your application only supports the English language, it may still need to deal with people’s names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

  • utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
  • utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages

However, if you are only using this to store English text, these shouldn’t differ.

Benchmark Test -Performance difference between using utf8_general_ci and utf8_unicode_ci [For Very Simple Table Of 500000 Rows]

In the stored procedures above utf8_general_ci collation is used, but of course during the tests I used both utf8_general_ci and utf8_unicode_ci.

We called each stored procedure 5 times for each collation (5 times for utf8_general_ci and 5 times for utf8_unicode_ci) and then calculated the average values.

Here is the results:

benchmark_simple_select() with utf8_general_ci: 9957 ms
benchmark_simple_select() with utf8_unicode_ci: 10271 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 3.2%.

benchmark_select_like() with utf8_general_ci: 11441 ms
benchmark_select_like() with utf8_unicode_ci: 12811 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 12%.

benchmark_order_by() with utf8_general_ci: 11944 ms
benchmark_order_by() with utf8_unicode_ci: 12887 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 7.9%.

[divider scroll_text=”Back To Top”]








21 Must Know Facts About The Golden Gate...

Interesting facts about the Golden Gate Bridge – It was called a mousetrap upside down, which is “turned up” the natural beauty of the...

In Seconds, These People Will Witness Something So...

This Massive Volcano Erupts In Front Of A Boat Full Of People This boat was calmly cruising along this picturesque body of water. All of...

10 Effective Home Remedies to Boost Your Sperm...

Inadequate sperm production and quality is one of the most common reasons for male infertility as it decreases the odds of one of the...

President Obama Announces Plans For Executive Action On...

President Barack Obama announced Monday he's done waiting for the House of Representatives to pass an immigration bill, and plans to take executive action...

How to Make Apple ID – Create Apple...

Apple ID is a must every iOS device user should have to use his or her iPhone, iPad or iPod touch. It is needed...

15 Tips to Save Your Hard Earned Money...

Since we’re not all schemers and with the year rounding up soon, it’s good to take note of these tips for saving money—especially for...