Backport #19773
Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.
This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.
Fix#19743
Signed-off-by: Andrew Thornton <art27@cantab.net>
TestToUTF8WithFallback is the cause of recurrent spurious test failures
even despite code to set the detected charset order.
The reason why this happens is because the preferred detected charset order
is not being initialised for these tests.
This PR simply ensures that this is set at the start of each test and would
allow different tests to be written to allow differing orders.
Replaces #12571Close#12571
Signed-off-by: Andrew Thornton <art27@cantab.net>
* Fix chardet test and add ordering option
Signed-off-by: Andrew Thornton <art27@cantab.net>
* minor fixes
Signed-off-by: Andrew Thornton <art27@cantab.net>
* remove log
Signed-off-by: Andrew Thornton <art27@cantab.net>
* remove log2
Signed-off-by: Andrew Thornton <art27@cantab.net>
* only iterate through top results
Signed-off-by: Andrew Thornton <art27@cantab.net>
* Update docs/content/doc/advanced/config-cheat-sheet.en-us.md
* slight restructure of for loop
Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
* Convert files to utf-8 for indexing
* Move utf8 functions to modules/base
* Bump repoIndexerLatestVersion to 3
* Add tests for base/encoding.go
* Changes to pass gosimple
* Move UTF8 funcs into new modules/charset package