Backport #19773
Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.
This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.
Fix#19743
Signed-off-by: Andrew Thornton <art27@cantab.net>
Backport #18820
There is a potential panic due to a mistaken resetting of the length parameter when
multibyte characters go over a read boundary.
Signed-off-by: Andrew Thornton <art27@cantab.net>
Fix#17514
Given the comments I've adjusted this somewhat. The numbers of characters detected are increased and include things like the use of U+300 to make à instead of à and non-breaking spaces.
There is a button which can be used to escape the content to show it.
Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: Gwyneth Morgan <gwymor@tilde.club>
Co-authored-by: silverwind <me@silverwind.io>
Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
TestToUTF8WithFallback is the cause of recurrent spurious test failures
even despite code to set the detected charset order.
The reason why this happens is because the preferred detected charset order
is not being initialised for these tests.
This PR simply ensures that this is set at the start of each test and would
allow different tests to be written to allow differing orders.
Replaces #12571Close#12571
Signed-off-by: Andrew Thornton <art27@cantab.net>
* Fix chardet test and add ordering option
Signed-off-by: Andrew Thornton <art27@cantab.net>
* minor fixes
Signed-off-by: Andrew Thornton <art27@cantab.net>
* remove log
Signed-off-by: Andrew Thornton <art27@cantab.net>
* remove log2
Signed-off-by: Andrew Thornton <art27@cantab.net>
* only iterate through top results
Signed-off-by: Andrew Thornton <art27@cantab.net>
* Update docs/content/doc/advanced/config-cheat-sheet.en-us.md
* slight restructure of for loop
Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
* Convert files to utf-8 for indexing
* Move utf8 functions to modules/base
* Bump repoIndexerLatestVersion to 3
* Add tests for base/encoding.go
* Changes to pass gosimple
* Move UTF8 funcs into new modules/charset package