[0]: https://en.wikipedia.org/wiki/Unicode_character_property#Gen...
---
For anyone that wants to learn more about specific Unicode stuff, the three big data sources are The Core Spec, the Unicode Technical Annexes (UAXs), and the Unicode Character Database itself (the database is a bunch of text files. There's an XML version now as well).
For further reading on this specifically, it might be worth looking at:
[Unicode Core Spec - Chapter 4: Character Properties] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
├ [General Category] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
└ [Properties for Text Boundaries] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
[UAX #44 - Unicode Character Database (Technical Report)] https://www.unicode.org/reports/tr44/
├ [General Category Values] https://www.unicode.org/reports/tr44/#General_Category_Value...
└ [Property Definitions] https://www.unicode.org/reports/tr44/#Property_Definitions
And, if you're brave and want to see the data itself (skim through UAX #44 first):
[Unicode Character Database] https://www.unicode.org/Public/17.0.0/ucd/
When I tried to raise this in the mailing list and get it rectified, the response I got was pretty much that many properties need language-specific processing anyway, and shouldn't be relied upon fully, and so this wasn't worth fixing.
[1] https://util.unicode.org/UnicodeJsps/character.jsp?a=0BCD&B1...
For string you’d need to
import io
for line in io.StringIO(str):
pass
In my experience, they're not. It's strings.
From database fields, API calls, JSON values, HTML tag content, function inputs generally, you know -- the normal places.
In my experience, most people aren't dealing directly with files (or streams) most of the time.
I also used the word generally, so your insistence on quantifying the proportion is a complete waste of time.
And I said your "generally" was wrong. You were provided general advice, I'm saying it's wrong in general. Do you see me giving numerical quantities anywhere?
There are countless sources one can get a string from. Surely you don't think filesystems are the only source of strings?
Surely you haven't misread my comments above to such an extent? Perhaps not familiar with sockets.
>>> s = "line1\nline2\rline3\r\nline4\vline5\x1dhello"
>>> s.split() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']
>>> s.splitlines() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']
But split() has sep argument to define delimiter according which to split the string.. In which case it provides what you expected to happen:
>>> s.split('\n') ['line1', 'line2\rline3\r', 'line4\x0bline5\x1dhello']
In general you want this:
>>> linesep_splitter = re.compile(r'\n|\r\n?')
>>> linesep_splitter.split(s) ['line1', 'line2', 'line3', 'line4\x0bline5\x1dhello']
str.split() splits on runs of consecutive whitespace, any type of whitespace, including tabs and spaces which splitlines() doesn't do.
>>> 'one two'.split()
['one', 'two']
>>> 'one two'.splitlines()
['one two']
split() without custom delimiter also splits on runs of whitespace, which splitline() also doesn't do (except for \r\n because that combination counts as one line ending): >>> 'one\n\ntwo'.split()
['one', 'two']
>>> 'one\n\ntwo'.splitlines()
['one', '', 'two']
#1 use-case of that for me is probably just avoiding the cases where there's a trailing newline character in the output of a command I ran by subprocess.
https://www.sqlite.org/c3ref/set_authorizer.html
https://docs.python.org/3/library/sqlite3.html#sqlite3.Conne...
That feature was added in 3.9 with the addition of `removeprefix` and `removesuffix`.
Sadly,
1. unlike Rust's version they provide no way of knowing whether they stripped things out
2. unlike startswith/endswith they do not take tuples of prefixes/suffixes
... Guilty, actually.