-
Notifications
You must be signed in to change notification settings - Fork 18
Description
As @fonsp pointed out in #39, URIs.jl does not technically handle Unicode characters correctly, at least according to RFC 3986. IETF RFC 3986 Sec. 1.2.1 implies that URIs should only contain characters from the US-ASCII charset and should percent-encode additional characters (RFC 3987 makes this a little more explicit). URIs.jl, however, will accept and work with any string as its input regardless of the underlying character set:
julia> using URIs
julia> url = URI("https://a/🌟/e")
URI("https://a/🌟/e")
julia> url.path
"/🌟/e"After diving into it for a bit, there seems to be a split in how the standard / canonical library for URI handling works in many other languages. In JavaScript, Go, and Rust, passing in a URI that uses Unicode will either force the URI to be percent-encoded or raise an error:
Javascript
>> new URL("https://a/🌟/e").pathname
"/%F0%9F%8C%9F/e"Go
package main
import (
"fmt"
"net/url"
"os"
)
func main() {
url, err := url.Parse("https://a/🌟/e")
if err != nil {
fmt.Fprintf(os.Stderr, "Error parsing url: %s", err)
return
}
fmt.Printf("%s\n", url)
// Prints https://a/%F0%9F%8C%9F/e
}Rust
Rust's http crate will actually panic if you try to feed it a Unicode URI at all, e.g.:
use http::Uri;
fn main() {
let uri = Uri::from_static("https://a/🌟/e");
println!("{}", uri.path());
}$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/uri`
thread 'main' panicked at 'static str is not valid URI: invalid uri character', /home/kernelmethod/.cargo/registry/src/github.com-1ecc6299db9ec823/http-0.2.7/src/uri/mod.rs:365:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
But this isn't universally the case: in Python and Java, the Unicode encoding is preserved:
Python
>>> from urllib.parse import urlparse
>>> url = urlparse("https://a/🌟/e")
>>> url.path
'/🌟/e'Java
import java.net.*;
class URITesting {
public static void main(String[] args) {
try {
URI url = new URI("https://a/🌟/e");
System.out.printf("path = %s\n", url.getPath());
}
catch (URISyntaxException ex) {
System.out.println(ex);
}
// System.out.println("Hello, World!");
}
}One potential difference between these languages is that Java's java.net.URI tries to comply with RFC 2936, whereas Python's urllib.parse.urlparse seems to try to comply with a mix of standards.
In any case, there's a bit of a dilemma here -- this library doesn't quite implement the RFC as specified, which is also an issue that has cropped up in other places, e.g. in the implementation of normpath #20 and joinpath (related issue: #18). As far as this issue is concerned, it seems like there are three ways URIs.jl could go:
- Percent-encode strings when we generate a URI to ensure compliance to the spec;
- Implement RFC 3987 under the hood, which does permit Unicode characters; or
- Keep the library's current behavior and try to specify which parts of URIs.jl comply with which RFCs, similar to what Python does for its
urllib.parsemodule.
I would think that option (1) is the most preferable of all of these -- this library says that it implements URIs according to RFC 3986, so it should comply with that RFC.