Incorrect handling of Unicode characters

As @fonsp pointed out in #39, URIs.jl does not technically handle Unicode characters correctly, at least according to RFC 3986. [IETF RFC 3986 Sec. 1.2.1](https://datatracker.ietf.org/doc/html/rfc3986#section-1.2.1) implies that URIs should only contain characters from the US-ASCII charset and should percent-encode additional characters ([RFC 3987](https://datatracker.ietf.org/doc/html/rfc3987) makes this a little more explicit). URIs.jl, however, will accept and work with any string as its input regardless of the underlying character set:

```julia
julia> using URIs

julia> url = URI("https://a/🌟/e")
URI("https://a/🌟/e")

julia> url.path
"/🌟/e"
```

After diving into it for a bit, there seems to be a split in how the standard / canonical library for URI handling works in many other languages. In JavaScript, Go, and Rust, passing in a URI that uses Unicode will either force the URI to be percent-encoded or raise an error:

<details>
<summary>Javascript</summary>

```javascript
>> new URL("https://a/🌟/e").pathname
"/%F0%9F%8C%9F/e"
```
</details>

<details>
<summary>Go</summary>

```go
package main

import (
	"fmt"
	"net/url"
	"os"
)

func main() {
	url, err := url.Parse("https://a/🌟/e")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error parsing url: %s", err)
		return
	}
	fmt.Printf("%s\n", url)
	// Prints https://a/%F0%9F%8C%9F/e
}
```
</details>

<details>
<summary>Rust</summary>

Rust's `http` crate will actually panic if you try to feed it a Unicode URI at all, e.g.:

```rust
use http::Uri;

fn main() {
    let uri = Uri::from_static("https://a/🌟/e");
    println!("{}", uri.path());
}
```

```
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/uri`
thread 'main' panicked at 'static str is not valid URI: invalid uri character', /home/kernelmethod/.cargo/registry/src/github.com-1ecc6299db9ec823/http-0.2.7/src/uri/mod.rs:365:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

</details>

But this isn't universally the case: in Python and Java, the Unicode encoding is preserved:

<details>
<summary>Python</summary>

```python
>>> from urllib.parse import urlparse
>>> url = urlparse("https://a/🌟/e")
>>> url.path
'/🌟/e'
```
</details>

<details>
<summary>Java</summary>

```java
import java.net.*;

class URITesting {
    public static void main(String[] args) {
        try {
            URI url = new URI("https://a/🌟/e");
            System.out.printf("path = %s\n", url.getPath());
        }
        catch (URISyntaxException ex) {
            System.out.println(ex);
        }
        // System.out.println("Hello, World!"); 
    }
}
```

</details>

One potential difference between these languages is that Java's [`java.net.URI`](https://docs.oracle.com/javase/7/docs/api/java/net/URI.html) tries to comply with RFC 2936, whereas Python's [`urllib.parse.urlparse`](https://docs.python.org/3/library/urllib.parse.html) seems to try to comply with a mix of standards.

---

In any case, there's a bit of a dilemma here -- this library doesn't quite implement the RFC as specified, which is also an issue that has cropped up in other places, e.g. in the implementation of `normpath` #20 and `joinpath` (related issue: [#18](https://github.com/JuliaWeb/URIs.jl/issues/18#issuecomment-798986903)). As far as this issue is concerned, it seems like there are three ways URIs.jl could go:

1. Percent-encode strings when we generate a URI to ensure compliance to the spec;
2. Implement [RFC 3987](https://datatracker.ietf.org/doc/html/rfc3987) under the hood, which *does* permit Unicode characters; or
3. Keep the library's current behavior and try to specify which parts of URIs.jl comply with which RFCs, similar to what Python does for its `urllib.parse` module.

I would think that option (1) is the most preferable of all of these -- this library says that it implements URIs according to RFC 3986, so it should comply with that RFC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect handling of Unicode characters #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect handling of Unicode characters #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions