Uniform Resource Identifiers (URI)

Uniform Resource Identifiers (URI): Generic Syntax

Status of this Memo

This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the “Internet
Official Protocol Standards” (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.

Copyright Notice

Copyright (C) The Internet Society (1998). All Rights Reserved.

IESG Note

This paper describes a “superset” of operations that can be applied
to URI. It consists of both a grammar and a description of basic
functionality for URI. To understand what is a valid URI, both the
grammar and the associated description have to be studied. Some of
the functionality described is not applicable to all URI schemes, and
some operations are only possible when certain media types are
retrieved using the URI, regardless of the scheme used.

Abstract

A Uniform Resource Identifier (URI) is a compact string of characters
for identifying an abstract or physical resource. This document
defines the generic syntax of URI, including both absolute and
relative forms, and guidelines for their use; it revises and replaces
the generic definitions in RFC 1738 and RFC 1808.

This document defines a grammar that is a superset of all valid URI,
such that an implementation can parse the common components of a URI
reference without knowing the scheme-specific requirements of every
possible identifier type. This document does not define a generative
grammar for URI; that task will be performed by the individual
specifications of each URI scheme.

Berners-Lee, et. al. Standards Track [Page 1]

RFC 2396 URI Generic Syntax August 1998

1. Introduction

Uniform Resource Identifiers (URI) provide a simple and extensible
means for identifying a resource. This specification of URI syntax
and semantics is derived from concepts introduced by the World Wide
Web global information initiative, whose use of such objects dates
from 1990 and is described in “Universal Resource Identifiers in WWW”
[RFC1630]. The specification of URI is designed to meet the
recommendations laid out in “Functional Recommendations for Internet
Resource Locators” [RFC1736] and “Functional Requirements for Uniform
Resource Names” [RFC1737].

This document updates and merges “Uniform Resource Locators”
[RFC1738] and “Relative Uniform Resource Locators” [RFC1808] in order
to define a single, generic syntax for all URI. It excludes those
portions of RFC 1738 that defined the specific syntax of individual
URL schemes; those portions will be updated as separate documents, as
will the process for registration of new URI schemes. This document
does not discuss the issues and recommendation for dealing with
characters outside of the US-ASCII character set [ASCII]; those
recommendations are discussed in a separate document.

All significant changes from the prior RFCs are noted in Appendix G.

1.1 Overview of URI

URI are characterized by the following definitions:

Uniform
Uniformity provides several benefits: it allows different types
of resource identifiers to be used in the same context, even
when the mechanisms used to access those resources may differ;
it allows uniform semantic interpretation of common syntactic
conventions across different types of resource identifiers; it
allows introduction of new types of resource identifiers
without interfering with the way that existing identifiers are
used; and, it allows the identifiers to be reused in many
different contexts, thus permitting new applications or
protocols to leverage a pre-existing, large, and widely-used
set of resource identifiers.

Resource
A resource can be anything that has identity. Familiar
examples include an electronic document, an image, a service
(e.g., “today’s weather report for Los Angeles”), and a
collection of other resources. Not all resources are network
“retrievable”; e.g., human beings, corporations, and bound
books in a library can also be considered resources.

Berners-Lee, et. al. Standards Track [Page 2]

RFC 2396 URI Generic Syntax August 1998

The resource is the conceptual mapping to an entity or set of
entities, not necessarily the entity which corresponds to that
mapping at any particular instance in time. Thus, a resource
can remain constant even when its content—the entities to
which it currently corresponds—changes over time, provided
that the conceptual mapping is not changed in the process.

Identifier
An identifier is an object that can act as a reference to
something that has identity. In the case of URI, the object is
a sequence of characters with a restricted syntax.

Having identified a resource, a system may perform a variety of
operations on the resource, as might be characterized by such words
as `access’, `update’, `replace’, or `find attributes’.

1.2. URI, URL, and URN

A URI can be further classified as a locator, a name, or both. The
term “Uniform Resource Locator” (URL) refers to the subset of URI
that identify resources via a representation of their primary access
mechanism (e.g., their network “location”), rather than identifying
the resource by name or by some other attribute(s) of that resource.
The term “Uniform Resource Name” (URN) refers to the subset of URI
that are required to remain globally unique and persistent even when
the resource ceases to exist or becomes unavailable.

The URI scheme (Section 3.1) defines the namespace of the URI, and
thus may further restrict the syntax and semantics of identifiers
using that scheme. This specification defines those elements of the
URI syntax that are either required of all URI schemes or are common
to many URI schemes. It thus defines the syntax and semantics that
are needed to implement a scheme-independent parsing mechanism for
URI references, such that the scheme-dependent handling of a URI can
be postponed until the scheme-dependent semantics are needed. We use
the term URL below when describing syntax or semantics that only
apply to locators.

Although many URL schemes are named after protocols, this does not
imply that the only way to access the URL’s resource is via the named
protocol. Gateways, proxies, caches, and name resolution services
might be used to access some resources, independent of the protocol
of their origin, and the resolution of some URL may require the use
of more than one protocol (e.g., both DNS and HTTP are typically used
to access an “http” URL’s resource when it can’t be found in a local
cache).

Berners-Lee, et. al. Standards Track [Page 3]

RFC 2396 URI Generic Syntax August 1998

A URN differs from a URL in that it’s primary purpose is persistent
labeling of a resource with an identifier. That identifier is drawn
from one of a set of defined namespaces, each of which has its own
set name structure and assignment procedures. The “urn” scheme has
been reserved to establish the requirements for a standardized URN
namespace, as defined in “URN Syntax” [RFC2141] and its related
specifications.

Most of the examples in this specification demonstrate URL, since
they allow the most varied use of the syntax and often have a
hierarchical namespace. A parser of the URI syntax is capable of
parsing both URL and URN references as a generic URI; once the scheme
is determined, the scheme-specific parsing can be performed on the
generic URI components. In other words, the URI syntax is a superset
of the syntax of all URI schemes.

1.3. Example URI

The following examples illustrate URI that are in common use.

ftp://ftp.is.co.za/rfc/rfc1808.txt
— ftp scheme for File Transfer Protocol services

gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
— gopher scheme for Gopher and Gopher+ Protocol services

http://www.math.uio.no/faq/compression-faq/part1.html
— http scheme for Hypertext Transfer Protocol services

mailto:mduerst@ifi.unizh.ch
— mailto scheme for electronic mail addresses

news:comp.infosystems.www.servers.unix
— news scheme for USENET news groups and articles

telnet://melvyl.ucop.edu/
— telnet scheme for interactive services via the TELNET Protocol

1.4. Hierarchical URI and Relative Forms

An absolute identifier refers to a resource independent of the
context in which the identifier is used. In contrast, a relative
identifier refers to a resource by describing the difference within a
hierarchical namespace between the current context and an absolute
identifier of the resource.

Berners-Lee, et. al. Standards Track [Page 4]

RFC 2396 URI Generic Syntax August 1998

Some URI schemes support a hierarchical naming system, where the
hierarchy of the name is denoted by a “/” delimiter separating the
components in the scheme. This document defines a scheme-independent
`relative’ form of URI reference that can be used in conjunction with
a `base’ URI (of a hierarchical scheme) to produce another URI. The
syntax of hierarchical URI is described in Section 3; the relative
URI calculation is described in Section 5.

1.5. URI Transcribability

The URI syntax was designed with global transcribability as one of
its main concerns. A URI is a sequence of characters from a very
limited set, i.e. the letters of the basic Latin alphabet, digits,
and a few special characters. A URI may be represented in a variety
of ways: e.g., ink on paper, pixels on a screen, or a sequence of
octets in a coded character set. The interpretation of a URI depends
only on the characters used and not how those characters are
represented in a network protocol.

The goal of transcribability can be described by a simple scenario.
Imagine two colleagues, Sam and Kim, sitting in a pub at an
international conference and exchanging research ideas. Sam asks Kim
for a location to get more information, so Kim writes the URI for the
research site on a napkin. Upon returning home, Sam takes out the
napkin and types the URI into a computer, which then retrieves the
information to which Kim referred.

There are several design concerns revealed by the scenario:

o A URI is a sequence of characters, which is not always
represented as a sequence of octets.

o A URI may be transcribed from a non-network source, and thus
should consist of characters that are most likely to be able to
be typed into a computer, within the constraints imposed by
keyboards (and related input devices) across languages and
locales.

o A URI often needs to be remembered by people, and it is easier
for people to remember a URI when it consists of meaningful
components.

These design concerns are not always in alignment. For example, it
is often the case that the most meaningful name for a URI component
would require characters that cannot be typed into some systems. The
ability to transcribe the resource identifier from one medium to
another was considered more important than having its URI consist of
the most meaningful of components. In local and regional contexts

Berners-Lee, et. al. Standards Track [Page 5]

RFC 2396 URI Generic Syntax August 1998

and with improving technology, users might benefit from being able to
use a wider range of characters; such use is not defined in this
document.

1.6. Syntax Notation and Common Elements

This document uses two conventions to describe and define the syntax
for URI. The first, called the layout form, is a general description
of the order of components and component separators, as in

/;?

The component names are enclosed in angle-brackets and any characters
outside angle-brackets are literal separators. Whitespace should be
ignored. These descriptions are used informally and do not define
the syntax requirements.

The second convention is a BNF-like grammar, used to define the
formal URI syntax. The grammar is that of [RFC822], except that “|”
is used to designate alternatives. Briefly, rules are separated from
definitions by an equal “=”, indentation is used to continue a rule
definition over more than one line, literals are quoted with “”,
parentheses “(” and “)” are used to group elements, optional elements
are enclosed in “[” and “]” brackets, and elements may be preceded
with * to designate n or more repetitions of the following
element; n defaults to 0.

Unlike many specifications that use a BNF-like grammar to define the
bytes (octets) allowed by a protocol, the URI grammar is defined in
terms of characters. Each literal in the grammar corresponds to the
character it represents, rather than to the octet encoding of that
character in any particular coded character set. How a URI is
represented in terms of bits and bytes on the wire is dependent upon
the character encoding of the protocol used to transport it, or the
charset of the document which contains it.

6. URI Normalization and Equivalence

In many cases, different URI strings may actually identify the
identical resource. For example, the host names used in URL are
actually case insensitive, and the URL is
equivalent to . In general, the rules for
equivalence and definition of a normal form, if any, are scheme
dependent. When a scheme uses elements of the common syntax, it will
also use the common syntax equivalence rules, namely that the scheme
and hostname are case insensitive and a URL with an explicit “:port”,
where the port is the default for the scheme, is equivalent to one
where the port is elided.

7. Security Considerations

A URI does not in itself pose a security threat. Users should beware
that there is no general guarantee that a URL, which at one time
located a given resource, will continue to do so. Nor is there any
guarantee that a URL will not locate a different resource at some
later point in time, due to the lack of any constraint on how a given
authority apportions its namespace. Such a guarantee can only be
obtained from the person(s) controlling that namespace and the
resource in question. A specific URI scheme may include additional
semantics, such as name persistence, if those semantics are required
of all naming authorities for that scheme.

It is sometimes possible to construct a URL such that an attempt to
perform a seemingly harmless, idempotent operation, such as the
retrieval of an entity associated with the resource, will in fact
cause a possibly damaging remote operation to occur. The unsafe URL
is typically constructed by specifying a port number other than that
reserved for the network protocol in question. The client
unwittingly contacts a site that is in fact running a different
protocol. The content of the URL contains instructions that, when
interpreted according to this other protocol, cause an unexpected
operation. An example has been the use of a gopher URL to cause an
unintended or impersonating message to be sent via a SMTP server.

Caution should be used when using any URL that specifies a port
number other than the default for the protocol, especially when it is
a number within the reserved space.

Care should be taken when a URL contains escaped delimiters for a
given protocol (for example, CR and LF characters for telnet
protocols) that these are not unescaped before transmission. This
might violate the protocol, but avoids the potential for such

Berners-Lee, et. al. Standards Track [Page 23]

RFC 2396 URI Generic Syntax August 1998

characters to be used to simulate an extra operation or parameter in
that protocol, which might lead to an unexpected and possibly harmful
remote operation to be performed.

It is clearly unwise to use a URL that contains a password which is
intended to be secret. In particular, the use of a password within
the ‘userinfo’ component of a URL is strongly disrecommended except
in those rare cases where the ‘password’ parameter is intended to be
public.

References

[RFC2277] Alvestrand, H., “IETF Policy on Character Sets and
Languages”, BCP 18, RFC 2277, January 1998.

[RFC1630] Berners-Lee, T., “Universal Resource Identifiers in WWW: A
Unifying Syntax for the Expression of Names and Addresses
of Objects on the Network as used in the World-Wide Web”,
RFC 1630, June 1994.

[RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, Editors,
“Uniform Resource Locators (URL)”, RFC 1738, December 1994.

[RFC1866] Berners-Lee T., and D. Connolly, “HyperText Markup Language
Specification — 2.0”, RFC 1866, November 1995.

[RFC1123] Braden, R., Editor, “Requirements for Internet Hosts —
Application and Support”, STD 3, RFC 1123, October 1989.

[RFC822] Crocker, D., “Standard for the Format of ARPA Internet Text
Messages”, STD 11, RFC 822, August 1982.

[RFC1808] Fielding, R., “Relative Uniform Resource Locators”, RFC
1808, June 1995.

[RFC2046] Freed, N., and N. Borenstein, “Multipurpose Internet Mail
Extensions (MIME) Part Two: Media Types”, RFC 2046,
November 1996.

Berners-Lee, et. al. Standards Track [Page 24]

RFC 2396 URI Generic Syntax August 1998

[RFC1736] Kunze, J., “Functional Recommendations for Internet
Resource Locators”, RFC 1736, February 1995.

[RFC2141] Moats, R., “URN Syntax”, RFC 2141, May 1997.

[RFC1034] Mockapetris, P., “Domain Names – Concepts and Facilities”,
STD 13, RFC 1034, November 1987.

[RFC2110] Palme, J., and A. Hopmann, “MIME E-mail Encapsulation of
Aggregate Documents, such as HTML (MHTML)”, RFC 2110, March
1997.

[RFC1737] Sollins, K., and L. Masinter, “Functional Requirements for
Uniform Resource Names”, RFC 1737, December 1994.

[ASCII] US-ASCII. “Coded Character Set — 7-bit American Standard
Code for Information Interchange”, ANSI X3.4-1986.

[UTF-8] Yergeau, F., “UTF-8, a transformation format of ISO 10646”,
RFC 2279, January 1998.

C.1. Normal Examples

g:h = g:h
g = http://a/b/c/g
./g = http://a/b/c/g
g/ = http://a/b/c/g/
/g = http://a/g
//g = http://g
?y = http://a/b/c/?y
g?y = http://a/b/c/g?y
#s = (current document)#s
g#s = http://a/b/c/g#s
g?y#s = http://a/b/c/g?y#s
;x = http://a/b/c/;x
g;x = http://a/b/c/g;x
g;x?y#s = http://a/b/c/g;x?y#s
. = http://a/b/c/
./ = http://a/b/c/
.. = http://a/b/
../ = http://a/b/
../g = http://a/b/g
../.. = http://a/
../../ = http://a/
../../g = http://a/g

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *