Metadata for a Web of Data – Scott Davis on RDFa and Microformats
A very informative and entertaining presentation by Scott Davis of ThirstyHead on using RDFa and Microformats to build services that are OF the Web, not just ON the Web. Presentation slides can be found here.
glenn
Power of Metadata to enable Business Transformation
Introduction
I’m currently reading a business-oriented book on the power of the Semantic Web to enable new business models called Pull, the Power of the Semantic Web to Transform you Business, by David Siegel, and published December 2009.
I’ve read several excellent technology-focused books on the Semantic Web, including Semantic Web Programming and Programming the Semantic Web. But this is the first book I’ve seen that specifically looks at the Semantic Web, and structured Metadata, from the vantage point of enabling Business Transformation and the development of new Business Models.
BTW, David Siegel recently delivered the keynote at SemTech 2010.
Metadata enables Smart Objects
I’m currently about 30 pages into the 250-ish page book, but several key messages have already been presented. And the one that most immediately grabbed my attention is the notion of “smart objects” [BTW, Siegel doesn’t explicitly use this term (at least he hasn’t thusfar in the book), but it is a notion that underlies much of this message.]
The key idea here is that objects – products and content – have a unique ID, and associated metadata, such they effectively “know about themselves”. They know what their meaning is and how to describe themselves, they know where they’ve been, they know what state they are in, and so forth. Obviously, these are not exactly the objects as we encounter them in the everyday world. Rather, we are speaking about an electronic representation of the object, that has some smarts associated with it.
In the first 2 chapter, Siegel provides several examples in the Shipping and Retail industries. In the shipping industry, he talks about smart “packages”. Quoting Siegel:
Using a new universal tracking number and open standards for messaging creates package-level autonomy: the package itself will send a message to the customer to make sure he or she is there to receive it.
The basic idea here is that an electronic representation exists of a package, that is basically “smart”. It knows about itself, and it can respond to events that are of interest to it. As well, the “vocabulary” that describes the data and events associated with packages are formalized as an industry standard, so that packages can easily cross process boundaries between different companies in the industry, that operate in different parts of the supply chain. This is the “package-level autonomy” that Siegel mentions above.
Another example is “smart products” in Retail. Here, Siegel provides examples of a Smart Cart and Smart Products. The Smart Cart knows what products you’ve put into it, and can take actions on the items currently in the cart – whether its adding up the total, applying coupons, or providing information at checkout.
Smart Products are tagged with bar codes and RFID codes as universal identifiers and tracking tags, whereby scanning these tags can provide product description information, competitive pricing information, and can be used to track the transport of product across various stages of its production and delivery lifecycle.
Smart Containers manage and leverage Smart Objects
It’s not, however, just the objects themselves that are smart. It is also the “containers” of these objects – whether the container is a Shopping Cart, a Shipment, Carton of items, a Palette of goods, or a Truck. Smart Containers know precisely the nature of the goods or products they contain, and what state they are in.
Applying the “Smart Object” concept to the Media Industry
Can these concepts be applied to the media industry? I think they can. The media industry has its own version of objects and containers of those objects. Our objects are most importantly Content – Articles, Photos, Videos, etc, and the discussions and conversations around that content. Our packages are the containers for this content – Publications, Websites, Web pages, and Stories that aggregate multiple types of content.
So, like the smart objects above, our objects – our Content – needs to be “smart” or “intelligent”. It needs to know about itself, it needs to be self-describing. And our “containers” and media products need to be able to take advantage of that intelligence – to understand what content is most relevant to our audiences, and make sure that content is available and discoverable by our users when they want it, where they want it, and in the form they want to consume it.
On the web, of course, one of the most important rationales for “smart content” is to make it easily discoverable by Search Engines – SEO-friendly, as they say. In this sense, a Search Results page is like a “dynamic container” that is constructed on-the-fly according to a Query that specifies relevance criteria (metadata) that express the intention of a user/consumer (machine or human) at that particular moment.
We also need to be able to learn from the behavior and media consumption patters of our users. To “learn” from their behavior, and calibrate our content delivery to their preferences and behaviors.
In Summary
Well that’s it really. Just wanted to:
- Introduce the notion of “smart” or “intelligent” objects and containers, powered by metadata
- Suggest the power of these intelligent objects to transform existing, and enable new, business models, and
- Suggest that the Media industries have their own versions of smart objects and containers – their content, and the platforms, products, and delivery channels that showcase their content.
glenn
Understanding Classification and Taxonomies – Building Enterprise Taxonomies
Currently reading a very nice book on designing Classification systems and Taxonomies titled Building Enterprise Taxonomies, authored by Darin L. Stewart (Director or Web Strategies and Research Information Services for Oregon Health and Science University), published in 2008.
The title of the book, IMO, is a bit of a misnomer. This is not so much a book about designing Taxonomies for Enterprises, as it is an elegant, easy-to-digest framework for classifying knowledge and designing Knowledge Representation, Search, and Discovery environments. To prove the point, here are the chapter titles (with my comments appended):
- Findability – on Search and Information Discovery
- Metadata – including an overview of Dublin Core
- Taxonomy – an overview of Classification systems, and the role of Controlled Vocabularies
- Preparations – references what Stewart calls the Taxonomy Development Lifecycle
- Terms
- Structure – a deeper exploration of the task of Categorization
- Ontology– exploring the Semantic Web
- Folksonomy – community-generated Classification in an era of the Social Web
Pretty great stuff eh?
Stewart also makes reference to a very nice research article from 1999 by Barbara H. Kwasnik: Role of Classification in Knowledge Representation and Discovery. It’s a nice piece.
All for now.
glenn
What is RDFa? – Mark Birbeck
In a previous post, I referenced an excellent talk that Mark Birbeck gave at Google in 2009, as well as a couple excellent introductory articles he wrote on RDFa.
I was re-viewing Birbeck’s Google TechTalk on RDFa, and really liked his brief explanation about what RDFa actually is. So thought I’d quote Birbeck from his talk:
I’m using RDFa as a bit of a shorthand, because I’m saying really “embedded metadata”. I’m saying any way of actually putting information into the HTML page, rather than the traditional semantic web approach of having a “separate channel”. By separate channel, I’m saying you might have had an RDF-XML document, or even an RSS feed you could regard as a kind of semantic channel of information. But a channel of information that’s kind of distinct from the web page.
Whereas what we’ve done with RDFa, and what the people behind Microformats were doing, basically the same goal, was actually make the HTML page the carrier of the metadata. And some times it’s carrying metadata about other things, and sometimes it’s carrying metadata about itself. So really, when I say RDFa (throughout this talk) I’m generally meaning those kind of solutions that allow you to embed metadata.
The reason I’m favoring RDFa is because it’s very specific goal was to align itself with RDF, so it’s actually much more precise than Microformats, but the idea is the same that you embed information [in the HTML page].
So that’s the purpose of RDFa according to Birbeck. As far as what RDFA actually is:
As for what it is, it’s a W3C standard now. It’s something we’ve been working on for four or so years – which I guess is quick for the W3C, we’ve been working on it for quite a long time, and it recently became a standard.
And it’s very much about defining the syntax of how you embed information. It’s not really about saying what the vocabularies should be. Whereas Microformats is very much more about the vocabularies.
And a good example of the flexibility of what that brings is when Google did its Rich Snippets, it just came out with its own vocabulary. It got a lot of stick for it from the Semantic Web community, or some there. But the point is that you were able to just come out with your own vocabulary, because RDFa is about the syntax and the structure, rather than the actual terms.
So it’s very much in the spirit of the Web in the sense that it allows people to define their own vocabularies or reuse existing vocabularies, and put them into their documents however they see fit.
So RDFa is a standard, and its goal is embedding metadata in pages.
That’s a very nice exaplanation I must say. Please view the entirety of Birbeck’s talk for deeper insight into the mechanics of RDFa.
glenn
Dublin Core Metadata Initiative (DCMI) – Learning Resources
A nice set of learning resources for the Dublin Core Metadata Initiative at the DCMI’s Metadata Training Resources page. In particular, there’s a series of links to presentations delivered by Makx Dekkers and Thomas Baker in December 2009 in Florence, Italy. For ease of access, here are the links:
- History, objectives and approaches of the Dublin Core Metadata Initiative – Makx Dekkers
- DCMI and the metadata landscape – Makx Dekkers
- Basics of Dublin Core Metadata – Thomas Baker
- Data Integration and Structured Search – Thomas Baker
- The “metadata record” and DCMI Abstract Model – Thomas Baker
- Web-enabled vocabularies – Thomas Baker
- Linking legacy data – Thomas Baker
- Outcomes of DC-2009 – Makx Dekkers
Just reading the Basics of Dublin Core Metadata presentation now, and for someone who’s relatively new to Dublin Core, it’s both fascinating and very well presented. Just a couple quick slide visuals to illustrate. First, the Dublin Core vocabulary circa 2000:
A very nice, clean, well-factored representation. And then there’s the important migration from 2003-2007 of the Dublin Core to RDF and the Semantic Web:
And the there’s the structured search scenarios that Dublin Core seeks to enable today:
And so the story progresses. Lots of semantic gold in them thar presentations. 🙂
glenn
Enterprise Metadata – thoughts
The company I work for is about to embark on an Enterprise Metadata initiative. So I thought I’d write an introductory blog post on what metadata is, and how I think of metadata within the Enterprise. So here goes …
What is metadata?
So here’s the Wikipedia page on metadata. Blah, blah, blah. To me, saying metadata is “data about data” is about as about as useful as saying information is information about information. I mean, it’s a bit recursive don’t you think?
I view metadata as descriptive information about a “thing”, where the “thing” is anything that can be represented as a concept. Depending on the context, this “thing” could be a person, a topic, a piece of content, an ad, a real-world entity like a truck or a house, or even an abstract concept like “love” or “beauty”. Any piece of information, or “semantics”, describing the underlying entity can be viewed as metadata.
The descriptive information can be information about the thing itself (for example the name and address of a business), or the relational context of the thing (for example, as we will see below, related content or persons associated with that thing).
What is “enterprise” metadata?
Enterprise metadata, therefore, is descriptive information that describes the core concepts within an enterprise – customers, ads, content, business units, you name it – and the web of things that are related to it.
Representing Metdata
So how is metadata represented in information systems? Well, it can be represented in many ways. It can be represented as fields in a table (i.e. a relational database), as tags, as categories, or as Properties associated with an Object in code, or even associated with a variable baked into the code (bad, bad programming!). Metadata can be represented as “structured” data (as in a relational database or markup language), or “semi-structured” or “unstructured” data (as in a Word document).
Representing metadata as a Graph
However, the most powerful way of representing metadata that is highly-relational and subject to change is with a graph. Here, we don’t mean “graph” in the sense of a visual representation of data in Excel. But rather in the mathematical sense of the term, as a network of nodes and links (or edges). In social networking, this is how people are related together … in terms of a Social Graph.
With graph-based data representations, there’s essentially no difference between data and metadata. What is viewed as data from one perspective, can be viewed as metadata from another perspective. For more on this topic, see my previous posts here and here.
Listings Metadata – an example
With this introduction, let’s consider what metadata might be associated with a Business Listing. So traditionally, when one thinks of a Business Listing, one might think of something that looks like this:
In this example above, you would probably say that the metadata associated with this Listing is the name of the business, the location of the business, phone #, etc.
However, what if the content associated with a business listing was displayed on an entire page, like this:
Here we see the entire page is full of metadata – or associated content – about the business listing. We have the business listing itself, but we also have all sorts of additional information/content associated with the listing: comments, editorial reviews, a map showing where the business is located, perhaps pricing information about the business’ products, and even a video supplied by the proprietor. We may also have descriptive tags associated with the listing or business, as well as people who “subscribe” or “follow” the listing, and want to be notified of updates to the listing.
Now the listing is less the descriptive data associated with the physical image of the listing provided in the initial example, but more like a “concept”, with all sorts of associated content and metadata – some basic textual information, and a whack of related content and even the social context of community members who might be interested in the listing, or who have contributed content.
This “web of related content” associated with the listing can be represented as a “graph” (as discussed above), which forms a “web” of related objects of content associated with the listing.
In Summary …
And that is really how I view metadata. It’s the immediate information that describes or characterizes a “thing”. But it’s also the web of contextual information that is associated with the thing – related content, user-generated content, social context, and so forth.
Thoughts? Comments?
glenn
Semantics (and Metadata) at the New York Times
***** Nov 10 2009 Update:
I have uploaded a summary doc of the NY Times presentation. Please click the following link to access: Semantics at The New York Times – notes – SemTech 2009
*****
Yet another great presentation from the SemTech 2009 conference this past June in San Jose. This presentation is on Semantics at the New York Times.
Here is a slide presentation that the New York Times delivered at a different conference, but it’s very similar to the one delivered at SemTech.
The (Long) History of Metadata at the New York Times
The presentation starts out exploring the history of metadata at the New York Times, from the beginnings of their Morgue archive which was created at the newspaper’s inception in, if you can believe, 1851. The so-called Morgue was not a collection of corpses (thank goodness), but rather a collection of newspaper clippings and photos.
No subject was too big or small to be indexed in the Morgue. As the Times VP of Digital Production Rob Larson states in the presentation, in 1907 the Times’ Managing Editor Carr Van Anda invested in the Morgue to add staff and rigor of organization to the files, and a Tagging system grew up around this effort.
At the Morgue’s zenith a few decades ago, the Morgue had a staff of 24 persons, creating 600 new clip folders per week, cutting up 36 editions of the final New York city edition of the Times, as well as copies of other prominent newspapers.
Within its main operation on the third floor, there were more than 4,000 cabinet drawers of newspaper clippings, containing 1,126,000 named individuals (including animals, etc), 65,000 subject headings, 300,000 ships and planes, 500,000 places, and 500,000 corporations. (Wow!)
The Morgue is only one form of tagging system used at the Times – others include the New York Times Index and the NYTimes.com website.
So what is the Tagging workflow at the New York Times?
A few slides to show from the presentation. The first slide depicts the tagging workflow at the New York Times, and what roles apply metadata at what step in the workflow.
This visual oversimplifies the underlying complexity of the application of metadata, however, in the editorial workflow. Here’s a very-hard-to-read workflow diagram of the stages at which metadata is applied in the NY Times – which suggests the overall complexity of the end-to-end workflow, to both Print and Online channels.
Why Tag?
Another core visual is shown below, which summarizes the motivation for tagging – that is the various use cases for metadata-tagged content at the Times.
Rob Larson specifically addresses the importance of metadata for generating NY Times Topic Pages, 4 examples of which are provided below:
The Future
Next the presenters address the future of metadata (and now the talk turns more to “semantics”) at the NY Times.
What near-term plans does the Times have for evolving their metadata management practice? See the slide below:
Next up the presenters discusses the New York Times’ various Open Data initiatives, and the APIs the Times is making avaiable to the public to access and build applications on top of its data.
New York Times and Linked Data
Finally, the New York Times announced at SemTech the next phase of their Open Data strategy, which is to prepare their Corpus to be exposed to the Linked Data Cloud.
Interesting stuff.
glenn
"People, Places, Subjects" – BBC Topic and Guardian keyword pages
More great content from the Guardian’s Information Architect Martin Belam. In this series of posts, he explores the metadata and taxonomy strategies at the BBC and Guardian. Here are the posts:
- “People, Places, Subjects” – BBC Topic and Guardian keyword pages: Part 1
- “People, Places, Subjects” – BBC Topic and Guardian keyword pages: Part 2
- “People, Places, Subjects” – BBC Topic and Guardian keyword pages: Part 3
- “People, Places, Subjects” – BBC Topic and Guardian keyword pages: Part 4
- “People, Places, Subjects” – BBC Topic and Guardian keyword pages: Part 5
Here’s the BBC’s presentation of a Topic – in this case, Climate Change. Note how videos, news, and blogs are aggregated for a particular topic. At first glance, the page has a nice look and feel. But the approach is somewhat brittle. The stories under the Topics don’t seem to be aggregated search results, but rather stories placed into the Topics as “one-off” stories.
By contrast, here is the Guardian’s Climate Change topic page. Note that the Guardian has also implemented a Taxonomy around its topics, with Climate Change being a sub-topic of the broader Environment topic.
*** Update #1 ***
Kind of interesting exploring this whole “Topic” theme. Here’s the New York Times Topics page, which looks to be a pretty standard “sections” based approach that a Newspaper might be expected to take.
OTOH, here’s all topic pages about people, places, organizations, and subjects that start with the letter “A”. Clearly, there’s some keyword indexing going on here … althought I’m not crazy about the presentation.
****************
More later,
glenn
Categories
Advertising and Marketing
- Adotas
- Ant's Eye View Blog
- BIA-Kelsey blog
- Borrell Associates
- Brand Autopsy – John Moore
- Brian Solis
- Church of the Customer blog
- ClickZ
- Convince and Convert – Jay Baer
- David Berkowitz's Marketing Blog
- Digital Tonto – Greg Satell
- Direct Marketing News
- Duct Tape Marketing Blog
- eMarketer Blog
- GasPedal
- HubSpot – Internet Marketing Blog
- IABlog
- iMedia Connection
- Influential Marketing Blog
- MarketingProfs
- Mashable Advertising & Marketing
- Ogilvy PR 360 Digital Influence Blog
- Screenwerk – Greg Sterling's blog
- Seth's Blog
- The Bad Pitch Blog
- The Daily Influence – Ogilvy PR
- TopRank Online Marketing Blog
- UnMarketing – Scott Stratten
- Web Ink Now – David Meerman Scott
Architecture
Business Strategy and Innovation
Citizen/Community Journalism
Commerce
Content Management
Content Strategy
Data Architecture & Analysis
Design
- A List Apart
- Aza's Thoughts
- Boxes and Arrows
- Cogapp blog
- Core77
- Designful Thinking
- disambiguity – Leisa Reichelt
- emergent by design
- Experiencing Information – James Kalbach
- InfoDesign
- Joe Lamantia.com
- Johnny Holland
- Logic + Emotion – David Armano
- Semantic Foundry – Will Evans
- Skilfull Minds – Larry Irons
- UX Booth
- UX Magazine
- UXmatters
Favorite News Sources
- Al Jazeera English
- Al Jazeera Listening Post
- Al Jazeera YouTube channel
- Ambrose Evans-Pritchard
- Boiling Frogs – Sibel Edmonds
- Business Insider
- CounterPunch
- Daily Show /w Jon Stewart
- Democracy Now
- Glenn Greenwald
- globalresearch.ca
- Huffington Post
- Mark Crispin Miller
- Project Censored
- Robert Fisk
- WikiLeaks on Twitter
Funny
Information Architecture
Interesting and Creative
Investing and Economy
Local
- BIA-Kelsey blog
- Borrell Associates
- Breaking News Network blog
- Google Maps & Local Search – Mike Blumenthal
- HyperlocalBlogger – Matt McGee
- Local SEO Guide
- Lost Remote
- Media Transparent – Pat Kitano blog
- Praized Blogs – Seb Provencher
- Screenwerk – Greg Sterling's blog
- Small Business Search Marketing – Matt McGee
Media and Content
Media and Culture
Mobile
News Media and Journalism
- 10,000 Words
- Adam Westbrook
- Blogically Thinking – Jan Schaffer's blog
- BuzzMachine
- Columbia Journalism Review
- eMedia Vitals
- Knight Digital Media Center
- Muck Rack – Journalists on Twitter
- News 3.0 – Steffen Konrath
- News for Digital Journalists (KDMC)
- News Innovation
- News Leadership 3.0 (KDMC)
- Newsonomics
- Newspaper Death Watch
- Nieman Journalism Lab
- Online Journalism Review (KDMC)
- Pointer Online
- PressThink – Jay Rosen's blog
- Rebooting the News
- Recovering Journalist – Mark Potts' blog
- Reflections of a Newsosaur
- Reportr.net – Alfred Hermida
- Save the Media – Gina Chen
- Steve Buttry
- SteveOuting.com
- Vadim Lavrusik
Politics
Product Management
Search Marketing & SEO
- Bryan & Jeffrey Eisenberg
- Chris Silver Smith
- Google Maps & Local Search – Mike Blumenthal
- HuoMah SEO Blog
- John Battelle's Searchblog
- Search Engine Land
- SEM ClubHouse
- SEO Book Blog
- SEO by the Sea – Bill Slawski
- SEOmoz Blog
- Small Business Search Marketing – Matt McGee
- The Noisy Channel – Daniel Tunkelang
- This Week in Search – Google Blog
Semantic Web
Social Business
Social Media/Social Web
- apophenia – Dana Boyd
- Bokardo – Joshua Porter's blog
- Brass Tack Thinking
- Brian Solis
- Chris Messina
- Clay Shirky (Twitter)
- Digital Tonto – Greg Satell
- Epeus' epigone – Kevin Marks
- iLibrarian – Ellyssa Kroski
- Mashable
- Skilfull Minds – Larry Irons
- SmartMobs
- Social Computing Journal
- Social Media Today
- The Community Roundtable Blog
- The Facebook Era – Clara Shih's blog