ia play

the good life in a digital age

Archive for the ‘information architecture’ Category

why your search engine (probably) isn’t rubbish

without comments

Now all search engines struggle,  to varying degrees,  with the knotty mess that is natural language. But they don’t generally don’t get called rubbish for not succeeding with the meaty search challenges.

Rubbish search engines are the ones that can’t seem to answer the most basic requests in a sensible manner. These are ones that get mocked as “random link generators”, the jibbering wrecks of their breed.

Go to  Homebase and search for “rabbit hutch” (we need another one as two of our girls are about to produce heaps of bunnies at the same time).

The first result is “Small plastic pet carrier”. There’s a number of other carriers and cages. Then there’s a “Beech Finish Small Corner Desk with Hutch”. Finally there’s a Pentland Rabbit Hutch at result no #8.  This is a rubbish set of results. I asked for “rabbit hutch” and they’ve got a rabbit hutch to sell me but they’re showing me pet carriers and beech finish corner desks.

This is a rubbish set of results. But it doesn’t mean the search engine is rubbish.

Somebody made a rubbish decision. They’ve set it up shonky.

So before you reach for the million pound enterprise search project, try having a quick look under the bonnet with a spanner.

Is it AND or OR?

This is reasonably easy to test, if you can’t ask someone who knows.

Pick a word that will be rare on your site and another word that doesn’t appear with the rare one  e.g.  “Topaz form” for my intranet.  A rare word is one that should only appear one or two times in the entire dataset so you can check that the other word doesn’t appear with it.  You may need to be a bit imaginative but unique things like product codes can be helpful here.  If the query returns no results you’ve probably got an AND search.  More than a couple of results (and ones that don’t mention Topaz) and you’ve probably got OR.

(this can get messed up if there is query expansion going on but hopefully the rare word isn’t one whatever query expansion rules there are will work on).

AND is more likely to be problematic as a setting. You’ll get lots of “no results”. You’ll need your users to be super precise with their terminology and spell every word right.  If they are looking for “holiday form” and the form is called “annual leave form” they’ll get no results.

OR will generate lots of results. This is ok if the sort order is sensible. Very few people care that Google returned 2,009,990 results for their query. They just care that the first result is spot-on.

So most of the time you probably want an OR set-up.

(preferably combined with support for phrase searching so the users can choose to put their searches in nice speech marks to run an AND search if they want to and know how to).

Is there crazy stemming/query expansion going on?

Query expansion is search systems trying to be clever,  often getting it wrong and not telling you what they’ve done so you can unpick it. Basically the search system is taking the words you gave it and giving you results for those words, plus some others that it thinks are relevant or related.

Typical types of expansion are stemming (expand a search for fish to include fishes and fishing), misspellings and synonyms (expand a search for cockerel to include rooster).

This is probably what is happening if you are getting results that don’t seem to include the words you searched for anywhere on the page (although metadata is another option).

Now this stuff can be really, really helpful. If it is any good.

Have you got smart sophisticated query expansion like Google?  Or does it do silly (from a day-to-day not a Latin perspective) stemming like equating animation with animals? If it is the silly version then definitely switch it off (or tweak it if you can).

Even if you’ve got smart expansion options available, it’s generally best practice to either give the user the option of running the expanding (or alternate) query, or at the very least of undoing it if you’ve got it wrong. They won’t always spot the options (Google puts lots of effort into coming up with the right way of doing this) but it’s bad search engine etiquette to force your query on a user.

Is the sort order sensible?

That Homebase example. The main problem here is sorting by price low-high. That’d be fine (actually very considerate of Homebase) if I’d navigated to a category full of rabbit hutches. But I didn’t. I searched for rabbit hutches and got a mixed bag of results that included plenty of things that a small child could tell you aren’t rabbit hutches.

The solution? Sort by relevancy.

I’ve seen quite a lot of bad search set-ups recently where the search order was set to alphabetical. Why? Unless as Martin said when I bemoaned this on Twitter your main use case is “to enable people to find stuff about aardvarks”.

News sites sometimes go with most recent as the sort order. Kinda makes sense but you need to be sure the top results are still relevant not just recent.

Interestingly sort order doesn’t matter so much if you’ve gone for AND searches and you haven’t got any query expansion going on. If you’re pretty sure that everything in the result set is relevant, then you’ve got more freedom over sort order.  If not,  stick with relevancy.

(I don’t need to tell you that you want relevancy is high-low, do I?)

So people stop giving me grief over navigation.  Let’s talk about that rubbish search engine you’ve got.  I could probably fix that for you.

Written by Karen

March 5th, 2010 at 6:04 am

Posted in search

Tagged with

worst drop down so far this year

without comments

Drop-down menus aren’t inherently evil but they do seem to encourage all sorts of terrible behaviour.

HMCS CourtFinder includes a menu that is certainly the worst I’ve had to interact with this year, and probably for a quite a long time before that.

Stupid menu

The list is incredibly long. But more damagingly it isn’t in *any* order that I can see. Nor is this a list where you or I is likely to be sure exactly what the term we’re looking for is. After all types of court work isn’t a classification that most of us know off-by-heart.

Written by Karen

February 9th, 2010 at 6:15 am

topical navigation on CHOW

without comments

CHOW has a nice example of topical navigation.

Timely nav

It’s cold, people are trying to eat healthily, and it is Superbowl time (for the Americans anyway). So the navigation includes nachos, snacks, braises and healthy recipes.

I’m very fond of this kind of navigation. For big sites it is rare than the navigation actually contains exactly what the user is looking for, instead it provides a starting point for a journey. But for any site where interest in content is influenced by outside events then you can use this knowledge to get the users where they are going much, much faster and with greater confidence.

Written by Karen

February 8th, 2010 at 6:00 am

Posted in navigation

ways of adding metadata

with one comment

I was digging around in my files this weekend and found this table I made once of different approaches to applying metadata to content. At first glance the volunteers example looks like it is only relevant to charities but alot of scenarios that refer to users tagging, it is actually volunteers tagging. The difference is doing something for your own benefit (users) or contributing something to a greater cause (volunteers).

users volunteers staff-authors staff-specialists automatic-rules automatic-training sets
Users apply metadata to their own content or content they have gathered for their own use Unpaid volunteers apply metadata to content produced by others e.g Freebase The paid author applies metadata to their own content. Paid metadata specialists apply metadata to content produced by others Software applies metadata to content based on rules defined by specialists Software applies metadata to content based on training sets chosen by specialists
Strengths Cheap, real user language, subjective value judgements, highly reactive, latest trend vocab depending on how handled can be more predictable and reliable than users, may be close to user language, can be guided more like staff, asked to go back and change small commitment required from each staff member, expert knowledge of the content highly motivated, objectives likely to be tied to quality of this work more efficient than staff options more efficient than staff options
Weaknesses no guarantees of contributions, same tag to mean different things, different tags mean the same thing, cryptic personal tags, smaller interpretations drowned out, hardly anyone goes back and changes out-of-date tagging, can require more management/attention than users, smaller number, may not make up enough hours, probably not viable in most commercial enterprises – although can still be done if company offers a free-at-consumption service that may be perceived as a public good. low motivation and interest, may be too close to the content to understand user needs, more likely to be formal/objective cost, needs to read the content first, may not necessarily be user focused, more likely to be formal/objective needs operational staffing hard to control, can be ‘black-box’, need a mechanism for addressing errors
Recommended environment Large user-base, with a *selfish* motivation for users – often gathering/collecting, reasonably shared vocabulary, rarely works on a single site where the user could instead aggregate links or content on a generic site like delicious Where you can rely on lots of good will. Probably in combination with another approach, unless a large number of volunteers are likely. You have good historical examples of imposing new activities on the authors and getting them to follow them. Probably quite process and guideline driven organisation. Bad where your authors think of themselves as creatives…they’ll think metadata is beneath them. Strong information management skills in the organisation. The project needs to be resourced on an ongoing basis. Business probably needs to see a very close correlation between the quality of the metadata and profit. As for specialist staff. Strong technical and information management skills in the organisation. An understanding from management of the ongoing need for operational staffing. Management do not believe the vendors promises.

Written by Karen

October 7th, 2009 at 6:51 am

Search Solutions 2009

with one comment

Last week I went to the Search Solutions event, held by BCS in their lovely office in Southampton Street. There were maybe 50 people, 6 or 7 women and seemingly even less laptops (which rather made it stand out from the more web-focused events I usually attend – because of lack of laptops not the male-female ratio).

I didn’t make masses of notes but I did capture a few points and reminders:

Vivian Lin Dufour from Yahoo talked about Search Pad, an attempt to make search more “stateful”.

Richard Russell from Google explained how the auctions for Google Ads work. Always interesting to hear more about the money side of things.

Dave Mountain, a geographer (another example of Nominative Determinism?) talked about geographical aspects of searching. He explained that if the task is “finding the nearest cafe”, then the ‘near’ isn’t a simple statement. There are types of near: as the crow flies, in travel time, in the direction I’m already going. After all you may not be interested in a cafe that’s already 5 miles behind you on the motorway. He had some good slides covering this, so hopefully they’ll be made available.

Tony Russell-Rose discussed Endeca’s impending pattern library. Should be interesting – public version to be available in the new year.

David White of Web Optimiser talked amongst other things about the importance of cross-media optimisation. He asked why don’t more companies, especially b2b ones, have phone numbers in title/description of search results? He also touched on the growth of twitter as a substantial source of referrals (in response to a question about whether Bing was increasing referrals and thus changing optimisation tactics).

Richard Boulton, as well as discussing his efforts with open source search, introduced us to the marvelous concept of dev/fort/.

“Imagine a place of no distractions, no IM, no Twitter — in fact, no internet. Within, a group of a dozen or more developers, designers, thinkers and doers. And a lot of a food.

Now imagine that place is a fort.”

Well marvellous to me but I wanted to get married in a Napoleonic fort so perhaps I’m not typical. He also mentioned searchevent.org, a day dedicated to open source search systems, which will hopefully happen again sometime.

Andrew Maisey talked about a school of thought that search will increasingly become less important on the site. Dynamic user journeys will encourage more browsing.

(Food was pretty good as usual for the venue.  I’m hoping that we’re going back to BCS for our team away-day later in the year and then I can have more of the strawberry tarts.)

Written by Karen

October 5th, 2009 at 6:54 am

Posted in events,search

SharePoint search: more insights

without comments

Surprisingly this white paper on building multilingual solutions in SharePoints provides a good overview of how the search works, regardless of whether you are interested in the multilingual aspect.

White paper: Plan for building multilingual solutions.

Read page 15, titled “overview of the language features in search” for a description of content crawling and search query extraction. Then 16-18 provide a good overview of individual features and what they are doing.

Word breakers A word breaker is a component used by the query and index engines to break compound words and phrases into individual words or tokens. If there is no word breaker for a specific language, the neutral word breaker is used, in which case word breaking occurs where there are white spaces between the words and phrases. At indexing time, if there is any locale information associated with the document (for example, a Word document contains locale information for each text chunk), the index engine will try to use the word breaker for that locale. If the document does not contain any locale information, the user locale of the computer the indexer is installed on is used instead. At query time, the locale (HTTP_ACCEPT_LANGUAGE) of the browser from which the query was sent is used to perform word breaking on the query. Additional information about the language availability of the word breaker component is available in Appendix B: Search Language Considerations.

Stemming Stemming is a feature of the word breaker component used only by the query engine to determine where the word boundaries are in the stream of characters in the query. A stemmer extracts the root form of a given word. For example, ”running,” ”ran,” and ”runner“ are all variants of the verb ”to run.” In some languages, a stemmer expands the root form of a word to alternate forms. Stemming is turned off by default. Stemmers are available only for languages that have morphological expansion; this means that, for languages where stemmers are not available, turning on this feature in the Search Result Page (CoreResult Web Part) will not have any effect. Additional information about language availability for the Stemmer feature is available in Appendix B: Search Language Considerations.

Noise words dictionary Noise words are words that do not add value to a query, such as ”and,” ”the,” and ”a.” The indexing engine filters them to save index space and to increase performance. Noise word files are customizable, language-specific text files. These files are a simple list of words, one per line. If a noise word file is changed, you must perform a full update of the index to incorporate the changes. Additional information about the noise words dictionary and how to customize it is available at www.microsoft.com.

Custom dictionary The custom dictionary file contains values that the search server must include at index and query times. Custom dictionary lists are customizable, language-specific text files. These files are used by Search in both the index and query processes to identify exceptions to the noise word dictionaries. A word such as “AT&T,” for example, will never be indexed by default because the word breaker breaks it into single noise words. To avoid this, the user can add ”AT&T” to the custom dictionary file; as result, this word will be treated as an exception by the word breaker and will be indexed and queried. These files contain a simple list of words, one per line. If the custom dictionary file is changed, you must perform a full update of the index to incorporate the changes. By default, no custom dictionary file is installed during Office SharePoint Server 2007 Setup. Additional information about the custom dictionary file and how to customize it is available at www.microsoft.com.

Thesaurus There is a configurable thesaurus file for each language that Search supports. Using the thesaurus, you can specify synonyms for words and also automatically replace words in a query with other words that you specify. The thesaurus used will always be in the language of the query, not necessarily the server’s user locale. If a language-specific thesaurus is not available, a neutral thesaurus (tseneu.xml) is used. Additional information about the thesaurus file and how to customize it is available at www.microsoft.com.

Language Auto Detection The Language Auto Detection (LAD) feature generates a best guess about the language of a text chunk based on the Unicode range and other language patterns. Basically, it’s used for relevance calculation by the index engine and in queries sent from the Advanced Search Web Part, where the user is able to specify constraints on the language of the documents returned by a query.

Did You Mean? The Did You Mean? feature is used by the query engine to catch possible spelling errors and to provide suggestions for queries. The Did You Mean? feature builds suggestions by using three components:

· Query log Information tracked in the query log includes the query terms used, when the search results were returned for search queries, and the pages that were viewed from search results. This search usage data helps you understand how people are using search and what information they are seeking. You can use this data to help determine how to improve the search experience for users.

· Dictionary lexicon A dictionary of most-used lexicons provided at installation time.

· Custom lexicon A collection of the most frequently occurring words in the corpus, built at query time by the query engine from indexed information.

The Did You Mean? suggestions are available only for English, French, German, and Spanish.

Definition Extraction The Definition Extraction feature finds definitions for candidate terms and identifies acronyms and their expansions by examining the grammatical structure of sentences that have been indexed (for example, NASA, radar, modem, and so on). It is only available for English.

Written by Karen

September 30th, 2009 at 6:56 am

Posted in search,sharepoint

BCS IRSG – Search Solutions 2009

without comments

I’m going to “Innovations in Web and Enterprise Search” at BCS next week

Search Solutions is a special one-day event dedicated to the latest innovations in web and enterprise search. In contrast to other major industry events, Search Solutions aims to be highly interactive and collegial, with attendance limited to 60-80 delegates.

Provisional programme

09:30 – 10:00 Registration and coffee

Session 1: (Chair: Tony Russell-Rose)

* 10:00 Introduction – Alan Pollard, BCS President

* 10:10 “Enterprising Search” – Mike Taylor, Microsoft

* 10:35 Accessing Digital Memory: Yahoo! Search Pad – Vivian Lin Dufour, Yahoo

* 11:00 “How Google Ads Work” – Richard Russell, Google

11:25 – 11:45 COFFEE BREAK

Session 2: (Chair: Andy MacFarlane)

* 11:45 “Location-based services: Positioning, Geocontent and Location-aware Applications” – Dave Mountain, Placr

* 12:10 “Librarians, metadata, and search” – Alan Oliver, Ex Libris

* 12:35 “UI Design Patterns for Search & Information Discovery”- Tony Russell-Rose, Endeca

13:00 – 14:15 LUNCH

Session 3: (Chair: Leif Azzopardi)

* 14:15 “Search-Based Applications: the Maturation of Search” – Greg Grefenstette, Exalead

* 14:40 “How and why you need to calculate the true value of page 1 natural search engine positions” – Gary Jennings, WebOptimiser

* 15:05 “Search as a service with Xapian” – Richard Boulton, Lemur Consulting

15:30 – 16:00 TEA BREAK

Session 4: (Chair: Alex Bailey)

* 16:00 “The Benefits of Taxonomy in Content Management”, Andrew Maisey, Unified Solutions

* 16:25 Panel: “Interactive Information Retrieval” – details to follow

17:00 – 19:00 DRINKS RECEPTION

via BCS IRSG – Search Solutions 2009.

Written by Karen

September 24th, 2009 at 6:42 am

Posted in events,search

ia deliverables

with one comment

A recent conversation with a friend generated shock (and even a little scorn) that I’d been producing wireframes. I was firmly entreated to sketch instead. Around the same time a recruiter approached me with information on a job that would require detailed annotated UI specs of around 40 pages every fortnight.

The profession is still judged, by and large, by the quality of our documentation. Most recruiters and hiring managers seem more interested in the quality of annotation than the quality of thinking.

I’m rather inconsistent in my approach to documentation. Mostly the medium is picked for the context. Is the project agile? How good are the developers? Is there a remote team? Do lots of people need to be consulted? What are their reading preferences?

Whilst I’m happier with pen and paper  than computer, I think it is far to say that I doodle a good deal more than I sketch.  Now there’s always a way to get chickens into a blog post… this little trio were sketched during a conference presentation, presumably a scintilating one and probably about something 2.0 related given the labelling of the fowl.

Chicken conference doodles

In fact, it appears I doodle most when irritated by the speaker. In this case , rather than asking an insightful question to highlight the cliched and superficial nature of the argument, I wrote “blog, wisdom of the crowds, whatever”. That told him, I’m sure. I do still want this mug though:

Angry (?) conference doodles

None of this is what my friend had in mind though. She’d like this more: part user journeys, part concept map, but mostly not very pretty. Not really for sharing (apart from with you lot, of course) but it could be re-jigged into something more respectable.

Book discovery sketch

I do these little pages all the time but again they aren’t for collaborative purposes. This one was so I could sanity check we had all the functionality we’d need on the product backlog before the supplier drew up the drawbridge.

Homepage sketch

Then of course, there’s cheating. Those search forms I shared recently were created in Visio but with the sketchy stencil:

E-commerce search forms: scope drop-downs


I very rarely do this kind of documentation anymore. My business stakeholders are bored by them and the developers are best told what to do by pointing over their shoulders.

Wireframe and sitemap

I do still do content models. This kind of specification still gets traction with the developers:


Book content model

But, horror of horrors, a lot of my documentation these days is actually reasonably high-fidelity mock-ups. These are really aimed at the business stakeholders. Colours and fonts are pretty much fixed by our visibility requirements, so the business units know better than to ask for their favourite shade of puce.  And they worry less if they don’t have to try and visualise from wireframes. It doesn’t take me any longer as I’ve got a colour stencil and the choices are pretty limited.

Page mock-up

Is this ironic? I’m working for an organisation of and for blind people and I’m producing the most colourful deliverables ever.  But then you should see the colour of the office floors.

Written by Karen

September 23rd, 2009 at 5:26 pm

Posted in deliverables,ucd

search forms on online shops

with 4 comments

I’ve been thinking about the search functionality for our online shop this week. I’ll write up our approach to search properly at a later date but for now I thought I share the variety of search forms I’ve seen on other online shops.

E-commerce search forms: simple boxes

E-commerce search forms: labelled boxes

E-commerce search forms: scope drop-downs

E-commerce search forms: guidance text

Some things of note:

  • The longer search boxes were mostly on book sites.
  • 3 sites also offered “suggestions as you type” (Amazon, Borders, Ocado)
  • Only 1 site had an obvious link to an advanced search
  • All sites handled scopes with a dropdown

(Visio stencil is from GUUUI)

Written by Karen

September 4th, 2009 at 6:34 am

Posted in e-commerce,search

metadata driven websites, via CMS Watch

without comments

There’s a post on the CMS Watch Blog about the challenges of achieving a metadata-driven publishing model:

“The content needs metadata for this to work. Many will tell you that “people won’t tag.” No, seriously, they won’t tag content with the right labels, add the right metadata, or correctly categorize, “even if threatened with being fired.” And even if they do tag, it will be haphazard and inconsistent.

This is a very real problem. But at the same time it’s complete nonsense. Because if this were the case, why would people meticulously tag and file their holiday snapshots on Flickr and Facebook? Somehow, in their spare time, they do identify the people in a picture, add keywords to a shot, give it a meaningful title, and actually describe it. Without having to be threatened with being fired, or even having to be beaten with a stick.

Partly this is because they get the feedback that makes it worth their while to do so. If you identify your friends in a picture on Facebook, they (and then their friends) will immediately find it and start commenting, which creates a positive feedback loop to tag some more. More importantly though, it’s really easy.”

via Trends: Tagging your web content.

Written by Karen

September 2nd, 2009 at 8:46 am

Posted in metadata