Archive for the ‘search’ Category
There are lots of useful interface elements you can add to a search results page, but the general wisdom seems to be you shouldn’t put too many of them together.
Christies, the auction house, don’t seem to care for this perceived wisdom. Their search results are power searching heaven. I counted 12 filters (one of which you can search within), 4 sorts and 3 options for displaying the results.
So is this design overkill or understanding your audience? I’d have no qualms about including this amount of controls for an audience of professional researchers but would usually avoid it for a general audience.
I’m not hugely knowledgeable about the audience. For all it’s complexity, Christies looks far more professional and considered than most antiques websites. Especially sellingantiques.co.uk
It’s easy to imagine that the audience is very particular about the item they are looking for. If you’re a collector then being able to locate Louis XVI chairs rather than Louis XV chairs, or find an Arts and Crafts lamp rather than an Art Deco one, is a fundamental part of the experience.
I enjoyed using it but I’d be fascinated to know how typical I am of this audience. Anyone done any research?
The bane of my life when trying to work out what’s gone wrong with a search engine is the hidden thesaurus.
Lots of search software comes with a thesaurus that is referred to ‘behind the scenes’ to expand queries to include other queries that are *known to be equivalent*. Anyone who has spent even a short amount of time thinking about language can see why these things might become a problem.
(these files are doubly irritating as they’re usually set up without any kind of admin interface…the assumption being only the system administrator would or should edit it…and that they will of course be technical)
The expansion happens behind the scenes and the user isn’t necessarily told it has happened. This is usually bad. You need to be really really wary of expanding the users search queries without telling them. Don’t just give them results for aubergine and results for eggplant, when they only searched for Aubergine. You think you are being clever and helpful. If you’re wrong about the expansion then you are just being extremely irritating.
Or possibly worse than irritating.
I read a comment on the Guardian recently that suggested hor = mum in Danish. I thought that was wrong and searched for “hor mum” in Google. It wasn’t my most thought through search query but I didn’t expect Google to automatically convert it into “hot mum”. That was a bit of a surprising set of results.
(the word the commenter had misheard was mor)
This Google example demonstrates how you can end up with a worse situation that the user simply not getting the results they were looking for. But it is also different from the thesaurus examples that I started this talking about. Google do at least tell you what they’ve done and allow you to correct them. Given how uncertain query expansion is, best practice must be to tell the users what you’ve done.
If you tell the users you have two choices about how to tell them:
a) Suggest the expansion but don’t run it for them. Risks them missing it as an option.
b) Run the expansion but tell them you’ve done it. Still risks them missing the option to un-do
Google’s experimented with both approaches over the years. And currently has a bit of a mixed approach. Don’t assume their approach has “cracked” the problem.
a) Suggest the expansion but don’t run it for them. Risks them missing it.
b) Run the expansion but tell them you’ve done it. Still risks them missing it.
Google’s experimenting with both approaches over the years. And currently has a bit of a mixed approach. Don’t assume their approach has “cracked” the problem.
This is the least visible thing, that you might not consider a feature, that mostly gets ignored and is absolutely the most important thing for you to dedicate time to getting right.
If the query isn’t particularly ambiguous then you need the top results to be right, without asking the searcher to do much else.
Ranking isn’t sexy and it takes care and attention. But isn’t magic, it’s just rules. Ask what the rules are. Don’t be fobbed off. If no-one knows, work it out yourself.
2. Manual Suggestions (query expansion/narrowing)
This basically means Best Bets.
I’m very, very attached to Best Bets. This is mostly because I’ve been a search product manager as well as an IA on search re-design projects. Once the project team has packed up, the product manager (or web manager/editor) can still improve results and resolve problems using Best Bets. And they will need to. Promise.
3. Automated Suggestions (query expansion/narrowing)
We can’t spell and we can’t type. And then we blame the poor old search engine when it doesn’t find what we were looking for.
Any decent search solution needs to have some solution to misspellings (where to put them is a problem for another day!). You can do some of this with Best Bets, but with a big and diverse enough set of users you’ll probably need something a bit more automatic like Google’s Did You Mean?
A related but broader concept is suggesting related searches. You might have spelt your query correctly but there’s a similar term that would get you better results. Ask.com used to do this.
It might seem perverse to prioritise the manual intervention over the automated one. I’d usually expect to have both but I have a few reasons for picking manual if it comes to a choice:
- the manual option is probably cheaper to add on if neither comes as standard
- automated suggestions often get better over time but might start a bit ropy
- automated suggestions may be ‘black-box’ you might not be able to do anything with them if they are wrong/misleading. And every system I’ve worked with and/or used makes mistakes sometimes.
It’s worth asking whether there is any control over the automated suggestions. Is there a dictionary? Is the right language (esp. UK v US English)? Can we edit it? How?
4. Filters and sort options (after you got search results)
These tend to get missed by users or interfere with their understanding of the page. Not all users will understand them, especially complex faceted filters. The positioning of filters/facets is very difficult to get right. Users home in on the top results, so above the first result is most likely to get noticed and also most likely to get noticed for being in an annoying position.
If you are doing product search then I’d probably still prioritise 1-3 but I’d strongly argue you need 4 as well.
5. Clever query language
Quote marks seem to be reasonably widely understood, so I might argue these should be higher up your expectation list.
But unless you’ll have access to your users and be able to train them all… I wouldn’t prioritise operators like wildcards, NOT/And/Or etc..
Find out what you get out of the box. Make that information available to interested users. But don’t invest lots of development effort and money here.
6. Filters and sort options (before you run the search)
a) Radio buttons and drop-downs. These get missed, people don’t think about using them, they tend to just stick words in and hit go. Other users won’t use them because they don’t know they need to use them until they see the search results aren’t focused enough. So then they have to go backwards. So you might as well go with (4).
If you can sensibly default them then they can be more useful but establishing what the sensible default is problematic.
b) Advanced search pages.
These are basically a collection of filters for the user to set before you run the search. Search specialists inevitably find advanced search useful but your average end-user doesn’t. The exception here is power users but be sure the users actually are “power” users. You are likely to find power users where there are time/cost pressures around searching e.g. staff answering customer calls or researchers using databases where they pay for searches. In these situations even reasonably techno-phobic users are motivated to get to grips with advanced searches including some of the more complex query building ones.
Another reason advanced search might be worthwhile is if your power users are also your most mouthy. If the segment of your audience that blogs/tweets is also the segment that might demand power features then you might consider the feature as marketing.
(Don’t be worried by people being intimated by the label “advanced”. If they are intimated by the word then they’ll be intimated by the features. )
SharePoint search allows you to create Best Bets. They can be created by the Site Collection administrator.
If you go to Site Settings, you should see ‘Search Keywords’ under the Site Collection Administration heading. If you don’t see it you probably haven’t got the right permissions.
You create a keyword, associate some synonyms with it and then add one or more Best Bet links. You can set it to expire and/or be reviewed.
Keyword: The search term that will generate the Best Bets and also is displayed above the Best Bet e.g. PenFriend
Synonym: Other search terms that will also generate the Best Bet. These aren’t displayed e.g. Pen Friend
Best Bets: The editorially picked search result e.g. Penfriend Audio Labeller
I can’t for the life of me figure out how to delete a keyword (Best Bet, yes. Keyword, no). Maybe it’s a permission thing again.
This article is part of a series about our e-commerce redesign.
Analysing your search referrals only tells you about the traffic you were successful in attracting. Even if you are getting lots of traffic for a particular keyword that might be a tiny fraction of the number of people searching for that keyword. And the referrers says nothing about what you missed out on completely.
So it helps to look at search engine traffic for keywords in the kind of space your website sits in. The free tools like Google AdWords keyword tool have generated lots of debate about how useful they are but I tend to see them as worth a look if you’re just looking for rough ideas about language and relative popularity.
With our shop research, I didn’t get much data for easy to see, easy to read, giant print, big print, canes, liquid level indicators, and (my favourite) bumpons. I couldn’t find information about Moon (the alphabet) because it was drowned by references to the satellite and all the other things called moon.
What I’ve learnt:
Generally people refer to concrete properties of the product rather than their condition. So it is ‘big button phone’ rather than ‘easy to see phone’ or ‘low vision phone’.
Singular is much more important than plural for objects like clocks and watches but the opposite is true for book formats e.g large print books. Which is kind of obvious…you only want one watch but you may want many books. This might have a bit of effect on our labelling policy, but not much as Google doesn’t seem to make a huge deal about singular verus plural.
There’s clearly a big opportunity around low vision products. The interest in products for blind people (like Braille) is less significant, which makes perfect sense when you compare the size of the audiences.
And loads of people are interested in magnifiers.
SharePoint search features are managed at 3 levels
- Farm level (configure the search service, configure crawler timeout settings etc…)
- SSP (Shared Services Provider) level
- Site collection level
The SSP functions are accessed via the Shared Services Administration.
SSP search functions:
- add sources to the crawl
- block URLs and URL patterns from the crawl
- define crawl schedules
- inspect crawl logs and troubleshoot crawls
- emergency removal of items
- install IFilters to support non-default file types
- add/remove file types from the crawl
- specify authoritative pages
- create scopes for all site collections (you can also create at a site collection level)
And in theory specify noise words and create a custom thesaurus. See Inside the Index and Search Engines
chapter 5 for more.
You can by default index these types of content source:
- SharePoint sites
- Non-SharePoint websites
- Windows file shares
- Microsoft Exchange Server public folders (you can index exchange mailboxes with a 3rd party add-on)
- Full crawl: indexes all content
- Incremental crawl: only accesses content that has been updated since last crawl. Faster, but slow if accessing an external website
- Crawl schedules can be specified for each content source
- Crawls should be scheduled for low usage times
- content can be excluded by defining a rule
- rules are applied in the specified order so you usually need to move exclude rules in front of include rules.
- a URL can be excluded by adding it as an exclude rule
- URL patterns can also be excluded and help keep the management of rules neat e.g. http://www.bbc.co.uk/* or http://www.amazon.co.uk/*/dp/*
- Exclude rules will remove any matched URLs during the next crawl
- If you need to remove a URL in an emergency you do this via “Search Result Removal” instead
Now all search engines struggle, to varying degrees, with the knotty mess that is natural language. But they don’t generally don’t get called rubbish for not succeeding with the meaty search challenges.
Rubbish search engines are the ones that can’t seem to answer the most basic requests in a sensible manner. These are ones that get mocked as “random link generators”, the jibbering wrecks of their breed.
Go to Homebase and search for “rabbit hutch” (we need another one as two of our girls are about to produce heaps of bunnies at the same time).
The first result is “Small plastic pet carrier”. There’s a number of other carriers and cages. Then there’s a “Beech Finish Small Corner Desk with Hutch”. Finally there’s a Pentland Rabbit Hutch at result no #8. This is a rubbish set of results. I asked for “rabbit hutch” and they’ve got a rabbit hutch to sell me but they’re showing me pet carriers and beech finish corner desks.
This is a rubbish set of results. But it doesn’t mean the search engine is rubbish.
Somebody made a rubbish decision. They’ve set it up shonky.
So before you reach for the million pound enterprise search project, try having a quick look under the bonnet with a spanner.
Is it AND or OR?
This is reasonably easy to test, if you can’t ask someone who knows.
Pick a word that will be rare on your site and another word that doesn’t appear with the rare one e.g. “Topaz form” for my intranet. A rare word is one that should only appear one or two times in the entire dataset so you can check that the other word doesn’t appear with it. You may need to be a bit imaginative but unique things like product codes can be helpful here. If the query returns no results you’ve probably got an AND search. More than a couple of results (and ones that don’t mention Topaz) and you’ve probably got OR.
(this can get messed up if there is query expansion going on but hopefully the rare word isn’t one whatever query expansion rules there are will work on).
AND is more likely to be problematic as a setting. You’ll get lots of “no results”. You’ll need your users to be super precise with their terminology and spell every word right. If they are looking for “holiday form” and the form is called “annual leave form” they’ll get no results.
OR will generate lots of results. This is ok if the sort order is sensible. Very few people care that Google returned 2,009,990 results for their query. They just care that the first result is spot-on.
So most of the time you probably want an OR set-up.
(preferably combined with support for phrase searching so the users can choose to put their searches in nice speech marks to run an AND search if they want to and know how to).
Is there crazy stemming/query expansion going on?
Query expansion is search systems trying to be clever, often getting it wrong and not telling you what they’ve done so you can unpick it. Basically the search system is taking the words you gave it and giving you results for those words, plus some others that it thinks are relevant or related.
Typical types of expansion are stemming (expand a search for fish to include fishes and fishing), misspellings and synonyms (expand a search for cockerel to include rooster).
This is probably what is happening if you are getting results that don’t seem to include the words you searched for anywhere on the page (although metadata is another option).
Now this stuff can be really, really helpful. If it is any good.
Have you got smart sophisticated query expansion like Google? Or does it do silly (from a day-to-day not a Latin perspective) stemming like equating animation with animals? If it is the silly version then definitely switch it off (or tweak it if you can).
Even if you’ve got smart expansion options available, it’s generally best practice to either give the user the option of running the expanding (or alternate) query, or at the very least of undoing it if you’ve got it wrong. They won’t always spot the options (Google puts lots of effort into coming up with the right way of doing this) but it’s bad search engine etiquette to force your query on a user.
Is the sort order sensible?
That Homebase example. The main problem here is sorting by price low-high. That’d be fine (actually very considerate of Homebase) if I’d navigated to a category full of rabbit hutches. But I didn’t. I searched for rabbit hutches and got a mixed bag of results that included plenty of things that a small child could tell you aren’t rabbit hutches.
The solution? Sort by relevancy.
I’ve seen quite a lot of bad search set-ups recently where the search order was set to alphabetical. Why? Unless as Martin said when I bemoaned this on Twitter your main use case is “to enable people to find stuff about aardvarks”.
News sites sometimes go with most recent as the sort order. Kinda makes sense but you need to be sure the top results are still relevant not just recent.
Interestingly sort order doesn’t matter so much if you’ve gone for AND searches and you haven’t got any query expansion going on. If you’re pretty sure that everything in the result set is relevant, then you’ve got more freedom over sort order. If not, stick with relevancy.
(I don’t need to tell you that you want relevancy is high-low, do I?)
So people stop giving me grief over navigation. Let’s talk about that rubbish search engine you’ve got. I could probably fix that for you.
Last week I went to the Search Solutions event, held by BCS in their lovely office in Southampton Street. There were maybe 50 people, 6 or 7 women and seemingly even less laptops (which rather made it stand out from the more web-focused events I usually attend – because of lack of laptops not the male-female ratio).
I didn’t make masses of notes but I did capture a few points and reminders:
Vivian Lin Dufour from Yahoo talked about Search Pad, an attempt to make search more “stateful”.
Richard Russell from Google explained how the auctions for Google Ads work. Always interesting to hear more about the money side of things.
Dave Mountain, a geographer (another example of Nominative Determinism?) talked about geographical aspects of searching. He explained that if the task is “finding the nearest cafe”, then the ‘near’ isn’t a simple statement. There are types of near: as the crow flies, in travel time, in the direction I’m already going. After all you may not be interested in a cafe that’s already 5 miles behind you on the motorway. He had some good slides covering this, so hopefully they’ll be made available.
Tony Russell-Rose discussed Endeca’s impending pattern library. Should be interesting – public version to be available in the new year.
David White of Web Optimiser talked amongst other things about the importance of cross-media optimisation. He asked why don’t more companies, especially b2b ones, have phone numbers in title/description of search results? He also touched on the growth of twitter as a substantial source of referrals (in response to a question about whether Bing was increasing referrals and thus changing optimisation tactics).
Richard Boulton, as well as discussing his efforts with open source search, introduced us to the marvelous concept of dev/fort/.
“Imagine a place of no distractions, no IM, no Twitter — in fact, no internet. Within, a group of a dozen or more developers, designers, thinkers and doers. And a lot of a food.
Now imagine that place is a fort.”
Well marvellous to me but I wanted to get married in a Napoleonic fort so perhaps I’m not typical. He also mentioned searchevent.org, a day dedicated to open source search systems, which will hopefully happen again sometime.
Andrew Maisey talked about a school of thought that search will increasingly become less important on the site. Dynamic user journeys will encourage more browsing.
(Food was pretty good as usual for the venue. I’m hoping that we’re going back to BCS for our team away-day later in the year and then I can have more of the strawberry tarts.)
Surprisingly this white paper on building multilingual solutions in SharePoints provides a good overview of how the search works, regardless of whether you are interested in the multilingual aspect.
Read page 15, titled “overview of the language features in search” for a description of content crawling and search query extraction. Then 16-18 provide a good overview of individual features and what they are doing.
Word breakers A word breaker is a component used by the query and index engines to break compound words and phrases into individual words or tokens. If there is no word breaker for a specific language, the neutral word breaker is used, in which case word breaking occurs where there are white spaces between the words and phrases. At indexing time, if there is any locale information associated with the document (for example, a Word document contains locale information for each text chunk), the index engine will try to use the word breaker for that locale. If the document does not contain any locale information, the user locale of the computer the indexer is installed on is used instead. At query time, the locale (HTTP_ACCEPT_LANGUAGE) of the browser from which the query was sent is used to perform word breaking on the query. Additional information about the language availability of the word breaker component is available in Appendix B: Search Language Considerations.
Stemming Stemming is a feature of the word breaker component used only by the query engine to determine where the word boundaries are in the stream of characters in the query. A stemmer extracts the root form of a given word. For example, ”running,” ”ran,” and ”runner“ are all variants of the verb ”to run.” In some languages, a stemmer expands the root form of a word to alternate forms. Stemming is turned off by default. Stemmers are available only for languages that have morphological expansion; this means that, for languages where stemmers are not available, turning on this feature in the Search Result Page (CoreResult Web Part) will not have any effect. Additional information about language availability for the Stemmer feature is available in Appendix B: Search Language Considerations.
Noise words dictionary Noise words are words that do not add value to a query, such as ”and,” ”the,” and ”a.” The indexing engine filters them to save index space and to increase performance. Noise word files are customizable, language-specific text files. These files are a simple list of words, one per line. If a noise word file is changed, you must perform a full update of the index to incorporate the changes. Additional information about the noise words dictionary and how to customize it is available at www.microsoft.com.
Custom dictionary The custom dictionary file contains values that the search server must include at index and query times. Custom dictionary lists are customizable, language-specific text files. These files are used by Search in both the index and query processes to identify exceptions to the noise word dictionaries. A word such as “AT&T,” for example, will never be indexed by default because the word breaker breaks it into single noise words. To avoid this, the user can add ”AT&T” to the custom dictionary file; as result, this word will be treated as an exception by the word breaker and will be indexed and queried. These files contain a simple list of words, one per line. If the custom dictionary file is changed, you must perform a full update of the index to incorporate the changes. By default, no custom dictionary file is installed during Office SharePoint Server 2007 Setup. Additional information about the custom dictionary file and how to customize it is available at www.microsoft.com.
Thesaurus There is a configurable thesaurus file for each language that Search supports. Using the thesaurus, you can specify synonyms for words and also automatically replace words in a query with other words that you specify. The thesaurus used will always be in the language of the query, not necessarily the server’s user locale. If a language-specific thesaurus is not available, a neutral thesaurus (tseneu.xml) is used. Additional information about the thesaurus file and how to customize it is available at www.microsoft.com.
Language Auto Detection The Language Auto Detection (LAD) feature generates a best guess about the language of a text chunk based on the Unicode range and other language patterns. Basically, it’s used for relevance calculation by the index engine and in queries sent from the Advanced Search Web Part, where the user is able to specify constraints on the language of the documents returned by a query.
Did You Mean? The Did You Mean? feature is used by the query engine to catch possible spelling errors and to provide suggestions for queries. The Did You Mean? feature builds suggestions by using three components:
· Query log Information tracked in the query log includes the query terms used, when the search results were returned for search queries, and the pages that were viewed from search results. This search usage data helps you understand how people are using search and what information they are seeking. You can use this data to help determine how to improve the search experience for users.
· Dictionary lexicon A dictionary of most-used lexicons provided at installation time.
· Custom lexicon A collection of the most frequently occurring words in the corpus, built at query time by the query engine from indexed information.
The Did You Mean? suggestions are available only for English, French, German, and Spanish.
Definition Extraction The Definition Extraction feature finds definitions for candidate terms and identifies acronyms and their expansions by examining the grammatical structure of sentences that have been indexed (for example, NASA, radar, modem, and so on). It is only available for English.