Archive for the ‘search’ Category
I’m going to “Innovations in Web and Enterprise Search” at BCS next week
Search Solutions is a special one-day event dedicated to the latest innovations in web and enterprise search. In contrast to other major industry events, Search Solutions aims to be highly interactive and collegial, with attendance limited to 60-80 delegates.
09:30 – 10:00 Registration and coffee
Session 1: (Chair: Tony Russell-Rose)
* 10:00 Introduction – Alan Pollard, BCS President
* 10:10 “Enterprising Search” – Mike Taylor, Microsoft
* 10:35 Accessing Digital Memory: Yahoo! Search Pad – Vivian Lin Dufour, Yahoo
* 11:00 “How Google Ads Work” – Richard Russell, Google
11:25 – 11:45 COFFEE BREAK
Session 2: (Chair: Andy MacFarlane)
* 11:45 “Location-based services: Positioning, Geocontent and Location-aware Applications” – Dave Mountain, Placr
* 12:10 “Librarians, metadata, and search” – Alan Oliver, Ex Libris
* 12:35 “UI Design Patterns for Search & Information Discovery”- Tony Russell-Rose, Endeca
13:00 – 14:15 LUNCH
Session 3: (Chair: Leif Azzopardi)
* 14:15 “Search-Based Applications: the Maturation of Search” – Greg Grefenstette, Exalead
* 14:40 “How and why you need to calculate the true value of page 1 natural search engine positions” – Gary Jennings, WebOptimiser
* 15:05 “Search as a service with Xapian” – Richard Boulton, Lemur Consulting
15:30 – 16:00 TEA BREAK
Session 4: (Chair: Alex Bailey)
* 16:00 “The Benefits of Taxonomy in Content Management”, Andrew Maisey, Unified Solutions
* 16:25 Panel: “Interactive Information Retrieval” – details to follow
17:00 – 19:00 DRINKS RECEPTION
I’ve been thinking about the search functionality for our online shop this week. I’ll write up our approach to search properly at a later date but for now I thought I share the variety of search forms I’ve seen on other online shops.
Some things of note:
- The longer search boxes were mostly on book sites.
- 3 sites also offered “suggestions as you type” (Amazon, Borders, Ocado)
- Only 1 site had an obvious link to an advanced search
- All sites handled scopes with a dropdown
(Visio stencil is from GUUUI)
I often refer back to SEOMoz ranking factors article when I think teams are getting hung up on minor SEO issues.
Will from Distilled just ran a free webinar about the SEOMoz tools so it seemed a good opportunity to learn more about what more is available from SEOMoz.
Will says that SEO tools (some free) give you three things:
- Quick research (basic understanding)
- Deep dive research (actionable insights)
- Making things pretty for boss/client (ever important)
The Pro tools aren’t particularly cheap, so it was useful to have someone talk you through what the return on that investment would actually be. In places the data looks a lot like the stuff you get from your web analytics tool e.g. Google Analytics. But remember this is data on your competitors as well as your own site.
Using AutoTrader as an example, Will talked about
- SEOToolbox: Free tools. Will likes and uses Firefox plugins instead of some of these. Still likes and uses Domain Age tool
- Term Target: free, aggregates data on a given page, identifies keyphrases
- Term Extractor tool: free, uses for competitor and keyword research. 3 word phrases might give you something new.
- Geotarget. Get Listed is an alternative.
- Popular searches. Particularly likes the Amazon content.
- Trifecta. Useful aggregator. But has the comparison of your site to the rest of the web as whole (possibly unique data).
- CrawlTest: pro-tool. Xenu is an alternative.
- JuicyLinkfinder: finds linking opportunities
- Keyword Difficulty: how hard a keyword is going to be to rank for, regardless of domain.
- Rank Tracker: Will keen to stress that individual keyword ranking isn’t the important thing. Often your boss will demand it. Makes little graphs and will export to CSV. Can combine with analytics data e.g. using Google Analytics API
- Firefox toolbar Will loves this. Uses it more than any other SEO tool. Pro version better. Shows some pagerank-esque data for page and domain. Going up 1 MozRank point is equivalent to 8x stronger. So decimal points are important.MozTrust is similar but restricted to links from trusted sites. Page Analysis also part of the toolbar? Alternative is Bronco tools.
- Linkscape: the tool SEOMoz are heavily investing in. Web graph of which pages link to each other on the web. Will doesn’t see an alternative to this. Free version does basic stuff. Pro version produces more data and prettier data. Will recommends the Adv Link Intelligence Report. You can get data on who links with “nofollow” which Will thinks is unique data.
- Labs: Online Non-Linear Regression is scary. Visualizing Link Data is more mortal friendly. Link Acquisition Assistant helps you construct queries for search engines to find link opportunities.Other tools include Social Media Monitoring and Blogscape.
(As a side point, Will recommends learning Excel functions MATCH and LOOKUP. And pivot tables.)
Distilled are going to do more conference calls, including one on keyword research tactics. Could be useful. Free webinars are another useful alternative to conferences when budgets are tight but you need to keep learning.
There are lots of tools that help you choose terms to purchase in PPC campaigns and to target for SEO.
They can also be useful in helping you design navigation, choose your site name and even your company name.
Google provides all sorts of resources, some which seem to do very similar things.
There are analytics specifically for your own site:
And some that anyone can use:
Of the ‘public’ tools I mostly use the Adwords Keyword Tool, inspite of not using Adwords.
Try searching for ‘phones’. From the results you can see whether ‘cell phone’, ‘wireless phone’ or ‘mobile phone’ is the dominant language in your area. When there are labels that my team is arguing about, Ill sometimes see if the Keyword Tool can add evidence to the argument.
But beware, they can get addictive.
So you’ve tested your site search. You’ve submitted some bugs. You’ve probably got lots of responses to those bugs along the lines of “oh, that’s just a config setting” , “you don’t understand – that’s a feature of how this product works” and “the search is fine, you just need to get the authors to do their metadata properly”.
Now the config statement is fine. So long as changing the configurations actually sorts the problem. Don’t sit back at this point. Either make the recommended changes yourself or insist the supplier does. Don’t close the bug until they’ve proved the point.
Changes you can usually make to the configuration
- change the crawled pages
- change the indexed fields
- default query syntax
- change stop/noise words, stemming and the thesaurus
- ranking parameters
Be very, very careful if you are changing the ranking parameters. If fact, I’d suggest this is a mini-project in it’s own right. You’ll need to be able to make one change at a time and compare the new results with the old, across a large set of queries. You probably want to do this with someone who has experience with the specific search engine.
The other two scenarios/excuses are more problematic. If the search has a feature that you thing make the results bad you’ll need to see if you can get it switched off/removed. If you can’t you may have chosen the wrong product.
If your supplier thinks that teaching authors to do metadata properly is a simple goal then you may need a new supplier. This is hardly the attitude that made Google the search masters.
(I’m not contradicting my Best Bets post here: I think there are scenarios where properly motivated and focused editorial staff can do a better job than natural search results. But I’m not thinking of your average author, I mean your central web or search team. I mean people paid to care about search.)
You change the guidelines/training for authors. You can probably get the current batch of authors to listen to some simple tips and pointers. They might remember. They might pass them on. But be realistic, how much control do you have over the authors? Metadata education is often a thankless and futile task. The best solutions are those that don’t require the authors to think about search, whether that is technology or intervention by search specialists.
Where the natural results just aren’t good enough and the authors can’t help there are things you can do on the search results page to help the user out.
Not really about testing but still coming soonish: Changing the interface
So you’ve prepared for testing site search. Now you have to run the tests.
Set aside a reasonable block of time where you won’t be interupted. Schedule later sessions bearing in mind the crawl timescales. If you make changes you’ll need to wait for the crawl to run before you can test again.
You need content in the system before you can test search. The ideal scenario is to be testing search once a site or system is fully populated with real content but this often isn’t possible. Don’t wait for the system to be populated if that means you won’t be able to make any technical changes.
So allow time for content creation as part of testing. You’ll probably want a mix of real content and dummy content that has been specifically written to test an aspect of search.
You’ll need to record the results so you need a spreadsheet.
- Set up columns something like this:the query (linked if you are running the tests from here), whether the results are ok, a description of the issues, hypotheses about causes, changes or adjustments made to validate, bugs reported, screenshots (where necessary)
- Create new versions of the worksheet each time you test, and label accordingly. If you make changes to the content or the configuration then test again after the crawl has run
- Add queries to the spreadsheet as you go. No matter how good your original lists, you’ll explore other issues as you actually use the system.
I’m not merely testing. I’m attempting to analyse and resolve the issues. You could argue that I shouldn’t need to do this, I could just log all the issues with the supplier and get them to resolve them. In my experience it is more successful to do as much as possible yourself.
So what does ok mean? Inevitably it is subjective and it is also qualitative. You could compare with benchmarking metrics for the existing site but some part of the testing usually relies on the subjective judgement of the expert tester. Where time for testing is fixed, I raise the bar with different rounds of testing i.e. round one could be focusing on results that are patently unacceptable, with later rounds raising the standard of quality.
(this testing is in no way meant to replace user testings, the intention is more to test that the functionality works as promised and to get the results to the sort of quality that it is worth putting in front of test participants!)
Mostly you’ll have no problem spotting bad results. Explaining the bad result is the challenge.
Possible sources of issues
Incomplete crawls. First check the search engine successfully completed a crawl. Testing is easiest if you can check yourself. Otherwise you’ll keep having to nag the suppliers/IT to tell you if the crawl went ok. Ask if there is an interface that shows how the crawl went and ask for access.
What is the default query syntax? This is a simple one to check off. If you thought the search was performing an OR search and it is actually running AND then that might well explain why you aren’t happy with the results. And vice versa.
Documents/pages that shouldn’t be crawled? Pages I’ve seen in the results that shouldn’t have been there include:
- admin pages (in one case the blocked profanity list!)
- permission controlled pages
- quiz answers
- form thank-you page
- user profile information
You may need to get rid of a lot of these pages before you can see the true quality of the results.
Documents/pages that should be crawled?
- other specified domains in addition to your main site e.g. www.rnibcollege.ac.uk as well as www.rnib.org.uk
- all sub-domains e.g. not just www.bbc.co.uk but also jobs.bbc.co.uk and news.bbc.co.uk.
- pages regardless of their position in the site
- Office and other documents
- images, video, audio (depending on how you want these assets to appear)
What is being indexed within a document/page? You can check by creating a variety of dummy content and adding your test keyword to a different field on each piece of dummy content. Choose an unusual keyword that won’t be appearing in the rest of the content (I tend to use my mother’s Polish maiden name). Fields to check:
- meta descriptions and keywords
- main page content
- authors and other metadata relevant to your content set
- navigation and page furniture (you’ll see this cause trouble more when the content set is small)
- full content of Office document, pdfs etc?
- metadata attached to multimedia assets
What filters are being applied? Check for:
- stop words
Ask if there is an interface where you can view/edit these filters. If not, ask for copies of the actual files.
What is affecting the ranking? This is complicated to test with any ease as most systems use a variety of factors and there’s usually a level of mystery in the supplier communications. Consider:
- where the keyword appears
- how many times the keyword appears
- the ratio of keywords/article length
- type of document
- links to the document, text of those links, authority/rank of the linking page
If you’ve been told that your search system utitlises “previous user behaviour” to adjust ranking then this can make testing a bit tricky. It also gives the suppliers a black box to hide behind if you don’t think the search is working right.
I’ve been told “don’t worry about testing search, this is a learning system”. Which sounds lovely but on day one the search results still need to be good enough to go live and you’re going to have to really work hard to get a grip on how the system is working. And who says it is learning the right lessons? In this particular scenario I doubled the amount of time I had set aside for testing.
Next: Solutions to try
In last week’s post about Best Bets I commented that search software is “certainly not good enough without a lot of work. A lot of expensive work. If your supplier says ‘the search is really good, you don’t need to worry about it’ then you definitely need to worry about it.”
Worrying about and testing search systems has been a common theme in my working life: whether that involves benchmarking the performance of existing system, testing a new one prior to launch and comparing vendors when choosing a new system.
I’ve had varying levels of exposure to APR Smartlogik, Google, Inktomi/Yahoo, Fast, Verity, Autonomy, SharePoint. At this moment I’m in the middle of testing and tweaking the search for a SharePoint powered website. The challenges are surprisingly similiar to those I encountered when working with Muscat in 2001.
Having gone through such similar processes so many times, now seemed a good time to write it all down. I’ve divided my process into three stages: preparation, running the tests, and making changes.
1. Ask the suppliers lots and lots of questions. You are after actual answers, testing their level of knowledge and letting them know that the quality of the search matters to you. Don’t rely wholy on the suppliers answers. Find other users and do your own reading to validate what the supplier tells you.
Most important to find out:
- Ranking criteria
- What is configurable; of those configurations which have a graphical interface; and of those which have a user friendly graphical interface?
Other useful things to find out:
- What query syntax is supported? What is the default syntax?
- What are the stemming rules and which words are stop words? Ask for copies
- Is there a default thesaurus? Ask for a copy
- What will the crawl timescales be during testing?
- How to construct queries using the URL query strings
2. Build a list of test queries. You really need hundreds. Good sources are:
- Names of a pages/articles on your current site or items in your catalogue
- Real queries from your search logs or from a similar site if you can find someone willing to share
- Obvious variants of these terms – thesaurus, misspellings, abbreviations
- Known problems – ask for feedback from users
- Include a range of specific items, broad topics and ambiguous queries
Your list could be a simple list of terms but you’ll find it easier to run many rounds of tests if you set your list up as http links that will run the query in your test search engine.
If you are testing multiple search engines and you have access to coding skills then you can set up the list to run automatically across the range of search engines and display your result back to you, saving lots of time. Or if you are running multiple rounds of testing on the same search system, an interface that checks to see if the results have changed since last time is invaluable.
But for most of us, we’ll be working from a list of queries and running them one by one.
Next: Running the tests
I was working on a Best Bets system this week, which is essentially what I did 8 years ago on my first BBC project . It is nice to working on something straightforward but I’ve had to do a lot of explaining of the concept. What follows is my advice if you are think about adding Best Bets to your search.
What are Best Bets?
Best Bets are essentially editorial picks that appear at the top of the search results. They are a manual intervention for use when the search engine isn’t developing the best results for the users. Some sites use them to fix just a couple of problematic queries but others have built up extensive databases of thousands of best bets.
You can see examples in Peter Morville’s Best Bets collection on Flickr:
Some search systems have Best Bets functionality as standard (surprisingly SharePoint is one of these) or you can have something bespoke added. The first system I ever worked with was just a basic text file that I edited and uploaded to server – you should be able to get something better than that!
A Bad Idea?
Kas Thomas thinks that we shouldn’t do best bets:
“In point of fact, the search software should do all this for you. After all, that’s its job: to return relevant results (automatically) in response to queries. Why would you sink tens (or hundreds) of thousands of dollars into an enterprise search system only to override it with a manually assembled collection of point-hacks? Sure, search is a hard problem. But if your search system is so poor at delivering relevant results that it can’t figure out what your users need without someone in IT explicitly telling it the answer, maybe you should search for a new search vendor.”
This is the sort of language I expect from the vendors but it is a bit surprising from industry analysts. Yes, the search systems should be good enough. But they’re not. They’re certainly not good enough without a lot of work. A lot of expensive work. If your supplier says “the search is really good, you don’t need to worry about it” then you definitely need to worry about it.
As James Robertson says “No amount of tweaking of metadata or search configuration will… ensure that the most relevant results always appear at the beginning of the list.”
Oh and IT shouldn’t be managing the Best Bets anyway. The teams I’ve worked with it has always been an editorial or product management role. After all why would you build a simple tool to allow editorial intervention and then ask IT to put the content in?
A simple best bets solution, that can be maintained by editorial/product teams rather than scarce technical experts (or worse expensive consultants) is often a better business solution than battling with the search algorithm to try and get it right for all the scenarios. Particularly on a tight budget.
Other pros for Best Bets:
- Just fixes that problem. It doesn’t change any other results. There’s no mysterious black box that has you banging your head against the desk about why when you changed Property X to fix the results for Query Y the results for Query Z changed like that.
- Fixes the problem straight away. You don’t have to wait for the next crawl or even for an emergency crawl to finish. Sometimes it really is that important. Other times someone else thinks it really is that important and you want them to leave you alone now.
- Buys you time whilst you improve the algorithm.
Managing Best Bets
The critics are however correct that Best Bets have some drawbacks. You have to create and maintain them. If you let the links break then you’ve created a worse user experience than the one you set out to fix.
- Don’t go overboard. Only create them where there are clear problems
- Plan for maintenance time. Who is going to add Best Bets and when? Do they have time to check existing Best Bets?
- Make sure you have access to search logs so you can see what terms users might be having difficulties with
- If possible, set up a broken and redirected link checker to run over the Best Bets
And yes, do look at what your Best Bets tell you about the weakness of your search system. If you have the permissions and the skills you may be able to put that knowledge to use in improving the algorithm. But even if you can’t make the changes yourself and there’s no budget for incremental changes (which there often isn’t) then you can at least start building a business case for a search improvement project.
Designing the display
It is tempting to strongly highlight the Best Bets to draw attention to them but this is one area where usability testing tells us a different story.
Users demonstrate a very strong preference for the first ‘ordinary’ looking search result, which is presumably a behaviour they have learnt from web search engines. With search engines any result that is styled slightly differently is probably an ad. Some users didn’t even notice the existence of best bets when we had tried to draw attention to them. This may be a similar situation to banner blindness.
So don’t make a song and dance about it. We might feel the need to tell the user all the effort we’ve put into helping them but ultimately they just want the right result for their query. And they don’t care how it gets to the top of the results, so long as it is at the top of the results.
(Think about it. You’d never highlight a set of the results with a label saying “Brought to you by the IA tweaking the algorithm to weight page title more heavily”)
3 steps to happy Best Bets
- If the system you are buying doesn’t come with a built in Best Bets system, see if you can get a simple one added on.Think of it as safety net for once all the developers and project managers have packed up and left you to your own devices.
- Put them at the top of the search results. If you feel the need to style them differently then keep the styling as minimal as possible
- Don’t get carried away and make sure you maintain those links!
Inside the Index and Search Engines is 624 pages of lovely SharePoint search info. It is the sort of book that sets me apart from my colleagues. I was delighted when it arrived, everyone else was sympathetic.
The audience is “administrators” and “developers”. I’m never sure how technical they are imagining when they say “administrators” so I waded in anyway. The book defines topics for administrators as; managing the index file; configuring the end-user experience; managing metadata; search usage reports; configuring BDC applications; monitoring performance; administering protocol handlers and iFilters. I skimmed through the content for developers and found some useful nuggets in there too.
1. Introducing Enterprise Search in SharePoint 2007
2. The End-User Search Experience
3. Customizing the Search User Interface
4. Search Usage Reports
5. Search Administration
6. Indexing and Searching Business Data
7. Search Deployment Considerations
8. Search APIs
9. Advanced Search Engine Topics
10. Searching with Windows SharePoint Services 3.0
The book begins by setting the scene, and with lots of fluff about why search matters and some slightly awkward praise for Microsoft’s efforts. It gets much more interesting later, so you can probably skip most of the introduction.
Content I found useful:
Chapter 1. Introducing Enterprise Search in SharePoint 2007
p.28-33 includes a comparison of features for a quick overview of Search Server, Search Server Express and SharePoint Server.
“Queries that are submitted first go through layers of word breakers and stemmers before they are executed against the content index file is available. Word breaking is a technique for isolating the important words out of the content, and stemmers store the variations on a word” p.32
Keyword query syntax p.44
- maximum query length 1024 characters
- by default is not case sensitive
- defaults to AND queries
- phrase searches can be run with quote marks
- wildcard searching is not supported at the level of keyword syntax search queries. Developers could build this functionality using CONTAINS in the SQL query syntax
- exclude words with
- you can search for properties e.g rnib author:loasby
- property searches can include prefix searches e.g author:loas
- properties are ANDed unless it the same property repeated (which would run as OR search)
Search URL parameters p.50
- k = keyword query
- s = the scope
- v = sort e.g “&v=date”
Chapter 4: The Search Usage Reports
Search queries report contains:
- number of queries
- query origin site collections
- number of queries per scope
- query terms
Search results report contains:
- search result destination pages (which URL was clicked by users)
- queries with zero results
- most clicked best bets
- search results with zero best bets
- queries with low clickthrough
Data can be exported to Excel (useful if I need to share the data in an accessible format).
You cannot view data beyond the 30 day data window. The suggested solution is to export every report!
Chapter 5: Search Administration
Can manage the crawl by:
- create content sources
- define crawl rules : exclude content (can use wildcard patterns), follow/noindex, crawl URLs with query strings
- define crawl schedules
- removed unwanted items with immediate effect
- troubleshoot crawls
There’s a useful but off-topic box about file shares vs. sharepoint on p.225
Crawler can discover metadata from:
- file properties e.g name, extension, date and size
- additional microsoft office properties
- SharePoint list columns
- Meta Tags from in HTML
- Email subject and to fields
- User profile properties
You can view the list of crawled properties via the Metadata Property Mappings link in the Configure Search Settings page. The Included In Index indicates if the property is searchable.
Managed properties can be:
- exposed in advanced search and in query syntax
- displayed in search results
- used in search scope rules
- used in custom relevancy ranking
Adjusting the weight of properties in ranking is not an admin interface task and can only be done via the programming interface.
High Confidence Results: A different (more detailed?) result for results that the search engine believes are an exact match for the query.
- site central to high priority business process should be authoritative
- sites that encourage collaboration and actions should be authoritative
- external sites should not be authoritative
- an XML file on the server with no admin interface
- no need to include stemming variations
- different lanuage thesauri exist. The one used depends on the language specified by client apps sending requests
- tseng.xml and tsenu.xml
Noise words p.294
- language specific plain text files, in the same directory as the thesaurus
- for US english the file name is noiseenu.txt
- off by default
Chapter 8 – Search APIs
Mostly too technical but buried in the middle of chapter 8 are the ranking parameters:
- saturation constant for term frequency
- saturation constand for click distance
- weight of click distance for calculating relevance
- saturation constant for URL depth
- weight of URL depth for calculating relevance
- weight for ranking applied to non-default language
- weight of HTML, XML and TXT content type
- weight of document content types (Word, PP, Excel and Outlook)
- weight of list items content types
They’ll come in handy when I’m baffling over some random ranking decisions that SP has made.
Chapter 9 – Advanced Search Engine Topics
Skipped through most of this but it does covers the Codeplex Faceted Search on p.574-585
A good percentage of the book was valuable to a non-developer, particularly one who is happy to skip over chunks of code. I’ve seen and heard a lot of waffle about what SharePoint search does and doesn’t do, so it was great to get some solid answers.
Inside the Index and Search Engines: Microsoft® Office SharePoint® Server 2007
Where search analytics is concerned it appears the RNIB is actually doing what everyone else is doing i.e. using Google Analytics:
“The use of Google Analytics is very much on the increase. Just under a quarter of responding organisations (23%) now use Google Analytics exclusively compared to only 14% a year ago.
A further 57% of respondents are using Google Analytics in conjunction with another tool (up from 52% in 2008), which means that 80% of companies are now using Google for analytics compared to 66% last year…
The majority of responding companies believe that they have set up Google Analytics properly.
There is more doubt among those who do not use Google exclusively, with 23% of these
respondents saying they don’t know if it has been properly configured”
And I’m firmly in the later 46% camp these days:
“since 2008 there has been an increase from 8% to 15% of companies who have two dedicated web analysts and a decrease in the proportion of companies who have one analyst (from 32% to 26%).
But while this is a positive development, it can also be seen that exactly the same proportion of companies (46%) report that they do not have any web analysts.”