Archive for the ‘sharepoint’ Category
SharePoint search allows you to create Best Bets. They can be created by the Site Collection administrator.
If you go to Site Settings, you should see ‘Search Keywords’ under the Site Collection Administration heading. If you don’t see it you probably haven’t got the right permissions.
You create a keyword, associate some synonyms with it and then add one or more Best Bet links. You can set it to expire and/or be reviewed.
Keyword: The search term that will generate the Best Bets and also is displayed above the Best Bet e.g. PenFriend
Synonym: Other search terms that will also generate the Best Bet. These aren’t displayed e.g. Pen Friend
Best Bets: The editorially picked search result e.g. Penfriend Audio Labeller
I can’t for the life of me figure out how to delete a keyword (Best Bet, yes. Keyword, no). Maybe it’s a permission thing again.
SharePoint search features are managed at 3 levels
- Farm level (configure the search service, configure crawler timeout settings etc…)
- SSP (Shared Services Provider) level
- Site collection level
The SSP functions are accessed via the Shared Services Administration.
SSP search functions:
- add sources to the crawl
- block URLs and URL patterns from the crawl
- define crawl schedules
- inspect crawl logs and troubleshoot crawls
- emergency removal of items
- install IFilters to support non-default file types
- add/remove file types from the crawl
- specify authoritative pages
- create scopes for all site collections (you can also create at a site collection level)
And in theory specify noise words and create a custom thesaurus. See Inside the Index and Search Engines
chapter 5 for more.
You can by default index these types of content source:
- SharePoint sites
- Non-SharePoint websites
- Windows file shares
- Microsoft Exchange Server public folders (you can index exchange mailboxes with a 3rd party add-on)
- Full crawl: indexes all content
- Incremental crawl: only accesses content that has been updated since last crawl. Faster, but slow if accessing an external website
- Crawl schedules can be specified for each content source
- Crawls should be scheduled for low usage times
- content can be excluded by defining a rule
- rules are applied in the specified order so you usually need to move exclude rules in front of include rules.
- a URL can be excluded by adding it as an exclude rule
- URL patterns can also be excluded and help keep the management of rules neat e.g. http://www.bbc.co.uk/* or http://www.amazon.co.uk/*/dp/*
- Exclude rules will remove any matched URLs during the next crawl
- If you need to remove a URL in an emergency you do this via “Search Result Removal” instead
As a general principle it is best not to go overboard on defining SharePoint content types. They add power to information retrieval but also add content creation overheads. Keep the number of types reasonable and also the number of metadata fields. (Obviously the art is defining what ‘reasonable’ means)
A list of reasons to define a specific content type:
- you want to attach a document template for that content type
- there’s a standard workflow for that content type
- there’s a standard info policy for that content type
- you want properties of the content type to be possible to search through advanced search
- you want to restrict a search to that content type
- you want to be able to sort a list or library by a specific metadata field of the content type
- you want to categorise a list or library by a specific metadata field of the content type
See also Microsoft’s Managing enterprise metadata with content types
Surprisingly this white paper on building multilingual solutions in SharePoints provides a good overview of how the search works, regardless of whether you are interested in the multilingual aspect.
Read page 15, titled “overview of the language features in search” for a description of content crawling and search query extraction. Then 16-18 provide a good overview of individual features and what they are doing.
Word breakers A word breaker is a component used by the query and index engines to break compound words and phrases into individual words or tokens. If there is no word breaker for a specific language, the neutral word breaker is used, in which case word breaking occurs where there are white spaces between the words and phrases. At indexing time, if there is any locale information associated with the document (for example, a Word document contains locale information for each text chunk), the index engine will try to use the word breaker for that locale. If the document does not contain any locale information, the user locale of the computer the indexer is installed on is used instead. At query time, the locale (HTTP_ACCEPT_LANGUAGE) of the browser from which the query was sent is used to perform word breaking on the query. Additional information about the language availability of the word breaker component is available in Appendix B: Search Language Considerations.
Stemming Stemming is a feature of the word breaker component used only by the query engine to determine where the word boundaries are in the stream of characters in the query. A stemmer extracts the root form of a given word. For example, ”running,” ”ran,” and ”runner“ are all variants of the verb ”to run.” In some languages, a stemmer expands the root form of a word to alternate forms. Stemming is turned off by default. Stemmers are available only for languages that have morphological expansion; this means that, for languages where stemmers are not available, turning on this feature in the Search Result Page (CoreResult Web Part) will not have any effect. Additional information about language availability for the Stemmer feature is available in Appendix B: Search Language Considerations.
Noise words dictionary Noise words are words that do not add value to a query, such as ”and,” ”the,” and ”a.” The indexing engine filters them to save index space and to increase performance. Noise word files are customizable, language-specific text files. These files are a simple list of words, one per line. If a noise word file is changed, you must perform a full update of the index to incorporate the changes. Additional information about the noise words dictionary and how to customize it is available at www.microsoft.com.
Custom dictionary The custom dictionary file contains values that the search server must include at index and query times. Custom dictionary lists are customizable, language-specific text files. These files are used by Search in both the index and query processes to identify exceptions to the noise word dictionaries. A word such as “AT&T,” for example, will never be indexed by default because the word breaker breaks it into single noise words. To avoid this, the user can add ”AT&T” to the custom dictionary file; as result, this word will be treated as an exception by the word breaker and will be indexed and queried. These files contain a simple list of words, one per line. If the custom dictionary file is changed, you must perform a full update of the index to incorporate the changes. By default, no custom dictionary file is installed during Office SharePoint Server 2007 Setup. Additional information about the custom dictionary file and how to customize it is available at www.microsoft.com.
Thesaurus There is a configurable thesaurus file for each language that Search supports. Using the thesaurus, you can specify synonyms for words and also automatically replace words in a query with other words that you specify. The thesaurus used will always be in the language of the query, not necessarily the server’s user locale. If a language-specific thesaurus is not available, a neutral thesaurus (tseneu.xml) is used. Additional information about the thesaurus file and how to customize it is available at www.microsoft.com.
Language Auto Detection The Language Auto Detection (LAD) feature generates a best guess about the language of a text chunk based on the Unicode range and other language patterns. Basically, it’s used for relevance calculation by the index engine and in queries sent from the Advanced Search Web Part, where the user is able to specify constraints on the language of the documents returned by a query.
Did You Mean? The Did You Mean? feature is used by the query engine to catch possible spelling errors and to provide suggestions for queries. The Did You Mean? feature builds suggestions by using three components:
· Query log Information tracked in the query log includes the query terms used, when the search results were returned for search queries, and the pages that were viewed from search results. This search usage data helps you understand how people are using search and what information they are seeking. You can use this data to help determine how to improve the search experience for users.
· Dictionary lexicon A dictionary of most-used lexicons provided at installation time.
· Custom lexicon A collection of the most frequently occurring words in the corpus, built at query time by the query engine from indexed information.
The Did You Mean? suggestions are available only for English, French, German, and Spanish.
Definition Extraction The Definition Extraction feature finds definitions for candidate terms and identifies acronyms and their expansions by examining the grammatical structure of sentences that have been indexed (for example, NASA, radar, modem, and so on). It is only available for English.
Inside the Index and Search Engines is 624 pages of lovely SharePoint search info. It is the sort of book that sets me apart from my colleagues. I was delighted when it arrived, everyone else was sympathetic.
The audience is “administrators” and “developers”. I’m never sure how technical they are imagining when they say “administrators” so I waded in anyway. The book defines topics for administrators as; managing the index file; configuring the end-user experience; managing metadata; search usage reports; configuring BDC applications; monitoring performance; administering protocol handlers and iFilters. I skimmed through the content for developers and found some useful nuggets in there too.
1. Introducing Enterprise Search in SharePoint 2007
2. The End-User Search Experience
3. Customizing the Search User Interface
4. Search Usage Reports
5. Search Administration
6. Indexing and Searching Business Data
7. Search Deployment Considerations
8. Search APIs
9. Advanced Search Engine Topics
10. Searching with Windows SharePoint Services 3.0
The book begins by setting the scene, and with lots of fluff about why search matters and some slightly awkward praise for Microsoft’s efforts. It gets much more interesting later, so you can probably skip most of the introduction.
Content I found useful:
Chapter 1. Introducing Enterprise Search in SharePoint 2007
p.28-33 includes a comparison of features for a quick overview of Search Server, Search Server Express and SharePoint Server.
“Queries that are submitted first go through layers of word breakers and stemmers before they are executed against the content index file is available. Word breaking is a technique for isolating the important words out of the content, and stemmers store the variations on a word” p.32
Keyword query syntax p.44
- maximum query length 1024 characters
- by default is not case sensitive
- defaults to AND queries
- phrase searches can be run with quote marks
- wildcard searching is not supported at the level of keyword syntax search queries. Developers could build this functionality using CONTAINS in the SQL query syntax
- exclude words with
- you can search for properties e.g rnib author:loasby
- property searches can include prefix searches e.g author:loas
- properties are ANDed unless it the same property repeated (which would run as OR search)
Search URL parameters p.50
- k = keyword query
- s = the scope
- v = sort e.g “&v=date”
Chapter 4: The Search Usage Reports
Search queries report contains:
- number of queries
- query origin site collections
- number of queries per scope
- query terms
Search results report contains:
- search result destination pages (which URL was clicked by users)
- queries with zero results
- most clicked best bets
- search results with zero best bets
- queries with low clickthrough
Data can be exported to Excel (useful if I need to share the data in an accessible format).
You cannot view data beyond the 30 day data window. The suggested solution is to export every report!
Chapter 5: Search Administration
Can manage the crawl by:
- create content sources
- define crawl rules : exclude content (can use wildcard patterns), follow/noindex, crawl URLs with query strings
- define crawl schedules
- removed unwanted items with immediate effect
- troubleshoot crawls
There’s a useful but off-topic box about file shares vs. sharepoint on p.225
Crawler can discover metadata from:
- file properties e.g name, extension, date and size
- additional microsoft office properties
- SharePoint list columns
- Meta Tags from in HTML
- Email subject and to fields
- User profile properties
You can view the list of crawled properties via the Metadata Property Mappings link in the Configure Search Settings page. The Included In Index indicates if the property is searchable.
Managed properties can be:
- exposed in advanced search and in query syntax
- displayed in search results
- used in search scope rules
- used in custom relevancy ranking
Adjusting the weight of properties in ranking is not an admin interface task and can only be done via the programming interface.
High Confidence Results: A different (more detailed?) result for results that the search engine believes are an exact match for the query.
- site central to high priority business process should be authoritative
- sites that encourage collaboration and actions should be authoritative
- external sites should not be authoritative
- an XML file on the server with no admin interface
- no need to include stemming variations
- different lanuage thesauri exist. The one used depends on the language specified by client apps sending requests
- tseng.xml and tsenu.xml
Noise words p.294
- language specific plain text files, in the same directory as the thesaurus
- for US english the file name is noiseenu.txt
- off by default
Chapter 8 – Search APIs
Mostly too technical but buried in the middle of chapter 8 are the ranking parameters:
- saturation constant for term frequency
- saturation constand for click distance
- weight of click distance for calculating relevance
- saturation constant for URL depth
- weight of URL depth for calculating relevance
- weight for ranking applied to non-default language
- weight of HTML, XML and TXT content type
- weight of document content types (Word, PP, Excel and Outlook)
- weight of list items content types
They’ll come in handy when I’m baffling over some random ranking decisions that SP has made.
Chapter 9 – Advanced Search Engine Topics
Skipped through most of this but it does covers the Codeplex Faceted Search on p.574-585
A good percentage of the book was valuable to a non-developer, particularly one who is happy to skip over chunks of code. I’ve seen and heard a lot of waffle about what SharePoint search does and doesn’t do, so it was great to get some solid answers.
Inside the Index and Search Engines: Microsoft® Office SharePoint® Server 2007
One of our biggest challenges in rolling out SharePoint (and in many other projects) is getting an accessible rich text editor that our blind and partially sighted authors can use to enter content with.
Any other suggestions welcome. I’ll let you know how we get on.
SharePoint search takes into account keywords in titles, URLS and hyperlinks. In each case the keywords need to be separated by spaces/underscores.
It also favours:
- HTML over documents, PPT over Word, and Word over Excel
- high-level pages
- shorter content pages/documents
These can be changed but it is generally not advised (imagine the manual equivalent of a plumber sucking air through his teeth).
We’re working on SharePoint teamsite requirements today. Team-sites are permission controlled “collaboration spaces”.
We’re using them for three types of teams:
- formal organisational teams i.e. teams that share a line manager
- project teams, that might be within an organisational team or cross-team
- network teams, for communities of practice
Because of the accessibility changes required, all functionality requires development so it isn’t the case that we can just stick functionality in and see if people use it.
We’re planning to include content pages, document libraries, search functionality, discussion boards, lists and alerts. We’re not including calenders, wikis and blogs. With calenders we don’t think they’re particularly useful when they don’t integrate elegantly with Outlook- we’re told you can’t send an Outlook appointment to a SharePoint calendar. Wikis are not widely used within the organisation and would require an awareness/training effort so they’re out for now. Blogs are similarly not a common tool at present, and I don’t buy the scenario of blogging for a very limited team audience.
I’ve been researching the reasons for not moving everything from your shared drives to SharePoint. These seemed to be the common factors mentioned (with varying levels of explanation/justification):
1. Storage costs
“SQL Server storage is more expensive and complicated than network storage” -objectmix.com
“The basic collaborative nature of Sharepoint probably doesn’t support long term historical archives of data.” -objectmix.com
4. Backup and restore issues
5. Types of files
File types not to store in SharePoint: scripts, executables, multi files, CIFS links, some access databases, Outlook Personal Folders, Application files (*.exe, *.dll, *.bat, *.log, etc.), large backup files (> 50 MB *.zip, *.iso, *.bak, etc.),DVD images (*.ifo, *.vob).
6. File usage
Usage reasons not to store in SharePoint: files not accessed for months and files without collaborative value
7. Size of files
File size restrictions seems to be the most commonly mentioned point, with most sources suggesting an upper limit of 50-100MB per file.
To maintain optimum server performance and ease navigation of the document libraries and folder structures, use the following guidelines as the upper limits when organizing your files:
o 1,000 files in a folder
o 1,000 folders per Document Library
o 1,000 document libraries per site
o 50 megabytes (MB) per file”
8. Linked documents
“Linked documents and files cannot be run from a SharePoint site, as the dependency on an external sources isn’t captured in SharePoint.”
9. Only in SharePoint for search purposes
The files in the drives can still be searched for in SharePoint. “Just index them with Microsoft Office SharePoint Server and they will become discoverable as well.”
One of my great hopes for our current intranet project is to significantly improve the intranet search. The current set-up used the search bundled with Stellent. It is universally derided within the organisation and with good reason (the Stellent search itself may not be at fault, I imagine some changes to the configuration could fix some of the more significant problems).
I’ve heard mixed reports of Sharepoint search. Our suppliers are very positive about it, and it does seem hard to imagine how it could be worse that what we currently have.
At the TFPL conference I attended Sharon Richardson of Joining Dots defended SharePoint search. She went a bit far with the statement “…so the problem with search is not the technology, it’s the users” but there’s some interesting stuff in the ‘research‘ she referred to.
55% The content was badly named, didn’t contain the words the users was searching for, wasn’t easily identifiable in search results (e.g. if you have 2 results both called Cafe – which is for London and which is for Manchester?)
30% The content users were looking for didn’t exist
10% Users were using wide or strange search terms (why would somebody search for ‘google’ on the intranet? what exactly did they want to find when they searched for ‘form’?)
5% Search wasn’t finding appropriate content or ranking wasn’t appropriate
I’ve been keeping track of failed or problematic searches on our current intranet. Not particularly scientific but it has been an interesting starting point for evaluating the new search.
30% mismatches in language
25% inappropriate date ordering
15% lack of stemming
15% overly rigid phrase order matching
10% ambiguous queries
5% inappropriate alphabetic ordering of results
If a number of results are assigned the same relevancy then they are returned in date order, and if there are a number of results published on the same day then they are returned in alphabetical order. The relevancy scores don’t seem to distinguish between enough results, so the date and alpha ordering are regularly skewing the results.
The mismatched language and the ambiguous queries are sure to still be problems with the new search. I’m not going to endeavour to ‘fix the users’ here. There are plenty of solutions (best bets, related searches, faceted filters and synonym control) that we can utilise.
Interestingly my experiences with our existing search have suggested that searching for just ‘form’ can be an intelligent, considered tactic in less than ideal circumstances. If you are looking for the sickness form but you are not sure if it is actually called that (absence form, sick form etc) then searching for form and scanning the results can be your least worst option. Given our current search is pedantic in it’s insistence on exact phrase order, I find myself conducting single word searches far more often than usual.