Knowledge Mining with Azure Cognitive Search: 12 Common Questions

While the Microsoft documentation is quite extensive and covers nearly any question you might have about Knowledge Mining with Azure Cognitive Search, we thought it would be helpful to collect some of the more salient questions into a single article for easy reference.

Over the past few months, I’ve been on what seems like a nonstop whirlwind of Knowledge Mining activity. And whether it’s delivering projects using Azure Cognitive Search, speaking with clients about the potential for Knowledge Mining, or hitting the road to give briefings and technical training with Microsoft, many of the same questions seem to come up. I find that while the power of Knowledge Mining is immediately evident, the specifics can be a little difficult to pin down when you’re first getting started. With this in mind, I figured it would be worthwhile to pull together a list of common questions and answers to help others get their arms around what Knowledge Mining in Azure might look like for them.

So, here are “12 Common Questions about Knowledge Mining and Azure Cognitive Search.” I hope you’ll find this article useful, and if you still have questions or would like to chat further about Knowledge Mining (I love to “talk shop”), please don’t hesitate to reach out. I’m always happy to lend a hand.

1) What types of documents can I index?

The blob indexer can extract text from the following document formats:

  • PDF
  • Microsoft Office formats:
    • Word – DOCX/DOC/DOCM
    • Excel – XLSX/XLS/XLSM
    • PowerPoint – PPTX/PPT/PPTM
    • Outlook Emails — MSG
    • XML (both 2003 and 2006 WORD XML)
  • Open Document formats: ODT, ODS, ODP
  • Other: HTML, XML, ZIP, GZ, EPUB, EML, RTF, plain Text files, JSON, CSV

(Source: https://docs.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage#supported-document-formats)

2) How much can I index on the "Free" tier?

Azure Cognitive Search limits how much text it extracts depending on the pricing tier:

  • 32,000 characters for Free tier
  • 64,000 for Basic
  • 4 million for Standard
  • 8 million for Standard S2
  • 16 million for Standard S3

If your indexer runs out of characters before it runs out of content, a warning is included in the indexer status response in the Azure Portal identifying documents that are partially indexed for this reason.

(Source: https://docs.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage#how-azure-cognitive-search-indexes-blobs)

3) Can I index non-document data?

It is possible to index data in formats other than documents/files.

  • Azure Tables
  • Azure Cosmos DB
  • Azure Data Lake Storage Gen2
  • SQL Databases
  • SQL “Managed Instances”
  • SQL Server VMs

(Source: https://docs.microsoft.com/en-us/azure/search/search-indexer-overview )

4) Are there alternative ways to index data?

You can upload your data directly to an Azure Indexer using a data “push”.

This technique involves use the available REST API to POST data directly into the index. This is a flexible approach which supports any JSON-formatted data, however since this approach updates the index directly, your data will not benefit from the cognitive services that are normally applied by the indexer.

(Source: https://docs.microsoft.com/en-us/azure/search/search-what-is-data-import)

5) Can I add support for custom document types?

Yes, you can. The simplest way to do this is to build a service (e.g. using Azure Functions) that will read the custom document type and will store its contents as JSON data (i.e. in Blob Storage) or that will push it into a supported Azure database platform.

Following this approach allows you to leverage libraries for multiple different languages.

  • C#
  • JavaScript
  • F#
  • Java
  • PowerShell
  • Python
  • TypeScript

(Source: https://docs.microsoft.com/en-us/azure/azure-functions/supported-languages)

Azure Functions can be helpful in minimizing your cost in Azure usage, since your function will only run when needed, and it won’t use any of the premium cognitive or other services that sometimes drive cost in Knowledge Mining implementations.

Because it’s a flexible, multi-language environment, you can leverage existing libraries for various file types that aren’t supported by Cognitive Search out of the box (e.g. engineering or CAD files, media files, etc.) Since you control the “cracking” process, you can make as flexible as you see fit. For instance, it will allow you to handle groups of files as a single “entity” in the index. Say you had all information on a client in a single folder. You could write Azure Function code that reads the files in that folder and writes their info out as a single JSON file. Azure Search could then ingest that one file and allow your users to search against it. The contents of such a file could even have come from multiple places, including services. You could start with the contents of a folder, as above, but augment that information with data in your CRM.

The sky really is the limit.

(Sources: https://docs.microsoft.com/en-us/azure/search/cognitive-search-working-with-skillsets and https://docs.microsoft.com/en-us/azure/search/cognitive-search-defining-skillset)

You can upload your data directly to an Azure Indexer using a data “push”.

This technique involves use the available REST API to POST data directly into the index. This is a flexible approach which supports any JSON-formatted data, however since this approach updates the index directly, your data will not benefit from the cognitive services that are normally applied by the indexer.

(Source: https://docs.microsoft.com/en-us/azure/search/search-what-is-data-import)

6) Can I control which fields can be used for sorting, filtering and faceting results?

Yes, you can.

These are attributes of the Index and can be controlled through the Azure Portal UI or by creating or updating your indexer via HTTPS POST or PUT. One tool we find very useful for this is POSTMan (https://www.postman.com)

Useful links:

7) Can I index documents that are stored inside of SharePoint or Microsoft Teams?

Yes and No. Currently there is no out-of-the-box support for these environments, however we have custom code that can help with this need. Also, we have spoken to the Microsoft product team and they are aware that people would love to have this feature. You never know but they may already be working on this…

Contact us if you’d like to discuss how to build such a solution or stay tuned for any announcements from Microsoft should they decide to add this feature.

8) Can I secure my search to limit which documents users can see?

Yes, you can.

Keep in mind that the security model behind this service is still in flux, so there may soon be simpler ways to do this. At the same time, Microsoft has to be careful when making deep changes to the underlying technologies (i.e. Lucene and Solr) because they don’t want to impact performance, stability or the correctness of results.

The current approach to securely hiding select results from users uses native filtering to “trim” results from the returned set. While this technique has some challenges (e.g. updating permissions is a bit of a pain), it should have little to no impact on your app’s performance.

How it works: you add a field to each entry in the index identifying the “Principals” that are allowed to view it. This allows you to implement strict filtering against the current user’s group/role membership(s).

For instance: say you have two types of users, “public” and “private”. You tag each entry in the index with one or the other of these principals. When a user submits a search, your back-end service pulls the current user’s session and grabs group membership information. The groups to which he or she is a member are then added as filters on the query. This then eliminates any “private” search results from being returned to someone who isn’t a member of the “private” group (or who isn’t logged in at all).

(Source: https://docs.microsoft.com/en-us/azure/search/search-security-trimming-for-azure-search)

9) What are the driving factors for the Azure cost in this type of solution?

When pricing out a possible Azure Search implementation, you’re going to have to look at the following:

  • Storage cost for holding the documents (e.g. in Azure Blob Storage or Azure Files)
  • Storage cost for the Knowledge Store
  • Ingress and egress of files and data from Storage
  • Cognitive Services
  • Size of the index
  • Any Azure Consumption from custom services you’ve built in Azure Functions or in VMs
  • User load on the search application UI

Some principles to keep in mind, from our experience:

  • Keep an eye on which Cognitive Services you use. Do not extract entities that aren’t useful to your users. You will pay for each of these services and while they can be very useful, it can add up.
  • In the “Azure Cognitive Search pricing” page, when they refer to “Storage”, they are referring to the size of your index, not the size of your document store (which is quite a bit larger than the index). You can easily check your index size in the “Indexes” blade in the Azure Portal.
  • Every time the index is run, you are going to accrue Azure costs. Therefore, only do incremental indexing unless you absolutely must re-build the whole index.
  • When they refer to “text records” in the Cognitive Services pricing grid, they are referring to 1,000 characters. So, “1,000 text records” translates to “1,000,000 characters”. In other words, for text content, the cost of your Cognitive Services usage will reflect the total number of the characters in your document set.

References:

10) How can I estimate my Azure costs before committing to Knowledge Mining?

We recommend that you start with a representative sub-set of your documents. That means grabbing files that represent the range of file and content types you would have in a production rollout. Set up indexing to where you have the information you want, then track the Azure cost of indexing that whole example set.

From that cost, you can multiply it out to find the cost for your whole document set. This will of course be an estimate but it should give you an “order of magnitude” idea of the final costs.

Next, determine how much “churn” you will have on your document set. How often will you run the incremental indexing, and how many (i.e. what percentage) of your documents will have been changed, deleted, created? This should allow you to roughly calculate your incremental indexing costs.

Add the initial indexing cost to the incremental cost times the period you’re estimating and that should give you a rough idea of the total indexing cost for that period.

References:

11) Is there a limit on the number of documents or the size of my document repository?

The maximum number of Lucene documents is roughly 25 billion per index, so there are no practical limits on the number of documents allowed.

The maximum document size when calling an Index API is approximately 16 megabytes.

Some limits vary by Azure Region, so to confirm if any will apply to your implementation, please invoke the Service Statistics service within your target region and it will return information on any limits that may apply.

(Source: https://docs.microsoft.com/en-ca/rest/api/searchservice/get-service-statistics)

There is a “complexity” limit that applies to composite elements, though this is rarely an issue. As per Microsoft: Indexers will enforce the upper limit of 3,000 elements across all complex collections per document starting with the generally available API version “2019-05-06” and onwards. That means, for instance, that if a Zip file is indexed, it should contain less than 3,000 items.

(Source: https://docs.microsoft.com/en-us/azure/search/search-limits-quotas-capacity )

12) What are some tricks for minimizing cost?

This toolset can do amazing things, but power comes at a cost. Before committing to a full production implementation of Azure Cognitive Search, you’re likely going to have to determine the cost/benefit relationship.

That said, there are many ways to shave off wasted or unnecessary costs. Start by following these basic tricks:

  • Only index files that are useful to your strategic goals.
  • Eliminate any un-necessary cognitive services. There’s no point in pulling attributes that don’t really matter to your users.
  • If your documents are large and people download them often, in the aggregate, that can drive up “egress” costs. To minimize this:
    • Tune the UX design of your search application to discourage needless downloads
    • If you have a large groups of users and/or a small set of assets that are downloaded often, consider caching popular documents in memory in the custom search application back-end.
    • If you are using Azure Files to store your source documents, consider setting up Azure File Sync for hybrid cloud/on-premise file caching.
    • If 50% of your document store is useless to 90% of your users, you’ll be paying just as much to store and index those files as their useful counterparts. Use your judgement. If you don’t know which files are important, do some analysis beforehand or implement search term and user behaviour analytics in your search application. This can provide you with unique insights on which files matter most to your users.
    • Analytics can also help you know which files you should cache to quicker (and cheaper) access.
    • Do careful analysis, planning and designing before committing to indexing all of your documents. Get the Index design right the first time, because once you’ve clicked “run” on the indexers, you’re going to be billed for that Azure usage. It’s best to know exactly what you’re committing to.
    • If you can’t eliminate cost altogether, you may still benefit from splitting the cost among the different internal groups who own the documents or benefit from the search. Azure supports different indexing setups. For instance, you can have separate indexers — paid by individual groups — feeding into a single index.

There are no on-size-fits-all solutions here. Please reach out to us and we can help you customize your Azure Cognitive Search solution to fit your user and organizational needs, as well as your pocketbook.

(Source: https://azure.microsoft.com/en-us/services/storage/files/)

Taylor Bastien

Taylor is a Solutions Architect at T4G, working closely with clients to bring their true needs into focus and forming the right team of professionals to deliver quality solutions. He takes a strategic view of each client’s challenges, helping them to make informed technical investments. When he’s not on the clock, he enjoys staying fit, learning languages, and spending time with his family.