daBongo LMS AI Training Courses

Understanding Google Gemini – What It Is and How It Thinks

Lesson 4: Gemini’s Unique Capabilities – Multimodal, Ecosystem, and More

Log in and enroll to track lesson completion.

Lesson Objectives

By the end of this lesson, students should be able to:

  • Explain what "multimodal" means and what Gemini can process beyond text
  • Identify at least four practical use cases enabled by Gemini's multimodal capabilities
  • Understand Gemini's Google ecosystem integration at a conceptual level
  • Know which capabilities require verification of current availability in their plan

Lesson Content

What "multimodal" means.

Most early AI assistants were text-only – you typed in text and received text back. Gemini is multimodal, meaning it can process and reason across multiple types of input, not just text. Depending on your interface and plan, Gemini may be able to work with:

  • Text: Natural language, code, structured data
  • Images: Photos, diagrams, screenshots, charts, documents photographed on a phone
  • Audio: Spoken input (voice conversations)
  • Video: Visual content for analysis (verify current availability in your plan)
  • Files: PDFs, documents, spreadsheets (verify supported formats in your plan)

The ability to send an image or document alongside your text changes what you can ask Gemini to do – and in some cases eliminates steps that would otherwise require manual data entry or description.

Practical multimodal use cases.

*Analyzing images and documents*: Photograph a whiteboard from a brainstorming session and ask: "Transcribe what is written here and organize it into an action list." Photograph a receipt and ask: "What is the total, date, and vendor on this receipt?" Upload a chart from a report and ask: "Summarize the trend shown in this chart and what it implies."

*Working with uploaded documents*: Upload a contract PDF and ask: "Identify all the dates and deadlines in this contract." Upload a research paper and ask: "Summarize the methodology and key findings in plain language for a non-researcher." Upload multiple documents and ask: "Compare these two proposals on cost, timeline, and risk."

*Voice conversations*: On mobile devices with voice access enabled, you can speak to Gemini rather than type – useful for hands-free situations or accessibility needs. Verify current voice capabilities in your device and plan.

Google ecosystem integration.

Gemini can connect to Google's suite of services through a feature called Extensions. When Extensions are enabled and connected, Gemini can:

  • Read and summarize your Gmail messages
  • Access and discuss your Google Drive files
  • Work with content in Google Docs, Sheets, and Slides
  • Find information from Google Maps, YouTube, and other Google services

This integration means Gemini can, for example, help you draft a reply based on an email thread it has access to, or analyze a spreadsheet stored in your Drive – without you needing to manually copy and paste the content. Extensions are covered in depth in the features course (Part 11 of this bundle).

Important: verify current availability.

Gemini's multimodal capabilities vary by interface version, device, and plan tier. Some capabilities described here may require Gemini Advanced (a paid tier) or may not yet be available in your region. Always verify what is currently available in your specific account and plan at gemini.google.com before building workflows that depend on specific capabilities.

What multimodal does NOT mean.

Multimodal input does not mean Gemini can see your screen, access your camera without your action, or monitor your device. You must explicitly share images, files, or voice input – Gemini only accesses what you actively give it in the conversation.

Practical Example

A small retail business owner photographs her handwritten inventory notes from a stockroom walk-through and uploads the photo to Gemini:

"Here is a photo of my handwritten inventory notes. Please transcribe everything you can read, then organize it as a table with columns: Item, Quantity, Notes, and a blank Action column."

Gemini transcribes the handwritten content and produces a clean table – converting a photo of messy notes into a structured inventory list in under a minute. The business owner then pastes it into her spreadsheet.

Without multimodal, this would have required manual typing. With multimodal, it is a 30-second task.

Lesser-Known Tip

When uploading a photo of a document for analysis, take the photo in good lighting against a plain background and avoid angles that distort the text. Gemini's ability to read photographed documents is good but not perfect – blurry, skewed, or poorly lit photos produce less accurate transcription. If accuracy matters, take an extra moment to take a clean photo or use a document scanner app before uploading.

Safety Notes

When using Gemini's document analysis capabilities for professional content – contracts, financial documents, employee records, medical records – apply the same data handling standards as any other document tool. Do not upload documents containing personal identification information, protected health information, or legally sensitive materials without verifying that Gemini's data handling practices align with your obligations. Review your organization's AI use policy before uploading work-related documents of any kind.

Lesson Quiz

Log in and enroll to take this lesson quiz.

Scroll to Top