Inspect Rich Documents with Gemini Multimodality and Multimodal RAG
과정 중 3번째
Multimodal Retrieval Augmented Generation (RAG) using the Gemini API in Vertex AI
(Vertex AI의 Gemini API를 사용하는 멀티모달 검색 증강 생성(RAG))
여기는 분량이 많아 나누었습니다.
이 포스트는 2편입니다.
(이전 1편은 여기를 참고)
Comparative reasoning (비교 추론)
ex) 지금까지 한 일을 비교 추론에 적용합시다.
1 단계 : 특정 쿼리의 모든 이미지를 검색합니다.
2 단계 : 해당 이미지를 Gemini에 보내어 여러 질문을하고 답변을 제공해야합니다.
matching_results_image_query_1 = get_similar_image_from_query(
text_metadata_df,
image_metadata_df,
query="Google Class에 누적 5 년 총 수익을 보여주는 모든 그래프를 보여주세요",
column_name="text_embedding_from_image_description",
matching_results_image_query_1 = get_similar_image_from_query(
text_metadata_df,
image_metadata_df,
query="Show me all the graphs that shows Google Class A cumulative 5-year total return",
column_name="text_embedding_from_image_description", # Use image description text embedding # mm_embedding_from_img_only text_embedding_from_image_description
image_emb=False, # Use text embedding instead of image embedding
top_n=3,
embedding_size=1408,
)
# Check Matched Images
# You can access the other two matched images using:
print("---------------Matched Images------------------\n")
display_images(
[
matching_results_image_query_1[0]["img_path"],
matching_results_image_query_1[1]["img_path"],
],
resize_ratio=0.5,
)
ㄴ>
---------------Matched Images------------------


ex) 컨텍스트로 제공된 이미지와 Gemini 추출 텍스트를 비교
Prompt = F "" "지침 : 컨텍스트로 제공된 이미지와 Gemini 추출 텍스트를 비교 : 질문에 답하려면 :
질문에 대답하기 전에 철저히 생각하고 필요한 조치를 취하여 총알 포인트의 답변에 도달하여 쉽게 설명 할 수 있습니다.
문맥:
image_1 : {matching_results_image_query_1 [0] [ "image_object"]}
gemini_extracted_text_1 : {matching_results_image_query_1 [0] [ 'image_description']}
image_2 : {matching_results_image_query_1 [1] [ "im
gemini_extracted_text_2 : {matching_results_image_query_1 [2] [ 'image_description']}
질문:
- 클래스 A 공유의 주요 결과?
- 클래스 A 공유의 그래프 간의 중요한 차이점은 무엇입니까?
-S & P 500에 관한 클래스 A 주식의 주요 결과는 무엇입니까?
- Google이 아직 부분이 아닌 곳에서 클래스 A 공유 성능에 가장 적합한 색인은 무엇입니까? 추론을 설명하십시오.
- 두 그래프에서 키 차트 패턴을 식별합니다.
- Google이 아직 부분이 아닌 곳에서 클래스 A 공유 성능에 가장 적합한 색인은 무엇입니까? 추론을 설명하십시오.
"" "
prompt = f""" Instructions: Compare the images and the Gemini extracted text provided as Context: to answer Question:
Make sure to think thoroughly before answering the question and put the necessary steps to arrive at the answer in bullet points for easy explainability.
Context:
Image_1: {matching_results_image_query_1[0]["image_object"]}
gemini_extracted_text_1: {matching_results_image_query_1[0]['image_description']}
Image_2: {matching_results_image_query_1[1]["image_object"]}
gemini_extracted_text_2: {matching_results_image_query_1[2]['image_description']}
Question:
- Key findings of Class A share?
- What are the critical differences between the graphs for Class A Share?
- What are the key findings of Class A shares concerning the S&P 500?
- Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
- Identify key chart patterns in both graphs.
- Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
"""
# Generate Gemini response with streaming output
rich_Markdown(
get_gemini_response(
multimodal_model, # we are passing Gemini 2.0
model_input=[prompt],
stream=True,
generation_config=GenerationConfig(temperature=1),
)
)
ㄴ>
Here's an analysis of the images and text, broken down to answer each question:
Image 1 Analysis (Cumulative Return Graph):
• Key findings of Class A share? Alphabet Inc. Class A shares (red line) showed significant growth in cumulative
total return over the 5-year period (12/17 to 12/22). The graph illustrates its performance relative to other
market indices. The return appears to be substantially higher than the other indices, especially towards the
end of the period.
• What are the critical differences between the graphs for Class A Share? This question is phrased oddly, as
there's only one graph representing Class A shares. Assuming it means "What are the critical differences
between the Class A share performance and the other indices?": The key difference is the magnitude of the
return. While all indices generally trend upward, the Class A shares show a much steeper upward trajectory,
indicating higher returns than the S&P 500, NASDAQ, and RDG Internet Composite.
• What are the key findings of Class A shares concerning the S&P 500? Alphabet Inc. Class A shares significantly
outperformed the S&P 500 over the 5-year period. The red line (Alphabet) is consistently above the green line
(S&P 500) and the gap widens considerably over time.
• Which index best matches Class A share performance closely where Google is not already a part? Explain the
reasoning. This is a tricky question without knowing the exact composition of the RDG Internet Composite Index.
The RDG Internet Composite (blue line) appears visually to track the Class A share performance most closely
compared to the S&P 500 and NASDAQ until late 2021. The NASDAQ performance also follows similarly but Class A
shares perform better. The S&P 500 is the least similar. Reasoning: We are looking for an index that mirrors the
upward and downward swings of Class A shares. The RDG Internet Composite index follows it closely. If
Google/Alphabet is not part of the RDG Internet Composite, then it's the best match.
Image 2 Analysis (RSU Table):
• Key findings of Class A share? This table does not directly show findings about the performance of Class A
Shares. Instead, it provides information about Alphabet's Restricted Stock Units (RSUs), including the number of
shares unvested, granted, vested, and forfeited/canceled, along with their weighted-average grant-date fair
value. The table shows changes in unvested RSUs over the year 2021. The number of unvested shares decreased from
19,288,793 to 16,894,713, while the weighted-average grant-date fair value increased from $1,262.13 to
$1,626.13.
• What are the critical differences between the graphs for Class A Share? There is only a single table regarding
the RSUs. This question is also phrased oddly, as there's only one table representing Class A shares RSU
details.* Assuming it means "What is important to note in the table?": The key point to highlight is the
decrease in the number of unvested shares coupled with the increase in their average fair value. This suggests
that while shares are vesting and being forfeited, the overall value of the remaining unvested shares is
increasing.
• What are the key findings of Class A shares concerning the S&P 500? The table provides no direct information
about the S&P 500. It focuses solely on Alphabet's RSU activity.
• Which index best matches Class A share performance closely where Google is not already a part? Explain the
reasoning. The table has no bearing on this question, as it doesn't present any data related to market indices
or share price performance.
Identifying Key Chart Patterns:
• Image 1 (Return Graph):
• Outperformance: Clear and sustained outperformance of Alphabet Inc. Class A shares compared to all other
indices.
• Increasing Divergence: The gap between Alphabet and the other indices widens significantly over time,
indicating accelerating outperformance.
• Image 2 (RSU Table):
• Decreasing Unvested Shares: The number of unvested shares decreases from the beginning to the end of the
year.
• Increasing Fair Value: The weighted-average grant-date fair value of unvested shares increases, suggesting an
appreciation in the company's stock value.
작업 7. 멀티모달 검색 증강 생성(RAG)
이제 지금까지의 작업을 기반으로 멀티모달 RAG를 구현하겠습니다.
이전 섹션에서 살펴본 모든 요소를 사용하여 멀티모달 RAG를 구현합니다.
단계는 다음과 같습니다.
- 1단계: 예상되는 정보가 문서에 있고 이미지와 텍스트에 포함되어 있는 경우 사용자는 텍스트 형식으로 쿼리를 입력합니다.
- 2단계: 텍스트 검색에서 살펴본 것과 유사한 방법을 사용하여 문서의 페이지에서 모든 텍스트 청크를 찾습니다.
- 3단계: 이미지 검색에서 살펴본 것과 동일한 방법을 사용하여 사용자 쿼리와 일치하는 image_description을 기반으로 페이지에서 유사한 이미지를 모두 찾습니다.
- 4단계: 2단계와 3단계에서 찾은 모든 유사한 텍스트와 이미지를 context_text 및 context_images로 결합합니다.
- 5단계: Gemini의 도움으로 2단계와 3단계에서 찾은 텍스트 및 이미지 컨텍스트를 사용하여 사용자 쿼리를 전달할 수 있습니다. 사용자 쿼리에 대답하는 동안 모델이 기억해야 하는 구체적인 요청 사항을 추가할 수도 있습니다.
- 6단계: Gemini가 답변을 생성하면 사용자는 인용 문구를 인쇄하여 쿼리를 해결하는 데 사용된 모든 관련 텍스트와 이미지를 확인할 수 있습니다.
참고: 이 작업의 점수를 확인하려면 몇 분 정도 기다려야 할 수 있습니다
1 단계 : 사용자 쿼리
- 예상되는 정보가 문서에 있고 이미지와 텍스트에 포함되어 있는 경우 사용자는 텍스트 형식으로 쿼리를 입력합니다.
# 이번에는 이미지를 전달하는 것이 아니라 간단한 텍스트 쿼리입니다.
query = "" "질문 :
- 클래스 A 공유를위한 다양한 그래프 간의 중요한 차이점은 무엇입니까?
- Google이 아직 부분이 아닌 곳에서 클래스 A 공유 성능에 가장 적합한 색인은 무엇입니까? 추론을 설명하십시오.
- Google Class A 주식의 주요 차트 패턴을 식별합니다.
- 2020 년의 수입, 영업비 및 순이익 비용은 얼마입니까? 비율 변화를 언급하십시오.
- 2020 회계 연도에 Covid의 영향은 무엇입니까?
- 2021 년 APAC 및 미국의 총 수익은 얼마입니까?
- 이연 소득세는 무엇입니까?
- 주당 순이익을 어떻게 계산합니까?
- 2021 년의 통합 수익 및 수입 비용의 비율 변화를 주도했으며 Covid의 영향이 있습니까?
- 2020 년에서 2021 년까지 매출이 41% 증가한 원인은 얼마이며 달러 변동은 얼마입니까?
"" "
# this time we are not passing any images, but just a simple text query.
query = """Questions:
- What are the critical difference between various graphs for Class A Share?
- Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
- Identify key chart patterns for Google Class A shares.
- What is cost of revenues, operating expenses and net income for 2020. Do mention the percentage change
- What was the effect of Covid in the 2020 financial year?
- What are the total revenues for APAC and USA for 2021?
- What is deferred income taxes?
- How do you compute net income per share?
- What drove percentage change in the consolidated revenue and cost of revenue for the year 2021 and was there any effect of Covid?
- What is the cause of 41% increase in revenue from 2020 to 2021 and how much is dollar change?
"""
2 단계 : 모든 관련 텍스트 청크를받습니다
get_similar_image_from_query(
text_metadata_df,
image_metadata_df,
query=query,
# Retrieve relevant chunks of text based on the query
matching_results_chunks_data = get_similar_text_from_query(
query,
text_metadata_df,
column_name="text_embedding_chunk",
top_n=10,
chunk_text=True,
)
3 단계 : 모든 관련 이미지를 얻습니다
get_similar_image_from_query(
text_metadata_df,
image_metadata_df,
query=query,
# Get all relevant images based on user query
matching_results_image_fromdescription_data = get_similar_image_from_query(
text_metadata_df,
image_metadata_df,
query=query,
column_name="text_embedding_from_image_description",
image_emb=False,
top_n=10,
embedding_size=1408,
)
4 단계 : context_text 및 context_images를 만듭니다
for key, value in matching_results_chunks_data.items():
...
for key, value in matching_results_image_fromdescription_data.items():
# combine all the selected relevant text chunks
context_text = []
for key, value in matching_results_chunks_data.items():
context_text.append(value["chunk_text"])
final_context_text = "\n".join(context_text)
# combine all the relevant images and their description generated by Gemini
context_images = []
for key, value in matching_results_image_fromdescription_data.items():
context_images.extend(
["Image: ", value["image_object"], "Caption: ", value["image_description"]]
)
5 단계 : gemini에게 컨텍스트를 통과하십시오
prompt = F "" "지침 : 컨텍스트로 제공된 이미지와 텍스트를 비교 : 여러 질문에 답하려면 :
질문에 대답하기 전에 철저히 생각하고 필요한 조치를 취하여 총알 포인트의 답변에 도달하여 쉽게 설명 할 수 있습니다.
확실하지 않은 경우, "대답하기에 충분한 맥락이 아닙니다"라고 응답하십시오.."
Context:
- Text Context:
{final_context_text}
- Image Context:
{context_images}
{query}
Answer:
"""
# Generate Gemini response with streaming output
rich_Markdown(
get_gemini_response(
prompt = f""" Instructions: Compare the images and the text provided as Context: to answer multiple Question:
Make sure to think thoroughly before answering the question and put the necessary steps to arrive at the answer in bullet points for easy explainability.
If unsure, respond, "Not enough context to answer".
Context:
- Text Context:
{final_context_text}
- Image Context:
{context_images}
{query}
Answer:
"""
# Generate Gemini response with streaming output
rich_Markdown(
get_gemini_response(
multimodal_model,
model_input=[prompt],
stream=True,
generation_config=GenerationConfig(temperature=1),
)
)
ㄴ>
Here's a breakdown of the answers to your questions, incorporating information from both the text and image
contexts:
• What are the critical differences between various graphs for Class A Share?
• The graphs compare the cumulative 5-year total stockholder return of Alphabet Inc. Class A common stock
against the S&P 500 index, the NASDAQ Composite index, and the RDG Internet Composite index.
• One graph tracks performance from December 31, 2016, to December 31, 2021, while the other tracks from
December 31, 2017, to December 31, 2022.
• The later graph shows a decline in Alphabet's Class A share performance relative to the earlier graph. All
indices exhibit different levels of growth and volatility across the two time periods.
• The RDG Internet Composite index had the highest return in the first graph, followed by Alphabet Inc. Class
A, then the NASDAQ Composite, and finally the S&P 500.
• Which index best matches Class A share performance closely where Google is not already a part? Explain the
reasoning.
• Based on the two graphs and without knowing the specific composition of the RDG Internet Composite index,
it's difficult to definitively determine which index best matches Class A share performance while excluding
Google/Alphabet's influence.
• To answer this perfectly, the exact constituents of RDG Internet Composite index would be needed and whether
Google is a significant constituent of S&P 500/NASDAQ.
• The RDG Internet Composite appears to track Class A shares most closely in both graphs.
• Identify key chart patterns for Google Class A shares.
• The graphs provided show cumulative total return, not price charts. It is hard to identify chart patterns.
• However, the graphs indicate relative performance against benchmark indices over 5-year periods. The second
graph (2017-2022) shows the Alphabet's Class A share underperforming RDG Internet Composite but outperforming
S&P 500 and NASDAQ.
• What is cost of revenues, operating expenses, and net income for 2020? Do mention the percentage change.
• Cost of Revenues (2020): $84,732 million
• Operating Expenses (2020): $56,571 million
• Net Income (2020): $40,269 million
• Percentage change is not applicable as only 2020 values are requested here.
• What was the effect of Covid in the 2020 financial year?
• The text states that in March 2020, Alphabet observed the effect of COVID-19 on financial results.
• Despite an increase in user search activity, advertising revenues declined compared to the prior year.
• This was due to a shift of user search activity to less commercial topics and reduced spending by
advertisers.
• For the quarter ended June 30, 2020, advertising revenues declined due to the continued effects of COVID-19.
• What are the total revenues for APAC and USA for 2021?
• APAC Revenues (2021): $46,123 million
• United States Revenues (2021): $117,854 million
• What are deferred income taxes?
• The images provide a table of deferred tax assets and liabilities, but the text does not directly define
"deferred income taxes."
• Generally, deferred tax assets arise when taxable income is less than accounting income, and deferred tax
liabilities arise when taxable income is more than accounting income. These differences reverse over time.
• How do you compute net income per share?
• The text explains that net income per share of Class A, Class B, and Class C stock is computed using the
two-class method.
• Basic net income per share: Computed using the weighted-average number of shares outstanding during the
period.
• Diluted net income per share: Computed using the weighted-average number of shares and the effect of
potentially dilutive securities outstanding during the period (restricted stock units and other contingently
issuable shares).
• What drove the percentage change in consolidated revenue and cost of revenue for the year 2021, and was there
any effect of Covid?
• Consolidated Revenue Increase (41%): Primarily driven by Google Services and Google Cloud. The adverse effect
of COVID-19 on 2020 advertising revenues also contributed to the year-over-year growth (a rebound effect).
• Cost of Revenues Increase (31%): Primarily driven by increases in TAC (Traffic Acquisition Costs) and content
acquisition costs.
• The text suggests a positive effect related to COVID-19 in the sense that the rebound from a weaker 2020
contributed to 2021's growth.
• What is the cause of the 41% increase in revenue from 2020 to 2021, and how much is the dollar change?
• Cause: Primarily driven by Google Services and Google Cloud. The adverse effect of COVID-19 on 2020
advertising revenues also contributed to the year-over-year growth.
• Dollar Change: $75,110 million (calculated as $257,637 - $182,527 - based on table)
6단계: 인용문(citations) 및 참고문헌(references) 인쇄
display_images(
[
matching_results_image_fromdescription_data[0]["img_path"],
print("---------------Matched Images------------------\n")
display_images(
[
matching_results_image_fromdescription_data[0]["img_path"],
matching_results_image_fromdescription_data[1]["img_path"],
matching_results_image_fromdescription_data[2]["img_path"],
matching_results_image_fromdescription_data[3]["img_path"],
],
resize_ratio=0.5,
)
ㄴ>
---------------Matched Images------------------




+) 이미지 인용. Gemini가 생성한 메타데이터가 답변의 근거를 제시하는 데 어떻게 도움이 되었는지 확인할 수 있습니다.
# Image citations. You can check how Gemini generated metadata helped in grounding the answer.
print_text_to_image_citation(
matching_results_image_fromdescription_data, print_top=False
)
다중 모드 RAG는 매우 강력할 수 있지만, 몇 가지 한계
- 데이터 종속성: 고품질 텍스트와 시각 자료가 필요합니다.
- 계산 부담: 다중 모드 데이터 처리는 많은 리소스를 필요로 합니다.
- 도메인 특정성: 일반 데이터로 학습된 모델은 의학과 같은 특수 분야에서는 빛을 발하지 못할 수 있습니다.
- 블랙박스: 이러한 모델의 작동 방식을 이해하는 것은 까다로울 수 있으며, 신뢰와 도입을 저해할 수 있습니다.
이러한 어려움에도 불구하고, 다중 모드 RAG는 다양한 다중 모드 데이터를 처리할 수 있는 검색 시스템을 향한 중요한 발걸음을 나타냅니다.
이 실습에서는 멀티모달 검색 증강 생성(RAG)을 사용하여 강력한 문서 검색엔진을 구축하는 방법을 배웠습니다.
텍스트와 이미지가 모두 포함된 문서의 메타데이터를 추출 및 저장하고 문서에 대한 임베딩을 생성하는 방법을 배웠습니다.
또한 유사한 텍스트와 이미지를 찾기 위해 텍스트 및 이미지 쿼리로 메타데이터를 검색하는 방법도 배웠습니다.
마지막으로 텍스트 쿼리를 입력하여 텍스트와 이미지를 모두 사용해 상황에 맞는 답변을 검색하는 방법을 배웠습니다.
출처
https://www.cloudskillsboost.google/course_templates/981/labs/550043
Inspect Rich Documents with Gemini Multimodality and Multimodal RAG - Vertex AI의 Gemini API를 사용하는 멀티모달 검
이 실습에서는 Vertex AI의 Gemini API를 사용하여 멀티모달 검색 증강 생성(RAG)을 수행하는 방법을 알아봅니다.
www.cloudskillsboost.google
generative-ai/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb at main · GoogleCloudPlatform/generati
Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI - GoogleCloudPlatform/generative-ai
github.com