Full Transcript
https://www.youtube.com/watch?v=X9SJXE6bARQ
[01:01] Um hello everyone.
[01:03] I'm Janing and this webinar will begin um in 3 minutes.
[01:07] Let's wait for everyone to join this meeting.
[01:09] Okay.
[01:09] Thank you.
[02:52] Okay, everyone. Hello. Good afternoon.
[02:54] I believe that we are all in the same time zone.
[02:56] So, good afternoon.
[02:59] Uh my name is Shanning and I will be the PC for this uh Indonesian training set project that will kick off very soon.
[03:05] Um yeah, so today I will give you a brief introduction on this on the guidelines for this Indonesian training set project.
[03:13] So what is the difference between the training set project and the test set project?
[03:21] Um for those of you who have done the um test set project prior to this one, well we you will know that when we encounter some um unclear speech parts, we can use main uh means like um asterisks or tildas to represent uh the unclear word or unclear speech parts or we can use um angle brackets to represent overlapping speech parts.
[03:48] But for this training set project, uh what we want to have for as learning
[03:52] we want to have for as learning materials for AI or large language.
[03:55] materials for AI or large language models to learn from uh is uh clear.
[03:58] models to learn from uh is uh clear speech parts.
[04:01] We do not want some uh we do not want any interference to uh disturb or to um interfere with the uh learning process of large language.
[04:10] disturb or to um interfere with the uh learning process of large language models.
[04:11] learning process of large language models.
[04:16] So uh if we I if I were to summarize uh today's webinar in just one sentence, I would say that the the across the board rule to complete this training set project is to uh keep the clear speech parts and to cut away or to intercept away the unclear or the overlapping speech parts.
[04:21] summarize uh today's webinar in just one sentence, I would say that the the across the board rule to complete this training set project is to uh keep the clear speech parts and to cut away or to intercept away the unclear or the overlapping speech parts.
[04:23] sentence, I would say that the the across the board rule to complete this training set project is to uh keep the clear speech parts and to cut away or to intercept away the unclear or the overlapping speech parts.
[04:25] across the board rule to complete this training set project is to uh keep the clear speech parts and to cut away or to intercept away the unclear or the overlapping speech parts.
[04:28] training set project is to uh keep the clear speech parts and to cut away or to intercept away the unclear or the overlapping speech parts.
[04:31] clear speech parts and to cut away or to intercept away the unclear or the overlapping speech parts.
[04:33] intercept away the unclear or the overlapping speech parts.
[04:36] overlapping speech parts. And yeah, that's the across the board rule.
[04:38] that's the across the board rule. That's the rule of thumb for this project.
[04:41] the rule of thumb for this project. Um let's come to the operating interface.
[04:43] let's come to the operating interface. I will give you a brief introduction about the operating interface of this project on the platform.
[04:44] will give you a brief introduction about the operating interface of this project on the platform.
[04:46] the operating interface of this project on the platform.
[04:49] on the platform. So on the left is the conversation context.
[04:51] conversation context. It represents the conversation between the user and the
[04:52] conversation between the user and the chatbot that directly precedes the audio.
[04:56] chatbot that directly precedes the audio but it only serves as a reference.
[04:58] but it only serves as a reference.
[04:58] It helps you better understand the audio.
[05:00] helps you better understand the audio and what you uh what you are to put in.
[05:04] and what you uh what you are to put in the transcription box.
[05:07] the transcription box.
[05:07] And on the right section on the right is the operational.
[05:09] section on the right is the operational surface.
[05:11] surface.
[05:11] Let's look at a case here.
[05:15] look at a case here.
[05:15] So on the top you can see an audio bar with the audio's.
[05:17] can see an audio bar with the audio's wavelength form.
[05:20] wavelength form.
[05:20] And for example we can hit the play button to listen to the.
[05:23] hit the play button to listen to the audio.
[05:26] audio.
[05:26] Let's try it.
[05:29] try it.
[05:29] >> Hello.
[05:31] cholesterol.
[05:34] cholesterol.
[05:34] Hello Google.
[05:36] Hello Google cholesterol.
[05:39] cholesterol.
[05:39] Yeah, like this.
[05:41] Yeah, like this.
[05:41] And below this uh audio bar, we can see two multiple choices.
[05:44] bar, we can see two multiple choices questions.
[05:47] questions.
[05:47] So, first we need to choose whether we uh we wish to discard this a.
[05:49] whether we uh we wish to discard this a audio or to keep this audio for this.
[05:52] audio or to keep this audio for this case.
[05:52] And number two, we need to choose between whether uh well the speaker is
[05:56] between whether uh well the speaker is using only the target language which
[05:57] using only the target language which means Indonesian in this langu uh in
[05:59] means Indonesian in this langu uh in this project or a mix of target language
[06:02] this project or a mix of target language and English uh in the final cut of of
[06:07] and English uh in the final cut of of this of the entire audio. And below this
[06:10] this of the entire audio. And below this two question multiple choices question
[06:13] two question multiple choices question we can see two text boxes.
[06:16] we can see two text boxes. Well, the first text box is the ASR
[06:19] Well, the first text box is the ASR result provided by AI and it only serves
[06:23] result provided by AI and it only serves as a reference. It is not uh corresponds
[06:26] as a reference. It is not uh corresponds necessarily to the transcription or the
[06:28] necessarily to the transcription or the final answer that we uh that you should
[06:35] final answer that we uh that you should uh that we wish to obtain. Um and here
[06:40] uh that we wish to obtain. Um and here in the transcription box, this is a uh
[06:44] in the transcription box, this is a uh marked with an a red asterisk. This is
[06:47] marked with an a red asterisk. This is the box that you must fill in. And this
[06:49] the box that you must fill in. And this is the final result that we wish to
[06:51] is the final result that we wish to have. <t=413.9>have. Okay, let's come back to the SOP. So
[06:57] Okay, let's come back to the SOP.
[07:00] So there are two rounds for you to do for this project.
[07:02] Number one is the labeling round and number two is the QA round.
[07:04] So for the labeling round, what you should do first is to click process to enter the labeling interface.
[07:11] And this is the labeling interface.
[07:13] And number two, you should listen to the audio by clicking the uh play button like what I did just now.
[07:22] And next you after you listen to the audio, you should select whether you wish to discard this case, discard the audio or to keep this audio.
[07:31] Um and I will come back to the discard rules uh very soon.
[07:35] later.
[07:38] And if you wish to discard this audio, please provide a reason.
[07:39] And if you choose to choose to keep this case, keep this audio, there are three subsequent steps.
[07:47] Number one, you should intercept the audio according to the rules.
[07:53] So to intercept the audio uh well, you can select a segment like this.
[07:59] segment like this.
[08:01] And the selected part are the part that we want to keep.
[08:04] And the other unselected the other two unselected parts are the parts that we wish to intercept away that we wish to how to put it discard.
[08:08] It's an it's another uh uh well yeah the the other two parts are the parts that we do not want are the unwanted parts.
[08:28] And um to if you want to make some modifications to the selected parts, you could either one drag the selected parts like this.
[08:35] And number two, you could click this trash can button to delete this segment and to ch and to select another part.
[08:45] Um this is because we cannot select two parts in one audio.
[08:48] We can only have one selected part per audio.
[08:55] Uh yes that's the that's also a rule in this project. But
[09:02] That's also a rule in this project.
[09:04] But how can we make sure that we have selected the right uh selected uh capture uh the part that we want?
[09:11] How can we be sure that we do not miss the uh for example the beginning vowel or the beginning syllable or the final syllable of uh the part that we want?
[09:20] We can use use this magnifying glass button to zoom in.
[09:27] In this way, we can see that there are uh some red redundant parts that we selected.
[09:33] So, we can make some adjustments like this like this.
[09:41] And we can you can see that if we zoom in well if we include some redundant parts in the parts that we selected and if we zoom out there are nearly no differences but we can only tolerate uh margins within 100 milliseconds.
[09:58] So you can take advantage of this magnifying glass button to make sure that uh we have
[10:03] button to make sure that uh we have captured the uh precisely the part that.
[10:07] captured the uh precisely the part that we want. Okay.
[10:10] Um that is the interception.
[10:14] Um that is the interception. We we'll also come to the interception rules in detail uh very soon.
[10:19] rules in detail uh very soon. And next we have successfully intercepted the uh the part that we want.
[10:23] we want we should choose the correct language label for the final cut.
[10:26] the correct language label for the final cut. Now what do I mean for final cut?
[10:30] We can refer to the glossery here. So the final cut means uh the final audio segment selected for transcription.
[10:33] the final cut means uh the final audio segment selected for transcription. So for example the the blue part is now the final cut.
[10:36] the final cut. And if we want to choose the language label, we do not need to um uh pay attention to the whole audio and to the entire audio anymore.
[10:38] for example the the blue part is now the final cut. And if we want to choose the language label, we do not need to um uh pay attention to the whole audio and to the entire audio anymore.
[10:41] uh pay attention to the whole audio and to the entire audio anymore. We should only pay attention to the blue parts to the selected part to uh choose between target language or target language and English.
[10:45] language label, we do not need to um uh pay attention to the whole audio and to the entire audio anymore. We should only pay attention to the blue parts to the selected part to uh choose between target language or target language and English.
[10:50] uh pay attention to the whole audio and to the entire audio anymore. We should only pay attention to the blue parts to the selected part to uh choose between target language or target language and English.
[10:52] to the entire audio anymore. We should only pay attention to the blue parts to the selected part to uh choose between target language or target language and English.
[10:54] We should only pay attention to the blue parts to the selected part to uh choose between target language or target language and English.
[10:56] only pay attention to the blue parts to the selected part to uh choose between target language or target language and English.
[10:59] the selected part to uh choose between target language or target language and English.
[11:01] target language or target language and English.
[11:09] Next, next, uh, next we're coming to the, uh, next we're coming to the transcription. Uh, we should transcribe.
[11:16] Transcription, uh, we should transcribe the final cut according to the rules and the final cut according to the rules and we can use the conversation context for hints if needed.
[11:20] And that is what I mentioned just now and if you are sure that there is no problem with it with this case.
[11:24] If you finish the two multiple choices question, if you have intercepted the audio, if you have finished the transcription, you can click submit.
[11:36] Uh, and submit normally would appear here.
[11:38] But, uh, right now I'm not processing this case.
[11:41] So there's no submit button.
[11:43] But if you, if you, uh, enter this labeling interface via the process button, well, there will be a submit, uh, button here.
[11:48] Uh, on the right, top right corner.
[11:56] And for the QA round, you, uh, the participants that will, uh, do the quality assessment, um, should review the case classification.
[12:09] Um should review the case classification and the language label and the and the language label and the transcription from the first labeler.
[12:13] If they are correct, if there there's no problem, you should choose qualified and if it is wrong, you should choose unqualified and be sure to correct the errors before submitting the cases.
[12:23] Uh choosing only choosing qualified or unqualified is not enough.
[12:26] We want to have the correct answers that you believe uh the answers that you believe to be correct uh in the end.
[12:33] Okay.
[12:33] And then uh after the QA round the cases will uh uh be circulated to us and uh for further evaluation.
[12:44] Okay.
[12:44] Any questions for the overview for uh any questions for now?
[12:59] You can just turn on your mic if you have any problem.
[13:07] Okay.
[13:07] Okay.
[13:07] Then we'll move on to the
[13:09] Okay. Okay. Then we'll move on to the guidelines.
[13:12] guidelines. So according to the SOP that I introduced just now the first uh operation that we that we should do is to choose discard or non-discard.
[13:26] So uh uh right now I will move on to the criteria uh concerning the concerning the situations that we should discard an audio or a case.
[13:35] Criterion number one, if an audio contains any personal or a personal identifiable data, we should discard it.
[13:43] For example, if this audio contains a full name of a private individual like Bob Evans cited here, we should discard it.
[13:51] But if it only contains a single name Bob, well, we do not know which Bob the speaker is referring to.
[13:58] So in in this case, we do not need to discard it.
[14:00] or or or in another case when the audio contains a public figure the name of a public figure like Tom Cruz for example the
[14:11] figure like Tom Cruz for example the speaker is making is um presenting his speaker is making is um presenting his um idol for example he's saying Tom Cruz.
[14:20] um idol for example he's saying Tom Cruz is an actor we do not need to discard this case because we all know who Tom Cruz is but let let's hypothesize that um Tom Cruz himself is using the AI and he says this is Tom Cruz.
[14:34] Well, we in this case we need to discard it.
[14:37] And please note that fictional characters are considered public figures.
[14:42] for example, um I don't know maybe uh Edogawa Konan or Naruto if you know them you know them and they should be considered as public figures.
[14:53] and if the and if influencers names like Aish speed or Mr. Beast appears in the audio we do not need to discard it but if it is a private individual username uh we should discard it.
[15:06] Uh if you're unsure about um
[15:14] Discard it.
[15:17] Uh if you're unsure about um about whether a name or username is uh public or private, you could do a fact check on the internet.
[15:21] And if you're still unsure, you could just discard it.
[15:23] Well, the safest uh practice is always to discard a case.
[15:25] But please try to use this practice as a last resort because we do not want to u discard too many cases.
[15:37] Okay.
[15:41] Um, and if the audio contains any phone number, emails, postal address, ID number or password or credit card account number, please also discard the audio.
[15:52] Um, yeah, but if it is a business number, business email or some business public addresses, you can keep it.
[16:01] You can uh refer to more details in this appendix too for uh personal data.
[16:09] And okay let's come back criteria number two when the audio contains no speech if
[16:14] two when the audio contains no speech if the audio only contains noises
[16:16] the audio only contains noises instrumental music or silence uh which
[16:19] instrumental music or silence uh which means there's no uh speaker's voice we
[16:23] means there's no uh speaker's voice we should discard the audio in this case
[16:27] should discard the audio in this case number three when the audio contains no
[16:29] number three when the audio contains no target language we should discard the
[16:31] target language we should discard the audio so the target language for this
[16:33] audio so the target language for this project are the specific target language
[16:35] project are the specific target language which means Indonesian
[16:37] which means Indonesian uh in this project plus English.
[16:42] uh in this project plus English. Yes. But proper nouns and adopted words
[16:45] Yes. But proper nouns and adopted words that are accepted and commonly used by
[16:47] that are accepted and commonly used by native speakers like L'Oreal or L'Oreal
[16:51] native speakers like L'Oreal or L'Oreal L'Oreal or Naruto which is a figure in
[16:57] L'Oreal or Naruto which is a figure in Japanese anime. Well, we do not need to
[17:00] Japanese anime. Well, we do not need to discard uh cases like this. It should be
[17:04] discard uh cases like this. It should be considered a part of Indonesian. Well, I
[17:07] considered a part of Indonesian. Well, I remember some of you who have raised the
[17:08] remember some of you who have raised the question about a Korean word like jajang
[17:11] question about a Korean word like jajang which is a Korean
[17:13] which is a Korean food or a an a type of East Asian food.
[17:18] food or a an a type of East Asian food.
[17:21] Uh well in that case I believe uh well if you believe that it is a it is already widely acknowledged by for example Indonesian people well that in in that case it should be considered a part of Indonesian language and not English nor Korean nor other foreign language.
[17:39] Okay. But please be careful. Please pay attention to rule number three. when the audio is a mix of target language and English, but the English part accounts for more than 30% of the audio. Please discard the audio. That is because we're doing a it well it is an Indonesian project that we're doing. So, we do not want the uh want the models or we do not want the AI to uh learn too much English. We're focusing on the Indonesian uh language in this project.
[18:14] Okay, that is the criteria number three. Coming to the criteria number four when uh the audio is well when the entire
[18:21] uh the audio is well when the entire audio is unclear speech we should discard it.
[18:24] audio is unclear speech we should discard it.
[18:26] Well unclear speech refers to audio in which the content is indistinguishable and agents are unable to identify any discernable words.
[18:31] For example, common scenarios include speech overlapping with loud background noises or if the person if the speaker is mumbling or whispering in a very soft way or for example his pronunciation is very imprecise that we cannot discern what he is he or she is speaking or if he or she is speaking in a very thick accent or if the speech is accelerated or slowed down in a very exaggerated way.
[19:00] Well, uh these cases are considered unclear speech and if the whole audio if the entire audio um uh is well if it well if the entire audio is like this we should discard it.
[19:18] Okay. But if only the if only part of the audio is unclear
[19:23] the if only part of the audio is unclear please do not discard it immediately.
[19:25] please do not discard it immediately.
[19:28] Um, if there are still clear parts, you can intercept away the unclear parts and to keep the clear parts.
[19:32] Select the clear parts and to transcribe the clear parts.
[19:34] And the same goes for overlapping speech.
[19:37] If only part of the audio overlaps, please not discard it immediately either.
[19:41] And please check if non-over overlapping segments are clear enough to transcribe.
[19:49] Okay. Coming to the next criteria.
[19:51] If there is a song played in the audio, there are three uh independent circumstances.
[19:58] Number one, you should discard the audio when the entire audio clip consists of only a song or music or the speaker speech overlaps entirely with the song or music.
[20:10] You should discard the audio in these two cases.
[20:12] But if the uh if the speaker speech overlaps only partially with the song or music, we should intercept away the overlapping part and to transcribe the uh clear speech parts
[20:26] The clear speech parts uh that the speaker produced.
[20:29] Uh that the speaker produced.
[20:31] But if the uh speech a speaker speech overlaps entirely or partially with a song or music.
[20:36] Entirely or partially with a song or music but the speech remains clear because the music is a soft background.
[20:41] Uh is a soft music or light music.
[20:44] Well, we should transcribe as as it is.
[20:47] Or if there is only accompany music or background music in the background and the speaker sings along with it and the singing is clear, we should transcribe the lyrics that the speaker is singing.
[21:00] But if the speaker sings along with the with a song with audible lyrics, we should discard this case.
[21:08] Okay.
[21:10] And the final the last criterion to uh for the discard rules uh is like this.
[21:19] Well, when the audio contains sexual or harmful content well moderate tolerance is allowed
[21:27] Well, moderate tolerance is allowed. Uh, we can tolerate uh these content to a certain extent as long as it remains non-explicit, nonviolent and contextually ambiguous.
[21:39] But if any content makes you unsure, uncomfortable, like I just said, the safest choice is always to discard it.
[21:45] But um, as always, please use this as a last resort, please.
[21:48] Uh, because we do not want to discard too many cases.
[21:53] Okay, that is all for the discard rules.
[21:57] Any questions for now?
[22:09] Okay, good.
[22:16] So, okay, good.
[22:17] Now, we move on to the interception rules.
[22:21] Um, if I were to summarize the interception rules in just one sentence, I would I would use this sentence.
[22:28] I would I would use this sentence. When an interception is made, retain as much
[22:31] an interception is made, retain as much transcribable and non-over overlapping
[22:32] transcribable and non-over overlapping speech as possible.
[22:35] speech as possible. So we'll uh
[22:38] So we'll uh we'll discuss the trans uh the
[22:39] we'll discuss the trans uh the interception rules uh in two uh in two
[22:43] interception rules uh in two uh in two sections. Number one, what to intercept
[22:45] sections. Number one, what to intercept away and number two, what to keep. So
[22:49] away and number two, what to keep. So what are the unwanted parts? Number one,
[22:52] what are the unwanted parts? Number one, when there's recorded sound playing in
[22:54] when there's recorded sound playing in the background, please refer to the
[22:56] the background, please refer to the rules that we discussed just now. Okay.
[22:59] rules that we discussed just now. Okay. And um
[23:01] And um number two, when there are foreign
[23:03] number two, when there are foreign languages,
[23:05] languages, uh please note that the foreign
[23:07] uh please note that the foreign languages in this project refers to
[23:09] languages in this project refers to languages that are not Indonesian and
[23:12] languages that are not Indonesian and English. So if there are for
[23:16] English. So if there are for if there are foreign languages, please
[23:17] if there are foreign languages, please intercept away the parts that contains
[23:20] intercept away the parts that contains the foreign languages and to keep only
[23:22] the foreign languages and to keep only the uh the parts that uh and to keep
[23:27] the uh the parts that uh and to keep only the parts with Indonesian and
[23:29] only the parts with Indonesian and English.
[23:31] English. Uh coming up next for the unclear speech
[23:36] Uh coming up next for the unclear speech the rules. Uh
[23:40] the rules. Uh well, if there are unclear speech,
[23:42] well, if there are unclear speech, please just intercept away. Uh please
[23:44] please just intercept away. Uh please just just intercept them away and keep
[23:47] just just intercept them away and keep the clear speech parts. Um if the
[23:51] the clear speech parts. Um if the unclear speech in the background does
[23:53] unclear speech in the background does not impact the speaker's clarity, you
[23:55] not impact the speaker's clarity, you can ignore it and to uh and do not need
[23:59] can ignore it and to uh and do not need to cut the whole audio in half. Okay.
[24:03] to cut the whole audio in half. Okay. And for the noises, well, the same goes
[24:06] And for the noises, well, the same goes for the noises. If the noises does not
[24:08] for the noises. If the noises does not impact the speaker's clarity, you can
[24:10] impact the speaker's clarity, you can just ignore the noises and to and you do
[24:12] just ignore the noises and to and you do not need to cut the whole audio in half.
[24:15] not need to cut the whole audio in half. Um the rules for the margins and padding
[24:20] Um the rules for the margins and padding uh are mentioned just now. We only
[24:23] uh are mentioned just now. We only tolerate uh margins or deviations within
[24:26] tolerate uh margins or deviations within 100 milliseconds and any uh margins
[24:31] 100 milliseconds and any uh margins uh longer than 100 milliseconds will be
[24:34] uh longer than 100 milliseconds will be marked as reject in our evaluation
[24:36] marked as reject in our evaluation round.
[24:38] round. So uh in other words, please apply tight
[24:42] So uh in other words, please apply tight tight cropping uh when you're doing the
[24:44] tight cropping uh when you're doing the interception.
[24:46] interception. Let's look at some few shots here.
[24:50] Let's look at some few shots here. Uh for example, if there uh if there is
[24:55] Uh for example, if there uh if there is an if there's overlapping speech part
[24:57] an if there's overlapping speech part preceding the clear speech and there's a
[24:59] preceding the clear speech and there's a song following this clear speech, we
[25:01] song following this clear speech, we should only keep the clear speech part.
[25:03] should only keep the clear speech part. We should intercept the beginning and uh
[25:07] We should intercept the beginning and uh the beginning part and the ending part
[25:09] the beginning part and the ending part away.
[25:12] away. But if the if the two if the two parts
[25:15] But if the if the two if the two parts if the two red parts do not impact the
[25:18] if the two red parts do not impact the speakers speaker speakers speaker's
[25:19] speakers speaker speakers speaker's clarity. For example, if it is a click
[25:21] clarity. For example, if it is a click from starting the audio recorder and if
[25:24] from starting the audio recorder and if there's a traffic noise in the end that
[25:26] there's a traffic noise in the end that is quite um
[25:30] is quite um quite constant. Well, we could just
[25:33] quite constant. Well, we could just ignore the noises and to uh we do not
[25:37] ignore the noises and to uh we do not need to make two cuts here and here.
[25:44] need to make two cuts here and here. Okay. But what if the noises or the
[25:47] Okay. But what if the noises or the unclear parts or the overlapping speech
[25:49] unclear parts or the overlapping speech parts are in the middle?
[25:52] parts are in the middle? For example, clear speech part one
[25:54] For example, clear speech part one contains I want a cup of coffee and then
[25:57] contains I want a cup of coffee and then comes uh a part of uh some lyrics, some
[26:02] comes uh a part of uh some lyrics, some melodies and songs and then comes clear
[26:06] melodies and songs and then comes clear speech part two containing uh some
[26:09] speech part two containing uh some speakers saying with eyes. In in this
[26:12] speakers saying with eyes. In in this case, we should only keep clear speech
[26:15] case, we should only keep clear speech part one because number one only one uh
[26:19] part one because number one only one uh we can only select one clear speech
[26:21] we can only select one clear speech parts per audio and number two I want a
[26:24] parts per audio and number two I want a cup of coffee is longer and contains
[26:27] cup of coffee is longer and contains more information than with ice and I
[26:29] more information than with ice and I want a cup of coffee is more
[26:30] want a cup of coffee is more semantically complete than with ice.
[26:32] semantically complete than with ice. That is why we choose clear speech part
[26:34] That is why we choose clear speech part one as our um uh as the part that we
[26:39] one as our um uh as the part that we want to keep. Okay.
[26:44] But if in the but the if the noise part
[26:48] But if in the but the if the noise part or the unclear speech parts are uh um
[26:53] or the unclear speech parts are uh um are not
[26:55] are not how could I put it? If the noise does
[26:57] how could I put it? If the noise does not impact the audio uh the speaker's
[26:59] not impact the audio uh the speaker's clarity, it could be retained. For
[27:02] clarity, it could be retained. For example, if it is a a part a noise part
[27:05] example, if it is a a part a noise part or a silence part, well, we do not need
[27:08] or a silence part, well, we do not need to cut uh cut the clear speech parts in
[27:12] to cut uh cut the clear speech parts in half. We uh we can just ignore them.
[27:18] half. We uh we can just ignore them. Um yes, but please pay attention that if
[27:23] Um yes, but please pay attention that if the a if the silence or the noise is
[27:27] the a if the silence or the noise is longer than 3 seconds, uh we should only
[27:30] longer than 3 seconds, uh we should only retain the longest and most semantic
[27:33] retain the longest and most semantic semantically complete segment and to
[27:35] semantically complete segment and to intercept away the other part. But if
[27:37] intercept away the other part. But if the silence for example is shorter than
[27:40] the silence for example is shorter than 3 seconds or if the noise part is
[27:42] 3 seconds or if the noise part is shorter than three seconds we can um we
[27:46] shorter than three seconds we can um we do not need to do interception.
[27:55] Um well situations are a bit different
[27:58] Um well situations are a bit different for paral languages like coughing or
[28:02] for paral languages like coughing or sneezing or laughing.
[28:06] sneezing or laughing. uh or other real life sounds. We do not
[28:09] uh or other real life sounds. We do not need to trans transcribe these uh paral
[28:11] need to trans transcribe these uh paral languages um and we do not need to do
[28:16] languages um and we do not need to do the interception operation in this case.
[28:19] the interception operation in this case. But if the speaker is imitating sounds
[28:21] But if the speaker is imitating sounds like uh these real life sounds like if
[28:25] like uh these real life sounds like if the speaker is saying a chew in a very
[28:30] the speaker is saying a chew in a very well we can know that if a person is
[28:32] well we can know that if a person is imitating
[28:34] imitating uh sounds like this right um so we
[28:37] uh sounds like this right um so we should put a ho in the transcription box
[28:40] should put a ho in the transcription box instead of not transcribing it.
[28:45] Okay.
[28:50] But uh unclear speech parts does not
[28:54] But uh unclear speech parts does not always mean uh
[28:58] always mean uh we need to intercept them intercept them
[29:01] we need to intercept them intercept them away. For example, if someone says this
[29:05] away. For example, if someone says this job is tailored toward his experience,
[29:09] job is tailored toward his experience, um we are we can be sure that the
[29:13] um we are we can be sure that the speaker is definitely wanting trying to
[29:16] speaker is definitely wanting trying to say towards but for example he's h he
[29:19] say towards but for example he's h he has a very strong accent or he's
[29:22] has a very strong accent or he's speaking in a very fast way that he lost
[29:25] speaking in a very fast way that he lost the uh earth's part in at the end of the
[29:30] the uh earth's part in at the end of the word towards uh in this way we in this
[29:33] word towards uh in this way we in this case we can calibrate uh to to war to o
[29:37] case we can calibrate uh to to war to o wy-en
[29:39] wy-en to towards we can make some
[29:42] to towards we can make some modifications
[29:43] modifications but only on the condition that you are
[29:47] but only on the condition that you are sure that the speaker is trying to say
[29:49] sure that the speaker is trying to say what you are uh
[29:52] what you are uh only if you're sure that the speaker is
[29:54] only if you're sure that the speaker is trying to say two words. If you're not
[29:56] trying to say two words. If you're not sure, please uh do the interception and
[30:00] sure, please uh do the interception and treat the to
[30:03] treat the to hyphen as a unclear word.
[30:10] And let's look at another example. We
[30:13] And let's look at another example. We all know there that there is a song
[30:15] all know there that there is a song called young dumb and broke. But if the
[30:19] called young dumb and broke. But if the speaker is saying is speaking um
[30:23] speaker is saying is speaking um yandun and broke we can be sure that
[30:28] yandun and broke we can be sure that speaker is pronouncing it wrong. So in
[30:32] speaker is pronouncing it wrong. So in this case we should make some
[30:35] this case we should make some calibrations as well. We should no sorry
[30:41] calibrations as well. We should no sorry uh we we could not uh correct the uh
[30:46] uh we we could not uh correct the uh grammar mistakes in in in this case or
[30:49] grammar mistakes in in in this case or if you if the speaker is clearly saying
[30:52] if you if the speaker is clearly saying done we should transcribe done.
[30:58] We should not ch convert it to done.
[31:01] We should not ch convert it to done. But if the speaker is saying dank and
[31:03] But if the speaker is saying dank and this is it it is very clear
[31:08] this is it it is very clear uh we should transcribe we should do the
[31:11] uh we should transcribe we should do the interception because dank is not
[31:16] interception because dank is not a real word
[31:18] a real word or in other words it is a pseudo word.
[31:21] or in other words it is a pseudo word. So in this case we should treat dank as
[31:24] So in this case we should treat dank as a unclear word and we need to make
[31:27] a unclear word and we need to make interception.
[31:29] interception. And since I really like that song, you
[31:31] And since I really like that song, you know, young is longer than and broke, we
[31:35] know, young is longer than and broke, we should keep
[31:37] should keep the former uh the first part, clear
[31:40] the former uh the first part, clear speech part one
[31:42] speech part one and to uh intercept away the unclear
[31:47] and to uh intercept away the unclear part and clear speech part two.
[31:56] Okay, coming up next
[32:00] Okay, coming up next is situations with multiple speakers.
[32:05] is situations with multiple speakers. Well, there's two circumstances for
[32:08] Well, there's two circumstances for cases with multiple speakers. The first
[32:10] cases with multiple speakers. The first circumstance uh well uh there could be
[32:14] circumstance uh well uh there could be multiple speakers but one speaker speaks
[32:17] multiple speakers but one speaker speaks at a time. In this case, we need to
[32:19] at a time. In this case, we need to distinguish who is the main speaker. if
[32:22] distinguish who is the main speaker. if the two speaker or if if two or more
[32:25] the two speaker or if if two or more speakers u all have the intention to
[32:29] speakers u all have the intention to interact with the uh with AI. Well, we
[32:33] interact with the uh with AI. Well, we do not need to use the angled brackets.
[32:37] do not need to use the angled brackets. For example, A says hello, tell me about
[32:39] For example, A says hello, tell me about submarines and B says about the biggest
[32:42] submarines and B says about the biggest one. We can know that B is supplementing
[32:45] one. We can know that B is supplementing A's um A's request. So we do not need to
[32:51] A's um A's request. So we do not need to uh use angled brackets here. But for
[32:53] uh use angled brackets here. But for example here A in the background says
[32:56] example here A in the background says catch the news tonight and B says what's
[32:59] catch the news tonight and B says what's up CC? Maybe we should change it to Dola
[33:01] up CC? Maybe we should change it to Dola here. What's up Dola? Are you angry?
[33:04] here. What's up Dola? Are you angry? Well, we can know that A is definitely
[33:06] Well, we can know that A is definitely not interacting with Dola. And we need
[33:08] not interacting with Dola. And we need to uh include catch the news tonight in
[33:12] to uh include catch the news tonight in the angle brackets.
[33:19] But if there is overlapping speech
[33:22] But if there is overlapping speech parts, we need to refer to the rules
[33:25] parts, we need to refer to the rules that we discussed just now here.
[33:29] that we discussed just now here. Okay. If only part of the audio
[33:31] Okay. If only part of the audio overlaps, you can transcribe the clear
[33:34] overlaps, you can transcribe the clear uh the the parts that are not over
[33:37] uh the the parts that are not over overlapped. If there if the two if the
[33:41] overlapped. If there if the two if the two or more per people's speeches
[33:42] two or more per people's speeches overlap entirely uh we should discard
[33:46] overlap entirely uh we should discard it. But if um
[33:50] it. But if um if only one speaker can be heard clearly
[33:54] if only one speaker can be heard clearly and other speakers voices are too low or
[33:57] and other speakers voices are too low or too unclear to be heard, we can still uh
[34:01] too unclear to be heard, we can still uh keep the case
[34:04] keep the case and treat the and treat the other
[34:07] and treat the and treat the other speakers voices as background noise. But
[34:10] speakers voices as background noise. But only on the condition that they're too
[34:12] only on the condition that they're too low to be understood by us or too
[34:14] low to be understood by us or too unclear to be understood by us.
[34:19] And you can refer to the four examples
[34:20] And you can refer to the four examples here.
[34:32] Okay. And there are also some situations
[34:35] Okay. And there are also some situations where we uh encounter speakers uh who
[34:40] where we uh encounter speakers uh who stutter or who repeat for many times.
[34:43] stutter or who repeat for many times. And in this case we um should apply a
[34:48] And in this case we um should apply a three count limit rule to these cases
[34:53] three count limit rule to these cases uh but only for countless repetitions.
[34:55] uh but only for countless repetitions. For example, if you you you can't
[34:59] For example, if you you you can't determine well if you cannot determine
[35:02] determine well if you cannot determine how many times the speaker says the
[35:05] how many times the speaker says the personal pronoun I well we should only
[35:08] personal pronoun I well we should only transcribe it three times. But if you
[35:10] transcribe it three times. But if you can count uh but well if it is countable
[35:13] can count uh but well if it is countable uh for example he uh the speaker said I
[35:16] uh for example he uh the speaker said I for four times well we should put four
[35:18] for four times well we should put four I's in the transcription box but we do
[35:21] I's in the transcription box but we do not need to use the hyphen here because
[35:23] not need to use the hyphen here because I is a standalone word uh is a real word
[35:26] I is a standalone word uh is a real word that exists it has standalone meaning
[35:29] that exists it has standalone meaning but for example wh is only a uh does not
[35:34] but for example wh is only a uh does not represent a word it has no standalone
[35:37] represent a word it has no standalone meaning we should put a hyphen after wh
[35:40] meaning we should put a hyphen after wh the same goes for kappa incapacity
[35:44] the same goes for kappa incapacity and for cap incapacity we do not need to
[35:47] and for cap incapacity we do not need to put hyphen after it because cap uh is a
[35:53] put hyphen after it because cap uh is a word that exists which means something
[35:55] word that exists which means something like hat that we wear. Yeah.
[36:00] like hat that we wear. Yeah. But in cases like this uh well in this
[36:02] But in cases like this uh well in this case we have the abil capacity to
[36:05] case we have the abil capacity to deliver uh it it is clear that the speak
[36:08] deliver uh it it is clear that the speak this speaker is trying to uh change his
[36:11] this speaker is trying to uh change his wording. Uh so
[36:14] wording. Uh so a bit a bill does not correspond to the
[36:17] a bit a bill does not correspond to the next word. So it must be intercepted. So
[36:19] next word. So it must be intercepted. So in this case we can either keep we have
[36:21] in this case we can either keep we have the or capacity to deliver. If you think
[36:24] the or capacity to deliver. If you think capacity to deliver is longer than we
[36:26] capacity to deliver is longer than we have the well you can keep the latter
[36:28] have the well you can keep the latter part but um
[36:32] part but um they all have three words and they are
[36:35] they all have three words and they are both semantically incomplete to a
[36:38] both semantically incomplete to a certain extent. So well theoretically
[36:42] certain extent. So well theoretically both can be uh both can be kept but
[36:47] both can be uh both can be kept but since we can only keep one part per
[36:50] since we can only keep one part per audio you can just choose the part that
[36:53] audio you can just choose the part that you
[36:55] you uh according to your
[36:57] uh according to your um judgments. Okay.
[37:01] um judgments. Okay. And the same three count limit goes for
[37:03] And the same three count limit goes for interjections and model words as well.
[37:07] interjections and model words as well. Well, um I remember some of you who have
[37:10] Well, um I remember some of you who have uh inquired me about whether there
[37:13] uh inquired me about whether there should be a space in between the the H.
[37:18] should be a space in between the the H. Well, there should be no spaces in
[37:20] Well, there should be no spaces in between them. But if the speaker is
[37:23] between them. But if the speaker is trying to produce his l her or his
[37:25] trying to produce his l her or his laughter in a very intermittent way like
[37:27] laughter in a very intermittent way like ha ha ha well in that case there should
[37:31] ha ha ha well in that case there should be spaces in between but not if he or
[37:35] be spaces in between but not if he or she is laughing normally.
[37:45] Okay. Any question for the interception
[37:48] Okay. Any question for the interception rules?
[37:51] Well, if no, let's uh let's practice.
[37:55] Well, if no, let's uh let's practice. Let's practice.
[37:58] Let's practice. Let's hear the audio. Uh once again,
[38:02] Let's hear the audio. Uh once again, hello
[38:05] cholesterol.
[38:07] cholesterol. Hello Google
[38:10] Hello Google cholesterol.
[38:12] cholesterol. So in the beginning of the audio we can
[38:14] So in the beginning of the audio we can hear a woman's voice and a man's voice
[38:17] hear a woman's voice and a man's voice over that are overlapping with each
[38:19] over that are overlapping with each other. So we need to intercept them
[38:21] other. So we need to intercept them away. And I think from here the woman
[38:24] away. And I think from here the woman stopped talking. Let's let's see.
[38:29] stopped talking. Let's let's see. >> Yeah. So from here from apaka she
[38:32] >> Yeah. So from here from apaka she stopped talking. And so we need to uh
[38:34] stopped talking. And so we need to uh intercept uh we need to select
[38:38] intercept uh we need to select from here to here. Let's check if we
[38:43] from here to here. Let's check if we capture the cap if we captured it
[38:45] capture the cap if we captured it precisely. We can hit this button. This
[38:47] precisely. We can hit this button. This button means only play the segment that
[38:50] button means only play the segment that you have selected.
[38:54] >> And let's use this magnifying glass
[38:56] >> And let's use this magnifying glass button to make sure. Once again
[39:02] you can try different positions.
[39:04] you can try different positions. Actually
[39:07] >> if you think that we have uh intercept
[39:10] >> if you think that we have uh intercept way too much we can make adjustments.
[39:19] >> Okay. It seems that there's no problem
[39:22] >> Okay. It seems that there's no problem with the interception and we do not need
[39:24] with the interception and we do not need to discard it. It's p in pure target
[39:27] to discard it. It's p in pure target language. And since we intercepted away
[39:31] language. And since we intercepted away the parts with hello Google, we only
[39:33] the parts with hello Google, we only need to put uh the remaining part in the
[39:38] need to put uh the remaining part in the transcription box like this.
[39:46] Uh please pay attention here in the
[39:49] Uh please pay attention here in the transcription box that I have deleted or
[39:52] transcription box that I have deleted or I have excluded the punctuation mark and
[39:55] I have excluded the punctuation mark and the comma uh
[39:59] the comma uh and the comma because it is the it is
[40:01] and the comma because it is the it is only the ASR transcription that we're
[40:04] only the ASR transcription that we're doing right now. We're not doing the
[40:05] doing right now. We're not doing the post-processing. For those of you who
[40:07] post-processing. For those of you who have done the test set project, you will
[40:09] have done the test set project, you will know that the post-processing consists
[40:12] know that the post-processing consists of um the ITN the inverse text
[40:16] of um the ITN the inverse text normalization step and the punctuation
[40:20] normalization step and the punctuation uh step. So we do not wish to see any
[40:24] uh step. So we do not wish to see any punctuation in the transcription box nor
[40:27] punctuation in the transcription box nor do we wish to see any number in its
[40:31] do we wish to see any number in its Arabic form. Like if there is 14 in the
[40:34] Arabic form. Like if there is 14 in the Arabic form in the ASR result, we should
[40:38] Arabic form in the ASR result, we should convert it to its word form like f O U R
[40:41] convert it to its word form like f O U R T E and in English we do not need wish
[40:44] T E and in English we do not need wish to see Arabic numbers in the
[40:46] to see Arabic numbers in the transcription box.
[40:48] transcription box. Uh yes.
[40:56] Uh so this is
[41:00] Uh so this is how we do the interception and the
[41:03] how we do the interception and the transcription. Is there any problem with
[41:05] transcription. Is there any problem with this case?
[41:16] No. No questions.
[41:22] If there if there is no question, let's
[41:24] If there if there is no question, let's come let's come to the
[41:29] next section.
[41:31] next section. So when we are choosing the language
[41:34] So when we are choosing the language class label
[41:36] class label uh well we have two two choices number
[41:38] uh well we have two two choices number one target language. Number two target
[41:40] one target language. Number two target language in English. Well, select when
[41:43] language in English. Well, select when the language used in the final cut is
[41:45] the language used in the final cut is 100% target language only. And um
[41:50] 100% target language only. And um number two, if you if if the final cut
[41:54] number two, if you if if the final cut is a mix of target language and English,
[41:57] is a mix of target language and English, we can just select target language and
[42:00] we can just select target language and English. But please pay attention to an
[42:03] English. But please pay attention to an example here. Um when we are doing for
[42:06] example here. Um when we are doing for example when we are doing the Indonesian
[42:07] example when we are doing the Indonesian project if someone says I like Ames and
[42:10] project if someone says I like Ames and Chanel
[42:12] Chanel uh well
[42:14] uh well Ames and Chanel are I believe they are
[42:16] Ames and Chanel are I believe they are already widely acknowledged in
[42:18] already widely acknowledged in Indonesian because they are how to say
[42:21] Indonesian because they are how to say worldclass brands. So, Ames and Chanel
[42:25] worldclass brands. So, Ames and Chanel can be considered as a part of
[42:26] can be considered as a part of Indonesian language and we do not need
[42:29] Indonesian language and we do not need to discard this case because these two
[42:31] to discard this case because these two words are relatively very long. So that
[42:34] words are relatively very long. So that they have already account for roughly
[42:36] they have already account for roughly over 70% of the whole audio. So we do
[42:40] over 70% of the whole audio. So we do not need to uh discard
[42:43] not need to uh discard discard this audio and we should um
[42:47] discard this audio and we should um choose the language label as target
[42:49] choose the language label as target language in English with the target
[42:51] language in English with the target language part including the word Hermes
[42:54] language part including the word Hermes and Chanel and the English part
[42:56] and Chanel and the English part containing I like and the conjunction
[42:59] containing I like and the conjunction and
[43:01] and this is this is the rules for choosing
[43:03] this is this is the rules for choosing the correct language class label.
[43:06] the correct language class label. And for the transcription rules, I don't
[43:09] And for the transcription rules, I don't think I need to uh read every read each
[43:14] think I need to uh read every read each rule one by one for you because they are
[43:16] rule one by one for you because they are quite clear and quite easy compared to
[43:18] quite clear and quite easy compared to the discard rules and interception
[43:21] the discard rules and interception rules. Well, there are there is one part
[43:24] rules. Well, there are there is one part one section that I need to uh that I
[43:27] one section that I need to uh that I want to emphasize
[43:30] want to emphasize and well that is section 2.4.1 4.1
[43:35] and well that is section 2.4.1 4.1 um
[43:36] um well as I have said just now unclear
[43:40] well as I have said just now unclear speech parts doesn't all doesn't always
[43:43] speech parts doesn't all doesn't always mean that we need to intercept them away
[43:45] mean that we need to intercept them away or we need to discard it for example if
[43:49] or we need to discard it for example if someone says I open the doll well we
[43:52] someone says I open the doll well we know that the speaker is trying to say
[43:54] know that the speaker is trying to say door but if her or his pronunciation
[43:57] door but if her or his pronunciation falls somewhere in between doll and door
[44:00] falls somewhere in between doll and door we can actually calibrate it to door
[44:03] we can actually calibrate it to door because we're sure that he sure he's
[44:05] because we're sure that he sure he's saying door like I would what I um
[44:11] saying door like I would what I um like the the same goes for the two
[44:16] like the the same goes for the two this case with two words. We also made a
[44:19] this case with two words. We also made a calibration here, right?
[44:23] calibration here, right? And another example is like this. If the
[44:27] And another example is like this. If the speaker pronounces the t the final
[44:30] speaker pronounces the t the final consonant t in finished very in a very
[44:33] consonant t in finished very in a very light way or in a very soft way that we
[44:36] light way or in a very soft way that we do not know whether she or he pronounced
[44:38] do not know whether she or he pronounced it uh
[44:41] it uh at all
[44:43] at all in this case we can calibrate it as
[44:45] in this case we can calibrate it as well. Um, but if you're sure that he or
[44:50] well. Um, but if you're sure that he or she uh hasn't pronounced the final
[44:54] she uh hasn't pronounced the final consonant, we do not need to uh correct
[44:57] consonant, we do not need to uh correct her or his grammar mistakes. We can just
[45:00] her or his grammar mistakes. We can just uh delete the ed in this case. But if it
[45:03] uh delete the ed in this case. But if it is unclear, we can calibrate it.
[45:07] is unclear, we can calibrate it. The same goes for the two examples
[45:09] The same goes for the two examples above. If he or she says a lot of
[45:13] above. If he or she says a lot of instead of a lot of or if he or she says
[45:16] instead of a lot of or if he or she says I do instead of I did in a very clear
[45:19] I do instead of I did in a very clear way, we do not need to correct her or
[45:21] way, we do not need to correct her or his grammar mistakes.
[45:24] his grammar mistakes. Okay, that is pretty much what I want to
[45:27] Okay, that is pretty much what I want to say about the training set guidelines.
[45:31] say about the training set guidelines. Let me let me think if there is anything
[45:34] Let me let me think if there is anything that I want to um that I want to add.
[45:39] that I want to um that I want to add. Well yes there is one important thing.
[45:43] Well yes there is one important thing. Well for this case we have selected uh
[45:46] Well for this case we have selected uh from uh from here to here right? But if
[45:50] from uh from here to here right? But if we think the uh but if let's hypothesize
[45:55] we think the uh but if let's hypothesize that if there's no overlapping speech
[45:58] that if there's no overlapping speech part in the whole audio, we should still
[46:00] part in the whole audio, we should still do the interception from here to here.
[46:06] do the interception from here to here. Because if we do not if we don't do the
[46:09] Because if we do not if we don't do the interception, there will be no start
[46:11] interception, there will be no start time and no end time. And
[46:15] time and no end time. And there will be no um there will be no
[46:18] there will be no um there will be no audio left when we export the the data
[46:22] audio left when we export the the data that you have uh that you have labeled.
[46:26] that you have uh that you have labeled. So for any case that you do, please be
[46:28] So for any case that you do, please be sure to intercept one part per audio.
[46:34] sure to intercept one part per audio. Even if there are no unclear speech
[46:36] Even if there are no unclear speech parts, even if there's no overlapping
[46:38] parts, even if there's no overlapping speech parts, please at least intercept
[46:41] speech parts, please at least intercept uh please at least select one part in
[46:45] uh please at least select one part in the audio like this.
[46:49] Okay.
[46:53] Oh yeah, and for the model word uh the
[46:56] Oh yeah, and for the model word uh the English model words lit list is in this
[46:58] English model words lit list is in this document. But for other lit languages,
[47:00] document. But for other lit languages, for example, for this Indonesian
[47:02] for example, for this Indonesian project, you can always refer to the uh
[47:05] project, you can always refer to the uh for refer to this document that I that I
[47:08] for refer to this document that I that I have sent you before.
[47:13] Okay, that is pretty much what I want to
[47:16] Okay, that is pretty much what I want to say.
[47:17] say. And any questions?
[47:20] And any questions? I hope I have made everything clear and
[47:24] I hope I have made everything clear and if not please please correct me
[47:29] if not please please correct me if there is anything unclear.
[47:38] Okay. So if you have more questions when
[47:41] Okay. So if you have more questions when you are actually proceeding with the
[47:43] you are actually proceeding with the cases in the pilot queue or in the um uh
[47:48] cases in the pilot queue or in the um uh in the training set project when when we
[47:52] in the training set project when when we are kicking it officially
[47:54] are kicking it officially you can still u pose your questions into
[47:58] you can still u pose your questions into the group chat uh if you're already with
[48:01] the group chat uh if you're already with me in the group chat if no please turn
[48:03] me in the group chat if no please turn to your PC for help and your PC will uh
[48:07] to your PC for help and your PC will uh forward your question to me.
[48:10] forward your question to me. Okay, if no one has questions then our
[48:14] Okay, if no one has questions then our webinar I think our webinar can end here
[48:17] webinar I think our webinar can end here for today. Thank you. Thank you very
[48:19] for today. Thank you. Thank you very much for every uh for your cooperation
[48:21] much for every uh for your cooperation and for your participation. I hope we
[48:23] and for your participation. I hope we can um
[48:25] can um we can do this project together. Um
[48:31] we can do this project together. Um very how could I put it? Uh
[48:37] very how could I put it? Uh I hope that everyone can get what what
[48:40] I hope that everyone can get what what he or she wants.
[48:43] he or she wants. Okay. Thank you very much. I'll stop
[48:45] Okay. Thank you very much. I'll stop this webinar now and I will send you the
[48:48] this webinar now and I will send you the recording after into the group and to
[48:49] recording after into the group and to your PC's. Thank you very much again.