Install TexLive 2022 on Ubuntu 22.04 under Multipass

Multipass is a game changer. It looks like a virtual environment for python or a virtual box for OS, but for Ubuntu only. This short writing documented my effort to enable the installation of TexLive 2022 for adding IEEE header and footer for camera-ready APSIPA 2022.

Host environment

bagus@L140MU:~$ snap --version
snap 2.56.2
snapd 2.56.2
series 16
ubuntu 20.04
kernel 5.15.0-46-generic
bagus@L140MU:~$ multipass --version
multipass 1.10.1
multipassd 1.10.1


1. Install multipass (refer to this link for detail).
2. Create an instance with Jammy (Ubuntu 22.04)
3. Update Jammy
4. Install texlive and required packages

sudo apt install texlive-base texlive-fonts-extra texlive-fonts-recommended

5. Try on the desired latex template

$ sudo apt install unzip
$ unzip
$ cd APSIPA_ASC_2022_Template/Latex
$ pdflatex PaperSample_Guideline_tex.tex
Output written on PaperSample_Guideline_tex.pdf (3 pages, 126525 bytes).
Transcript written on PaperSample_Guideline_tex.log.

That’s all. For 20.04, it will throw to an infinite recursion loop due to the lower fancy version. The only solution is to install newer Latex on Ubuntu 22.04 under multipass. For real cases, you may need to mount your local directory (which contains TEX files) to an instance with `multipass mount`.

from bagustris@/home

Acoustic Feature Extraction with Transformers

The example in Transformers’ documentation here shows how to use the wav2vec2 model for automatic speech recognition. However, there are two crucial issues in that example. First, we usually use our data (set) instead of their (available) dataset. Second, we need to extract acoustic features (the last hidden states instate of logits). The following is my example of adapting Transformers to extract acoustic embedding given any audio file (WAVE) using several models. It includes the pooling average from frame-based processing to utterance-based processing for given any audio file. You don’t need to perform the pooling average if you want to process your audio file in frame-based processing.

Basic syntax: wav2vec2 base model

This is the example from the documentation. I replaced the use of the dataset with the defined path of the audio file (‘00001.wav’).

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torchaudio

# load model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

# audio file is decoded on the fly
array, fs = torchaudio.load("/data/A-VB/audio/wav/00001.wav")
input = processor(array.squeeze(), sampling_rate=fs, return_tensors="pt")

# apply the model to the input array from wav
with torch.no_grad():
outputs = model(**input)

# extract last hidden state, compute average, convert to numpy
last_hidden_states = outputs.last_hidden_state.squeeze().mean(axis=0).numpy()

# change to list to print
print(f"Hidden state shape: {last_hidden_states.shape}")
# Hidden state shape: (768,)

The syntax for the wav2vec2 large and robust model

In this second example, I replace the base model with the large and robust model without finetuning. This example is adapted from here. Note that I replaced ‘Wav2Vec2ForCTC’ with ‘wav2vec2Model’. The former is used when we want to obtain the logits (for speech-to-text transcription) instead of obtaining the hidden states.

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch
import torchaudio

# load model
processor = Wav2Vec2Processor.from_pretrained(
model = Wav2Vec2Model.from_pretrained(

# audio file is decoded on the fly
array, fs = torchaudio.load("/data/A-VB/audio/wav/00001.wav")
input = processor(array.squeeze(), sampling_rate=fs, return_tensors="pt")

with torch.no_grad():
outputs = model(**input)

last_hidden_states = outputs.last_hidden_state.squeeze().mean(axis=0).numpy()
# change to list to print
print(f"Hidden state shape: {last_hidden_states.shape}")

You can replace “facebook/wav2vec2-large-robust-ft-swbd-300h” with “facebook/wav2vec2-large-robust-ft-libri-960h” for the larger fine-tuned model.


The syntax for the custom model (wav2vec-R-emo-vad)

The last one is the example of the custom model. The model is wav2vec 2.0 fine-tuned on the MSP-Podcast dataset for speech emotion recognition. This last example differs from the previous since the configuration is given by the authors of the model (read the code thoroughly to inspect the details). I replaced the dummy audio file with the real audio file. It is assumed to process in batch (with batch_size=2) by replicating the same audio file.

import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import (
import torchaudio

class RegressionHead(nn.Module):
r"""Classification head."""

def __init__(self, config):


self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.final_dropout)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

def forward(self, features, **kwargs):

x = features
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)

return x

class EmotionModel(Wav2Vec2PreTrainedModel):
r"""Speech emotion classifier."""

def __init__(self, config):


self.config = config
self.wav2vec2 = Wav2Vec2Model(config)
self.classifier = RegressionHead(config)

def forward(

outputs = self.wav2vec2(input_values)
hidden_states = outputs[0]
hidden_states = torch.mean(hidden_states, dim=1)
logits = self.classifier(hidden_states)

return hidden_states, logits

def process_func(
sampling_rate: int
# embeddings: bool = False,
r"""Predict emotions or extract embeddings from raw audio signal."""

# run through processor to normalize signal
# always returns a batch, so we just get the first entry
# then we put it on the device
# wavs = pad_sequence(wavs, batch_first=True)
# load model from hub
device = 'cpu'
model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = EmotionModel.from_pretrained(model_name)

y = processor([wav.cpu().numpy() for wav in wavs],
y = y['input_values']
y =

y = model(y)

return {
'hidden_states': y[0],
'logits': y[1],

## test to an audiofile
sampling_rate = 16000
signal = [torchaudio.load('train_001.wav')[0].squeeze().to('cpu') for _ in range(2)]

# extract hidden states
with torch.no_grad():
hs = process_func(signal, sampling_rate)['hidden_states']
print(f"Hidden states shape={hs.shape}")

Please note for all models, the audio file must be sampled with 16000 Hz, otherwise, you must resample it before extracting acoustic embedding using the methods above. It may not throw an error even if the sampling rate is not 16000 Hz but the results, hence, is not valid since all models were generated based on a 16 kHz of sampling rate speech dataset. You may also want to extract acoustic features using opensmile toolkit. The tutorial for Windows users using WSL is available here:

Happy reading, don’t wait more to apply these methods on your own audio file.

from bagustris@/home

Siapa yang seharusnya membersihkan sampah B?

Andaikan sebuah eksperimen pikiran sebagai berikut.

A mengadakan suatu acara (panitia acara). B mengikuti acara tersebut (peserta acara). Jika B membuang sampah secara sembarangan di saat mengikuti acara tersebut, siapa yang wajib membersihkannya?

Jika anda masih menjawab A. Kita tambahkan kasus lain seperti ini.

B berada di rumahnya sendiri. B membuang sampah secara sembarangan di rumahnya sendiri. Siapa yang seharusnya membersihkan sampah B?

B yang seharusnya membuang sampahnya sendiri, tak peduli dimanapun. Selama itu sampahnya, maka dia sendiri yang wajib membuangnya, bukan orang lain.

from bagustris@/home

Maksimal jumlah referensi self-citation

Best practice jumlah self-citation pada makalah adademik adalah 10% dari total jumlah referensi [1]. Sumber lain membolehkan 7-20%. Untuk saya pribadi, jumlah maksimalnya adalah berdasarkan tabel dan rumus di bawah ini.

.tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}

Number of references Max. self-citation
1-10 1
11-20 2
21-30 3
91-100 10


$$ n\_cite=10\% \times ceil(n\_ref/10) \times 10 $$

Dimana n_cite adalah jumlah maksimal self-citation and n_ref adalah jumlah referensi.

Kenapa self-citation?

Karena (biasanya) tidak mungkin kita meneliti dan menulis makalah adademik dari nol, pasti dari penelitian-penelitian kita sebelumnya. Disinilah self-citation masuk.

Alasan kedua adalah untuk mendongkrak h-indeks (Scopus, G-scholar, WOS) peneliti yang bersangkutan.



from bagustris@/home

Teknik Mendaki di Jalan Datar: Flat Foot

Catatan berikut merupakan teknik mendaki di jalan datar. Memangnya ada tekniknya untuk berjalan di jalan yang datar? Ada. Pada perjalanan panjang, menggunakan teknik berikut akan meminimalisir tenaga yang dikeluarkan, sekaligus meminimalkan kecapekan. Dua teknik yang penting adalah postur dan cara berjalan (ayunan kaki).


Bentuk postur berikut harus digunakan pada jalanan datar.

  1. Tegak/lurus. Berbeda dengan jalan naik yang mencondongkan badan ke depan atau teknik turun seperti gorilla, pada jalan posisi tubuh adalah tegak lurus.
  2. Menegakkan kepala. Kadang kita lupa untuk menegakkan kepala dengan menunduk melihat jalan, atau mendongak melihat pemandangan. Teknik utama pada jalan datar adalah menegakkan kepala.

Teknik Berjalan

  1. Flat foot / mendatarkan kaki. Ini adalah teknik utama berjalan di jalan datar. Sesuai namanya, flat foot mengayunkan lagi sedatar mungkin saat mengangkatnya dan menjejakkan ke tanah. Kalau berjalan normal, kita mengangkat kaki dengan pangkal kaki atau tumit (tumit menyentuh tanah terakhid) dan menjejak dengan jari kaki. Pada flat foot baik mengangkat dan menginjak dengan kaki yang datar. Lihat gambar di bawah untuk lebih jelasnya.
  2. Teknik mengangkat kaki dengan flat foot [1]

    Teknik menjejak kaki dengan flat foot [2]

  3. Mengangkat kaki serendah mungkin. Teknik terakhir adalah mengangkat kaki serendah mungkin, agar tidak capek (lebih tinggi mengangkat kaki lebih banyak tenaga yang dibutuhkan).

Lihat video tutorial berikut untuk prakteknya.



from bagustris@/home

Membuka dan menyimpan file json

Studi kasus
Misalkan kita ingin menyimpan data set berikut dalam format json yang berisi file dan labelnya (data speech emotion recognition). Untuk keperluan tersebut kita ingin memisahkan antara data training (‘train_meta_data.json’) dan data test (‘test_meta_data.json’). Skrip berikut memenuhi tujuan tersebut.

import os
import glob
import json

data_dir = '/data/Audio_Speech_Actors_01-24/'
files = glob.glob(os.path.join(data_dir, 'Actor_??', '*.wav'))

data_train = []
data_test = []

for file in files:
lab = os.path.basename(file).split('-')[2]
if int(file[-6:-4]) < 20: # speaker 1-19 for traninig
'path': file,
'label': lab
'path': file,
'label': lab

with open("train_meta_data.json", 'w') as f:
json.dump(data_train, f)

with open("test_meta_data.json", 'w') as f:
json.dump(data_test, f)

Membuka file JSON

import json
filepath = '/data/Audio_Speech_Actors_01-24/train_meta_data.json'
with open(filepath, 'r') as f:
data_train = json.load(f)

from bagustris@/home

Benchmarking SSD: INTEL SSDPEKNW020T8 (NVMe)

Berikut adalah hasil benchmarking SSD INTEL SSDPEKNW020T8 tipe NVMe. 

Capacity: 2TB 
Format partisi: FAT 




– Format yang dibenchmark kali ini adalah FAT, bisa jadi untuk ext4 dan xfs akan lebih cepat.
– SSD NVMe intel terlihat lebih stabil dibanding WDC, dengan kecepatan baca/tulis yang mirip. Lihat disini untuk hasil benchmark dengan SSD NVMe WDC:

from bagustris@/home

Writing for impact, not for impact factor

Nowadays, research is measured by publication. Publish or perish. The pressure for researchers to publish is now more than ever. As a result, there are tons of research publications. Most of them may be garbage; only a small portion has an impact. So, what is the impact in research?

Impact factor

At first, I thought that “writing for impact is writing for impact factor” (since “impact” is measured by “impact factor”). By this definition, the author will seek for writing to the Journal which has high impact factors due “impact” is defined by the calculation of impact factor. In fact the (impact factor calculation is based on citations and the number of publications)[]. Hence, writing for impact factor is no more than writing for citations. I changed my mind recently, writing for impact is not writing for impact factor. The impact is different with (calculation of) impact factor. Now, some journals and conferences requested this “social impact” as an additional section in the author’s manuscript [1, 2]. It is good. By this method (requesting to show the impact of their research), the impact of research now is clearer than before.

Kinds of social impact

Now, when asked to write the social impact of my writing, I am thinking of what social impact will be in my manuscript. Reference [1] requested explicitly what is the definition of “positive impact” for authors. A positive impact could be one of the following (my own definition).

1. Readers change their perspective. For instance, the paper entitled “Toward a consensus on symbolic notation of harmonics, resonances, and formants in vocalization” proposed a new standard notation for fundamental frequency (in acoustics), i.e., by writing it as $f_o$ (ef-ow) instead of F0, $F_0$ or $f_0$ (ef-zero). This paper has a big social impact on the (acoustic) community.

2. Readers can learn. Many papers show their method clearly so the readers can learn and get the benefit from reading the paper. An instance is a paper entitled “CALFEM as a Tool for Teaching University Mechanics.”

3. Readers can replicate. Open science is making a difference. Anyone can replicate the experiment of the authors. This kind of research is game-changing. Even big companies like Google, Microsoft, and Meta open their research publicly along with open repositories to replicate the research. Most of my research is also open science, one example is a paper entitled “Deep Multilayer Perceptrons for Dimensional Speech Emotion Recognition”.

4. Readers can improve the result. One way to improve the current result is by explicitly proposing  further directions. This statement usually is placed in the Conclusions or before this section.

5. A policy can be taken. This is the highest impact, a policy can be taken from a research result. For instance, to fight global warming (based on specific data), the government changes the policy to abandon the use of coal and move to nuclear energy. Or, based on the risk of nuclear energy, the government encourages the use of wind and solar energy.

Hope this opinion will change your opinions; do not write for impact factor (only), but do write for real (social) impact.




from bagustris@/home

Basic Audio Manipulation With Torchaudio

Recently, I moved my audio processing toolkit from librosa (and others) to Torchaudio. This short writing documented the very basics of torchaudio for audio manipulation: read, resample, and write an audiofile.

Load audio file (read)

The process of loading (reading) an audio file is straightforward, just pass the audio path to `torchaudio.load`. We need to import the needed modules first. Most audio files can be loaded by torchaudio (WAV, OGG, MP3, etc.).

import torchaudio
import torchaudio.transforms as T
wav0, sr0 = torchaudio.load("old_file_48k.wav", normalize=True)

where wav0 is the output tensor (array) and sr0 is the original sampling rate. Argument `normalize=True` is optional to normalize the waveform. Note that one of my colleagues (a student) found that using `librosa.util.normalize()` resulted in better normalization (peak to peak waveform is -1 to 1) than this torchaudio normalization.



Resample a sampling rate to another sampling rate is done by a Class; the output is a function. Hence, we need to pass the old tensor to the resampler function. Here is an example to convert 48k tensor to 16k tensor.

sr1 = 16000
resampler = T.Resample(sr0, sr1)
wav1 = resampler(wav0)

Save as a new audio file (write)

The process of saving files is also straightforward, just pass the file name, tensor, and sampling rate in order.'new_file_16k.wav', wav1, sr1)

Then the new audio file appeared in the current directory. Just set the path and file name if you want to save it in another directory.




from bagustris@/home

Tiga Tipe Ilmuwan..

Berdasarkan dialog Peter Gruss (rektor OIST) dan Kazuhiko Nakamura (CEO AIST) [1], yang pertama menjelaskan ada tiga golongan ilmuwan sebagai berikut.

  1. Ilmuwan murni (pure scientist) seperti Albert Einstein
  2. Ilmuwan murni ini hanya memikirkan dan meriset apa yang dipikirkannya, tanpa memikirkan dampak luasnya (impak jangka panjang). Einstein pada saat meneliti teori relativitas tidak terpikirkan tentang teknologi global positioning system (GPS), padahal konsep relativitas tersebut penting untuk penemuan GPS lima puluh tahun setelah teori relativitas terbit.

  3. Ilmuwan aplikatif (use-inspired scientist) seperti Pasteur
  4. Yang dipikirkan oleh ilmuwan jenis ini adalah “Apa yang bisa saya lakukan untuk meningkatkan aspek tertentu kehidupan manusia?”. Contohnya adalah penelitian untuk mengembangkan teknik diagnostik baru, terapi, obat baru, seperti yang dilakukan Pasteur untuk menemukan antibiotik.

  5. Insinyur (engineer) seperti Thomas Alfa Edison
  6. Ilmuan jenis ini hanya berkutat sedikit pada sains, yang penting bagaimana aplikasinya. Edison hanya fokus bagaimana cara menemukan bola lampu entah bagaimana caranya/ilmunya.

Jika anda ilmuwan (atau ingin menjadi ilmuwan), tipe ilmuwan manakah yang anda inginkan? Saya lebih tertarik pada tipe ketiga karena kontribusnya (impak sosial) lebih nyata.




from bagustris@/home